ebook img

Cross-Lingual Dependency Parsing with Late Decoding for Truly Low-Resource Languages PDF

0.29 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Cross-Lingual Dependency Parsing with Late Decoding for Truly Low-Resource Languages

Cross-Lingual Dependency Parsing with Late Decoding for Truly Low-Resource Languages MichaelSejrSchlichtkrull AndersSøgaard UniversityofAmsterdam∗ UniversityofCopenhagen [email protected] [email protected] Abstract better than relying on less biased data such as the EuroParlcorpus. In cross-lingual dependency annotation In Agic´ et al. (2016), a projection scheme is projection, information is often lost dur- proposed wherein labels are collected from many 7 ing transfer because of early decoding. 1 sources, projected into a target language, and We present an end-to-end graph-based 0 then averaged. Crucially, the paper demonstrates 2 neural network dependency parser that how projecting and averaging edge scores from a n can be trained to reproduce matrices of graph-basedparserbeforedecodingimprovesper- a edge scores, which can be directly pro- J formance. Evenso,decodingisstillarequirement jected across word alignments. We show 6 between projecting labels and retraining from the that our approach to cross-lingual depen- projecteddata,sincetheirparser(TurboParser)re- ] dency parsing is not only simpler, but L quires well-formed input trees. This introduces a also achieves an absolute improvement of C potential source of noise and loss of information 2.25%averagedacross10languagescom- . that may be important for finding the best target s paredtothepreviousstateoftheart. c sentenceparse. [ 1 Introduction Our approach circumvents the need for decod- 1 ing prior to training, thereby surpassing a state- v Dependency parsing is an integral part of many 3 of-the-art dependency parser trained on decoded 2 natural language processing systems. However, multi-source annotation projections as done by 6 most research into dependency parsing has fo- Agic´ etal. Wefirstevaluatethemodelacrosssev- 1 cused on learning from treebanks, i.e. collec- 0 eral languages, demonstrating results comparable . tions of manually annotated, well-formed syntac- 1 to the state of the art on the Universal Dependen- tic trees. In this paper, we develop and evaluate 0 cies (McDonald et al., 2013) dataset. Then, we 7 a graph-based parser which does not require the evaluate the same model by inducing labels from 1 training data to be well-formed trees. We show : cross-lingual multi-source annotation projection, v that such a parser has an important application in comparingtheperformanceofamodelwithearly i X cross-linguallearning. decodingtoamodelwithlatedecoding. r Annotation projection is a method for develop- a Contributions We present a novel end-to-end ingparsersforlow-resourcelanguages,relyingon neural graph-based dependency parser and apply alignedtranslationsfromresource-richsourcelan- it in a cross-lingual setting where the task is to guages into the target language, rather than lin- induce models for truly low-resource languages, guistic resources such as treebanks or dictionar- assuming only parallel Bible text. Our parser is ies. TheBiblehasbeentranslatedcompletelyinto moreflexiblethansimilarparsers,andacceptsany 542 languages, and partially translated into a fur- weighted or non-weighted graph over a token se- ther 2344 languages. As such, the assumption quenceasinput. Inoursetting,theinputisadense thatwehaveaccesstoparallelBibledata,ismuch weightedgraph,andweshowthatourparserissu- lessconstrainingthantheassumptionofaccessto periortopreviousbestapproachestocross-lingual linguistic resources. Furthermore, for truly low- parsing. ThecodeismadeavailableonGitHub.1 resource languages, relying upon the Bible scales ∗WorkdonewhileattheUniversityofCopenhagen. 1https://github.com/MichSchli/Tensor-LSTM 2 Model S , where S and S are feature vectors corre- j i j spondingtodependentandhead. Ourcontribution Thegoalofthissectionistoconstructafirst-order is the introduction of an LSTM-based global en- graph-based dependency parser capable of learn- coderwheretheentiretyofS isrepresentedinthe ing directly from potentially incomplete matrices calculationofE . ij of edge scores produced by another first-order WebeginbyextendingStoa(w+1)×(f+1)- graph-based parser. Our approach is to treat the matrixS∗withanadditionalrowcorrespondingto encoding stage of the parser as a tensor transfor- the root node and a single binary feature denot- mation problem, wherein tensors of edge features ing whether a node is the root.We now compute are mapped to matrices of edge scores. This al- a 3-tensor F = S (cid:1)S∗ of dimension w ×(w + lowsourmodeltoapproximatesetsofscoringma- 1)×(2f +1) consisting of concatenations of all tricesgeneratedbyanotherparserdirectlythrough combinations of rows in S and S∗. This tensor non-linear regression. The core component of the effectively contains a featurization of every edge model is a layered sequence of recurrent neural (u,v) in the complete digraph over the sentence, network transformations applied to the axes of an consistingofthefeaturesoftheparentworduand inputtensor. childwordv. Theseedge-wisefeaturevectorsare Moreformally,anydigraphG = (V,E)canbe organized in the tensor exactly as the dependency expressedasabinary|V|×|V|-matrixM,where arcsinaparsematrixsuchastheoneshowninthe M = 1ifandonlyif(j,i) ∈ E –thatis,ifihas ij exampleinFigure1. aningoingedgefromj. IfGisatreerootedatv , 0 The edges represented by elements F can as ij v hasnoingoingedges. Hence,itsufficestousea 0 such easily be interpreted in the context of re- (|V|−1)×|V|-matrix. Independencyparsing,ev- lated edges represented by the row i and the col- ery sentence is expressed as a matrix S ∈ Rw×f, umn j in which that edge occurs. The classi- where w is the number of words in the sentence cal arc-factored parsing algorithm of McDonald and f is the width of a feature vector correspond- et al. (2005) corresponds to applying a function ing to each word. The goal is to learn a function O : R2f+1 → R pointwise to S (cid:1) S∗, then de- P : Rw×f → Zw×(w+1), such that P(S) corre- 2 coding the resulting w × (w + 1)-matrix. Our sponds to the matrix representation of the correct modeldivergesbyapplyinganLSTM-basedtrans- parse tree for that sentence – see Figure 1 for an formation Q : Rw×(w+1)×(2f+1) → Rw×(w+1)×d example. to S (cid:1)S∗ before applying an analogous transfor- mationO : R → R. d The Long Short-Term Memory (LSTM) unit is a function LSTM(x,h ,c ) = (h ,c ) de- t−1 t−1 t t root John walks his dog finedthroughtheuseofseveralintermediarysteps, following Hochreiter et al. (2001). A concate- 0 0 1 0 0 nated input vector I = x ⊕ hprev is constructed, 1 0 0 0 0 where ⊕ represents vector concatenation. Then,   0 0 0 0 1 functions corresponding to input, forget, and out- 0 0 1 0 0 put gates are defined following the form ginput = σ(W I+b ). Finally,theinternalcellstate input input c andtheoutputvectorh attimetaredefinedus- t t Figure 1: An example dependency tree and the ingtheHadamard(pointwise)product•: correspondingparsematrix. c = g •c +g •tanh(W I +b ) t forget prev input cell cell In the arc-factored (first-order), graph-based h = g •tanh(c ) t output t model, P is a composite function P = D ◦ E where the encoder E : Rw×f → Rw×(w+1) is a WedefineafunctionMatrix-LSTMinductively, real-valued scoring function and the decoder D : that applies an LSTM to the rows of a ma- Rw×(w+1) → Zw×(w+1) is a minimum spanning trix X. Formally, Matrix-LSTM is a function 2 tree algorithm (McDonald et al., 2005). Com- M : Ra×b → Ra×c such that (h ,c ) = 1 1 monly, the encoder includes only local informa- LSTM(X ,0,0), ∀1 < i ≤ n (h ,c ) = 1 i i tion – that is, E is only dependent on S and LSTM(X ,h ,c ),andM(X) = h . ij i i i−1 i−1 i i Figure 2: Four-directional Tensor-LSTM applied to the example sentence seen in Figure 1. The word- pairtensorS(cid:1)S∗ isrepresentedwithblueunits(horizontallines),ahiddenTensor-LSTMlayerH with greenunits(verticallines),andtheoutputlayerwithwhiteunits. Therecurrentconnectionsinthehidden layeralongH andHT(2,1,3) areillustratedrespectivelywithdottedandfullydrawnlines. An effective extension is the bidirectional the center of our approach. Formally, bidirec- LSTM, wherein the LSTM-function is applied to tionalTensor-LSTMisafunctionT : Ra×b×c → 2d the sequence both in the forward and in the back- Ra×b×2h suchthat: ward direction, and the results are concatenated. T (T) = M (T ) In the matrix formulation, reversing a sequence 2d i 2d i corresponds to inverting the order of the rows. This definition allows information to flow This is most naturally accomplished through left- within the matrices of the first axis of the tensor, multiplication with an exchange matrix J ∈ m but not between them – corresponding in Figure Rm×m suchthat: 2 to horizontal connection along the rows, but no   0 ··· 1 vertical connections along the columns. To fully Jm = ... ... ... cover the tensor structure, we must extend this modeltoincludeconnectionsalongcolumns. 1 ··· 0 This is accomplished through tensor transpo- BidirectionalMatrix-LSTMisthereforedefinedas sition. Formally, tensor transposition is an op- afunctionM : Ra×b → Ra×2c suchthat: erator TTσ where σ is a permutation on the set 2d {1,...,rank(T)}. The last axis of the tensor con- M2d(S) = M(S)⊕2JaM(JaS) tainsthefeaturerepresentations,whichwearenot interested in scrambling. For the Matrix-LSTM, Where ⊕ refers to concatenation along the sec- 2 this leaves only one option – MT(1,2). When the ondaxisofthematrix. LSTM is operating on a 3-tensor, we have two Keepinginmindthegoalofconstructingaten- options – TT(2,1,3) and TT(1,2,3). This leads to sortransformationQcapableofpropagatinginfor- thefollowingdefinitionoffour-directionalTensor- mationinanLSTM-likemannerbetweenanytwo LSTM as a function T : Ra×b×c → Ra×b×4h 4d elements of the input tensor, we are interested in analogoustobi-directionalSequence-LSTMs: constructing an equivalent of the Matrix-LSTM- modeloperatingon3-tensorsratherthanmatrices. T (T) = T (T)⊕ T (TT(2,1,3))T(2,1,3) 4d 2d 3 2d This construct, when applied to the edge tensor F = S (cid:1) S∗, can then provide a means of in- Calculating the LSTM-function on TT(1,2,3) terpretingedgesinthecontextofrelatededges. and TT(2,1,3) can be thought of as constructing A very simple variant of such an LSTM- the recurrent links either ”side-wards” or ”down- functionoperatingon3-tensorscanbeconstructed wards” in the tensor – or, equivalently, construct- byapplyingabidirectionalMatrix-LSTMtoevery ing recurrent links either between the outgoing matrixalongthefirstaxisofthetensor. Thisforms or between the in-going edges of every vertex in the dependency graph. In Figure 2, we illustrate S ∈ L , S ∈ L , and W(w ,w ) represents the s s t t s t the two directions respectively with full or dotted confidencethatw alignstow giventhatS isthe s t t edgesinthehiddenlayer. translationofS s The output of Tensor-LSTM is itself a tensor. For each source language L with a scoring s In our experiments, we use a multi-layered vari- function score , we define a local edge-wise Ls ation implemented by stacking layers of models: voting function vote ((u ,v ),(u ,v )) operat- Ss s s t t T (T) = T (T (...T (T)...)). We do not ing on a source language edge (u ,v ) ∈ S and 4d,stack 4d 4d 4d s s s share parameters between stacked layers. Train- a target language edge (u ,u ) ∈ S . Intuitively, t t t ing the model is done by minimizing the value every source language edge votes for every tar- E(G,O(Q(S (cid:1)S∗)))ofsomelossfunctionE for get language edge with a score proportional to each sentence S with gold tensor G. We experi- theconfidenceoftheedgesaligningandthescore mentwithtwolossfunctions. giveninthesourcelanguage. Foreverytargetlan- In our monolingual set-up, we exploit the fact guageedge(ut,vt) ∈ St: thatparsematricesbyvirtueofdepictingtreesare vote ((u ,v ),(u ,v )) = W (u ,u ) right stochastic matrices. Following this observa- Ss s s t t Ls,Ss,St s t tion,weconstraineachrowofO(Q(S(cid:1)S∗))un- ·W (v ,v ) Ls,Ss,St s t der a softmax-function and use as loss the row- ·score (u ,v ) Ls s s wisecrossentropy. Inourcross-lingualset-up,we usemeansquarederror. Inbothcases,prediction- FollowingAgic´ etal.(2016),asentence-wisevot- timedecodingisdonewithChu-Liu-Edmondsal- ingfunctionisthenconstructedasthehighestcon- gorithm(Edmonds,1968)followingMcDonaldet tributionfromasource-languageedge: al.(2005). vote (u ,v ) = max vote ((u ,v ),(u ,v )) 3 Cross-lingualparsing Ss t t us,vs∈Ss Ss s s t t The final contribution of each source language Hwa et al. (2005) is a seminal paper for cross- datasetL toatargetlanguageedge(u ,v )isthen s t t lingual dependency parsing, but they use very de- calculated as the sum for all sentences S ∈ L s s tailed heuristics to ensure that the projected syn- over vote (u ,v ) multiplied by the confidence tactic structures are well-formed. Agic´ et al. Ss t t that the source language sentence aligns with the (2016)isthelatestcontinuationoftheirwork,pre- targetlanguagesentence. Foranedge(u ,v )ina t t senting a new approach to cross-lingual projec- targetlanguagesentenceS ∈ L : t t tion, projecting edge scores rather than subtrees. Agic´ et al. (2016) construct target-language tree- (cid:88) vote (u ,v ) = A (S ,S )vote (u ,v ) banksbyaggregatingscoresfrommultiplesource Ls t t s s t Ss t t Ss∈Ls languages,beforedecoding. Averagingbeforede- coding is especially beneficial when the parallel Finally, we can compute a target language scor- data is of low quality, as the decoder introduces ing function by summing over the votes for every errors,whenedgescoresaremissing. Despiteav- sourcelanguage: eraging,therewillstillbescoresmissingfromthe n inputweightmatrices,especiallywhenthesource (cid:80)vote (u ,v ) Li t t and target languages are very distant. Below we score(u ,v ) = i=1 t t showthatwecancircumventerror-inducingearly ZSt decodingbytrainingdirectlyontheprojectededge Here, Z is a normalization constant ensuring scores. St thatthetarget-languagescoresareproportionalto WeassumesourcelanguagedatasetsL ,...,L , 1 n thosecreatedbythesource-languagescoringfunc- parsedbymonolingualarc-factoredparsers,Inour tions. Assuch,Z shouldconsistofthesumover case, this data comes from the Bible. We assume St the weights for each sentence contributing to the access to a set of sentence alignment functions scoringfunction. Wecancomputethisas: A : L × L → R where A (S ,S ) is the s s t 0,1 s s t confidence that S is the translation of S . Sim- n t s (cid:88) (cid:88) ilarly, we have access to a set of word alignment ZSt = As(Ss,St) functions WLs,Ss,St : Ss × St → R0,1 such that i=1Ss∈Li Thesentencealignmentfunctionisnotaprobabil- 50 itydistribution; itmaybethecasethatnosource- languagesentencescontributetoatargetlanguage 40 sentence, causing the sum of the weights and the sum of the votes to approach zero. In this case, S 30 A we define score(ut,vt) = 0. Before projection, U the source language scores are all standardized to 20 Crossentropyat5000 have0asthemeanand1asthestandarddeviation. Meansquaredat5000 10 Crossentropyat10000 Hence,thiscorrespondstoassumingneitherposi- Meansquaredat10000 tivenornegativeevidenceconcerningtheedge. 0 We experiment with two methods of learning 0 2 4 6 8 fromtheprojecteddata–decodingwithChu-Liu- Epochs Edmondsalgorithmandthentrainingasproposed Figure3: UASperepochonGermandevelopment inAgic´ etal.(2016),ordirectlylearningtorepro- data training from 5000 or 10000 randomly sam- duce the matrices of edge scores. For alignment, pledsentenceswithprojectedannotations. we use the sentence-level hunalign algorithm in- troducedinVargaetal.(2005)andthetoken-level modelpresentedinO¨stling(2015). tags, we found small, non-significant improve- mentsformonolingualparsing,butsignificantim- 4 Experiments provementsforcross-lingualparsing. We conduct two sets of experiments. First, we The weights were initialized using the nor- evaluate the Tensor-LSTM-parserin the monolin- malized values suggested in Glorot and Bengio gual setting. We compare Tensor-LSTM to the (2010). Following Jozefowicz et al. (2015), we TurboParser (Martins et al., 2010) on several lan- add 1 to the initial forget gate bias. We trained guages from the Universal Dependencies dataset. the network using RMSprop (Tieleman and Hin- In the second experiment, we evaluate Tensor- ton, 2012) with hyperparameters α = 0.1 and LSTM in the cross-lingual setting. We include as γ = 0.9, using minibatches of 64 sentences. Fol- baselinesthedelexicalizedparserofMcDonaldet lowingNeelakantanetal.(2015),weaddedanoise al. (2011), and the approach of Agic´ et al. (2016) factor n ∼ N(0, 1 ) to the gradient in each (1+t)0.55 using TurboParser. To demonstrate the effective- update. We applied dropouts after each LSTM- ness of circumventing the decoding step, we con- layer with a dropout probability p = 0.5, and ductthecross-lingualevaluationofTensor-LSTM between the input layer and the first LSTM-layer using cross entropy loss with early decoding, and with a dropout probability of p = 0.2 (Bluche et usingmeansquaredlosswithlatedecoding. al., 2015). As proposed in Pascanu et al. (2012), we employed a gradient clipping factor of 15. In 4.1 Modelselectionandtraining the monolingual setting, we used early stopping Ourfeaturesconsistof500-dimensionalwordem- onthedevelopmentset. beddingstrainedontranslationsoftheBible. The Weexperimentedwith10,50,100,and200hid- word embeddings were trained using skipgram denunitsperlayer,andwithupto6layers. Using with negative sampling on a word-by-sentence greedy search on monolingual parsing and evalu- PMI matrix induced from the Edinburgh Bible ating on the English development data, we deter- Corpus,following(Levyetal.,2017). Ourembed- mined the optimal network shape to contain 100 dings are not trainable, but fixed representations units per direction per hidden layer, and a total of throughoutthelearningprocess. Unknowntokens 4layers. wererepresentedbyzero-vectors. For the cross-lingual setting, we used two ad- We combined the word embeddings with one- ditional hyper-parameters. We used the develop- hot-encodingsofPOS-tags,projectedacrossword ment data from one of our target languages (Ger- alignments following the method of Agic´ et al. man) to determine the optimal number of epochs (2016). To verify the value of the POS-features, before stopping. Furthermore, we trained only on we conducted preliminary experiments on En- a subset of the projected sentences, choosing the glish development data. When including POS- sizeofthesubsetusingthedevelopmentdata. We experimented with either 5000 or 10000 For the cross-lingual annotation projection ex- randomly sampled sentences. There are two mo- periments, we use the delexicalized system sug- tivating factors behind this subsampling. First, gested by McDonald et al. (2011) as a baseline. whiletheBibleingeneralconsistsofabout30000 We also compare against the annotation projec- sentences, for many low-resource languages we tion scheme using TurboParser suggested in Agic´ do not have access to annotation projections for et al. (2016), representing the previous state of thefullBible,becausepartswerenevertranslated, theartfortrulylow-resourcecross-lingualdepen- andbecauseofvaryingprojectionquality. Second, dency parsing. Note that while our results for the subsampling speeds up the training, which was TurboParser-based system use the same training necessary to make our experiments practical: At data, test data, and model as in Agic´ et al., our 10000sentencesandonasingleGPU,eachepoch results differ due to the use of the Bible corpus takes approximately 2.5 hours. As such, training rather than a Watchtower publications corpus as for a single language could be completed in less parallel data. The authors made results available than a day. We plot the results in Figure 3. We using the Edinburgh Bible Corpus for unlabeled seethatthebestperformanceisachievedat10000 data. The two tested conditions of Tensor-LSTM sentences, and with respectively 6 and 5 epochs arethemeansquaredlossmodelwithout interme- forcrossentropyandmeansquaredloss. diary decoding, and the cross entropy model with intermediary decoding. The results of the cross- 4.2 Results lingualexperimentcanbeseeninTable2. Inthemonolingualsetting,wecompareourparser 5 Discussion to TurboParser (Martins et al., 2010) – a fast, ca- pable graph-based parser used as a component in AsisevidentfromTable2,thevariationinperfor- many larger systems. TurboParser is also the sys- mance across different languages is large for all tem of choice for the cross-lingual pipeline of systems. This is to be expected, as the quality of Agic´ et al. (2016). It is therefore interesting to theprojectedlabelsetsvarywidelyduetolinguis- make a direct comparison between the two. The tic differences. On average, Tensor-LSTM with resultscanbeseeninTable1. mean squared loss outperforms all other systems. InSection1,wehypothesizedthatincompletepro- Language TurboParser Tensor-LSTM jected scorings would have a larger impact upon English* 83.84 85.81 systems reliant on an intermediary decoding step. German 81.45 82.64 To investigate this claim, we plot in Figure 4 the Danish 81.82 82.24 performance difference with mean squared loss Finnish 77.74 78.83 and cross entropy loss for each language versus Spanish 83.19 86.69 thepercentageofmissingedgescores. French 81.17 84.63 Czech 81.32 85.04 6 Hebrew Average 81.50 83.70 4 Table 1: Unlabeled Attachment Score on the UD ce Hindi n test data for TurboParser and Tensor-LSTM with re Persian e crossentropyloss. Englishdevelopmentdatawas diff 2 Finnish usedformodelselection(marked*). S A U EngClizsehch 0 Danish Note that in order for a parser to be directly German French Spanish applicable to the annotation projection setup ex- −2 plored in the secondary experiment, it must be a 0 20 40 60 80 100 first-order graph-based parser. In the monolin- %missinglabels gualsetting,thebestresultsreportedsofar(84.74, Figure 4: Percentage of missing edge scores ver- on average) for the above selection of treebanks susperformancedifferenceforTensor-LSTMwith werebytheParsitosystem(Strakaetal., 2015), a meansquaredlossandcrossentropyloss. transition-basedparserusingadynamicoracle. Language Delexicalized TurboParser Tensor-LSTM Tensor-LSTM (Decoding) (Nodecoding) Czech(cs) 40.99 43.81 42.58 41.54 Danish(da) 49.65 54.87 54.93 54.15 English*(en) 48.08 52.52 52.91 52.90 Finnish(fi) 41.18 46.08 43.98 45.26 French(fr) 48.97 45.83 55.06 53.83 German*(de) 49.36 51.79 54.87 53.85 Spanish(es) 47.60 58.90 59.60 57.81 Persian(fa) 28.93 14.88 46.47 48.60 Hebrew(he) 19.06 52.89 26.17 31.41 Hindi(hi) 21.03 43.31 43.21 46.09 Average 39.49 46.29 47.98 48.54 Table 2: Unlabeled attachment scores for the various systems. Tensor-LSTM is evaluated using cross entropy and mean squared loss. We include the results of two baselines – the delexicalized system of McDonald et al. (2011) and the Turbo-based projection scheme of Agic´ et al. (2016). English and Germandevelopmentdatawasusedforhyperparametertuning(marked*). For languages outside the Germanic and Latin across any of the languages. In Figure 5, we plot families,ourclaimholds–theperformanceofthe the performance for each of the two systems ver- crossentropylosssystemdecreasesfasterwiththe susthepercentageofdeletedvalues. percentageofmissinglabelsthantheperformance 54 ofthemeansquaredlosssystem. Toanextent,this Meansquareloss confirmsourhypothesis,aswefortheaveragelan- Crossentropyloss 52 guage observe an improvement by circumventing the decoding step. French and Spanish, however, 50 do not follow the same trend, with cross entropy S loss outperforming mean squared loss despite the A U 48 highnumberofmissinglabels. In Table 2, performance on French and Span- 46 ish for both systems can be seen to be very high. It may be the case that indo-european target lan- 44 guages are not as affected by missing labels as 0 10 20 30 40 mostofthesourcelanguagesarethemselvesindo- %blankout european. Anotherexplanationcouldbethatsome Figure 5: Performance for Tensor-LSTM on En- feature of the cross entropy loss function makes glishtestdatawith0-40%oftheedgescoresarti- it especially well suited for Latin languages – as ficiallymaintainedat0. seen in Table 1, French and Spanish are also two of the languages for which Tensor-LSTM yields As can be clearly seen, performance drops thehighestperformanceimprovement. faster with the percentage of deleted labels for To compare the effect of missing edge scores the cross entropy model. This confirms our in- uponperformancewithoutinfluencefromlinguis- tuition that the initially lower performance using tic factors such as language similarity, we repeat meansquaredlosscomparedtocrossentropyloss thecross-lingualexperimentononelanguagewith ismitigatedbyagreaterrobustnesstowardsmiss- respectively10%,20%,30%,and40%ofthepro- ing labels, gained by circumventing the decoding jected and averaged edge scores artificially set to step in the training process. In Table 2, this is re- 0,simulatingmissingdata. WechoosetheEnglish flected as dramatic performance increases using data for this experiment, as the English projected mean squared error for Finnish, Persian, Hindi, data has the lowest percentage of missing labels andHebrew–thefourlanguagesfurthestremoved from the predominantly indoeuropean source lan- mine and filter out high-uncertainty trees. In Ma guages and therefore the four languages with the and Xia (2014), performance on projected data is poorestprojectedlabelquality. used as an additional objective for unsupervised Severalpossibleavenuesforfutureworkonthis learningthroughacombinedlossfunction. projectareavailable. Inthispaper,weusedanex- A common thread in these papers is the use tremely simple feature function. More complex of high-quality parallel data such as the EuroParl feature functions is one potential source of im- corpus. For truly low-resource target languages, provement. Another interesting direction for fu- thissettingisunrealisticasparallelresourcesmay ture work would be to include POS-tagging di- be restricted to biased data such as the Bible. In rectly as a component of Tensor-LSTM prior to Agic´ etal.(2016)thisproblemisaddressed,anda the construction of S (cid:1) S∗ in a multi-task learn- parserisconstructedwhichutilizesaveragingover ing framework. Similarly, incorporating seman- edgeposteriorsformanysourcelanguagestocom- tic tasks on top of dependency parsing could lead pensate for low-quality projected data. Our work to interesting results. Finally, extensions of the builds upon their contribution by constructing a Tensor-LSTM function to deeper models, wider more flexible parser which can bypass a source models, or more connected models as seen in e.g. of bias in their projected labels, and we therefore Kalchbrenner et al. (2015) may yield further per- comparedourresultsdirectlytotheirs. formancegains. Annotation projection procedures for cross- lingual dependency parsing has been the focus 6 RelatedWork of several other recent papers (Guo et al., 2015; ZhangandBarzilay,2015;Duongetal.,2015;Ra- Experimentswithneuralnetworksfordependency sooliandCollins,2015). InGuoetal.(2015),dis- parsing have focused mostly on learning higher- tributed, language-independent feature represen- order scoring functions and creating efficient fea- tations are used to train shared parsers. Zhang ture representations, with the notable exception and Barzilay (2015) introduce a tensor-based fea- of Fonseca et al. (2015). In their paper, a con- ture representation capable of incorporating prior volutional neural network is used to evaluate lo- knowledgeaboutfeatureinteractionslearnedfrom cal edge scores based on global information. In sourcelanguages. InDuongetal.(2015),aneural ZhangandZhao(2015)andPeietal.(2015),neu- networkparserisbuiltwhereinhigher-levellayers ral networks are used to simultaneously evaluate aresharedbetweenlanguages. first-orderandhigher-orderscoresforgraph-based Finally, Rasooli and Collins (2015) leverage parsing,demonstratinggoodresults. Bidirectional dense information in high-quality sentence trans- LSTM-models have been successfully applied to lations to improve performance. Their work can feature generation (Kiperwasser and Goldberg, beseenasoppositetoours–whereasRasooliand 2016). Such LSTM-based features could in fu- Collins leverage high-quality translations to im- tureworkbeemployedandtrainedinconjunction prove performance when such are available, we withTensor-LSTM,incorporatingglobalinforma- focusonimprovingperformanceintheabsenceof tionbothinparsingandinfeaturization. high-qualitytranslations. AnextensionofLSTMtotensor-structureddata hasbeenexploredinGravesetal.(2007),andfur- 7 Conclusion ther improved upon in Kalchbrenner et al. (2015) intheformofGridLSTM.Ourapproachissimilar, We have introduced a novel algorithm for graph- but simpler and computationally more efficient as based dependency parsing based on an extension no within-layer connections between the first and of sequence-LSTM to the more general Tensor- thesecondaxesofthetensorarerequired. LSTM. We have shown how the parser with a Annotation projection for dependency parsing cross entropy loss function performs comparably has been explored in a number of papers, start- to state of the art for monolingual parsing. Fur- ing with Hwa et al. (2005). In Tiedemann (2014) thermore, we have demonstrated that the flexibil- andTiedemann(2015)theprocessinextendedand ity of our parser enables learning from non well- evaluatedacrossmanylanguages. Lietal.(2014) formed data and from the output of other parsers. followsthemethodofHwaetal.(2005)andadds Using this property, we have applied our parser a probabilistic target-language classifier to deter- to a cross-lingual annotation projection problem for truly low-resource languages, demonstrating Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and an average target-language unlabeled attachment Ju¨rgen Schmidhuber. 2001. Gradient flow in re- currentnets: thedifficultyoflearninglong-termde- scoreof48.54,whichtothebestofourknowledge pendencies. InAFieldGuidetoDynamicRecurrent arethebestresultsyetforthetask. NeuralNetworks.IEEEpress. Acknowledgments Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping ThesecondauthorwassupportedbyERCStarting parsersviasyntacticprojectionacrossparalleltexts. GrantNo. 313695. Naturallanguageengineering,11(03):311–325. Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. An empirical exploration of re- References currentnetworkarchitectures. InProceedingsofthe ZˇeljkoAgic´,AndersJohannsen,BarbaraPlank,He´ctor 32nd International Conference on Machine Learn- Mart´ınez Alonso, Natalie Schluter, and Anders ing,pages2342–2350.InternationalMachineLearn- Søgaard. 2016. Multilingual projection for pars- ingSociety. ing truly low-resource languages. Transactions of theAssociationforComputationalLinguistics,4. Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. 2015. Gridlongshort-termmemory. arXivpreprint Theodore Bluche, Christopher Kermorvant, and arXiv:1507.01526. Jerome Louradour. 2015. Where to apply dropout inrecurrentneuralnetworksforhandwritingrecog- EliyahuKiperwasserandYoavGoldberg. 2016. Sim- nition? InDocumentAnalysisandRecognition(IC- ple and accurate dependency parsing using bidirec- DAR),201513thInternationalConferenceon,pages tional lstm feature representations. arXiv preprint 681–685.IEEE. arXiv:1603.04351. Long Duong, Trevor Cohn, Steven Bird, and Paul Omer Levy, Anders Søgaard, and Yoav Goldberg. Cook. 2015. Low resource dependency parsing: 2017. A strong baseline for learning cross-lingual Cross-lingualparametersharinginaneuralnetwork word representations from sentence alignments. In parser. In Proceedings of the 53rd Annual Meet- EACL. ingoftheAssociationforComputationalLinguistics andthe7thInternationalJointConferenceonNatu- Zhenghua Li, Min Zhang, and Wenliang Chen. 2014. ralLanguageProcessing(ShortPapers),pages845– Softcross-lingualsyntaxprojectionfordependency 850.AssociationforComputationalLinguistics. parsing. In Proceedings of the 25th International Conference on Computational Linguistics, pages JackEdmonds. 1968. Optimumbranchings. InMath- 783–793. Association for Computational Linguis- ematics and the Decision Sciences, Part 1, pages tics. 335–345.AmericanMathematicalSociety. Xuezhe Ma and Fei Xia. 2014. Unsupervised depen- Erick R Fonseca, Avenida Trabalhador Sa˜o-carlense, dencyparsingwithtransferringdistributionviapar- and Sandra M Alu´ısio. 2015. A deep architecture allel guidance and entropy regularization. In Pro- fornon-projectivedependencyparsing. InProceed- ceedings of the 52nd Annual Meeting of the Asso- ings of the 2015 NAACL-HLT Workshop on Vector ciationforComputationalLinguistics,pages1337– SpaceModelingforNLP,pages56–61.Association –1348.AssociationforComputationalLinguistics. forComputationalLinguistics. Andre´ FT Martins, Noah A Smith, Eric P Xing, Pe- XavierGlorotandYoshuaBengio. 2010. Understand- dro MQ Aguiar, and Ma´rio AT Figueiredo. 2010. ingthedifficultyoftrainingdeepfeedforwardneural Turbo parsers: Dependency parsing by approxi- networks. InProceedingsofthe2010International mate variational inference. In Proceedings of the conference on Artificial Intelligence and Statistics, 2010 Conference on Empirical Methods in Natural pages 249–256. Society for Artificial Intelligence LanguageProcessing,pages34–44.Associationfor andStatistics. ComputationalLinguistics. AlexGraves,SantiagoFerna´ndez,andJu¨rgenSchmid- huber. 2007. Multi-dimensional recurrent neural RyanMcDonald,FernandoPereira,KirilRibarov,and networks. arXivpreprintarXiv:0705.2011. Jan Hajicˇ. 2005. Non-projective dependency pars- ingusingspanningtreealgorithms. InProceedings Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng of the conference on Human Language Technology Wang, and Ting Liu. 2015. Cross-lingual depen- and Empirical Methods in Natural Language Pro- dency parsing based on distributed representations. cessing, pages 523–530. Association for Computa- In Proceedings of the 53rd Annual Meeting of the tionalLinguistics. Association for Computational Linguistics and the 7thInternationalJointConferenceonNaturalLan- Ryan McDonald, Slav Petrov, and Keith Hall. 2011. guage Processing, pages 1234–1244. Association Multi-source transfer of delexicalized dependency forComputationalLinguistics. parsers. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Process- Da´niel Varga, Pe´ter Hala´csy, Andra´s Kornai, Viktor ing, pages 62–72. Association for Computational Nagy,La´szlo´ Ne´meth,andViktorTro´n. 2005. Par- Linguistics. allel corpora for medium density languages. In Proceedings of the 2005 Conference on Recent Ad- RyanTMcDonald,JoakimNivre,YvonneQuirmbach- vances in Natural Language Processing. Associa- Brundage, Yoav Goldberg, Dipanjan Das, Kuzman tionforComputationalLinguistics. Ganchev,KeithBHall,SlavPetrov,HaoZhang,Os- car Ta¨ckstro¨m, et al. 2013. Universal dependency Yuan Zhang and Regina Barzilay. 2015. Hierarchical annotationformultilingualparsing. InProceedings low-rank tensors for multilingual transfer parsing. of the 51st Annual Meeting of the Association for In Proceedings of the 2015 Conference on Empiri- ComputationalLinguistics.Citeseer. calMethodsinNaturalLanguageProcessing,pages 1857–1867.AssociationforComputationalLinguis- Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya tics. Sutskever,LukaszKaiser,KarolKurach,andJames Martens. 2015. Adding gradient noise improves Zhisong Zhang and Hai Zhao. 2015. High-order learning for very deep networks. arXiv preprint graph-based neural dependency parsing. In Pro- arXiv:1511.06807. ceedingsofthe29thPacificAsiaConferenceonLan- guage, Information and Computation, pages 114– Robert O¨stling. 2015. Bayesian models for multilin- 123.AssociationforComputationalLinguistics. gual word alignment. Ph.D. thesis, Department of Linguistics,StockholmUniversity. RazvanPascanu,TomasMikolov,andYoshuaBengio. 2012. On the difficulty of training recurrent neural networks. arXivpreprintarXiv:1211.5063. Wenzhe Pei, Tao Ge, and Baobao Chang. 2015. An effectiveneuralnetworkmodelforgraph-basedde- pendency parsing. In Proceedings of the 53rd An- nual Meeting of the Association for Computational Linguistics and the 7th International Joint Confer- ence on Natural Language Processing, pages 313– 322.AssociationforComputationalLinguistics. Mohammad Sadegh Rasooli and Michael Collins. 2015. Density-driven cross-lingual transfer of de- pendencyparsers. InProceedingsofthe2015Con- ferenceonEmpiricalMethodsinNaturalLanguage Processing, pages 328–338. Association for Com- putationalLinguistics. Milan Straka, Jan Hajicˇ, Jana Strakova´, and Jan Hajicˇ jr. 2015. Parsing universal dependency treebanks using neural networks and search-based oracle. In Proceedings of the 14th International Workshop on TreebanksandLinguisticTheories,pages208–220. AssociationforComputationalLinguistics. Jo¨rgTiedemann. 2014. Rediscoveringannotationpro- jection for cross-lingual parser induction. In Pro- ceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Techni- calPapers,pages1854–1864.AssociationforCom- putationalLinguistics. Jo¨rg Tiedemann. 2015. Cross-lingual dependency parsing with universal dependencies and predicted pos labels. Proceedings of the Third International ConferenceonDependencyLinguistics,pages340– 349. TijmenTielemanandGeoffreyHinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running av- erageofitsrecentmagnitude. COURSERA:Neural NetworksforMachineLearning,4:2.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.