ebook img

Automatic Detection of Proverbs and their Variants PDF

15 Pages·2014·0.52 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Automatic Detection of Proverbs and their Variants

Automatic Detection of Proverbs and their Variants Amanda P. Rassi1,2, Jorge Baptista2, and Oto Vale1 1 Federal University of São Carlos-UFSCar Rodovia Washington Luís, km 235 – SP-310. São Carlos – São Paulo – Brasil CEP 13565-905 [email protected],[email protected] 2 University of Algarve-FSCH/CECL Campus de Gambelas, 8005-139 Faro, Portugal [email protected] Abstract This article presents the task of automatic detection of proverbs in Brazilian Portuguese, from theintersectionoftheregularsyntacticstructureofproverbsandtheircoreelements. Wecreated finite-state automata that enabled us to look for these word combinations in running texts. The rationale behind this method consists in the fact that although proverbs may have a normal sentence structure and often a very commonly used lexicon, their specific word-combinations may enables us to identify them and their variants irrespective of the syntactic or structural changes the proverb may undergo. The goal of this task is to gather the largest number of proverbs and their variants. The results showed precision 60.15%. 1998 ACM Subject Classification I.2.7 Natural Language Processing Keywords and phrases Brazilian Portuguese, proverbs, syntactic structure, core element, vari- ation Digital Object Identifier 10.4230/OASIcs.SLATE.2014.235 1 Introduction Theexistenceofproverbialstructuresintexts,includingjournalistictexts,isindisputable[12], which raises the problem of identifying them as a complex structure. The main problem concerning the identification of proverbs is that they have the same syntactic structure and the same words as ordinary, free sentences, however, they normally have a non compositional meaning and must be recognized not as an ordinary string of words, but as a complex unit, formed by several words, phrases and even multiple clauses. In this sense, proverbs resemble multiword expressions (MWE), although some authors [13, p.53] consider them as a different type of linguistic units as a quoted speech inside speech itself. In this paper, we adopt the view that proverbs should be treated as MWE. In general, automatic processing of idiomatic expressions, fixed expressions, semi-fixed expressions,proverbsandothermultiwordexpressionsisstillahardtaskforNaturalLanguage Processing(NLP)[30]. Althoughtherearemanystudiesabouttheidentificationofmultiword expressions in NLP [20, 21, 23], it is still difficult to identify them automatically in natural language texts [4, 5, 26]. In this paper we focus on the special case of proverbs in view of a double problem they representtoNLP:thefactthatproverbsacceptbothlexicalandformal(structural)variation. We aim at developing a method for automatic detection of proverbs and their variants, based ©AmandaP.Rassi,JorgeBaptista,andOtoVale; licensedunderCreativeCommonsLicenseCC-BY 3rd SymposiumonLanguages,ApplicationsandTechnologies(SLATE’14). Editors: MariaJoãoVarandaPereira,JoséPauloLeal,andAlbertoSimões;pp. 235–249 OpenAccessSeriesinInformatics SchlossDagstuhl–Leibniz-ZentrumfürInformatik,DagstuhlPublishing,Germany 236 Automatic Detection of Proverbs and their Variants on existing compilations of proverbs, by exploring the regular syntactic structures that most proverbs present. These regularities led to a formal classification of proverbs, based on their syntactic structure. Finite-state automata will be used to represent the regular patterns found in these classes of proverbs. Results from the automatic identification of Brazilian Portuguese proverbs from real texts are presented. This approach can be used in to two main applications: for lexicographic work, in order to build more complete dictionaries, and for Natural Language Processing, to improve linguistic resources, tools and applications, by allowing systems to signal these micro-texts and a special type of discursive element. 2 Delimitation of the Object Proverbs, parables, adages, aphorisms, maxims, and so on, these are all different terms used to designate similar types of sentences. Though there are conceptual differences among these terms, in practice, many authors ignore such distinctions and tend to group all these linguistic expressions under the broad umbrella term of proverb. In this paper, we also adopt such broad perspective and will consider proverbs as linguistic expressions forming fixed word combinations, in spite of some (limited) lexical or structural variation, often with a sentential status, that may even include subclauses, and whose global meaning is often idiomatic. These micro-texts are usually generic statements, conveying a world view or stating a moral judgement, an eternal truth, an ideal state of affairs. We distinguish proverbs from fixed expressions/frozen sentences (or idioms, proper). In idioms, the verb and one of its argument positions are frozen together, that is, they are distributionally invariant, or the argument nouns can only vary within a small and closed paradigm. Usually the subject of frozen sentences is distributionally free, and its selection depends not just on the verb, but on the overall meaning of the combination of the verb and its frozen arguments; i.e. Ana/Essa mesa não vale um tostão ‘Ana/This table is not worthy a penny’. On the other hand, typically, proverbs are completely frozen sentences, where, in spite of some (reduced) lexical variation and some (even more constraint) syntactical paraphrasing, all the elements are fixed. In other words, proverbs have the subject position necessarily filled by a fixed element [18, p.161], while the subject in fixed expressions usually varies and may be defined intensionally, by distributional constraints. Thesecondpropertythatdistinguishesproverbsandfixedexpressionsis,accordingto[24], that the proverbs “always have an autonomous semantic value in communicative terms, unlike idioms that are only constituents of sentences and may never occur as a full sentence.” In this sense, proverbs take place in whole sentences while fixed expressions only replace phrases (nominal phrase, verbal phrase or prepositional phrase). Although proverbs have syntactic structures similar to simple sentences, they can not be recognized as common sentences, but must be understood as a single block, whose syntactic slots should always be filled by specific lexical units. It means that proverbs are formed by words and phrases like any other free sentences, but they must be understood as a complex expression, a combination of words whose use is highly constraint. When proverbs are introduced by an enunciative mark, such as como dizem ‘as they say’, como dizia minha avó ‘as my grandmother used to say’, dizem por aí ‘people say/they say’, costuma dizer-se ‘it is often said’, etc.; it is then easier to identify them because these type of marks can be extensively described. However, there is often no mark at all introducing proverbs in texts, which renders their spotting more difficult. Finally, proverbs are prone to certain types of formal variation, particular ellipsis of one of its clause-type components, and they often undergo stylistic reformulation, in order to producesomeperlocutionaryeffect. Forexample, abankinginstitution, inoneadvertisement A.P. Rassi, J. Baptista, and O. Vale 237 of its products, recently “reinvented” the proverb Tempo é dinheiro ‘Time is money’ as Tempo não é só dinheiro. É valor ‘Time is not just money. It is value’. This capacity of the proverbs to be reinterpreted and reformulated, which some linguists called “défigement” or “unfreezing” is an inherent part of the paremiologic dynamics in language. 3 Related Works Most of the work done on Brazilian Portuguese proverbs adopt a didatic or pedagogic approach, [14, 25, 31], or analyzes rhetorical relations between the clauses [15, 16, 17]. We did not find any work that describes formally proverb structures in Portuguese or that tried to identify them automatically in large corpus. For European Portuguese, Lucília Chacoto developed many studies on proverbs, either theoretical and practical works. The author compared Portuguese and Spanish proverbs initiated by Quem/Quien ‘Who’ [6] and also analyzed comparative structures [7] which are two of the structures we describe in this paper. We can also cite works for other languages, like Lacavalla [22], who compared proverbs initiating by Quand/Quando ‘When’ in Italian and French. The author uses local grammars for searching the proverbs in both languages and describes the data in Lexicon-Grammar Tables, analyzing all syntactic properties and distribution of those units. On the other hand, Navarro Brotons [2] compared proverbs in Spanish and French. The author analyzed syntax, semantics and translation of proverbs and their variants in both languages and also described the data in Lexicon-Grammar tables. We also cite the extensive work of Mirella Conenna [8, 9, 10, 11], who produced many works about proverbs in French and Italian, comparing their structures in both languages, classifying proverbs in syntactic tables, i.e. Lexicon-Grammar tables, and analyzing proverbs and their variants in equivalence classes. In all those works, the author was concerned about the formalization of the data for automatic identification and processing. There are also some other publications about proverbs in Brazilian Portuguese, but they do not present any systematic analysis. These include didactic materials used in schools, dictionaries, glossaries, and lists of proverbs. Most of them are used in teaching/learning Portuguese as second language or as didactic manuals. For Brazilian Portuguese it is still necessary to describe formally syntactic structures of the proverbs and their core elements, aiming to contributing for the construction of lexicon-syntactic resources applicable in NLP. 4 Methods Inthissectionwepresentamethodologyforautomaticdetectionofproverbsandtheirvariants, tested on a Brazilian Portuguese corpus, which can be resumed in 6 steps: (i) creating a database with proverbs searched in dictionaries and other lists; (ii) defining syntactic criteria to organize the collected proverbs into formal classes; (iii) manually identifying the POS tags of their elements; (iv) generating tables with the core elements derived from POS tagging; (v) creating graphs with the basic structure for each class; and (vi) intersecting the graphs with the tables of the proverbs’ core elements to produce finite-state transducers that will enable us to identify such word combination in texts. After these steps, we could find other proverbs and their semantic variations within the same syntactic structure. WesearchedfortheproverbsandtheirvariantsinPLN.BRFullcorpus[3],whichcontains 103,080 texts, with 29,014,089 tokens, from Folha de São Paulo, a Brazilian newspaper, from 1994 to 2005. SLATE 2014 238 Automatic Detection of Proverbs and their Variants 4.1 Collection of Proverbs The first step for this work consists in creating a list of proverbs that will serve as input seeds to recognize other proverbs and their variants in large corpora. Five different sources were used: a list of proverbs in Wikipedia, three books with proverbs collections [29, 32, 34] and a dictionary of proverbs [19]. Firstly, all the expressions collected in these sources were analyzed manually and many were discarded as they were not considered as proverbs but consist mostly of idiomatic expressions (or idioms), like (1), or aphorisms and maxims, as in (2): (1) Matar dois coelhos com uma cajadada só [to] kill two bunnies with just one thwack ‘kill two birds with a stone’ (2) Na natureza, nada se cria nada se perde, tudo se transforma ‘In Nature, nothing is created, nothing is lost, everything is transformed’ The idiom in (1) is a frozen sentence with a free subject slot and two frozen complements, a direct object and an instrumental complement [1, 18, 35](class C1P2). On the other hand, (2) is an aphorism or maxim, attributed to the chemist Lavoisier (1743-1794) about the conservation of mass. In spite of its three-clause, parallelistic, proverb-like structure, and its generic nature, the (known) authorship of the maxim lead us to discard it from our study. Afterasubstantialcollectionofover3,502proverbs(andtheirvariants)hasbeengathered, the variants of each proverb were grouped together and one of them was selected to be considered as the entry of our lexicon (or its base-form), based on its frequency among the sources consulted. Most differences between variants of the same proverb consist in the variation of their grammatical elements, and the lexical choices for their core meaningful words. Finally, we tried to confirm whether these proverbs were (still) really in use in current Brazilian Portuguese, checking them with 5 native speakers of Brazilian Portuguese from different geographic regions.1 Some proverbs are only used in Portugal or in Portuguese- speaking African countries, while others are very old and probably may not be in use anymore. From the original 3,502 proverbs (and their variants), a final list of 594 proverbs (types or base-forms) was compiled.2 4.2 Classifying Proverbs and POS Tagging their Elements The list of proverbs (base-forms) was then classified into formal classes. This classification was based on the following criteria, applied in this order: (i) the number of propositions (one, two, or three clauses or clause-like units); (ii) coordination (in multiple-clause proverbs); (iii) order of the main vs the subordinate clauses (in multiple-clause proverbs); (iv) order of the constituents (in single-clause proverbs); (v) impersonal constructions; and (vi) obligatory negation. Table 1 presents the current classification. 1 Weconsiderthatthesamplingbyregionisnotsufficienttoconfirmthepresenceorabsenceofproverbs, andwewouldneedtoconsultspeakersfromdifferentgenders,ages,socialclasses,educationlevelsetc, thisisoutofthemainscopeofthiswork. 2 ThelistofproverbsandtheirclassificationcanbeconsultedatthefirstauthorprofileinResearchGate, availableinhttps://www.researchgate.net/project/PB-proverbs. A.P. Rassi, J. Baptista, and O. Vale 239 Table 1 Formal Classification of Brazilian Portuguese Proverbs. Class Structure Example (approximate translation) Types P1F1 Ø V w Não há crime sem lei 20 (impersonal) ‘There is no crime without law’ P1F2 N VcopAdj/N w A carne é fraca 53 0 ‘The flesh is weak’ P1F3 N V w O hábito (não) faz o monge 80 0 ‘The cloth (does not) make the monk’ P1F4 N NegV w Burro velho não aprende línguas 53 0 ‘Old donkey does not learn languages’ P1F5 PrepN N V w Para bom entendedor, meia palavra basta 45 i 0 (fronted prep. phrase) ‘For the one who understands, half word is enough’ P2F1 F Conjs-compF Mais vale um pássaro na mão do que dois voando 39 1 2 (comparatives) ‘Beter is is a bird in the hand than two flying’ P2F2 F ConjcF A palavra é de prata e o silêncio é de ouro 71 1 2 (coordinated) ‘The word is silver and the silence is gold’ P2F3 N , N Tal pai, tal filho 48 1 2 ‘Like father, like son’ P2F4 Qu-F F Quem tem boca vai a Roma 90 1 2 (interrogative subclass) ‘Who has a mouth goes to Rome’ P2F5 F ConjsF Os amigos são muitos quando grande é a abastança 20 1 2 (subordinated) ‘Friends are many when abundance is great’ P2F6 ConjsF , F Quando a esmola é demais, o santo desconfia 28 2 1 (fronted subord.) ‘When alms are too much, the saint gets suspicious’ P3 F , F , F Um é pouco, dois é bom, três é demais 47 1 2 3 ‘One is little, two is good, three is too much’ Total 594 Some remarks on this classification are in order: (i) impersonal constructions involve the verb haver ‘there be’ and ter ‘to have’ with impersonal valency (the later only exists in Brazilian Portuguese); (ii) sentences with copula verbs ser and estar ‘to be’ usually present an adjectival or nominal predicate; these sometimes allow for mirror permutation (A carne é fraca = fraca é carne3 ‘The flesh is weak’); (iii) proverbs with obligatory negation usually involve negation adverbs, e.g. não ‘no/not’, nunca ‘never’, jamais ‘never’, nem ‘nor’, etc.; negation has precedence over copula verbs, so that proverbs with negated copula were included in this class; (iv) single-clause proverbs with a fronted prepositional phrase do not admit the basic word-order; (v) comparativeproverbs,includingthosewithsubordinatesub-clause,areatypeofcomplex sentences, though other types of comparative structures were also included in this class; (vi) nominal propositions named N , N (in P2F3 class) are treated as clausal propositions, 1 2 even if they may contain no verbs and only have a ‘clausal’ or ‘propositional content’. 3 http://rainhadocarmelo.blogspot.pt/2010_02_01_archive.html[2014-03-0813:11] SLATE 2014 240 Automatic Detection of Proverbs and their Variants After classifying the proverbs, we manually annotated their elements for part-of-speech (POS) tags. Since each class is syntactically homogeneous, it was then relatively simple to organize the lexical items in a tabular format, so that the characteristic elements of the proverbs may be aligned, and can easily be identified. For the noun phrases (NP), either the subject (N ) or the complement (N ), the head noun (or pronoun) is determined, 0 1 and eventual determiners (Det) or modifiers (Mod) are tagged and distributed across the corresponding columns. Eventual pre- or post-modifiers of verbs (Deus escreve direito por linhas tortas ‘God writes straight with crooked lines’), including obligatory auxiliary verbs (Não se entra em briga que não se pode ganhar ‘Do not enter into a fight you can not win’), and other elements, such as the impersonal pronouns (Aqui se faz, aqui se paga ‘Here you do, here you pay’)4, or obligatory negation (Quem não tem cão caça com gato ‘Who does not haveadoghuntswithacat’)arealsotakenintoconsideration. Subordinativeorcoordinative elements are also provided with an adequate slot. In this way, it is relatively simple to automatically extract the core (or more representative) elements from each proverb, based on the classes’ formal homogeneity. 4.3 Extracting Core Elements In order to extract the core words in each proverb, we analyzed all cells in each table and selected as core elements the most frequent grammatical classes in each syntactic position. For example, in almost all classes5 the initial NP is necessarily filled by a noun or, in rare cases, a pronoun. The noun can be accompany by determinants and/or adjectives and/or other nominal adjuncts, but the only position that is fully filled by some element is the column <N> either in the subject or in the complement position, so we selected the item instantiated in column <N> as one of the core elements for identifying the proverb. In all classes6, VP position is necessarily filled by a verb, so this is selected as a key element in the constitution of the proverbs. Table 2 shows a sample of P1F3 class, in a tabular format, indicating all columns7. Depending on the formal class of the proverbs, so the core elements are defined. In the case of class P1F2, the definitory elements are the heads of the subject and of the predicative complement (noun or adjective) as well as the copula verb. In the case the head in null (e.g. Os últimos serão os primeiros ‘The first shall be the last’) the determiner or an adjective may be chosen instead. In comparative proverbs, there is often no main verb, so the determiners 4.3 or the comparative conjunctions 4.3 must be selected, along with the core nouns: (3) Tal pai tal filho ‘Like father like son’ (4) Nem tanto ao mar nem tanto à terra ‘Not so much to sea not so much to ground’ 4 InPortuguese,impersonalcliticpronoun-se imposes3rd person-singularagreementtotheverb,thus being indistinguishable from passive-like pronominal constructions. Only some few clear-cut cases of pronominal passives were found; e.g. Entre mortos e feridos salvaram-se todos ‘Among dead and woundedallweresaved’. Bothstrategiesmaybeconsideredasaformofsubject(agent)degenerescence, hencecontributingtothegenericeffectoftheproverbs. 5 ExceptiondoneforclassP1F1,whichhasnoexplicitsubject(nullsubject). 6 ExceptiondoneforclassP2F3,whichisconstitutedbynominalphrasesonly,andhasnoverb. 7 Inthistabletheheadingsarereadasfollows: Adj=Adjective,Adv=Adverb,Det=Determinant, Indet_Pass = Pronominal passive-like construction, N = Noun, Prep = Preposition, V = Verb; the wordsinsidechevronscorrespondtolemmas A.P. Rassi, J. Baptista, and O. Vale 241 Table 2 Sample of class P1F3. jdA >> oo orturt 6666666666<t<c6666666 N > <herói><razão><agravo><casa><ladrão><força>66<ouvido><amigo><linha><perna><sopa><abismo><monge><pecador<boca><meio><casa> jdA > m o b 666666666<666666666 teD >>> >> >>>>>> ooo oo oooooo <<<6<<666666<<<<<<6 perP m or oror m 666e666666p6666pp6e vdA a s s epre erto 6666666d66c66666666 V <fazer><cegar><fazer><começar><fazer><fazer><enganar><chegar><ter><fazer><escrever><ter><estragar><atrair><fazer><pagar><conhecer><justificar><lavar> ssaP_tednI e e 6666666666666666s6s jdA > o uj s 666666666666666666< N > <adversidade<ambição><intenção><justiça><ocasião><união><aparência><notícia><parede><conta><deus><mentira><cozinheiro><abismo><hábito><justo><peixe><fim><roupa> jdA > > u m ma bo 6666666<6<666666666 $teD > o o>o>o>o>o>o>o>o>o> muito>o>o>o>o> <<<<<<<<<666<<<<<<6 s a a p Proverb A"adversidade"faz"os"heróisA"ambição"cega"a"razãoA"intenção"faz"o"agravoA"justiça"começa"em"casaA"ocasião"faz"o"ladrãoA"união"faz"a"forçaAs"aparências"enganamAs"más"noticias"chegam"depressaAs"paredes"têm"ouvidosBoas"contas"fazem"bons"amigosDeus"escreve"certo"por"linhas"tortMentira"tem"perna"curtaMuitos"cozinheiros"estragam"a"soO"abismo"atrai"o"abismoO"hábito"faz"o"mongeO"justo"paga"pelo"pecadorO"peixe"se"conhece"pela"bocaOs"fins"justificam"os"meiosRoupa"suja"se"lava"em"casa SLATE 2014 242 Automatic Detection of Proverbs and their Variants Figure 1 Reference graph for class P2F4. In the common cases where a lexical element of the proverb allows for variation, all the variants are included in the corresponding slot. This is the case of the proverb Cachorro mordido de cobra tem medo de linguiça ‘Dog bitten by a snake is afraid of sausage’ where the second noun can be replaced by barbante ‘string’ and salsicha ‘sausage’; notice, however, that the variation of grammatical elements 4.3 was ignored:8 (5) Cachorro (que foi + <E>) mordido (de + por) cobra tem medo até de (barbante + salsicha + linguiça) ‘Dog(thatwas+<E>)bittenbyasnakeisafraidof(string+sausage+porksausage)’ 4.4 Creating and Applying the Graphs Oncethecharacteristicelementsofeachproverbhavebeenidentified,theywerestructuredin a tabular format, one table for each class (residual class “others” was not considered in this paper). Then, using the Unitex 3.1.beta linguistic development platform [27, 28], we produce a reference graph for each class. Fig. 1 illustrates the graph for class P2F4, corresponding to proverbs with a fronted subordinated clause; e.g. Se queres conhecer o vilão, põe-lhe um pau na mão ‘If you want to know a villain, put a stick in his hand’. This graph reads as follows: the system explores systematically each line in the table of a class core elements, replacing the variables @A, @B, etc, by the corresponding content of columns A, B, etc. These input variables are then associated to output variables (in the letters below the brackets) to be reused in the output. In this case, the graph delimits the matched expression by brackets, and produced the content in a normalized form, introduced by the idiom number (the table’s line number), represented by variable @%9. By intersecting the reference graph with the corresponding table, the system generates one subgraph for each line of the table, and a general result graph, containing all the subgraphs. The result graph can then be used to find patterns in texts. Table 3 shows a sample of a concordance of such matched strings from the PLN.Br corpus. Each line in the table has been numbered. In this concordance, a small left context is provided, followed by the number of the proverb type in the corresponding class, the actual words in the corpus and the core words that the transducer detected; empty variables are not represented (void commas). The table presents two matches that are considered False Positives, in lines 16 and 17. The proverb supposed to be found is Quem sabe faz ‘Who knows makes’, but the system found, for example, a free sentence (line 16) and a verse of a brazilian song (line 17). It is also remarkable the transformations (actualizations or adaptations) created by speakers. The proverb we were looking for is Quem vê cara não vê coração ‘Who sees the face does not see the heart’ as in line 22, but the speaker adapted the proverb to the context of smoking and created Quem vê cara não vê pulmão ‘Who sees the face does not see the lung’, as 8 Theitemslinkedby“+”insideparenthesescancomuteinthegivensyntacticslot;thesymbol<E> representstheemptystring. 9 TheshadowedboxInsisasubgraphdefiningawindowof0to3wordsandseparatorsallowedbetween theproverbs’coreelements. A.P. Rassi, J. Baptista, and O. Vale 243 Table 3 Sample of a concordance of Class P2F4. 1 é o[0003barato que pode sair caro=barato, caro„,] 2 não[0006mata engorda=mata, engorda„,] 3 Quem[0015avisa amigo é=avisa, amigo„,] 4 Quem[0018cala consente=cala, consente„,] 5 Quem[0019Canta Seus Males Espanta=Canta, Males, Espanta„] 6 e como[0020casei e quero casa=casei, quero, casa„] 7 quem[0023conta um conto aumenta um ponto=conta, conto, aumenta, ponto,] 8 quem[0028diz o que quer ouve o que não quer=diz, quer, ouve, quer,] 9 não[0042arrisca não só não petisca=arrisca, petisca„,] 10que não[0043choram nem mamam=choram, mamam„,] 11 não[0044deve não teme=deve, teme„,] 12 Quem[0047está dentro quer sair e quem está fora não=está, dentro, quer, sair,] 13 não[0050sabe não ensina=sabe, ensina„,] 14 quem[0062pariu Mateus que o embale=pariu, Mateus, embale„] 15 quem[0064procura acha=procura, acha„,] 16 Quem[0068sabe alguém faz uma experiência com isso=sabe, faz„,] 17 quem[0068sabe faz a hora=sabe, faz„,] 18 Quem[0068Sabe Faz ao Vivo=Sabe, Faz„,] 19 Quem[0069sabe sabe=sabe, sabe„,] 20 os que[0070semeiam ventos colhem tempestades=semeiam, ventos, colhem, tempestades, ] 21 "Quem[0079tem pressa come cru=tem, pressa, come, cru, ] 22 "quem[0085vê cara não vê coração=vê, cara, coração„] 23 quem[0085vê cara não vê pulmão=vê, cara, vê„] 24 Quem[0085vê cara vê muito mais do que coração=vê, cara, vê, coração,] 25 Quem[0086viver verá=viver, verá„,] in line 23. In 24 the obligatory negation of the original proverb has been deleted and the meaning actually inverted in a creative way. In this way it was possible to find other variants of proverbs than those we had previously collected (from books, dictionaries and the wikipedia) and find several instances of creative reuse and transformations of proverbs for rethoric purposes. 5 Results and Discussion Since, to our knowledge, there is no available corpus annotated with proverbs and similar expressions, only precision was reported here. From the previous list of 594 proverbs, 788 matches were found in the PLN.Br corpus, from which 474 matches (60.15%) correspond to actual proverbs. We decided to search these lexical units in journalistic corpus aiming to check if in the common language they also appear. It has been proved [33] that literary corpora contain a large number of proverbs, but the challenge is looking for them in non-literary texts. Table 4 shows the breakdown of these results by class. In spite of the number of matches, only 137 types (different proverbs) were found. The scarcity of the occurrence of proverbs in the corpus (1:36,820 words), as well as its reduced variety (23% types) is most probably linked to the journalist nature of the corpus. In this respect, it is remarkable the number of instances retrieved from the data in class P2F4 as well as its low precision (27.5%). This class includes only two lexical items, besides the indefinite subject pronoun quem ‘who’, as in Quem cala consente ‘[he] who silence [gives SLATE 2014 244 Automatic Detection of Proverbs and their Variants Table 4 Results of automatic identification of proverbs by class. Class Proverbs (types) Matches Types False-Positives P1F1 20 15 4 2 P1F2 53 91 21 16 P1F3 80 153 24 55 P1F4 53 61 15 0 P1F5 45 63 5 6 P2F1 39 40 7 1 P2F2 71 14 3 9 P2F3 48 40 8 25 P2F4 90 276 37 200 P2F5 20 3 1 0 P2F6 28 1 1 0 P3 47 31 11 0 Total 594 788 137 314 his] consent’. Since these are very short proverbs, a window of 5 words between the core elements may be inadequate. We repeated the experiment without any insertion window, and captured 56 matches, of which 26 were false positives. The local precision of the class P2F4 raised from 27.5% to 53.57%. Considering the global precision (including all classes), global precision raised from 60.15% to 73.35%. This may indicate that, depending of the syntactic structure of the proverb, a more or less wide window between the core elements must be defined. The system matched 137 different proverbs from the previous list with 594 entries, and their distribution is presented in Fig. 2, below. Some few other proverbs have higher frequencies but they were collapsed in Fig. 2 because they form a small number of proverbs with relatively high frequency.10 The small number of different proverbs matched by the system (23% of the total types) is probably due to the nature of the corpus. Some proverbs, as we will see below, have been adapted and reconfigured to fit the discursive needs of the author. 10Namely,f=13,f=16,f=20,f=22,f=44,f=52,f=55andf=88. Figure 2 Distribution of proverbs in corpus PLN.Br Full.

Description:
The idiom in (1) is a frozen sentence with a free subject slot and two frozen complements, a direct object and an instrumental complement [1, 18, 35](class C1P2). On the other hand,. (2) is an aphorism or maxim, attributed to the chemist Lavoisier (1743-1794) about the conservation of mass. In spit
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.