ebook img

Toward Computational Processing of Less Resourced Languages PDF

22 Pages·2012·0.82 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Toward Computational Processing of Less Resourced Languages

Chapter 9 Toward Computational Processing of Less Resourced Languages: Primarily Experiments for Moroccan Amazigh Language Fadoua Ataa Allah and Siham Boulaknadel Additional information is available at the end of the chapter http://dx.doi.org/10.5772/51069 1. Introduction The world is undergoing a huge transformation from industrial economies into an informa‐ tion economy, in which the indices of value are shifting from material to non-material re‐ sources. This transformation has been rightly described as a revolution that is accompanied by considerable dangers for the future and the survival of many languages and their associ‐ ated cultures. The last years have seen a growing tendency in investigating applying lan‐ guage processing methods to other languages than English. However, most of tools and methods' development on language processing has so far concentrated on a fairly small and limited number of languages, mainly European and East-Asian languages. Nevertheless, there is a mandatory requirement for all people over the world to be able to employ their own language when accessing information on the Internet or using computers. To this end, a variety of applications is needed, and lots funds are involved. But the fact that the most of the research sponsored around the world has focused only on the economically and politically important languages makes the language technology gap between the lan‐ guages of the developed countries and those of the less developed ones leading up to a larg‐ er and a largest gap. According to some linguists’ estimations in 1995, half of the 6000 world's languages is be‐ ing disappearing, 2000 among the 3000 remaining will be threatened in the next century [1]. This means that if there are no efforts put in place to reduce the technology gap and to pre‐ serve these languages so many of them will disappear completely by the end of this centu‐ ry. Unfortunately, there are numerous obstacles to advance in language processing for this kind of languages. In the first hand, the language features themselves might impose specif‐ © 2012 Ataa Allah and Boulaknadel; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 198 Text Mining ic strategies to be undertaken. In the second hand, the lack of previously existing language resources produces a vicious circle: having resources makes creating electronic ones and proc‐ essing tools easier, but not having resources makes the development and testing of new ones more difficult and time-consuming. Furthermore, there is usually a disturbing lack of inter‐ est that people needs to be able to employ their own language in computer applications. In the aim to help in revitalizing endangered languages, that are generally under or less re‐ sourced languages, many efforts need to be made. One way is to encourage younger genera‐ tions to use their mother tongue by building e-learning platforms, and creating instructive games. Oral documenting can be used to preserve the culture of endangered languages; es‐ pecially that many of these languages are only spoken. They have rich oral cultures with sto‐ ries, sayings, songs, chants and histories, but no written forms. So, the extinction of such language will quickly lead to the annihilation of its culture. Machine translation system can also be employed to produce translations from other languages, in order to extend the use of these languages from familiar and home use to more formal social contexts such as media, administration, and commercial relations. Another way to contribute in preserving endan‐ gered languages is the use of Internet. This later might be handy to raise awareness about the issues of language extinction and language preservation. In this context, this paper presents the key strategies for improving endangered languages on human language technologies. It describes the experiments currently underway on Ama‐ zigh at Computer Science Studies, Information Systems and Communications Center (CEI‐ SIC) in the Royal Institute of the Amazigh Culture (IRCAM), in order to let this language becoming more intercommunicated and widely used in the community. 2. Strategies for enhancing under and less resourced languages Recently, several private companies, technology centers, and public institutes have begun to get interested and to invest in developing technology for under and less resourced languag‐ es. To successfully deal with this task some studies have focused on studying the main strat‐ egies that could be taken in order to promote and develop this set of languages. 2.1. Linguistic contribution Generally, the computational processing of a language involves linguistic contributions that consist on matching or modeling language competence by discovering and presenting for‐ mally the rules governing this language. These linguistic contributions can be efficiently shared by a collaborative work on the web [2], substituting a local development team with potentially bigger distributed team. This idea avoids reduplication and wastage of efforts and resources. It has been investigated in an early Montaigne project (1996), and has been implemented at GETA for the Lao language. It has also been applied by Oki to the Japanese language and by NII/NECTEC to a Japanese-Thai dictionary [3]. Toward Computational Processing of Less Resourced Languages: Primarily Experiments for Moroccan Amazigh ... 199 http://dx.doi.org/10.5772/51069 2.2. Resource recycling Building electronic resources is indispensable parts of any computational language process. However, this task requires time and valuable human competence. An alternative solution for developing such resources is to get firstly electronic files by using Optical Character Rec‐ ognition (OCR) [4], then to generate from these files a standardized suitable format of re‐ sources that will be exploitable for automated task. The resource standardization is an important step in the process of resource building. It al‐ lows the reuse of resources in different researches, tools and applications. Furthermore, it facilitates the maintenance of a coherent document life cycle through various processing stages, and enables the enrichment of existing data with new information. 2.3. Adapting CLP techniques Adapting Computational Language Processing (CLP) techniques is an interesting way to build new tools for a specific language while taking the advantages of the similarity be‐ tween languages’ features. Such experiment has been particularly applied in machine trans‐ lation projects. One of these project is the ‘MAJO system‘, where the investment of syntactical and morphological similarities between Japanese and Uighur has helped suffi‐ ciently to obtain good results [5]. 2.4. Extensibility focused The philosophy of this direction suggests that the conception of any project should be made in such a way that others can easily come and extend the work to another level. This means that the project’s development should not focus only on getting results, but looking for oth‐ ers to be able to continue the work [6]. In this context, there are several examples: The ‘Aca‐ bit system’ has been developed firstly for the extraction of French multiword. Then, it has been extended to Japanese and Arabic languages [7]. Similarly, the ‘NOOJ framework’ has been built for European languages. Whereas, the work is still continuing on this framework for other languages such as Arabic [8], and Amazigh [9]. 2.5. Open source focused In general, the under and less resourced languages are economically poor. Whereas, doing computational language processing involves lots funds. To get around this obstacle and to cut down on the financial issues, it is suggested to adopt the open source strategy. Further‐ more, this strategy will allow the adoption of the two previous directions (adapting CLP techniques and extensibility focused). 2.6. Professional documentation Documentation will also greatly help in the continuation and the extension of projects. This documentation could be in terms of manuals or Websites, assisting people who may be in‐ terested in the use of a project, or allowing them to access to any phase of the work and con‐ tinue its development. 200 Text Mining 2.7. Evaluation system The evaluation system can be defined as a process allowing measuring the gap between fixed objectives and attained results. The choice of the time of the evaluation depends on the aim of the evaluation. Generally, the evaluation of a project could be done before its realiza‐ tion, to make a diagnostic that determines the objectives of this project and its prerequisites; during the development, to make a progressive evaluation that pilots and directs the prog‐ ress of the development; and after the implementation, to make a final evaluation which yields the results of the level of satisfaction, relevance, durability of the project, and finally of the continuity and the extensibility of the project. 2.8. Road map Conscious that search engine, machine translation, human-machine dialogue, and e-learning play a key role in the survival of under and less resourced languages, in manner that they will strongly help these languages to find their way into our daily lives by extending their use from familiar use to social one, we have organized and prepared a clear vision to realize these specific projects in a progressive approach. While studying these projects, we have noted that: • Search engine is designed to look for the information needed of the user by understand‐ ing his/her query, retrieving the relevant information related to the given query inde‐ pendently of the used language, and presenting a list of ranked search results. To this end, most of the search engines are based on automatic web crawlers, ranking algorithm, and relevance techniques of automatic indexing. These later either rely on keyword-based indexing, linguistic analysis indexing, concept-based indexing, or multilingual indexing. • Machine translation objective is to allow translating text with roughly the skill of a hu‐ man, by ensuring a high quality of hierarchical phrase-based translation. In this aim, most of the machine translation systems combine the strengths of rule-based and statistical ap‐ proaches to reduce the amount of the required linguistic information and training data, and also reduce the size of the statistical models while maintaining high performance. The rule-based machine translation (RBMT) approach is described as interlingual or transfer-based machine translation. It is based on lexicons with morphological, syntactic, and semantic information, and sets of rules. While the statistical machine translation (SMT) approach is based on parallel corpora to properly train the translation system. The two approaches can be merged in different ways: either translation is performed using a rules based engine, then statistics are used in an attempt to adjust/correct the output from the rules engine; rules are used to pre-process data in an attempt to better guide the statis‐ tical engine; or rules are used to post-process the statistical output. • Human-machine dialogue aims to support interactions between users and machines by designing receptive products to the user's needs. The Human-machine dialogue systems can be represented as a four process: Speech recognition process to transcribe sentences spoken into written text, natural language understanding process to extract the meaning from the text, execution process to perform actions on the conversation meaning, and re‐ sponse generation process to give feedback to the user. Toward Computational Processing of Less Resourced Languages: Primarily Experiments for Moroccan Amazigh ... 201 http://dx.doi.org/10.5772/51069 • E-learning increases access to learning opportunities, by offering knowledge and skills’ online transfer that can be ensured anytime and anywhere through a variety of electronic learning and teaching solutions such as Web-based courseware, online discussion groups, live virtual classes, video and audio streaming. Nowadays, modern technology, especially computational language processing, is strongly used in e-learning to assist reading, writ‐ ing, and speaking a language. While a person writes a sentence or reads a text aloud, the system can correct and monitor which words are not right or even analyze and tutor par‐ ticular problems. From this study we have noticed that these projects are mutually related to each other, and one can act as a part of the other. Furthermore, they are based on various processing which requires a large amount of specialized knowledge. Therefore, we have identified a list of the necessary processes needed to ensure the functionality of these projects, and we have sug‐ gested arranging them in a road map chronologically for short, medium, and long term ac‐ cording to the availability of resources and the level of functionality expected within each term [10]. As discussed, the achievement of our goal requires a large amount of specialized knowledge that is mainly encoded in complex systems of linguistic rules and descriptions, such as grammars and lexicons, which will in turn involve a considerable amount of specialized manpower. Thus depending on the availability of linguistic expertise and resources, we have estimated that short term phase will necessitate at least 5-year to 10-year plan to estab‐ lish the low level processing and resources, and pave the way for medium and long terms applications. While, based on the undertaken studies for well resourced language, we have gauged that the two other phases will demand only 5-year plan. Figure 1 represents the road map structured on these three phases. 2.8.1. Short term phase This phase is considered as an initial step. It consists mainly on the identification of the lan‐ guage encoding, and the foundation of the primarily resources namely keyboard, fonts, ba‐ sic lexical database (list of lemmas and affixes), and the elementary corpora that serve in the elaboration of most computational language processing applications (text raw corpus, corpus for evaluating search engine and information retrieval systems, manually part of speech tagged corpus, and speech corpus). Furthermore, basic tools and applications such encoding converter, sentence and token splitter, basic concordancer and web search en‐ gine, morphological generator, and optical character recognition system also need to be de‐ veloped in this phase. 2.8.2. Medium term phase After paving the way by the elaboration of the fundamental and the basic resources in the first phase, this one needs to be focused on advanced tools and applications. Based on the size and the representativity of the elaborated resources, the processing tools of this phase could be even rule-based or statistical. The most important processing to undertake, in this step, are stemming or lemmatization (depending on the morphological features of the stud‐ 202 Text Mining ied language), part of speech tagging, morphological analyzer, chunker, syntactical ana‐ lyzer, and speech recognition. These processing tools will enable to build a spell checker, a terminology extractor, a text generator, and human-machine dialogue. Furthermore, they will allow the enhancement of the first phase tools and applications. The medium term phase represents also the time to prepare the necessary resources for the next step, including multilingual dictionaries, multilingual aligned corpora, and semantic annotated corpora. Figure 1. Road map for under and less resourced languages. Toward Computational Processing of Less Resourced Languages: Primarily Experiments for Moroccan Amazigh ... 203 http://dx.doi.org/10.5772/51069 2.8.3. Long term phase The third step of the road map could be considered as the synthesis phase of the realized work. Beside the elaboration of a pronunciation lexicon, Word Net, word-sense disambigua‐ tion and speech synthesis, this phase is also focused on the multilingualism applications, mainly machine translation system. 3. Amazigh language features The Amazigh language, known as Berber or Tamazight, is a branch of the Afro-Asiatic (Ha‐ mito-Semitic) languages [11, 12]. Nowadays, it covers the Northern part of Africa which ex‐ tends from the Red Sea to the Canary Isles and from the Niger in the Sahara to the Mediterranean Sea. 3.1. Sociolinguistic context In Morocco, this language is divided, due to historical, geographical and sociolinguistic fac‐ tors, into three main regional varieties, depending on the area and the communities: Tarifite in North, Tamazight in Central Morocco and South-East, and Tachelhite in the South-West and the High Atlas. The Amazigh is spoken approximately by the half of Moroccan population, either as a first language or bilingually with the spoken Arabic dialect. However, it was until 1994 reserved only to family domain [13]. But in 2001, thanks to the King Mohammed VI Speech, which has established by a Dahir the creation of the Royal Institute of the Amazigh Culture, the Amazigh language has become an institutional language nationally recognized; and in July 2011, it has become an official language beside the classical Arabic. 3.2. Tifinaghe-IRCAM graphical system Since the ancient time, the Amazigh language has its own script called Tifinaghe. It is found engraved in stones and tombs in some historical sites attested from 25 centuries. Its writing form has continued to change from the traditional Tuareg writing to the Neo-Tifinaghe in the end of the sixties, and to the Tifinaghe-IRCAM in 2003. The Tifinaghe-IRCAM graphical system has been adapted, and computerized, in order to provide the Amazigh language an adequate and usable standard writing system. While, it has been chosen to represent to the best all the Moroccan Amazigh varieties, it tends to be phonological [14]. However, before adopting Tifinaghe-IRCAM as an official graphic system in Morocco, the Arabic script was widely used for religion and rural poetry writing, and the Latin script supported by the International Phonetic Alphabet (IPA) was used particularly in missionar‐ ies’ works. 204 Text Mining The Tifinaghe-IRCAM graphical system contains: • 27 consonants including: the labials (ⴼ, ⴱ, ⵎ), the dentals (ⵜ, ⴷ, ⵟ, ⴹ, ⵏ, ⵔ, ⵕ, ⵍ), the alveo‐ lars (ⵙ, ⵣ, ⵚ, ⵥ), the palatals (ⵛ, ⵊ), the velar (ⴽ, ⴳ), the labiovelars (ⴽ, ⴳ), the uvulars (ⵇ, ⵅ, ⵖ), the pharyngeals (ⵃ, ⵄ) and the laryngeal (ⵀ); • 2 semi-consonants: ⵢ and ⵡ; • 4 vowels: three full vowels ⴰ, ⵉ, ⵓ and neutral vowel (or schwa) ⴻ which has a rather spe‐ cial status in Amazigh phonology. 3.3. Punctuation and numeral No particular punctuation is known for Tifinaghe. IRCAM has recommended the use of the international symbols (“ ” (space), “.”, “,”, “;”, “:”, “?”, “!”, “…”) for punctuation markers; and the standard numeral used in Morocco (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) for the Tifinaghe system writing. 3.4. Directionality Historically, in ancient inscriptions, the Amazigh language was written horizontally from left to right, and from right to left; vertically upwards and downwards; or in boustrophe‐ don. However, the orientation most often adopted in Amazigh language script is horizontal and from left to right, which is also adopted in Tifinaghe-IRCAM writing system. 3.5. Amazigh morphological properties The main syntactic categories of the Amazigh language are the noun, the verb, and the parti‐ cles [14, 15, 16]. 3.5.1. Noun In the Amazigh language, noun is a lexical unit, formed from a root and a pattern. It could occur in a simple form (ⴰⵔⴳⴰⵣ ‘argaz’ the man), compound form (ⴱⵓⵀⵢⵢⵓⴼ ‘buhyyuf‘ the fam‐ ine), or derived one (ⴰⵎⵢⴰⵡⴰⴹ ‘amyawaḍ‘ the communication). This unit varies in gender (mas‐ culine, feminine), number (singular, plural) and case (free case, construct case). 3.5.2. Verb The verb, in Amazigh, has two forms: basic and derived forms. The basic form is composed of a root and a radical, while the derived one is based on the combination of a basic form and one of the following prefixes morphemes: ⵙ ‘s’ / ⵙⵙ ‘ss’ indicating the factitive form, ⵜⵜ ‘tt’ marking the passive form, and ⵎ ‘m’ / ⵎⵎ ‘mm’ designating the reciprocal form. Whether basic or derived, the verb is conjugated in four aspects: aorist, imperfective, perfect, and negative perfect. Toward Computational Processing of Less Resourced Languages: Primarily Experiments for Moroccan Amazigh ... 205 http://dx.doi.org/10.5772/51069 3.5.3. Particles In the Amazigh language, particle is a function word that is not assignable to noun neither to verb. It contains pronouns, conjunctions, prepositions, aspectual, orientation and negative particles, adverbs, and subordinates. Generally, particles are uninflected words. However in Amazigh, some of these particles are flectional, such as the possessive and demonstrative pronouns (ⵜⴰ ‘ta’ this (fem.)  ⵜⵉⵏⴰ ‘tina’ these (fem.)). 4. The complexity of Amazigh in CLP Amazigh is an official language in Morocco. However, it has been less studied from the computational point of view for many years. Moreover, it is among the languages having rich morphology and different writing forms. Below we describe the difficulties that the Amazigh language confronts in developing computational language applications. 4.1. Amazigh script Amazigh is one of the languages with complex and challenging pre-processing tasks. Its writing system poses three main difficulties: • Writing forms’ variation that requires a transliterator to convert all writing prescriptions into the standard form ‘Tifinaghe – Unicode’. This process is confronted with spelling variation related to regional varieties ([tfucht] [tafukt] (sun)), and transcription systems ([tafuct] [tafukt]), especially when Latin or Arabic alphabet is used. • The standard form adopted ‘Tifinaghe – Unicode’ requires special consideration even in simple applications. Most of the existed CLP applications were developed for Latin script. Therefore, those that will be used for Tifinaghe – Unicode require localization and adjust‐ ment. • Different prescriptions differ in the style of writing words using or elimination of spaces within or between words ([tadartino] [tadart ino] (my house)). 4.2. Phonetic and phonology The Amazigh phonetic and phonological problems depend particularly on the regional vari‐ eties. These problems consist on allophones and two kinds of correlations: the contrast be‐ tween constrictive and occlusive consonants, and that between lax and tense ones. • The allophone problems concern single phonemes that realized in different ways, such as /ll/ and /k/ that are pronounced respectively as [dž] and [š] in the North. • The contrast between constrictive and occlusive consonants concern particularly the Riffi‐ an and the Central varieties. Those have a strong tendency to plosive spirantization, where b, t, d, ḍ, k, g become respectively b, t, ḍ, k, g. 206 Text Mining • In the phonological Amazigh system, all phonemes can alternate from lax to tense, which is characterized by greater articulator energy and often a longer duration. Some phonetic and phonological evidence consider the opposition lax versus tense as a tense correlation and not a gemination [17], while others consider this opposition as gemination [14]. More‐ over, the realization of this opposition varies from region to region and from consonant to consonant. 4.3. Amazigh morphology An additional reason for the difficulties of computational processing of the Amazigh lan‐ guage is its rich and complex morphology. Inflectional processes in Amazigh are based pri‐ marily on both prefix and suffix concatenations. Furthermore, the base form itself can be modified in different paradigms such as the derivational one. Where in case of the presence of geminated letter in the base form, this later will be altered in the derivational form (ⵇⵇⵉⵎ ‘qqim’  ⵙⵖⵉⵎ ‘svim’ (make sit)). 5. Primarily experiments for the Amazigh language For many decades the Amazigh language was solely oral, exclusively reserved for familial and informal domains, although 50% of the Moroccan population are Amazigh speakers [14]. Since the colonial period, many studies have been undertaken, but most of them have contributed to the collection of the Amazigh oral tradition or have focused on linguistic fea‐ tures. Whereas, the computational studies have been neglect until the creation of the IRCAM in 2001. This creation has enabled the Amazigh language to get an official spelling [18], proper encoding in the Unicode Standard [19], appropriate standards for keyboard realiza‐ tion, and linguistic structures [15, 18]. Nevertheless, this is not sufficient for a less-resourced language as Amazigh to join the well- resourced languages in information technology. In this context, many researches, based on the approaches used for well-resourced languages, are undertaken at national level to im‐ prove the current situation [20, 21, 22]. In the remainder of this paper we present existing systems and resources built for Amazigh languages. 5.1. Amazigh encoding 5.1.1. Tifinaghe encoding Over several years, the Amazigh language has been writing in Latin alphabet supported by the IPA, or in Arabic script. While after adopting Tifinaghe as an official script in Morocco, the Unicode encoding of this script has become a necessity. To this end considerable efforts have been invested. However, this process took ample time to be done, which required the use of ANSI encoding as a first step to integrate the Amazigh language into the educational system at time.

Description:
guage processing methods to other languages than English. However 2012 Ataa Allah and Boulaknadel; licensee InTech will allow the enhancement of the first phase tools and applications. In: proceeding of HLT & NLP.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.