ebook img

CNL Controlled Natural Language simplifying language use PDF

30 Pages·2014·2.2 MB·Italian
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview CNL Controlled Natural Language simplifying language use

Controlled Natural Language Simplifying Language Use Workshop Programme Ø Introduction 09:00 – 09:25 Key-Sun Choi, Hitoshi Isahara “Workshop Introduction” Ø Session 1 : CNL and Controlled Editing 09:25 – 09:50 Pierrette Bouillon, Liliana Gaspar, Johanna Gerlach, Victoria Porro, Johann Roturier “Pre-editing by Forum Users: a Case Study” 09:50 – 10:15 Wai Lok Tam, Yusuke Matsubara, Koiti Hasida, Motoyuki Takaai, Eiji Aramaki, Mai Miyabe, Hiroshi Uozaki “Generating Annotated Corpora with Autocompletion in a Controlled Language Environment” Ø Session 2 : CNL Language Resource and Content Management 10:15 – 10:30 Delyth Prys, David Chan, Dewi Bryn Jones “What is the Relationship between Controlled Natural Language and Language Registers?” Ø Coffee Break 10:30 – 11:00 Ø Invited Talk 11:00 – 11:30 Teruko Mitamura (CMU, Language Technologies Institute) Ø Session 2 Continued : CNL Language Resource and Content Management 11:30 – 11:55 Kara Warburton “Developing Lexical Resources for Controlled Authoring Purposes” 11:55 – 12:20 Giovanni Antico, Valeria Quochi, Monica Monachini, Maurizio Martinelli “Marrying Technical Writing with LRT” Ø Session 3 : ISO Standard for CNL 12:20 – 13:00 Key-Sun Choi, Hitoshi Isahara “Toward ISO Standard for Controlled Natural Language” i Editors Hitoshi Isahara Toyohashi University of Technology Key-Sun Choi KAIST Shinhoi Lee KAIST Sejin Nam KAIST Workshop Organizers Key-Sun Choi KAIST Hitoshi Isahara Toyohashi University of Technology Christian Galinski Infoterm Laurent Romary INRIA Workshop Programme Committee Key-Sun Choi KAIST Hitoshi Isahara Toyohashi University of Technology Christian Galinski Infoterm Laurent Romary INRIA ii Table of Contents Toward ISO Standard for Controlled Natural Language ............................................................ 1 Key-Sun Choi, Hitoshi Isahara Pre-editing by Forum Users: a Case Study ..................................................................................... 3 Pierrette Bouillon, Liliana Gaspar, Johanna Gerlach, Victoria Porro, Johann Roturier Generating Annotated Corpora with Autocompletion in a Controlled Language Environment ............................................................................................................................................................ 11 Wai Lok Tam, Yusuke Matsubara, Koiti Hasida, Motoyuki Takaai, Eiji Aramaki, Mai Miyabe, Hiroshi Uozaki Developing Lexical Resources for Controlled Authoring Purposes ........................................... 15 Kara Warburton Marrying Technical Writing with LRT ........................................................................................ 19 Giovanni Antico, Valeria Quochi, Monica Monachini, Maurizio Martinelli iii Author Index Antico, Giovanni ...................................... 19 Aramaki, Eiji ............................................... 11 Bouillon, Pierrette ......................................... 3 Choi, Key-Sun ......................................... 1 Gaspar, Liliana ............................................ 3 Gerlach, Johanna .......................................... 3 Hasida, Koiti ............................................. 11 Isahara, Hitoshi ........................................... 1 Martinelli, Maurizio ...................................... 19 Matsubara, Yusuke ......................................... 11 Miyabe, Mai ............................................... 11 Monachini, Monica ......................................... 19 Porro, Victoria .......................................... 3 Quochi, Valeria ......................................... 19 Roturier, Johann ............................................ 3 Takaai, Motoyuki ..................................... 11 Tam, Wai Lok ....................................... 11 Uozaki, Hiroshi ......................................... 11 Warburton, Kara ............................................. 15 iv Preface The study of controlled natural language has a long history due to its commercial impact as well as its effectiveness in applications like machine translation, librarianship, information management, terminology management, mobile communication, legal documents, and so on. On the other hand, “text simplification” is also beneficial for efficient communication with respect to all kinds of language use in the Web, such as simplified English Wikipedia for instance. The current progress of linked data also assumes a great potential for knowledge acquisition from text and web data, for example, NLP2RDF and its NIF (http://nlp2rdf.org). It is also obvious that its data fusion and knowledge fusion are more beneficial from the controlled or simplified text or structured source. There is also a working item on this topic in ISO/TC37 “Terminology and other language and content resources” for recommending the principles of controlled natural language and its supporting environment and utilization. This workshop is altogether to know about the scope of controlled natural language for simplifying use in the aspects of their pre-editing for controlled natural language use, their language resources and content management systems in technical writing and mobile life, and the interoperability and relation in the context of standardization. As a result, the workshop may identify the environment of controlled natural language use, their guideline to use, relationship with language resources and other systems, interlinking interoperability and dependency with other standards and activities, and discovery of controlled natural language for human technical writing as well as for the knowledge acquisition and knowledge fusion processes manually and/or automatically by computing and linking in the web environment. It is also observable to see the cooperative work items, to identify the shared tasks to work together open and to analyze their units for simplifying use of language. v Toward ISO Standard for Controlled Natural Language Key-Sun Choi1, Hitoshi Isahara2 1Korea Advanced Institute of Science and Technology 2Toyohashi University of Technology E-mail: [email protected], [email protected] Abstract This standard is the first part of the series of ISO standards that are targeted at controlled natural language (CNL) in written languages. It focuses on the basic concepts and general principles of CNL that apply to languages in general. It will cover properties of CNL and CNL classification scheme. The subsequent parts will, however, focus on the issues specific to particular viewpoint and/or applications, such as particular CNLs, CNL interfaces, implementation of CNLs, and evaluation techniques for CNL. Keywords: controlled natural language, ISO standard Writer, editor, translator and/or reader 1. Scope Native speaker or Non-native speaker This standard is the first part of the series of ISO standards Knowledgeable person or Non-knowledgeable person that are targeted at controlled natural language (CNL) in Handicapped people written languages. It focuses on the basic concepts and 2) By machine, in more specific; general principles of CNL that apply to languages in Machine Translation, word processor general. It will cover properties of CNL and CNL Text understanding classification scheme. The subsequent parts will, Information Retrieval (text which can be retrieved however, focus on the issues specific to particular properly and inquiry which can be converted into viewpoint and/or applications, such as particular CNLs, proper structure of keywords would be covered by this CNL interfaces, implementation of CNLs, and evaluation standardization.) techniques for CNL. 2.3 What is improved by controlled natural The objective of this standard is to provide language language-independent and general purpose guidelines to enable written texts to be controlled, in a reliable and The general rules and principles of this standard constitute reproducible manner, in order to suit to specific situation. a systematic approach that makes cross-language and In language-related activities in industry, the usability of cross-domain as well cross-system applications of CNLs documents is a fundamental and necessary concept. It is more effective. thus critical to have a universal definition of what and CNL can be aimed to; how one can control language for the purposes of Improve readability; satisfying real world needs. Controlled natural language Reduce ambiguity; can be realized by a selection or a simplification of Speed up reading; lexicons and/or linguistic rules, or a modification of Be easier to comprehend, e.g., lexicons and/or linguistic rules. It also can be realized by Improvement of comprehension for people whose adding some syntactic or semantic tags to original texts. first language is not the language of the document There are many applications and fields that need to at hand; control language, including machine translation (MT), Improvement of comprehension for people with information retrieval (IR) and technical communication different domain or application background; (TC). Disambiguation (to what extent and for what purpose); 2. Objectives of controlled natural Avoiding misunderstanding. language Reduce cost of whole process of an application, e.g. Making human translation and localization easier, 2.1 Purposes of controlled natural language faster and more cost effective; There are several target applications for which CNL will Being used by computer-assisted translation and be developed, such as authoring, language learning and machine translation. man-machine interface or interaction. 2.4 Re-usability of Written Text 2.2 Beneficiary of controlled natural language Another benefit will be re-usability of written text, e.g., There are several candidates of beneficiaries of CNL. re-usability of language resources in larger application CNL can be used by human and/or by machine. scenarios, like Semantic Web or decision-support 1) By human, in more detail; systems. Here again, there are several aspects; e.g. documents, 1 texts, sentences, phrases, and terms, which can be easily 1) Principle for human comprehension retrieved and/or modified for re-use. 1-1) Comprehension by human readers This aspect is especially useful for CNL in industrial 1-2) Re-usability by human writers scene. 2) Principle for computational viewpoint 2-1) Machine Translation: Restricted language for 3. Classification of controlled natural MT input language 2-2) CNL for Information Retrieval: simple sentence This standard can be used for making CNL document, to be easily retrieved, understandable input changing pre-existing documents to CNL document, and queries rewriting or re-producing based on the existing text. As 4.2 Four Sets of Principles of CNL for making CNL document, it can be used as a guideline for an author during human writing process, or can be Four sets of language-independent principles for utilized by a system which control language and assist validating controlled natural language are described in human to write texts. this clause: We can classify “control” by several different viewpoints. 1) Universal Principles The guidelines for CNL will be divided according to the 1-1) Cost of whole process of an application users, such as (1) professional writers, (2) translators, (3) 1-2) Principle of complexity novice (casual users), and (4) machine translation 2) Human-oriented Principles systems. 2-1) Comprehension According to the linguistic structure, there are several 2-2) Reusability (easy to edit) levels of CNL, such as (1) syntax, (2) terminology, and (3) 3) Computer-oriented Principles document style. More precisely, there are; 3-1) Comprehension Morphological, lexical, syntax, semantics, vocabulary 3-2) Reusability (easy to edit) Character level vs. word level (language specific) 4) Object-oriented Principles Content Reduction, Sentence Segmentation 4-1) Lexical Clarification 4-2) Sentential Style We can also classify CNL based on its target domain of 5. Related concepts standards, such as (1) language, (2) user manual, and (3) Simplified language is a language which is generated as a know-how documents. result of some procedure, therefore simplified itself is the Additionally, we should think about “cross” aspects, such objective of this concept, and simplified natural language as language, domain and system (or application). has no aspect of multilinguality. On the other hand, CNL CNL can be used as a guideline for generation, such as represents some procedure which restrict some aspects of narrative generation and CNL generated by computer. language phenomena. CNL is not necessarily to be It also has educational aspect, i.e. CNL for language simplified language, and simplified language is a CNL. learning purpose. 4. General principles of controlled natural language 4.1 Two viewpoints for CNL There are at least two kinds of CNL from different viewpoints; 2 Pre-editing by Forum Users: a Case Study Pierrette Bouillon1, Liliana Gaspar2, Johanna Gerlach1, Victoria Porro1, Johann Roturier2 1Université de Genève FTI/TIM - 40 bvd Du Pont-d’Arve, CH-1211 Genève 4, Suisse {Pierrette.Bouillon, Johanna.Gerlach, Victoria.Porro}@unige.ch 2Symantec Ltd. Ballycoolin Business Park, Blanchardstown, Dublin 15, Ireland {Liliana_Gaspar, Johann_Roturier}@symantec.com Abstract Previous studies have shown that pre-editing techniques can handle the extreme variability and uneven quality of user-generated content (UGC), improve its machine-translatability and reduce post-editing time. Nevertheless, it seems important to find out whether real users of online communities, which is the real life scenario targeted by the ACCEPT project, are linguistically competent and willing to pre-edit their texts according to specific pre-editing rules. We report the findings from a user study with real French-speaking forum users who were asked to apply pre-editing rules to forum posts using a specific forum plugin. We analyse the interaction of users with pre-editing rules and evaluate the impact of the users' pre-edited versions on translation, as the ultimate goal of the ACCEPT project is to facilitate sharing of knowledge between different language communities. Keywords: pre-editing, statistical machine translation, user-generated content, language communities comparative evaluation (Gerlach et al, 2013a; Seretan et al, 1. Introduction to appear). Another study suggested that for specific Since the emergence of the web 2.0 paradigm, forums, phenomena, for example for the register mismatch between blogs and social networks are increasingly used by online community content and training data, pre-editing produces communities to share technical information or to exchange comparable if not better results than retraining with new problems and solutions to technical issues. User-generated data (Rayner et al, 2012). Further work (Gerlach et al, content (UGC) now represents a large share of the 2013b) has shown that pre-editing rules that improve the informative content available on the web. However, the output quality of SMT also have a positive impact on uneven quality of this content can hinder both readability bilingual post-editing time, reducing it almost by half. and machine-translatability, thus preventing sharing of However, it is still unclear whether pre-editing can knowledge between language communities (Jiang et al, successfully be implemented in a forum, which is the real 2012; Roturier and Bensadoun, 2011). life scenario targeted by the ACCEPT project. In the The ACCEPT project (http://www.accept-project.eu/) aims previous studies, the pre-editing rules were applied by at solving this issue by improving Statistical Machine native speakers with a translation background, i.e., with Translation (SMT) of community content through excellent language skills. In contrast, in the targeted minimally-intrusive pre-editing techniques, SMT scenario, the pre-editing task will have to be accomplished improvement methods and post-editing strategies, thus by the community members themselves. Although the task allowing users to post questions or benefit from solutions was simplified as much as possible for the forum users, by on forums of other language communities. Within this integration of a checking tool in the forum interface, it still project, the forums used are those of Symantec, one of the involves choosing among one or multiple suggestions, or partners in the project. Pre-editing and post-editing are even correcting the text manually, following instructions done using the technology of another project partner, the when no reliable suggestions can be given. Applying these Acrolinx IQ engine (Bredenkamp et al, 2000). This changes might prove difficult for users with varied rule-based engine uses a combination of NLP components linguistic knowledge, as it can involve quite complex and enables the development of declarative rules, which modifications, for example restructuring a sentence to are written in a formalism similar to regular expressions, avoid a present participle. Another aspect to consider is the based on the syntactic tagging of the text. motivation of the users: if pre-editing requires too much Within the project, we used the Acrolinx engine to develop time or effort, users will be less inclined to complete this different types of pre-editing rules for French, specifically step. Additionally, as users probably have little knowledge designed for the Symantec forums. Primarily, the aim of of the functioning of an SMT engine or the consequences pre-editing in this context is to obtain a better translation of pre-editing, the importance of making certain changes to quality in English without retraining the system with new the source will not be obvious to them. data. In previous work, we have found that the application The aim of this study is therefore to ascertain whether light of these rules significantly improves MT output quality, pre-editing rules which were developed using the Acrolinx where improvement was assessed through human formalism and which have proved to be useful for SMT can 3 be applied successfully by forum users. N360 sauvegarde les fichiers en plusieurs In the rest of the paper, Section 2 provides more details répertoires, ce qui peut parait abscons, mais c'est about the French Acrolinx pre-editing rules developed for correct. the Symantec forums. Section 3 describes the experimental N360 sauvegarde les fichiers en plusieurs setup and provides details about the experiments conducted répertoires. Ceci peut paraître abscons, mais c'est for evaluating the rules with forum users. In Section 4, we correct. discuss the results obtained in these experiments and, Figure 1. Example of pre-editing rule used to finally, conclusions and directions for future work are substitute traditional CNL rules like "avoid long provided in Section 5. sentences" In the absence of forum post-edited data that would have 2. Pre-editing in ACCEPT allowed identification of badly translated phrases or Pre-editing can take different forms: spelling and grammar phenomena, the rules were developed mainly using a checking; lexical normalisation (e.g. Han & Baldwin, corpus-oriented approach. Two specific resources proved 2011, Banerjee et al., 2012); Controlled Natural Language to be particularly useful: the out-of-vocabulary (OOV) (CNL) (O’Brien, 2003; Kuhn, 2013); or reordering (e.g. items, which are a good indicator of the data that is not Wang et al, 2007; Genzel, 2010). However, few pre-editing covered in the training set (see Banerjee et al, 2012), and scenarios combine these different approaches. For partially the list of frequent trigrams and bigrams, present in the historical reasons, CNL was mostly associated with rule development data but absent from the training corpus. based machine translation (RBMT) (Pym, 1988; Bernth & Three sets of rules were developed intended to be used in Gdaniec, 2002; O’Brien & Roturier, 2007; Temnikova, sequence. A first distinction is made between rules for 2011, etc. (one exception is (Aikawa et al, 2007)). On the humans (which also improve source quality) and rules for contrary, spellchecking, normalisation and reordering were the machine (which can degrade it or change it frequently used as pre-processing steps for SMT. In this considerably since the only aim is to improve MT output) work, the particularities of community content have led us (Hujisen, 1998). The rules for humans were split up into to choose an eclectic approach. We developed rules of all two sets, according to the pre-editing effort they require. the types mentioned above which answer the following A first set (Set1) contains rules that can be applied criteria: automatically. This set includes rules that treat  The rules focus on specificities of community content unambiguous cases and have unique suggestions. It that hinder SMT, namely informal and familiar style contains rules for homophones, word confusion, tense (not well covered by available training data), word confusion, elision and punctuation. While the precision of confusion (related to homophones) and divergences the rules included in this set is reasonably high, it is not between French and English. perfect. The automatic application of this set does therefore  As we cannot reasonably ask forum users, whose main produce some errors that might be avoided if the rules were objective is obtaining or providing solutions to applied manually instead. Examples of rules contained in technical issues, to painstakingly study pre-editing this set are given in Table 1. guidelines, compliance with the rules must be checked automatically. Therefore rules must be implemented Rule Raw Pre-edited within a checking tool, in our case Acrolinx. This Confusion of the entails some restrictions, especially due to the nature of oups j'ai oublié, oups j'ai oublié, homophones “sa” the Acrolinx formalism, which is for example not well and “ça” j'ai sa aussi. j'ai ça aussi. suited to detect non local phenomena. On the positive Lancez side, it also means that rules are easily portable to other Lancez Liveupdate et similar tools since they don’t require a lot of linguistic Missing or Liveupdate et regardez s'il y a incorrect elision regardez si il y a resources. un code un code d'erreur.  Another condition for successful rule application by d'erreur. forum users is that suggestions are provided, since we Il est peut être Il est peut-être cannot expect forum users to reformulate based only on Missing infecté, ce qui infecté, ce qui linguistic instructions (such as “avoid the present hyphenation serait bien serait bien participle”, “avoid direct questions”, “avoid long dommage. dommage. sentences”, etc). For this reason, common CNL rules Table 1. Examples for Set1 like “avoid long sentences” were replaced by more A second set (Set2) contains rules that have to be applied specific rules, accompanied by an explanation which manually as they have either multiple suggestions or no appears on a tooltip. A good example is the rule which suggestions at all. The rules correct agreement replaces “, ce qui”, by a full stop followed by a (subject-verb, noun phrase, verb form) and style (cleft pronoun: “. Ceci” (see Figure 1). sentences, direct questions, use of present participle, incomplete negation, abbreviations), mainly related to informal/familiar language. The human intervention required to apply these rules can vary from a simple 4 selection between two suggestions, to manual changes, for subsequent checking sessions. By means of a properties example for checking a bad sequence of words. Examples window, users can view learned words and ignored rules, of rules contained in this set are given in Table 2. which can be reverted at any time. Figure 2 shows the plugin in action. Rule Raw Pre-edited Avoid direct questions Tu as lu le tuto As-tu lu le tutoriel Avoid sur le forum? sur le forum? abbreviations Certains jeux Certains jeux qui Avoid the utilisant Internet utilisent Internet present ne fonctionnent ne fonctionnent participle plus. plus. Regarde le(s) Regarde les barres barre(s) que tu as que tu as Avoid letters téléchargées et téléchargées et between surtout le(s) site(s) surtout les sites brackets web où tu les as web où tu les as récupérés. récupérés. Table 2. Examples for Set2 Finally, the rules for the machine were grouped in a third Figure 2. ACCEPT pre-editing plugin used for this study set (Set3) that is applied automatically and will not be visible to end-users. These rules modify word order and In this study, our aim is twofold. In a first step, we want to frequent badly translated words or expressions to produce compare rule application by forum users and experts. In a variants better suited to SMT. The rules developed in this second step, we wish to determine if it is preferable to have framework are specific to the French-English combination a semi-automatic, yet not entirely reliable process (where and to the technical forum domain. Examples of rules Set1 is applied automatically), or a manual process where all the rules from Set1 and Set2 are checked manually. This contained in this set are given in Table 3. last approach will strongly depend on the motivation and Rule Raw Pre-edited skills of the users. These different scenarios (user vs expert, J'ai apporté une manual vs automatic) will be compared in terms of Avoid J'ai apporté une modification dans pre-editing activity (number of changes made in the source informal 2nd modification dans le titre de votre and the target) and in terms of the impact of changes on person le titre de ton sujet. sujet translation output. This impact will be evaluated using Replace Il est recommandé Il est recommandé human comparative evaluation. In the next section, we will pronoun by de la tester sur une de tester ça sur une describe the experimental setup for the scenarios “ça” machine dédiée. machine dédiée. mentioned above. Avoid “merci Merci de nous Veuillez nous tenir de” tenir au courant. au courant. 3. Experimental Setup Table 3. Examples for Set3 3.1 Pre-editing In ACCEPT, pre-editing is completed through the In order to compare the different pre-editing scenarios, we ACCEPT plugin directly in the Symantec forum. This collected the following pre-edited versions of our corpus: plugin was developed using Acrolinx's technologies and UserSemiAuto: Rules from Set1 were applied specifically conceived to check the compliance with the automatically. Then, the corpus was submitted to the rules directly where content is created (ACCEPT forum users, who applied the rules from Set2 manually Deliverable D5.2, 2013). This plugin “flags” potential using the ACCEPT plugin. errors or structures by underlining them in the text. Depending on the rules, when hovering with the mouse UserAllManual: The raw corpus was submitted to the cursor over the underlined words or phrases, the user forum users, who applied the rules from Set1 and Set2 receives different feedback to help him apply the rule manually using the ACCEPT plugin. This version was correction (Figure 2). For rules with suggestions, a produced at one week interval from UserSemiAuto. contextual menu provides a list of potential replacements, Expert: Rules from Set1 were applied automatically. which can be accepted with a mouse click. For rules Then, the corpus was submitted to a native French without suggestions, a tool-tip comes up with the speaking language professional, who applied the rules description of the error but no list of potential replacement from Set2 manually. is provided. Modifications then have to be done directly by Oracle: This version is the result of manual editing the text. Besides these two main interactions, users post-processing of the Expert version by a native can also choose to “learn words”, i.e. add a given token to French speaker. All remaining grammar, punctuation the system so that it will not be flagged again, or “ignore and spelling issues were corrected. No style rules”, i.e. completely deactivate a given rule. Both actions improvements were made in this step. are stored within the user profile and remain active for all 5

Description:
19. Giovanni Antico, Valeria Quochi, Monica Monachini, Maurizio Martinelli . Disambiguation (to what extent and for what purpose);. Avoiding
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.