ebook img

Natural language processing of incident and accident reports PDF

203 Pages·2016·5.05 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Natural language processing of incident and accident reports

Natural language processing of incident and accident reports : application to risk management in civil aviation Nikola Tulechki To cite this version: Nikola Tulechki. Natural language processing of incident and accident reports : application to risk management in civil aviation. Linguistics. Universit´e Toulouse le Mirail - Toulouse II, 2015. English. <NNT : 2015TOU20035>. <tel-01230079> HAL Id: tel-01230079 https://tel.archives-ouvertes.fr/tel-01230079 Submitted on 17 Nov 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destin´ee au d´epˆot et `a la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publi´es ou non, lished or not. The documents may come from ´emanant des ´etablissements d’enseignement et de teaching and research institutions in France or recherche fran¸cais ou ´etrangers, des laboratoires abroad, or from public or private research centers. publics ou priv´es. TTHHÈÈSSEE En vue de l’obtention du DOCTORAT DE L’UNIVERSITÉ DE TOULOUSE Présentée et soutenue le 30 septembre 2015 par : Nikola TULECHKI Natural language processing of incident and accident reports: application to risk management in civil aviation Traitement automatique de rapports d’incidents et accidents: application à la gestion du risque dans l’aviation civile JURY Patrice BELLOT PR, LSIS, Marseille Rapporteur Yannick TOUSSAINT CR HDR, INRIA, Nancy Rapporteur Cécile FABRE PR, CLLE-ERSS, Toulouse Examinatrice Ludovic TANGUY MCF HDR, CLLE-ERSS, Toulouse Directeur Eric HERMANN Directeur de CFH-SD, Toulouse Invité École doctorale et spécialité : CLESCO : Sciences du langage Unité de Recherche : CLLE-ERSS (UMR 5263) Directeur de Thèse : Ludovic TANGUY Rapporteurs : Patrice BELLOT et Yannick TOUSSAINT Abstract This thesis describes the applications of natural language processing (NLP) to industrial risk management. We focus on the domain of civil aviation, where incident reporting and accident investigations produce vast amounts of information, mostly in the form of textual accounts of abnormal events, and where efficient access to the information contained in the reports is required. We start by drawing a panorama of the different types of data produced in this particular domain. We analyse the documents themselves, how they are stored and organised as well as how they are used within the community. We show that the current storage and organisation paradigms are not well adapted to the data analysis requirements, and we identify the problematic areas, for which NLP technologies are part of the solution. Specificallyaddressingtheneedsofaviationsafetyprofessionals,twoinitial solutions are implemented: automatic classification for assisting in the coding ofreportswithinexistingtaxonomiesandasystembasedontextualsimilarity for exploring collections of reports. Basedontheobservationofreal-worldtoolusageandonuserfeedback, we proposedifferentmethodsandapproachesforprocessingincidentandaccident reportsandcomprehensivelydiscusshowNLPcanbeappliedwithinthesafety information processing framework of a high-risk sector. By deploying and evaluating certain approaches, we show how elusive aspects related to the variabilityandmultidimensionalityoflanguagecanbeaddressedinapractical manner and we propose bottom-up methods for managing the overabundance of textual feedback data. 3 Acknowledgements Much like rock climbing, writing a thesis is both a solitary endeavour and a team exercise. Making progress up the mountain is not possible without someone standing on terra firma and holding the other end of the rope. Com- pleting a thesis is not possible without a strong, rich and varied supportive context. For that, before we begin, I would like to express my gratitude to all the people who have helped me in doing this research. First and foremost I would like to thank Ludovic Tanguy who has been my adviser not only for this thesis but for my entire life in academia. Starting from the very first introductions to the domain of natural language processing andtocomputerprogrammingin2007rightuptothisverymoment,Ludovic’s advice, guidance and support have been invaluable. Whether presented with an outrageous “great" idea or with a last minute crisis, Ludovic has always had a trick up his sleeve and with the fitting words has pointed me in the right direction. For all that and much more, thank you, Ludovic! I would like to thank the members of the jury, Patrice Bellot and Yannick Toussaint for accepting to write the detailed reports and Cécile Fabre for accepting to be my examiner. This thesis would not have been possible without the material support and the context provided by both CFH/Safety Data and the CLLE-ERSS linguistics research laboratory. From CFH/Safety Data, I would like to thank Eric Hermann and Michel Mazeau for providing me with this opportunity and for believing in the po- tential of natural language processing in the context of risk management. I wouldalsoliketoexpressmygratitudetothemforintroducingmetothefields of industrial ergonomics and human factors and thus providing the founda- tions for my understanding of the complexities of human work and interaction with technology. My respect also goes to the rest of the CFH/Safety Data team: Céline Raynal, Christophe Pimm, Vanessa Andréani, Marion Laignelet and Pamela Maury for constituting of this exceptionally rich working envi- ronment. From CLLE-ERSS I would like to thank all the past and present members, whose various inputs throughout both the time of this writing and my academic upbringing constitutes the foundations of this research. A won- derful and colourful crowd, of which I feel honoured being a part. 5 6 I would especially like to thank Assaf Urieli and Nicolas Ribeiro without whose help and contributions whole sections of this thesis would not exist as well as my good friend and colleague Aleksandar Kalev for his years long professional and personal support. MygratitudealsogoestoMarie-PaulePéry-Woodleyforprovidingamuch needed external perspective and helping me overcome the dreaded writer’s block and Mai Ho-Dac for always finding a way to show me the bright side of academia. For their invaluable input in helping me understand the the intricacies of aviation and flying I would like to thank Grégory Caudy, Jerôme Rodriguez and especially Reinhard Menzel for patiently sharing his profound knowledge of aviation safety management. ForacceptingsubjectsrelatedtomyworkfortheirMastersprojectsandfor their exemplary work, I thank Céline Barès, Joao Pedro Campello Rodriguez and Clement Thibert. Fortheirsupportandencouragements,Ithankmyfellowdoctoralstudents Fanny Lalleman and François Morlane-Hondère and Caroline Atallah. Finally to all those, whose love has made me wake up with a smile in the morningandtoallthosewhosewordshavemademefallasleepwithapeaceful mind. Thank you! Contents List of Figures 11 List of Tables 13 Introduction 17 1 Basics of accident modelling and risk management 25 1.1 What is an accident? . . . . . . . . . . . . . . . . . . . . . . . . 26 1.1.1 From normality to disaster . . . . . . . . . . . . . . . . 26 1.1.2 A complicated definition . . . . . . . . . . . . . . . . . . 29 1.1.3 Severity, frequency and visibility . . . . . . . . . . . . . 30 1.1.4 The basics of incident reporting . . . . . . . . . . . . . . 32 1.2 Risk management in a complex systems . . . . . . . . . . . . . 34 1.2.1 The descending flow: controlling the processes . . . . . 36 1.2.2 The ascending flow: information driven decision making 37 1.3 Looking for patterns . . . . . . . . . . . . . . . . . . . . . . . . 38 1.4 Chapter conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 40 2 Safety information in civil aviation: actors, models and data 41 2.1 Producing occurrence data. . . . . . . . . . . . . . . . . . . . . 43 2.1.1 The actors . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.1.2 Official accident investigations . . . . . . . . . . . . . . 46 2.1.2.1 The process. . . . . . . . . . . . . . . . . . . . 46 2.1.2.2 Report examples . . . . . . . . . . . . . . . . . 46 2.1.2.3 Acquiring the data . . . . . . . . . . . . . . . . 53 2.1.3 Preliminary reports and accidents briefs . . . . . . . . . 54 2.1.3.1 The process. . . . . . . . . . . . . . . . . . . . 54 2.1.3.2 Report examples . . . . . . . . . . . . . . . . . 54 2.1.3.3 Acquiring the data . . . . . . . . . . . . . . . . 56 2.1.4 Voluntary reporting programs . . . . . . . . . . . . . . . 56 2.1.4.1 The process. . . . . . . . . . . . . . . . . . . . 56 2.1.4.2 Report examples . . . . . . . . . . . . . . . . . 57 2.1.4.3 Acquiring the data . . . . . . . . . . . . . . . . 59 7 8 CONTENTS 2.1.5 Safety management systems and mandatory reporting . 59 2.1.5.1 The process. . . . . . . . . . . . . . . . . . . . 59 2.1.5.2 Report examples . . . . . . . . . . . . . . . . . 60 2.1.5.3 Acquiring the data . . . . . . . . . . . . . . . . 61 2.1.6 Other sources of occurrence data . . . . . . . . . . . . . 61 2.1.6.1 Specialised data providers . . . . . . . . . . . . 61 2.1.6.2 Acquiring the data . . . . . . . . . . . . . . . . 62 2.1.6.3 Press . . . . . . . . . . . . . . . . . . . . . . . 62 2.1.6.4 Community efforts and user generated content 63 2.1.6.5 Acquiring the data . . . . . . . . . . . . . . . . 64 2.1.7 A typology of occurrence reports . . . . . . . . . . . . . 64 2.1.7.1 External categorisation . . . . . . . . . . . . . 64 2.1.7.2 Internal categorisation . . . . . . . . . . . . . . 66 2.2 Storing and organising occurrence data . . . . . . . . . . . . . . 70 2.2.1 The occurrence and its lifecycle . . . . . . . . . . . . . . 70 2.2.2 Accident models, coded data and taxonomies . . . . . . 70 2.2.3 Examples of metadata . . . . . . . . . . . . . . . . . . . 72 2.2.3.1 Simple factual information . . . . . . . . . . . 72 2.2.3.2 Standard descriptors of the accident sequence. 72 2.2.3.3 The ASRS coding schema . . . . . . . . . . . . 75 2.2.3.4 SMS systems and the bow-tie model . . . . . . 75 2.2.3.5 ECCAIRS and ADREP . . . . . . . . . . . . . 76 2.2.4 A typology of taxonomies . . . . . . . . . . . . . . . . . 82 2.3 Using occurrence data . . . . . . . . . . . . . . . . . . . . . . . 84 2.3.1 Querying the collection . . . . . . . . . . . . . . . . . . 84 2.3.2 KPIs and statistics . . . . . . . . . . . . . . . . . . . . . 85 2.3.3 Intelligence and monitoring . . . . . . . . . . . . . . . . 89 2.4 Issues when dealing with large collections of occurrence data . 90 2.4.1 Issues with natural language reports . . . . . . . . . . . 91 2.4.2 Issues with coded data and taxonomies . . . . . . . . . 92 2.4.2.1 Complex codification schemes . . . . . . . . . 92 2.4.2.2 Dynamic systems and static taxonomies . . . . 92 2.4.2.3 Changing models and taxonomies . . . . . . . 93 2.4.2.4 Bottleneck effects . . . . . . . . . . . . . . . . 93 2.5 Summary of the issues and NLP as a solution . . . . . . . . . . 94 2.6 Chapter conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 97 3 NLP: domains of application 99 3.1 Information retrieval . . . . . . . . . . . . . . . . . . . . . . . . 100 3.1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . 100 3.1.1.1 Information need and query formulation . . . . 101 3.1.1.2 Models of IR for document processing . . . . . 102 3.1.1.3 Displaying the results . . . . . . . . . . . . . . 102 3.1.1.4 IR performance . . . . . . . . . . . . . . . . . 103 CONTENTS 9 3.1.1.5 An example of full text search problem . . . . 103 3.1.2 Linguistic issues in IR . . . . . . . . . . . . . . . . . . . 105 3.1.2.1 Morphological variation . . . . . . . . . . . . . 106 3.1.2.2 Lexical variation . . . . . . . . . . . . . . . . . 107 3.1.2.3 Compositionality and semantic variation . . . 109 3.1.2.4 Discourse and document structure . . . . . . . 109 3.1.3 IR for occurrence data . . . . . . . . . . . . . . . . . . . 110 3.1.3.1 Precise information needs . . . . . . . . . . . . 110 3.1.3.2 Undefined information needs . . . . . . . . . . 110 3.1.3.3 Favouring recall . . . . . . . . . . . . . . . . . 110 3.1.4 A broader perspective on the IR problem definition . . . 111 3.2 Automatic text categorisation . . . . . . . . . . . . . . . . . . . 112 3.2.1 Problem definition . . . . . . . . . . . . . . . . . . . . . 112 3.2.1.1 The nature of the classification problem . . . . 113 3.2.1.2 The choice of classifier . . . . . . . . . . . . . . 113 3.2.1.3 Document representation . . . . . . . . . . . . 114 3.2.2 Specifics of applying TC to occurrence data . . . . . . . 114 3.2.3 Automaticclassificationofoccurrencecategories: anex- ample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.2.3.1 Context . . . . . . . . . . . . . . . . . . . . . . 115 3.2.3.2 Corpus size and category distribution . . . . . 116 3.2.3.3 Classifier and classification problem . . . . . . 116 3.2.3.4 Results . . . . . . . . . . . . . . . . . . . . . . 117 3.2.3.5 Industrialisation . . . . . . . . . . . . . . . . . 118 3.3 Chapter conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 118 4 From text to vectors 119 4.1 Extracting features . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.1.1 Tokenising. . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.1.2 Levels of normalisation . . . . . . . . . . . . . . . . . . 121 4.1.3 Overview of a processing chain . . . . . . . . . . . . . . 123 4.1.4 Basic processing . . . . . . . . . . . . . . . . . . . . . . 124 4.1.5 Word n-gram extractor . . . . . . . . . . . . . . . . . . 126 4.1.6 Detector of developed acronyms. . . . . . . . . . . . . . 126 4.2 Representing documents in vector space . . . . . . . . . . . . . 127 4.2.1 The term matrix . . . . . . . . . . . . . . . . . . . . . . 128 4.2.2 Feature weighing . . . . . . . . . . . . . . . . . . . . . . 128 4.3 Dimensionality reduction methods . . . . . . . . . . . . . . . . 130 4.3.1 Smoothing the term matrix . . . . . . . . . . . . . . . . 130 4.3.2 Explicit methods . . . . . . . . . . . . . . . . . . . . . . 131 4.3.3 Intrinsic or extrinsic, hidden or explicit? . . . . . . . . . 132 4.4 Chapter conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 133 5 The timePlot system: detecting similar reports over time 135

Description:
This thesis describes the applications of natural language processing (NLP) to ever improving cabin design, passenger evacuation and ground
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.