Department of Computer Science Series of Publications A Report A-2013-2 Probabilistic, Information-Theoretic Models for Etymological Alignment Hannes Wettig To be presented, with permission of the Faculty of Science of the University of Helsinki, for public criticism in Auditorium XIV of the University Main Building, on February 9th 2013 at 12 o’clock noon. University of Helsinki Finland Supervisor Petri Myllym¨aki and Roman Yangarber, University of Helsinki, Finland Pre-examiners Steven de Rooij, Centrum Wiskunde & Informatica (CWI), Amsterdam, Netherlands Timo Honkela, Aalto University School of Science, Finland Opponent Erik Aurell, Skolan f¨or dataveteskap och kommunikation (KTH), Sweden and Helsinki University of Technology (TKK), Finland Custos Petri Myllym¨aki University of Helsinki, Finland Contact information Department of Computer Science P.O. Box 68 (Gustaf H¨allstr¨omin katu 2b) FI-00014 University of Helsinki Finland Email address: [email protected].fi URL: http://www.cs.helsinki.fi/ Telephone: +358 9 1911, telefax: +358 9 191 51120 Copyright c 2013 Hannes Wettig (cid:13) ISSN 1238-8645 ISBN 978-952-8588-8 (paperback) ISBN 978-952-8589-5 (PDF) Computing Reviews (1998) Classification: G.3, H.1.1, I.2.6, I.2.7 Helsinki 2013 Helsinki University Print Probabilistic, Information-Theoretic Models for Etymological Alignment Hannes Wettig Department of Computer Science P.O. Box 68, FI-00014 University of Helsinki, Finland wettig@hiit.fi PhD Thesis, Series of Publications A, Report A-2013-2 Helsinki, January 2013, 202 pages ISSN 1238-8645 ISBN 978-952-8588-8 (paperback) ISBN 978-952-8589-5 (PDF) Abstract This thesis starts out by reviewing Bayesian reasoning and Bayesian net- work models. We present results related to discriminative learning of Bayesian network parameters. Along the way, we explicitly identify a num- ber of problems arising in Bayesian model class selection. This leads us to information theory and, more specifically, the minimum description length (MDL) principle. We look at its theoretic foundations and practical impli- cations. The MDL approach provides elegant solutions for the problem of modelclassselectionandenablesustoobjectivelycompareanysetofmod- els, regardless of their parametric structure. Finally, we apply these meth- ods to problems arising in computational etymology. We develop model familiesforthetaskofsound-by-soundalignmentacrosskindredlanguages. Fed with linguistic data in the form of cognate sets, our methods provide information about the correspondence of sounds, as well as the history and ancestral structure of a language family. As a running example we take the family of Uralic languages. Computing Reviews (1998) Categories and Subject Descriptors: G.3 [Mathematics of Computing] Probability and Statistics. H.1.1 [Information Systems] Models and Principles — Systems and Information Theory. I.2.6 [Computing Methodologies] Artificial Intelligence — Learning. iii iv I.2.7 [Computing Methodologies] Artificial Intelligence — Natural Language Processing. General Terms: Data Analysis, Probabilistic Modeling, Information Theory, Natural Language Processing. Additional Key Words and Phrases: Bayesian Networks, Logistic Regression, Minimum Description Length Principle, Kolmogorov Complexity, Normalized Maximum Likelihood, Etymology, Language Alignment, Phylogenetic Trees. Foreword v vi Contents Foreword v List of Included Publications 1 List of Abbreviations 3 1 Introduction 5 2 Bayesian Reasoning 13 2.1 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Bayesian Network Models . . . . . . . . . . . . . . . . . . . 20 2.4 Bayesian Model Class Selection . . . . . . . . . . . . . . . . 26 2.5 Supervised Learning Tasks . . . . . . . . . . . . . . . . . . . 33 2.6 Discriminative Parameter Learning . . . . . . . . . . . . . . 36 2.7 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . 43 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3 Information Theory 49 3.1 The Minimum Description Length Principle . . . . . . . . . 49 3.2 Two-Part Codes . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3 Kolmogorov Complexity . . . . . . . . . . . . . . . . . . . . 61 3.4 Normalized Maximum Likelihood . . . . . . . . . . . . . . . 66 3.5 More Properties of the NML Distribution . . . . . . . . . . 70 3.6 Computing the NML Distribution . . . . . . . . . . . . . . 73 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4 Etymology 77 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3 The Alignment Problem . . . . . . . . . . . . . . . . . . . . 84 vii viii Contents 4.4 Baseline Model and Extensions . . . . . . . . . . . . . . . . 87 4.4.1 Baseline Model . . . . . . . . . . . . . . . . . . . . . 87 4.4.2 Learning Procedure . . . . . . . . . . . . . . . . . . 88 4.4.3 Sanity Checking . . . . . . . . . . . . . . . . . . . . 90 4.4.4 NML . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.5 Codebook . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.6 Distinguishing Between Kinds of Events . . . . . . . 93 4.4.7 Multiple Sound Alignment . . . . . . . . . . . . . . . 93 4.4.8 Separate Encoding of Affixes . . . . . . . . . . . . . 95 4.4.9 Multilingual Alignment . . . . . . . . . . . . . . . . 97 4.5 Featurewise Context Modeling . . . . . . . . . . . . . . . . 99 4.5.1 Larger Context . . . . . . . . . . . . . . . . . . . . . 99 4.5.2 Phonetic Features . . . . . . . . . . . . . . . . . . . 99 4.5.3 Context Trees . . . . . . . . . . . . . . . . . . . . . . 100 4.5.4 Codelength . . . . . . . . . . . . . . . . . . . . . . . 102 4.5.5 Learning . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 104 4.5.7 Exploiting Monolingual Rules . . . . . . . . . . . . . 105 4.6 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.7 Phylogenetic Language Trees . . . . . . . . . . . . . . . . . 109 5 Summary and Current Work 113 References 119 Original Publications 131 Summary and Contributions . . . . . . . . . . . . . . . . . . . . 131 Publication I: When Discriminative Learning of Bayesian Net- work Parameters is Easy . . . . . . . . . . . . . . . . . . . 135 Publication II: Calculating the Normalized Maximum Likelihood Distribution for Bayesian Forests . . . . . . . . . . . . . . . 143 Publication III: Probabilistic Models for Alignment of Etymologi- cal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Publication IV: MDL-based Models for Alignment of Etymological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Publication V: Using context and phonetic features in models of etymological sound change . . . . . . . . . . . . . . . . . . . 175 Publication VI: Information-Theoretic Methods for Analysis and Inference in Etymology . . . . . . . . . . . . . . . . . . . . . 187 List of Included Publications Publication I H.Wettig, P. Gru¨nwald, T.Roos, P. Myllym¨aki, H.Tirri: When Discriminative Learning of Bayesian Network Parameters Is Easy 18th International Joint Conference on Artificial Intelligence (IJCAI ’03). Publication II H. Wettig, P. Kontkanen and P. Myllyma¨ki: CalculatingtheNormalizedMaximumLikelihoodDistributionforBayesian Forests IADIS International Journal on Computer Science and Information Sys- tems 2 (2007). Publication III H. Wettig and R. Yangarber: Probabilistic Models for Aligning Etymological Data 18thNordicConferenceofComputationalLinguistics(NODALIDA2011). Publication IV H. Wettig, S. Hiltunen and R. Yangarber: MDL-based Models for Aligning Etymological Data Conf. on Recent Advances in Natural Language Processing (RANLP 2011). Publication V H. Wettig, K. Reshetnikov and R. Yangarber: Usingcontextandphoneticfeaturesinmodelsofetymologicalsoundchange EACL Joint Workshop of LINGVIS & UNCLH 2012. Publication VI H. Wettig, J. Nouri, K. Reshetnikov and R. Yangarber Information-Theoretic Methods for Analysis and Inference in Etymology Fifth Workshop on Information Theoretic Methods in Science and Engi- neering (WITMSE 2012). For detailed references, summaries and contributions of the current author see pages 131–133. 2
Description: