JournalofArtificialGeneralIntelligence-(-)-,2015 Submitted2015-05-24 DOI:- Accepted2015-11-19 From Distributional Semantics to Conceptual Spaces: A Novel Computational Method for Concept Creation StephenMcGregor(cid:63) [email protected] KatAgres(cid:63) [email protected] MatthewPurver [email protected] GeraintA.Wiggins [email protected] SchoolofElectronicEngineeringandComputerScience QueenMaryUniversityofLondon MileEndRoad,LondonE14NS,UK (cid:63)Primaryauthorscontributingequallytothiswork. Editor:- Abstract Weinvestigatetherelationshipbetweenlexicalspacesandcontextually-definedconceptualspaces, offeringapplicationstocreativeconceptdiscovery. Wedefineacomputationalmethodfordiscov- eringmembersofconceptsbasedonsemanticspaces: startingwithastandarddistributionalmodel derived from corpus co-occurrence statistics, we dynamically select characteristic dimensions associatedwithseedterms,andthusasubspaceoftermsdefiningtherelatedconcept.Thisapproach performs as well as, and in some cases better than, leading distributional semantic models on a WordNet-basedconceptdiscoverytask,whilealsoprovidingamodelofconceptsasconvexregions within a space with interpretable dimensions. In particular, it performs well on more specific, contextualizedconcepts;toinvestigatethiswethereforemovebeyondWordNettoasetofhuman empiricalstudies,inwhichwecompareoutputagainsthumanresponsesonamembershiptaskfor novel concepts. Finally, a separate panel of judges rate both model output and human responses, showing similar ratings in many cases, and some commonalities and divergences which reveal interestingissuesforcomputationalconceptdiscovery. Keywords: distributional semantics, conceptual spaces, computational creativity, concept discovery,behaviouralvalidation 1. Introduction This paper presents a computational model for the discovery of concepts which attempts to bridge the gap between standard lexical distributional semantics and conceptual spaces. Beginning with a general vector space model of word meaning derived from co-occurrence statistics, we use input querytermstoselectacontextualizedsub-spacecorrespondingtoaconcept,anddiscovermembers oftheconceptasvectorswithinthatspace. Ourgeneralobjectiveistoinvestigatewhetheramodel which situates words in a spatial relationship to one another can be mapped to likewise spatially calibrated models of concepts. Such a model may be thought of as a method for modelling and discoveringmeaningfulconceptualrelationships,atopicofinteresttogeneralartificialintelligence. The power of our approach lies in its ability to draw this connection between an arrangement of termsinalexicalspaceandarepresentationinacognitivespace: clusteringsofwordsarediscovered ThisworkislicensedundertheCreativeCommonsAttribution3.0License. FROMDISTRIBUTIONALSEMANTICSTOCONCEPTUALSPACES in a subspace informed by an input query, and this subspace suggests the structure, in terms of constituency,ofaparallelconceptualregion. Vectorspacemodelsofdistributionalsemanticsarecurrentlyapopularapproachforquantifying similarityincomputationallinguistics,butmanycontemporarystudiesneedgroundingandexternal validation (Clark, 2015). Much of the work in this area compares model performance to semantic databases,butdoesnotdirectlyrelateresultstothecognitiveperformanceofhumans,orusesvery restricted targets, such as word similarity judgments, rather than higher-level concepts. On the otherhand,cognitivescientistshaveseenvectorspacemodelsasasuitablewaytomodelconcepts, capturing notions such as conceptual similarity, degree judgements, prototypical membership and the effects of context on these (Ga¨rdenfors, 2000). Our aim here is to investigate the connection betweenthetwo,andintheprocesstoelucidatehowhumansconceptualisecreativity;wetherefore examineourmodelandcomparetostate-of-the-artapproachesontasksrelatedtoconceptdiscovery, andincludeevaluationagainsthumanresponses. The model we propose is based on a standard approach to distributional lexical semantics, but differsfromstandardmodelsinafewimportantaspects. First,weproposeacontextualisedmethod of dimension selection, rather than the standard approach of general dimensional reduction: we choosesignificantdimensionsofthemodelbasedontheseedtermsgiven(takentonameordefine the target concept), and use these to outline a concept-specific subspace. By exploring this space we can suggest members of the concept, and we show that this provides accuracy at least as good as state-of-the-art distributional semantic models on a WordNet-based task, while also providing many of the properties of conceptual space models: the subspace is convex and relatively low- dimensional, and defined by interpretable characteristic dimensions. In particular, our approach performsbetteronconceptswhichareneitherverygeneral,common,abstractconceptsnortightly- definedscientificconcepts,bothofwhicharemodelledwellbystandardapproaches. Wethenmove beyondthelimitationsofWordNet’staxonomytoinvestigatethesemorenovelandunusualconcepts viacomparisonstohumanjudgements. The discovery and modeling of novel concepts is not a sufficient property for a model of computational creativity, but it is a necessary one. By seeking to implement a low level approach tothedelineationofconceptualregionsbasedonthegeometryofadistributedsemanticspace,and onewhichviewsconceptsasmomentaryandpragmaticphenomenawhichcanemergeinacontext without predefinition or preformulation, we hope to contribute to the understanding of methods which can perform this creative task. Furthermore, by investigating the use and limitations of lexical ontologies such as WordNet, and the supplementation thereof via human judgment studies, wehopetocontributetotheunderstandingofsuitablemethodsforevaluationofcreativebehaviour. AlthoughmanycomputationalandAIsystemsaimtodisplaycreativebehaviourorproducecreative artefacts, the evaluation of computational creativity remains challenging and controversial, and in some cases, the issue of evaluation is not even explicitly addressed. This secondary aspect of the work,thepotentialformeta-analysisinherentinthequestionofwhetherourmodel’soutputwillbe usefulforguidinganevaluativediscussionofcreativeworkelsewhere,isintendedtogivethework its own pragmatic grounding, in that this suggests a practical application for the creative output describedinthefollowingpages. Theorganizationofthepaperisasfollows: firstweofferanoverviewofgeometricmodelsfor conceptsandwords,togetherwiththerelationbetweenthem,positingthatacrucialcorrespondence between the cognitive and linguistic domains can be found by situating them both spatially. This is followed by a general description of our methodology, which involves building a distributed 53 MCGREGOR,AGRES,ETAL. semantic space and then projecting subspaces from this base space informed by the conceptually contextual information contained in an input query. Based on this methodology, we then evaluate our approach and compare it to existing general lexical approaches on a task of discovering conceptual members, deriving models from a large scale corpus (the English language edition of Wikipedia), and evaluating against a set of classes in WordNet. We go on to investigate more creative cases via two empirical studies comparing human conceptualisations with our model’s output: firstbycomparingautomaticoutputstohuman-generatedequivalents,andsecondbyasking humanjudgestoblindlycompareourmodel’soutputtohumangeneratedterms. Forthemajorityof concepts, ourmodelperformscomparablytohumans. Weconcludewithasummaryofourresults andabriefconsiderationofpossibleavenuesforfuturework. 2. ConceptsandCreativity As Barsalou (1993) has pointed out, when concepts are discussed in formally theoretical contexts, they are almost always construed in terms of words: in fact, it is almost impossible to imagine performing an analysis of conceptualisation without immediately resorting to words. Yet concepts clearly do not supervene on the words that can be used to denote and perhaps to outline them; cognitive content is seemingly something other than language, and certainly something other than the day-to-day language documented in a lexicon. On the other hand, it is even less clear how exactly a computer could be used to model conceptualisation, and here we must surely resort to some sort of commerce in linguistic symbols. Words are, as such, the sticking point between abstract qualitative conceptual processes and the likewise abstract but also essentially quantifiable computationaloperationsinvolvedinmodelingconceptualisation. There is something not true to life, though, in constructing a conceptual model as a look-up table,howevercomprehensive,merelyassociatingwordswithrules. Rather,conceptsseemtocome aboutintheprocessofacognitiveagent’sdynamicinteractionwithanenvironment,andinthisthere is somethingfundamentally creative: theability to reactto an unpredictable worldnecessitates the ongoingproductionofnovel,meaningfulinformationstructures. Sowhenwetalkaboutconceptual creativity,whatwearereferringtoisanagent’sabilitytoformusefulrepresentationsofasituation inanimmediateandcontextsensitiveway. Thequestionofmodelingsuchaprocessthenbecomes a question of the nature of the representations themselves, and in particular a matter of how they achievetheircontextualdynamism. Thefinalpieceofthepicturetobepaintedhereisatheoryofconceptualisationassomethingthat unfoldsinafundamentallyspatial,geometricmilieu—itisinthiscorrespondencebetweenconcepts and space that the answer to the problem of representational structure is to be found. This section addressesthetheoreticalproblemofhowworldrelevant,informationallyproductiverepresentations canbemodeled,beginningwiththequestionoftherelationshipbetweenconceptsintheworldand thegeometriccapacitiesofsymbolmanipulatingmachines. 2.1 ConceptualCreativityandComputers Koestler (1964) has proposed that creativity can be understood in terms of a bisociative act, by which some new meshing of previously disparate patterns of conceptual schemata results in the discovery of a new perspective on the world. Koestler characterises patterns of thought and behaviour in terms of matrices, and he likewise describes creativity in terms of the discovery, through bisociation, of “new, auxiliary matrices” which allows the creator to overcome some 54 FROMDISTRIBUTIONALSEMANTICSTOCONCEPTUALSPACES obstacle in the world. Extending this idea to span conceptual domains including all varieties of language use as well as non-propositional conceptualisation, Fauconnier and Turner (1998) have presented a theory of conceptual blending which holds that even the most quotidian of concepts are formed through the online interaction of mental spaces, “small conceptual packets constructed as we think and talk, for purposes of local understanding and action,” (p. 137). Critically, these spaces are characterised by dynamics which determine the way that they can interact in the course ofmappingsprojectedacrossdiffusespaces. Theseconstructswhichconstruecreativityintermsofdiscoveryandspatialdynamismfitnicely with the classic approach to creative cognition of Boden (1990), who describes creativity in terms of state spaces that can be explored and transformed. When it comes to computers modeling the creative development of concepts, it is important that emphasis be placed on the mechanism by whichtheconceptualisationisperformed—soWiggins(2006)describesaframeworkforassessing computational creativity in terms of creative behaviour, with the focus being here on the way in whichthesystemoperatesratherthanmerelytheoutputofthatoperation. Implicitinthisoperational sensitivityisanawarenessofwhatBodenhasdescribedasP-(personal,local)versusH-(historical, global)creativity,andlikewise,intheterminologyofRitchie(2007),theinspiringsetwhichserves as the foundation for a creative agent’s activity, with the agent attempting to extend a base set of creativeartefactswithoutsimplyimitatingthem. In the case of the operation of the model which will be described here, which seeks to map conceptsfromclustersderivedfromoperationsongeometricwordrepresentations,thefactthat,for instance, the concept DOG is more or less paradigmatically similar to CAT may be inherent in the structure of the underlying data, which is to say, in the relationship between words in a large scale corpusthatdiscussesthingslikecatsanddogs. ThatCATissometimesakintoPANTHER,orindeed incertaincircumstancestoJAZZMUSICIANorperhapstootherlesslikelythingsisasomewhatmore interestingdistinctiontomake. Itisultimatelythediscoverybothoftheclassifications,whichmay be to some degree inherent in the underlying data, and of the operational, representational context inwhichtheseclassificationqualify,somethingthatisfundamentallyanaspectofthesystem’sown representational dynamics, that draws out the creativity being modeled through a computational approach to linguistics. In our conceptually contextual language model, the things that are being associated are linguistic representations – words, but words cast as dynamic structures – and the process of association involves the ongoing development of previously uncharted constructions in anastronomicallyimmensestatespaceofprojections. The thrust of the argument here is that creative conceptualisation is something more than just an itemisation, more than just a rearranging of things into predefined categories. A creative conceptualisation involves the creation of a new way of associating things, and it is both the associations themselves and the novel criteria for making the associations which are submitted as a creative event. Creativity happens in the course of an agent’s entanglements with the world: in an unpredictable environment, flexibility and immediacy are paramount, and so the cognitive state of an agent must be tightly coupled with the environment. The process by which such an agent achieves goals and indeed survives must involve something more than merely indexing reactions from a list of possible scenarios, since, to the agent (as opposed to its programmer), such a list is not available: novel scenarios that arise must be understood, and their representations synthesised anew. Thus, the kind of conceptual creativity that we are, in a broad sense, trying to describe and demonstrateisverymuchatthecoreofcognitioningeneral. Inordertomodelthisphenomenonof 55 MCGREGOR,AGRES,ETAL. mindandenvironmentcomputationally,itisnecessarytobuildrepresentationswhichinteractwith bothinputandeachotherdynamically. 2.2 ConceptualSpaces Ga¨rdenfors (2000) has developed a geometrical theory of conceptual spaces which focuses on the dimensionality of the spaces, and the way that a space’s dimensions can correspond to perception ofphenomenaintheworld. Acrucialcharacteristicofsuchaspaceisthattheregionsofthespace, which might be construed as conceptual entities, can be seen as interacting with one another by virtue of the lower level relationships between their defining dimensions. Among other things, Ga¨rdenfors’modelpresentsabasisforresolvingthedifficultrelationshipbetweenlowlevelstimuli inanenvironment,whichbecomedimensionsinconceptualspaces,andthehigherlevelinteractions ofrepresentationsapparentincognitiveprocesses. Ga¨rdenfors sees concepts as corresponding to regions in such spaces, with individual entities which belong to those concepts (or events etc.) as points within those regions. This provides a view naturally suited to modelling not only membership judgements in this way, but similarity judgements (via distances between points), the existence of prototypical members of concepts (as more central points within regions) and degree judgements (via distances from central points). Furthermore, Ga¨rdenfors takes a critical property of natural concepts that they be representable as convex regions: any point that lies between two members of a region — given the notion of betweennessdefinedbytheconceptualspaceanditsdimensions—mustalsobeamemberofthat region. This plays into the intuition that there cannot be gaps in the dimensional substrate of the conceptual manifold: a color that, in terms of brightness, lies somewhere between light red and darkredisstillashadeofred. Furthermore,Ga¨rdenforspresentsanotionofsalience,afactorthat mediates the significance of certain aspects in the interactions between concepts: in the course of conceptualmeshing,theinteractivedimensionsareweightedwithregardtothesignificanceoftheir roleintheentanglement. AswillbedescribedinSection4,themodelproposedhereisgroundedinthetheoreticalstance that concepts and conceptualisation can be understood in terms of geometry. Like Ga¨rdenfors, we seek to develop a model of conceptualisation which is inherently spatial, and one which is able to develop concepts in a contextually sensitive way by virtue of the construction of dynamic linguistic representations which continuously interact with input in an unfolding process of ongoing conceptualisation. To this end, we are modeling creative conceptualisation, because our methodology facilitates the ongoing reaction to information as it arises in an unpredictable environment. Ourmodel,however,operatesintherarefiedbutreadilycomputabledomainofcorpus linguistics: where Ga¨rdenfors describes a profusion of dimensions, both continuous and discrete, spanningcolours,sensations,size,shape,time,physiognomy,society,andvariousotherthingsboth concrete and abstract, our model is confined to the narrow world of words as linguistic symbols and the way they relate to one another across the breadth of a textual corpus. Nonetheless, in the modern tradition of distributional semantics (see Section 3), we present a methodology for using theinformationinherentinalargebodyoftexttodefinespatialregionsrelatingtoconcepts,where boththerelationshipsbetweenpositionsandthedimensionsdescribingthepositionsthemselvesare loaded with information. These regions we define are convex, and are defined by giving particular weighttocontextuallyimportantdimensions,followingGa¨rdenfors’approachtosalienceaswellas 56 FROMDISTRIBUTIONALSEMANTICSTOCONCEPTUALSPACES thedynamicsdescribedbyFauconnierandTurner(1998),desideratafulfilledbythemodel’sability tocontextuallyadjusttheinfluenceofdimensionsontheongoinggenerationofsubspaces. 2.3 WordsandConceptsinContext Therelationshipbetweenwordsandconceptsisfraught,asDavidson(1974)pointsout: intheend, meaning,intention,andcognitivecontentseemtoemergeindynamictensionwithoneanother,and it is therefore problematic to say that language in use can ever really stand for a deeper cognitive representation. On the other hand, language, and certainly natural language, can no more be the substrateofthoughtthanitcansitontopofthought;itmustbegraspedasathinginitself,pragmatic, oftenmessy,andverymuchintheworld. Puttingwordsbesideratherthanaboveorbelowconcepts, Clark (2006) interprets language as “a cognition-enhancing animal-built structure... a kind of self- constructed cognitive niche: a persisting but never stationary material scaffolding,” (p. 370). So language can be seen as a thing in the world, a perceptible entity that, in the sense of affordance describedbyGibson(1979),affordsalinguisticagenttheopportunitytodosomethingconceptual. Words are not the same thing as the concepts which they serve to denote, nor are they the atomic substance of mental states which sustain concepts, but they are nonetheless instrumental in the unfoldingphenomenonofconceptualisationinanenvironmentalcontext. Barsalou (1993) has defined “concepts” as “temporary constructions in working memory” (p. 34),andinparticularhehasexaminedwhatheidentifiesasthehaphazardcontentoftheinherently vagueapplicationoflanguagetoconcepts. ForBarsalou,conceptsarespecificallysomethingother than feature lists, or mere enumerations of the things which perpetually constitute a concept. The prevalenceoflanguageintherepresentationofconceptsassumes,though,thatconceptsare“stable structures in long term memory,” (p. 44), and a consequence of is a fundamental linguistic vagary in the relationship between concepts and words. Barsalou considers the example of how words are used to conceptualise BIRDS: diverse features such as feathers, nest, and eats worms are drawn into the conceptual framework as demanded by a particular cognitive context. These momentary,contextuallinguisticconstructscanservetodelineate,amongotherthings,culturalfault lines,withChineselanguagespeakersevidentlyconsideringswanandpeacockasprototypesof BIRD and therefore associating the concept with a feature of gracefulness that is not salient in other cultural contexts. Barsalou describes this application of language to conceptualisation in termsoftheconstructionof“adhoccategories,”(p. 32). A similar vocabulary is employed by Carston (2010), who defines an ad hoc concept as one that “has to be inferentially derived on, and for, the particular occasion of use,” (p. 158). Using language to communicate about concepts therefore involves a process of discovering without an excessofmentaleffortacognitivescenariowheretheimplicationsofanexpressionaresatisfactorily coherent. Andthekeytoconservingmentaleffortinthecourseofgeneratingadhocconceptsisto outsourcecognitiveworktotheenvironment,whichprovidesthecontextinwhichthemappingfrom languagetoconceptworks. AllottandTextor(2012)haveaslightlydifferentperspective,describing ad hoc concepts as being themselves individuated composites consisting of “activated information that is supposed to pertain to a category or property,” (p. 201). Here, again, the critical element in the formation of concepts is that there is some sort of contextual process of engagement with a cognitive apparatus which results in the emergence of a new, situated information structure. For Carston, theconceptisaconsequenceofacontextuallysensitiveimplicature, whereforAllottand 57 MCGREGOR,AGRES,ETAL. Textortheconceptisacompoundstructurecomposedofinformationactivatedbyasituation,butin bothcases,theessentialelementintheformationofadhocconceptsiscontextualentanglement. Theworkpresentedinthispaperseekstodemonstrateapracticalcomputationalimplementation of the kind of dynamic linguistic representations that allow for creative, contextual construction of conceptual spaces. The chief requirement for such an implementation is to build a language model which is likewise dynamic and contextually sensitive. So the key issue at stake here is the nature of the representations used in the language model: they must be interactive both with one another and with the input fed to the model. Thus we seek to build a model that captures what Barsalou (1993) has described as the “open-ended recursion and context dependence of linguistic representations”(p. 48)thatfacilitatethetraffickingofadhocconcepts. Thebesthopeofdoingthis computationallywouldseemtobetoassociatethewordswhichdenoteconceptsandtheirfeatures withmathematicallytractable,geometricallyexpressibleinformationstructures,andsowenextturn toaconsiderationofstatisticalandnetworkbasedlanguagemodels. 3. DistributionalSemanticLanguageModels Where Ga¨rdenfors (2000) has described conceptual spaces in terms of latent dimensions that correlate to stimuli in the world, research in computational linguistics, including that of Widdows (2004), has developed a similarly spatial view of semantics geared more explicitly towards the domain of language. Models which represent word meanings as vectors in a vector space have been shown to have several advantages over more traditional, symbolic approaches in expressing continuouspropertiessuchassimilarityandrelevance,whilestillbeingabletobeextended(atleast in theory) to more complex semantic phenomena. For example, Widdows shows how a geometric viewofmeaningcanbeusedtoconstructamulti-dimensionallanguagemodelwhich,inturn,canbe usedtoemulatehigherlevelcognitiveoperationssuchaslogic;GrefenstetteandSadrzadeh(2011) andSocheretal.(2012)offeralternativewaystocompositionallyconstructsentencerepresentations fromtheirconstituentwords. Clark (2015) offers a comprehensive overview of the field; in the following few pages, a brief history of recent developments in this area of computational linguistics will be offered, with a particularfocusonthedimensionalsituationofthesespaces. 3.1 WordCountingandMatrixFactorisation One set of approaches relies on directly observed lexical statistics. In this view, usually termed distributional semantics, the fundamental premise is that similar words appear in similar contexts (Harris, 1957), and that semantics can therefore somehow be represented in terms of the way in whichwordsarearrangedinrelationtooneanotheracrossacorpus. Motivated by statistical techniques for document retrieval such as Latent Semantic Analysis (LSA, Deerwester et al., 1990), Schu¨tze (1992) proposed a method for representing the semantic relationships between words involving the construction of vectors based on co-occurrence counts observed across relatively large-scale textual corpora: Schu¨tze’s model built vectors populated by statistics indicating the frequency with which words occurred in the context of other terms, and thenappliedaclusteringalgorithminordertolocateangularregionsinthespacecorrespondingto sensesofambiguouswords. Similarly,LundandBurgess(1996)usedaword-countingtechniqueto buildamodelgearedtowardsclusteringwordsintoconceptuallyorientedcategories. Dueprimarily to concerns with the data storage and handling requirements entailed by very high dimensional 58 FROMDISTRIBUTIONALSEMANTICSTOCONCEPTUALSPACES representationsofwordsoccurringinmanydifferentcontexts,theseauthorsemployeddimensional reduction procedures to generate smaller, denser versions of their initial co-occurrence matrices. Schu¨tzeinparticularusedsingularvaluedecomposition,whileLundandBurgesspickeddimensions with the highest variance, under the presumption that high variance corresponded to a high degree of informativeness across co-occurrences with a given context word. More recently, Rychly´ and Kilgarriff (2007) use a range of heuristics to reduce processing complexity, but calculate full matricestoenablethesauruscreation. Subsequent work involving statistical techniques for modeling meaning in terms of word co- occurrencesacrossalargecorpushasappliedmorecomplexmathematicalapproachestorepresent- ingtheobservedrelationshipbetweenwords. Blei,Ng,andJordan(2003),forinstance,presenteda modelwhichmovesbeyondmerewordcounting, seekingtomodelthe“exchangeability”(p. 994) of words across different contexts in terms of probability distributions construed over a relatively small array of latent parameters. Expanding the scope of statistical language models in a different direction, Kanejiya, Kumar, and Prasad (2003) presented a model that applied singular value decompositionstomatricesdescribingtheco-occurrenceofboththemorphologicalformsofwords and the syntactical dependencies between words, achieving marginal improvements over purely semantic models in correlation with human responses to a set of test questions. More recently, Turney and Pantel (2010) have suggested that, to the degree that a co-occurrence matrix might be construed as an incomplete register of the semantic possibilities inherent in a language, singular valuedecompositioncanbeviewedas“awayofsimulatingthemissingtext,”(ibid,p. 160). Notwithstanding the increasing sophistication of models, the theme which emerges across the relatively brief history of statistical text modeling (since at least LSA (Deerwester et al., 1990)) is one of dimensional reduction. The generation of dense and condensed matrices was originally driven by the necessity of computational efficiency, but latter day approaches have actually embraced dimensional reduction, and matrix factorisation in particular, as a mechanism forenhancingmodelperformance. Pennington,Socher,andManning(2014),forinstance,describe matrix factorisation as an important element of their GloVe model, and Yogatama et al. (2015) reportstate-of-the-artresultsonwordsimilarity,sentencecompletion,andsentimentanalysisusing asparsecodingtechniqueforbuildingnuanceddistributionalsemanticmodels. Indeed,despitethe mixed results of dimensionality reduction reported in the comprehensive survey of distributional semantic vector spaces by Lapesa and Evert (2013), an enthusiasm for matrix factorisation is currently prevalent in the field, with one recent pair of authors going so far as to assert that singular value decomposition type reduction “entails the abstraction of meaning by collapsing similarcontextsanddiscountingnoisyandirrelevantones,hencetransformingtherealworldterm- contextspaceintoaword-latent-conceptspacewhichachievesamuchdeeperandconcretesemantic representationofwords,”(HassanandMihalcea,2011). Thisnotionofabstractingawayfromadimensionthatliterallycorrespondstoaco-occurrence observation is key to all the factorisation techniques described here. In this regard, the techniques predicatedonthetabulatingofwordcountingtechniquesthathavejustbeendescribedhaveacertain commonality with the arguably even more abstract word embedding approaches to vector space semantics,whichwillbedescribednext. 59 MCGREGOR,AGRES,ETAL. 3.2 WordEmbeddingsfromNeuralNetworks Around the same time as Blei, Ng, and Jordan (2003) were developing their nuanced statistical approach to topic modeling, Bengio et al. (2003) introduced an alternative methodology for modelingwordmeaningbasedonvectorspacesbuiltbymulti-layerneuralnetworks. Theobjective ofthistechniqueissimilartothestatisticallyorientedworkmentionedabove: theconstructionofa spaceofword-vectors,whereproximitycorrespondstosemanticinformation. Thetechniqueitself, however,isdifferent,andthisdifferencehasbeenmotivatedbyastanceontheverydimensionality of the space. To use the terminology of Bengio et al., the smooth redistribution of probability achieved by a neural network helps to overcome the curse of dimensionality, by which a small movement in a jagged space can lead to a catastrophic breakdown in the language model. This smoothdistributionisachievedthroughthesteadilyannealedinfluenceoftheincrementallyupdated weightsofaneuralnetwork,resultinginaspaceofword-vectorswhichhavebeendescribedasword embeddings. Theearlyneuralapproachestolanguagemodelinginvolvedlearningafunctionwhichmapped from input word-vectors to an output distribution assigning probabilities to the occurrence of a subsequentword,andinthissenseweremoreinlinewithclassicn-gramlanguagemodels(Brownet al.,1992),thoughtheunderlyingintuitionwasdistributional,inthatword-vectorsforsemantically similarwordswereexpectedtohavesimilarfeaturesandthereforesimilarprobabilitiesofoccurring inacertaincontext. Insubsequentwork,CollobertandWeston(2008)havedemonstratedamethod for using a multi-layer network for assigning nuanced levels of features to words in a sentence, again relying on the premise that proximate vectors will be processed in similar ways by the network. Huangetal.(2012)developedanetworkwhichtakesbothlocalandnon-localcontextual informationaboutawordasinput,buildingaspaceofdisambiguatedrepresentationswherecosine distancebetweentwosensespecificword-vectorsisexplicitlytakenasameasureofsimilarity. Building on this body of work, Mikolov, Yih, and Zweig (2013) presented their word2vec model, based on a neural network that learned a space of word-vectors based on the distributional semantic insight, although rather than deriving vectors directly from co-occurrence statistics, the network training infers vectors which predict co-occurrence. The resulting model has geometric features especially suited for capturing analogical relationships between words. In this space, simple linear algebraic operations between word-vectors can yield meaningful results, with the −−−−−→ −−→ paradigmatic example from the original literature being the calculus by which woman−man+ −−→ −−−→ king = queen. Two different and similarly effective approaches to training the vector space are presented, one involving a network that predicts a word based on its context and another that predicts a context based on a word. In both cases, the network weights and the values of word- vectors are simultaneously updated through backpropagation as the model uses a sliding context window to processes the corpus, much like with the word counting techniques described above. Theresultisaspaceloadedwithsemanticvalue: theactualgeometryofthespaceyieldssignificant informationabouttherelationshipbetweenwords,andthisinformationmightpotentiallybemapped ontoconceptualschema. With that said, all these neural language models essentially use dimensions as handles for gradually and systematically pulling word-vectors into a global arrangement which satisfies the semantic relationships observed between words, per the insight into contextual correlation of the distributional hypothesis. In fact, the random initialisation of matrices means that an entirely different space, albeit with similar relative relationships between word-vectors, will be established 60 FROMDISTRIBUTIONALSEMANTICSTOCONCEPTUALSPACES for any given run of a model on a particular corpus. The dimensions themselves are thus populated by arbitrary values that cannot be interpreted as corresponding to any co-occurrence event in the underlying corpus. In this regard, the neural network models, like models based on matrix factorisation, seek to exchange expressiveness on a dimensional scale for compactness and robustnessonthescaleoftheentiremodel. Despitethissimilarity,therecenttrendincomputational linguistics has been away from strictly statistical models and towards word embeddings, as characterisedbytheconclusionsdrawnbyBaroni,Dinu,andKruszewski(2014). 3.3 FindingDynamicContextinaLexicalSpace Theinteresting–perhapsevenremarkable–featureofthenetworksdevelopedbyMikolov,Yih,and Zweig(2013)isthatcomplexnon-linearmodelsunderliesurfacespaceswheresimplearithmetical operations between vectors reveal semantically loaded relationships between words, a point which hasbeenexploredbyAroraetal.(2015),whogoontoproposeagenerativemodelthat,theysuggest, restoresanelementofinterpretabilitytotheirword-vectors. Levy,Goldberg,andDagan(2015)have stagedsomethingofacounterattackontheascendancyofwordembeddingapproachestolanguage modeling, distilling what they deem to be hyperparameters deployed in the cases of the most effectivesystems,systemicfeaturessuchascontextwindowsofvaryinglengththatcan,inprinciple, beappliedtoanymodelbuiltthroughthetraversaloflargescalecorpora. Oncethesemodularmodel componentsaretakenintoaccount,asLevyetalwouldhaveit,theevidenttheoreticaldistinctions betweenthetwogeneralapproachestodistributionalsemanticscanbereducedtoanarrayoftestable technicalconsiderations. The upshot of this view is that there seem to be some grounds for considering both statistical and network based approaches to distributional semantics as building more or less the same kind of spaces through a variety of different techniques. Indeed, as discussed above, one of the main characteristics of the majority of distributional semantic models, regardless of how they are generated, is that their dimensions become abstractions that do nothing more than delineate a likewise abstract space. These are spaces just for the sake of being spaces, where any characteristic properties of the space itself are stripped out. In an early insight that is particular prescient with regard to the work presented here, however, Schu¨tze (1992) observed that, with regardtodistributionalsemanticmodels,“differentdimensionsareimportantfordifferentsemantic distinctions and thatall are potentially useful,” (p. 794). Schu¨tze’s point was thatthere is valuable information in the raw data of pointwise co-occurrence probabilities inherent in the comportment ofwordsacrossacorpus,andthisinformationcouldpossiblybecomeastrengthofasophisticated languagemodel. Thisisapointthatthemodelpresentedhereaimstotakeseriously. To return to Ga¨rdenfors’ insight regarding the geometric nature of conceptualisation, it seems impossible to imagine how a space defined by, typically, dozens to hundreds of purely abstract dimensions could ever be construed as containing the kind of conceptual differentiation that is evidentlyinherentintherelationshipbetweenmindsandtheworld. Thisisnottosaythatanumber representing the likelihood of two words occurring in the same context should be construed in the same richly interrelated and elaborately differentiated way as Ga¨redenfors’ highly structured, complexlydelineatedconceptualspaces. Nonetheless,aspaceofnumbersthatservebothasanchors forthemeaningfulpositioningofpointsandalsoasindicatorsthatcanbeindependentlyassociated withevents,eveneventswhicharethemselvesrelativelyabstract,issomehowmoreintheworldthan anessentiallyindivisiblespaceofmerepositions. Inparticular, inanunreducedbaselexicalspace 61
Description: