ebook img

Community Question Answering Platforms vs. Twitter for Predicting Characteristics of Urban Neighbourhoods PDF

0.7 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Community Question Answering Platforms vs. Twitter for Predicting Characteristics of Urban Neighbourhoods

Community Question Answering Platforms vs. Twitter for Predicting Characteristics of Urban Neighbourhoods MarziehSaeidi LiciaCapra UniversityCollegeLondon UniversityCollegeLondon [email protected] [email protected] AlessandroVenerandi SebastianRiedel UniversityCollegeLondon UniversityCollegeLondon [email protected] [email protected] 7 Abstract Predicting demographics of individual users using their 1 languageonsocialmediaplatforms,especiallyTwitter,has 0 Inthispaper,weinvestigatewhethertextfromaCommunity been the focus of many research works: text from blogs 2 Question Answering (QA) platform can be used to predict and on-line forum posts are utilised to predict user’s age anddescribereal-worldattributes.Weexperimentwithpre- n throughtheanalysisoflinguisticfeatures.Resultsshowthat dictingawiderangeof62demographicattributesforneigh- a the age of users can be predicted where the predicted and J bourhoodsofLondon.WeusethetextfromQAplatformof observed values reach a Pearson correlation coefficient of Yahoo!Answersandcompareourresultstotheonesobtained 7 almost 0.7. Sociolinguistic associations using geo-tagged fromTwittermicroblogs.Outcomesshowthatthecorrelation 1 betweenthepredicteddemographicattributesusingtextfrom Twitter data have been discovered (Eisenstein, Smith, and Yahoo!Answersdiscussionsandtheobserveddemographic Xing2011)andtheresultsindicatethatthedemographicin- ] L attributescanreachanaveragePearsoncorrelationcoefficient formationofuserssuchasfirstlanguage,race,andethnicity ofρ=0.54,slightlyhigherthanthepredictionsobtainedus- can be predicted by using text from Twitter with a correla- C ingTwitterdata.Ourqualitativeanalysisindicatesthatthere tionupto0.3.Otherresearchshowsthatusers’incomecan s. issemanticrelatednessbetweenthehighestcorrelatedterms alsobepredictedusingtweetswithagoodpredictionaccu- c extracted from both datasets and their relative demographic racy(Preo¸tiuc-Pietro,Lampos,andAletras2015).Textfrom [ attributes.Furthermore,thecorrelationshighlightthediffer- Twitter microblogs has also been used to discover the rela- entnaturesoftheinformationcontainedinYahoo!Answers 1 tionbetweenthelanguageofusersandthedeprivationindex andTwitter.Whiletheformerseemstoofferamoreencyclo- v ofneighbourhoods.Thecollectivesentimentextractedfrom pediccontent,thelatterprovidesinformationrelatedtocur- 3 rentsocioculturalaspects. thetweetsofusershasbeenshown(Querciaetal.2012)to 5 havesignificantcorrelation(0.35)withthedeprivationindex 6 ofthecommunitiestheusersbelongto. 4 Introduction Data generated on QA platforms have not been used in 0 the past for predicting real-world attributes. Most research . Recent years have seen a huge boom in the number of dif- 1 workthatutiliseQAdataaimtoincreasetheperformanceof 0 ferentsocialmediaplatformsavailabletousers.Peopleare suchplatformsinanalysingquestionquality(Lietal.2012), 7 increasinglyusingtheseplatformstovoicetheiropinionsor predictingthebestanswers(Liu,Liu,andYang2010;Tian 1 letothersknowabouttheirwhereaboutsandactivities.Each etal.2013)orthebestresponder(Zhaoetal.2012). : oftheseplatformshasitsowncharacteristicsandisusedfor v differentpurposes.Theavailabilityofahugeamountofdata Inthispaper,weusethetextfromthediscussionsonthe i X from many social media platforms has inspired researchers QA platform of Yahoo! Answers about neighbourhoods of r tostudytherelationbetweenthedatageneratedthroughthe LondontoshowthattheQAtextcanbeusedtopredictthe a useoftheseplatformsandreal-worldattributes. demographicattributesofthepopulationofthoseneighbour- hoods. We compare the performance of Yahoo! Answers Manyrecentstudiesinthisfieldareparticularlyinspired data to the performance of data from Twitter, a platform bytheavailabilityoftext-basedsocialmediaplatformssuch thathasbeenwidelyusedforpredictingmanyreal-worldat- as blogs and Twitter. Text from Twitter microblogs, in par- tributes.Unlikemanycurrentworksthatfocusonpredicting ticular, has been widely used as data source to make pre- one or few selected attributes (e.g. deprivation, race or in- dictionsinmanydomains.Forexample,box-officerevenues come)usingsocialmediadata,westudyawiderangeof62 arepredictedusingtextfromTwitter(Asur,Huberman,and demographicattributes.Furthermore,wetestwhetherterms others 2010). Twitter data has also been used to find cor- extractedfrombothYahoo!AnswersandTwitterareseman- relations between the mood stated in tweets and the value tically related to these attributes and provide examples of ofDowJonesIndustrialAverage(DJIA)(Bollen,Mao,and sociocultural profiles of neighbourhoods through the inter- Zeng2011). pretationofthecoefficientsofthepredictivemodels. Copyright(cid:13)c 2017,AssociationfortheAdvancementofArtificial Thecontributionsofthispapercanbesummarisedasfol- Intelligence(www.aaai.org).Allrightsreserved. lows: Table1:ExamplesofYahoo!AnswersdiscussionsandTwittermicroblogsaboutneighbourhoodsthatcontaintheterm“Jewish”. Yahoo!Answers Q:Wherecanifindajewishshopinlondon? A:ThemainJewishCommunitiesinLondonareStamfordHillandGoldersGreen,plusHendonandEdgeware.Allhavemany KosherandJudaicastoresontheirhighstreets. Q:JewishneighborhoodsinLondon? A:ThelargestisinGantsHill.TheyarepredominantlyReformistJews.ThenyouhavethelargestHasidicJewishCommunity inEuropeinStamfordHill.ThenthereisalargeOrthodoxJewishCommunityinHendon,andaround14%ofSwissCottage isJewish. Twitter -MeanwhileinCamden.@JewishMuseumLondon[tweetedfromCamden] -Challahmakesmehappy.Braided,proofedandeggwashed#Shabbatshalom#dough#sesame#Jewishfood... [tweetedfrom EastFinchley] - Jewish crouton crack. For when you just need that boost #osem #mondaymorning #whoneedschickensoup [tweeted from GoldersGreen] • We show that text from QA discussions can be used Yahoo! Answers vs. Twitter. When it comes to neigh- to predict real-world attributes such as demographic at- bourhoods,Yahoo!Answershavebeenusedbymanyusers tributes of the population of neighbourhoods with a per- toaskortoanswerquestionsaboutdifferentcharacteristics formancecomparabletoTwitterdata. of many neighbourhoods. While people may not use Twit- ter in the same way, they may log their tweets while be- • Ouranalysishighlightsthedifferencesbetweendatafrom ing in different neighbourhoods. In this paper, we investi- aQAplatformandTwitter:whileQAdataoffersamore gatetheextenttowhichthediscussiononYahoo!Answers encyclopedic content, the latter provides information re- platformsaboutneighbourhoodsandthemicroblogsthatare latedtocurrentsocioculturalaspects. logged from different neighbourhoods can reflect charac- teristics of those neighbourhoods. Table 1 shows examples Datasets of Yahoo! Answers discussions that contain the names of London neighbourhoods and the term “Jewish”. The table Yahoo! Answers . Community QA platforms help users alsoshowsexamplesoftweetsthathavebeenbloggedfrom to obtain information from a community – a user can post London neighbourhoods (neighbourhood’s name in brack- questionswhichmaythenbeansweredbyotherusers.Dis- ets) and contain the term “Jewish”. These examples show cussionscanbeindepthandinlengthbuttheinteractionsare someofthedifferencesbetweenthediscussionsthatcanbe notspontaneous.Thetimeframeinwhichuserstakepartin found on Yahoo! Answers and on Twitter. As we can see, adiscussionthreadcanvaryfromonedaytoseveraldaysor theQAdiscussionsarefocusedonatopic,i.e.Jewishneigh- eventomonths.Moreover,QAplatforms,unlikeTwitterand bourhoods in London. The answers provide explicit infor- some of the other social media platforms, are not location- mation on neighbourhoods that have a high population of based.Yahoo!AnswersisoneofthefewQAplatformsthat Jewish.Ontheotherhand,microblogsofTwitterdonotfo- haveemergedinthepastdecade.DiscussionsinYahoo!An- cus on providing explicit information about the neighbour- swers are not domain specific and can cover a broad range hoods. But they contain information on user’s activities or oftopics. observations (e.g. Jewish Museum, Jewish food) while in different locations. These activities can implicitly indicate thatJewishcommunitiesinhibittheneighbourhoodsthatthe userisbloggingfrom. Twitter. Twitter is an on-line microblogging social net- workwhereuserscanpostandreadshort140-charactermes- sages. Twitter is used mostly to share views, opinions and PopulationDemographicData. Populationdemographic news in real-time. Unlike QA platforms, Twitter is ubiqui- dataistakenfromtheUKcensusprovidedbytheOfficefor tous. While some users use Twitter to get updates on the NationalStatistics.1 CensussurveysintheUKarerepeated news and their social circle, many others use it as part of every10yearsandwerelastconductedin2011.Censusdata theirdailyroutinetotalkabouttheirthoughts,whereabouts, is provided for specific geographical units that are created activities or sometimes just to share what is going on in solely for the purpose of census data collection. These are their lives. This can be because of the strong tie that there called Lower Layer Super Output areas (LSOAs) and are exists between Twitter and smartphones which are nowa- identified through an alphanumeric ID. Greater London is daysatthecentreofmanypeople’slives.Foralltheserea- dividedinto4,835LSOAs.Thesearenotnecessarilyequal sons,hugeamountofdataisconstantlybeingcreatedonthis in size as they have been designed to have a population of platform.Additionally,Twitterisalocation-basedplatform around 1,500. Census data provides statistical information where people can tag their locations while blogging their tweets. 1http://www.ons.gov.uk/ onawiderangeofcategoriessuchastheaveragehouseprice Number of QAs per Area - London inaLSOA,populationcount(orpercent)ofareligionoran 400 ethnicbackground,incomelevel,etc.Eachcategorycanbe 350 subdividedintofurtherattributes.Forinstance,thecategory religion contains the attributes Muslim, Christian, Jewish, 300 Hindu,etc. eas250 Ar Method er of 200 b m SpatialUnitofAnalysis Nu150 100 Thespatialunitofanalysischosenforthisworkistheneigh- bourhood.Thisisidentifiedwithauniquename(e.g.,Cam- 50 den)andpeoplenormallyusethisnameinQAdiscussionsto 0 refer to specific neighbourhoods. A list of neighbourhoods 0 50 100 150 200 Number of QAs for London is extracted from the GeoNames gazetteer2, a dataset containing names of geographic places including Figure1:HistogramofthenumberofQAspereachLondon placenames.Foreachneighbourhood,GeoNamesprovides neighbourhood. its name and a set of geographic coordinates (i.e., latitude andlongitude)whichroughlyrepresentsitscentre.Notethat geographical boundaries are not provided. GeoNames con- a stemmer will transform the word “presumably” to “pre- tains589neighbourhoodsthatfallwithintheboundariesof sum”and“provision”to“provis”.Tokeepthemostfrequent the Greater London metropolitan area. In the remainder of words, we remove any token that has appeared less than 5 the paper, we use the terms “neighbourhood” or “area” to timesinlessthan5uniqueQAs.Thisleavesuswith8kdis- refertoourspatialunitofanalysis. tincttokens. Pre-processing,Filtering,andSpatialAggregation TwitterData. TocollectdatafromTwitter,weusethege- ographical bounding box of London, defined by the north- Yahoo!AnswersData. Wecollectquestionsandanswers westandsoutheastpointsoftheGreaterLondonregion.We (QAs)fromYahoo!AnswersusingitspublicAPI.3Foreach thenusethisboundingboxtoobtainthetweetsthataregeo- neighbourhood,thequeryconsistsofthenameoftheneigh- tagged and are created within this box through the official bourhoodtogetherwiththekeywords“London”and“area”. TwitterAPI.4 WestreamTwitterdatafor6monthsbetween This is to prevent obtaining irrelevant QAs for ambiguous December2015andJuly2016.Attheend,wehavearound entitynamessuchasVictoria.Foreachneighbourhood,we 2,000,000tweetsinourdataset. thentakealltheQAsthatarereturnedbytheAPI.EachQA To assign tweets to different neighbourhoods, for each consists of a title and a content which is an elaboration on tweet, we calculate the distance between the location that thetitle.Thisisfollowedbyanumberofanswers.Intotal, it was blogged from and the centre points of all the neigh- wecollect12,947QAsacrossallLondonneighbourhoods. bourhoodsinourdataset.Notethatthecentrepointforeach TheseQAsspanoverthelast5years.Itiscommonforusers neighbourhoodisprovidedinthegazetteer.Wethenassign to discuss characteristics of several neighbourhoods in the the tweet to the closest neighbourhood that is not further same QA thread. This means that the same QA can be as- than 1 km from the tweet’s geolocation. At the end of this signedtomorethanoneneighbourhood.Figure1showsthe process,wehaveacollectionoftweetspereachneighbour- histogram of the number of QAs for each neighbourhood. hoodandwecombinethemtocreateasingledocument.Fig- Asthefigureshows,themajorityofareashavelessthan100 ure2showsthenumberoftweetspereachneighbourhood. QAs with some areas having less than 10. Only few areas As we can see, the majority of neighbourhoods have less haveover100QAs. than1000tweets. Foreachneighbourhood,wecreateonesingledocument We remove all the target words (words starting with @) by combining all the QA discussions that have been re- from the documents. The pre-processing is then similar to trieved using the name of such neighbourhood. This doc- the QA documents. At the end of this process, we obtain ument may or may not contain names of other neighbour- 17kdistinctfrequenttokensfortheTwittercorpus. hoods. We split each document into sentences and remove thoseneighbourhoodscontaininglessthan40sentences. Population Demographic Data. As we previously ex- We then remove URLs from each document. The doc- plained, each attribute in census data is assigned to spatial ument is then converted to tokens and stop words are unitscalledLSOAs.However,theseunitsdonotgeograph- removed. All the tokens in all the documents are then icallymatchourunitsofanalysiswhicharetheneighbour- stemmed. The goal of stemming is to reduce the different hoodsdefinedtroughthegazetteer.Amapshowingthespa- grammaticalformsofawordtoacommonbaseform.Stem- tialmismatchispresentedinFigure3.Toaggregatethedata ming is a special case of text normalisation. For example, containedintheLSOAsattheneighbourhoodlevel,weuse thefollowingapproach. 2http://www.geonames.org/ 3https://developer.yahoo.com/answers/ 4https://dev.twitter.com/streaming/ DocumentRepresentation Number of Tweets per Area - London 45 Averypopularmethodforrepresentingadocumentusingits 40 wordsisthetf-idfapproach(Salton,Fox,andWu1983).Tf- idfisshortfortermfrequency-inversedocumentfrequency 35 where tf indicates the frequency of a term in the document eas30 andidfisafunctionofthenumberofdocumentsthataterms of Ar25 has appeared in. In a tf-idf representation, the order of the ber 20 wordsinthedocumentisnotpreserved.Foreachtermina m Nu15 document,thetf-idfvalueiscalculatedasbelow: 10 tf(d,t) 5 tf-idf(d,t)= (1) log( Totalnumberofdocuments ) 0 Numberofdocumentscontainingthetermt 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 Number of Tweets To discount the bias for areas that have a high number of Figure2:HistogramofthenumberoftweetspereachLon- QAsortweets,wenormalisetfvaluesbythelengthofeach donneighbourhood. documentasbelow.Thelengthofadocumentisdefinedby thenumberofitstokens(non-distinctivewords). Often,whenpeopletalkaboutaneighbourhood,theyre- FrequencyofTermtinDocumentd Normalisedtf(d,t)= (2) fer to the area around its centre point. Therefore, the in- NumberofTokensinDocumentd formation provided for neighbourhoods in QA discussions Correlation shouldbeveryrelatedtothisgeographicpoint.Tokeepthis level of local information, for each demographic attribute, Toinvestigatetheextenttowhichthetextobtainedfromthe we assign only the values of the nearby LSOAs to the re- twoplatformsofYahoo!AnswersandTwitterreflectthetrue spective neighbourhood. To do this, we calculate the dis- attributes of neighbourhoods, we first study whether there tance between each neighbourhood and all the LSOAs in aresignificant,strongandmeaningfulcorrelationsbetween London.Thedistanceiscalculatedbetweenthecoordinates the terms present in each corpus and the many neighbour- ofaneighbourhoodandthecoordinatesofeachLSOA’scen- hood attributes through the Pearson correlation coefficient troid.Foreachneighbourhood,wethenselectthe10closest ρ. For each term in each corpus, we calculate the correla- LSOAs that are not further than one kilometre away. The tion between the term and all the selected demographic at- valueofeachdemographicattributeforeachneighbourhood tributes. To do so, for each term, we define a vector with is then computed by averaging the values associated with thedimensionofthenumberofneighbourhoods.Thevalue theLSOAsassignedtoit.Weapplythisproceduretoallthe of each cell in this vector represents the normalised tf-idf demographicattributes. valueofthetermforthecorrespondingneighbourhood.For eachdemographicattribute,wealsodefineavectorwiththe dimensionofthenumberofneighbourhoods.Eachcellrep- resentsthevalueforthedemographicattributeofthecorre- spondingneighbourhood.WethencalculatethePearsoncor- relationcoefficient(ρ)betweenthesetwovectorstomeasure the strength of the association between each term and each attribute. Sinceweperformmanycorrelationtestssimultaneously, weneedtocorrectthesignificancevalues(p-values)formul- tipletesting.WedosobyimplementingtheBonferronicor- rection,amultiple-comparisonp-valuecorrection,whichis usedwhenseveraldependentorindependentstatisticaltests arebeingperformedsimultaneously.TheBonferroniadjust- ment ensures an upper bound for the probability of having an erroneous significant result among all the tests. All the p-values showed in this paper are adjusted through the use oftheBonferronicorrection. Prediction We investigate how well the demographic attributes can be Figure 3: The geographic relation between LSOAs and predictedbyusingusingYahoo!AnsewrsandTwitterdata. neighbourhoodsidentifiedthroughthegazetteer.LSOAsare We define the task of predicting a continuous-valued de- geographical shapes and their centroids are marked with mographicattributeforunseenneighbourhoodsasaregres- darkdots.Neighbourhoodsaremarkedwithgreencircles. sion task given their normalised tf-idf document represen- tation.Aseparateregressiontaskisdefinedforeachdemo- Correlation graphic attribute. We choose linear regression for the pre- Number of Correlated Terms. The number of signifi- dictiontasksasithasbeenwidelyusedforpredictionsfrom cantly correlated terms from both Yahoo! Answers and the textintheliterature(FosterandStine;Joshietal.2010). Twitter with the selected demographic attributes are shown Due to the high number of features (size of vocabulary) inTable2.Notethatthenumberofunique(frequent)words and a small number of training points, over-fitting can oc- inTwitter(17k)isalmosttwiceasinYahoo!Answers(8k). cur. To avoid this issue, we use elastic net regularisation, a Thefirstcolumnshowsademographicattributeandthesec- technique that combines the regularisation of the ridge and ondcolumnindicatesthesource,i.e.Yahoo!Answers(Y!A lasso regressions. The parameters θ are estimated by min- forshort)orTwitter.Thethirdcolumn(“All”)showstheto- imisingthefollowinglossfunction.Here,yi isthevalueof tal number of terms that have a significant correlation with anattributeforthei-thneighbourhood,vectorxiisitsdocu- each attribute (p-value < 0.01). The following columns mentrepresentationandN isthenumberofneighbourhoods show the number of terms that have a significant correla- inthetrainingset. tionwiththeattributewithaρinthegivenranges.Thelast columnshowsthenumberoftermsthataresignificantlycor- 1 (cid:88)N relatedwiththeattributewithanegativeρ.Thedatasource L= N (yi−xTi θ)2+λ1||θ||+λ2||θ||2 (3) that has the highest number of correlated terms with each i=1 attributeishighlightedinbold. Asthetableshows,termsextractedfromYahoo!Answers Evaluation. To measure the performance of a regression tendtobemorerelated,intermsofthenumberofcorrelated model, residual-based methods such as mean squared er- terms,toattributesrelatedtoreligionorethnicitycompared ror are commonly used. Ranking metrics such as Pearson totermsfromTwitter.However,fortwoparticularattributes correlation coefficient have also been used in the litera- (i.e., Price and Buddhist), the number of correlated terms ture(Preo¸tiuc-Pietroetal.2015;Eisenstein,Smith,andXing from Twitter is higher than the ones from Yahoo! Answers 2011).Usingarankingmeasurehassomeadvantagescom- .Theseresultscollectivelysuggestthatthereisawealthof paredtoaresidual-basedmeasure.First,rankingevaluation terms, both in Yahoo! Answers and Twitter, which can be is more robust against extreme outliers compared to an ad- usedtopredictthepopulationdemographics. ditive residual-based evaluation measure. Second, ranking metricsaremoreinterpretablethanmeasuressuchasmean Table 2: Number of significantly correlated terms (p-value squarederror(Rosset,Perlich,andZadrozny2005).Wethus <0.01)frombothYahoo!Answers(“Y!A”)andTwitter. usethismethodforevaluatingtheperformanceoftheregres- sionmodelsinthiswork. Attribute Source All >0.4 [0.3,0.4] [0.2,0.3] <0 Asfurtherperformancecheck,weapplya10foldscross- Y!A 115 1 48 66 0 IMD validationtoeachregressiontask.Ineachfold,weuse75% Twitter 17 0 10 7 0 of the data for training and the remaining 25% for valida- Y!A 50 2 36 12 0 Price tion. At the end, we report the average performance over Twitter 1120 312 533 275 0 all folds together with the standard deviation. For each de- Jewish% Y!A 48 7 31 10 0 Twitter 6 0 5 1 0 mographic attribute, i.e. target value, training and valida- Y!A 87 0 59 28 0 tion sets are sampled using Stratified Sampling. This is a Muslim% Twitter 13 1 8 4 0 samplingmethodfromapopulation,whensub-populations Y!A 8 2 3 3 0 within this population vary. For instance, in London, there Hindu% Twitter 5 0 3 2 0 are areas with very high or very low deprivation. In these Y!A 1 0 1 0 0 cases,itisadvantageoustosampleeachsub-populationin- Buddhist% Twitter 934 18 728 188 0 dependentlyandproportionallytoitssize. Y!A 114 4 59 51 0 Black% Twitter 2 0 2 0 0 Y!A 8 0 0 0 8 Results White% Twitter 0 0 0 0 0 Note. There are many attributes across several categories Asian% Y!A 6 0 3 3 0 Twitter 1 0 1 0 0 inthecensusdata.Becauseofspacelimitation,weconduct mostofourexperimentsonaselectedsetofattributes.These attributes are taken from religion (population of Jewish%, Semantic Relatedness. In this section, we observe populationofMuslim%,populationofHindu%,population whether the correlations between terms and attributes are ofBuddhist%),ethnicity(populationofBlack%,population semantically meaningful. Due to the limited space, we se- ofWhite%,andpopulationofAsian%ethnicity),Price(av- lect three attributes and their relative top correlated terms eragehouseprices)anddeprivation(IndexofMultipleDe- extracted from Yahoo! Answers (Table 3) and Twitter (Ta- privation,IMD5).Forafullbreakdownofresultspleaserefer ble4).WechoosetheattributesPriceandIMDastheyshow totheAppendix. thehighestnumberofcorrelatedtermsforbothsources.For eachsource,wethenchooseonemoreattributethathasthe 5https://en.wikipedia.org/wiki/Multiple_ highest number of strongly correlated terms (ρ > 0.4), i.e. deprivation_index Jewish%forYahoo!AnwsersandBuddhist%forTwitter. Table3:TermsfromYahoo!AnswerswiththehighestPear- videgeographicalinformationonwheredeprivationismore son correlation coefficients for the selected demographic concentratedinLondon(i.e.,EastEnd).Othertermsseemto attributes. Correlations are statistically significant (p-value berelatedtothepresenceofyoungergenerationofcreatives <0.001). andartistsinmoredeprivedneighbourhoods.“Yeah”,“shit”, “ass”,mayallbejargonscommonlyusedbythissectionof Jewish% (High)Price Deprivation population.“Studio”,“craftbeer”,“music”mayinsteadrefer Term ρ Term ρ Term ρ totheirmainactivitiesandoccupations.Forwhatconcerns matzo 0.45 townhouse 0.4 hurt 0.4 (high)“Price”,allthetermsseemtorelatetoaspectsofex- harmony 0.45 fortune 0.39 poverty 0.36 pensiveareas,e.g.“luxury”,“classy”,and“stylish”.“Tea”, jewish 0.41 qatar 0.39 drug 0.36 “teatime”,“delight”,“truffle”seemtorelatetosocialactiv- jew 0.41 diplomat 0.39 cockney 0.35 unfairly 0.42 exclusive 0.37 victim 0.35 ities of the upper class. For the attribute “Buddhist%”, five flyover 0.41 hectic 0.36 mug 0.34 termsoutoftenare,inourview,associatedwithneighbour- ark 0.38 desirable 0.35 trouble 0.34 hoods where the majority of people is Buddhist or practise straw 0.38 celeb 0.34 notorious 0.34 Buddhism.Thesetermsseemstorelatetoaspectsofthisre- arab 0.39 aristocratic 0.33 rundown 0.33 ligion,e.g.“think”,“learn”,“mind”,etc. kosher 0.32 fashionable 0.32 slum 0.32 Interestingly, terms extracted from Yahoo! Answers and Twitterseemtooffertwodifferentkindsofknowledge.On Table 4: Terms from Twitter with the highest Pearson cor- oneside,termsextractedfromYahoo!Answersaremoreen- relationcoefficientsfortheselecteddemographicattributes. cyclopedicastheytendtoofferdefinitionsorrenownedas- Correlationsarestatisticallysignificant(p-value<0.001). pects for each attribute. “Jewish%” is, for example, related toaspectsoftheJewishculturesuchas“matzo”,“harmony”, Buddhist% (High)Price Deprivation and“kosher”.“Deprivation”isassociatedwithsocialissues Term ρ Term ρ Term ρ such as “poverty” and “drug”, but also with a degraded ur- think 0.44 luxury 0.66 east 0.39 ban environment (e.g., “rundown”, “slum”). On the other, long 0.42 tea 0.64 eastlondon 0.36 Twitterwordsprovideakindofknowledgemorerelatedto rainy 0.41 teatime 0.61 eastend 0.36 currentsocioculturalaspects.Thisisthecase,forexample, learn 0.40 delight 0.60 yeah 0.33 of the jargon associated with “Deprivation” (e.g., “yeah”, presentation 0.40 truffle 0.60 studio 0.33 “shit”), or of the culinary habits related to “High Prices” mind 0.40 car 0.60 shit 0.32 (e.g.,“tea”,“truffle”). para 0.40 classy 0.59 craftbeer 0.30 todo 0.40 stylish 0.59 ass 0.30 thing 0.40 gorgeous 0.59 music 0.30 Prediction heart 0.40 interiordesign 0.58 neighbour 0.29 The results of the regression tasks performed over the se- lectedsetofdemographicattributes,intermsofPearsoncor- relationcoefficient(ρ),arepresentedinTable4.Resultsare WefirstexamineTable3andprovideexamplesofseman- averagedover10foldsandstandarddeviationsaredisplayed tic similarity between Yahoo! Answers terms and the se- inparenthesis. lectedattributes.Wordshighlightedinboldare,inourview, Wecanseethatonaverage,performancesofYahoo!An- theonesmostassociatedwiththeirrespectiveattribute.For swers and Twitter are very similarly with Yahoo! Answers the attribute Deprivation, the majority of the terms seem havingaslightlyhigherperformance(4%).Twitterdatacan to be linked to issues of deprived areas. “Poverty”, “drug”, predictthemajorityofthereligion-relatedattributeswitha “victim”, all refer to social issues. “Rundown” and “slum” highercorrelationcoefficientwiththeexceptionofpopula- may be associated with the degradation of the surround- tionofJewish%.Ontheotherhand,Yahoo!Answersissu- ing environment. “Cockney” is a dialect traditionally spo- periortoTwitterwhenpredictingethnicityrelatedattributes kenbyworkingclass,andthuslessadvantaged,Londoners. suchaspopulationofWhite%andBlack%.Wehaveseenin Fortheattribute(High)Price,mosttermsseemtoberelated Table 2 that Twitter has very few correlated terms with the toaspectsofplaceswhichmayoffermoreexpensivehous- attributesWhite(0)andBlack(2). ing.Termssuchas“fortune”,“diplomat”,and“aristocratic” We also observe that IMD and Price can be predicted are often associated with wealth. Others seem to reflect a with a high correlation coefficient using both Yahoo! An- poshlifestyleandstatussymbol:“townhouse”,“exclusive”, swersandTwitter.Thiscanbeduetothefactthatthereare “celeb”, “fashionable”, “desirable”. For the attribute Jew- many words in our dataset that can be related to the depri- ish%,mostofthetermsseemtoreflectaspectsofthisreli- vationofaneighbourhoodortohowexpensiveaneighbour- gion or be linguistically associated with it (i.e., “Jew” and hood is. This is also evident in Table 2 where the number “Jewish”). “Matzo” and “Kosher” are associated with the of correlated terms from both Yahoo! Answers and Twitter traditionalJewishcuisine;theformerisatypeofflat-bread, withtheseattributesareveryhigh.Ontheotherhand,terms thelatterisawayofpreparingfood.The“ark”isaspecific thatdescribeareligionoranethnicityaremorespecificand partofthesynagoguewhichcontainssacredtexts. lower in frequency. Therefore attributes that are related to We now examine Table 4. For the attribute Deprivation, religionorethnicityarepredictedwithaloweraccuracy. ninewordsoutoftenseemtobelinkedtomoredeprivedar- Table5furthershowstwotermsthathavethehighestco- eas.“East”,“eastlondon”,and“eastend”,forexample,pro- efficients in the regressions models (across the majority of Table 5: Prediction results in terms of ρ using Yahoo! Answers and Twitter data. Results are averaged over 10 folds and standarddeviationsareshowninparenthesis.Correlationsarestatisticallysignificant(p-value<0.01).Termswiththehighest coefficientsinregressionsmodelsarealsoprovided. Yahoo!Answers Twitter Attribute ρ Terms ρ Terms Muslim% 0.51(0.07) asian,barber 0.54(0.05) mileend,eastlondon Jewish% 0.42(0.08) jewish,arab 0.13(0.06) rsa,rugby Hindu% 0.32(0.10) stadium,cemetery 0.46(0.09) smokeyeye,asianbride Buddhist% 0.24(0.10) minister,tourist 0.44(0.07) theatre,prayforparis Black% 0.60(0.07) gang,drug 0.44(0.08) southlondon,frank Asian% 0.40(0.07) asian,barber 0.39(0.05) mileend,gymselfie White% 0.58(0.06) essex,suburbia 0.45(0.08) essex,golf HousePrice 0.69(0.05) money,compliment 0.68(0.04) dailyspecial,personaltrainer IMD 0.69(0.03) notorious,cockney 0.56(0.04) np,eastlondon folds) for each attribute and source in the column Terms. predicting attributes related to Ethnicity and Employment, Thesetermsareamongthestrongpredictorsoftheirrespec- Twitter performs better when predicting attributes relating tive attribute. Many of these terms appear to be related to totheAgeGroup,andCarOwnership. the given demographic attribute (for both Twitter and Ya- hoo! Answers ) and are also often amongst the top corre- RelatedWork lated terms presented in Tables 3 and 4. We follow with The availability of a huge amount of data from many so- someexamples.Accordingtotheregressioncoefficientsfor cial media platforms has inspired researchers to study the theattributeMuslim%,neighbourhoodsinhabitedbyaMus- relationbetweenthedataontheseplatformsandmanyreal- lim majority may be located in Mile End, an East London worldattributes.Twitterdata,inparticular,hasbeenwidely district(i.e.,Twitterterms“mileend”and“eastlondon”),see used as a social media source to make predictions in many thepresenceofAsianpopulationandbarbershops(i.e.,Ya- domains.Forexample,box-officerevenuesarepredictedus- hoo! Answers terms “asian” and “barber”). According to ingtextfromTwittermicroblogs(Asur,Huberman,andoth- the terms for Black%, neighbourhoods with a black ma- ers 2010). Prediction results have been predicted by per- jority tend to be located in the southern part of London forming content analysis on tweets (Tumasjan et al. 2010). (i.e., Twitter term “southlondon”) and experience social is- It is shown that correlations exist between mood states of suessuchaspresenceofcriminalgroupsanddruguse(i.e., the collective tweets to the value of Dow Jones Industrial Yahoo! Answers terms “gang” and “drug”). According to Average(DJIA)(Bollen,Mao,andZeng2011). the terms for IMD, more deprived areas seem to be lo- Predicting demographics of individual users using their cated in the East End of London (i.e., Twitter term “east- languageonsocialmediaplatforms,especiallyTwitter,has london”) where the Cockney dialect is dominant (i.e., Ya- beenthefocusofmanyresearch.Textfromblogs,telephone hoo!Answersterm“cockney”).Yahoo!AnswersandTwit- conversations,andforumpostsareutilisedforpredictingau- ter seem to complement one another in terms of informa- thor’sage(Nguyen,Smith,andRose´2011)withaPearson’s tion they provide through the terms associated with each correlation of 0.7. Geo-tagged Twitter data have been used attribute which in most cases are different. One noticeable to predict the demographic information of authors such as difference is that Twitter tends to offer geographical infor- first language, race, and ethnicity with correlations up to mation (e.g., “mileend”, “southlondon”, “essex”). On the 0.3(Eisenstein,Smith,andXing2011). other hand, terms from Yahoo! Answers sometimes match One aspect of urban area life that has been the focus of thenameoftheattribute(i.e.“asian”and“Jewish”). many research work in urban data mining is finding cor- In the Appendix, in Tables 6 and 7, we show the predic- relations between different sources of data and the depri- tionresultsforawiderangeof62demographicattributesus- vation index (IMD), of neighbourhoods across a city or a ingYahoo!AnswersandTwitter.Foreachattribute,wedis- country (Lathia, Quercia, and Crowcroft 2012; Quercia et playtwotermswiththehighestcoefficientcommonbetween al.2012).Cellulardata(Smith-Clarke,Mashhadi,andCapra the majority of the folds. Attributes are divided into cate- 2014)andtheelementspresentinanurbanarea(Venerandi goriessuchasReligion,Employment,Education,etc.Over- et al. 2015) are among non-textual data sources that are all,theresultsshowthatYahoo!Answersperformsslightly shown to have correlations with a deprivation index. Also, better than Twitter with an average 1% increase over all flow of public transport data has been used to find corre- the attributes. Wilxocon signed rank test shows that their lations (with a correlation coefficient of r = 0.21) with results are significantly different from each other (p-value IMD of urban areas available in UK census (Lathia, Quer- < 0.01). Outcomes in these tables show that on average, cia,andCrowcroft2012).Researchshowsthatcorrelations a wide range of demographic attributes of the population ofr =0.35existsbetweenthesentimentexpressedintweets ofneighbourhoodscanbepredictedusingbothYahoo!An- of users in a community and the deprivation index of the swersandTwitterwithhighperformancesof0.54and0.53 community(Querciaetal.2012). respectively.WhileYahoo!AnswersoutperformsTwitterin Socialmediadatahasbeenusedinmanydomainstofind linkstothereal-worldattributes.DatageneratedonQAplat- Lathia,N.;Quercia,D.;andCrowcroft,J. 2012. Thehidden forms,however,hasnotbeenusedinthepastforpredicting imageofthecity:sensingcommunitywell-beingfromurban suchattributes.Inthispaper,weusediscussionsonYahoo! mobility. InPervasivecomputing. Answers QA platform to make predictions of demographic Li,B.;Jin,T.;Lyu,M.R.;King,I.;andMak,B. 2012. An- attribute of city neighbourhoods. Previous work in this do- alyzingandpredictingquestionqualityincommunityques- main has mainly focused on predicting the deprivation in- tionansweringservices. InProceedingsoftheInternational dexofareas(Querciaetal.2012).Inthiswork,welookata ConferenceonWorldWideWeb. wide range of attributes and report prediction results on 62 Liu,M.;Liu,Y.;andYang,Q.2010.Predictingbestanswer- demographicattributes.Additionally,workinurbanpredic- ersfornewquestionsincommunityquestionanswering. In tion uses geolocation-based platforms such as Twitter. QA ProceedingsoftheInternationalConferenceonWeb-AgeIn- datathathasbeenutilisedinthispaperdoesnotincludege- formationManagement. olocation information. Utilising such data presents its own Nguyen,D.;Smith,N.A.;andRose´,C.P. 2011. Authorage challenges. predictionfromtextusinglinearregression. InProceedings of the 5th Workshop on Language Technology for Cultural Discussion Heritage,SocialSciences,andHumanities:AnnualMeeting Inthispaper,weinvestigatepredictingvaluesforreal-world oftheAssociationforComputationalLinguistics. entities such as demographic attributes of neighbourhoods Preo¸tiuc-Pietro,D.;Volkova,S.;Lampos,V.;Bachrach,Y.; usingdiscussionsfromQAplatforms.Weshowthattheseat- and Aletras, N. 2015. Studying user income through lan- tributescanbepredictedusingtextfeaturesbasedonYahoo! guage,behaviourandaffectinsocialmedia. PlOSONE. Answers discussions about neighbourhoods with a slightly higher correlation coefficient than predictions made using Preo¸tiuc-Pietro,D.;Lampos,V.;andAletras,N. 2015. An Twitterdata. analysis of the user occupational class through twitter con- tent. InProceedingsoftheAnnualMeetingoftheAssocia- Limitations tionforComputationalLinguistics. Here,wepresentsomeofthelimitationsofourwork. Quercia, D.; Ellis, J.; Capra, L.; and Crowcroft, J. 2012. Tracking gross community happiness from tweets. In Pro- ceedingsoftheInternationalConferenceonComputerSup- Unification of the units of analysis. To unify the units portedCooperativeWorkandSocialComputing. of analysis, we take a heuristic approach. We do not cross- validateourresultswithotherapproaches.Thisisbecauseof Rosset, S.; Perlich, C.; and Zadrozny, B. 2005. Ranking- thelackofworkinusingnon-geotaggedtextforpredicting based evaluation of regression models. In Proceedings of attributesofneighbourhoodsinthecurrentliterature. theInternationalConferenceonDataMining. Salton,G.;Fox,E.A.;andWu,H. 1983. Extendedboolean informationretrieval. CommunicationsoftheACM. Coverage. Ourexperimentsinthispaperislimitedtothe city of London. London is a cosmopolitan city and a pop- Smith-Clarke, C.; Mashhadi, A.; and Capra, L. 2014. ular destination for travellers and settlers. Therefore, many Povertyonthecheap:Estimatingpovertymapsusingaggre- discussions can be found on Yahoo! Answers regarding its gated mobile communication networks. In Proceedings of neighbourhoods. The coverage of discussions on QA plat- International Conference on Human Factors in Computing formsmaynotbesufficientforallcitiesofinterest. Systems. Tian,Y.;Kochhar,P.S.;Lim,E.-P.;Zhu,F.;andLo,D.2013. References Predicting best answerers for new questions: An approach Asur, S.; Huberman, B.; et al. 2010. Predicting the future leveragingtopicmodelingandcollaborativevoting. InPro- withsocialmedia. InProceedingsoftheInternationalCon- ceedings of the Workshops at the International Conference ferencesonWebIntelligenceandIntelligentAgentTechnol- onSocialInformatics. ogy. Tumasjan, A.; Sprenger, T. O.; Sandner, P. G.; and Welpe, Bollen,J.;Mao,H.;andZeng,X. 2011. Twittermoodpre- I. M. 2010. Predicting elections with twitter: What 140 dictsthestockmarket. JournalofComputationalScience. characters reveal about political sentiment. In Proceedings oftheInternationalConferenceonWebandSocialMedia. Eisenstein,J.;Smith,N.A.;andXing,E.P. 2011. Discover- ing sociolinguistic associations with structured sparsity. In Venerandi, A.; Quattrone, G.; Capra, L.; Quercia, D.; and Proceedings of the Annual Meeting of the Association for Saez-Trumper,D. 2015. Measuringurbandeprivationfrom ComputationalLinguistics. usergeneratedcontent. InProceedingsoftheInternational ConferenceonComputerSupportedCooperativeWorkand Foster,D.P.,andStine,M.L.R.A. Featurizingtext:Con- SocialComputing. vertingtextintopredictorsforregressionanalysis. Zhao,T.;Li,C.;Li,M.;Wang,S.;Ding,Q.;andLi,L.2012. Joshi, M.; Das, D.; Gimpel, K.; and Smith, N. A. 2010. Predicting best responder in community question answer- Moviereviewsandrevenues:Anexperimentintextregres- ing using topic model method. In Proceedings of the In- sion. InHumanLanguageTechnologies:TheAnnualCon- ternationalConferencesonWebIntelligenceandIntelligent ference of the North American Chapter of the Association AgentTechnology. forComputationalLinguistics. Table6:PredictionresultsfordifferentcategoriesandattributesintermsofρusingYahoo!AnswersandTwitterdata.Results areaveragedover10foldsandstandarddeviationsareshowninparenthesis.Allcorrelationsarestatisticallysignificantwitha p-value<0.01.Foreachcategory,thedifferenceinperformancebetweenthetwosourcesarehighlightedinthecolumnrelated totheoutperforming(i.e.upwardarrow)source. Yahoo!Answers Twitter Attribute ρ Terms ρ Terms Price&Deprivation 0.69 5%↑ 0.64 HousePrice 0.69 money,compliment 0.68 dailyspecial,personaltrainer IMD 0.69 notorious,cockney 0.56 np,eastlondon Religion 0.37 0.39 2%↑ Muslim% 0.51 asian,barber 0.54 mileend,eastlondon Jewish% 0.42 jewish,arab 0.13 rsa,rugby Hindu% 0.32 stadium,cemetery 0.46 smokeyeye,asianbride Buddhist% 0.24 minister,tourist 0.44 theatre,prayforparis Ethnicity 0.49 6%↑ 0.43 Black% 0.6 gang,drug 0.44 southlondon,frank Asian% 0.40 asian,barber 0.39 mileend,gymselfie White% 0.58 essex,suburbia 0.45 essex,golf Mixed% 0.37 reggae,gang 0.45 studio,southlondon AgeGroup 0.56 0.60 4%↑ 0-15 0.53 mortgage,crappy 0.54 uel,ikea 16-29 0.66 student,music 0.66 drum,campus 30-44 0.5 cycle,psychic 0.6 nffc,loyaltylunch 45-64 0.46 temporarily,underrate 0.57 essex,golf 65Plus 0.62 hospital,outskirts 0.62 golf,thearcher WorkingAge 0.58 foody,triple 0.64 ukjob,delay HouseholdComposition 0.55 0.58 3%↑ CoupleWithDependentChildren% 0.57 belt,affordability 0.76 blondieblue,xoxo CoupleWithoutDependentChildren% 0.59 role,essex 0.55 essex,semipermanentmakeup LoneParentHousehold% 0.61 gang,mortgage 0.38 helpme,ikea OnePersonHousehold% 0.55 hotel,fashionable 0.7 personaltrainer,wine AtLeastOnePerson16+English1stLanguage% 0.52 essex,outskirts 0.55 golf,essex NoPeopleAged16+English1stLanguage% 0.46 asian,foreigner 0.53 tube,edgwareroad ResidentialStatus 0.58 0.63 5%↑ OwnedOutright% 0.69 chelmsford,outskirts 0.61 starbucks,grand OwnedWithAMortgageOrLoan% 0.67 belt,scummy 0.74 barbergang,essex SocialRented% 0.65 cockney,dump 0.56 ikea,studio PrivateRented% 0.48 hotel,privacy 0.6 tube,edgwareroad HouseholdOne+UsualResident% 0.49 mortgage,gang 0.59 eastlondon,londonbridge HouseholdNoUsualResidents% 0.39 hotel,square 0.61 hotel,marblearch WholeHouseOrDetached% 0.56 underrate,retiree 0.69 instafamily,crochet WholeHouseOrSemiDetached% 0.65 benefit,suburbia 0.72 essex,semipermanentmakeup WholeHouseOrTerraced% 0.55 cypriot,value 0.53 followforfollow,brockley FlatOrApartment% 0.68 location,inexpensive 0.76 nffc,pcm Sale 0.53 commute,upmarket 0.5 personaltrainer,crochet Employment 0.54 8%↑ 0.46 NoAdultsEmployed-DependentChildren% 0.52 interchange,cockney 0.33 ikea,gymtime LoneParentNotInEmploymentPercent 0.55 slum,cockney 0.53 mileend,edgwareroad EconomicallyActiveTotal 0.46 suite,deprive 0.57 railway,kensalrise EconomicallyInactiveTotal 0.58 student,triple 0.58 mileend,gymtime EconomicallyActiveEmployee 0.28 cycle,deprive 0.4 railway,royaltylunch EconomicallyActiveSelfEmployed 0.65 jewish,affordability 0.47 rugby,northlondon EconomicallyActiveUnemployed 0.66 cockney,drug 0.5 np,eastlondon EconomicallyActiveFullTimeStudent 0.57 student,asian 0.49 np,instrumental EmploymentRate 0.54 commute,suburban 0.36 railway,barbergang UnemploymentRate 0.59 notorious,cockney 0.40 swap,eastlondon Table7:cont. Yahoo!Answers Twitter Attribute ρ Terms ρ Terms Education 0.54 0.58 4%↑ NoQualifications 0.62 scummy,cockney 0.55 eastlondon,puregym HighestLevelOfQualificationLevel1% 0.68 essex,scummy 0.72 eastlondon,hackneywick HighestLevelOfQualificationLevel2% 0.69 scummy,role 0.77 followforfollow,tattoo HighestLevelOfQualificationApprenticeship% 0.56 role,truck 0.75 tatemodern,oldstreet HighestLevelOfQualificationLevel3% 0.16 role,fish 0.23 bttower,fresher HighestLevelOfQualificationLevel4%AndAbove 0.71 scholarship,affordability 0.62 rugby,cave HighestLevelOfQualificationOther% 0.38 employer,stadium 0.39 endorphin,tube SchoolchildrenAndFullTimeStudents18+% 0.53 student,chips 0.59 tube,np Health 0.42 0.42 DayToDayActivitiesLimitedALot% 0.33 cockney,gang 0.31 eastlondon,shisha DayToDayActivitiesLimitedALittle% 0.4 gloom,puppy 0.52 cafc,hackneywick DayToDayActivitiesNotLimited% 0.39 commute,park 0.37 tea,enjoysmilelive VeryGoodOrGoodHealth% 0.48 commute,park 0.39 rwc,tea FairHealth% 0.52 scummy,gang 0.59 streetfood,coy BadOrVeryBadHealth% 0.35 cockney,gang 0.34 eastlondon,east CarOwnership 0.62 0.71 9%↑ NoCarsOrVansInHousehold% 0.72 brewery,cockney 0.71 tube,groove 1CarOrVanInHousehold% 0.67 suburban,grounds 0.7 onelife,supercar 2CarsOrVansInHousehold% 0.67 hospital,outskirts 0.71 thearcher,golf 3CarsOrVansInHousehold% 0.57 role,belt 0.75 dailypic,boxpark 4OrMoreCarsOrVansInHousehold% 0.49 freehold,residential 0.69 rugby,cave Average 0.54 0.53

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.