ebook img

Computational and Statistical Methods for Analysing Big Data with Applications PDF

195 Pages·2016·10.88 MB·English
by  Ge
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Computational and Statistical Methods for Analysing Big Data with Applications

Computational and Statistical Methods for Analysing Big Data with Applications Computational and Statistical Methods for Analysing Big Data with Applications Shen Liu The School of Mathematical Sciences and the ARC Centre of Excellence for Mathematical & Statistical Frontiers, Queensland University of Technology, Australia James McGree The School of Mathematical Sciences and the ARC Centre of Excellence for Mathematical & Statistical Frontiers, Queensland University of Technology, Australia Zongyuan Ge Cyphy Robotics Lab, Queensland University of Technology, Australia Yang Xie The Graduate School of Biomedical Engineering, the University of New South Wales, Australia AMSTERDAM BOSTON HEIDELBERG LONDON (cid:1) (cid:1) (cid:1) NEWYORK OXFORD PARIS SANDIEGO (cid:1) (cid:1) (cid:1) SANFRANCISCO SINGAPORE SYDNEY TOKYO (cid:1) (cid:1) (cid:1) AcademicPressisanimprintofElsevier AcademicPressisanimprintofElsevier 125,LondonWall,EC2Y5AS. 525BStreet,Suite1800,SanDiego,CA92101-4495,USA 225WymanStreet,Waltham,MA02451,USA TheBoulevard,LangfordLane,Kidlington,OxfordOX51GB,UK Copyright©2016ElsevierLtd.Allrightsreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronic ormechanical,includingphotocopying,recording,oranyinformationstorageandretrievalsystem, withoutpermissioninwritingfromthepublisher.Detailsonhowtoseekpermission,furtherinformation aboutthePublisher’spermissionspoliciesandourarrangementswithorganizationssuchasthe CopyrightClearanceCenterandtheCopyrightLicensingAgency,canbefoundatourwebsite: www.elsevier.com/permissions. Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythe Publisher(otherthanasmaybenotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperience broadenourunderstanding,changesinresearchmethods,professionalpractices,ormedicaltreatment maybecomenecessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluatingand usinganyinformation,methods,compounds,orexperimentsdescribedherein.Inusingsuchinformation ormethodstheyshouldbemindfuloftheirownsafetyandthesafetyofothers,includingpartiesfor whomtheyhaveaprofessionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assumeany liabilityforanyinjuryand/ordamagetopersonsorpropertyasamatterofproductsliability,negligence orotherwise,orfromanyuseoroperationofanymethods,products,instructions,orideascontainedin thematerialherein. ISBN:978-0-12-803732-4 BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary. LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress. ForInformationonallAcademicPresspublications visitourwebsiteathttp://store.elsevier.com/ List of Figures Figure5.1 A3Dplotofthedrill-holedata. 106 Figure6.1 95%credibleintervalsforeffectsizesfortheMortgagedefault 122 exampleateachiterationofthesequentialdesignprocess. Figure6.2 Optimaldesignversusselecteddesignpointsfor(a)credit 123 score,(b)houseage,(c)yearsemployedand(d)creditcard debtfortheMortgagedefaultexample. Figure6.3 Posteriormodelprobabilitiesateachiterationofthesequential 125 designprocessforthetwomodelsinthecandidatesetforthe airlineexample. Figure6.4 Ds-optimal designsversusthedesignsextractedfromthe 125 airlinedatasetfor(a)departurehourand(b)distancefrom theorigintothedestination. Figure7.1 Standardprocessforlarge-scaledataanalysis,proposedby 134 Peng,Leek,andCaffo(2015). Figure7.2 DataCompletenessmeasuredusingthemethoddescribed 142 inSection7.2.1.3.Thenumbersindicatepercentageof completeness. Figure7.3 Customerdemographics,admissionandprocedureclaimdata 144 foryear2010,alongwith2011DIHwereusedtotrainthe model.Later,atthepredictionstage,customer demographics,hospitaladmissionandprocedureclaimdata foryear2011wereusedtopredictthenumberofDIHin 2012. Figure7.4 Averagedaysinhospitalperpersonbyageforeachofthe 147 3yearsofHCFdata. Figure7.5 Scatter-plotsforbaggedregressiontreeresultsforcustomers 149 bornbeforeyear1948(thoseaged63yearsorolderwhen themodelwastrainedin2011). Figure7.6 Distributionofthetop200featuresamongthefourfeature 151 subsets:(a)inthewholepopulation;(b)insubjectsbornin orafter1948;(c)insubjectsbornbefore1948;(d)inthe 1+daysgroup;(e)showsthepercentageofthefoursubsets withrespecttothefullfeaturesetof915features. viii ListofFigures Figure8.1 SandgateRoadanditssurroundings. 177 Figure8.2 IndividualtraveltimesoverLinkAandBon12November 178 2013. Figure8.3 Clustersofroadusersandtraveltimeestimates,LinkA. 180 Figure8.4 Clustersofroadusersandtraveltimeestimates,LinkB. 181 Figure8.5 Traveltimeestimatesoftheclusteringmethodsandspotspeed 182 data,LinkA. List of Tables Table6.1 Resultsfromanalysingthefulldatasetforyear2000forthe 120 Mortgagedefaultexample Table6.2 Thelevelsofeachcovariateavailableforselectioninthe 121 sequentialdesignprocessforthemortgageexamplefrom Drovandietal.(2015) Table6.3 Resultsfromanalysingtheextracteddataset fromtheinitial 121 learningphaseandthesequential designprocessforthe Mortgagedefaultexample Table6.4 Thelevelsofeachcovariateavailableforselectioninthe 124 sequentialdesignprocessfortheairlineexamplefrom Drovandietal.(2015) Table7.1 Asummaryofbigdataanalysisinhealthcare 132 Table7.2 Summaryofthecompositionofthefeatureset.thefourfeature 138 subsetclasses are:1)demographicfeatures;2)medical featuresextractedfromclinicalinformation;3)priorcost/dih features,and;4)othermiscellaneousfeatures Table7.3 Performancemeasures 146 Table7.4 Performancemetricsoftheproposedmethod,evaluatedon 148 differentpopulations Table7.5 Performancemetricsofpredictionsusingfeaturecategory 150 subsetsonly Table7.6 AnexampleofinterestingICD-10primarydiagnosisfeatures 151 Table8.1 Proportionsofroadusersthathavethesamegroupingoverthe 182 twolinks Acknowledgment The authors would like to thank the School of Mathematical Sciences and the ARC Centre of Excellence for Mathematical & Statistical Frontiers at Queensland University of Technology, the Australian Centre for Robotic Vision, and the Graduate School of Biomedical Engineering at the University of New South Wales fortheirsupportinthedevelopmentofthisbook. The authors are grateful to Ms Yike Gao for designing the front cover of this book. 1 Introduction The history of humans storing and analysing data dates back to about 20,000 years ago when tally sticks were used to record and document numbers. Palaeolithic tribespeopleusedtonotchsticksorbonestokeeptrackofsuppliesortradingactivi- ties, while the notches could be compared to carry out calculations, enabling them to make predictions, such as how long their food supplies would last (Marr, 2015). As one can imagine, data storage or analysis in ancient times was very limited. However, after a long journey of evolution, people are now able to collect and process huge amounts of data, such as business transactions, sensor signals, search engine queries, multimedia materials and social network activities. As a 2011 McKinsey report (Manyika et al., 2011) stated, the amount of information in our society has been exploding, and consequently analysing large datasets, which refers tothe so-called bigdata,willbecome akey basisofcompetition, underpinningnew wavesofproductivitygrowth,innovationandconsumersurplus. 1.1 What is big data? Big data is not a new phenomenon, but one that is part of a long evolution of data collection and analysis. Among numerous definitions of big data that have been introduced over the last decade, the one provided by Mayer-Scho¨nberger and Cukier(2013)appearstobemostcomprehensive: (cid:1) Big data is “the ability of society to harness information in novel ways to produce useful insightsorgoodsandservicesofsignificantvalue”and“thingsonecandoatalargescale thatcannotbedoneatasmallerone,toextractnewinsightsorcreatenewformsofvalue.” In the community of analytics, it is widely accepted that big data can be concep- tualizedbythefollowingthreedimensions(Laney,2001): (cid:1) Volume (cid:1) Velocity (cid:1) Variety 1.1.1 Volume Volume refers to the vast amounts of data being generated and recorded. Despite the fact that big data and large datasets are different concepts, to most people big data implies an enormous volume of numbers, images, videos or text. Nowadays, ComputationalandStatisticalMethodsforAnalysingBigDatawithApplications. ©2016ElsevierLtd.Allrightsreserved. 2 ComputationalandStatisticalMethodsforAnalysingBigDatawithApplications the amount of information being produced and processed is increasing tremen- dously,whichcanbedepictedbythefollowingfacts: (cid:1) 3.4millionemailsaresenteverysecond; (cid:1) 570newwebsitesarecreatedeveryminute; (cid:1) Morethan3.5billionsearchqueriesareprocessedbyGoogleeveryday; (cid:1) OnFacebook,30billionpiecesofcontentaresharedeveryday; (cid:1) Every two days we create as much information as we did from the beginning of time until2003; (cid:1) In 2007, the number of Bits of data stored in the digital universe is thought to have exceededthenumberofstarsinthephysicaluniverse; (cid:1) Thetotalamountofdatabeingcapturedandstoredbyindustrydoublesevery1.2years; (cid:1) Over90%ofallthedataintheworldwerecreatedinthepast2years. As claimed by Laney (2001), increases in data volume are usually handled by utilizing additional online storage. However, the relative value of each data point decreases proportionately as the amount of data increases. As a result, attempts have been made to profile data sources so that redundancies can be identified and eliminated. Moreover, statistical sampling can be performed to reduce the size of thedatasettobeanalysed. 1.1.2 Velocity Velocity refers to the pace of data streaming, that is, the speed at which data are generated, recorded and communicated. Laney (2001) stated that the bloom of e-commerce has increased point-of-interaction speed and consequently the pace data used to support interactions. According to the International Data Corporation (https://www.idc.com/), the global annual rate of data production is expected to reach 5.6 zettabytes in the year 2015, which doubles the figure of 2012. It is expected that by 2020 the amount of digital information in existence will have grown to 40 zettabytes. To cope with the high velocity, people need to access, pro- cess, comprehend and act on data much faster and more effectively than ever before. The major issue related to the velocity is that data are being generated con- tinuously. Traditionally, the time gap between data collection and data analysis used tobe large, whereasinthe era ofbigdata thiswouldbeproblematicas alarge portion of data might have been wasted during such a long period. In the presence of high velocity, data collection and analysis need to be carried out as an integrated process. Initially, research interests were directed towards the large-volume charac- teristic, whereascompaniesarenow investinginbigdatatechnologiesthatallow us toanalysedatawhiletheyarebeinggenerated. 1.1.3 Variety Variety refers to the heterogeneity of data sources and formats. Since there are numerouswaystocollectinformation,itisnowcommontoencountervarioustypes of data coming from different sources. Before the year 2000, the most common Introduction 3 format of data was spreadsheets, where data are structured and neatly fit into tables or relational databases. However, in the 2010s most data are unstructured, extracted from photos, video/audio documents, text documents, sensors, transaction records,etc.Theheterogeneityofdatasourcesandformatsmakesdatasetstoocom- plex tostore andanalyse usingtraditional methods,while significantefforts have to bemadetotacklethechallengesinvolvedinlargevariety. As stated by the Australian Government (http://www.finance.gov.au/sites/ default/files/APS-Better-Practice-Guide-for-Big-Data.pdf), traditional data analysis takes adataset fromadatawarehouse,whichis clean and complete with gaps filled and outliers removed. Analysis is carried out after the data are collected and stored in a storage medium such as an enterprise data warehouse. In contrast, big data analysis uses a wider variety of available data relevant to the analytics problem. The data are usually messy, consisting of different types of structured and unstructured content. There are complex coupling relationships in big data from syntactic, semantic, social, cultural, economic, organizational and other aspects. Rather than interrogating data, those analysing explore it to discover insights and understandingssuchasrelevantdataandrelationshipstoexplorefurther. 1.1.4 Another two V’s It is worth noting that in addition to Laney’s three Vs, Veracity and Value have been frequently mentioned in the literature of bigdata (Marr,2015). Veracity refers tothetrustworthinessofthedata,thatis,theextenttowhichdataarefreeofbiased- ness, noise and abnormality. Efforts should be made to keep data tidy and clean, whereas methods should be developed to prevent recording dirty data. On the other hand, value refers to the amount of useful knowledge that can be extracted from data. Big data can deliver value in a broad range of fields, such as computer vision (Chapter 4 of this book), geosciences (Chapter 5), finance (Chapter 6), civil avia- tion (Chapter 6), health care (Chapters 7 and 8) and transportation (Chapter 8). In fact, the applications of big data are endless, from which everyone is benefiting as wehaveenteredadata-richera. 1.2 What is this book about? Big data involves a collection of techniques that can help in extracting useful infor- mation from data. Aiming at this objective, we develop and implement advanced statistical and computational methodologies for use in various high impact areas wherebigdataarebeingcollected. In Chapter 2, classification methods will be discussed, which have been exten- sively implemented for analysing big data in various fields such as customer seg- mentation, fraud detection, computer vision, speech recognition and medical diagnosis. In brief, classification can be viewed as a labelling process for new observations, aiming at determining to which of a set of categories an unlabelled object would belong. Fundamentals of classification will be introduced first, fol- lowed by a discussion on several classification methods that have been popular in

Description:
Due to the scale and complexity of data sets currently being collected in areas such as health, transportation, environmental science, engineering, information technology, business and finance, modern quantitative analysts are seeking improved and appropriate computational and statistical methods to
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.