PRINCIPLES OF BIG DATA Intentionally left as blank PRINCIPLES OF BIG DATA Preparing, Sharing, and Analyzing Complex Information JULES J. BERMAN, Ph.D., M.D. AMSTERDAM (cid:129) BOSTON (cid:129) HEIDELBERG (cid:129) LONDON NEW YORK (cid:129) OXFORD (cid:129) PARIS (cid:129) SAN DIEGO SAN FRANCISCO (cid:129) SINGAPORE (cid:129) SYDNEY (cid:129) TOKYO Morgan Kaufmann is an imprint of Elsevier AcquiringEditor:AndreaDierna EditorialProjectManager:HeatherScherer ProjectManager:PunithavathyGovindaradjane Designer:RussellPurdy MorganKaufmannisanimprintofElsevier 225WymanStreet,Waltham,MA02451,USA Copyright#2013ElsevierInc.Allrightsreserved Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans, electronicormechanical,includingphotocopying,recording,oranyinformationstorage andretrievalsystem,withoutpermissioninwritingfromthepublisher.Detailsonhowto seekpermission,furtherinformationaboutthePublisher’spermissionspoliciesandour arrangementswithorganizationssuchastheCopyrightClearanceCenterandthe CopyrightLicensingAgency,canbefoundatourwebsite:www.elsevier.com/permissions. Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightby thePublisher(otherthanasmaybenotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchand experiencebroadenourunderstanding,changesinresearchmethodsorprofessional practices,maybecomenecessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgein evaluatingandusinganyinformationormethodsdescribedherein.Inusingsuch informationormethodstheyshouldbemindfuloftheirownsafetyandthesafetyofothers, includingpartiesforwhomtheyhaveaprofessionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,or editors,assumeanyliabilityforanyinjuryand/ordamagetopersonsorpropertyasa matterofproductsliability,negligenceorotherwise,orfromanyuseoroperationofany methods,products,instructions,orideascontainedinthematerialherein. LibraryofCongressCataloging-in-PublicationData Berman,JulesJ. Principlesofbigdata:preparing,sharing,andanalyzingcomplexinformation/Jules JBerman. pagescm ISBN978-0-12-404576-7 1.Bigdata.2.Databasemanagement.I.Title. QA76.9.D32B472013 005.74–dc23 2013006421 BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary PrintedandboundintheUnitedStatesofAmerica 13 14 15 16 17 10 9 8 7 6 5 4 3 2 1 ForinformationonallMKpublicationsvisitourwebsiteatwww.mkp.com Dedication To my father,Benjamin v Intentionally left as blank Contents Acknowledgments xi 4. Introspection Author Biography xiii Background 49 Preface xv KnowledgeofSelf 50 Introduction xix eXtensibleMarkupLanguage 52 IntroductiontoMeaning 54 NamespacesandtheAggregationofMeaningful 1. Providing Structure to Unstructured Assertions 55 Data ResourceDescriptionFrameworkTriples 56 Reflection 59 Background 1 UseCase:TrustedTimeStamp 59 MachineTranslation 2 Summary 60 Autocoding 4 Indexing 9 5. Data Integration and Software TermExtraction 11 Interoperability 2. Identification, Deidentification, Background 63 and Reidentification TheCommitteetoSurveyStandards 64 StandardTrajectory 65 Background 15 SpecificationsandStandards 69 FeaturesofanIdentifierSystem 17 Versioning 71 RegisteredUniqueObjectIdentifiers 18 ComplianceIssues 73 ReallyBadIdentifierMethods 22 InterfacestoBigDataResources 74 EmbeddingInformationinanIdentifier:Not Recommended 24 One-WayHashes 25 6. Immutability and Immortality UseCase:HospitalRegistration 26 Deidentification 28 Background 77 DataScrubbing 30 ImmutabilityandIdentifiers 78 Reidentification 31 DataObjects 80 LessonsLearned 32 LegacyData 82 DataBornfromData 83 ReconcilingIdentifiersacrossInstitutions 84 3. Ontologies and Semantics Zero-KnowledgeReconciliation 86 TheCurator’sBurden 87 Background 35 Classifications,theSimplestofOntologies 36 Ontologies,ClasseswithMultipleParents 39 7. Measurement ChoosingaClassModel 40 IntroductiontoResourceDescriptionFramework Background 89 Schema 44 Counting 90 CommonPitfallsinOntologyDevelopment 46 GeneCounting 93 vii viii CONTENTS DealingwithNegations 93 Step2.ResourceEvaluation 158 UnderstandingYourControl 95 Step3.AQuestionIsReformulated 159 PracticalSignificanceofMeasurements 96 Step4.QueryOutputAdequacy 160 Obsessive-CompulsiveDisorder:TheMarkofaGreat Step5.DataDescription 161 DataManager 97 Step6.DataReduction 161 Step7.AlgorithmsAreSelected,IfAbsolutely 8. SimplebutPowerfulBigDataTechniques Necessary 162 Step8.ResultsAreReviewedandConclusions Background 99 AreAsserted 164 LookattheData 100 Step9.ConclusionsAreExaminedandSubjected DataRange 110 toValidation 164 Denominator 112 FrequencyDistributions 115 12. Failure MeanandStandardDeviation 119 Estimation-OnlyAnalyses 122 Background 167 UseCase:WatchingDataTrendswithGoogle FailureIsCommon 168 Ngrams 123 FailedStandards 169 UseCase:EstimatingMoviePreferences 126 Complexity 172 WhenDoesComplexityHelp? 173 9. Analysis WhenRedundancyFails 174 SaveMoney;Don’tProtectHarmless Background 129 Information 176 AnalyticTasks 130 AfterFailure 177 Clustering,Classifying,Recommending,and UseCase:CancerBiomedicalInformaticsGrid, Modeling 130 aBridgeTooFar 178 DataReduction 134 NormalizingandAdjustingData 137 BigDataSoftware:SpeedandScalability 139 13. Legalities FindRelationships,NotSimilarities 141 Background 183 10. Special Considerations in Big Data ResponsibilityfortheAccuracyandLegitimacyof Analysis ContainedData 184 RightstoCreate,Use,andSharetheResource 185 Background 145 CopyrightandPatentInfringementsIncurredby TheoryinSearchofData 146 UsingStandards 187 DatainSearchofaTheory 146 ProtectionsforIndividuals 188 Overfitting 148 Consent 190 BignessBias 148 UnconsentedData 194 TooMuchData 151 GoodPoliciesAreaGoodPolicy 197 FixingData 152 UseCase:TheHavasupaiStory 198 DataSubsetsinBigData:NeitherAdditivenor Transitive 153 14. Societal Issues AdditionalBigDataPitfalls 154 Background 201 11. Stepwise Approach to Big Data HowBigDataIsPerceived 201 Analysis TheNecessityofDataSharing,EvenWhenIt SeemsIrrelevant 204 Background 157 ReducingCostsandIncreasingProductivitywith Step1.AQuestionIsFormulated 158 BigData 208 ix CONTENTS PublicMistrust 210 Glossary 229 SavingUsfromOurselves 211 References 247 HubrisandHyperbole 213 Index 257 15. The Future Background 217 LastWords 226
Description: