ebook img

Introduction To Data Systems: Building From Python PDF

844 Pages·2020·11.545 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Introduction To Data Systems: Building From Python

Thomas Bressoud David White Introduction to Data Systems Building from Python Introduction to Data Systems Thomas Bressoud • David White Introduction to Data Systems Building from Python ThomasBressoud DavidWhite MathematicsandComputerScience MathematicsandComputerScience DenisonUniversity DenisonUniversity Granville,OH,USA Granville,OH,USA ISBN978-3-030-54370-9 ISBN978-3-030-54371-6 (eBook) https://doi.org/10.1007/978-3-030-54371-6 ©SpringerNatureSwitzerlandAG2020 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressedorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Infamilylife,loveistheoilthateases friction,thecementthatbindscloser together,andthemusicthatbringsharmony. [Nietzsche] ToSuzanne,yourenduringloveandsupport inspireandsustainme.ToAmandaandBen, Iloveyoudeeplyandamsoproudofyou. And,tothebestofalltime—thefamilyof GeorgeandLeigh. –Tom Preface Thesourcesandformsofdata,alongwiththedemandforacquiring,processing,and analyzingthatdataareincreasingataprodigiousrate.Fromacurricularperspective, majors, programs, and courses are being designed to enable interested students to understandthesetopicsandpreparethemselvesforthepracticeoftheseassociated dataskills.Asthesourcesandformsofdataincrease,withacommensurateincrease in the data providers and interfaces for accessing data, our interpretation of what constitutes a data system must keep pace. An understanding of these forms and sourcesofdataisexactlywhatisneededinabroadintroductiontodatasystems. Data systems encompasses the study of forms and sources of data, an increas- ingly important topic for both computer scientists and data scientists. This book coversdataacquisition,wrangling,normalization,andcuration,requiringonlybasic prior exposure to Python. The book includes a detailed treatment of tidy data, relational data, and hierarchical data, laying a conceptual basis for the structure, operations, and constraints of each data model, while simultaneously providing hands-onskillsinPython,SQL,andXPath.Thesourcesofdatastudiedencompass local files, text applications and regular expressions, database servers, HTTP requests,RESTAPIs,andwebscraping. Who IsThisBookfor? As university curricula expand to include content on data systems, the student audienceforsuchcurriculareffortsisbroadeningaswell.Weidentifythreestudent constituenciesattheundergraduatelevel: (cid:129) ComputerScience:studentsrecognizingthevalueofbeingfacilewhileworking with data and seeing the synergy with the systems and algorithmic problem- solvingaspectsofthecomputersciencediscipline. vii viii Preface (cid:129) Data Science/Data Analytics: students pursuing the emerging undergraduate majors, who desire to build algorithmic and practical skills with an eye toward enabling data exploration, visualization, statistical analysis, and application of machinelearningtechniquesforuseincourseworkaswellasinpracticumsand internships. (cid:129) Multidisciplinary/domain-centric:studentswhosefocusisonpracticalaspectsof acquiring and transforming data for use in their own disciplines. These include studentsinsocialsciencesaswellasnaturalsciences,andoftenasapreludetoa “methods”coursespecifictotheirdiscipline. Withincomputerscienceeducation,theunionoftheseaudienceshasbeentermed the “data-centric” audience. We have carefully identified a set of essential data- relatedtopics,andadesiredlevelofunderstanding,thatthesestudentconstituencies willneedtobesuccessfulintheirfutureendeavors. This book is ideal for first- or second-year undergraduate students, either for a classroom setting or for self-directed learning, and does not require prerequisites of data structures, algorithms, or other courses beyond a first course in Python programming. This book equips students with understanding and skills that can be applied in computer science, data science/data analytics, and information technology programs as well as for internships and research experiences. By drawing together content normally spread across a set of upper-level computer science courses, it offers a single source providing the essentials for data science practitioners, an accessible and foundational second course for computer science majors, a potential second course for data science majors, and content that can supplement introductory courses seeking more exposure to real-world data. In our increasinglydata-centricworld,studentsfromalldomainsbothwantandneedthe data-aptitudebuiltbythematerialinthisbook. PhilosophyofThis Book Weseetwoprimarydimensionstodata-aptitude,correspondingtothesourcesand forms of data. Data sources range from local files, downloaded or given to the studentoranalyst,toarelationaldatabasesystem,toawideassortmentofnetwork- based data providers. Data forms/formats include comma-separated value or other delimited flat files, tables within a relational database, XML and JSON used for data interchange, as well as unstructured data and HTML from which data can be extracted.Datamodelsgiveaframeworkforunderstandingthestructure,operations, andconstraintsforthesevariousformsofdata.Athirddimensionofdata-aptitude relates to the protection and privacy of the data. Some data is explicitly open and freely available. Other data may be limited to use by particular applications. Still otherdatamaybecomprisedofprotectedresourcesofoneormoreresourceowners, andtheacquisitionanduseofsuchdatamustbeappropriatelyauthorized. Preface ix AnIntroductiontoDataSystems,asadvancedbythisbook,takestheperspective that students can, early in their academic careers, learn these dimensions of data- aptitude. Students need, in a structured way, to learn about the forms and sources ofdatainmoderndatasystems,definedbroadly.Asusersofdatasystems,students needtobeabletobuildapplicationstoacquiredataand,giventheformofthedata acquired, be able to extract and transform the data into a normalized form, where subsequentstatisticsandanalysismaybeeasilyperformedonthedata. This book uses the framework of a set of data models, from a simple tabular modelwithappropriateconstraints,totherelationaldatamodel,andtoahierarchical data model, giving structure to the various data forms seen in practice. In the data sources dimension of data-aptitude, the course follows a progression of data sources,startingthestudentswithlocalfiles,movingontoMySQLdatabaseserver and SQLite databases as representative of relational databases, and then covering moreadvancedclient–serverinteractionoverHTTPusingproviderAPIs,andalso extracting data from HTML. In this way, we feel we can cover the topics of data- aptitudeinsufficientdepth,whilekeepingmaterialbroadandapplicabletothewide rangeofdata-centricstudents. Since specific programming languages and packages are bound to change over the course of the reader’s lifetime, we stress a conceptual understanding over an exhaustive coverage of packages. For each topic, we begin with a high-level discussion,situatingthetopicwithinitsdisciplinarycontext.Wethendiscusshow toworkwiththetopic,withanemphasisonproblemsolvingandalgorithms.Lastly, weillustrateusingtoolssuchasPython,pandas,SQL,XPath,curl,etc.Inthis way,weensurethatreadersunderstandtheprinciplesunderlyingthesetools,rather thanonlyhowtousethemasablackbox.Weselecttoolsthatarepowerfulenough toillustratetheconcepts,andsimultaneouslyaseasyaspossibleforreaderstolearn. Readersarefreetofocustheirenergyontheessentialcontent,ratherthanstruggling to learn the tools. We include detailed references that can guide interested readers tofurtherexploration. Toavoidoverwhelmingreaderswithtoomanydifferentdatasituations,wecenter our exposition on three data sets—based on economic data, sociological data, and educational data—that we carry with us throughout the book. These illustrate our foundational material, the three data models, and the process of data acquisition. This allows us to center our discussion on compelling real-world problems, rather than on the tools used to solve these problems. This approach has been shown to be effective for a diverse array of student learners, and to increase retention in the discipline. When we develop data wrangling solutions on these data sets, we illustrate good software engineering principles, guiding the reader through the process of developing incrementally, testing their code as they go, and error handling.Weincludealargenumberofexerciseswherestudentscansharpentheir skills. x Preface Web Resources Accompanyingthisbookisawebsitesupportedbytheauthors,containingfilesand supplementarycontentthatwillaidthereader: http://datasystems.denison.edu. This website hosts data files used in Parts I and II of the book and is used to illustrateHTTP,HTML,andwebscrapinginPartIII.Inaddition,theauthorshave taughtmanyiterationsofacourseusingthisbookandhavecuratedarepositoryof hundredsofexercises,readingquestions,andhands-onactivitiesengagingstudents in the material. These resources are contained within Jupyter notebooks that use the freely available nbgrader system, so that worksheets may be automatically graded. The repository also contains multiple in-depth projects guiding students toworkwithreal-worlddatasets,normalizethedataintotherelevantdatamodel, generateinterestingquestions,andanswerthesequestionswithvisualizations.Via theseprojects,studentscancreateaportfoliotoshowcasewhattheyhavelearnedto potentialfutureemployersandgraduateprograms.Wehostseveralsampleprojects on the book website, and we intend to add more as we continue to teach courses using this book. In this way, projects can be updated if real-world data sources change where they host data, what data they make available, or the form inwhich thedataisprovided. To Students This text was written with both concrete and abstract goals in mind. A large part of our motivation was to serve “data-focused” students and provide a bridge for you to take the learning from the classroom and use these skills in concrete, real-world settings. The focus of the book is giving you the skills you need to acquiredatafromamultitudeofsources,mutateitintoaformsuitableforanalysis, andaccessitprogrammaticallytoanswerinterestingquestions.Thisincludesdata storedinfiles,ondatabaseservers,providedbyApplicationProgrammingInterfaces (APIs),anddataobtainedviawebscraping.Theprojects(hostedonthebookweb page) will guide you to applying your new skills in the real world: after learning about data models, you will be able to use that knowledge to work with real- world data sets, normalize the data into the model, and then generate interesting questionsandvisualizationstohelpanswerthosequestions.Similarly,afterlearning aboutobtainingdataovertheInternet,RESTAPIs,andauthorizationforprotected resources,theAPIprojectsallowyoutobringitalltogether,andacquiredatafrom providers such as LinkedIn, Reddit, various Google APIs, Lyft, and many others. Thiswillentailusingthematerialfromtheentirearcofthebook. We surveyed students to see how the material in this book benefited them after thecoursewasover.Thevastmajorityofrespondentsreportedusingmaterialfrom Preface xi some or all of the chapters of this book in their subsequent jobs, internships, and research projects. Furthermore, if done correctly, the projects you complete can become part of a portfolio that you show to potential employers, which, coupled withyourknowledgeofthetermsandconceptscoveredinthisbook,canhelpyou securethekindofdata-focusedjobyouareinterestedin. Equally important to these concrete learning goals are our abstract goals: to sharpen your analytic thinking, problem solving, coding, writing, and technical reading skills. For each technique in this book, we first describe the approach in general terms, then carefully work through multiple examples, and finally provide numerous exercises (building on the examples) for you to achieve mastery. In our examples,wemodelanincrementalapproach,wherewedevelopapartialsolution, test it, and then develop a bit more, repeating this process until we have a general solution. We encourage you to follow the same steps when you solve exercises, includingtheuseoftry/exceptblocksandassertstatementstoensurethatyourcode behavesasexpected. Thisbookiswrittentoonlyassumeyouhavepriorexposuretocomputerscience principles(andPython)atthelevelofanintroductorycourse.Normally,thematerial inthisbookisspreadacrossseveralelectivesthatonlyjuniorandseniorcomputer sciencemajorstake.Bycondensingthematerial,andtakinganelementaryapproach toit,weaimtogiveyoutheconcreteandabstractskillsearlyinyourcollegecareer. However,thetrade-offisthatthisbookcontainsagreatdealthatwillprobablybe newtoyou.Wedonotexpectthatyouwillbeanexpertatreadingcomputerscience books.Somepartsmaybedifficulttounderstandormayrequireyoutoreadmultiple times.Aswithanythingyouread,ifyoucomeacrossatermthatisnewtoyou,we encourageyoutolookthattermupandunderstanditbeforeproceeding.Itmaybe thatthetermwasdefinedearlierinthebook,inwhichcasetheindexattheendof thebookcantellyouwherethetermwasfirstdefined. Most sections begin with an abstract approach and introduce examples later. It maymakesensetopeekaheadattheexamplesifyoustrugglewiththeabstractpart, ortorereaditafterfinishingthechaptertobetterstructurewhatyouhavelearned.To helpyouidentifythemostimportantpartsofeachsection,andtoguideyouthrough thetypesofactivitiesthatwillhelpyoutounderstandthematerial(e.g.,relatingthe abstract concepts to experiences you may have already had in the real world), we includereadingquestionsattheendofeachsection.Thesemaybeassignedbyyour instructor, to make sure you attempt the reading before class. Even if they are not assigned, werecommend working through thequestions asyou readeach section, astheywilloftenclarifythemeaningofnewtermsintroduced,highlightpotential pitfalls,andemphasizewhichpiecesofthereadingaremostessential.Youcanand shouldusePythonwhenansweringreadingquestionsthatreferencecode,modules, andmethods. Data systems is a rapidly evolving field, and this book is the first of its kind. Previously, students who wanted to learn this material would need to do so by reading online tutorials that often treated data science tools as a black box. By emphasizing the concepts and programming underlying the tools, we aim to give youadeeperlevelofunderstanding.Thisunderstanding,coupledwiththetechnical

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.