http://training.databricks.com/workshop/datasci.pdf Data Science Training Spark “I want to die on Mars— b Eutlo nno Mt ouns ki,m ipntaecrtv” iew with Chris Anderson tenta“tTivhee" CI cfso "oyhmTonrheupce uwltuotrdeserrt i u ogaMrunreeee t–d shin asetot ,hde tedafhat sTaece r lat ofsanen,srr geato iecl netntilho oyheu-n -iy gsn-!mp-ht-oe-,o iWtrtshpw wteero eivaslrtlarik asecl” uo,tn in-atoo-hfbtn ee Jglsse eos.c " irt cnooo-go umaF itrnnoareyi egh toShaedfine rontigyhgcu m.dehs" a – oNltt eahHui aierban pytlBk z V ierttsau’osrcrn ilahaaeeengt r s,�� !! ADVANCED: DATA SCIENCE WITH APACHE SPARK Data Science applications with Apache Spark combine the scalability of Spark and the distributed machine learning algorithms. This material expands on the “Intro to Apache Spark” workshop. Lessons focus on industry use cases for machine learning at scale, coding examples based on public data sets, and leveraging cloud-based notebooks within a team context. Includes limited free accounts on Databricks Cloud. Prerequisites: Intro to Apache Spark workshop or Topics covered include: equivalent (e.g., Spark Developer Certificate) Experience coding in Scala, Python, SQL Data transformation techniques based on both Spark SQL and functional Have some familiarity with Data Science programming in Scala and Python. topics (e.g., business use cases) Predictive analytics based on MLlib, clustering with KMeans, building classifiers with a variety of algorithms and text analytics – all with emphasis on an iterative cycle of feature engineering, modeling, evaluation. Visualization techniques (matplotlib, ggplot2, D3, etc.) to surface insights. Understand how the primitives like Matrix Factorization are implemented in a distributed parallel framework from the designers of MLlib Several hands-on exercises using datasets such as Movielens, Titanic, State Of the Union speeches, and RecSys Challenge 2015. Agenda • Detailed)agenda)In)Google)doc) • https://docs.google.com/document/d/ 1T9AkXUmL6gDYTpAEEgqsy9hfJtqGlavjnGzMqzUwDf E/edit) Goals • Patterns):)Data)wrangling)(Transform,)Model)&)Reason))with)Spark) o Use)RDDs,)Transformations)and)Actions)in)the)context)of)a)Data)Science)Problem,)an)Algorithms))&)a) Dataset) • Spend)time)working)through)MLlib)) • Balance)between)internals)&)handsWon) o Internals)from)Reza,)the)MLlib)lead) • ~65%)of)time)on)Databricks)Cloud)&)Notebooks) o Take)the)time)to)get)familiar)with)the)Interface)&)the)Data)Science)Cloud) o Make*mistakes,*experiment,…* • Good)Time)for)this)course,)this)version) o Will)miss)many)of)the)gory)details)as)the)framework)evolves) • Summarized)materials)for)a)3)day)course) o Even)if)we)don’t)finish)the)exercises)today,)that)is)fine) o Complete)the)work)at)home)W)There*are*also**homework*notebooks* o Ask)us)questions)@ksankar,@pacoid,*@reza_zadeh,*@mhfalaki,*@andykonwinski,*@xmeng,* @michaelarmbrust,*@tathadas* Tutorial)Outline:) morning' a)ernoon' o Welcome)+)Ge3ng)Started)(Krishna)) o Ex)3):)Clustering)B)In)which)we)explore)SegmenFng) o Databricks)Cloud)mechanics)(Andy))) Frequent)InterGallacFcHoppers)(Krishna)) o Ex)0:)PreBFlight)Check)(Krishna)) o Ex)4):)RecommendaFon)(Krishna)) o DataScience)DevOps)B)IntroducFon)to)Spark)(Krishna)) o Theory):)Matrix)FactorizaFon,)SVD,…)(Reza)) o Ex)1:)MLlib):)StaFsFcs,)Linear)Regression)(Krishna)) o OnBline)kBmeans,)spark)streaming)(Reza)) o MLlib)Deep)Dive)–)Lecture)(Reza)) o Design)Philosophy,)APIs) o Ex)2:)In)which)we)explore)Disasters,)Trees,) o Ex)5):)Mood)of)the)UnionBText)AnalyFcs(Krishna)) ClassificaFon)&)the)Kaggle)CompeFFon)(Krishna)) o In)which)we)analyze)the)Mood)of)the)naFon)from) inferences)on)SOTU)by)the)POTUS)(State)of)the)Union) o Random)Forest,)Bagging,)Data)DeBcorrelaFon) Addresses)by)The)President)Of)the)US)) o Deepdive)B)Leverage)parallelism)of)RDDs,)sparse) o Ex)99):)RecSys)2015)Challenge)(Krishna)) vectors,)etc)(Reza)) o Ask)Us)Anything)B)Panel) Introducing:) Andy Konwinski @andykonwinski Hossein Falaki Michael Armbrust @mhfalaki @michaelarmbrust Reza Zadeh @Reza_Zadeh Tathagata Das Paco Nathan @tathadas @pacoid Xiangrui Meng @xmeng Krishna Sankar @ksankar About Me o Chief Data Scientist at BlackArrow.tv o Have been speaking at OSCON, PyCon, Pydata, The)Nuthead)band)!) Strata et al o Reviewer “Machine Learning with Spark” o Picked up co-authorship Second Edition of “Fast Data Processing with Spark” o Have done lots of things: • (cid:13)(cid:41)(cid:39)(cid:1)(cid:15)(cid:33)(cid:51)(cid:33)(cid:1)(cid:5)(cid:27)(cid:37)(cid:51)(cid:33)(cid:41)(cid:43)(cid:7)(cid:1)(cid:13)(cid:41)(cid:46)(cid:41)(cid:45)(cid:38)(cid:46)(cid:49)(cid:44)(cid:33)(cid:51)(cid:41)(cid:35)(cid:50)(cid:7)(cid:1)(cid:17)(cid:41)(cid:45)(cid:33)(cid:45)(cid:35)(cid:41)(cid:33)(cid:43)(cid:7)(cid:1)(cid:12)(cid:36)(cid:29)(cid:37)(cid:35)(cid:40)(cid:7)(cid:9)(cid:9)(cid:6)(cid:1) • (cid:32)(cid:49)(cid:41)(cid:51)(cid:51)(cid:37)(cid:45)(cid:1)(cid:13)(cid:46)(cid:46)(cid:42)(cid:50)(cid:1)(cid:5)(cid:32)(cid:37)(cid:34)(cid:1)(cid:11)(cid:9)(cid:10)(cid:7)(cid:1)(cid:32)(cid:41)(cid:49)(cid:37)(cid:43)(cid:37)(cid:50)(cid:50)(cid:7)(cid:1)(cid:20)(cid:33)(cid:53)(cid:33)(cid:7)(cid:58)(cid:6)(cid:1) • (cid:28)(cid:51)(cid:33)(cid:45)(cid:36)(cid:33)(cid:49)(cid:36)(cid:50)(cid:1)(cid:5)(cid:32)(cid:37)(cid:34)(cid:1)(cid:28)(cid:37)(cid:49)(cid:53)(cid:41)(cid:35)(cid:37)(cid:7)(cid:1)(cid:14)(cid:43)(cid:46)(cid:52)(cid:36)(cid:6)(cid:7)(cid:1)(cid:28)(cid:46)(cid:44)(cid:37)(cid:1)(cid:54)(cid:46)(cid:49)(cid:42)(cid:1)(cid:41)(cid:45)(cid:1)(cid:12)(cid:19)(cid:1) • (cid:18)(cid:52)(cid:37)(cid:50)(cid:51)(cid:1)(cid:21)(cid:37)(cid:35)(cid:51)(cid:52)(cid:49)(cid:37)(cid:49)(cid:1)(cid:33)(cid:51)(cid:1)(cid:23)(cid:33)(cid:53)(cid:33)(cid:43)(cid:1)(cid:25)(cid:18)(cid:1)(cid:28)(cid:35)(cid:40)(cid:46)(cid:46)(cid:43)(cid:7)(cid:58)(cid:1) • (cid:25)(cid:43)(cid:33)(cid:45)(cid:45)(cid:41)(cid:45)(cid:39)(cid:1)(cid:22)(cid:33)(cid:50)(cid:51)(cid:37)(cid:49)(cid:50)(cid:1)(cid:14)(cid:46)(cid:44)(cid:47)(cid:52)(cid:51)(cid:33)(cid:51)(cid:41)(cid:46)(cid:45)(cid:33)(cid:43)(cid:1)(cid:17)(cid:41)(cid:45)(cid:33)(cid:45)(cid:35)(cid:37)(cid:1)(cid:46)(cid:49)(cid:1)(cid:28)(cid:51)(cid:33)(cid:51)(cid:41)(cid:50)(cid:51)(cid:41)(cid:35)(cid:50)(cid:1)(cid:1) • (cid:31)(cid:46)(cid:43)(cid:52)(cid:45)(cid:51)(cid:37)(cid:37)(cid:49)(cid:1)(cid:33)(cid:50)(cid:1)(cid:27)(cid:46)(cid:34)(cid:46)(cid:51)(cid:41)(cid:35)(cid:50)(cid:1)(cid:20)(cid:52)(cid:36)(cid:39)(cid:37)(cid:1)(cid:33)(cid:51)(cid:1)(cid:17)(cid:41)(cid:49)(cid:50)(cid:51)(cid:1)(cid:21)(cid:37)(cid:39)(cid:46)(cid:1)(cid:43)(cid:37)(cid:33)(cid:39)(cid:52)(cid:37)(cid:1)(cid:32)(cid:46)(cid:49)(cid:43)(cid:36)(cid:1)(cid:14)(cid:46)(cid:44)(cid:47)(cid:37)(cid:51)(cid:41)(cid:51)(cid:41)(cid:46)(cid:45)(cid:50)(cid:1) o @ksankar, doubleclix.wordpress.com [email protected] Pre-requisites ① Register)&)Download)data)from)Kaggle.) We)cannot)distribute)Kaggle)data.) Moreover)you)need)an)account)to)submit)entries) a) Setup)an)account)in)Kaggle)(www.kaggle.com)) b) We)will)be)using)the)data)from)the)competition)“Titanic:) Machine)Learning)from)Disaster”) c) Download)data)from) http://www.kaggle.com/c/titanicWgettingStarted) ② Register)for)RecSys)2015)Competition) a) http://2015.recsyschallenge.com/) 9:00 Welcome + Getting Started Getting Started: Step 1 Everyone will receive a username/password for one ! of the Databricks Cloud shards. Use your laptop and browser to login there. We find that cloud-based notebooks are a simple way to get started using Apache Spark – as the motto “Making Big Data Simple” states. Please create and run a variety of notebooks on your account throughout the tutorial. These accounts will remain open long enough for you to export your work. See the product page or FAQ for more details, or contact Databricks to register for a trial account. 10
Description: