Tutorial on Large Scale Distributed Data Science from Scratch with Apache Spark 2.0 & Deep Learning *

Full-day hands-on tutorial at CIKM 2017, Wednesday 8 November 2017

Dr. James G. Shanahan

CEO and Chief Scientist at Church and Duncan Group and UC Berkeley

Dr. James G. Shanahan has spent the past 25 years developing and researching cutting-edge artificial intelligent systems. He has (co) founded several companies including: Church and Duncan Group Inc. (2007), a boutique consultancy in large scale AI which he runs in San Francisco; RTBFast (2012), a real-time bidding engine infrastructure play for digital advertising systems; and Document Souls (1999), a document-centric anticipatory information system. In 2012 he went in-house as the SVP of Data Science and Chief Scientist at NativeX, a mobile ad network that got acquired by MobVista in early 2016. In addition, he has held appointments at AT&T (Executive Director of Research), Turn Inc. (founding chief scientist), Xerox Research, Mitsubishi Research, and at Clairvoyance Corp (a spinoff research lab from CMU). Dr. Shanahan has been affiliated with the University of California at Berkeley (and Santa Cruz) since 2008 where he teaches graduate courses on big data analytics, large-scale machine learning, and stochastic optimization. He also advises several high-tech startups (including Quixey, Aylien, VoxEdu, and others) and is executive VP of science and technology at Irish Innovation Center (IIC). He has published six books, more than 50 research publications, and over 20 patents in the areas of machine learning and information processing. Dr. Shanahan received his PhD in engineering mathematics from the University of Bristol, U. K., and holds a Bachelor of Science degree from the University of Limerick, Ireland. He is a EU Marie Curie fellow. In 2011 he was selected as a member of the Silicon Valley 50 (Top 50 Irish Americans in Technology).

Liang Dai

NativeX, University of California Santa Cruz

Liang Dai is a Ph.D. candidate in Technology Information and Management department, UC Santa Cruz. There he does research in data mining on digital marketing, including campaign evaluation, online experiment design, customer value improvement, etc. Liang received the B.S. and the M.S. from Information Science and Electronic Engineering department, Zhejiang University, China. Liang is also working as a applied research scientist in Facebook, focusing on data modeling for ads product. He has hands-on experience on end to end large scale data mining projects in distributed platform, e.g. AWS, Hadoop, Spark, etc.

What is this tutorial about?

In the continuing big data revolution, Apache Spark’s open-source cluster computing framework has overtaken Hadoop MapReduce as the big data processing engine of choice. Spark maintains MapReduce’s linear scalability and fault tolerance, but offers two key advantages: Spark is much faster – as much as 100x faster for certain applications; and Spark is much easier to program, due to its inclusion of APIs for Python, Java, Scala, SQL and R, plus its user-friendly core data abstraction, the distributed data frame. In addition, Spark goes far beyond traditional batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming data, machine learning, and graph processing.

This tutorial offers you an accessible introduction to large-scale distributed machine learning and data mining, and to Spark and its potential to revolutionize academic and commercial data science practices. The tutorial includes discussions of algorithm design, presentation of illustrative algorithms, relevant case studies, and practical advice and experience in writing Spark programs and running Spark clusters. Part I familiarizes you with fundamental Spark concepts, including Spark Core, functional programming a la MapReduce, RDDs/data frames/datasets, the Spark Shell, Spark Streaming and online learning, Spark SQL, MLlib, and more. Part 2 gives you hands-on algorithmic design and development experience with Spark, including building algorithms from scratch such as decision tree learning, association rule mining (aPriori), graph processing algorithms such as PageRank and shortest path, gradient descent algorithms such as support vector machines and matrix factorization, distributed parameter estimation, and deep learning. Your homegrown implementations will shed light on the internals of Spark’s MLlib libraries and on typical challenges in parallelizing machine learning algorithms. You will see examples of industrial applications and deployments of Spark.

Keywords. Distributed systems, HDFS, Spark, Hadoop, large-scale distributed machine learning, online learning, deep learning, Spark Streaming, mobile advertising

Materials. You will receive electronic handouts and a web-based iPython Jupyter notebook with example code and data. You will deploy Spark on your own multicore laptop to run and develop examples there. So that you can rapidly provision remote Spark clusters on the fly, we plan to work with Amazon Web Services to provide you with free access to Amazon's Elastic Compute Cloud (EC2) during the tutorial.

Who should attend?

Industry practitioners and researchers who wish to learn the best practices for large scale data science using next generation tools. No prior familiarity with Spark, distributed systems, how to distribute algorithms, or large-scale machine learning is required.

What will I get out of this tutorial?

  • Gain an integrated view of the data processing pipeline as the tutorial highlights its components, including exploratory data analysis, feature extraction, supervised learning, and model evaluation.

  • Introduced to the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines, and gain hands-on experience in applying these principles using Apache Spark, a cluster computing system well suited for large-scale machine learning tasks.

  • Learn how to implement scalable algorithms for fundamental statistical models (linear regression, logistic regression, matrix factorization, principal component analysis) while tackling key data science problems from various domains: mobile advertising, personalized recommendation, and consumer segmentation.

  • Learn about data intensive industrial applications and deployments of Spark in fields such as mobile advertising. You will leave with an understanding of scalability challenges and the tradeoffs associated with distributed processing of large datasets.

Detailed Outline

Spark introduction
History of Spark
Introduction to data analysis with Spark
Downloading Spark and getting started on your laptop
Parallel computing
Divide and conquer, semaphores, barriers, shared nothing architectures
Core Spark
Spark basics, Functional programming, Transformations and actions, MapReduce patterns, Data frames and datasets, RDD (no keys) and Pair RDDs, PySpark, Scala and Spark Shell, Broadcast variables
Spark APIs
Java, Scala, Python, R, SQL
Data analysis and handling with Spark
Tools for exploratory data analysis, Standardization, Reservoir sampling, SparkSQL and joins, Statistics in Spark
Algorithms and programming in Spark
Algorithmic design and development with Spark

Developing algorithms from scratch
  • Decision tree learning
  • Naive Bayes

Association rule mining
  • aPriori algorithm

Graph processing algorithms
  • PageRank
  • Shortest path
  • Friend of friends
  • TextRank

Unsupervised algorithms
  • Expectation maximization

Gradient descent algorithms
  • Support vector machines
  • Matrix factorization

Deep Learning
  • Word2Vec/Glove, multilayer perceptrons, recurrent neural networks, convolutional neural networks, BackProp, deep learning with Apache Spark and TensorFlow
Spark at scale
Spark on your laptop versus Spark on an EC2 cluster
Spark libraries
SparkSQL, MLlib, GraphX, Spark Streaming
Online learning
Online learning for classification and for regression via gradient descent
Spark deployments and case studies
Mobile advertising, Recommendation engines, Security
Spark 2.0 and beyond
Desiderata for large scale data processing environments