STAT 695V, CRN 13281, Spring 2016

Course Information

Instructor
William S. Cleveland, Haas 222, wsc@purdue.edu

Prerequisites
Knowledge of basic probability and statics, and mathematics through calculus and linear algebra. No previous knowledge of R, Hadoop, or RHIPE is needed.

Primary Audience
Graduate students in university departments where data are analyzed.

Credits
3

Time
TueThur 1:30 - 2:45

Location
Lectures and Labs: SC 277

Description
This course has two components: (1) The Divide and Recombine (D&R) statistical approach to large complex data; (2) The Tessera computational environment that implements D&R, allowing a data analyst to carry out deep analysis of big data using D&R. Deep analysis means that the data are analyzed in detail at their finest granularity, and the analyst has access to any of the 1000s of methods of statistics, machine learning, and visualization for use in the analysis.

Tessera has R at the front end. All analyst programming is in R. At the back end is the Hadoop distributed file system (HDFS) and parallel compute engine (MapReduce). Hadoop runs the analyst's R commands to carry out the D&R computations. Tessera software packages merge R and Hadoop, enabling communication between the two, and making programming D&R easy.

Students will have access to a Hadoop cluster provided by the Rosen Center for Advanced Computing, and with the Tessera software stack installed. Reading materials and lectures will be provided electronically.

Participant Responsibilities
Participants are expected to attend class and successfully carry out the class assignments.

Course Lectures: Living Documents

Get R and Login to Hathi Get Started

R Language: A Living Document An Introduction to R

R Language: A Living Document Visualization Methods

Trellis Display An Introduction to Trellis Display with Lattice Graphics

High Performance Computing for Data Analysis: D&R with Tessera D&R and Tessera

Reading

R Manual R Manual Written by R Core Team

Introduction of plyr package plyr package plyr package(continued)

Data

Chicago Bears PSL Data in Text Format CSV data

Datasets and Functions that Make Plots in the Book Visualizing Data R objects