** Instructor **

William S. Cleveland, Haas 222, wsc@purdue.edu

** Prerequisites **

Knowledge of basic probability and statics, and mathematics through
calculus and linear algebra. No previous knowledge of R, Hadoop,
or RHIPE is needed.

** Primary Audience**

Graduate students in university departments where data are analyzed.

** Credits**

3

** Time **

TueThur 1:30 - 2:45

** Location**

Lectures and Labs: SC 277

** Description**

This course has two components: (1) The Divide and Recombine (D&R)
statistical approach to large complex data; (2) The Tessera computational
environment that implements D&R, allowing a data analyst to carry out deep
analysis of big data using D&R. Deep analysis means that the data are
analyzed in detail at their finest granularity, and the analyst has access to
any of the 1000s of methods of statistics, machine learning, and visualization
for use in the analysis.

Tessera has R at the front end. All analyst programming is in R. At the back end is the Hadoop distributed file system (HDFS) and parallel compute engine (MapReduce). Hadoop runs the analyst's R commands to carry out the D&R computations. Tessera software packages merge R and Hadoop, enabling communication between the two, and making programming D&R easy.

Students will have access to a Hadoop cluster provided by the Rosen Center for Advanced Computing, and with the Tessera software stack installed. Reading materials and lectures will be provided electronically.

**Participant Responsibilities**

Participants are expected to attend class and successfully carry out the class
assignments.

** Get R and Login to Hathi**
Get Started

** R Language: A Living Document**
An Introduction to R

** R Language: A Living Document**
Visualization Methods

** Trellis Display**
An Introduction to Trellis Display with Lattice Graphics

** High Performance Computing for Data Analysis: D&R with Tessera**
D&R and Tessera

** R Manual **
R Manual Written by R Core Team

** Introduction of plyr package **
plyr package
plyr package(continued)

** Chicago Bears PSL Data in Text Format **
CSV data

** Datasets and Functions that Make Plots in the Book Visualizing Data**
R objects