William S. Cleveland, Haas 222, email@example.com
Knowledge of basic probability and statics, and mathematics through calculus and linear algebra. No previous knowledge of R, Hadoop, or RHIPE is needed.
Graduate students in university departments where data are analyzed.
TueThur 1:30 - 2:45
Lectures and Labs: SC 277
This course has two components: (1) The Divide and Recombine (D&R) statistical approach to large complex data; (2) The Tessera computational environment that implements D&R, allowing a data analyst to carry out deep analysis of big data using D&R. Deep analysis means that the data are analyzed in detail at their finest granularity, and the analyst has access to any of the 1000s of methods of statistics, machine learning, and visualization for use in the analysis.
Tessera has R at the front end. All analyst programming is in R. At the back end is the Hadoop distributed file system (HDFS) and parallel compute engine (MapReduce). Hadoop runs the analyst's R commands to carry out the D&R computations. Tessera software packages merge R and Hadoop, enabling communication between the two, and making programming D&R easy.
Students will have access to a Hadoop cluster provided by the Rosen Center for Advanced Computing, and with the Tessera software stack installed. Reading materials and lectures will be provided electronically.
Participants are expected to attend class and successfully carry out the class assignments.
Get R and Login to Hathi Get Started
R Language: A Living Document An Introduction to R
R Language: A Living Document Visualization Methods
Trellis Display An Introduction to Trellis Display with Lattice Graphics
High Performance Computing for Data Analysis: D&R with Tessera D&R and Tessera
R Manual R Manual Written by R Core Team
Introduction of plyr package plyr package plyr package(continued)
Chicago Bears PSL Data in Text Format CSV data
Datasets and Functions that Make Plots in the Book Visualizing Data R objects