June 30, 2015

Tutorial Schedule

Activity Time
Tessera overview (Ryan) 13:30 - 14:00
Get started with housing example (Amanda) 14:00 - 14:30
Break 10 min
Housing example continued (Amanda) 14:40 - 15:30
Break 10 min
Taxi example (Stephen) 15:40 - 16:30

Installation Check

Deep Analysis of Large, Complex Data

  • Data most often do not come with a model
  • If we already (think we) know the algorithm / model to apply and simply apply it to the data and nothing else, we are not doing analysis, we are processing
  • Deep analysis means
    • detailed, comprehensive analysis that does not lose important information in the data
    • learning from the data, not forcing our preconceptions on the data
    • being willing and able to use any of the 1000s of statistical, machine learning, and visualization methods as dictated by the data
    • trial and error, an iterative process of hypothesizing, fitting, validating, learning
    • a lot of visualization

Deep Analysis of Large, Complex Data

Large complex data has any or all of the following:

  • Large number of records
  • Many variables
  • Complex data structures not readily put into tabular form of cases by variables
  • Intricate patterns and dependencies that require complex models and methods of analysis
  • Does not conform to simple assumptions made by many algorithms

The Goal of Tessera

Provide an environment that allows us to do the following with large complex data:

  • Work completely in R
  • Have access to R's 1000s of statistical, ML, and vis methods ideally with no need to rewrite scalable versions
  • Be able to apply any ad-hoc R code
  • Minimize time thinking about code or distributed systems
  • Maximize time thinking about the data
  • Be able to analyze it with nearly as much flexibility and ease as small data

Tessera Packages

Users interact primarily with two R packages:

  • datadr: data analysis R package implementing the Divide & Recombine paradigm that allows data scientists to leverage parallel data and processing back-ends such as Hadoop and Spark through a simple consistent interface
  • Trelliscope: visualization package that enables flexible detailed scalable visualization of large, complex data

Back End Agnostic

Interface stays the same regardless of back end

Tessera Fundamentals: D&R

Tessera Fundamentals: Trelliscope

  • Trelliscope: a viz tool that enables scalable, detailed visualization of large data
  • Data is split into meaningful subsets, and a visualization method is applied to each subset
  • The user can sort and filter plots based on "cognostics" - summary statistics of interest - to explore the data (example)