Tessera: Open Source Tools for Big Data Analysis in R

June 30, 2015

Tutorial Schedule

Activity	Time
Tessera overview (Ryan)	13:30 - 14:00
Get started with housing example (Amanda)	14:00 - 14:30
Break	10 min
Housing example continued (Amanda)	14:40 - 15:30
Break	10 min
Taxi example (Stephen)	15:40 - 16:30

Installation Check

http://bit.ly/tessera-aalborg

Deep Analysis of Large, Complex Data

Data most often do not come with a model
If we already (think we) know the algorithm / model to apply and simply apply it to the data and nothing else, we are not doing analysis, we are processing
Deep analysis means
- detailed, comprehensive analysis that does not lose important information in the data
- learning from the data, not forcing our preconceptions on the data
- being willing and able to use any of the 1000s of statistical, machine learning, and visualization methods as dictated by the data
- trial and error, an iterative process of hypothesizing, fitting, validating, learning
- a lot of visualization

Deep Analysis of Large, Complex Data

Large complex data has any or all of the following:

Large number of records
Many variables
Complex data structures not readily put into tabular form of cases by variables
Intricate patterns and dependencies that require complex models and methods of analysis
Does not conform to simple assumptions made by many algorithms

The Goal of Tessera

Provide an environment that allows us to do the following with large complex data:

Work completely in R
Have access to R's 1000s of statistical, ML, and vis methods ideally with no need to rewrite scalable versions
Be able to apply any ad-hoc R code
Minimize time thinking about code or distributed systems
Maximize time thinking about the data
Be able to analyze it with nearly as much flexibility and ease as small data

Tessera Packages

Users interact primarily with two R packages:

datadr: data analysis R package implementing the Divide & Recombine paradigm that allows data scientists to leverage parallel data and processing back-ends such as Hadoop and Spark through a simple consistent interface
Trelliscope: visualization package that enables flexible detailed scalable visualization of large, complex data

Back End Agnostic

Interface stays the same regardless of back end

Tessera Fundamentals: D&R

Tessera Fundamentals: Trelliscope

Trelliscope: a viz tool that enables scalable, detailed visualization of large data
Data is split into meaningful subsets, and a visualization method is applied to each subset
The user can sort and filter plots based on "cognostics" - summary statistics of interest - to explore the data (example)

The Current Tessera Distributed Computing Stack

trelliscope: visualization of subsets of data, web interface powered by Shiny http://shiny.rstudio.com
datadr: interface for divide and recombine operations
RHIPE: The R and Hadoop Integrated Programming Environment
Hadoop: Framework for managing data and computation distributed across multiple hardrives in a cluster
HDFS: Hadoop Distributed File System

More Information

http://tessera.io
http://github.com/tesseradata
@TesseraIO
Google user group
Try it out
- If you have some applications in mind, give it a try!
- You don’t need big data or a cluster to use Tessera
- Ask us for help, give us feedback

Introduction to `datadr`

Installing the Tessera packages

install.packages("devtools") # if not installed
library(devtools)
install_github("tesseradata/datadr")
install_github("tesseradata/trelliscope")
install_github("hafen/housingData") # demo data

Housing Data

Housing sales and listing data in the United States
Between 2008-10-01 and 2014-03-01
Aggregated to the county level
Zillow.com data provided by Quandl (https://www.quandl.com/c/housing)

Housing Data Variables

Variable	Description
fips	Federal Information Processing Standard a 5 digit count code
county	US county name
state	US state name
time	date (the data is aggregated monthly)
nSold	number sold this month
medListPriceSqft	median list price per square foot
medSoldPriceSqft	median sold price per square foot

`datadr` data representation

Fundamentally, all data types are stored in a back-end as key/value pairs
Data type abstractions on top of the key/value pairs
- Distributed data frame (ddf):
  - A data frame that is split into chunks
  - Each chunk contains a subset of the rows of the data frame
  - Each subset may be distributed across the nodes of a cluster
- Distributed data object (ddo):
  - Similar to distributed data frame
  - Except each chunk can be an object with any structure
  - Every distributed data frame is also a distributed data object

Back-ends

datadr data back-end options:

In memory
Local disk
HDFS
Spark (under development)

Data ingest

# similar to read.table function:
my.data <- drRead.table(
  hdfsConn("/home/me/dir/datafile.txt", header=TRUE, sep="\t")
)

# similar to read.csv function:
my.data2 <- drRead.csv(
    localDiskConn("c:/my/local/data.csv"))

# convert in memory data.frame to ddf:
my.data3 <- ddf(some.data.frame)

You try it

# Load necessary libraries
library(datadr)
library(trelliscope)
library(housingData)

# housing data frame is in the housingData package
housingDdf <- ddf(housing)

Division

A common thing to do is to divide a dataset based on the value of one or more variables
Another option is to divide data into random replicates
- Use random replicates to estimate a GLM fit by applying GLM to each replicate subset and taking the mean coefficients
- Random replicates can be used for other all-data model fitting approaches like bag of little bootstraps, concensus MCMC, etc.

Divide example

Divide the housing data set by the variables "county" and "state"

(This kind of data division is very similar to the functionality provided by the plyr package)

byCounty <- divide(housingDdf,
    by = c("county", "state"), update = TRUE)

Divide example

byCounty

##
## Distributed data frame backed by 'kvMemory' connection
##
##  attribute      | value
## ----------------+-----------------------------------------------------------
##  names          | fips(cha), time(Dat), nSold(num), and 2 more
##  nrow           | 224369
##  size (stored)  | 15.73 MB
##  size (object)  | 15.73 MB
##  # subsets      | 2883
##
## * Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()
## * Conditioning variables: county, state

Exercise: try the `divide` function

Now try using the divide statement to divide on one or more variables

Possible solutions

byState <- divide(housing, by="state", update = TRUE)

byMonth <- divide(housing, by="time", update=TRUE)

Exploring the `ddf` data object

Data divisions can be accessed by index or by key name

byCounty[[1]]

## $key
## [1] "county=Abbeville County|state=SC"
##
## $value
##    fips       time nSold medListPriceSqft medSoldPriceSqft
## 1 45001 2008-10-01    NA         73.06226               NA
## 2 45001 2008-11-01    NA         70.71429               NA
## 3 45001 2008-12-01    NA         70.71429               NA
## 4 45001 2009-01-01    NA         73.43750               NA
## 5 45001 2009-02-01    NA         78.69565               NA
## ...

byCounty[["county=Benton County|state=WA"]]

Exploring the `ddf` data object

Partipants: try these functions on your own

summary(byCounty)
names(byCounty)
length(byCounty)
getKeys(byCounty)

Break

Transformations

The addTransform function applies a function to each key/value pair in a ddf
- E.g. to calculate a summary statistic
The transformation is not applied immediately, it is deferred until:
- A function that kicks off a map/reduce job is called (e.g. recombine)
- A subset of the data is requested (e.g. byCounty[[1]])
- drPersist function explicitly forces transformation computation

Transformation example

# Function to calculate a linear model and extract
# the slope parameter
lmCoef <- function(x) {
   coef(lm(medListPriceSqft ~ time, data = x))[2]
}

# Best practice tip: test transformation
# function on one division
lmCoef(byCounty[[1]]$value)

##          time
## -0.0002323686

# Apply the transform function to the ddf
byCountySlope <- addTransform(byCounty, lmCoef)

Transformation example

byCountySlope[[1]]

## $key
## [1] "county=Abbeville County|state=SC"
##
## $value
##          time
## -0.0002323686

Exercise: create a transformation function

Try creating your own transformation function
Hint: the input to your function will be one value from a key/value pair (e.g. byCounty[[1]]$value)

transformFn <- function(x) {
    ## you fill in here
}

# test:
transformFn(byCounty[[1]]$value)

# apply:
xformedData <- addTransform(byCounty, transformFn)

Possible solutions

# example 1
totalSold <- function(x) {
   sum(x$nSold, na.rm=TRUE)
}
byCountySold <- addTransform(byCounty, totalSold)

# example 2
timeRange <- function(x) {
   range(x$time)
}
byCountyTime <- addTransform(byCounty, timeRange)

Recombination

Combine transformation results together again
Example

countySlopes <- recombine(byCountySlope,
    combine=combRbind)

head(countySlopes)

##                 county state           val
## time  Abbeville County    SC -0.0002323686
## time1    Acadia Parish    LA  0.0019518441
## time2  Accomack County    VA -0.0092717711
## time3       Ada County    ID -0.0030197554
## time4     Adair County    IA -0.0308381951
## time5     Adair County    KY  0.0034399585

Recombination options

combine parameter controls the form of the result

combine=combRbind: rbind is used to combine results into data.frame, this is the most frequently used option
combine=combCollect: results are collected into a list
combine=combDdo: results are combined into a ddo object

Exercise: try the `recombine` function

Apply recombine to the data with your custom transformation
Hint: combine=combRbind is probably the simplest option

Exercise: `divide`

Divide two new datasets geoCounty and wikiCounty by county and state

# look at the data first
head(geoCounty)
head(wikiCounty)

# use divide function on each

Solution

geoByCounty <- divide(geoCounty,
    by=c("county", "state"))

wikiByCounty <- divide(wikiCounty,
    by=c("county", "state"))

Data operations: drJoin

Join together multiple data objects based on key

joinedData <- drJoin(housing=byCounty,
    slope=byCountySlope,
    geo=geoByCounty,
    wiki=wikiByCounty)

Distributed data objects vs distributed data frames

In a ddf the value in each key/value is always a data.frame
A ddo can accomodate values that are not data.frames

class(joinedData)

## [1] "ddo"      "kvMemory"

Distributed data objects vs distributed data frames

joinedData[[176]]

## $key
## [1] "county=Benton County|state=WA"
##
## $value
## $housing
##     fips       time nSold medListPriceSqft medSoldPriceSqft
## 1  53005 2008-10-01   137         106.6351         106.2179
## 2  53005 2008-11-01    80         106.9650               NA
## 3  53005 2008-11-01    NA               NA         105.2370
## 4  53005 2008-12-01    95         107.6642         105.6311
## 5  53005 2009-01-01    73         107.6868         105.8892
## 6  53005 2009-02-01    97         108.3566               NA
## 7  53005 2009-02-01    NA               NA         104.3273
## 8  53005 2009-03-01   125         107.1968         103.2748
## 9  53005 2009-04-01   147         107.7649         102.2363
## 10 53005 2009-05-01   192         108.6823               NA
## 11 53005 2009-05-01    NA               NA         103.8925
## 12 53005 2009-06-01   256         108.5143         105.1873
## 13 53005 2009-07-01   232         108.4902         104.8865
## 14 53005 2009-08-01   250         108.5763               NA
## 15 53005 2009-08-01    NA               NA         105.6069
## 16 53005 2009-09-01   185         109.4762         111.4779
## 17 53005 2009-10-01   208         108.3172         111.3014
## 18 53005 2009-11-01   171         109.1082               NA
## 19 53005 2009-11-01    NA               NA         111.3912
## 20 53005 2009-12-01   116         109.6998         109.8340
## 21 53005 2010-01-01   112         109.8148         109.8369
## 22 53005 2010-02-01   130         109.1902               NA
## 23 53005 2010-02-01    NA               NA         109.5266
## 24 53005 2010-03-01   230         109.1489         111.3277
## 25 53005 2010-04-01   257         108.9922         111.9392
## 26 53005 2010-05-01   263         109.1406               NA
## 27 53005 2010-05-01    NA               NA         112.3360
## 28 53005 2010-06-01   277         109.7328         112.4703
## 29 53005 2010-07-01   143         109.3155         112.5845
## 30 53005 2010-08-01   147         109.5679               NA
## 31 53005 2010-08-01    NA               NA         113.5515
## 32 53005 2010-09-01   145         109.9034         114.7505
## 33 53005 2010-10-01   158         111.1056         111.6532
## 34 53005 2010-11-01   119         110.9722               NA
## 35 53005 2010-11-01    NA               NA         111.1267
## 36 53005 2010-12-01    96         112.1773         109.0577
## 37 53005 2011-01-01    84         113.0760         109.4568
## 38 53005 2011-02-01   106         114.2857               NA
## 39 53005 2011-02-01    NA               NA         111.6570
## 40 53005 2011-03-01   157         113.9522         111.6075
## 41 53005 2011-04-01   178         112.5166         111.9766
## 42 53005 2011-05-01   176         113.6156               NA
## 43 53005 2011-05-01    NA               NA         110.9980
## 44 53005 2011-06-01   160         113.2360         110.8176
## 45 53005 2011-07-01   175         112.5824         113.7276
## 46 53005 2011-08-01   153               NA               NA
## 47 53005 2011-08-01    NA         119.9161         114.2078
## 48 53005 2011-09-01   157         118.8687         114.6581
## 49 53005 2011-10-01   148         118.1545         113.7756
## 50 53005 2011-11-01    70               NA               NA
## 51 53005 2011-11-01    NA         118.0064         112.0518
## 52 53005 2011-12-01   102         118.2796         105.5297
## 53 53005 2012-01-01    72         118.4511         107.1666
## 54 53005 2012-02-01    82               NA               NA
## 55 53005 2012-02-01    NA         118.1705         107.7065
## 56 53005 2012-03-01   108         118.5601         106.8745
## 57 53005 2012-04-01   100         118.3301         108.6945
## 58 53005 2012-05-01   134               NA               NA
## 59 53005 2012-05-01    NA         118.3333         109.9282
## 60 53005 2012-06-01   116         118.0660         111.0290
## 61 53005 2012-07-01   115         118.2432         110.6429
## 62 53005 2012-08-01   119               NA               NA
## 63 53005 2012-08-01    NA         118.4626         111.5133
## 64 53005 2012-09-01   103         118.8302         111.9573
## 65 53005 2012-10-01   129         117.9672         111.9468
## 66 53005 2012-11-01   121               NA               NA
## 67 53005 2012-11-01    NA         117.8074         110.9557
## 68 53005 2012-12-01   122         118.1281         108.4398
## 69 53005 2013-01-01   110         117.2370         106.6948
## 70 53005 2013-02-01    NA         117.2093         105.8374
## 71 53005 2013-03-01    NA         117.7890         108.0909
## 72 53005 2013-04-01    NA         118.4573         108.3922
## 73 53005 2013-05-01    NA         119.0476         109.3556
## 74 53005 2013-06-01    NA         118.8369         112.1253
## 75 53005 2013-07-01    NA         118.5484         113.4037
## 76 53005 2013-08-01    NA         119.8686         113.7689
## 77 53005 2013-09-01    NA         120.1890         112.2339
## 78 53005 2013-10-01    NA         119.2271         113.0176
## 79 53005 2013-11-01    NA         118.5382         114.7066
## 80 53005 2013-12-01    NA         120.0107         112.5436
## 81 53005 2014-01-01    NA         120.5357         109.4688
## 82 53005 2014-02-01    NA         120.5588         109.4023
## 83 53005 2014-03-01    NA         121.6829         108.9446
##
## $slope
##        time
## 0.007726753
##
## $geo
##    fips       lon      lat  rMapState rMapCounty
## 1 53005 -119.5155 46.23413 washington     benton
##
## $wiki
##    fips pop2013                                                   href
## 1 53005  184486 http://en.wikipedia.org/wiki/Benton_County,_Washington

Data operations: drFilter

Filter a ddf or ddo based on key and/or value

# Note that a few county/state combinations do
# not have housing sales data:
names(joinedData[[2884]]$value)

## [1] "geo"  "wiki"

# We want to filter those out those
joinedData <- drFilter(joinedData,
    function(v) {
        !is.null(v$housing)
    })

Other data operations

drSample: returns a ddo containing a random sample (i.e. a specified fraction) of key/value pairs
drSubset: applies a subsetting function to the rows of a ddf
drLapply: applies a function to each subset and returns the results in a ddo

Exercise: `datadr` data operations

Apply one or more of these data operations to joinedData or a ddo or ddf you created

drJoin
drFilter
drSample
drSubset
drLapply

Using Tessera with a Hadoop cluster

Differences from in memory computation:

Data ingest: use hdfsConn to specify a file location to read in HDFS
Each data object is stored in HDFS
- Use output parameter in most functions to specify a location in HDFS to store data

housing <- drRead.csv(
    file=hdfsConn("/hdfs/data/location"),
    output=hdfsConn("/hdfs/data/second/location"))

byCounty <- divide(housing, by=c("state", "county"),
    output=hdfsConn("/hdfs/data/byCounty"))

Introduction to `trelliscope`

Trelliscope

Divide and recombine visualization tool
Based on Trellis display
Apply a visualization method to each subset of a ddf or ddo
Interactively sort and filter plots

Trelliscope panel function

Define a function to apply to each subset that creates a plot
Plots can be created using base R graphics, ggplot, lattice, rbokeh, conceptually any htmlwidget

# Plot medListPriceSqft and medSoldPriceSqft by time
timePanel <- function(x) {
   xyplot(medListPriceSqft + medSoldPriceSqft ~ time,
      data = x$housing, auto.key = TRUE,
      ylab = "Price / Sq. Ft.")
}

Trelliscope panel function

# test the panel function on one division
timePanel(joinedData[[176]]$value)

Visualization database (vdb)

Trelliscope creates a directory with all the data to render the plots
Can later re-launch the Trelliscope display without all the prior data analysis

vdbConn("housing_vdb", autoYes=TRUE)

Creating a Trelliscope display

makeDisplay(joinedData,
   name = "list_sold_vs_time_datadr",
   desc = "List and sold price over time",
   panelFn = timePanel,
   width = 400, height = 400,
   lims = list(x = "same")
)

view()

Trelliscope demo

Exercise: create a panel function

newPanelFn <- function(x) {
   # fill in here
}

# test the panel function
timePanel(joinedData[[1]]$value)

vdbConn("housing_vdb", autoYes=TRUE)

makeDisplay(joinedData,
   name = "panel_test",
   desc = "Your test panel function",
   panelFn = newPaneFn)

Cognostics and display organization

Cognostic:
- a value or summary statistic
- calculated on each subset
- to help the user focus their attention on plots of interest
Cognostics are used to sort and filter plots in Trelliscope
Define a function to apply to each subset to calculate desired values
- Return a list of named elements
- Each list element is a single value (no vectors or complex data objects)

Cognostics function

priceCog <- function(x) {
   st <- getSplitVar(x, "state")
   ct <- getSplitVar(x, "county")
   zillowString <- gsub(" ", "-", paste(ct, st))
   list(
      slope = cog(x$slope, desc = "list price slope"),
      meanList = cogMean(x$housing$medListPriceSqft),
      meanSold = cogMean(x$housing$medSoldPriceSqft),
      lat = cog(x$geo$lat, desc = "county latitude"),
      lon = cog(x$geo$lon, desc = "county longitude"),
      wikiHref = cogHref(x$wiki$href, desc="wiki link"),
      zillowHref = cogHref(
          sprintf("http://www.zillow.com/homes/%s_rb/",
              zillowString),
          desc="zillow link")
   )
}

Use the cognostics function in trelliscope

makeDisplay(joinedData,
   name = "list_sold_vs_time_datadr2",
   desc = "List and sold price with cognostics",
   panelFn = timePanel,
   cogFn = priceCog,
   width = 400, height = 400,
   lims = list(x = "same")
)

Trelliscope demo

Exercise: create a cognostics function

newCogFn <- function(x) {
#     list(
#         name1=cog(value1, desc="description")
#     )
}

# test the cognostics function
newCogFn(joinedData[[1]]$value)

makeDisplay(joinedData,
   name = "cognostics_test",
   desc = "Test panel and cognostics function",
   panelFn = newPaneFn,
   cogFn = newCogFn)
view()

Break

Analysis of NYC Taxi Data

Data download

Data and solution for UseR! 2015: http://tessera.io/docs-UseR2015/

Subset taxi data for New York City
First seven days of 2013
Ryan has posted some analysis of the NYC taxi data: http://hafen.github.io/taxi/#getting-the-data
NYC taxi rates are complex: http://www.nyc.gov/html/tlc/html/passenger/taxicab_rate.shtml

Data Contents

Variable	Description
medallion	identifier of the individual taxi's license
hack_license	taxi driver license id
vendor_id	unique identifier of taxi owner
rate_code	code based on type of trip
store_and_fwd_flag	ignore
pickup_datetime	date and time of trip start
dropoff_datetime	date and time of trip end
passenger_count	number of riders
trip_time_in_secs	total time between pickup and dropoff
trip_distance	distance covered in miles

Data Contents

Variable	Description
pickup_longitude, pickup_latitude	coordinates for start of trip
dropoff_longitude, dropoff_latitude	coordinates for end of trip
payment_type	"CRD" - credit or debit card, "CSH" - cash
fare_amount	base fare amount
surcharge	additional change for peak hours, certain locations
mta_tax	local tax
tip_amount	tip paid, may be $0
tolls_amount	bridge and tunnel tolls
total_amount	total amount paid by passengers

Divide and Recombine

Running a Local Cluster

Local cluster can use multiple cores: each running an R process
R process requires memory for the chunks being processed
Buffer size limits number of chunks, but large chunks must be loaded into memory
Hint: look at arguments of localDiskControl
Hint: look at process size, ps -aux on Linux or Windows Task Manager
Local calculations often disk-I/O bound
Clusters with HDFS achieve greater I/O bandwidth

Challenge Questions

Compute and plot quantiles of passenger_count, trip_time_in_secs, trip_distance, total_amount, tip_amount
Compute summary statistics, such as mean toll or mean tip percent by distance category
Hint - cut() on divisions of log distance
Hint - divide the ddf by distance category
Explore how tip percent changes with distance category, rate_code and perhaps hour of the day
Hint - Think of some potentially useful cognostics

Tutorial Schedule

Installation Check

Deep Analysis of Large, Complex Data

Deep Analysis of Large, Complex Data

The Goal of Tessera

Tessera Packages

Back End Agnostic

Tessera Fundamentals: D&R

Tessera Fundamentals: Trelliscope

The Current Tessera Distributed Computing Stack

More Information

Introduction to datadr

Installing the Tessera packages

Housing Data

Housing Data Variables

datadr data representation

Back-ends

Data ingest

You try it

Division

Divide example

Divide example

Exercise: try the divide function

Possible solutions

Exploring the ddf data object

Exploring the ddf data object

Break

Transformations

Transformation example

Transformation example

Exercise: create a transformation function

Possible solutions

Recombination

Recombination options

Exercise: try the recombine function

Exercise: divide

Solution

Data operations: drJoin

Distributed data objects vs distributed data frames

Distributed data objects vs distributed data frames

Data operations: drFilter

Other data operations

Exercise: datadr data operations

Using Tessera with a Hadoop cluster

Introduction to trelliscope

Trelliscope

Trelliscope panel function

Trelliscope panel function

Visualization database (vdb)

Creating a Trelliscope display

Trelliscope demo

Exercise: create a panel function

Cognostics and display organization

Cognostics function

Use the cognostics function in trelliscope

Trelliscope demo

Exercise: create a cognostics function

Break

Analysis of NYC Taxi Data

Data download

Data Contents

Data Contents

Divide and Recombine

Running a Local Cluster

Challenge Questions

Introduction to `datadr`

`datadr` data representation

Exercise: try the `divide` function

Exploring the `ddf` data object

Exploring the `ddf` data object

Exercise: try the `recombine` function

Exercise: `divide`

Exercise: `datadr` data operations

Introduction to `trelliscope`