June 30, 2015

Tutorial Schedule

Activity Time
Tessera overview (Ryan) 13:30 - 14:00
Get started with housing example (Amanda) 14:00 - 14:30
Break 10 min
Housing example continued (Amanda) 14:40 - 15:30
Break 10 min
Taxi example (Stephen) 15:40 - 16:30

Installation Check

Deep Analysis of Large, Complex Data

  • Data most often do not come with a model
  • If we already (think we) know the algorithm / model to apply and simply apply it to the data and nothing else, we are not doing analysis, we are processing
  • Deep analysis means
    • detailed, comprehensive analysis that does not lose important information in the data
    • learning from the data, not forcing our preconceptions on the data
    • being willing and able to use any of the 1000s of statistical, machine learning, and visualization methods as dictated by the data
    • trial and error, an iterative process of hypothesizing, fitting, validating, learning
    • a lot of visualization

Deep Analysis of Large, Complex Data

Large complex data has any or all of the following:

  • Large number of records
  • Many variables
  • Complex data structures not readily put into tabular form of cases by variables
  • Intricate patterns and dependencies that require complex models and methods of analysis
  • Does not conform to simple assumptions made by many algorithms

The Goal of Tessera

Provide an environment that allows us to do the following with large complex data:

  • Work completely in R
  • Have access to R's 1000s of statistical, ML, and vis methods ideally with no need to rewrite scalable versions
  • Be able to apply any ad-hoc R code
  • Minimize time thinking about code or distributed systems
  • Maximize time thinking about the data
  • Be able to analyze it with nearly as much flexibility and ease as small data

Tessera Packages

Users interact primarily with two R packages:

  • datadr: data analysis R package implementing the Divide & Recombine paradigm that allows data scientists to leverage parallel data and processing back-ends such as Hadoop and Spark through a simple consistent interface
  • Trelliscope: visualization package that enables flexible detailed scalable visualization of large, complex data

Back End Agnostic

Interface stays the same regardless of back end

Tessera Fundamentals: D&R

Tessera Fundamentals: Trelliscope

  • Trelliscope: a viz tool that enables scalable, detailed visualization of large data
  • Data is split into meaningful subsets, and a visualization method is applied to each subset
  • The user can sort and filter plots based on "cognostics" - summary statistics of interest - to explore the data (example)

The Current Tessera Distributed Computing Stack

  • trelliscope: visualization of subsets of data, web interface powered by Shiny http://shiny.rstudio.com
  • datadr: interface for divide and recombine operations
  • RHIPE: The R and Hadoop Integrated Programming Environment
  • Hadoop: Framework for managing data and computation distributed across multiple hardrives in a cluster
  • HDFS: Hadoop Distributed File System

More Information

Introduction to datadr

Installing the Tessera packages

install.packages("devtools") # if not installed
library(devtools)
install_github("tesseradata/datadr")
install_github("tesseradata/trelliscope")
install_github("hafen/housingData") # demo data

Housing Data

  • Housing sales and listing data in the United States
  • Between 2008-10-01 and 2014-03-01
  • Aggregated to the county level
  • Zillow.com data provided by Quandl (https://www.quandl.com/c/housing)

Housing Data Variables

Variable Description
fips Federal Information Processing Standard a 5 digit count code
county US county name
state US state name
time date (the data is aggregated monthly)
nSold number sold this month
medListPriceSqft median list price per square foot
medSoldPriceSqft median sold price per square foot

datadr data representation

  • Fundamentally, all data types are stored in a back-end as key/value pairs
  • Data type abstractions on top of the key/value pairs
    • Distributed data frame (ddf):
      • A data frame that is split into chunks
      • Each chunk contains a subset of the rows of the data frame
      • Each subset may be distributed across the nodes of a cluster
    • Distributed data object (ddo):
      • Similar to distributed data frame
      • Except each chunk can be an object with any structure
      • Every distributed data frame is also a distributed data object

Back-ends

datadr data back-end options:

  • In memory
  • Local disk
  • HDFS
  • Spark (under development)

Data ingest

# similar to read.table function:
my.data <- drRead.table(
  hdfsConn("/home/me/dir/datafile.txt", header=TRUE, sep="\t")
)

# similar to read.csv function:
my.data2 <- drRead.csv(
    localDiskConn("c:/my/local/data.csv"))

# convert in memory data.frame to ddf:
my.data3 <- ddf(some.data.frame)

You try it

# Load necessary libraries
library(datadr)
library(trelliscope)
library(housingData)

# housing data frame is in the housingData package
housingDdf <- ddf(housing)

Division

  • A common thing to do is to divide a dataset based on the value of one or more variables
  • Another option is to divide data into random replicates
    • Use random replicates to estimate a GLM fit by applying GLM to each replicate subset and taking the mean coefficients
    • Random replicates can be used for other all-data model fitting approaches like bag of little bootstraps, concensus MCMC, etc.

Divide example

Divide the housing data set by the variables "county" and "state"

(This kind of data division is very similar to the functionality provided by the plyr package)

byCounty <- divide(housingDdf,
    by = c("county", "state"), update = TRUE)

Divide example

byCounty
##
## Distributed data frame backed by 'kvMemory' connection
##
##  attribute      | value
## ----------------+-----------------------------------------------------------
##  names          | fips(cha), time(Dat), nSold(num), and 2 more
##  nrow           | 224369
##  size (stored)  | 15.73 MB
##  size (object)  | 15.73 MB
##  # subsets      | 2883
##
## * Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()
## * Conditioning variables: county, state

Exercise: try the divide function

Now try using the divide statement to divide on one or more variables

Possible solutions

byState <- divide(housing, by="state", update = TRUE)

byMonth <- divide(housing, by="time", update=TRUE)

Exploring the ddf data object

Data divisions can be accessed by index or by key name

byCounty[[1]]
## $key
## [1] "county=Abbeville County|state=SC"
##
## $value
##    fips       time nSold medListPriceSqft medSoldPriceSqft
## 1 45001 2008-10-01    NA         73.06226               NA
## 2 45001 2008-11-01    NA         70.71429               NA
## 3 45001 2008-12-01    NA         70.71429               NA
## 4 45001 2009-01-01    NA         73.43750               NA
## 5 45001 2009-02-01    NA         78.69565               NA
## ...
byCounty[["county=Benton County|state=WA"]]

Exploring the ddf data object

Partipants: try these functions on your own

  • summary(byCounty)
  • names(byCounty)
  • length(byCounty)
  • getKeys(byCounty)

Break

Transformations

  • The addTransform function applies a function to each key/value pair in a ddf
    • E.g. to calculate a summary statistic
  • The transformation is not applied immediately, it is deferred until:
    • A function that kicks off a map/reduce job is called (e.g. recombine)
    • A subset of the data is requested (e.g. byCounty[[1]])
    • drPersist function explicitly forces transformation computation

Transformation example

# Function to calculate a linear model and extract
# the slope parameter
lmCoef <- function(x) {
   coef(lm(medListPriceSqft ~ time, data = x))[2]
}

# Best practice tip: test transformation
# function on one division
lmCoef(byCounty[[1]]$value)
##          time
## -0.0002323686
# Apply the transform function to the ddf
byCountySlope <- addTransform(byCounty, lmCoef)

Transformation example

byCountySlope[[1]]
## $key
## [1] "county=Abbeville County|state=SC"
##
## $value
##          time
## -0.0002323686

Exercise: create a transformation function

  • Try creating your own transformation function
  • Hint: the input to your function will be one value from a key/value pair (e.g. byCounty[[1]]$value)
transformFn <- function(x) {
    ## you fill in here
}

# test:
transformFn(byCounty[[1]]$value)

# apply:
xformedData <- addTransform(byCounty, transformFn)

Possible solutions

# example 1
totalSold <- function(x) {
   sum(x$nSold, na.rm=TRUE)
}
byCountySold <- addTransform(byCounty, totalSold)

# example 2
timeRange <- function(x) {
   range(x$time)
}
byCountyTime <- addTransform(byCounty, timeRange)

Recombination

  • Combine transformation results together again
  • Example
countySlopes <- recombine(byCountySlope,
    combine=combRbind)
head(countySlopes)
##                 county state           val
## time  Abbeville County    SC -0.0002323686
## time1    Acadia Parish    LA  0.0019518441
## time2  Accomack County    VA -0.0092717711
## time3       Ada County    ID -0.0030197554
## time4     Adair County    IA -0.0308381951
## time5     Adair County    KY  0.0034399585

Recombination options

combine parameter controls the form of the result

  • combine=combRbind: rbind is used to combine results into data.frame, this is the most frequently used option
  • combine=combCollect: results are collected into a list
  • combine=combDdo: results are combined into a ddo object

Exercise: try the recombine function

  • Apply recombine to the data with your custom transformation
  • Hint: combine=combRbind is probably the simplest option

Exercise: divide

Divide two new datasets geoCounty and wikiCounty by county and state

# look at the data first
head(geoCounty)
head(wikiCounty)

# use divide function on each

Solution

geoByCounty <- divide(geoCounty,
    by=c("county", "state"))

wikiByCounty <- divide(wikiCounty,
    by=c("county", "state"))

Data operations: drJoin

Join together multiple data objects based on key

joinedData <- drJoin(housing=byCounty,
    slope=byCountySlope,
    geo=geoByCounty,
    wiki=wikiByCounty)

Distributed data objects vs distributed data frames

  • In a ddf the value in each key/value is always a data.frame
  • A ddo can accomodate values that are not data.frames
class(joinedData)
## [1] "ddo"      "kvMemory"

Distributed data objects vs distributed data frames

joinedData[[176]]
## $key
## [1] "county=Benton County|state=WA"
##
## $value
## $housing
##     fips       time nSold medListPriceSqft medSoldPriceSqft
## 1  53005 2008-10-01   137         106.6351         106.2179
## 2  53005 2008-11-01    80         106.9650               NA
## 3  53005 2008-11-01    NA               NA         105.2370
## 4  53005 2008-12-01    95         107.6642         105.6311
## 5  53005 2009-01-01    73         107.6868         105.8892
## 6  53005 2009-02-01    97         108.3566               NA
## 7  53005 2009-02-01    NA               NA         104.3273
## 8  53005 2009-03-01   125         107.1968         103.2748
## 9  53005 2009-04-01   147         107.7649         102.2363
## 10 53005 2009-05-01   192         108.6823               NA
## 11 53005 2009-05-01    NA               NA         103.8925
## 12 53005 2009-06-01   256         108.5143         105.1873
## 13 53005 2009-07-01   232         108.4902         104.8865
## 14 53005 2009-08-01   250         108.5763               NA
## 15 53005 2009-08-01    NA               NA         105.6069
## 16 53005 2009-09-01   185         109.4762         111.4779
## 17 53005 2009-10-01   208         108.3172         111.3014
## 18 53005 2009-11-01   171         109.1082               NA
## 19 53005 2009-11-01    NA               NA         111.3912
## 20 53005 2009-12-01   116         109.6998         109.8340
## 21 53005 2010-01-01   112         109.8148         109.8369
## 22 53005 2010-02-01   130         109.1902               NA
## 23 53005 2010-02-01    NA               NA         109.5266
## 24 53005 2010-03-01   230         109.1489         111.3277
## 25 53005 2010-04-01   257         108.9922         111.9392
## 26 53005 2010-05-01   263         109.1406               NA
## 27 53005 2010-05-01    NA               NA         112.3360
## 28 53005 2010-06-01   277         109.7328         112.4703
## 29 53005 2010-07-01   143         109.3155         112.5845
## 30 53005 2010-08-01   147         109.5679               NA
## 31 53005 2010-08-01    NA               NA         113.5515
## 32 53005 2010-09-01   145         109.9034         114.7505
## 33 53005 2010-10-01   158         111.1056         111.6532
## 34 53005 2010-11-01   119         110.9722               NA
## 35 53005 2010-11-01    NA               NA         111.1267
## 36 53005 2010-12-01    96         112.1773         109.0577
## 37 53005 2011-01-01    84         113.0760         109.4568
## 38 53005 2011-02-01   106         114.2857               NA
## 39 53005 2011-02-01    NA               NA         111.6570
## 40 53005 2011-03-01   157         113.9522         111.6075
## 41 53005 2011-04-01   178         112.5166         111.9766
## 42 53005 2011-05-01   176         113.6156               NA
## 43 53005 2011-05-01    NA               NA         110.9980
## 44 53005 2011-06-01   160         113.2360         110.8176
## 45 53005 2011-07-01   175         112.5824         113.7276
## 46 53005 2011-08-01   153               NA               NA
## 47 53005 2011-08-01    NA         119.9161         114.2078
## 48 53005 2011-09-01   157         118.8687         114.6581
## 49 53005 2011-10-01   148         118.1545         113.7756
## 50 53005 2011-11-01    70               NA               NA
## 51 53005 2011-11-01    NA         118.0064         112.0518
## 52 53005 2011-12-01   102         118.2796         105.5297
## 53 53005 2012-01-01    72         118.4511         107.1666
## 54 53005 2012-02-01    82               NA               NA
## 55 53005 2012-02-01    NA         118.1705         107.7065
## 56 53005 2012-03-01   108         118.5601         106.8745
## 57 53005 2012-04-01   100         118.3301         108.6945
## 58 53005 2012-05-01   134               NA               NA
## 59 53005 2012-05-01    NA         118.3333         109.9282
## 60 53005 2012-06-01   116         118.0660         111.0290
## 61 53005 2012-07-01   115         118.2432         110.6429
## 62 53005 2012-08-01   119               NA               NA
## 63 53005 2012-08-01    NA         118.4626         111.5133
## 64 53005 2012-09-01   103         118.8302         111.9573
## 65 53005 2012-10-01   129         117.9672         111.9468
## 66 53005 2012-11-01   121               NA               NA
## 67 53005 2012-11-01    NA         117.8074         110.9557
## 68 53005 2012-12-01   122         118.1281         108.4398
## 69 53005 2013-01-01   110         117.2370         106.6948
## 70 53005 2013-02-01    NA         117.2093         105.8374
## 71 53005 2013-03-01    NA         117.7890         108.0909
## 72 53005 2013-04-01    NA         118.4573         108.3922
## 73 53005 2013-05-01    NA         119.0476         109.3556
## 74 53005 2013-06-01    NA         118.8369         112.1253
## 75 53005 2013-07-01    NA         118.5484         113.4037
## 76 53005 2013-08-01    NA         119.8686         113.7689
## 77 53005 2013-09-01    NA         120.1890         112.2339
## 78 53005 2013-10-01    NA         119.2271         113.0176
## 79 53005 2013-11-01    NA         118.5382         114.7066
## 80 53005 2013-12-01    NA         120.0107         112.5436
## 81 53005 2014-01-01    NA         120.5357         109.4688
## 82 53005 2014-02-01    NA         120.5588         109.4023
## 83 53005 2014-03-01    NA         121.6829         108.9446
##
## $slope
##        time
## 0.007726753
##
## $geo
##    fips       lon      lat  rMapState rMapCounty
## 1 53005 -119.5155 46.23413 washington     benton
##
## $wiki
##    fips pop2013                                                   href
## 1 53005  184486 http://en.wikipedia.org/wiki/Benton_County,_Washington

Data operations: drFilter

Filter a ddf or ddo based on key and/or value

# Note that a few county/state combinations do
# not have housing sales data:
names(joinedData[[2884]]$value)
## [1] "geo"  "wiki"
# We want to filter those out those
joinedData <- drFilter(joinedData,
    function(v) {
        !is.null(v$housing)
    })

Other data operations

  • drSample: returns a ddo containing a random sample (i.e. a specified fraction) of key/value pairs
  • drSubset: applies a subsetting function to the rows of a ddf
  • drLapply: applies a function to each subset and returns the results in a ddo

Exercise: datadr data operations

Apply one or more of these data operations to joinedData or a ddo or ddf you created

  • drJoin
  • drFilter
  • drSample
  • drSubset
  • drLapply

Using Tessera with a Hadoop cluster

Differences from in memory computation:

  • Data ingest: use hdfsConn to specify a file location to read in HDFS
  • Each data object is stored in HDFS
    • Use output parameter in most functions to specify a location in HDFS to store data
housing <- drRead.csv(
    file=hdfsConn("/hdfs/data/location"),
    output=hdfsConn("/hdfs/data/second/location"))

byCounty <- divide(housing, by=c("state", "county"),
    output=hdfsConn("/hdfs/data/byCounty"))

Introduction to trelliscope

Trelliscope

  • Divide and recombine visualization tool
  • Based on Trellis display
  • Apply a visualization method to each subset of a ddf or ddo
  • Interactively sort and filter plots

Trelliscope panel function

  • Define a function to apply to each subset that creates a plot
  • Plots can be created using base R graphics, ggplot, lattice, rbokeh, conceptually any htmlwidget
# Plot medListPriceSqft and medSoldPriceSqft by time
timePanel <- function(x) {
   xyplot(medListPriceSqft + medSoldPriceSqft ~ time,
      data = x$housing, auto.key = TRUE,
      ylab = "Price / Sq. Ft.")
}

Trelliscope panel function

# test the panel function on one division
timePanel(joinedData[[176]]$value)

Visualization database (vdb)

  • Trelliscope creates a directory with all the data to render the plots
  • Can later re-launch the Trelliscope display without all the prior data analysis
vdbConn("housing_vdb", autoYes=TRUE)

Creating a Trelliscope display

makeDisplay(joinedData,
   name = "list_sold_vs_time_datadr",
   desc = "List and sold price over time",
   panelFn = timePanel,
   width = 400, height = 400,
   lims = list(x = "same")
)
view()

Trelliscope demo

Exercise: create a panel function

newPanelFn <- function(x) {
   # fill in here
}

# test the panel function
timePanel(joinedData[[1]]$value)

vdbConn("housing_vdb", autoYes=TRUE)

makeDisplay(joinedData,
   name = "panel_test",
   desc = "Your test panel function",
   panelFn = newPaneFn)

Cognostics and display organization

  • Cognostic:
    • a value or summary statistic
    • calculated on each subset
    • to help the user focus their attention on plots of interest
  • Cognostics are used to sort and filter plots in Trelliscope
  • Define a function to apply to each subset to calculate desired values
    • Return a list of named elements
    • Each list element is a single value (no vectors or complex data objects)

Cognostics function

priceCog <- function(x) {
   st <- getSplitVar(x, "state")
   ct <- getSplitVar(x, "county")
   zillowString <- gsub(" ", "-", paste(ct, st))
   list(
      slope = cog(x$slope, desc = "list price slope"),
      meanList = cogMean(x$housing$medListPriceSqft),
      meanSold = cogMean(x$housing$medSoldPriceSqft),
      lat = cog(x$geo$lat, desc = "county latitude"),
      lon = cog(x$geo$lon, desc = "county longitude"),
      wikiHref = cogHref(x$wiki$href, desc="wiki link"),
      zillowHref = cogHref(
          sprintf("http://www.zillow.com/homes/%s_rb/",
              zillowString),
          desc="zillow link")
   )
}

Use the cognostics function in trelliscope

makeDisplay(joinedData,
   name = "list_sold_vs_time_datadr2",
   desc = "List and sold price with cognostics",
   panelFn = timePanel,
   cogFn = priceCog,
   width = 400, height = 400,
   lims = list(x = "same")
)

Trelliscope demo

Exercise: create a cognostics function

newCogFn <- function(x) {
#     list(
#         name1=cog(value1, desc="description")
#     )
}

# test the cognostics function
newCogFn(joinedData[[1]]$value)

makeDisplay(joinedData,
   name = "cognostics_test",
   desc = "Test panel and cognostics function",
   panelFn = newPaneFn,
   cogFn = newCogFn)
view()

Break

Analysis of NYC Taxi Data

Data download

Data Contents

Variable Description
medallion identifier of the individual taxi's license
hack_license taxi driver license id
vendor_id unique identifier of taxi owner
rate_code code based on type of trip
store_and_fwd_flag ignore
pickup_datetime date and time of trip start
dropoff_datetime date and time of trip end
passenger_count number of riders
trip_time_in_secs total time between pickup and dropoff
trip_distance distance covered in miles

Data Contents

Variable Description
pickup_longitude, pickup_latitude coordinates for start of trip
dropoff_longitude, dropoff_latitude coordinates for end of trip
payment_type "CRD" - credit or debit card, "CSH" - cash
fare_amount base fare amount
surcharge additional change for peak hours, certain locations
mta_tax local tax
tip_amount tip paid, may be $0
tolls_amount bridge and tunnel tolls
total_amount total amount paid by passengers

Divide and Recombine

Running a Local Cluster

  • Local cluster can use multiple cores: each running an R process
  • R process requires memory for the chunks being processed
  • Buffer size limits number of chunks, but large chunks must be loaded into memory
    Hint: look at arguments of localDiskControl
    Hint: look at process size, ps -aux on Linux or Windows Task Manager
  • Local calculations often disk-I/O bound
  • Clusters with HDFS achieve greater I/O bandwidth

Challenge Questions

  • Compute and plot quantiles of passenger_count, trip_time_in_secs, trip_distance, total_amount, tip_amount
  • Compute summary statistics, such as mean toll or mean tip percent by distance category
    Hint - cut() on divisions of log distance
    Hint - divide the ddf by distance category
  • Explore how tip percent changes with distance category, rate_code and perhaps hour of the day
    Hint - Think of some potentially useful cognostics