Activity | Time |
---|---|
Tessera overview (Ryan) | 13:30 - 14:00 |
Get started with housing example (Amanda) | 14:00 - 14:30 |
Break | 10 min |
Housing example continued (Amanda) | 14:40 - 15:30 |
Break | 10 min |
Taxi example (Stephen) | 15:40 - 16:30 |
June 30, 2015
Activity | Time |
---|---|
Tessera overview (Ryan) | 13:30 - 14:00 |
Get started with housing example (Amanda) | 14:00 - 14:30 |
Break | 10 min |
Housing example continued (Amanda) | 14:40 - 15:30 |
Break | 10 min |
Taxi example (Stephen) | 15:40 - 16:30 |
Large complex data has any or all of the following:
Provide an environment that allows us to do the following with large complex data:
Users interact primarily with two R packages:
Interface stays the same regardless of back end
|
datadr
install.packages("devtools") # if not installed library(devtools) install_github("tesseradata/datadr") install_github("tesseradata/trelliscope") install_github("hafen/housingData") # demo data
Variable | Description |
---|---|
fips | Federal Information Processing Standard a 5 digit count code |
county | US county name |
state | US state name |
time | date (the data is aggregated monthly) |
nSold | number sold this month |
medListPriceSqft | median list price per square foot |
medSoldPriceSqft | median sold price per square foot |
datadr
data representationddf
):
ddo
):
datadr
data back-end options:
# similar to read.table function: my.data <- drRead.table( hdfsConn("/home/me/dir/datafile.txt", header=TRUE, sep="\t") ) # similar to read.csv function: my.data2 <- drRead.csv( localDiskConn("c:/my/local/data.csv")) # convert in memory data.frame to ddf: my.data3 <- ddf(some.data.frame)
# Load necessary libraries library(datadr) library(trelliscope) library(housingData) # housing data frame is in the housingData package housingDdf <- ddf(housing)
Divide the housing data set by the variables "county" and "state"
(This kind of data division is very similar to the functionality provided by the plyr package)
byCounty <- divide(housingDdf, by = c("county", "state"), update = TRUE)
byCounty
## ## Distributed data frame backed by 'kvMemory' connection ## ## attribute | value ## ----------------+----------------------------------------------------------- ## names | fips(cha), time(Dat), nSold(num), and 2 more ## nrow | 224369 ## size (stored) | 15.73 MB ## size (object) | 15.73 MB ## # subsets | 2883 ## ## * Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary() ## * Conditioning variables: county, state
divide
functionNow try using the divide statement to divide on one or more variables
byState <- divide(housing, by="state", update = TRUE) byMonth <- divide(housing, by="time", update=TRUE)
ddf
data objectData divisions can be accessed by index or by key name
byCounty[[1]]
## $key ## [1] "county=Abbeville County|state=SC" ## ## $value ## fips time nSold medListPriceSqft medSoldPriceSqft ## 1 45001 2008-10-01 NA 73.06226 NA ## 2 45001 2008-11-01 NA 70.71429 NA ## 3 45001 2008-12-01 NA 70.71429 NA ## 4 45001 2009-01-01 NA 73.43750 NA ## 5 45001 2009-02-01 NA 78.69565 NA ## ...
byCounty[["county=Benton County|state=WA"]]
ddf
data objectPartipants: try these functions on your own
summary(byCounty)
names(byCounty)
length(byCounty)
getKeys(byCounty)
addTransform
function applies a function to each key/value pair in a ddf
recombine
)byCounty[[1]]
)drPersist
function explicitly forces transformation computation# Function to calculate a linear model and extract # the slope parameter lmCoef <- function(x) { coef(lm(medListPriceSqft ~ time, data = x))[2] } # Best practice tip: test transformation # function on one division lmCoef(byCounty[[1]]$value)
## time ## -0.0002323686
# Apply the transform function to the ddf byCountySlope <- addTransform(byCounty, lmCoef)
byCountySlope[[1]]
## $key ## [1] "county=Abbeville County|state=SC" ## ## $value ## time ## -0.0002323686
byCounty[[1]]$value
)transformFn <- function(x) { ## you fill in here } # test: transformFn(byCounty[[1]]$value) # apply: xformedData <- addTransform(byCounty, transformFn)
# example 1 totalSold <- function(x) { sum(x$nSold, na.rm=TRUE) } byCountySold <- addTransform(byCounty, totalSold) # example 2 timeRange <- function(x) { range(x$time) } byCountyTime <- addTransform(byCounty, timeRange)
countySlopes <- recombine(byCountySlope, combine=combRbind)
head(countySlopes)
## county state val ## time Abbeville County SC -0.0002323686 ## time1 Acadia Parish LA 0.0019518441 ## time2 Accomack County VA -0.0092717711 ## time3 Ada County ID -0.0030197554 ## time4 Adair County IA -0.0308381951 ## time5 Adair County KY 0.0034399585
combine
parameter controls the form of the result
combine=combRbind
: rbind
is used to combine results into data.frame
, this is the most frequently used optioncombine=combCollect
: results are collected into a listcombine=combDdo
: results are combined into a ddo
objectrecombine
functionrecombine
to the data with your custom transformationcombine=combRbind
is probably the simplest optiondivide
Divide two new datasets geoCounty
and wikiCounty
by county and state
# look at the data first head(geoCounty) head(wikiCounty) # use divide function on each
geoByCounty <- divide(geoCounty, by=c("county", "state")) wikiByCounty <- divide(wikiCounty, by=c("county", "state"))
Join together multiple data objects based on key
joinedData <- drJoin(housing=byCounty, slope=byCountySlope, geo=geoByCounty, wiki=wikiByCounty)
ddf
the value in each key/value is always a data.frame
ddo
can accomodate values that are not data.frame
sclass(joinedData)
## [1] "ddo" "kvMemory"
joinedData[[176]]
## $key ## [1] "county=Benton County|state=WA" ## ## $value ## $housing ## fips time nSold medListPriceSqft medSoldPriceSqft ## 1 53005 2008-10-01 137 106.6351 106.2179 ## 2 53005 2008-11-01 80 106.9650 NA ## 3 53005 2008-11-01 NA NA 105.2370 ## 4 53005 2008-12-01 95 107.6642 105.6311 ## 5 53005 2009-01-01 73 107.6868 105.8892 ## 6 53005 2009-02-01 97 108.3566 NA ## 7 53005 2009-02-01 NA NA 104.3273 ## 8 53005 2009-03-01 125 107.1968 103.2748 ## 9 53005 2009-04-01 147 107.7649 102.2363 ## 10 53005 2009-05-01 192 108.6823 NA ## 11 53005 2009-05-01 NA NA 103.8925 ## 12 53005 2009-06-01 256 108.5143 105.1873 ## 13 53005 2009-07-01 232 108.4902 104.8865 ## 14 53005 2009-08-01 250 108.5763 NA ## 15 53005 2009-08-01 NA NA 105.6069 ## 16 53005 2009-09-01 185 109.4762 111.4779 ## 17 53005 2009-10-01 208 108.3172 111.3014 ## 18 53005 2009-11-01 171 109.1082 NA ## 19 53005 2009-11-01 NA NA 111.3912 ## 20 53005 2009-12-01 116 109.6998 109.8340 ## 21 53005 2010-01-01 112 109.8148 109.8369 ## 22 53005 2010-02-01 130 109.1902 NA ## 23 53005 2010-02-01 NA NA 109.5266 ## 24 53005 2010-03-01 230 109.1489 111.3277 ## 25 53005 2010-04-01 257 108.9922 111.9392 ## 26 53005 2010-05-01 263 109.1406 NA ## 27 53005 2010-05-01 NA NA 112.3360 ## 28 53005 2010-06-01 277 109.7328 112.4703 ## 29 53005 2010-07-01 143 109.3155 112.5845 ## 30 53005 2010-08-01 147 109.5679 NA ## 31 53005 2010-08-01 NA NA 113.5515 ## 32 53005 2010-09-01 145 109.9034 114.7505 ## 33 53005 2010-10-01 158 111.1056 111.6532 ## 34 53005 2010-11-01 119 110.9722 NA ## 35 53005 2010-11-01 NA NA 111.1267 ## 36 53005 2010-12-01 96 112.1773 109.0577 ## 37 53005 2011-01-01 84 113.0760 109.4568 ## 38 53005 2011-02-01 106 114.2857 NA ## 39 53005 2011-02-01 NA NA 111.6570 ## 40 53005 2011-03-01 157 113.9522 111.6075 ## 41 53005 2011-04-01 178 112.5166 111.9766 ## 42 53005 2011-05-01 176 113.6156 NA ## 43 53005 2011-05-01 NA NA 110.9980 ## 44 53005 2011-06-01 160 113.2360 110.8176 ## 45 53005 2011-07-01 175 112.5824 113.7276 ## 46 53005 2011-08-01 153 NA NA ## 47 53005 2011-08-01 NA 119.9161 114.2078 ## 48 53005 2011-09-01 157 118.8687 114.6581 ## 49 53005 2011-10-01 148 118.1545 113.7756 ## 50 53005 2011-11-01 70 NA NA ## 51 53005 2011-11-01 NA 118.0064 112.0518 ## 52 53005 2011-12-01 102 118.2796 105.5297 ## 53 53005 2012-01-01 72 118.4511 107.1666 ## 54 53005 2012-02-01 82 NA NA ## 55 53005 2012-02-01 NA 118.1705 107.7065 ## 56 53005 2012-03-01 108 118.5601 106.8745 ## 57 53005 2012-04-01 100 118.3301 108.6945 ## 58 53005 2012-05-01 134 NA NA ## 59 53005 2012-05-01 NA 118.3333 109.9282 ## 60 53005 2012-06-01 116 118.0660 111.0290 ## 61 53005 2012-07-01 115 118.2432 110.6429 ## 62 53005 2012-08-01 119 NA NA ## 63 53005 2012-08-01 NA 118.4626 111.5133 ## 64 53005 2012-09-01 103 118.8302 111.9573 ## 65 53005 2012-10-01 129 117.9672 111.9468 ## 66 53005 2012-11-01 121 NA NA ## 67 53005 2012-11-01 NA 117.8074 110.9557 ## 68 53005 2012-12-01 122 118.1281 108.4398 ## 69 53005 2013-01-01 110 117.2370 106.6948 ## 70 53005 2013-02-01 NA 117.2093 105.8374 ## 71 53005 2013-03-01 NA 117.7890 108.0909 ## 72 53005 2013-04-01 NA 118.4573 108.3922 ## 73 53005 2013-05-01 NA 119.0476 109.3556 ## 74 53005 2013-06-01 NA 118.8369 112.1253 ## 75 53005 2013-07-01 NA 118.5484 113.4037 ## 76 53005 2013-08-01 NA 119.8686 113.7689 ## 77 53005 2013-09-01 NA 120.1890 112.2339 ## 78 53005 2013-10-01 NA 119.2271 113.0176 ## 79 53005 2013-11-01 NA 118.5382 114.7066 ## 80 53005 2013-12-01 NA 120.0107 112.5436 ## 81 53005 2014-01-01 NA 120.5357 109.4688 ## 82 53005 2014-02-01 NA 120.5588 109.4023 ## 83 53005 2014-03-01 NA 121.6829 108.9446 ## ## $slope ## time ## 0.007726753 ## ## $geo ## fips lon lat rMapState rMapCounty ## 1 53005 -119.5155 46.23413 washington benton ## ## $wiki ## fips pop2013 href ## 1 53005 184486 http://en.wikipedia.org/wiki/Benton_County,_Washington
Filter a ddf
or ddo
based on key and/or value
# Note that a few county/state combinations do # not have housing sales data: names(joinedData[[2884]]$value)
## [1] "geo" "wiki"
# We want to filter those out those joinedData <- drFilter(joinedData, function(v) { !is.null(v$housing) })
drSample
: returns a ddo
containing a random sample (i.e. a specified fraction) of key/value pairsdrSubset
: applies a subsetting function to the rows of a ddf
drLapply
: applies a function to each subset and returns the results in a ddo
datadr
data operationsApply one or more of these data operations to joinedData
or a ddo
or ddf
you created
drJoin
drFilter
drSample
drSubset
drLapply
Differences from in memory computation:
hdfsConn
to specify a file location to read in HDFSoutput
parameter in most functions to specify a location in HDFS to store datahousing <- drRead.csv( file=hdfsConn("/hdfs/data/location"), output=hdfsConn("/hdfs/data/second/location")) byCounty <- divide(housing, by=c("state", "county"), output=hdfsConn("/hdfs/data/byCounty"))
trelliscope
ddf
or ddo
# Plot medListPriceSqft and medSoldPriceSqft by time timePanel <- function(x) { xyplot(medListPriceSqft + medSoldPriceSqft ~ time, data = x$housing, auto.key = TRUE, ylab = "Price / Sq. Ft.") }
# test the panel function on one division timePanel(joinedData[[176]]$value)
vdbConn("housing_vdb", autoYes=TRUE)
makeDisplay(joinedData, name = "list_sold_vs_time_datadr", desc = "List and sold price over time", panelFn = timePanel, width = 400, height = 400, lims = list(x = "same") )
view()
newPanelFn <- function(x) { # fill in here } # test the panel function timePanel(joinedData[[1]]$value) vdbConn("housing_vdb", autoYes=TRUE) makeDisplay(joinedData, name = "panel_test", desc = "Your test panel function", panelFn = newPaneFn)
priceCog <- function(x) { st <- getSplitVar(x, "state") ct <- getSplitVar(x, "county") zillowString <- gsub(" ", "-", paste(ct, st)) list( slope = cog(x$slope, desc = "list price slope"), meanList = cogMean(x$housing$medListPriceSqft), meanSold = cogMean(x$housing$medSoldPriceSqft), lat = cog(x$geo$lat, desc = "county latitude"), lon = cog(x$geo$lon, desc = "county longitude"), wikiHref = cogHref(x$wiki$href, desc="wiki link"), zillowHref = cogHref( sprintf("http://www.zillow.com/homes/%s_rb/", zillowString), desc="zillow link") ) }
makeDisplay(joinedData, name = "list_sold_vs_time_datadr2", desc = "List and sold price with cognostics", panelFn = timePanel, cogFn = priceCog, width = 400, height = 400, lims = list(x = "same") )
newCogFn <- function(x) { # list( # name1=cog(value1, desc="description") # ) } # test the cognostics function newCogFn(joinedData[[1]]$value) makeDisplay(joinedData, name = "cognostics_test", desc = "Test panel and cognostics function", panelFn = newPaneFn, cogFn = newCogFn) view()
Data and solution for UseR! 2015: http://tessera.io/docs-UseR2015/
Variable | Description |
---|---|
medallion | identifier of the individual taxi's license |
hack_license | taxi driver license id |
vendor_id | unique identifier of taxi owner |
rate_code | code based on type of trip |
store_and_fwd_flag | ignore |
pickup_datetime | date and time of trip start |
dropoff_datetime | date and time of trip end |
passenger_count | number of riders |
trip_time_in_secs | total time between pickup and dropoff |
trip_distance | distance covered in miles |
Variable | Description |
---|---|
pickup_longitude, pickup_latitude | coordinates for start of trip |
dropoff_longitude, dropoff_latitude | coordinates for end of trip |
payment_type | "CRD" - credit or debit card, "CSH" - cash |
fare_amount | base fare amount |
surcharge | additional change for peak hours, certain locations |
mta_tax | local tax |
tip_amount | tip paid, may be $0 |
tolls_amount | bridge and tunnel tolls |
total_amount | total amount paid by passengers |