014.county.Rmd

### Read and Divide by County ###

As we showed previously, we divided the whole data set to subsets by state, and then calculated the mean list
and selling price for each state. According to different purpose of the data analysis, we sometime would
like to divide the data set by different variables. So we are going to show that what if we divide our housing
data by county from the text file as well.

The input keys are line numbers in the housing data text file.  They are the elements of the list
object `map.keys`.  The input values are the lines of text, which are the elements of the list object `map.values`.
This is the default when the input is a raw text file.  For each input key-value pair, `rhcollect` emits an
intermediate key-value pair, where the key is the state name (the third field in the comma-separated
line) and the value is all other fields in the line, stored as a data frame with a single row.

#### Map ####

```{r eval =FALSE, tidy=FALSE}
map3 <- expression({
  lapply(seq_along(map.keys), function(r) {
    line = strsplit(map.values[[r]], ",")[[1]]
    key <- line[1:3]
    value <- as.data.frame(rbind(line[c(-1, -2, -3)]), stringsAsFactors = FALSE)
    names(value) <- c("date", "units", "list", "selling")
    value$list <- as.numeric(value$list)
    value$selling <- as.numeric(value$selling)
    rhcollect(key, value)
  })
})

```

Note that we want to use the FIPS code as a unique identifier for the county, since counties in
different states can share a common name.  This time we've used a character vector of length 3
as the key.  It contains the unique FIPS code, the county name, and the state name.

#### Reduce ####

```{r eval=FALSE, tidy=FALSE}
reduce3 <- expression(
  pre = {
    oneCounty <- data.frame()
  },
  reduce = {
    oneCounty <- rbind(oneCounty, do.call(rbind, reduce.values))
  },
  post = {
    attr(oneCounty, "county") <- reduce.key[2]
    attr(oneCounty, "state") <- reduce.key[3]
    rhcollect(reduce.key[1], oneCounty)
  }
)
```

By removing the FIPS, county, and state columns from the data frame and storing them as
attributes, we've eliminated redundant information in each data frame. Working with massive 
data sets, we want our data to take up the least possible space on disk in order to save 
read/write time. Also we only keep the first element in `reduce.key` which is the `FIPS`
as the key in output key/value pairs.

#### Execution Function ####

```{r eval=FALSE, tidy=FALSE}
mr3 <- rhwatch(
  map      = map3,
  reduce   = reduce3,
  input    = rhfmt("/ln/tongx/housing/housing.txt", type = "text"),
  output   = rhfmt("/ln/tongx/housing/byCounty", type = "sequence"),
  mapred   = list(
    mapred.reduce.tasks = 10
  ),
  readback = FALSE
)
```

After the job completes successfully, we'll read the results from HDFS into our interactive R 
session as we did before.  This time, let's use the `max` argument to `rhread`, which specifies 
how many key-value pairs to read. The default value is -1, which means read in all key-value pairs.

```{r eval=FALSE, tidy=FALSE}
countySubsets <- rhread("/ln/tongx/housing/byCounty", max = 10)
```
```
Read 10 objects(31.39 KB) in 0.04 seconds
```
Suppose we want to see all 10 keys that we read.  Recall that key-value pairs are stored as a nested
list.  So what we want is the first element of each element in the list.  We can use `lapply` to get
them:

```{r eval=FALSE, tidy=FALSE}
keys <- unlist(lapply(countySubsets, "[[", 1))
keys
```
```
 [1] "01013" "01031" "01059" "01077" "01095" "01103" "01121" "04001" "05019" "05037"
```

Finally, let's check that we have the FIPS code, state name, and county name information saved as 
attributes of the data frame.

```{r eval=FALSE, tidy=FALSE}
attributes(countySubsets[[1]][[2]])
```
```
$names
[1] "date"             "units"            "list"             "selling"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 
[65] 65 66

$state
[1] "AL"

$county
[1] "Butler"

$class
[1] "data.frame"
```