Skip to content
This repository was archived by the owner on Mar 10, 2025. It is now read-only.

Aggregations Examples

Denny Lee edited this page Mar 14, 2017 · 10 revisions

Below are some examples of how you can do distributed aggregations and analytics using Apache Spark and Azure DocumentDB together. Note, Azure DocumentDB already has support for aggregations (link to blog goes here) so here is how you can take it to the next level with Apache Spark.

Note, these aggregations are in reference to the Spark to DocumentDB Connector Notebook

Connecting to Flights Sample Data

For these aggregations examples, we are accessing some flight performance data stored in our DoctorWho DocumentDB database. To connect to it, you will need to utilize the following code snippet below:

// Import Spark to DocumentDB Connector
import com.microsoft.azure.documentdb.spark.schema._
import com.microsoft.azure.documentdb.spark._
import com.microsoft.azure.documentdb.spark.config.Config

// Connect to DocumentDB Database
val readConfig2 = Config(Map("Endpoint" -> "https://doctorwho.documents.azure.com:443/",
"Masterkey" -> "le1n99i1w5l7uvokJs3RT5ZAH8dc3ql7lx2CG0h0kK4lVWPkQnwpRLyAN0nwS1z4Cyd1lJgvGUfMWR3v8vkXKA==",
"Database" -> "DepartureDelays",
"preferredRegions" -> "Central US;East US 2;",
"Collection" -> "flights_pcoll", 
"SamplingRatio" -> "1.0"))

// Create collection connection 
val coll = spark.sqlContext.read.DocumentDB(readConfig2)
coll.createOrReplaceTempView("c")

With this, we will also run a base query which transfer the filtered set of data we want from DocumentDB to Spark (where the latter can perform distributed aggregates). In this case, we are asking for flights departing from Seattle (SEA).

// Run, get row count, and time query
val originSEA = spark.sql("SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'")
originSEA.createOrReplaceTempView("originSEA")

The results below are from running the queries using Jupyter notebook service. Note, all of these code snippets are generic and not specific to any service.

Running LIMIT and COUNT queries

Just like you're used to in SQL / Spark SQL, let's start off with a LIMIT query:

The next query being a simple and fast COUNT query:

GROUP BY queries

In this next set, now we can easily run GROUP BY queries against our DocumentDB database:

select destination, sum(delay) as TotalDelays 
from originSEA 
group by destination 
order by sum(delay) desc limit 10
Clone this wiki locally