Skip to content
This repository was archived by the owner on Mar 10, 2025. It is now read-only.

Configuring Power BI Direct Query to Azure Cosmos DB via Apache Spark (HDI)

Denny Lee edited this page Jul 20, 2017 · 7 revisions

A powerfully fun way to visualize your data in Azure Cosmos DB is to use Power BI. While there is an ODBC Driver (refer to Connect to Azure Cosmos DB using BI analytics tools with the ODBC driver), this method requires you to download all of the data from Azure Cosmos DB into Power BI.

To workaround this issue, one technique is to use the azure-cosmosdb-spark connector which allows you to use Apache Spark as the bridge between Power BI and Azure Cosmos DB. Power BI has direct query capabilities to Apache Spark and with the azure-cosmosdb-spark connector, you can create direct connectivity from Power BI to Azure Cosmos DB.

Power  BI DQ > Spark > Azure Cosmos DB

Note, these are alpha working instructions and we will over time simplify how to do this so it will be easier for you to configure this.

Setup

You will need the following components

Configuration

The key configuration here is the ability to copy the azure-cosmosdb-spark JARs to the worker nodes on your HDI cluster.

Getting the azure-cosmosdb-spark JARs

To get the jars, please build the code using mvn clean package or you can download them from the releases folder. As of this writing, the latest version of the JARS can be found in azure-cosmosdb-spark-0.0.3_2.0.2_2.11.

Grab these JARs and be prepared to upload them to your HDI cluster worker nodes.

Copying the JARS to your HDI cluster worker nodes

The goal here is to copy the azure-cosmosdb-spark JARS to the `/usr/hdp/current/spark2-client/jars on your worker and head nodes of your cluster.

Obtain the head and worker node IP addresses

To get this information, you will need to log into your Azure HDI cluster and copy down a list of your head and worker nodes. To do this, first you will log into the Azure Portal and connect to your HDI cluster such as the image below.

You will need to click on Cluster Dashboard.

From here, you click on HDInsight Cluster Dashboard.

Then click on Hosts and you see the list of head nodes (prefix of hn) and worker nodes (prefix of wn).

Clone this wiki locally