title | titleSuffix | description | author | ms.author | ms.reviewer | ms.date | ms.service | ms.subservice | ms.topic |
---|---|---|---|---|---|---|---|---|---|
Query HDFS data: storage pool |
SQL Server Big Data Clusters |
This tutorial demonstrates how to query HDFS data in a SQL Server 2019 big data cluster. You create an external table over data in the storage pool and then run a query. |
WilliamDAssafMSFT |
wiassaf |
hudequei |
12/13/2019 |
sql |
big-data-cluster |
tutorial |
[!INCLUDESQL Server 2019]
[!INCLUDEbig-data-clusters-banner-retirement]
This tutorial demonstrates how to Query HDFS data in a [!INCLUDEbig-data-clusters-2019].
In this tutorial, you learn how to:
[!div class="checklist"]
- Create an external table pointing to HDFS data in a big data cluster.
- Join this data with high-value data in the master instance.
Tip
If you prefer, you can download and run a script for the commands in this tutorial. For instructions, see the Data virtualization samples on GitHub.
This 7-minute video walks you through querying HDFS data in a big data cluster:
- Big data tools
- kubectl
- Azure Data Studio
- SQL Server 2019 extension
- Load sample data into your big data cluster
The storage pool contains web clickstream data in a CSV file stored in HDFS. Use the following steps to define an external table that can access the data in that file.
-
In Azure Data Studio, connect to the SQL Server master instance of your big data cluster. For more information, see Connect to the SQL Server master instance.
-
Double-click on the connection in the Servers window to show the server dashboard for the SQL Server master instance. Select New Query.
-
Run the following Transact-SQL command to change the context to the Sales database in the master instance.
USE Sales GO
-
Define the format of the CSV file to read from HDFS. Press F5 to run the statement.
CREATE EXTERNAL FILE FORMAT csv_file WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS( FIELD_TERMINATOR = ',', STRING_DELIMITER = '"', FIRST_ROW = 2, USE_TYPE_DEFAULT = TRUE) );
-
Create an external data source to the storage pool if it does not already exist.
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlStoragePool') BEGIN CREATE EXTERNAL DATA SOURCE SqlStoragePool WITH (LOCATION = 'sqlhdfs://controller-svc/default'); END
-
Create an external table that can read the
/clickstream_data
from the storage pool. The SqlStoragePool is accessible from the master instance of a big data cluster.CREATE EXTERNAL TABLE [web_clickstreams_hdfs] ("wcs_click_date_sk" BIGINT , "wcs_click_time_sk" BIGINT , "wcs_sales_sk" BIGINT , "wcs_item_sk" BIGINT , "wcs_web_page_sk" BIGINT , "wcs_user_sk" BIGINT) WITH ( DATA_SOURCE = SqlStoragePool, LOCATION = '/clickstream_data', FILE_FORMAT = csv_file ); GO
Run the following query to join the HDFS data in the web_clickstream_hdfs
external table with the relational data in the local Sales
database.
SELECT
wcs_user_sk,
SUM( CASE WHEN i_category = 'Books' THEN 1 ELSE 0 END) AS book_category_clicks,
SUM( CASE WHEN i_category_id = 1 THEN 1 ELSE 0 END) AS [Home & Kitchen],
SUM( CASE WHEN i_category_id = 2 THEN 1 ELSE 0 END) AS [Music],
SUM( CASE WHEN i_category_id = 3 THEN 1 ELSE 0 END) AS [Books],
SUM( CASE WHEN i_category_id = 4 THEN 1 ELSE 0 END) AS [Clothing & Accessories],
SUM( CASE WHEN i_category_id = 5 THEN 1 ELSE 0 END) AS [Electronics],
SUM( CASE WHEN i_category_id = 6 THEN 1 ELSE 0 END) AS [Tools & Home Improvement],
SUM( CASE WHEN i_category_id = 7 THEN 1 ELSE 0 END) AS [Toys & Games],
SUM( CASE WHEN i_category_id = 8 THEN 1 ELSE 0 END) AS [Movies & TV],
SUM( CASE WHEN i_category_id = 9 THEN 1 ELSE 0 END) AS [Sports & Outdoors]
FROM [dbo].[web_clickstreams_hdfs]
INNER JOIN item it ON (wcs_item_sk = i_item_sk
AND wcs_user_sk IS NOT NULL)
GROUP BY wcs_user_sk;
GO
Use the following command to remove the external table used in this tutorial.
DROP EXTERNAL TABLE [dbo].[web_clickstreams_hdfs];
GO
Advance to the next article to learn how to query Oracle from a big data cluster.
[!div class="nextstepaction"] Query external data in Oracle