title | titleSuffix | description | author | ms.author | ms.reviewer | ms.date | ms.service | ms.subservice | ms.topic | monikerRange |
---|---|---|---|---|---|---|---|---|---|---|
Tutorial: Develop a clustering model in R |
SQL Machine Learning |
In this four-part tutorial series, you'll develop a model to perform clustering in R with SQL machine learning. |
VanMSFT |
vanto |
garye, jroth |
05/29/2024 |
sql |
machine-learning |
tutorial |
>=sql-server-2016||>=sql-server-linux-ver15||=azuresqldb-mi-current |
[!INCLUDE SQL Server 2016 SQL MI]
::: moniker range=">=sql-server-ver15||>=sql-server-linux-ver15" In this four-part tutorial series, you'll use R to develop and deploy a K-Means clustering model in SQL Server Machine Learning Services or on Big Data Clusters to categorize customer data. ::: moniker-end ::: moniker range="=sql-server-2017" In this four-part tutorial series, you'll use R to develop and deploy a K-Means clustering model in SQL Server Machine Learning Services to cluster customer data. ::: moniker-end ::: moniker range="=sql-server-2016" In this four-part tutorial series, you'll use R to develop and deploy a K-Means clustering model in SQL Server R Services to cluster customer data. ::: moniker-end ::: moniker range="=azuresqldb-mi-current" In this four-part tutorial series, you'll use R to develop and deploy a K-Means clustering model in Azure SQL Managed Instance Machine Learning Services to cluster customer data. ::: moniker-end
In part one of this series, you'll set up the prerequisites for the tutorial and then restore a sample dataset to a database. In parts two and three, you'll develop some R scripts in an Azure Data Studio notebook to analyze and prepare this sample data and train a machine learning model. Then, in part four, you'll run those R scripts inside a database using stored procedures.
Clustering can be explained as organizing data into groups where members of a group are similar in some way. For this tutorial series, imagine you own a retail business. You'll use the K-Means algorithm to perform the clustering of customers in a dataset of product purchases and returns. By clustering customers, you can focus your marketing efforts more effectively by targeting specific groups. K-Means clustering is an unsupervised learning algorithm that looks for patterns in data based on similarities.
In this article, you'll learn how to:
[!div class="checklist"]
- Restore a sample database
In part two, you'll learn how to prepare the data from a database to perform clustering.
In part three, you'll learn how to create and train a K-Means clustering model in R.
In part four, you'll learn how to create a stored procedure in a database that can perform clustering in R based on new data.
::: moniker range=">=sql-server-ver15||>=sql-server-linux-ver15"
-
SQL Server Machine Learning Services with the Python language option - Follow the installation instructions in the Windows installation guide or the Linux installation guide. You can also enable Machine Learning Services on SQL Server Big Data Clusters. ::: moniker-end ::: moniker range="=sql-server-2017"
-
SQL Server Machine Learning Services with the R language option - Follow the installation instructions in the Windows installation guide. ::: moniker-end ::: moniker range="=azuresqldb-mi-current"
-
Azure SQL Managed Instance Machine Learning Services. For information, see the Azure SQL Managed Instance Machine Learning Services overview.
-
SQL Server Management Studio (SSMS) - Use SSMS to restore the sample database to Azure SQL Managed Instance. To download, see SQL Server Management Studio. ::: moniker-end
-
Azure Data Studio. You'll use a notebook in Azure Data Studio for SQL. For more information about notebooks, see How to use notebooks in Azure Data Studio.
-
R IDE - This tutorial uses RStudio Desktop.
-
RODBC - This driver is used in the R scripts you'll develop in this tutorial. If it's not already installed, install it using the R command
install.packages("RODBC")
. For more information on RODBC, see CRAN - Package RODBC.
The sample dataset used in this tutorial has been saved to a .bak
database backup file for you to download and use. This dataset is derived from the tpcx-bb dataset provided by the Transaction Processing Performance Council (TPC).
::: moniker range=">=sql-server-ver15||>=sql-server-linux-ver15"
Note
If you are using Machine Learning Services on Big Data Clusters, see how to Restore a database into the SQL Server big data cluster master instance. ::: moniker-end
::: moniker range=">=sql-server-2017||>=sql-server-linux-ver15"
-
Download the file tpcxbb_1gb.bak.
-
Follow the directions in Restore a database from a backup file in Azure Data Studio, using these details:
- Import from the
tpcxbb_1gb.bak
file you downloaded. - Name the target database
tpcxbb_1gb
.
- Import from the
-
You can verify that the dataset exists after you have restored the database by querying the
dbo.customer
table:USE tpcxbb_1gb; SELECT * FROM [dbo].[customer];
::: moniker-end ::: moniker range="=azuresqldb-mi-current"
-
Download the file tpcxbb_1gb.bak.
-
Follow the directions in Restore a database to a Managed Instance in SQL Server Management Studio, using these details:
- Import from the
tpcxbb_1gb.bak
file you downloaded. - Name the target database
tpcxbb_1gb
.
- Import from the
-
You can verify that the dataset exists after you have restored the database by querying the
dbo.customer
table:USE tpcxbb_1gb; SELECT * FROM [dbo].[customer];
::: moniker-end
If you're not going to continue with this tutorial, delete the tpcxbb_1gb
database.
In part one of this tutorial series, you completed these steps:
- Installed the prerequisites
- Restored a sample database
To prepare the data for the machine learning model, follow part two of this tutorial series:
[!div class="nextstepaction"] Prepare data to perform clustering