Skip to content

Latest commit

 

History

History
141 lines (101 loc) · 8.7 KB

python-clustering-model.md

File metadata and controls

141 lines (101 loc) · 8.7 KB
title titleSuffix description author ms.author ms.reviewer ms.date ms.service ms.subservice ms.topic ms.devlang monikerRange
Python tutorial: Categorize customers
SQL machine learning
In this four-part tutorial series, the goal is to cluster customers, using K-Means, in a database using Python with SQL machine learning.
VanMSFT
vanto
monamaki, randolphwest, anmunde
05/29/2024
sql
machine-learning
tutorial
python
>=sql-server-2017||>=sql-server-linux-ver15||=azuresqldb-mi-current

Python tutorial: Categorizing customers using k-means clustering with SQL machine learning

[!INCLUDE SQL Server 2017 SQL MI]

::: moniker range=">=sql-server-ver15||>=sql-server-linux-ver15" In this four-part tutorial series, use Python to develop and deploy a K-Means clustering model in SQL Server Machine Learning Services or on Big Data Clusters to categorize customer data. ::: moniker-end ::: moniker range="=sql-server-2017" In this four-part tutorial series, use Python to develop and deploy a K-Means clustering model in SQL Server Machine Learning Services to cluster customer data. ::: moniker-end ::: moniker range="=azuresqldb-mi-current" In this four-part tutorial series, use Python to develop and deploy a K-Means clustering model in Azure SQL Managed Instance Machine Learning Services to cluster customer data. ::: moniker-end

In part one of this series, set up the prerequisites for the tutorial and then restore a sample dataset to a database. Later in this series, use this data to train and deploy a clustering model in Python with SQL machine learning.

In parts two and three of this series, develop some Python scripts in an Azure Data Studio notebook to analyze and prepare your data and train a machine learning model. Then, in part four, run those Python scripts inside a database using stored procedures.

Clustering can be explained as organizing data into groups where members of a group are similar in some way. For this tutorial series, imagine you own a retail business. Use the K-Means algorithm to perform the clustering of customers in a dataset of product purchases and returns. By clustering customers, you can focus your marketing efforts more effectively by targeting specific groups. K-Means clustering is an unsupervised learning algorithm that looks for patterns in data based on similarities.

In this article, learn how to:

[!div class="checklist"]

  • Restore a sample database

In part two, learn how to prepare the data from a database to perform clustering.

In part three, learn how to create and train a K-Means clustering model in Python.

In part four, learn how to create a stored procedure in a database that can perform clustering in Python based on new data.

Prerequisites

::: moniker range=">=sql-server-ver16||>=sql-server-linux-ver16"

::: moniker range=">=sql-server-ver15" Run the following icacls commands to grant READ & EXECUTE access to the installed libraries to SQL Server Launchpad Service and SID S-1-15-2-1 (ALL_APPLICATION_PACKAGES).

  icacls "C:\Program Files\Python310\Lib\site-packages" /grant "NT Service\MSSQLLAUNCHPAD":(OI)(CI)RX /T
  icacls "C:\Program Files\Python310\Lib\site-packages" /grant *S-1-15-2-1:(OI)(CI)RX /T

::: moniker-end

Restore the sample database

The sample dataset used in this tutorial has been saved to a .bak database backup file for you to download and use. This dataset is derived from the tpcx-bb dataset provided by the Transaction Processing Performance Council (TPC).

::: moniker range=">=sql-server-ver15||>=sql-server-linux-ver15"

Note

If you are using Machine Learning Services on Big Data Clusters, see how to Restore a database into the SQL Server big data cluster master instance. ::: moniker-end

::: moniker range=">=sql-server-2017||>=sql-server-linux-ver15"

  1. Download the file tpcxbb_1gb.bak.

  2. Follow the directions in Restore a database from a backup file in Azure Data Studio, using these details:

    • Import from the tpcxbb_1gb.bak file you downloaded.
    • Name the target database tpcxbb_1gb.
  3. You can verify that the dataset exists after you have restored the database by querying the dbo.customer table:

    USE tpcxbb_1gb;
    SELECT * FROM [dbo].[customer];

::: moniker-end ::: moniker range="=azuresqldb-mi-current"

  1. Download the file tpcxbb_1gb.bak.

  2. Follow the directions in Restore a database to a SQL Managed Instance in SQL Server Management Studio, using these details:

    • Import from the tpcxbb_1gb.bak file you downloaded.
    • Name the target database tpcxbb_1gb.
  3. You can verify that the dataset exists after you have restored the database by querying the dbo.customer table:

    USE tpcxbb_1gb;
    SELECT * FROM [dbo].[customer];

::: moniker-end

Clean up resources

If you're not going to continue with this tutorial, delete the tpcxbb_1gb database.

Next step

In part one of this tutorial series, you completed these steps:

  • Restore a sample database

To prepare the data for the machine learning model, follow part two of this tutorial series:

[!div class="nextstepaction"] Python tutorial: Prepare data to perform clustering