Skip to content
This repository was archived by the owner on Oct 12, 2023. It is now read-only.

Commit 93d2382

Browse files
authored
Merge pull request #3 from Azure/allenwux-patch-1
Create AzureDocument.md
2 parents f18e3a6 + 4d3c520 commit 93d2382

File tree

1 file changed

+197
-0
lines changed

1 file changed

+197
-0
lines changed

docs/AzureDocument.md

+197
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Accelerate real-time big data analytics with Spark connector for Azure SQL Database and SQL Server
2+
3+
The Spark connector for Azure SQL Database and SQL Server enables SQL databases, including Azure SQL Database and SQL Server, to act as input data source or output data sink for Spark jobs. It allows you to utilize real time transactional data in big data analytics and persist results for adhoc queries or reporting. Compared to the built-in JDBC connector, this connector provides the ability to bulk insert data into SQL databases. It can outperform row by row insertion with 10x to 20x faster performance. The Spark connector for Azure SQL Database and SQL Server also supports AAD authentication. It allows you securely connecting to your Azure SQL database from Azure Databricks using your AAD account. It provides similar interfaces with the built-in JDBC connector. It is easy to migrate your existing Spark jobs to use this new connector.
4+
5+
## Download
6+
To get started, download the Spark to SQL DB connector from the [azure-sqldb-spark repository](https://github.com/Azure/azure-sqldb-spark) on GitHub.
7+
8+
## Official Supported Versions
9+
10+
| Component |Version |
11+
| :----------------------------------- | :---------------------- |
12+
| Apache Spark |2.0.2 or later |
13+
| Scala |2.10 or later |
14+
| Microsoft JDBC Driver for SQL Server |6.2 or later |
15+
| Microsoft SQL Server |SQL Server 2008 or later |
16+
| Azure SQL Database |Supported |
17+
18+
The Spark connector for Azure SQL Database and SQL Server utilizes the Microsoft JDBC Driver for SQL Server to move data between Spark worker nodes and SQL databases:
19+
20+
The dataflow is as following:
21+
1. The Spark master node connect to SQL Server or Azure SQL Database and load data from a specific table or using a specific SQL query
22+
2. Spark master node distribute data to worker nodes for transformation.
23+
3. Worker node connect to SQL Server or Azure SQL Database and write data to the database. User can choose to use row-by-row insertion or bulk insert.
24+
25+
### Build the Spark to SQL DB connector
26+
Currently, the connector project uses maven. To build the connector without dependencies, you can run:
27+
mvn clean package
28+
You can also download the latest versions of the JAR from the release folder
29+
Include the SQL DB Spark JAR
30+
31+
## Connect Spark to SQL DB using the connector
32+
You can connect to Azure SQL Database or SQL Server from Spark jobs, read or write data. You can also run a DML or DDL query in an Azure SQL database or SQL Server database.
33+
34+
### Read data from Azure SQL Database or SQL Server
35+
36+
```scala
37+
import com.microsoft.azure.sqldb.spark.config.Config
38+
import com.microsoft.azure.sqldb.spark.connect._
39+
40+
val config = Config(Map(
41+
"url" -> "mysqlserver.database.windows.net",
42+
"databaseName" -> "MyDatabase",
43+
"dbTable" -> "dbo.Clients"
44+
"user" -> "username",
45+
"password" -> "*********",
46+
"connectTimeout" -> "5", //seconds
47+
"queryTimeout" -> "5" //seconds
48+
))
49+
50+
val collection = sqlContext.read.sqlDB(config)
51+
collection.show()
52+
```
53+
### Reading data from Azure SQL Database or SQL Server with specified SQL query
54+
```scala
55+
import com.microsoft.azure.sqldb.spark.config.Config
56+
import com.microsoft.azure.sqldb.spark.connect._
57+
58+
val config = Config(Map(
59+
"url" -> "mysqlserver.database.windows.net",
60+
"databaseName" -> "MyDatabase",
61+
"queryCustom" -> "SELECT TOP 100 * FROM dbo.Clients WHERE PostalCode = 98074" //Sql query
62+
"user" -> "username",
63+
"password" -> "*********",
64+
))
65+
66+
//Read all data in table dbo.Clients
67+
val collection = sqlContext.read.sqlDb(config)
68+
collection.show()
69+
```
70+
71+
### Write data to Azure SQL Database or SQL Server
72+
```scala
73+
import com.microsoft.azure.sqldb.spark.config.Config
74+
import com.microsoft.azure.sqldb.spark.connect._
75+
76+
// Aquire a DataFrame collection (val collection)
77+
78+
val config = Config(Map(
79+
"url" -> "mysqlserver.database.windows.net",
80+
"databaseName" -> "MyDatabase",
81+
"dbTable" -> "dbo.Clients"
82+
"user" -> "username",
83+
"password" -> "*********"
84+
))
85+
86+
import org.apache.spark.sql.SaveMode
87+
collection.write.mode(SaveMode.Append).sqlDB(config)
88+
```
89+
90+
### Run DML or DDL query in Azure SQL Database or SQL Server
91+
```scala
92+
import com.microsoft.azure.sqldb.spark.config.Config
93+
import com.microsoft.azure.sqldb.spark.query._
94+
val query = """
95+
|UPDATE Customers
96+
|SET ContactName = 'Alfred Schmidt', City= 'Frankfurt'
97+
|WHERE CustomerID = 1;
98+
""".stripMargin
99+
100+
val config = Config(Map(
101+
"url" -> "mysqlserver.database.windows.net",
102+
"databaseName" -> "MyDatabase",
103+
"user" -> "username",
104+
"password" -> "*********",
105+
"queryCustom" -> query
106+
))
107+
108+
sqlContext.SqlDBQuery(config)
109+
```
110+
111+
## Connect Spark to Azure SQL Database using AAD authentication
112+
You can connect to Azure SQL Database using Azure Active Directory (AAD) authentication. Use AAD authentication to centrally manage identities of database users and as an alternative to SQL Server authentication.
113+
### Connecting using ActiveDirectoryPassword Authentication Mode
114+
#### Setup Requirement
115+
If you are using the ActiveDirectoryPassword authentication mode you will need to download [azure-activedirectory-library-for-java](https://github.com/AzureAD/azure-activedirectory-library-for-java) and its dependencies, and include them in the Java build path.
116+
117+
```scala
118+
import com.microsoft.azure.sqldb.spark.config.Config
119+
import com.microsoft.azure.sqldb.spark.connect._
120+
121+
val config = Config(Map(
122+
"url" -> "mysqlserver.database.windows.net",
123+
"databaseName" -> "MyDatabase",
124+
"user" -> "username ",
125+
"password" -> "*********",
126+
"authentication" -> "ActiveDirectoryPassword",
127+
"encrypt" -> "true"
128+
))
129+
130+
val collection = sqlContext.read.SqlDB(config)
131+
collection.show()
132+
```
133+
134+
### Connecting using Access Token
135+
#### Setup Requirement
136+
If you are using the access token based authentication mode, you will need to download [azure-activedirectory-library-for-java](https://github.com/AzureAD/azure-activedirectory-library-for-java) and its dependencies, and include them in the Java build path.
137+
138+
See [Use Azure Active Directory Authentication for authentication with SQL Database](https://docs.microsoft.com/en-us/azure/sql-database/sql-database-aad-authentication) to learn how to get access token to your Azure SQL database.
139+
140+
```scala
141+
import com.microsoft.azure.sqldb.spark.config.Config
142+
import com.microsoft.azure.sqldb.spark.connect._
143+
144+
val config = Config(Map(
145+
"url" -> "mysqlserver.database.windows.net",
146+
"databaseName" -> "MyDatabase",
147+
"accessToken" -> "access_token ",
148+
"hostNameInCertificate" -> "*.database.windows.net",
149+
"encrypt" -> "true"
150+
))
151+
152+
val collection = sqlContext.read.SqlDB(config)
153+
collection.show()
154+
```
155+
156+
## Write data to Azure SQL database or SQL Server using Bulk Insert
157+
The traditional jdbc connector writes data into Azure SQL database or SQL Server using row-by-row insertion. You can use Spark to SQL DB connector to write data to SQL database using bulk insert. It will significantly improve the write performance when loading large data sets or loading data into tables where column store index is used.
158+
159+
```scala
160+
import com.microsoft.azure.sqldb.spark.bulkcopy.BulkCopyMetadata
161+
import com.microsoft.azure.sqldb.spark.config.Config
162+
import com.microsoft.azure.sqldb.spark.connect._
163+
164+
/**
165+
Add column Metadata.
166+
If not specified, metadata will be automatically added
167+
from the destination table, which may suffer performance.
168+
*/
169+
var bulkCopyMetadata = new BulkCopyMetadata
170+
bulkCopyMetadata.addColumnMetadata(1, "Title", java.sql.Types.NVARCHAR, 128, 0)
171+
bulkCopyMetadata.addColumnMetadata(2, "FirstName", java.sql.Types.NVARCHAR, 50, 0)
172+
bulkCopyMetadata.addColumnMetadata(3, "LastName", java.sql.Types.NVARCHAR, 50, 0)
173+
174+
val bulkCopyConfig = Config(Map(
175+
"url" -> "mysqlserver.database.windows.net",
176+
"databaseName" -> "MyDatabase",
177+
"user" -> "username",
178+
"password" -> "*********",
179+
"databaseName" -> "zeqisql",
180+
"dbTable" -> "dbo.Clients",
181+
"bulkCopyBatchSize" -> "2500",
182+
"bulkCopyTableLock" -> "true",
183+
"bulkCopyTimeout" -> "600"
184+
))
185+
186+
df.bulkCopyToSqlDB(bulkCopyConfig, bulkCopyMetadata)
187+
//df.bulkCopyToSqlDB(bulkCopyConfig) if no metadata is specified.
188+
```
189+
190+
## Next steps
191+
If you haven't already, download the Spark connector for Azure SQL Database and SQL Server from [azure-sqldb-spark GitHub repository](https://github.com/Azure/azure-sqldb-spark) and explore the additional resources in the repo:
192+
193+
- [Sample Azure Databricks notebooks](https://github.com/Azure/azure-sqldb-spark/tree/master/samples/notebooks)
194+
- [Sample scripts (Scala)](https://github.com/Azure/azure-sqldb-spark/tree/master/samples/scripts)
195+
196+
You might also want to review the [Apache Spark SQL, DataFrames, and Datasets Guide](http://spark.apache.org/docs/latest/sql-programming-guide.html) and the [Azure Databricks documentation](https://docs.microsoft.com/en-us/azure/azure-databricks/).
197+

0 commit comments

Comments
 (0)