-
Notifications
You must be signed in to change notification settings - Fork 52
Home
The Spark to SQL DB connector enables SQL databases, including Azure SQL Database and SQL Server, to act as input data source or output data sink for Spark jobs.
It allows you to utilize real time transactional data in big data analytics and persist results for adhoc queries or reporting. Comparing to the built-in JDBC connector, the Spark to SQL DB connector provides the ability to bulk insert data into SQL databases. It can outperform row by row insertion with more than 10x faster performance. The Spark to SQL DB connector also support AAD authentication. It allows you securely connect to your Azure SQL databases from Azure Databricks using your AAD account. The Spark to SQL DB connector provides similar interfaces with the built-in JDBC connector. It is very easy to migrate your existing Spark jobs to use this new connector.
To get started, download the Spark to SQL DB connector from the sqldb-spark repository on GitHub.
- Apache Spark - 2.0.2 or later
- Scala - 2.10 or later
- Microsoft JDBC Driver for SQL Server 6.2 or later
The Spark to SQL DB connector utilizes the Microsoft JDBC Driver for SQL Server to move data between Spark worker nodes and SQL databases
The dataflow is as following:
- The Spark master node connect to SQL Server or Azure SQL Databases and load data from a specific table or using a specific SQL query
- Spark master node distribute data to worker nodes for transformation.
- Worker node connect to SQL Server or Azure SQL Databases and write data to the database. User can choose to use row-by-row insertion or bulk insert.
Currently, the connector project uses maven. To build the connector without dependencies, you can run:
mvn clean package
You can also download the latest versions of the JAR from the release folder .
You can connect to Azure SQL Databases using Azure Active Directory (AAD) authentication. Use AAD authentication to centrally manage identities of database users and as an alternative to SQL Server authentication.
The following AAD authentication modes are supported:
- ActiveDirectoryIntegrated
- ActiveDirectoryPassword
- SqlPassword
If you are using the ActiveDirectoryPassword authentication mode you will need azure-activedirectory-library-for-java and its dependencies. See Connecting using ActiveDirectoryPassword Authentication Mode section for more details.
If you are using the ActiveDirectoryIntegrated mode, you will need to install the Active Directory Authentication Library for SQL Server (ADALSQL.DLL) and sqljdbc_auth.dll.
If you are using the access token based authentication mode, you will need azure-activedirectory-library-for-java and its dependencies to run the examples from this article. See Connecting using Access Token section for more details.
The traditional jdbc connector writes data into SQL databases using row-by-row insertion. You can use Spark to SQL DB connector to write data to SQL database using bulk insert. It will significantly improve the write performance when loading large data sets.