Skip to content

Commit cff2f49

Browse files
committed
renamed data_bricks to databricks_ and also updated python package information
1 parent be7812a commit cff2f49

File tree

6 files changed

+9
-32
lines changed

6 files changed

+9
-32
lines changed

docs/running/databricks.md

+1-24
Original file line numberDiff line numberDiff line change
@@ -3,29 +3,6 @@ title: Running on Databricks
33
parent: Running Zingg on Cloud
44
nav_order: 6
55
---
6-
There are several ways to run Zingg on Databricks. All [file formats and data sources and sinks](../dataSourcesAndSinks) are supported within Databricks.
6+
You can run Zingg on Databricks directly using the Databricks notebook interface. All [file formats and data sources and sinks](../dataSourcesAndSinks) are supported within Databricks.
77

8-
# Running directly within Databricks using the Databricks notebook interface
98
This uses the Zingg Python API and an [example notebook is available here](https://github.com/zinggAI/zingg/blob/main/examples/databricks/FebrlExample.ipynb)
10-
11-
# Running using Databricks Connect from your local machine
12-
1. Configure databricks connect 11.3 and create correspoding workspace/cluster as per the [Databricks docs](https://docs.databricks.com/dev-tools/databricks-connect-legacy.html). Please makre sure that you run `databricks-connect configure`
13-
14-
Ensure to run databricks-connect configure
15-
16-
2. Set env variable ZINGG_HOME to the path where latest zingg release jar is e.g. location of zingg-0.4.0.jar
17-
18-
4. Set env variable DATA_BRICKS_CONNECT to Y
19-
20-
5. pip install zingg
21-
22-
6. Now run zingg using the shell script with -run-databricks option, SPARK session would be made remotely to Databricks and job would run on your Databricks environment
23-
e.g. ./scripts/zingg.sh --run-databricks test/InMemPipeDataBricks.py
24-
25-
Please refer to the [different options](https://docs.zingg.ai/zingg/stepbystep/zingg-command-line) available on the Zingg command line.
26-
27-
28-
# Running on Databricks using Spark Submit Jobs
29-
Zingg is run as a Spark Submit Job along with a python notebook-based labeler specially created to run within the Databricks cloud since the cloud environment does not have the system console for the labeler to work.
30-
31-
Please refer to the [Databricks Zingg tutorial](https://medium.com/@sonalgoyal/identity-resolution-on-databricks-for-customer-360-591661bcafce) for a detailed tutorial.

python/PKG-INFO

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Summary: Zingg.ai Entity Resolution
55
Home-page: www.zingg.ai
66
Author: Zingg.AI
77
Author-email: [email protected]
8-
License: UNKNOWN
8+
License: AGPL
99
Description: ## About Zingg
1010

1111
Zingg is an ML based entity resolution framework. The Python Package is used for building training data, training Zingg models and running the matching and linking processes.

python/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Zingg Python APIs for entity resolution, record linkage, data mastering and dedu
44
[Zingg.AI](https://www.zingg.ai)
55

66
# requirement
7-
python 3.6+; spark 3.1.2
7+
python 3.6+; spark 3.5.0
88

99
# Installation
1010

python/zingg/client.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,8 @@ def initClient():
4646
global _sqlContext
4747
global _spark
4848
if _spark_ctxt is None:
49-
DATA_BRICKS_CONNECT = os.getenv('DATA_BRICKS_CONNECT')
50-
if DATA_BRICKS_CONNECT=='Y' or DATA_BRICKS_CONNECT=='y':
49+
DATABRICKS_CONNECT = os.getenv('DATABRICKS_CONNECT')
50+
if DATABRICKS_CONNECT=='Y' or DATABRICKS_CONNECT=='y':
5151
return initDataBricksConectClient()
5252
else:
5353
return initSparkClient()
@@ -130,8 +130,8 @@ def execute(self):
130130
def initAndExecute(self):
131131
""" Method to run both init and execute methods consecutively """
132132
self.client.init()
133-
DATA_BRICKS_CONNECT = os.getenv('DATA_BRICKS_CONNECT')
134-
if DATA_BRICKS_CONNECT=='Y' or DATA_BRICKS_CONNECT=='y':
133+
DATABRICKS_CONNECT = os.getenv('DATABRICKS_CONNECT')
134+
if DATABRICKS_CONNECT=='Y' or DATABRICKS_CONNECT=='y':
135135
options = self.client.getOptions()
136136
inpPhase = options.get(ClientOptions.PHASE).getValue()
137137
if (inpPhase==ZinggOptions.LABEL.getValue()):

scripts/zingg.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ fi
5959
if [[ $RUN_PYTHON_DB_CONNECT_PHASE -eq 1 ]]; then
6060
unset SPARK_MASTER
6161
unset SPARK_HOME
62-
export DATA_BRICKS_CONNECT=Y
62+
export DATABRICKS_CONNECT=Y
6363
python $EXECUTABLE
6464
else
6565
# All the additional options must be added here

test/testFebrl/testArgs.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ def test_initClient_spark(self):
6262
def test_initClient_databricks(self):
6363
global _spark_ctxt
6464
_spark_ctxt = None
65-
os.environ['DATA_BRICKS_CONNECT'] = 'Y'
65+
os.environ['DATABRICKS_CONNECT'] = 'Y'
6666
result = initClient()
6767
self.assertEqual(result, 1)
6868

0 commit comments

Comments
 (0)