Skip to content

Commit 96da070

Browse files
Documentation for using Pyspark and Dataframes
1 parent 55b3ca9 commit 96da070

File tree

3 files changed

+51
-1
lines changed

3 files changed

+51
-1
lines changed

README.md

+2
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ execute arbitrary CQL queries in your Spark applications.
2929
- Filters rows on the server side via the CQL `WHERE` clause
3030
- Allows for execution of arbitrary CQL statements
3131
- Plays nice with Cassandra Virtual Nodes
32+
- Works with PySpark DataFrames
3233

3334
## Version Compatibility
3435

@@ -75,6 +76,7 @@ See [Building And Artifacts](doc/12_building_and_artifacts.md)
7576
- [Building And Artifacts](doc/12_building_and_artifacts.md)
7677
- [The Spark Shell](doc/13_spark_shell.md)
7778
- [DataFrames](doc/14_data_frames.md)
79+
- [Python](doc/15_python.md)
7880
- [Frequently Asked Questions](doc/FAQ.md)
7981

8082
## Community

doc/14_data_frames.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -144,4 +144,6 @@ df.write
144144
.format("org.apache.spark.sql.cassandra")
145145
.options(Map( "table" -> "words_copy", "keyspace" -> "test"))
146146
.save()
147-
```
147+
```
148+
149+
[Next - Python DataFrames](15_python.md)

doc/15_python.md

+46
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Documentation
2+
3+
## PySpark with Data Frames - Experimental
4+
5+
With the inclusion of the Cassandra Data Source, PySpark can now be used with the Connector to
6+
access Cassandra data. This does not require DataStax Enterprise but you are limited to DataFrame
7+
only operations.
8+
9+
### Setup
10+
11+
To enable Cassandra access the Spark Cassandra Connector assembly jar must be included on both the
12+
driver and executor classpath for the PySpark Java Gateway. This can be done by starting the PySpark
13+
shell similarlly to how the spark shell is started.
14+
15+
```bash
16+
./bin/pyspark \
17+
--driver-class-path spark-cassandra-connector-assembly-1.4.0-M1-SNAPSHOT.jar \
18+
--jars spark-cassandra-connector-assembly-1.4.0-M1-SNAPSHOT.jar
19+
```
20+
21+
### Loading a DataFrame in Python
22+
23+
A DataFrame can be created which links to cassandra by using the the `org.apache.spark.sql.cassandra`
24+
source and by specifying keyword arguements for `keyspace` and `table`.
25+
26+
```python
27+
sqlContext.read\
28+
.format("org.apache.spark.sql.cassandra")\
29+
.options(table="kv", keyspace="test")\
30+
.load().show()
31+
```
32+
33+
```
34+
+-+-+
35+
|k|v|
36+
+-+-+
37+
|5|5|
38+
|1|1|
39+
|2|2|
40+
|4|4|
41+
|3|3|
42+
+-+-+
43+
```
44+
45+
The options and parameters are identical to the Scala Data Frames Api so
46+
please see [Data Frames](14_data_frames.md) for more information.

0 commit comments

Comments
 (0)