Documentation for using Pyspark and Dataframes

RussellSpitzer · RussellSpitzer · commit 96da070bf498 · 2015-06-30T19:12:06.000-07:00
diff --git a/README.md b/README.md
@@ -29,6 +29,7 @@ execute arbitrary CQL queries in your Spark applications.
  - Filters rows on the server side via the CQL `WHERE` clause 
  - Allows for execution of arbitrary CQL statements
  - Plays nice with Cassandra Virtual Nodes
+ - Works with PySpark DataFrames
 
 ## Version Compatibility
 
@@ -75,6 +76,7 @@ See [Building And Artifacts](doc/12_building_and_artifacts.md)
   - [Building And Artifacts](doc/12_building_and_artifacts.md)
   - [The Spark Shell](doc/13_spark_shell.md)
   - [DataFrames](doc/14_data_frames.md)
+  - [Python](doc/15_python.md)
   - [Frequently Asked Questions](doc/FAQ.md)
     
 ## Community
diff --git a/doc/14_data_frames.md b/doc/14_data_frames.md
@@ -144,4 +144,6 @@ df.write
   .format("org.apache.spark.sql.cassandra")
   .options(Map( "table" -> "words_copy", "keyspace" -> "test"))
   .save()
-```
+```
+
+[Next - Python DataFrames](15_python.md) 
diff --git a/doc/15_python.md b/doc/15_python.md
@@ -0,0 +1,46 @@
+# Documentation
+
+## PySpark with Data Frames - Experimental
+
+With the inclusion of the Cassandra Data Source, PySpark can now be used with the Connector to 
+access Cassandra data. This does not require DataStax Enterprise but you are limited to DataFrame
+only operations.
+
+### Setup
+
+To enable Cassandra access the Spark Cassandra Connector assembly jar must be included on both the
+driver and executor classpath for the PySpark Java Gateway. This can be done by starting the PySpark
+shell similarlly to how the spark shell is started.
+
+```bash
+./bin/pyspark \
+  --driver-class-path spark-cassandra-connector-assembly-1.4.0-M1-SNAPSHOT.jar \
+  --jars spark-cassandra-connector-assembly-1.4.0-M1-SNAPSHOT.jar
+```
+
+### Loading a DataFrame in Python
+
+A DataFrame can be created which links to cassandra by using the the `org.apache.spark.sql.cassandra` 
+source and by specifying keyword arguements for `keyspace` and `table`.
+
+```python
+ sqlContext.read\
+    .format("org.apache.spark.sql.cassandra")\
+    .options(table="kv", keyspace="test")\
+    .load().show()
+```
+
+```
++-+-+
+|k|v|
++-+-+
+|5|5|
+|1|1|
+|2|2|
+|4|4|
+|3|3|
++-+-+
+```
+
+The options and parameters are identical to the Scala Data Frames Api so
+please see [Data Frames](14_data_frames.md) for more information.