jleetutorial
diff --git a/‎spark_streaming_basics/.ipynb_checkpoints/03_Basics of Transformations Demo 2-checkpoint.ipynb ‎spark_streaming_basics/.ipynb_checkpoints/02_and_03_Basics of Transformations Demo-checkpoint.ipynb
+134-24 b/‎spark_streaming_basics/.ipynb_checkpoints/03_Basics of Transformations Demo 2-checkpoint.ipynb ‎spark_streaming_basics/.ipynb_checkpoints/02_and_03_Basics of Transformations Demo-checkpoint.ipynb
+134-24
diff --git a/‎spark_streaming_basics/.ipynb_checkpoints/02_Basics of Transformations Demo 1-checkpoint.ipynb ‎spark_streaming_basics/.ipynb_checkpoints/04_Basics of Transformations Exercise - Solution-checkpoint.ipynb
+24-121 b/‎spark_streaming_basics/.ipynb_checkpoints/02_Basics of Transformations Demo 1-checkpoint.ipynb ‎spark_streaming_basics/.ipynb_checkpoints/04_Basics of Transformations Exercise - Solution-checkpoint.ipynb
+24-121
@@ -4,16 +4,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Basics of Transformations Demo 2"
+    "# Basics of Transformations Demo"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As we discussed earlier, there are a wide variety of data transformations available for use on DStreams, most of which are similar to those used on the DStreams' constituent parts.\n",
+    "In Spark Streaming, DStreams are treated very similarly to the RDDs that make them up. Like RDDs, there are a wide variety of data transformation options. \n",
     "\n",
-    "As a reminder, here is the list of transformations from the previous demo again:\n",
+    "Here are some examples of the transformations from the Spark documentation that might be useful for your purposes\n",
     "\n",
     "| Transformation        | Meaning         |\n",
     "| ------------------------------ |:-------------|\n",
@@ -30,25 +30,146 @@
     "| **cogroup**(otherStream, [numTasks])\t| When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.\n",
     "\n",
     "\n",
-    "If you look at the spark streaming documentation, you will also find the `transform(func)` and `updateStateByKey(func)`. We will discuss these later in the course.\n",
+    "If you look at the spark streaming documentation, you will also find the `transform(func)` and `updateStateByKey(func)`. We will discuss these later in the course.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Demo (Part 1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "We're going to be demoing the map and flatmap functions with respect to DStreams. One important question is \"What is the difference between the two?\"\n",
+    "\n",
+    "`map`: It returns a new RDD by applying a function to each element of the RDD. Function in map can return only one item. Works with DStreams as well as RDDs\n",
     "\n",
+    "`flatMap`: Similar to map, it returns a new RDD by applying  a function to each element of the RDD, but output is flattened.\n",
+    "Also, function in flatMap can return a list of elements (0 or more). Works with DStreams as well as RDDs.\n",
     "\n",
-    "Let's go though another example:\n",
-    "\n"
+    "Here's an example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "notice o/p is flattened out in a single list"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here's Another Example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "sc.parallelize([3,4,5]).map(lambda x: [x,  x*x]).collect() "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "sc.parallelize([3,4,5]).flatMap(lambda x: [x, x*x]).collect() "
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Demo"
+    "notice that the list is flattened in the latter version"
    ]
   },
   {
    "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here's another example, this time interacting with a file, which can often be useful for debugging code that interacts with full DStreams\n",
+    "\n",
+    "There is a text file `greetings.txt` with following lines:\n",
+    "```\n",
+    "Good Morning\n",
+    "Good Evening\n",
+    "Good Day\n",
+    "Happy Birthday\n",
+    "Happy New Year\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {
     "collapsed": true
    },
+   "outputs": [],
+   "source": [
+    "lines = sc.textFile(\"greetings.txt\")\n",
+    "lines.map(lambda line: line.split()).collect()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "lines.flatMap(lambda line: line.split()).collect()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Demo (Part 2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
    "source": [
     "Last time we went over the `map` and `flapmap` functions. We'll explore a few other options.\n",
     "\n",
@@ -57,21 +178,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "NameError",
-     "evalue": "name 'sc' is not defined",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
-      "\u001b[1;32m<ipython-input-1-8b5aca44da72>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mlines\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msc\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mparallelize\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'Its fun to have fun,'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;34m'but you have to know how.'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      2\u001b[0m \u001b[1;31m# Suppose then that we want to get wordcounts for this. We can use the map function from before here.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      3\u001b[0m \u001b[1;31m# map returns a new RDD containing values created by applying the supplied function to each value in the original RDD\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      4\u001b[0m \u001b[1;31m# Here we use a lambda function which replaces some common punctuation characters with spaces and convert to lower\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      5\u001b[0m \u001b[1;31m# case, producing a new RDD:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;31mNameError\u001b[0m: name 'sc' is not defined"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
    "source": [
     "scc = Streamingcontext(\"local[2]\",\"PythonSparkApp\", 10)\n",
     "\n",
@@ -80,7 +191,6 @@
     "oldwordcount = wordspair.reduceByKey(lambda x,y : x + y)\n",
     "lines = scc.socketTextStream(\"192.168.56.101\", 9999)\n",
     "\n",
-    "lines = sc.parallelize(['Its fun to have fun,','but you have to know how.'])\n",
     "# Suppose then that we want to get wordcounts for this. We can use the map function from before here. \n",
     "# map returns a new RDD containing values created by applying the supplied function to each value in the original RDD\n",
     "# Here we use a lambda function which replaces some common punctuation characters with spaces and convert to lower \n",
@@ -133,7 +243,7 @@
    "metadata": {},
    "source": [
     "# References\n",
-    "1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams "
+    "1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams\n"
    ]
   },
   {
 
@@ -4,16 +4,22 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Basics of Transformations Demo 1"
+    "# Basics of Transformations Exercise"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In Spark Streaming, DStreams are treated very similarly to the RDDs that make them up. Like RDDs, there are a wide variety of data transformation options. \n",
-    "\n",
-    "Here are some examples of the transformations from the Spark documentation that might be useful for your purposes\n",
+    "DStreams Transformations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise\n",
+    "Use any of the functions above to return the largest key of every RDD in a DStream (not just the largest in the entire DStream).\n",
     "\n",
     "| Transformation        | Meaning         |\n",
     "| ------------------------------ |:-------------|\n",
@@ -33,40 +39,6 @@
     "If you look at the spark streaming documentation, you will also find the `transform(func)` and `updateStateByKey(func)`. We will discuss these later in the course.\n"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Demo"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "collapsed": true
-   },
-   "source": [
-    "We're going to be demoing the map and flatmap functions. The important question is \"What is the difference between the two?\"\n",
-    "\n",
-    "`map`: It returns a new RDD by applying a function to each element of the RDD.   Function in map can return only one item.\n",
-    "\n",
-    "`flatMap`: Similar to map, it returns a new RDD by applying  a function to each element of the RDD, but output is flattened.\n",
-    "Also, function in flatMap can return a list of elements (0 or more)\n",
-    "\n",
-    "Here's an example:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -75,97 +47,28 @@
    },
    "outputs": [],
    "source": [
-    "sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "notice o/p is flattened out in a single list"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Here's Another Example:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "sc.parallelize([3,4,5]).map(lambda x: [x,  x*x]).collect() "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "sc.parallelize([3,4,5]).flatMap(lambda x: [x, x*x]).collect() "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "notice that the list is flattened in the latter version"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Here's another example, this time interacting with a file\n",
+    "sc = SparkContext(appName=\"PythonStreamingExercise\")\n",
+    "ssc = StreamingContext(sc, 1)\n",
+    "# Defining the stream\n",
+    "stream = ssc.queueStream([sc.parallelize([(1,\"a\"), (2,\"b\"),(1,\"c\"),(2,\"d\"),\n",
+    "(1,\"e\"),(3,\"f\")],3)])\n",
     "\n",
-    "There is a text file `greetings.txt` with following lines:\n",
-    "```\n",
-    "Good Morning\n",
-    "Good Evening\n",
-    "Good Day\n",
-    "Happy Birthday\n",
-    "Happy New Year\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "lines = sc.textFile(\"greetings.txt\")\n",
-    "lines.map(lambda line: line.split()).collect()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "lines.flatMap(lambda line: line.split()).collect()"
+    "# TODO: Use any of the functions above, or some combinaion, to \n",
+    "# return the largest key of every RDD in a DStream (not just the largest in the entire DStream). \n",
+    "maxstream = stream.reduce(max)\n",
+    "maxstream.pprint()\n",
+    "\n",
+    "###### End of Exercise section\n",
+    "ssc.start()\n",
+    "ssc.stop(stopSparkContext=True, stopGraceFully=True)\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# References\n",
-    "1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams\n"
+    "1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams"
    ]
   },
   {