|
4 | 4 | "cell_type": "markdown",
|
5 | 5 | "metadata": {},
|
6 | 6 | "source": [
|
7 |
| - "# Basics of Transformations Demo 2" |
| 7 | + "# Basics of Transformations Demo" |
8 | 8 | ]
|
9 | 9 | },
|
10 | 10 | {
|
11 | 11 | "cell_type": "markdown",
|
12 | 12 | "metadata": {},
|
13 | 13 | "source": [
|
14 |
| - "As we discussed earlier, there are a wide variety of data transformations available for use on DStreams, most of which are similar to those used on the DStreams' constituent parts.\n", |
| 14 | + "In Spark Streaming, DStreams are treated very similarly to the RDDs that make them up. Like RDDs, there are a wide variety of data transformation options. \n", |
15 | 15 | "\n",
|
16 |
| - "As a reminder, here is the list of transformations from the previous demo again:\n", |
| 16 | + "Here are some examples of the transformations from the Spark documentation that might be useful for your purposes\n", |
17 | 17 | "\n",
|
18 | 18 | "| Transformation | Meaning |\n",
|
19 | 19 | "| ------------------------------ |:-------------|\n",
|
|
30 | 30 | "| **cogroup**(otherStream, [numTasks])\t| When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.\n",
|
31 | 31 | "\n",
|
32 | 32 | "\n",
|
33 |
| - "If you look at the spark streaming documentation, you will also find the `transform(func)` and `updateStateByKey(func)`. We will discuss these later in the course.\n", |
| 33 | + "If you look at the spark streaming documentation, you will also find the `transform(func)` and `updateStateByKey(func)`. We will discuss these later in the course.\n" |
| 34 | + ] |
| 35 | + }, |
| 36 | + { |
| 37 | + "cell_type": "markdown", |
| 38 | + "metadata": {}, |
| 39 | + "source": [ |
| 40 | + "### Demo (Part 1)" |
| 41 | + ] |
| 42 | + }, |
| 43 | + { |
| 44 | + "cell_type": "markdown", |
| 45 | + "metadata": { |
| 46 | + "collapsed": true |
| 47 | + }, |
| 48 | + "source": [ |
| 49 | + "We're going to be demoing the map and flatmap functions with respect to DStreams. One important question is \"What is the difference between the two?\"\n", |
| 50 | + "\n", |
| 51 | + "`map`: It returns a new RDD by applying a function to each element of the RDD. Function in map can return only one item. Works with DStreams as well as RDDs\n", |
34 | 52 | "\n",
|
| 53 | + "`flatMap`: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.\n", |
| 54 | + "Also, function in flatMap can return a list of elements (0 or more). Works with DStreams as well as RDDs.\n", |
35 | 55 | "\n",
|
36 |
| - "Let's go though another example:\n", |
37 |
| - "\n" |
| 56 | + "Here's an example:" |
| 57 | + ] |
| 58 | + }, |
| 59 | + { |
| 60 | + "cell_type": "code", |
| 61 | + "execution_count": null, |
| 62 | + "metadata": { |
| 63 | + "collapsed": true |
| 64 | + }, |
| 65 | + "outputs": [], |
| 66 | + "source": [ |
| 67 | + "sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()" |
| 68 | + ] |
| 69 | + }, |
| 70 | + { |
| 71 | + "cell_type": "code", |
| 72 | + "execution_count": null, |
| 73 | + "metadata": { |
| 74 | + "collapsed": true |
| 75 | + }, |
| 76 | + "outputs": [], |
| 77 | + "source": [ |
| 78 | + "sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()" |
| 79 | + ] |
| 80 | + }, |
| 81 | + { |
| 82 | + "cell_type": "markdown", |
| 83 | + "metadata": {}, |
| 84 | + "source": [ |
| 85 | + "notice o/p is flattened out in a single list" |
| 86 | + ] |
| 87 | + }, |
| 88 | + { |
| 89 | + "cell_type": "markdown", |
| 90 | + "metadata": {}, |
| 91 | + "source": [ |
| 92 | + "Here's Another Example:" |
| 93 | + ] |
| 94 | + }, |
| 95 | + { |
| 96 | + "cell_type": "code", |
| 97 | + "execution_count": null, |
| 98 | + "metadata": { |
| 99 | + "collapsed": true |
| 100 | + }, |
| 101 | + "outputs": [], |
| 102 | + "source": [ |
| 103 | + "sc.parallelize([3,4,5]).map(lambda x: [x, x*x]).collect() " |
| 104 | + ] |
| 105 | + }, |
| 106 | + { |
| 107 | + "cell_type": "code", |
| 108 | + "execution_count": null, |
| 109 | + "metadata": { |
| 110 | + "collapsed": true |
| 111 | + }, |
| 112 | + "outputs": [], |
| 113 | + "source": [ |
| 114 | + "sc.parallelize([3,4,5]).flatMap(lambda x: [x, x*x]).collect() " |
38 | 115 | ]
|
39 | 116 | },
|
40 | 117 | {
|
41 | 118 | "cell_type": "markdown",
|
42 | 119 | "metadata": {},
|
43 | 120 | "source": [
|
44 |
| - "### Demo" |
| 121 | + "notice that the list is flattened in the latter version" |
45 | 122 | ]
|
46 | 123 | },
|
47 | 124 | {
|
48 | 125 | "cell_type": "markdown",
|
| 126 | + "metadata": {}, |
| 127 | + "source": [ |
| 128 | + "Here's another example, this time interacting with a file, which can often be useful for debugging code that interacts with full DStreams\n", |
| 129 | + "\n", |
| 130 | + "There is a text file `greetings.txt` with following lines:\n", |
| 131 | + "```\n", |
| 132 | + "Good Morning\n", |
| 133 | + "Good Evening\n", |
| 134 | + "Good Day\n", |
| 135 | + "Happy Birthday\n", |
| 136 | + "Happy New Year\n", |
| 137 | + "```" |
| 138 | + ] |
| 139 | + }, |
| 140 | + { |
| 141 | + "cell_type": "code", |
| 142 | + "execution_count": null, |
49 | 143 | "metadata": {
|
50 | 144 | "collapsed": true
|
51 | 145 | },
|
| 146 | + "outputs": [], |
| 147 | + "source": [ |
| 148 | + "lines = sc.textFile(\"greetings.txt\")\n", |
| 149 | + "lines.map(lambda line: line.split()).collect()" |
| 150 | + ] |
| 151 | + }, |
| 152 | + { |
| 153 | + "cell_type": "code", |
| 154 | + "execution_count": null, |
| 155 | + "metadata": { |
| 156 | + "collapsed": true |
| 157 | + }, |
| 158 | + "outputs": [], |
| 159 | + "source": [ |
| 160 | + "lines.flatMap(lambda line: line.split()).collect()" |
| 161 | + ] |
| 162 | + }, |
| 163 | + { |
| 164 | + "cell_type": "markdown", |
| 165 | + "metadata": {}, |
| 166 | + "source": [ |
| 167 | + "# Demo (Part 2)" |
| 168 | + ] |
| 169 | + }, |
| 170 | + { |
| 171 | + "cell_type": "markdown", |
| 172 | + "metadata": {}, |
52 | 173 | "source": [
|
53 | 174 | "Last time we went over the `map` and `flapmap` functions. We'll explore a few other options.\n",
|
54 | 175 | "\n",
|
|
57 | 178 | },
|
58 | 179 | {
|
59 | 180 | "cell_type": "code",
|
60 |
| - "execution_count": 1, |
61 |
| - "metadata": {}, |
62 |
| - "outputs": [ |
63 |
| - { |
64 |
| - "ename": "NameError", |
65 |
| - "evalue": "name 'sc' is not defined", |
66 |
| - "output_type": "error", |
67 |
| - "traceback": [ |
68 |
| - "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", |
69 |
| - "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", |
70 |
| - "\u001b[1;32m<ipython-input-1-8b5aca44da72>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mlines\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msc\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mparallelize\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'Its fun to have fun,'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;34m'but you have to know how.'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2\u001b[0m \u001b[1;31m# Suppose then that we want to get wordcounts for this. We can use the map function from before here.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[1;31m# map returns a new RDD containing values created by applying the supplied function to each value in the original RDD\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[1;31m# Here we use a lambda function which replaces some common punctuation characters with spaces and convert to lower\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[1;31m# case, producing a new RDD:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", |
71 |
| - "\u001b[1;31mNameError\u001b[0m: name 'sc' is not defined" |
72 |
| - ] |
73 |
| - } |
74 |
| - ], |
| 181 | + "execution_count": null, |
| 182 | + "metadata": { |
| 183 | + "collapsed": true |
| 184 | + }, |
| 185 | + "outputs": [], |
75 | 186 | "source": [
|
76 | 187 | "scc = Streamingcontext(\"local[2]\",\"PythonSparkApp\", 10)\n",
|
77 | 188 | "\n",
|
|
80 | 191 | "oldwordcount = wordspair.reduceByKey(lambda x,y : x + y)\n",
|
81 | 192 | "lines = scc.socketTextStream(\"192.168.56.101\", 9999)\n",
|
82 | 193 | "\n",
|
83 |
| - "lines = sc.parallelize(['Its fun to have fun,','but you have to know how.'])\n", |
84 | 194 | "# Suppose then that we want to get wordcounts for this. We can use the map function from before here. \n",
|
85 | 195 | "# map returns a new RDD containing values created by applying the supplied function to each value in the original RDD\n",
|
86 | 196 | "# Here we use a lambda function which replaces some common punctuation characters with spaces and convert to lower \n",
|
|
133 | 243 | "metadata": {},
|
134 | 244 | "source": [
|
135 | 245 | "# References\n",
|
136 |
| - "1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams " |
| 246 | + "1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams\n" |
137 | 247 | ]
|
138 | 248 | },
|
139 | 249 | {
|
|
0 commit comments