Skip to content

Commit 3232de8

Browse files
Implementing Feedback
1 parent ea9d1a4 commit 3232de8

27 files changed

+2892
-514
lines changed

spark_streaming_basics/.ipynb_checkpoints/03_Basics of Transformations Demo 2-checkpoint.ipynb spark_streaming_basics/.ipynb_checkpoints/02_and_03_Basics of Transformations Demo-checkpoint.ipynb

+134-24
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,16 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Basics of Transformations Demo 2"
7+
"# Basics of Transformations Demo"
88
]
99
},
1010
{
1111
"cell_type": "markdown",
1212
"metadata": {},
1313
"source": [
14-
"As we discussed earlier, there are a wide variety of data transformations available for use on DStreams, most of which are similar to those used on the DStreams' constituent parts.\n",
14+
"In Spark Streaming, DStreams are treated very similarly to the RDDs that make them up. Like RDDs, there are a wide variety of data transformation options. \n",
1515
"\n",
16-
"As a reminder, here is the list of transformations from the previous demo again:\n",
16+
"Here are some examples of the transformations from the Spark documentation that might be useful for your purposes\n",
1717
"\n",
1818
"| Transformation | Meaning |\n",
1919
"| ------------------------------ |:-------------|\n",
@@ -30,25 +30,146 @@
3030
"| **cogroup**(otherStream, [numTasks])\t| When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.\n",
3131
"\n",
3232
"\n",
33-
"If you look at the spark streaming documentation, you will also find the `transform(func)` and `updateStateByKey(func)`. We will discuss these later in the course.\n",
33+
"If you look at the spark streaming documentation, you will also find the `transform(func)` and `updateStateByKey(func)`. We will discuss these later in the course.\n"
34+
]
35+
},
36+
{
37+
"cell_type": "markdown",
38+
"metadata": {},
39+
"source": [
40+
"### Demo (Part 1)"
41+
]
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"metadata": {
46+
"collapsed": true
47+
},
48+
"source": [
49+
"We're going to be demoing the map and flatmap functions with respect to DStreams. One important question is \"What is the difference between the two?\"\n",
50+
"\n",
51+
"`map`: It returns a new RDD by applying a function to each element of the RDD. Function in map can return only one item. Works with DStreams as well as RDDs\n",
3452
"\n",
53+
"`flatMap`: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.\n",
54+
"Also, function in flatMap can return a list of elements (0 or more). Works with DStreams as well as RDDs.\n",
3555
"\n",
36-
"Let's go though another example:\n",
37-
"\n"
56+
"Here's an example:"
57+
]
58+
},
59+
{
60+
"cell_type": "code",
61+
"execution_count": null,
62+
"metadata": {
63+
"collapsed": true
64+
},
65+
"outputs": [],
66+
"source": [
67+
"sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()"
68+
]
69+
},
70+
{
71+
"cell_type": "code",
72+
"execution_count": null,
73+
"metadata": {
74+
"collapsed": true
75+
},
76+
"outputs": [],
77+
"source": [
78+
"sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()"
79+
]
80+
},
81+
{
82+
"cell_type": "markdown",
83+
"metadata": {},
84+
"source": [
85+
"notice o/p is flattened out in a single list"
86+
]
87+
},
88+
{
89+
"cell_type": "markdown",
90+
"metadata": {},
91+
"source": [
92+
"Here's Another Example:"
93+
]
94+
},
95+
{
96+
"cell_type": "code",
97+
"execution_count": null,
98+
"metadata": {
99+
"collapsed": true
100+
},
101+
"outputs": [],
102+
"source": [
103+
"sc.parallelize([3,4,5]).map(lambda x: [x, x*x]).collect() "
104+
]
105+
},
106+
{
107+
"cell_type": "code",
108+
"execution_count": null,
109+
"metadata": {
110+
"collapsed": true
111+
},
112+
"outputs": [],
113+
"source": [
114+
"sc.parallelize([3,4,5]).flatMap(lambda x: [x, x*x]).collect() "
38115
]
39116
},
40117
{
41118
"cell_type": "markdown",
42119
"metadata": {},
43120
"source": [
44-
"### Demo"
121+
"notice that the list is flattened in the latter version"
45122
]
46123
},
47124
{
48125
"cell_type": "markdown",
126+
"metadata": {},
127+
"source": [
128+
"Here's another example, this time interacting with a file, which can often be useful for debugging code that interacts with full DStreams\n",
129+
"\n",
130+
"There is a text file `greetings.txt` with following lines:\n",
131+
"```\n",
132+
"Good Morning\n",
133+
"Good Evening\n",
134+
"Good Day\n",
135+
"Happy Birthday\n",
136+
"Happy New Year\n",
137+
"```"
138+
]
139+
},
140+
{
141+
"cell_type": "code",
142+
"execution_count": null,
49143
"metadata": {
50144
"collapsed": true
51145
},
146+
"outputs": [],
147+
"source": [
148+
"lines = sc.textFile(\"greetings.txt\")\n",
149+
"lines.map(lambda line: line.split()).collect()"
150+
]
151+
},
152+
{
153+
"cell_type": "code",
154+
"execution_count": null,
155+
"metadata": {
156+
"collapsed": true
157+
},
158+
"outputs": [],
159+
"source": [
160+
"lines.flatMap(lambda line: line.split()).collect()"
161+
]
162+
},
163+
{
164+
"cell_type": "markdown",
165+
"metadata": {},
166+
"source": [
167+
"# Demo (Part 2)"
168+
]
169+
},
170+
{
171+
"cell_type": "markdown",
172+
"metadata": {},
52173
"source": [
53174
"Last time we went over the `map` and `flapmap` functions. We'll explore a few other options.\n",
54175
"\n",
@@ -57,21 +178,11 @@
57178
},
58179
{
59180
"cell_type": "code",
60-
"execution_count": 1,
61-
"metadata": {},
62-
"outputs": [
63-
{
64-
"ename": "NameError",
65-
"evalue": "name 'sc' is not defined",
66-
"output_type": "error",
67-
"traceback": [
68-
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
69-
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
70-
"\u001b[1;32m<ipython-input-1-8b5aca44da72>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mlines\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msc\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mparallelize\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'Its fun to have fun,'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;34m'but you have to know how.'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2\u001b[0m \u001b[1;31m# Suppose then that we want to get wordcounts for this. We can use the map function from before here.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[1;31m# map returns a new RDD containing values created by applying the supplied function to each value in the original RDD\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[1;31m# Here we use a lambda function which replaces some common punctuation characters with spaces and convert to lower\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[1;31m# case, producing a new RDD:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
71-
"\u001b[1;31mNameError\u001b[0m: name 'sc' is not defined"
72-
]
73-
}
74-
],
181+
"execution_count": null,
182+
"metadata": {
183+
"collapsed": true
184+
},
185+
"outputs": [],
75186
"source": [
76187
"scc = Streamingcontext(\"local[2]\",\"PythonSparkApp\", 10)\n",
77188
"\n",
@@ -80,7 +191,6 @@
80191
"oldwordcount = wordspair.reduceByKey(lambda x,y : x + y)\n",
81192
"lines = scc.socketTextStream(\"192.168.56.101\", 9999)\n",
82193
"\n",
83-
"lines = sc.parallelize(['Its fun to have fun,','but you have to know how.'])\n",
84194
"# Suppose then that we want to get wordcounts for this. We can use the map function from before here. \n",
85195
"# map returns a new RDD containing values created by applying the supplied function to each value in the original RDD\n",
86196
"# Here we use a lambda function which replaces some common punctuation characters with spaces and convert to lower \n",
@@ -133,7 +243,7 @@
133243
"metadata": {},
134244
"source": [
135245
"# References\n",
136-
"1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams "
246+
"1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams\n"
137247
]
138248
},
139249
{

spark_streaming_basics/.ipynb_checkpoints/02_Basics of Transformations Demo 1-checkpoint.ipynb spark_streaming_basics/.ipynb_checkpoints/04_Basics of Transformations Exercise - Solution-checkpoint.ipynb

+24-121
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,22 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Basics of Transformations Demo 1"
7+
"# Basics of Transformations Exercise"
88
]
99
},
1010
{
1111
"cell_type": "markdown",
1212
"metadata": {},
1313
"source": [
14-
"In Spark Streaming, DStreams are treated very similarly to the RDDs that make them up. Like RDDs, there are a wide variety of data transformation options. \n",
15-
"\n",
16-
"Here are some examples of the transformations from the Spark documentation that might be useful for your purposes\n",
14+
"DStreams Transformations"
15+
]
16+
},
17+
{
18+
"cell_type": "markdown",
19+
"metadata": {},
20+
"source": [
21+
"### Exercise\n",
22+
"Use any of the functions above to return the largest key of every RDD in a DStream (not just the largest in the entire DStream).\n",
1723
"\n",
1824
"| Transformation | Meaning |\n",
1925
"| ------------------------------ |:-------------|\n",
@@ -33,40 +39,6 @@
3339
"If you look at the spark streaming documentation, you will also find the `transform(func)` and `updateStateByKey(func)`. We will discuss these later in the course.\n"
3440
]
3541
},
36-
{
37-
"cell_type": "markdown",
38-
"metadata": {},
39-
"source": [
40-
"### Demo"
41-
]
42-
},
43-
{
44-
"cell_type": "markdown",
45-
"metadata": {
46-
"collapsed": true
47-
},
48-
"source": [
49-
"We're going to be demoing the map and flatmap functions. The important question is \"What is the difference between the two?\"\n",
50-
"\n",
51-
"`map`: It returns a new RDD by applying a function to each element of the RDD. Function in map can return only one item.\n",
52-
"\n",
53-
"`flatMap`: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.\n",
54-
"Also, function in flatMap can return a list of elements (0 or more)\n",
55-
"\n",
56-
"Here's an example:"
57-
]
58-
},
59-
{
60-
"cell_type": "code",
61-
"execution_count": null,
62-
"metadata": {
63-
"collapsed": true
64-
},
65-
"outputs": [],
66-
"source": [
67-
"sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()"
68-
]
69-
},
7042
{
7143
"cell_type": "code",
7244
"execution_count": null,
@@ -75,97 +47,28 @@
7547
},
7648
"outputs": [],
7749
"source": [
78-
"sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()"
79-
]
80-
},
81-
{
82-
"cell_type": "markdown",
83-
"metadata": {},
84-
"source": [
85-
"notice o/p is flattened out in a single list"
86-
]
87-
},
88-
{
89-
"cell_type": "markdown",
90-
"metadata": {},
91-
"source": [
92-
"Here's Another Example:"
93-
]
94-
},
95-
{
96-
"cell_type": "code",
97-
"execution_count": null,
98-
"metadata": {
99-
"collapsed": true
100-
},
101-
"outputs": [],
102-
"source": [
103-
"sc.parallelize([3,4,5]).map(lambda x: [x, x*x]).collect() "
104-
]
105-
},
106-
{
107-
"cell_type": "code",
108-
"execution_count": null,
109-
"metadata": {
110-
"collapsed": true
111-
},
112-
"outputs": [],
113-
"source": [
114-
"sc.parallelize([3,4,5]).flatMap(lambda x: [x, x*x]).collect() "
115-
]
116-
},
117-
{
118-
"cell_type": "markdown",
119-
"metadata": {},
120-
"source": [
121-
"notice that the list is flattened in the latter version"
122-
]
123-
},
124-
{
125-
"cell_type": "markdown",
126-
"metadata": {},
127-
"source": [
128-
"Here's another example, this time interacting with a file\n",
50+
"sc = SparkContext(appName=\"PythonStreamingExercise\")\n",
51+
"ssc = StreamingContext(sc, 1)\n",
52+
"# Defining the stream\n",
53+
"stream = ssc.queueStream([sc.parallelize([(1,\"a\"), (2,\"b\"),(1,\"c\"),(2,\"d\"),\n",
54+
"(1,\"e\"),(3,\"f\")],3)])\n",
12955
"\n",
130-
"There is a text file `greetings.txt` with following lines:\n",
131-
"```\n",
132-
"Good Morning\n",
133-
"Good Evening\n",
134-
"Good Day\n",
135-
"Happy Birthday\n",
136-
"Happy New Year\n",
137-
"```"
138-
]
139-
},
140-
{
141-
"cell_type": "code",
142-
"execution_count": null,
143-
"metadata": {
144-
"collapsed": true
145-
},
146-
"outputs": [],
147-
"source": [
148-
"lines = sc.textFile(\"greetings.txt\")\n",
149-
"lines.map(lambda line: line.split()).collect()"
150-
]
151-
},
152-
{
153-
"cell_type": "code",
154-
"execution_count": null,
155-
"metadata": {
156-
"collapsed": true
157-
},
158-
"outputs": [],
159-
"source": [
160-
"lines.flatMap(lambda line: line.split()).collect()"
56+
"# TODO: Use any of the functions above, or some combinaion, to \n",
57+
"# return the largest key of every RDD in a DStream (not just the largest in the entire DStream). \n",
58+
"maxstream = stream.reduce(max)\n",
59+
"maxstream.pprint()\n",
60+
"\n",
61+
"###### End of Exercise section\n",
62+
"ssc.start()\n",
63+
"ssc.stop(stopSparkContext=True, stopGraceFully=True)\n"
16164
]
16265
},
16366
{
16467
"cell_type": "markdown",
16568
"metadata": {},
16669
"source": [
16770
"# References\n",
168-
"1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams\n"
71+
"1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams"
16972
]
17073
},
17174
{

0 commit comments

Comments
 (0)