-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPythonInstruction.html
117 lines (105 loc) · 6.56 KB
/
PythonInstruction.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
<!doctype html>
<html>
<head>
<meta charset='utf-8'>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/github-markdown-css/2.4.1/github-markdown.min.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.11.0/styles/default.min.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.8.3/katex.min.css">
<link rel="stylesheet" href="https://gitcdn.xyz/repo/goessner/mdmath/master/css/texmath.css">
<link rel="stylesheet" href="https://gitcdn.xyz/repo/goessner/mdmath/master/css/vscode-texmath.css">
</head>
<body class="markdown-body">
<h1 data-line="NaN" class="code-line code-line" id="pyspark-installation-guide">Pyspark installation guide</h1>
<h2 data-line="NaN" class="code-line code-line" id="osx-installation-guide">OSX installation guide</h2>
<h3 data-line="NaN" class="code-line code-line" id="pyspark---environment-setup">PySpark - Environment setup</h3>
<ul>
<li data-line="NaN" class="code-line code-line">
<p data-line="NaN" class="code-line code-line">
<strong>Step 1</strong>: Make sure that java and scala are installed on your computer (PySpark doesn't work very
well with Java 9, so it's better to use Java 8)</p>
</li>
<li data-line="NaN" class="code-line code-line">
<p data-line="NaN" class="code-line code-line">
<strong>Step 2</strong>: Download Spark from
<a href="https://spark.apache.org/downloads.html">here</a>
</p>
</li>
<li data-line="NaN" class="code-line code-line">
<p data-line="NaN" class="code-line code-line">
<strong>Step 3</strong>: Move the downloaded file in a choosen directory, for example
<code>~/Dev</code> and unzip it with
<code>tar -xvf Downloads/spark-2.1.0-bin-hadoop2.7.tgz</code>
</p>
</li>
<li data-line="NaN" class="code-line code-line">
<p data-line="NaN" class="code-line code-line">
<strong>Step 4</strong>: Now we need to set the following environments to set the Spark path and the Py4j path in
the bash profile</p>
</li>
</ul>
<pre class="hljs"><code><div>export SPARK_HOME = ~/Dev/hadoop/spark-2.1.0-bin-hadoop2.7
export PATH = $PATH:~/Dev/hadoop/spark-2.1.0-bin-hadoop2.7/bin
export PYTHONPATH = $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PATH = $SPARK_HOME/python:$PATH
</div></code></pre>
<ul>
<li data-line="NaN" class="code-line code-line">
<strong>Step 5</strong>: restart the terminal application and then test if it works by typing</li>
</ul>
<pre class="hljs"><code><div>$SPARK_HOME/bin/pyspark
</div></code></pre>
<ul>
<li data-line="NaN" class="code-line code-line">
<strong>Step 6</strong>: To run a python script, for example
<code>FirstSparkApp.py</code> type</li>
</ul>
<pre class="hljs"><code><div>$SPARK_HOME/bin/spark-submit FirstSparkApp.py
</div></code></pre>
<h2 data-line="NaN" class="code-line code-line" id="windows-installation-guide">Windows installation Guide</h2>
<ol>
<li data-line="NaN" class="code-line code-line">
<p data-line="NaN" class="code-line code-line">Download and install the JDK from
<a href="http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html">here</a>
</p>
</li>
<li data-line="NaN" class="code-line code-line">
<p data-line="NaN" class="code-line code-line">Via terminal install pyspark using the following command</p>
<pre class="hljs"><code><div>pip install pyspark
</div></code></pre>
</li>
<li data-line="NaN" class="code-line code-line">
<p data-line="NaN" class="code-line code-line">Download and unzip winutils from
<a href="https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip">here</a>
</p>
</li>
<li data-line="NaN" class="code-line code-line">
<p data-line="NaN" class="code-line code-line">Modify your environment variables in the following way:</p>
<ul>
<li data-line="NaN" class="code-line code-line">Modify the environment variable "path" and add "path_to_unzip_winutils_folder/bin"</li>
<li data-line="NaN" class="code-line code-line">Add to the environment variables a new variable called "HADOOP_HOME" and give it value "path_to_unzip_winutils_folder"</li>
</ul>
</li>
<li data-line="NaN" class="code-line code-line">
<p data-line="NaN" class="code-line code-line">In your python code you will need to add the following lines of codes</p>
<pre class="hljs"><code><div><span class="hljs-keyword">from</span> pyspark <span class="hljs-keyword">import</span> SparkContext
sc = SparkContext.getOrCreate()
</div></code></pre>
</li>
</ol>
<h2 data-line="NaN" class="code-line code-line" id="example">Example</h2>
<pre class="hljs"><code><div><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> pyspark <span class="hljs-keyword">import</span> SparkContext, SparkConf
conf = SparkConf().setAppName(<span class="hljs-string">"PiEstimate"</span>)
sc = SparkContext(conf = conf)
n = <span class="hljs-number">10000</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">f</span><span class="hljs-params">(_)</span>:</span>
x, y = np.random.random(<span class="hljs-number">2</span>)*<span class="hljs-number">2</span><span class="hljs-number">-1</span>
<span class="hljs-keyword">if</span> x**<span class="hljs-number">2</span>+y**<span class="hljs-number">2</span> <= <span class="hljs-number">1</span>:
<span class="hljs-keyword">return</span> <span class="hljs-number">1</span>
<span class="hljs-keyword">else</span>:
<span class="hljs-keyword">return</span> <span class="hljs-number">0</span>
count = sc.parallelize(range(<span class="hljs-number">1</span>,n+<span class="hljs-number">1</span>)).map(f).reduce(<span class="hljs-keyword">lambda</span> x, y: x+y)
print(<span class="hljs-string">"Pi is roughly {}"</span>.format(<span class="hljs-number">4.0</span> * count/n))
</div></code></pre>
</body>
</html>