Skip to content

Commit a76e143

Browse files
committed
Redshift Blog Post
1 parent 7bc8069 commit a76e143

File tree

19 files changed

+680
-15
lines changed

19 files changed

+680
-15
lines changed

_config.yml

+10-10
Original file line numberDiff line numberDiff line change
@@ -94,39 +94,39 @@ JB :
9494
# num_posts: 5
9595
# width: 580
9696
# colorscheme: light
97-
97+
9898
# Settings for analytics helper
9999
# Set 'provider' to the analytics provider you want to use.
100100
# Set 'provider' to false to turn analytics off globally.
101-
#
101+
#
102102
analytics :
103-
provider : google
104-
google :
103+
provider : google
104+
google :
105105
tracking_id : 'UA-40495390-2'
106106
# getclicky :
107-
# site_id :
107+
# site_id :
108108
# mixpanel :
109109
# token : '_MIXPANEL_TOKEN_'
110110
# piwik :
111111
# baseURL : 'myserver.tld/piwik' # Piwik installation address (without protocol)
112112
# idsite : '1' # the id of the site on Piwik
113113

114-
# Settings for sharing helper.
114+
# Settings for sharing helper.
115115
# Sharing is for things like tweet, plusone, like, reddit buttons etc.
116116
# Set 'provider' to the sharing provider you want to use.
117117
# Set 'provider' to false to turn sharing off globally.
118118
#
119119
sharing :
120120
provider : false
121-
122-
# Settings for all other include helpers can be defined by creating
121+
122+
# Settings for all other include helpers can be defined by creating
123123
# a hash with key named for the given helper. ex:
124124
#
125125
# pages_list :
126-
# provider : "custom"
126+
# provider : "custom"
127127
#
128128
# Setting any helper's provider to 'custom' will bypass the helper code
129129
# and include your custom code. Your custom file must be defined at:
130130
# ./_includes/custom/[HELPER]
131131
# where [HELPER] is the name of the helper you are overriding.
132-
132+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
---
2+
layout: post
3+
title: "Redshift SSD Benchmarks"
4+
description: "Benchmarking Redshift performance for different nodes"
5+
category: Redshift, Data Science, Data Warehousing
6+
tags: [Coursera, Analytics]
7+
---
8+
{% include JB/setup %}
9+
10+
Our warehouse runs completely on Redshift, and query performance is extremely important to us. Earlier this year, the AWS team announced the release of SSD instances for Amazon Redshift. Is the extra CPU truly worth it? We do a lot of processing with Redshift, so this question is big for us. To answer this, we decided to benchmark SSD performance and compare it to our original HDD performance.
11+
12+
Redshift is easy to use because its PostgreSQL JDBC drivers allow us to use a range of familiar SQL clients. Its speedy performance is achieved through columnar storage and data compression.
13+
14+
15+
## Experiment Setup
16+
The Redshift instance specs are based off on-demand pricing, but the reserved instances can be 75% more affordable. The results from the benchmark are the mean run times after running each query 3 times.
17+
18+
<table class="table table-bordered table-striped table-hover">
19+
<colgroup>
20+
<col span="1" style="width: 20%;" />
21+
<col span="1" style="width: 20%;" />
22+
<col span="1" style="width: 20%;" />
23+
<col span="1" style="width: 20%;" />
24+
<col span="1" style="width: 20%;" />
25+
</colgroup>
26+
<thead>
27+
<tr>
28+
<td> </td>
29+
<td><b>HDD Setup 1</b></td>
30+
<td><b>HDD Setup 2</b></td>
31+
<td><b>SSD Setup 1</b></td>
32+
<td><b>SSD Setup 2</b></td>
33+
</tr>
34+
</thead>
35+
<tbody>
36+
<tr>
37+
<td><b>Nodes</b></td>
38+
<td>4 dw1.xlarge</td>
39+
<td>8 dw1.xlarge</td>
40+
<td>32 dw2.large</td>
41+
<td>4 dw2.8xlarge</td>
42+
</tr>
43+
<tr>
44+
<td><b>Storage</b></td>
45+
<td>8 TB</td>
46+
<td>16 TB</td>
47+
<td>5.12 TB</td>
48+
<td>10.24 TB</td>
49+
</tr>
50+
<tr>
51+
<td><b>Memory</b></td>
52+
<td>60 GB</td>
53+
<td>120 GB</td>
54+
<td>480 GB</td>
55+
<td>976 GB</td>
56+
</tr>
57+
<tr>
58+
<td><b>vCPU</b></td>
59+
<td>8</td>
60+
<td>16</td>
61+
<td>64</td>
62+
<td>128</td>
63+
</tr>
64+
<tr>
65+
<td><b>Price</b></td>
66+
<td>$3.4 / hr</td>
67+
<td>$6.8 / hr</td>
68+
<td>$8 / hr</td>
69+
<td>$19.2 / hr</td>
70+
</tr>
71+
</tbody>
72+
</table>
73+
74+
### Query 1.
75+
First, we ran a simple join query between a table with 1 billion rows and a table with 50 million rows. The total amount of data processed was around 46GB. The results fell in favour of SSD’s.
76+
77+
<img src="https://dnsta5v53r71w.cloudfront.net/images/redshift-ssd-benchmark/1a.png" alt="Screenshot" style="width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;"/>
78+
79+
80+
### Query 2.
81+
This complex query features REGEX matching and aggregate functions across 1 million rows from 4 joins. The total amount of data processed was around 100GB. The results fell even more in favour of SSD’s from 5x - 15x the performance improvement.
82+
83+
<img src="https://dnsta5v53r71w.cloudfront.net/images/redshift-ssd-benchmark/2.png" alt="Screenshot" style="width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;"/>
84+
85+
86+
### Query 3.
87+
A query that runs window functions on a table of 1 billion rows showed surprising results. The total amount of data in this table is about 400GB. Although the SSD’s performed better, the smaller SSD’s out-performed the bigger SSD’s despite having double the memory and CPU power.
88+
89+
<img src="https://dnsta5v53r71w.cloudfront.net/images/redshift-ssd-benchmark/3.png" alt="Screenshot" style="width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;"/>
90+
91+
92+
### Query 4.
93+
This last query has 4 join statements with a subquery that also includes 2 joins. The amount of data processed is around 107GB. Since this query is very compute-heavy, it is not surprising that SSD’s perform 10x better. What is shocking is that the smaller SSD’s are once again more performant than the bigger SSD’s.
94+
95+
<img src="https://dnsta5v53r71w.cloudfront.net/images/redshift-ssd-benchmark/4a.png" alt="Screenshot" style="width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;"/>
96+
97+
## Conclusion
98+
We also ran some other queries and the performance improvement from HDD to SSD was consistent at about 5 - 10 times. From these experiments, the DW2 machines are clearly promising in terms of computation time. For the same price, SSD’s provide 3.4 times more CPU power and memory. However, the disk storage is about 25% of that of the HDD’s.
99+
100+
A limitation to the dw2.large SSD instances is that a Redshift cluster can support at most 32 of them. That means dw2.large’s can provide at most 5.12 TB of disk storage. The only other option is to upgrade to dw2.8xlarge’s but this experiment shows little performance benefits from dw2.large’s to dw2.8xlarge’s despite doubling the memory and CPU.
101+
102+
<i><small>PS: This was originally written by Jason Shao on the [Coursera blog](https://tech.coursera.org/blog/2014/12/19/redshift-benchmark/).</small></i>

_site/articles/index.html

+5
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,11 @@ <h1>Articles <br /></h1>
142142
<div class='container articleList'>
143143
<table class="table table-responsive post-table">
144144

145+
<tr>
146+
<td><h4 class='postTitle'><a href="/redshift,%20data%20science,%20data%20warehousing/2014/12/20/redshift-ssd-benchmarks">Redshift SSD Benchmarks</a></h4></td>
147+
<td class='postDate' style='vertical-align:middle;'><time datetime="2014-12-20T00:00:00-08:00">December 20, 2014</time></td>
148+
</tr>
149+
145150
<tr>
146151
<td><h4 class='postTitle'><a href="/python/2014/04/20/pycon-2014---montreal">Pycon 2014 - Montreal</a></h4></td>
147152
<td class='postDate' style='vertical-align:middle;'><time datetime="2014-04-20T00:00:00-07:00">April 20, 2014</time></td>
Loading
43.9 KB
Loading
40.5 KB
Loading
Loading

_site/atom.xml

+99-1
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,112 @@
44
<title>Sourabh Bajaj</title>
55
<link href="http://sourabhbajaj.com/" rel="self"/>
66
<link href="http://sourabhbajaj.com"/>
7-
<updated>2014-11-30T11:35:40-08:00</updated>
7+
<updated>2014-12-20T02:02:53-08:00</updated>
88
<id>http://sourabhbajaj.com</id>
99
<author>
1010
<name>Sourabh Bajaj</name>
1111
<email>[email protected]</email>
1212
</author>
1313

1414

15+
<entry>
16+
<title>Redshift SSD Benchmarks</title>
17+
<link href="http://sourabhbajaj.com/redshift,%20data%20science,%20data%20warehousing/2014/12/20/redshift-ssd-benchmarks"/>
18+
<updated>2014-12-20T00:00:00-08:00</updated>
19+
<id>http://sourabhbajaj.com/redshift,%20data%20science,%20data%20warehousing/2014/12/20/redshift-ssd-benchmarks</id>
20+
<content type="html">
21+
&lt;p&gt;Our warehouse runs completely on Redshift, and query performance is extremely important to us. Earlier this year, the AWS team announced the release of SSD instances for Amazon Redshift. Is the extra CPU truly worth it? We do a lot of processing with Redshift, so this question is big for us. To answer this, we decided to benchmark SSD performance and compare it to our original HDD performance.&lt;/p&gt;
22+
23+
&lt;p&gt;Redshift is easy to use because its PostgreSQL JDBC drivers allow us to use a range of familiar SQL clients. Its speedy performance is achieved through columnar storage and data compression.&lt;/p&gt;
24+
25+
&lt;h2 id=&quot;experiment-setup&quot;&gt;Experiment Setup&lt;/h2&gt;
26+
&lt;p&gt;The Redshift instance specs are based off on-demand pricing, but the reserved instances can be 75% more affordable. The results from the benchmark are the mean run times after running each query 3 times.&lt;/p&gt;
27+
28+
&lt;table class=&quot;table table-bordered table-striped table-hover&quot;&gt;
29+
&lt;colgroup&gt;
30+
&lt;col span=&quot;1&quot; style=&quot;width: 20%;&quot; /&gt;
31+
&lt;col span=&quot;1&quot; style=&quot;width: 20%;&quot; /&gt;
32+
&lt;col span=&quot;1&quot; style=&quot;width: 20%;&quot; /&gt;
33+
&lt;col span=&quot;1&quot; style=&quot;width: 20%;&quot; /&gt;
34+
&lt;col span=&quot;1&quot; style=&quot;width: 20%;&quot; /&gt;
35+
&lt;/colgroup&gt;
36+
&lt;thead&gt;
37+
&lt;tr&gt;
38+
&lt;td&gt; &lt;/td&gt;
39+
&lt;td&gt;&lt;b&gt;HDD Setup 1&lt;/b&gt;&lt;/td&gt;
40+
&lt;td&gt;&lt;b&gt;HDD Setup 2&lt;/b&gt;&lt;/td&gt;
41+
&lt;td&gt;&lt;b&gt;SSD Setup 1&lt;/b&gt;&lt;/td&gt;
42+
&lt;td&gt;&lt;b&gt;SSD Setup 2&lt;/b&gt;&lt;/td&gt;
43+
&lt;/tr&gt;
44+
&lt;/thead&gt;
45+
&lt;tbody&gt;
46+
&lt;tr&gt;
47+
&lt;td&gt;&lt;b&gt;Nodes&lt;/b&gt;&lt;/td&gt;
48+
&lt;td&gt;4 dw1.xlarge&lt;/td&gt;
49+
&lt;td&gt;8 dw1.xlarge&lt;/td&gt;
50+
&lt;td&gt;32 dw2.large&lt;/td&gt;
51+
&lt;td&gt;4 dw2.8xlarge&lt;/td&gt;
52+
&lt;/tr&gt;
53+
&lt;tr&gt;
54+
&lt;td&gt;&lt;b&gt;Storage&lt;/b&gt;&lt;/td&gt;
55+
&lt;td&gt;8 TB&lt;/td&gt;
56+
&lt;td&gt;16 TB&lt;/td&gt;
57+
&lt;td&gt;5.12 TB&lt;/td&gt;
58+
&lt;td&gt;10.24 TB&lt;/td&gt;
59+
&lt;/tr&gt;
60+
&lt;tr&gt;
61+
&lt;td&gt;&lt;b&gt;Memory&lt;/b&gt;&lt;/td&gt;
62+
&lt;td&gt;60 GB&lt;/td&gt;
63+
&lt;td&gt;120 GB&lt;/td&gt;
64+
&lt;td&gt;480 GB&lt;/td&gt;
65+
&lt;td&gt;976 GB&lt;/td&gt;
66+
&lt;/tr&gt;
67+
&lt;tr&gt;
68+
&lt;td&gt;&lt;b&gt;vCPU&lt;/b&gt;&lt;/td&gt;
69+
&lt;td&gt;8&lt;/td&gt;
70+
&lt;td&gt;16&lt;/td&gt;
71+
&lt;td&gt;64&lt;/td&gt;
72+
&lt;td&gt;128&lt;/td&gt;
73+
&lt;/tr&gt;
74+
&lt;tr&gt;
75+
&lt;td&gt;&lt;b&gt;Price&lt;/b&gt;&lt;/td&gt;
76+
&lt;td&gt;$3.4 / hr&lt;/td&gt;
77+
&lt;td&gt;$6.8 / hr&lt;/td&gt;
78+
&lt;td&gt;$8 / hr&lt;/td&gt;
79+
&lt;td&gt;$19.2 / hr&lt;/td&gt;
80+
&lt;/tr&gt;
81+
&lt;/tbody&gt;
82+
&lt;/table&gt;
83+
84+
&lt;h3 id=&quot;query-1&quot;&gt;Query 1.&lt;/h3&gt;
85+
&lt;p&gt;First, we ran a simple join query between a table with 1 billion rows and a table with 50 million rows. The total amount of data processed was around 46GB. The results fell in favour of SSD’s.&lt;/p&gt;
86+
87+
&lt;p&gt;&lt;img src=&quot;https://dnsta5v53r71w.cloudfront.net/images/redshift-ssd-benchmark/1a.png&quot; alt=&quot;Screenshot&quot; style=&quot;width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;&quot; /&gt;&lt;/p&gt;
88+
89+
&lt;h3 id=&quot;query-2&quot;&gt;Query 2.&lt;/h3&gt;
90+
&lt;p&gt;This complex query features REGEX matching and aggregate functions across 1 million rows from 4 joins. The total amount of data processed was around 100GB. The results fell even more in favour of SSD’s from 5x - 15x the performance improvement.&lt;/p&gt;
91+
92+
&lt;p&gt;&lt;img src=&quot;https://dnsta5v53r71w.cloudfront.net/images/redshift-ssd-benchmark/2.png&quot; alt=&quot;Screenshot&quot; style=&quot;width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;&quot; /&gt;&lt;/p&gt;
93+
94+
&lt;h3 id=&quot;query-3&quot;&gt;Query 3.&lt;/h3&gt;
95+
&lt;p&gt;A query that runs window functions on a table of 1 billion rows showed surprising results. The total amount of data in this table is about 400GB. Although the SSD’s performed better, the smaller SSD’s out-performed the bigger SSD’s despite having double the memory and CPU power.&lt;/p&gt;
96+
97+
&lt;p&gt;&lt;img src=&quot;https://dnsta5v53r71w.cloudfront.net/images/redshift-ssd-benchmark/3.png&quot; alt=&quot;Screenshot&quot; style=&quot;width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;&quot; /&gt;&lt;/p&gt;
98+
99+
&lt;h3 id=&quot;query-4&quot;&gt;Query 4.&lt;/h3&gt;
100+
&lt;p&gt;This last query has 4 join statements with a subquery that also includes 2 joins. The amount of data processed is around 107GB. Since this query is very compute-heavy, it is not surprising that SSD’s perform 10x better. What is shocking is that the smaller SSD’s are once again more performant than the bigger SSD’s.&lt;/p&gt;
101+
102+
&lt;p&gt;&lt;img src=&quot;https://dnsta5v53r71w.cloudfront.net/images/redshift-ssd-benchmark/4a.png&quot; alt=&quot;Screenshot&quot; style=&quot;width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;&quot; /&gt;&lt;/p&gt;
103+
104+
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
105+
&lt;p&gt;We also ran some other queries and the performance improvement from HDD to SSD was consistent at about 5 - 10 times. From these experiments, the DW2 machines are clearly promising in terms of computation time. For the same price, SSD’s provide 3.4 times more CPU power and memory. However, the disk storage is about 25% of that of the HDD’s.&lt;/p&gt;
106+
107+
&lt;p&gt;A limitation to the dw2.large SSD instances is that a Redshift cluster can support at most 32 of them. That means dw2.large’s can provide at most 5.12 TB of disk storage. The only other option is to upgrade to dw2.8xlarge’s but this experiment shows little performance benefits from dw2.large’s to dw2.8xlarge’s despite doubling the memory and CPU.&lt;/p&gt;
108+
109+
&lt;p&gt;&lt;i&gt;&lt;small&gt;PS: This was originally written by Jason Shao on the &lt;a href=&quot;https://tech.coursera.org/blog/2014/12/19/redshift-benchmark/&quot;&gt;Coursera blog&lt;/a&gt;.&lt;/small&gt;&lt;/i&gt;&lt;/p&gt;
110+
</content>
111+
</entry>
112+
15113
<entry>
16114
<title>Pycon 2014 - Montreal</title>
17115
<link href="http://sourabhbajaj.com/python/2014/04/20/pycon-2014---montreal"/>

_site/extra/archive.html

+20-1
Original file line numberDiff line numberDiff line change
@@ -151,10 +151,29 @@ <h1>Archive <br /></h1>
151151

152152

153153
<h2>2014</h2>
154-
<h3>April</h3>
154+
<h3>December</h3>
155155
<ul>
156156

157157

158+
<li><span>December 20, 2014</span> &raquo; <a href="/redshift,%20data%20science,%20data%20warehousing/2014/12/20/redshift-ssd-benchmarks">Redshift SSD Benchmarks</a></li>
159+
160+
161+
162+
163+
</ul>
164+
<h3>April</h3>
165+
<ul>
166+
167+
168+
169+
170+
171+
172+
173+
174+
175+
176+
158177
<li><span>April 20, 2014</span> &raquo; <a href="/python/2014/04/20/pycon-2014---montreal">Pycon 2014 - Montreal</a></li>
159178

160179

_site/extra/categories.html

+24
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,10 @@ <h1>Categories <br /></h1>
157157
<li><a href="/categories.html#python-ref">
158158
python <span>2</span>
159159
</a></li>
160+
161+
<li><a href="/categories.html#redshift, data science, data warehousing-ref">
162+
redshift, data science, data warehousing <span>1</span>
163+
</a></li>
160164

161165

162166

@@ -239,6 +243,26 @@ <h2 id="python-ref">python</h2>
239243

240244

241245

246+
</ul>
247+
248+
<h2 id="redshift, data science, data warehousing-ref">redshift, data science, data warehousing</h2>
249+
<ul>
250+
251+
252+
253+
254+
255+
256+
257+
258+
<li><a href="/redshift,%20data%20science,%20data%20warehousing/2014/12/20/redshift-ssd-benchmarks">Redshift SSD Benchmarks</a></li>
259+
260+
261+
262+
263+
264+
265+
242266
</ul>
243267

244268

_site/extra/sitemap.txt

+1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ http://sourabhbajaj.com/index.html
1313
http://sourabhbajaj.com/portfolio/index.html
1414
http://sourabhbajaj.com/rss.xml
1515

16+
http://sourabhbajaj.com/redshift,%20data%20science,%20data%20warehousing/2014/12/20/redshift-ssd-benchmarks
1617
http://sourabhbajaj.com/python/2014/04/20/pycon-2014---montreal
1718
http://sourabhbajaj.com/installation/2014/04/20/mac-os-x-setup-guide
1819
http://sourabhbajaj.com/python/2014/03/31/fix-valueerror-unknown-locale-utf-8

0 commit comments

Comments
 (0)