-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathposts.html
322 lines (298 loc) · 26.3 KB
/
posts.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
<!DOCTYPE html>
<html>
<head>
<title>datawerk</title>
<meta charset="utf-8" />
<link href="https://buhrmann.github.io/theme/css/bootstrap-custom.css" rel="stylesheet"/>
<link href="https://buhrmann.github.io/theme/css/pygments.css" rel="stylesheet"/>
<link href="https://buhrmann.github.io/theme/css/style.css" rel="stylesheet" />
<link href="//maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css" rel="stylesheet">
<link rel="shortcut icon" type="image/png" href="https://buhrmann.github.io/theme/css/logo.png">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-56071357-1', 'auto');
ga('send', 'pageview');
</script> </head>
<body>
<div class="wrap">
<div class="container-fluid">
<div class="header">
<div class="container">
<nav class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target=".navbar-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="https://buhrmann.github.io">
<!-- <span class="fa fa-pie-chart navbar-logo"></span> datawerk -->
<span class="navbar-logo"><img src="https://buhrmann.github.io/theme/css/logo.png" style=""></img></span>
</a>
</div>
<div class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<!--<li><a href="https://buhrmann.github.io/archives.html">Archives</a></li>-->
<li><a href="https://buhrmann.github.io/posts.html">Blog</a></li>
<li><a href="https://buhrmann.github.io/pages/cv.html">Interactive CV</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Data Reports<span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
<!--<li class="divider"></li>
<li class="dropdown-header">Data Science Reports</li>-->
<li >
<a href="https://buhrmann.github.io/p2p-loans.html">Interest rates on <span class="caps">P2P</span> loans</a>
</li>
<li >
<a href="https://buhrmann.github.io/activity-data.html">Categorisation of inertial activity data</a>
</li>
<li >
<a href="https://buhrmann.github.io/titanic-survival.html">Titanic survival prediction</a>
</li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Data Apps<span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
<!--<li class="divider"></li>
<li class="dropdown-header">Data Science Reports</li>-->
<li >
<a href="https://buhrmann.github.io/elegans.html">C. elegans connectome explorer</a>
</li>
<li >
<a href="https://buhrmann.github.io/dash+.html">Dash+ visualization of running data</a>
</li>
</ul>
</li>
</ul>
</div>
</nav>
</div>
</div><!-- header -->
</div><!-- container-fluid -->
<div class="container main-content">
<div class="row row-centered">
<div class="col-centered col-max col-min col-sm-12 col-md-10 col-lg-10 main-content">
<section id="content" class="content">
<article class="hentry">
<header>
<span class="entry-title-info">Oct 07 · <a href="https://buhrmann.github.io/category/data-posts.html">Data Posts</a></span>
<h2 class="entry-title entry-title-tight"><a href="https://buhrmann.github.io/hadoop-distributed-cache.html" rel="bookmark" title="Permalink to Reading from distributed cache in Hadoop">Reading from distributed cache in Hadoop</a></h2>
</header>
<div class="entry-content"> <p>The distributed cache can be used to make small files (or jars etc.) available to mapreduce functions locally on each node. This can be useful e.g. when a global stopword list is needed by all mappers for index creation. Here are two correct ways of reading a file from distributed cache in Hadoop 2. This has changed in the new <span class="caps">API</span> and very few books and tutorials have updated examples.</p>
<h3>Named File</h3>
<p>In the driver:</p>
<div class="highlight"><pre><span></span><span class="n">Job</span> <span class="n">job</span> <span class="o">=</span> <span class="n">Job</span><span class="o">.</span><span class="na">getInstance</span><span class="o">(</span><span class="k">new</span> <span class="n">Configuration</span><span class="o">());</span>
<span class="n">job</span><span class="o">.</span><span class="na">addCacheFile</span><span class="o">(</span><span class="k">new</span> <span class="n">URI</span> <span class="o">(</span><span class="s">"/path/to/file.csv"</span> <span class="o">+</span> <span class="s">"#filelabel"</span><span class="o">));</span>
</pre></div>
<p>In the mapper:</p>
<div class="highlight"><pre><span></span><span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">setup</span><span class="o">(</span><span class="n">Context</span> <span class="n">context …</span></pre></div> </div><!-- entry-content -->
</article>
<article class="hentry">
<header>
<span class="entry-title-info">Oct 07 · <a href="https://buhrmann.github.io/category/reports.html">Reports</a></span>
<h2 class="entry-title entry-title-tight"><a href="https://buhrmann.github.io/p2p-loans.html" rel="bookmark" title="Permalink to Interest rates on P2P loans">Interest rates on <span class="caps">P2P</span> loans</a></h2>
</header>
<div class="entry-content"> <p>In this post I will look at linear regression to model the process determining interest rate on peer-to-peer loans provided by the <a href="https://www.lendingclub.com/home.action">Lending club</a>. Like other peer-to-peer services, the Lending Club aims to directly connect producers and consumers, or in this case borrowers and lenders, by cutting out the middleman. Borrowers apply for loans online and provide details about the desired loan as well their financial status (such as their <span class="caps">FICO</span> score). Lenders use the information provided to choose which loans to invest in. The Lending Club, finally, uses a <a href="https://www.lendingclub.com/public/how-we-set-interest-rates.action">proprietary algorithm</a> to determine the interest charged on an applicant …</p> </div><!-- entry-content -->
</article>
<article class="hentry">
<header>
<span class="entry-title-info">Sep 15 · <a href="https://buhrmann.github.io/category/data-posts.html">Data Posts</a></span>
<h2 class="entry-title entry-title-tight"><a href="https://buhrmann.github.io/sna-twitter.html" rel="bookmark" title="Permalink to Ideological twitter communities">Ideological twitter communities</a></h2>
</header>
<div class="entry-content"> <p>My current academic work revolves around the interactional “autonomy” of ideological communities in social networks. As part of my investigation I sometimes come across interesting little factoids. For example, I have been looking at the interaction of communities formed around the major political parties in Spain, and the most important media outlets (newspapers and <span class="caps">TV</span> stations). One observation I thought was interesting has to do with the bias in media consumption exhibited by different communities. For example, without going into detail here about how I identified the individual communities, here is a graph showing the inequality in retweet activity exhibited …</p> </div><!-- entry-content -->
</article>
<article class="hentry">
<header>
<span class="entry-title-info">Sep 13 · <a href="https://buhrmann.github.io/category/data-posts.html">Data Posts</a></span>
<h2 class="entry-title entry-title-tight"><a href="https://buhrmann.github.io/voopter.html" rel="bookmark" title="Permalink to Exploration of voopter airfare data">Exploration of voopter airfare data</a></h2>
</header>
<div class="entry-content"> <p>I’ve recently started working as a data science freelancer for <a href="https://voopter.com.br">voopter.com.br</a>, helping them analyze the data generated by airfare searches on their website. Voopter is a metasearch engine for flights from and to Brazil. The first thing I did was to create an interactive dashboard in R and shiny for some explorative statistics of the millions of seaches performed by users of their website (which has already led to more specific business-driven questions).</p>
<p>The dashboard provides a quick and easy way to filter and aggregate the data, which is stored in
an <span class="caps">SQL</span> database. The idea is …</p> </div><!-- entry-content -->
</article>
<article class="hentry">
<header>
<span class="entry-title-info">Jul 20 · <a href="https://buhrmann.github.io/category/data-posts.html">Data Posts</a></span>
<h2 class="entry-title entry-title-tight"><a href="https://buhrmann.github.io/elegans-pubmed.html" rel="bookmark" title="Permalink to Elegans now features PubMed search">Elegans now features PubMed search</a></h2>
</header>
<div class="entry-content"> <p>I’ve added a new PubMed search feature to <a href="https://elegans.herokuapp.com/"><em>Elegans</em></a>, the visual worm brain explorer. The idea here is to show the network of <em>C. Elegans</em> neurons that get mentioned in more than n papers on PubMed, in the context of a given search query. So, for example, if one is interested in the worm’s chemotaxis behaviour, one would type in ‘chemotaxis’ and choose the citation threshold n. Initiating the search will then return the neurons that get mentioned in at least n papers along with the word ‘chemotaxis’. The search is in fact performed once for each neuron …</p> </div><!-- entry-content -->
</article>
<article class="hentry">
<header>
<span class="entry-title-info">Jun 22 · <a href="https://buhrmann.github.io/category/data-posts.html">Data Posts</a></span>
<h2 class="entry-title entry-title-tight"><a href="https://buhrmann.github.io/tfidf-analysis.html" rel="bookmark" title="Permalink to Analyzing tf-idf results in scikit-learn">Analyzing tf-idf results in scikit-learn</a></h2>
</header>
<div class="entry-content"> <p>In a <a href="https://buhrmann.github.io/sklearn-pipelines.html">previous post</a> I have shown how to create text-processing pipelines for machine learning in python using <a href="http://scikit-learn.org/stable/">scikit-learn</a>. The core of such pipelines in many cases is the vectorization of text using the <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">tf-idf</a> transformation. In this post I will show some ways of analysing and making sense of the result of a tf-idf. As an example I will use the same <a href="https://www.kaggle.com/c/stumbleupon">kaggle dataset</a>, namely webpages provided and classified by StumbleUpon as either ephemeral (content that is short-lived) or evergreen (content that can be recommended long after its initial discovery).</p>
<h3>Tf-idf</h3>
<p>As explained in the previous post, the tf-idf …</p> </div><!-- entry-content -->
</article>
<article class="hentry">
<header>
<span class="entry-title-info">Jun 17 · <a href="https://buhrmann.github.io/category/data-posts.html">Data Posts</a></span>
<h2 class="entry-title entry-title-tight"><a href="https://buhrmann.github.io/sklearn-pipelines.html" rel="bookmark" title="Permalink to Pipelines for text classification in scikit-learn">Pipelines for text classification in scikit-learn</a></h2>
</header>
<div class="entry-content"> <p><a href="http://scikit-learn.org">Scikit-learn’s</a> <a href="http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">pipelines</a> provide a useful layer of abstraction for building complex estimators or classification models. Its purpose is to aggregate a number of data transformation steps, and a model operating on the result of these transformations, into a single object that can then be used in place of a simple estimator. This allows for the one-off definition of complex pipelines that can be re-used, for example, in cross-validation functions, grid-searches, learning curves and so on. I will illustrate their use, and some pitfalls, in the context of a kaggle text-classification challenge.</p>
<p><img src="/images/pipelines/stumbleupon_evergreen.jpg" alt="StumbleUpon Evergreen" width="1000"/></p>
<h3>The challenge</h3>
<p>The goal in the <a href="https://www.kaggle.com/c/stumbleupon">StumbleUpon Evergreen …</a></p> </div><!-- entry-content -->
</article>
<article class="hentry">
<header>
<span class="entry-title-info">Feb 02 · <a href="https://buhrmann.github.io/category/data-posts.html">Data Posts</a></span>
<h2 class="entry-title entry-title-tight"><a href="https://buhrmann.github.io/xql.html" rel="bookmark" title="Permalink to Sql to excel">Sql to excel</a></h2>
</header>
<div class="entry-content"> <p>A little python tool to execute an sql script (postgresql in this case, but should be easily modifiable for mysql etc.) and store the result in a csv or excel (xls file):</p>
<div class="highlight"><pre><span></span><span class="sd">"""</span>
<span class="sd">Executes an sql script and stores the result in a file.</span>
<span class="sd">"""</span>
<span class="kn">import</span> <span class="nn">os</span><span class="o">,</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="kn">from</span> <span class="nn">xlwt</span> <span class="kn">import</span> <span class="n">Workbook</span>
<span class="k">def</span> <span class="nf">sql_to_csv</span><span class="p">(</span><span class="n">sql_fnm</span><span class="p">,</span> <span class="n">csv_fnm</span><span class="p">):</span>
<span class="sd">""" Write result of executing sql script to txt file"""</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">sql_fnm</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">sql_file</span><span class="p">:</span>
<span class="n">query</span> <span class="o">=</span> <span class="n">sql_file</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">query</span> <span class="o">=</span> <span class="s2">"COPY ("</span> <span class="o">+</span> <span class="n">query</span> <span class="o">+</span> <span class="s2">") TO STDOUT WITH CSV HEADER"</span>
<span class="n">cmd</span> <span class="o">=</span> <span class="s1">'psql -c "'</span> <span class="o">+</span> <span class="n">query</span> <span class="o">+</span> <span class="s1">'"'</span>
<span class="k">print</span> <span class="n">cmd</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">subprocess</span><span class="o">.</span><span class="n">check_output</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">shell</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">csv_fnm</span><span class="p">,</span> <span class="s1">'wb …</span></pre></div> </div><!-- entry-content -->
</article>
<article class="hentry">
<header>
<span class="entry-title-info">Jan 27 · <a href="https://buhrmann.github.io/category/data-posts.html">Data Posts</a></span>
<h2 class="entry-title entry-title-tight"><a href="https://buhrmann.github.io/scholar.html" rel="bookmark" title="Permalink to Retrieving your Google Scholar data">Retrieving your Google Scholar data</a></h2>
</header>
<div class="entry-content"> <p>For my interactive <span class="caps">CV</span> I decided to try not only to automate the creation of a bibliography of my publications, but also to extend it with a citation count for each paper, which Google Scholar happens to keep track of. Unfortunately there is no Scholar <span class="caps">API</span>. But I figured since my own profile is based on data I essentially donated to Google, it is only fair that I can have access to it too. Hence I wrote a little scraper that iterates over the publications in my Scholar profile, extracts all citations, and bins them per year. That way I …</p> </div><!-- entry-content -->
</article>
<article class="hentry">
<header>
<span class="entry-title-info">Jan 27 · <a href="https://buhrmann.github.io/category/data-posts.html">Data Posts</a></span>
<h2 class="entry-title entry-title-tight"><a href="https://buhrmann.github.io/tag-graph.html" rel="bookmark" title="Permalink to Tag graph plugin for Pelican">Tag graph plugin for Pelican</a></h2>
</header>
<div class="entry-content"> <p>On my front page I display a sort of sitemap for my blog. Since the structure of the site is not very hierarchical, I decided to show pages and posts as a graph along with their tags. To do so, I created a mini plugin for the Pelican static blog engine. The plugin is essentially a sort of callback that gets executed when the engine has generated all posts and pages from their markdown files. I then simply take the results and write them out in a json format that d3.js understands (a list of nodes and a list …</p> </div><!-- entry-content -->
</article>
<div class="pager">
<ul>
<li class="previous disabled"><a>← Previous</a></li>
<li class="next"><a href="https://buhrmann.github.io/posts2.html">Next →</a></li>
</ul>
</div>
</section><!-- content -->
</div>
</div><!-- row-->
</div><!-- container -->
<!-- <div class="push"></div> -->
</div> <!-- wrap -->
<div class="container-fluid aw-footer">
<div class="row-centered">
<div class="col-sm-3 col-sm-offset-1">
<h4>Author</h4>
<ul class="list-unstyled my-list-style">
<li><a href="http://www.ias-research.net/people/thomas-buhrmann/">Academic Home</a></li>
<li><a href="http://github.com/synergenz">Github</a></li>
<li><a href="http://www.linkedin.com/in/thomasbuhrmann">LinkedIn</a></li>
<li><a href="https://secure.flickr.com/photos/syngnz/">Flickr</a></li>
</ul>
</div>
<div class="col-sm-3">
<h4>Categories</h4>
<ul class="list-unstyled my-list-style">
<li><a href="https://buhrmann.github.io/category/academia.html">Academia (4)</a></li>
<li><a href="https://buhrmann.github.io/category/data-apps.html">Data Apps (2)</a></li>
<li><a href="https://buhrmann.github.io/category/data-posts.html">Data Posts (9)</a></li>
<li><a href="https://buhrmann.github.io/category/reports.html">Reports (3)</a></li>
</ul>
</div>
<div class="col-sm-3">
<h4>Tags</h4>
<ul class="tagcloud">
<li class="tag-4"><a href="https://buhrmann.github.io/tag/shiny.html">shiny</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/networks.html">networks</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/sql.html">sql</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/hadoop.html">hadoop</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/mongodb.html">mongodb</a></li>
<li class="tag-1"><a href="https://buhrmann.github.io/tag/visualization.html">visualization</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/smcs.html">smcs</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/sklearn.html">sklearn</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/tf-idf.html">tf-idf</a></li>
<li class="tag-1"><a href="https://buhrmann.github.io/tag/r.html">R</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/sna.html">sna</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/nosql.html">nosql</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/svm.html">svm</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/java.html">java</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/hive.html">hive</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/scraping.html">scraping</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/lda.html">lda</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/kaggle.html">kaggle</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/exploratory.html">exploratory</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/titanic.html">titanic</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/classification.html">classification</a></li>
<li class="tag-1"><a href="https://buhrmann.github.io/tag/python.html">python</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/random-forest.html">random forest</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/text.html">text</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/big-data.html">big data</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/report.html">report</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/regression.html">regression</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/graph.html">graph</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/d3.html">d3</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/neo4j.html">neo4j</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/flume.html">flume</a></li>
</ul>
</div>
</div>
</div>
<!-- JavaScript -->
<script src="https://code.jquery.com/jquery-2.1.1.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script type="text/javascript">
jQuery(document).ready(function($)
{
$("div.collapseheader").click(function () {
$header = $(this).children("span").first();
$codearea = $(this).children(".input_area");
$codearea.slideToggle(500, function () {
$header.text(function () {
return $codearea.is(":visible") ? "Collapse Code" : "Expand Code";
});
});
});
// $(window).resize(function(){
// var footerHeight = $('.aw-footer').outerHeight();
// var stickFooterPush = $('.push').height(footerHeight);
// $('.wrap').css({'marginBottom':'-' + footerHeight + 'px'});
// });
// $(window).resize();
// $(window).bind("load resize", function() {
// var footerHeight = 0,
// footerTop = 0,
// $footer = $(".aw-footer");
// positionFooter();
// function positionFooter() {
// footerHeight = $footer.height();
// footerTop = ($(window).scrollTop()+$(window).height()-footerHeight)+"px";
// console.log(footerHeight, footerTop);
// console.log($(document.body).height()+footerHeight, $(window).height());
// if ( ($(document.body).height()+footerHeight) < $(window).height()) {
// $footer.css({ position: "absolute" }).css({ top: footerTop });
// console.log("Positioning absolute");
// }
// else {
// $footer.css({ position: "static" });
// console.log("Positioning static");
// }
// }
// $(window).scroll(positionFooter).resize(positionFooter);
// });
});
</script>
</body>
</html>