Skip to content

shef-ski/wikipedia_revisions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Documentation

Getting started

Environment Setup

Jakob manages his Dependencies using poetry it’s a python dependency management tool. Jakob actually installed it using his package manager which they don’t recommend but he does not fully understand why. (I think there should also be a ubuntu package)

But it’s also straightforward to install the dependencies using pip install -r requirements.txt If some dependencies are missing in the requirements.txt thats because Jakob forgot to run poetry export or someone else installed something using pip without exporting.

Data sharing

Data goes in the /csh/data/ Jakob proposes syncthing a decentralized file sharing software (there is no cloud everything is on our computers yay) It is not the safest (if one person deletes files they could be gone for everyone) so it could be good to keep seperate backups at certain points.

Sqlite utils

Information and Usefull Software

I guess the most interesting scripts currently are located in the Analysis/ directory

Wikipedia api

creating an OAuth consumer for higher request limit can be done here (after creating a WikiMedia account) https://meta.wikimedia.org/wiki/Special:OAuthConsumerRegistration/propose

Next Steps

Add lenient and strict columns in ETL

Add gender information column in ETL

Get backlinks with Wikimedia API and add to ETL

links here

Get topics of complete articles with BERTopic

Understanding Wikipedia

Qualitative analysis of edits

Pages Analyzed

Operations Management

http://en.wikipedia.org/?curid=1993994

  • Median edits per month: 4.0
  • Mean edits per month: 6.5
  • Standard deviation: 8.0

Operations Management started it’s life in 2005 with very little changes up until 2013, then big spikes in 2013 and 2014 (40-60 edits per month) and then a trickle down towards inactivity again ~ 2-5 edits per month. The spikes are exclusively caused by a group of 6 users who always were responsible for at least 50% of the edits in the months of the spikes.

These users with the high amount of edits added new content to the page. Sometimes edits got reverted (eg.: Tracteur adding unecessary pictures in 2014-03), still most of these contributions got added in good will and extended the page.

Contributions by users that had less contributions in the months of the Spikes usually were reverts (of unnecessary or wrong information) and formatting/typo fixes.

Financial analyst

http://en.wikipedia.org/?curid=782266

  • Median edits per month: 3
  • Mean edits per month: 5.90
  • Standard deviation: 9.57

Similar page to Operations Management edits mostly done by 3 users, the spikes were mostly due to the user Fintor extending the page and the additions were not relevant to any events happening in the world at that time.

Referee

http://en.wikipedia.org/?curid=400892

  • Median edits per month: 4
  • Mean edits per month: 6.5
  • Standard deviation: 8.4

dropped one outlier:

  • Median edits per month: 4
  • Mean edits per month: 6
  • Standard deviation: 5.6

Besides 2021-03 Referee is a pretty calm page which is to be expected considering the occupation probably did not change much in the last 20 years.

Models

  • Median edits per month: 14
  • Mean edits per month: 20
  • Standard deviation: 18.666

Models seems like a more healthy page, ie. the edits are distributed much more evenly and they follow a trend. We see a spike after COVID (the first spike is to early 2019-10), but it seems none of the edits are covid related (also no covid related changes in 2020-04). The COVID spike could also be due to people having more time to edit wikipedia pages because of COVID, since the spike slowly trails off.

Learnings from Page_Analysis.ipynb

  1. Selecting pages by page lenght might not be the best strategy, average/median edit activity could be much more usefull.
  2. Still it seems like we need bigger datasets to make the data smoother.

Ways to accumulate more and bigger pages per Occupational category

  • add technology pages
  • use minor or broad detail level for the occupational classification system

Size of page content probably matters

For now I’ll try to find an acceptable minimum page size

Spike after Page inception?

The hypothesis: The number of edit spikes after pages creation and then levels of as the page has gotten to an acceptable/complete state.

  • Observation 1: This does not hold for pages like Baker. Baker was created in 2003 and recieved only around 10 edits in that year. Also in 2004 the number of edits was pretty low.

    Back then Baker was more similar to a disambiguation page today - the site also named some towns with baker in their name.

Hypothesis: The Beginnings of Wikipedia and Pages are chaotic

Should we just give Wikipedia and Occupation pages some time to settle down? (1 or 2 years?) For now it seems like there is no consistencey between pages for when they settle down.

Watch out for stubs?

Is there a way to track stubs and should we only consider pages after loosing their stub status?

Topic Modelling

For BERTopic it makes sense to split the pages into paragraphs since different paragraphs will yield different topics.

BERTopic Algorithm

Guide for Hyperparamater configuration

Embed Documents

Cluster Documents

Dimensionality Reduction

default algorithm is UMAP can be changed based on this guide

Density Based Clustering - HDBSCAN algorithm

Tweaking Hyper-Parameters here could be very important since we don’t want to cluster away topics like COVID-19 HDBSCAN documentation

Transforming the space

Starts with the creation of a distance matrix based intitially on an estimate of density -> core distance (x) = distance to kth nearest neighbor $core_k(x)$ and then using the core distance defines the mutual reachability distance as: $dmreach-k(a,b) = max \begin{Bmatrix}core_k(a)\ core_k(b)\ d(a,b)\end{Bmatrix}$.

Building the minimum spanning tree

The distance matrix is then used to build a weighted graph where data points are represented by vertices and the corresponding mutual reachability distance values are the weights of the edges between them.

Instead of repeatedly dropping edges that are above a threshold that gets lowered every iteration (very computationally expensive). The minimum spanning tree is built via Prim’s algorithm or “if the data lives in metric space” (? Jakob does not know about metric space and this is one level to deep) other even faster algorithms.

After the minimal spanning tree is built, it is organized into a hierarchical structure based on the distance between nodes. Then the maximum distance for at which an edge becomes the edge at which a cluster should be split is iteratively reduced. A minimum cluster size is defined and based on this, splits between clusters either result in two clusters or if one of them is smaller than the minimum size the split just results in a cluster loosing a point/points. Finally clusters are selected based on cluster persistance $λbirth$ and $λdeath$ where $λ = \frac{1}{distance}$ a clusters birth is when it comes to be out of the split of a parent cluster and it’s death is when it becomes smaller than the minimum cluster size. For each point that a cluster loosen we can define $λ_p$ as the distance value at which the point was separated from the cluster, $λ_p$ has to fall between $λbirth$ and $λdeath$. Now cluster stability can be computed as $Λ = ∑p\ ∈ \ cluster({λ_p - λbirth})$

Now firstly all leaf are selected as clusters. Working up from the leaves, if $Λ$ of the parent node is bigger than the sum of all the children we select the parent and deselect the children. If the sum of all children $Λ$ is greater than the parents $Λ$ the parent is assigned the sum of childrens $Λ$ and we move up in the tree.

Comparing Embeddings?

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Methodological notes

The estimates of the oews estimates are calculated for a specific month (in our case I selected only the ones for May) but they rely on the 6 most recent surveys (2 per year) to produce an estimate.

The May 2019 employment and wage estimates were calculated using data collected in the May 2019, November 2018, May 2018, November 2017, May 2017, and November 2016 semi-annual panels. — https://www.bls.gov/oes/oes_ques.htm#overview

Since we still have yearly estimates for labour statistic we use the edits accumulated edits in the 12 months up to and including the month of the estimate. For May 2012 we count the edits starting with June 2011 ending with May 2012

Data

overall source for data: https://www.bls.gov/oes/tables.htm specific source link https://www.bls.gov/oes/special.requests/oesm21nat.zip potentially better estimates for employment stats are here https://www.bls.gov/oes/oes-mb3-methods.htm

removing pages of innsufficent lenght

The 40th percentile lays around a page length of 10ky

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages