Jakob manages his Dependencies using poetry it’s a python dependency management tool. Jakob actually installed it using his package manager which they don’t recommend but he does not fully understand why. (I think there should also be a ubuntu package)
But it’s also straightforward to install the dependencies using pip install -r requirements.txt
If some dependencies are missing in the requirements.txt thats because Jakob forgot to run poetry export
or someone else installed something using pip without exporting.
Data goes in the /csh/data/
Jakob proposes syncthing a decentralized file sharing software (there is no cloud everything is on our computers yay)
It is not the safest (if one person deletes files they could be gone for everyone) so it could be good to keep seperate backups at certain points.
- sqlitebrowser for viewing the databases
I guess the most interesting scripts currently are located in the Analysis/
directory
creating an OAuth consumer for higher request limit can be done here (after creating a WikiMedia account) https://meta.wikimedia.org/wiki/Special:OAuthConsumerRegistration/propose
http://en.wikipedia.org/?curid=1993994
- Median edits per month: 4.0
- Mean edits per month: 6.5
- Standard deviation: 8.0
Operations Management started it’s life in 2005 with very little changes up until 2013, then big spikes in 2013 and 2014 (40-60 edits per month) and then a trickle down towards inactivity again ~ 2-5 edits per month. The spikes are exclusively caused by a group of 6 users who always were responsible for at least 50% of the edits in the months of the spikes.
These users with the high amount of edits added new content to the page. Sometimes edits got reverted (eg.: Tracteur adding unecessary pictures in 2014-03), still most of these contributions got added in good will and extended the page.
Contributions by users that had less contributions in the months of the Spikes usually were reverts (of unnecessary or wrong information) and formatting/typo fixes.
http://en.wikipedia.org/?curid=782266
- Median edits per month: 3
- Mean edits per month: 5.90
- Standard deviation: 9.57
Similar page to Operations Management edits mostly done by 3 users, the spikes were mostly due to the user Fintor extending the page and the additions were not relevant to any events happening in the world at that time.
http://en.wikipedia.org/?curid=400892
- Median edits per month: 4
- Mean edits per month: 6.5
- Standard deviation: 8.4
dropped one outlier:
- Median edits per month: 4
- Mean edits per month: 6
- Standard deviation: 5.6
Besides 2021-03 Referee is a pretty calm page which is to be expected considering the occupation probably did not change much in the last 20 years.
- Median edits per month: 14
- Mean edits per month: 20
- Standard deviation: 18.666
Models seems like a more healthy page, ie. the edits are distributed much more evenly and they follow a trend. We see a spike after COVID (the first spike is to early 2019-10), but it seems none of the edits are covid related (also no covid related changes in 2020-04). The COVID spike could also be due to people having more time to edit wikipedia pages because of COVID, since the spike slowly trails off.
- Selecting pages by page lenght might not be the best strategy, average/median edit activity could be much more usefull.
- Still it seems like we need bigger datasets to make the data smoother.
- add technology pages
- use minor or broad detail level for the occupational classification system
For now I’ll try to find an acceptable minimum page size
The hypothesis: The number of edit spikes after pages creation and then levels of as the page has gotten to an acceptable/complete state.
- Observation 1:
This does not hold for pages like Baker. Baker was created in 2003 and recieved only around 10 edits in that year. Also in 2004 the number of edits was pretty low.
Back then Baker was more similar to a disambiguation page today - the site also named some towns with baker in their name.
Should we just give Wikipedia and Occupation pages some time to settle down? (1 or 2 years?) For now it seems like there is no consistencey between pages for when they settle down.
Is there a way to track stubs and should we only consider pages after loosing their stub status?
For BERTopic it makes sense to split the pages into paragraphs since different paragraphs will yield different topics.
Guide for Hyperparamater configuration
default algorithm is UMAP can be changed based on this guide
Tweaking Hyper-Parameters here could be very important since we don’t want to cluster away topics like COVID-19 HDBSCAN documentation
Starts with the creation of a distance matrix based intitially on an estimate of density -> core distance (x) = distance to kth
nearest neighbor
core distance
defines the mutual reachability distance
as: $dmreach-k(a,b) = max \begin{Bmatrix}core_k(a)\ core_k(b)\ d(a,b)\end{Bmatrix}$.
The distance matrix is then used to build a weighted graph where data points are represented by vertices and the corresponding mutual reachability distance values are the weights of the edges between them.
Instead of repeatedly dropping edges that are above a threshold that gets lowered every iteration (very computationally expensive). The minimum spanning tree is built via Prim’s algorithm or “if the data lives in metric space” (? Jakob does not know about metric space and this is one level to deep) other even faster algorithms.
After the minimal spanning tree is built, it is organized into a hierarchical structure based on the distance between nodes. Then the maximum distance for at which an edge becomes the edge at which a cluster should be split is iteratively reduced.
A minimum cluster size is defined and based on this, splits between clusters either result in two clusters or if one of them is smaller than the minimum size the split just results in a cluster loosing a point/points.
Finally clusters are selected based on cluster persistance $λbirth$ and $λdeath$ where
Now firstly all leaf are selected as clusters. Working up from the leaves, if
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
The estimates of the oews estimates are calculated for a specific month (in our case I selected only the ones for May) but they rely on the 6 most recent surveys (2 per year) to produce an estimate.
The May 2019 employment and wage estimates were calculated using data collected in the May 2019, November 2018, May 2018, November 2017, May 2017, and November 2016 semi-annual panels. — https://www.bls.gov/oes/oes_ques.htm#overview
Since we still have yearly estimates for labour statistic we use the edits accumulated edits in the 12 months up to and including the month of the estimate. For May 2012 we count the edits starting with June 2011 ending with May 2012
overall source for data: https://www.bls.gov/oes/tables.htm specific source link https://www.bls.gov/oes/special.requests/oesm21nat.zip potentially better estimates for employment stats are here https://www.bls.gov/oes/oes-mb3-methods.htm
The 40th percentile lays around a page length of 10ky