Skip to content

Commit 501df99

Browse files
i-be-snekliniiiiiiolofgLiNi12Shorouq Zahra
authored
Location Normalization (#11)
* Initial commit * Datasets * The 1st version of 300events information extracted by GPT4 and summarizing process by GPT3.5 * Add small data table for testing purposes * Create enwiki-title-matched-cold-spells.jsonl * Add Wikipedia files * Add windstorm keyphrases * Add keyphrases for additional categories * no message * no message * no message * no message * 🙈 Ignore results and pycache * Create comparison-test.py Add script for testing comparison module. * Add normalisation and comparison modules * Update comparison module * Update comparison test * Update normalisation module * Add initial comparison analysis * Extend conversion from text to integers * Add precision, recall and null penalty * Add comparison of event sets * Add event set comparison experiments * 🙈 Ignore results and pycache * 📌 Pin dependencies * 💡 Add TODO comments * 🗃️ Add preliminary schema * 🗃️ Fix schema sqlite3 compat + add validation checks * 🚚 Fix python script location * ➕ Add pandas as dep for json parsing * 🗃️ Fix database check validation for day/month * 🎨 Format sql file * 🗃️ Add Date field alongside d/m/y split * 🗃️ Parse 'Events' table * ➕ Add deps for data parsing * ♻️ Refactor uid (7 alphanumeric) * 🗃️ Insert Events without annotation * ♻️ Refactor Events insert * 💬 Fix readme * ♻️ Refactor json parsing (safe fail) * ✨ Parse date into (day, month, year) * ✨ Handle missing dates + split dates to d/m/y cols * 🎨 Add pre-commit for format/lint * 📌 Add pre-commit deps * 🚨 Fix lint + formatting * 🚨 Fix lint / extra line * 💡 Remove comment * ♻️ Refactor y/m/d format strings * 🧐 Fix postprocessing after change in json schema * 🚚 Fix output path * 🗃️ Json-normalize specific impact data * 🗃️ Split dates into d/m/y * ⚰️ Remove unused functions * 📝 Add docs + helpful comments * 🧐 Parse subevents * 🗃️ Update schema with start+end dates for subevents * 🚚 Rename Location_* columns * 🐛 Fix bad date tuple return * 🚚 Rename insertion file + add json with 8 events * 🗃️ Add subevents to sqlite3 db * 🙈 Ignore .DS_Store * Fix typo * ♻️ Small refactors + formatting * 🔥 Remove dead file * 🗃️ Add database + fix schema and drop annotations from subevents * 🗃️ Add country column (raw file + in json) * 🗃️ Add country col * 🗃️ Update database * 🔨 Add raw sample files * 💬 Fix col name order * Normalize digit/word numerals to floats in a (min, max) range (#9) * ✨ Add number normalization to extract col data * ➕ Add number normalization related deps * 🧑‍💻 Install spacy model if missing * 💡 Add comments to explain the flow (to myself) * ♻️ Load spacy model in func * ♻️ Fix formatting + small refactors * 🐛 Fix inequality function (check for approx) * ♻️ Refactor code and logic * ♻️ Refactor label checking * 💡 Add comments * 🐛 Fix approx check order * 🐛 Handle millions/billions/etc * ✨ Add extracted/normalized min,max,approx to parquet * 🗃️ Add min,max,approx cols to db * ⬇️ Downgrade to python3.9 * ⬇️ Downgrade deps for python3.8 * 🏷️ Fix List type * 🏷️ Fix more list types -> List * 🏷️ Fix types to run code on python3.8 * 🧑‍💻 Add args to pass to python scripts * 📝 Add docs * Fix NormalizeNum instantiation * Fix string '0' being identified as approximation case * Fix Indian Rupee normalization * Fix 'hundreds of' cases * Fix '>43 total' case * 🩹 Fix "none" for Total_* cols meaning "zero" * 🚨 Fix lint + format * 🎨 Clean out comments * ⚰️ Remove stats * ♻️ Add catch-all for approx * 📦️ Refresh parquet data files and db --------- Co-authored-by: Shorouq <[email protected]> Co-authored-by: chanjuan meng <[email protected]> * 🙈 Ignore geopy cache * ✨ Normalize locations (functions + example) * 🙈 Ignore .env + format with comments * 📝 Add docs on getting Bing api key * ➕ Add deps for extracting/normalizing locations * ♻️ Refactor normalizing locations * 🗃️ Normalize locations in db * ♻️ Refactor util functions into own file * 🎨 Fix formating/lint * 🗃️ Add GADM csv with location types/levels * ♻️ Refactor location splitting function * 🚚 Move data to dir * ♻️ Refactor function to handle more cases * ♻️ General refactor of parsing code * 🚨 Fix lint * 🔥 Remove moved file * ♻️ Handle cases with missing columns * ➕ Add dev dep + description * 🚚 Refactor module name * 🏷️ Convert list to str for sql db * 🔨 Use OpenStreetMap (Nominatim); ditch BING * 🔨 Add GADM normalization layer * ⚰️ Clean print statements * 🐛 Fix selecting only parquet * 🗃️ Add GADM + UNSD datasets * ♻️ Fix var name * ♻️ Refactor GADM id getter for robustness * ✨ Fuzzy-match world regions * ♻️ Split function into several * ♻️ Refactor GADM gid function * 🚨 Format and lint * ♻️ Refactor function * 🚨 Fix lint + format * ♻️ Refactor GADM data for the USA * 🙈 Ignore excel files * 🗃️ Parse events with Nominatim * 📝 Update docs to remove BING access key instructions * 📝 Add more instructions * 🙈 Ignore pycache no matter where it is * 🚚 Move and rename files for clarity * ♻️ Prefer locations with multigon/polygon * 🚨 Format + lint * 🗃️ Fix database schema * 🗃️ Add location normalization (normalized name, gid, type, geojson/geometry) * ➖ Remove unused dep * 🚚 Fix normalization class names * 🔊 Add logger * 🐛 Fix cache_uninstall bug * 🗃️ Convert GID and location type to str * ✏️ Fix typo * ♻️ Refactor to improve quality * ♻️ Refactor to handle wider cases + quickfix for cardinals * 🐛 Fix cardinal normalization * 💬 Expand unwanted location types * ⚡️ Add caching * 🐛 Fix not finding lowercase unsd regions * ⚡️ Generalize to american state if country not found * ⚡️ Improve country matching in gadm * ⚗️ Get locations by segment if normal querying fails * 💡 Fix comment * 🔊 Fix log * ♻️ Refactor api return * ⚡️ Match any segment of location if all fails * ♻️ Return original area if not normalized * 🎨 Fix formatting * ⚡️ Drop events/subevents with no location or year * ♻️ Expand list of location segments to remove (like city/prefecture/district/etc) * 💬 Expand list of unwanted location types * ✏️ Fix typo * ⚡️ Return "name" if "int_name" not available * 🥅 Catch pycountry exception * 🚨 Fix lint/whitespace * ⚡️ Expand allowed names to be returned for countries (in order of preference) * ✏️ Fix typo * 🐛 Fix subevents where the country and location are identical (generalize to country col) * 🔊 Add logs to table creation script + update db * 🐛Convert GID list to str for sqlite3 * 🐛Fix inserting partial columns into impactDB * 🗃️ Add alternative name to Mexico * Upgrade pre-commit * 🚨 Format * 📦️ Set up lfs for large files * 🙈 Ignore dev files * 📦️ Add parquet output before db insert * 📝 Add doc on git lfs * 💡 Remove debug script --------- Co-authored-by: Ni Li <[email protected]> Co-authored-by: CUMULUS\nili <[email protected]> Co-authored-by: olofg <[email protected]> Co-authored-by: LiNi12 <[email protected]> Co-authored-by: Shorouq Zahra <[email protected]> Co-authored-by: Shorouq <[email protected]> Co-authored-by: chanjuan meng <[email protected]>
1 parent 0846fd0 commit 501df99

26 files changed

+17775
-556
lines changed

.gitattributes

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
.csv filter=lfs diff=lfs merge=lfs -text
2+
.db filter=lfs diff=lfs merge=lfs -text
3+
.sqlite filter=lfs diff=lfs merge=lfs -text
4+
.parquet filter=lfs diff=lfs merge=lfs -text
5+
.json filter=lfs diff=lfs merge=lfs -text
6+
*.parquet filter=lfs diff=lfs merge=lfs -text
7+
*.db filter=lfs diff=lfs merge=lfs -text
8+
*.csv filter=lfs diff=lfs merge=lfs -text
9+
*.sqlite filter=lfs diff=lfs merge=lfs -text

.gitignore

+14-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,16 @@
1+
# ignore dev files
12
results
2-
*/__pycache__
3+
.env
4+
Database/raw/*.xlsx
5+
Database/output/*.csv
6+
Database/output/Ni/*.csv
7+
Database/output/dev/*
8+
9+
# ignore pycache
10+
**/__pycache__
11+
12+
# ignore mac-related files
313
.DS_Store
14+
15+
# ignore geopy cache (used for normalizing locations faster)
16+
geopy_cache.sqlite

.pre-commit-config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
repos:
22
- repo: https://github.com/pre-commit/pre-commit-hooks
3-
rev: v2.3.0
3+
rev: v4.5.0
44
hooks:
55
- id: end-of-file-fixer
66
- id: trailing-whitespace
+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:1b1749da5f2cfe66ba6ce95077bdbf4079d63632264b255ee319aa797ac39a3f
3+
size 20174

Database/data/gadm_world.csv

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:4dbae7fee448f153767689661a4cc579890fae18ca40e7278919959ca9d64b0c
3+
size 59385015
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:50b1492d73e695baa6c150250b8b7a33b6c89b7c4b8e2dea1a594b79e8ec12af
3+
size 62891821
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:753b92d2be3944d3a6bfde0aa2c3b203828f40d1e819b1eb28b54efe8490f203
3+
size 70635681
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:9c9508191ef3e9725b4368e27db6a23ea8f22f4b3bc1b1f65c58871fcd1c049b
3+
size 12749997
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:83e3a13fbc15fe7050243526fea931a4960100ac27607b9f3b94007dec9d0cb3
3+
size 211069627
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:be18e36bf6b0be545805b0cc087088114b4e006f9d5b6c6dcdafab4fb4bc84d1
3+
size 211070734

0 commit comments

Comments
 (0)