Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Species Habitat Dataset for Faceted Map Examples #684

Draft
wants to merge 33 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
d2cc2f6
feat: add species.csv and generation script
dsmedia Feb 11, 2025
f9c979c
chore: fix Ruff linting warnings
dsmedia Feb 11, 2025
93f06ee
chore: formatting with Ruff
dsmedia Feb 11, 2025
f11954e
feat: switch to ScienceBase API for data retrieval
dsmedia Feb 11, 2025
8311060
fix: ruff formatting
dsmedia Feb 12, 2025
87753bf
feat: overhaul habitat data processing pipeline
dsmedia Feb 16, 2025
853d5ce
chore: ruff format
dsmedia Feb 16, 2025
9a778c6
fix(typing): declare aliases as types, use `Literal` correctly
dangotbanned Feb 16, 2025
728a4a7
Merge remote-tracking branch 'upstream/main' into add-species-dataset
dangotbanned Feb 16, 2025
2a1baba
refactor: use zipfile.Path for TIF extraction
dsmedia Feb 17, 2025
fe99fff
feat: Use TOML config and improve ZIP extraction in species.py
dsmedia Feb 17, 2025
cb01bd3
fix: taplo fmt
dsmedia Feb 17, 2025
c4ff52d
feat: update species to align with altair issue
dsmedia Feb 18, 2025
6cee9a4
fix: remove hardcoded habitat ids; update urls.ts
dsmedia Feb 18, 2025
aa4516f
refactor: standardize column naming and enhance documentation
dsmedia Feb 20, 2025
62d4e60
docs: add species.arrow to datapackage_additions
dsmedia Feb 20, 2025
b558302
chore: datapackage md / json
dsmedia Feb 20, 2025
7fc597f
docs: Improve ScienceBaseClient docstrings and error handling
dsmedia Feb 21, 2025
7775aec
docs: list all available formats in toml
dsmedia Feb 21, 2025
12895cd
feat: replace arrow with csv
dsmedia Feb 21, 2025
ee414a9
feat: update species.toml to reflect csv not arrow
dsmedia Feb 21, 2025
9f3267b
feat: reflect csv in main script and datapackage
dsmedia Feb 21, 2025
e40833e
feat: update datapackage
dsmedia Feb 21, 2025
9a5d0c2
docs: somewhat adhere to numpydoc
dangotbanned Feb 22, 2025
2e92ed4
Update scripts/species.py
dsmedia Feb 22, 2025
afcf17a
refactor: simplify error handling and logging patterns
dsmedia Feb 22, 2025
35904df
feat: Set CONUS counties to 0% for missing habitat values
dsmedia Feb 25, 2025
3ea6d87
# feat: Add geographic filtering to focus on coterminous US counties
dsmedia Feb 27, 2025
cd5c243
typo
mattijn Feb 27, 2025
1415aae
refactor: move inline dependencies into a group
dangotbanned Feb 27, 2025
effe713
chore: add `tqdm`
dangotbanned Feb 27, 2025
8551a8d
ci(ruff): add `BLE001` rule
dangotbanned Feb 27, 2025
4ea2185
fix: replace generic Exception handling with specific exceptions
dsmedia Mar 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions _data/datapackage_additions.toml
Original file line number Diff line number Diff line change
Expand Up @@ -1286,6 +1286,61 @@ description = "Date of monthly observation in the format 'MMM D YYYY'"
name = "price"
description = "Closing price of the S&P 500 index for the given month"

[[resources]] # Path: species.csv
path = "species.csv"
description = """
Percentage of year-round habitat for four species -- American robin, white-tailed deer,
American bullfrog, and common gartersnake -- within US counties, derived from USGS
Gap Analysis Project (GAP) Species Habitat Maps. Data is provided at a 30-meter
resolution and covers the contiguous United States. Habitat percentages are calculated
by overlaying species habitat rasters (year-round habitat represented by value 3) with
US county boundaries.

The habitat maps are in Albers Conical Equal Area projection (EPSG:5070). County boundaries
are derived from US Census Bureau cartographic boundary files (1:10,000,000 scale), from
`US-10m.json` in this repository. This dataset only includes *year-round* habitat.
The original raster data also contains values for summer and winter habitat, which are
*not* included in this dataset. Data was processed using the `exactextract` library
for zonal statistics.
"""

[resources.schema]
[[resources.schema.fields]]
name = "item_id"
description = "Unique identifier for the species data item on ScienceBase."

[[resources.schema.fields]]
name = "common_name"
description = "Common name of the species."

[[resources.schema.fields]]
name = "scientific_name"
description = "Scientific name of the species."

[[resources.schema.fields]]
name = "gap_species_code"
description = "GAP Species Code, a unique identifier for the species within the GAP dataset."

[[resources.schema.fields]]
name = "county_id"
description = "Combined state and county FIPS code, identifying the US county."

[[resources.schema.fields]]
name = "habitat_yearround_pct"
description = "Percentage of the county area that is classified as year-round habitat for the species (rounded to 4 decimal places)."

[[resources.sources]]
title = "USGS Gap Analysis Project (GAP) Species Habitat Maps"
path = "https://www.usgs.gov/programs/gap-analysis-project" # General GAP link

[[resources.sources]]
title = "US Census Bureau Cartographic Boundary Files (1:10,000,000)"
path = "https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html"

[[resources.licenses]]
title = "U.S. Government Dataset"
path = "https://www.usa.gov/government-works"

[[resources]] # Path: stocks.csv
path = "stocks.csv"
description = "Monthly stock prices for five companies from 2000 to 2010."
Expand Down
22 changes: 22 additions & 0 deletions _data/species.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# _data/species.toml

[processing]
item_ids = [
'58fa449fe4b0b7ea54524c5e', # American Robin (Habitat)
'58fa817ae4b0b7ea54525c2f', # White-tailed Deer (Habitat)
'58fa3f0be4b0b7ea54524859', # American Bullfrog (Habitat)
'58fe0f4fe4b0074928294636', # Common Gartersnake (Habitat)
]
vector_fp = "../data/us-10m.json" # Relative path from TOML file
output_dir = "../data" # Relative path from TOML file
output_format = "csv" # Available formats: "csv", "parquet", "arrow"
debug = false # Controls logging level

# Areas excluded from analysis to focus on coterminous US
[processing.geographic_filter]
excluded_fips = [
{ code = "02", name = "Alaska" },
{ code = "15", name = "Hawaii" },
{ code = "72", name = "Puerto Rico" },
{ code = "78", name = "Virgin Islands" },
]
12,361 changes: 12,361 additions & 0 deletions data/species.csv

Large diffs are not rendered by default.

66 changes: 64 additions & 2 deletions datapackage.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "vega-datasets",
"description": "Common repository for example datasets used by Vega related projects. \nBSD-3-Clause license applies only to package code and infrastructure. Users should verify their use of datasets \ncomplies with the license terms of the original sources. Dataset license information, where included, \nis a reference starting point only and is provided without any warranty of accuracy or completeness.\n",
"homepage": "http://github.com/vega/vega-datasets.git",
"homepage": "git+http://github.com/vega/vega-datasets.git",
"licenses": [
{
"name": "BSD-3-Clause",
Expand All @@ -20,7 +20,7 @@
}
],
"version": "2.11.0",
"created": "2025-02-07T20:36:42.016594+00:00",
"created": "2025-02-27T12:28:27.826118+00:00",
"resources": [
{
"name": "7zip.png",
Expand Down Expand Up @@ -3171,6 +3171,68 @@
]
}
},
{
"name": "species.csv",
"type": "table",
"description": "Percentage of year-round habitat for four species -- American robin, white-tailed deer, \nAmerican bullfrog, and common gartersnake -- within US counties, derived from USGS \nGap Analysis Project (GAP) Species Habitat Maps. Data is provided at a 30-meter \nresolution and covers the contiguous United States. Habitat percentages are calculated \nby overlaying species habitat rasters (year-round habitat represented by value 3) with \nUS county boundaries.\n\nThe habitat maps are in Albers Conical Equal Area projection (EPSG:5070). County boundaries \nare derived from US Census Bureau cartographic boundary files (1:10,000,000 scale), from \n`US-10m.json` in this repository. This dataset only includes *year-round* habitat. \nThe original raster data also contains values for summer and winter habitat, which are \n*not* included in this dataset. Data was processed using the `exactextract` library \nfor zonal statistics.\n",
"licenses": [
{
"title": "U.S. Government Dataset",
"path": "https://www.usa.gov/government-works"
}
],
"sources": [
{
"title": "USGS Gap Analysis Project (GAP) Species Habitat Maps",
"path": "https://www.usgs.gov/programs/gap-analysis-project"
},
{
"title": "US Census Bureau Cartographic Boundary Files (1:10,000,000)",
"path": "https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html"
}
],
"path": "species.csv",
"scheme": "file",
"format": "csv",
"mediatype": "text/csv",
"encoding": "utf-8",
"hash": "sha1:55087bc475039d7bcd59683e5c3a6fb7c955b35a",
"bytes": 1034744,
"schema": {
"fields": [
{
"name": "item_id",
"type": "string",
"description": "Unique identifier for the species data item on ScienceBase."
},
{
"name": "common_name",
"type": "string",
"description": "Common name of the species."
},
{
"name": "scientific_name",
"type": "string",
"description": "Scientific name of the species."
},
{
"name": "gap_species_code",
"type": "string",
"description": "GAP Species Code, a unique identifier for the species within the GAP dataset."
},
{
"name": "county_id",
"type": "integer",
"description": "Combined state and county FIPS code, identifying the US county."
},
{
"name": "habitat_yearround_pct",
"type": "number",
"description": "Percentage of the county area that is classified as year-round habitat for the species (rounded to 4 decimal places)."
}
]
}
},
{
"name": "stocks.csv",
"type": "table",
Expand Down
39 changes: 38 additions & 1 deletion datapackage.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# vega-datasets
`2.11.0` | [GitHub](http://github.com/vega/vega-datasets.git) | 2025-02-07 20:36:42 [UTC]
`2.11.0` | [GitHub](git+http://github.com/vega/vega-datasets.git) | 2025-02-27 12:28:27 [UTC]

Common repository for example datasets used by Vega related projects.
BSD-3-Clause license applies only to package code and infrastructure. Users should verify their use of datasets
Expand Down Expand Up @@ -1514,6 +1514,43 @@ the dot-com bubble burst (2000-2002), the mid-2000s bull market, and the 2008 fi
|:-------|:-------|:-------------------------------------------------------|
| date | string | Date of monthly observation in the format 'MMM D YYYY' |
| price | number | Closing price of the S&P 500 index for the given month |
## `species.csv`
### path
species.csv
### description
Percentage of year-round habitat for four species -- American robin, white-tailed deer,
American bullfrog, and common gartersnake -- within US counties, derived from USGS
Gap Analysis Project (GAP) Species Habitat Maps. Data is provided at a 30-meter
resolution and covers the contiguous United States. Habitat percentages are calculated
by overlaying species habitat rasters (year-round habitat represented by value 3) with
US county boundaries.

The habitat maps are in Albers Conical Equal Area projection (EPSG:5070). County boundaries
are derived from US Census Bureau cartographic boundary files (1:10,000,000 scale), from
`US-10m.json` in this repository. This dataset only includes *year-round* habitat.
The original raster data also contains values for summer and winter habitat, which are
*not* included in this dataset. Data was processed using the `exactextract` library
for zonal statistics.

### schema

| name | type | description |
|:----------------------|:--------|:----------------------------------------------------------------------------------------------------------------------|
| item_id | string | Unique identifier for the species data item on ScienceBase. |
| common_name | string | Common name of the species. |
| scientific_name | string | Scientific name of the species. |
| gap_species_code | string | GAP Species Code, a unique identifier for the species within the GAP dataset. |
| county_id | integer | Combined state and county FIPS code, identifying the US county. |
| habitat_yearround_pct | number | Percentage of the county area that is classified as year-round habitat for the species (rounded to 4 decimal places). |
### sources
| title | path |
|:------------------------------------------------------------|:--------------------------------------------------------------------------------------------|
| USGS Gap Analysis Project (GAP) Species Habitat Maps | https://www.usgs.gov/programs/gap-analysis-project |
| US Census Bureau Cartographic Boundary Files (1:10,000,000) | https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html |
### licenses
| title | path |
|:------------------------|:-------------------------------------|
| U.S. Government Dataset | https://www.usa.gov/government-works |
## `stocks.csv`
### path
stocks.csv
Expand Down
11 changes: 11 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,16 @@ version = "2.11.0"

[dependency-groups]
dev = ["ipython[kernel]>=8.30.0", "ruff>=0.8.2", "taplo>=0.9.3"]
geo-species = [
"exactextract",
"geopandas",
"pandas[pyarrow]",
"rasterio",
"requests",
"sciencebasepy",
"setuptools",
"tqdm",
]

[tool.ruff]
extend-exclude = [
Expand Down Expand Up @@ -114,6 +124,7 @@ include = [
"./scripts/build_datapackage.py",
"./scripts/flights.py",
"./scripts/income.py",
"./scripts/species.py",
]
pythonPlatform = "All"
pythonVersion = "3.12"
Expand Down
Loading
Loading