Skip to content

Commit 2653e9c

Browse files
committed
Merge branch 'main' of jaanli.github:jaanli/american-community-survey
2 parents 6765336 + c7b337e commit 2653e9c

File tree

114 files changed

+381
-320
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

114 files changed

+381
-320
lines changed

README.md

+60-26
Original file line numberDiff line numberDiff line change
@@ -53,14 +53,14 @@ A typical Framework project looks like this:
5353

5454
## Command reference
5555

56-
| Command | Description |
57-
| ----------------- | -------------------------------------------------------- |
58-
| `yarn install` | Install or reinstall dependencies |
59-
| `yarn dev` | Start local preview server |
60-
| `yarn build` | Build your static site, generating `./dist` |
61-
| `yarn deploy` | Deploy your project to Observable |
62-
| `yarn clean` | Clear the local data loader cache |
63-
| `yarn observable` | Run commands like `observable help` |
56+
| Command | Description |
57+
| ----------------- | ------------------------------------------- |
58+
| `yarn install` | Install or reinstall dependencies |
59+
| `yarn dev` | Start local preview server |
60+
| `yarn build` | Build your static site, generating `./dist` |
61+
| `yarn deploy` | Deploy your project to Observable |
62+
| `yarn clean` | Clear the local data loader cache |
63+
| `yarn observable` | Run commands like `observable help` |
6464

6565
## GPT-4 reference
6666

@@ -93,14 +93,14 @@ Example plot of this data: https://s13.gifyu.com/images/SCGH2.gif (code here: ht
9393

9494
Example visualization: live demo here - https://jaanli.github.io/american-community-survey/ (visualization code [here](https://github.com/jaanli/american-community-survey/))
9595

96-
![image](https://github.com/jaanli/exploring_american_community_survey_data/assets/5317244/0428e121-c4ec-4a97-826f-d3f944bc7bf2)
96+
![image](https://github.com/jaanli/exploring_data_processing_data/assets/5317244/0428e121-c4ec-4a97-826f-d3f944bc7bf2)
9797

9898
## Requirements
9999

100100
Clone the repo; create and activate a virtual environment:
101101
```
102-
git clone https://github.com/jaanli/exploring_american_community_survey_data.git
103-
cd exploring_american_community_survey_data
102+
git clone https://github.com/jaanli/american-community-survey.git
103+
cd american-community-survey
104104
python3 -m venv .venv
105105
source activate
106106
```
@@ -123,28 +123,62 @@ brew install duckdb
123123
## Usage for 2022 ACS Public Use Microdata Sample (PUMS) Data
124124

125125
To retrieve the list of URLs from the Census Bureau's server and download and extract the archives for all of the 50 states' PUMS files, run the following:
126+
127+
```
128+
cd data_processing
129+
dbt run --select "public_use_microdata_sample.list_urls" \
130+
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}'
131+
```
132+
133+
Then save the URLs:
134+
135+
```
136+
dbt run --select "public_use_microdata_sample.urls" \
137+
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
138+
--threads 8
139+
```
140+
141+
Then execute the dbt model for downloading and extract the archives of the microdata (takes ~2min on a Macbook):
142+
143+
```
144+
dbt run --select "public_use_microdata_sample.download_and_extract_archives" \
145+
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' \
146+
--threads 8
147+
```
148+
149+
Then generate the CSV paths:
150+
151+
```
152+
dbt run --select "public_use_microdata_sample.csv_paths" \
153+
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.json", "output_path": "~/data/american_community_survey"}' \
154+
--threads 8
126155
```
127-
cd american_community_survey
128-
dbt run --exclude "public_use_microdata_sample.generated+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}'
156+
157+
Then parse the data dictionary:
158+
159+
```
160+
dbt run --select "public_use_microdata_sample.parse_data_dictionary" \
161+
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
162+
--threads 8
129163
```
130164

131165
Then generate the SQL commands needed to map every state's individual people or housing unit variables to the easier to use (and read) names:
132166

133167
```
134-
python scripts/generate_sql_data_dictionary_mapping_for_extracted_csv_files.py \
135-
~/data/american_community_survey/public_use_microdata_sample_csv_paths.parquet \
136-
~/data/american_community_survey/PUMS_Data_Dictionary_2022.json
168+
python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py ~/data/american_community_survey/csv_paths.parquet ~/data/american_community_survey/PUMS_Data_Dictionary_2022.json
137169
```
138170

139171
Then execute these generated SQL queries using 1 thread (you can adjust this number to be higher depending on the available processor cores on your system):
140172
```
141-
dbt run --select "public_use_microdata_sample.generated+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' --threads 1
173+
dbt run --select "public_use_microdata_sample.generated+" \
174+
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' \
175+
--threads 8
142176
```
143177

144178
Inspect the output folder to see what has been created in the `output_path` specified in the previous command:
145179
```
146180
❯ tree -hF -I '*.pdf' ~/data/american_community_survey
147-
[ 224] /Users/me/data/american_community_survey/
181+
[ 224] /Users/me/data/data_processing/
148182
├── [ 128] 2022/
149183
│ └── [3.4K] 1-Year/
150184
│ ├── [ 128] csv_hak/
@@ -169,7 +203,7 @@ To see the size of the csv output:
169203

170204
```
171205
❯ du -sh ~/data/american_community_survey/2022
172-
6.4G /Users/me/data/american_community_survey/2022
206+
6.4G /Users/me/data/data_processing/2022
173207
```
174208

175209
And the compressed representation size:
@@ -284,12 +318,12 @@ Check that you can execute a SQL query against these files:
284318
```
285319
duckdb -c "SELECT COUNT(*) FROM '~/data/american_community_survey/*individual_people_united_states*2021.parquet'"
286320
```
287-
1. Create a data visualization using the compressed parquet files by adding to the `american_community_survey/models/public_use_microdata_sample/figures` directory, and using examples from here https://github.com/jaanli/american-community-survey/ or here https://github.com/jaanli/lonboard/blob/example-american-community-survey/examples/american-community-survey.ipynb
321+
6. Create a data visualization using the compressed parquet files by adding to the `data_processing/models/public_use_microdata_sample/figures` directory, and using examples from here https://github.com/jaanli/american-community-survey/ or here https://github.com/jaanli/lonboard/blob/example-american-community-survey/examples/american-community-survey.ipynb
288322

289-
To save time, there is a bash script with these steps in `scripts/process_one_year_of_american_community_survey_data.sh` that can be used as follows:
323+
To save time, there is a bash script with these steps in `scripts/process_one_year_of_data_processing_data.sh` that can be used as follows:
290324
```
291-
chmod a+x scripts/process_one_year_of_american_community_survey_data.sh
292-
./scripts/process_one_year_of_american_community_survey_data.sh 2021
325+
chmod a+x scripts/process_one_year_of_data_processing_data.sh
326+
./scripts/process_one_year_of_data_processing_data.sh 2021
293327
```
294328

295329
The argument specifies the year to be downloaded, transformed, compressed, and saved. It takes about 5 minutes per year of data.
@@ -570,7 +604,7 @@ dbt run --select "public_use_microdata_sample.microdata_area_shapefile_paths"
570604
```
571605
5. Check that the paths are correct:
572606
```
573-
❯ duckdb -c "SELECT * FROM '/Users/me/data/american_community_survey/microdata_area_shapefile_paths.parquet';"
607+
❯ duckdb -c "SELECT * FROM '/Users/me/data/data_processing/microdata_area_shapefile_paths.parquet';"
574608
```
575609
Displays:
576610

@@ -579,11 +613,11 @@ Displays:
579613
│ shp_path │
580614
│ varchar │
581615
├─────────────────────────────────────────────────────────────────────────────────────────────┤
582-
│ /Users/me/data/american_community_survey/PUMA5/2010/tl_2010_02_puma10/tl_2010_02_puma10.shp │
616+
│ /Users/me/data/data_processing/PUMA5/2010/tl_2010_02_puma10/tl_2010_02_puma10.shp │
583617
│ · │
584618
│ · │
585619
│ · │
586-
│ /Users/me/data/american_community_survey/PUMA5/2010/tl_2010_48_puma10/tl_2010_48_puma10.shp │
620+
│ /Users/me/data/data_processing/PUMA5/2010/tl_2010_48_puma10/tl_2010_48_puma10.shp │
587621
├─────────────────────────────────────────────────────────────────────────────────────────────┤
588622
│ 54 rows (40 shown) │
589623
└─────────────────────────────────────────────────────────────────────────────────────────────┘

data_processing/dbt_project.yml

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Name your project! Project names should contain only lowercase characters
22
# and underscores. A good package name should reflect your organization's
33
# name or the intended use of these models
4-
name: "american_community_survey"
4+
name: "data_processing"
55
version: "1.0.0"
66
config-version: 2
77

88
# This setting configures which "profile" dbt uses for this project.
9-
profile: "american_community_survey"
9+
profile: "data_processing"
1010

1111
# Variables that can be changed from the command line using the `--vars` flag:
1212
# example: dbt run --vars 'my_variable: my_value'
@@ -28,8 +28,8 @@ macro-paths: ["macros"]
2828
snapshot-paths: ["snapshots"]
2929

3030
clean-targets: # directories to be removed by `dbt clean`
31-
- "target"
32-
- "dbt_packages"
31+
- "target"
32+
- "dbt_packages"
3333

3434
# Configuring models
3535
# Full documentation: https://docs.getdbt.com/docs/configuring-models
@@ -38,7 +38,7 @@ clean-targets: # directories to be removed by `dbt clean`
3838
# directory as views. These settings can be overridden in the individual model
3939
# files using the `{{ config(...) }}` macro.
4040
models:
41-
american_community_survey:
41+
data_processing:
4242
# Config indicated by + and applies to all files under models/example/
4343
# example:
4444
+materialized: view
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,31 @@
11
version: 2
22

33
models:
4-
- name: list_urls
5-
config:
6-
public_use_microdata_sample_url: "{{ var('public_use_microdata_sample_url') }}"
7-
output_path: "{{ var('output_path') }}"
8-
- name: download_and_extract_archives
9-
config:
10-
public_use_microdata_sample_url: "{{ var('public_use_microdata_sample_url') }}"
11-
output_path: "{{ var('output_path') }}"
12-
- name: parse_data_dictionary
13-
config:
14-
public_use_microdata_sample_data_dictionary_url: "{{ var('public_use_microdata_sample_data_dictionary_url') }}"
15-
output_path: "{{ var('output_path') }}"
16-
- name: list_shapefile_urls
17-
config:
18-
microdata_area_shapefile_url: "{{ var('microdata_area_shapefile_url') }}"
19-
output_path: "{{ var('output_path') }}"
20-
- name: download_and_extract_shapefiles
21-
config:
22-
microdata_area_shapefile_url: "{{ var('microdata_area_shapefile_url') }}"
23-
output_path: "{{ var('output_path') }}"
24-
- name: combine_shapefiles
25-
config:
26-
microdata_area_shapefile_url: "{{ var('microdata_area_shapefile_url') }}"
27-
output_path: "{{ var('output_path') }}"
4+
- name: list_urls
5+
config:
6+
public_use_microdata_sample_url: "{{ var('public_use_microdata_sample_url') }}"
7+
output_path: "{{ var('output_path') }}"
8+
- name: download_and_extract_archives
9+
config:
10+
public_use_microdata_sample_url: "{{ var('public_use_microdata_sample_url') }}"
11+
output_path: "{{ var('output_path') }}"
12+
- name: csv_paths
13+
config:
14+
public_use_microdata_sample_url: "{{ var('public_use_microdata_sample_url') }}"
15+
output_path: "{{ var('output_path') }}"
16+
- name: parse_data_dictionary
17+
config:
18+
public_use_microdata_sample_data_dictionary_url: "{{ var('public_use_microdata_sample_data_dictionary_url') }}"
19+
output_path: "{{ var('output_path') }}"
20+
- name: list_shapefile_urls
21+
config:
22+
microdata_area_shapefile_url: "{{ var('microdata_area_shapefile_url') }}"
23+
output_path: "{{ var('output_path') }}"
24+
- name: download_and_extract_shapefiles
25+
config:
26+
microdata_area_shapefile_url: "{{ var('microdata_area_shapefile_url') }}"
27+
output_path: "{{ var('output_path') }}"
28+
- name: combine_shapefiles
29+
config:
30+
microdata_area_shapefile_url: "{{ var('microdata_area_shapefile_url') }}"
31+
output_path: "{{ var('output_path') }}"

data_processing/models/public_use_microdata_sample/download_and_extract_archives.py

+2-23
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ def model(dbt, session):
2525
base_url = dbt.config.get('public_use_microdata_sample_url') # Assuming this is correctly set
2626

2727
# Fetch URLs from your table or view
28-
query = "SELECT * FROM list_urls"
28+
query = "SELECT * FROM list_urls "
2929
result = session.execute(query).fetchall()
3030
columns = [desc[0] for desc in session.description]
3131
url_df = pd.DataFrame(result, columns=columns)
@@ -50,25 +50,4 @@ def model(dbt, session):
5050
paths_df = pd.DataFrame(extracted_files, columns=['csv_path'])
5151

5252
# Return the DataFrame with paths to the extracted CSV files
53-
return paths_df
54-
55-
# Mock dbt and session for demonstration; replace with actual dbt and session in your environment
56-
class MockDBT:
57-
def config(self, key):
58-
return {
59-
'public_use_microdata_sample_url': 'https://example.com/path/to/your/csv/files',
60-
'output_path': '~/path/to/your/output/directory'
61-
}.get(key, '')
62-
63-
class MockSession:
64-
def execute(self, query):
65-
# Mock response; replace with actual fetching logic
66-
return [{"URL": "https://example.com/path/to/your/csv_file.zip"} for _ in range(10)]
67-
68-
dbt = MockDBT()
69-
session = MockSession()
70-
71-
if __name__ == "__main__":
72-
# Directly calling model function for demonstration; integrate properly within your dbt project
73-
df = model(dbt, session)
74-
print(df)
53+
return paths_df

data_processing/models/public_use_microdata_sample/generated/2022/enum_types_mapped_renamed/housing_units_alabama_enum_mapped_renamed_2022.sql

+2-2
Original file line numberDiff line numberDiff line change
@@ -905,7 +905,7 @@ CASE FYRBLTP
905905
WGTP78::VARCHAR AS "Housing Weight replicate 78",
906906
WGTP79::VARCHAR AS "Housing Weight replicate 79",
907907
WGTP80::VARCHAR AS "Housing Weight replicate 80",
908-
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_hal/psam_h01.csv',
908+
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_hal/psam_h01.csv',
909909
parallel=False,
910910
all_varchar=True,
911-
auto_detect=True)
911+
auto_detect=True)

data_processing/models/public_use_microdata_sample/generated/2022/enum_types_mapped_renamed/housing_units_alaska_enum_mapped_renamed_2022.sql

+2-2
Original file line numberDiff line numberDiff line change
@@ -905,7 +905,7 @@ CASE FYRBLTP
905905
WGTP78::VARCHAR AS "Housing Weight replicate 78",
906906
WGTP79::VARCHAR AS "Housing Weight replicate 79",
907907
WGTP80::VARCHAR AS "Housing Weight replicate 80",
908-
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_hak/psam_h02.csv',
908+
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_hak/psam_h02.csv',
909909
parallel=False,
910910
all_varchar=True,
911-
auto_detect=True)
911+
auto_detect=True)

data_processing/models/public_use_microdata_sample/generated/2022/enum_types_mapped_renamed/housing_units_arizona_enum_mapped_renamed_2022.sql

+2-2
Original file line numberDiff line numberDiff line change
@@ -905,7 +905,7 @@ CASE FYRBLTP
905905
WGTP78::VARCHAR AS "Housing Weight replicate 78",
906906
WGTP79::VARCHAR AS "Housing Weight replicate 79",
907907
WGTP80::VARCHAR AS "Housing Weight replicate 80",
908-
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_haz/psam_h04.csv',
908+
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_haz/psam_h04.csv',
909909
parallel=False,
910910
all_varchar=True,
911-
auto_detect=True)
911+
auto_detect=True)

data_processing/models/public_use_microdata_sample/generated/2022/enum_types_mapped_renamed/housing_units_arkansas_enum_mapped_renamed_2022.sql

+2-2
Original file line numberDiff line numberDiff line change
@@ -905,7 +905,7 @@ CASE FYRBLTP
905905
WGTP78::VARCHAR AS "Housing Weight replicate 78",
906906
WGTP79::VARCHAR AS "Housing Weight replicate 79",
907907
WGTP80::VARCHAR AS "Housing Weight replicate 80",
908-
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_har/psam_h05.csv',
908+
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_har/psam_h05.csv',
909909
parallel=False,
910910
all_varchar=True,
911-
auto_detect=True)
911+
auto_detect=True)

data_processing/models/public_use_microdata_sample/generated/2022/enum_types_mapped_renamed/housing_units_california_enum_mapped_renamed_2022.sql

+2-2
Original file line numberDiff line numberDiff line change
@@ -905,7 +905,7 @@ CASE FYRBLTP
905905
WGTP78::VARCHAR AS "Housing Weight replicate 78",
906906
WGTP79::VARCHAR AS "Housing Weight replicate 79",
907907
WGTP80::VARCHAR AS "Housing Weight replicate 80",
908-
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_hca/psam_h06.csv',
908+
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_hca/psam_h06.csv',
909909
parallel=False,
910910
all_varchar=True,
911-
auto_detect=True)
911+
auto_detect=True)

data_processing/models/public_use_microdata_sample/generated/2022/enum_types_mapped_renamed/housing_units_colorado_enum_mapped_renamed_2022.sql

+2-2
Original file line numberDiff line numberDiff line change
@@ -905,7 +905,7 @@ CASE FYRBLTP
905905
WGTP78::VARCHAR AS "Housing Weight replicate 78",
906906
WGTP79::VARCHAR AS "Housing Weight replicate 79",
907907
WGTP80::VARCHAR AS "Housing Weight replicate 80",
908-
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_hco/psam_h08.csv',
908+
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_hco/psam_h08.csv',
909909
parallel=False,
910910
all_varchar=True,
911-
auto_detect=True)
911+
auto_detect=True)

data_processing/models/public_use_microdata_sample/generated/2022/enum_types_mapped_renamed/housing_units_connecticut_enum_mapped_renamed_2022.sql

+2-2
Original file line numberDiff line numberDiff line change
@@ -905,7 +905,7 @@ CASE FYRBLTP
905905
WGTP78::VARCHAR AS "Housing Weight replicate 78",
906906
WGTP79::VARCHAR AS "Housing Weight replicate 79",
907907
WGTP80::VARCHAR AS "Housing Weight replicate 80",
908-
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_hct/psam_h09.csv',
908+
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_hct/psam_h09.csv',
909909
parallel=False,
910910
all_varchar=True,
911-
auto_detect=True)
911+
auto_detect=True)

0 commit comments

Comments
 (0)