@@ -121,25 +121,25 @@ For debugging, the `duckdb` command line tool is available on homebrew:
121
121
brew install duckdb
122
122
```
123
123
124
- ## Usage for 2022 ACS Public Use Microdata Sample (PUMS) Data
124
+ ## Usage for 2023 ACS Public Use Microdata Sample (PUMS) Data
125
125
126
126
To retrieve the list of URLs from the Census Bureau's server and download and extract the archives for all of the 50 states' PUMS files, run the following:
127
127
128
128
```
129
129
cd data_processing
130
- dbt run --select "public_use_microdata_sample.list_urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021 .csv", "output_path": "~/data/american_community_survey"}'
130
+ dbt run --select "public_use_microdata_sample.list_urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023 .csv", "output_path": "~/data/american_community_survey"}'
131
131
```
132
132
133
133
Then save the URLs:
134
134
135
135
```
136
- dbt run --select "public_use_microdata_sample.urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021 .csv", "output_path": "~/data/american_community_survey"}' --threads 8
136
+ dbt run --select "public_use_microdata_sample.urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023 .csv", "output_path": "~/data/american_community_survey"}' --threads 8
137
137
```
138
138
139
139
Then execute the dbt model for downloading and extract the archives of the microdata (takes ~ 2min on a Macbook):
140
140
141
141
```
142
- dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022 .csv", "output_path": "~/data/american_community_survey"}' --threads 8
142
+ dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023 .csv", "output_path": "~/data/american_community_survey"}' --threads 8
143
143
```
144
144
145
145
Then generate the CSV paths:
@@ -259,26 +259,26 @@ duckdb -c "SELECT * FROM '~/data/american_community_survey/public_use_microdata_
259
259
```
260
260
2 . Download and extract the archives for all of the 50 states' PUMS files (takes about 30 seconds on a gigabit connection):
261
261
```
262
- dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021 .csv", "output_path": "~/data/american_community_survey"}' --threads 8
262
+ dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023 .csv", "output_path": "~/data/american_community_survey"}' --threads 8
263
263
```
264
264
Save the paths to the CSV files:
265
265
```
266
- dbt run --select "public_use_microdata_sample.public_use_microdata_sample_csv_paths " --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021 .csv", "output_path": "~/data/american_community_survey"}' --threads 8
266
+ dbt run --select "public_use_microdata_sample.csv_paths " --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023 .csv", "output_path": "~/data/american_community_survey"}' --threads 8
267
267
```
268
268
Check that the CSV files are present:
269
269
```
270
- duckdb -c "SELECT * FROM '~/data/american_community_survey/public_use_microdata_sample_csv_paths .parquet'"
270
+ duckdb -c "SELECT * FROM '~/data/american_community_survey/csv_paths .parquet'"
271
271
```
272
272
273
273
2 . Parse the data dictionary:
274
274
275
275
``` bash
276
- python scripts/parse_data_dictionary.py https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022 .csv
276
+ python scripts/parse_data_dictionary.py https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023 .csv
277
277
```
278
278
279
279
Then:
280
280
```
281
- dbt run --select "public_use_microdata_sample.public_use_microdata_sample_data_dictionary_path " --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021 .csv", "output_path": "~/data/american_community_survey"}' --threads 8
281
+ dbt run --select "public_use_microdata_sample.data_dictionary_path " --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023 .csv", "output_path": "~/data/american_community_survey"}' --threads 8
282
282
```
283
283
Check that the data dictionary path is displayed correctly:
284
284
```
@@ -287,13 +287,11 @@ duckdb -c "SELECT * FROM '~/data/american_community_survey/public_use_microdata_
287
287
288
288
1 . Generate the SQL commands needed to map every state's individual people or housing unit variables to the easier to use (and read) names:
289
289
```
290
- python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py \
291
- ~/data/american_community_survey/public_use_microdata_sample_csv_paths.parquet \
292
- ~/data/american_community_survey/PUMS_Data_Dictionary_2021.json
290
+ python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py ~/data/american_community_survey/csv_paths.parquet PUMS_Data_Dictionary_2023.json
293
291
```
294
292
1 . Execute these generated SQL queries using 8 threads (you can adjust this number to be higher depending on the available processor cores on your system):
295
293
```
296
- dbt run --select "public_use_microdata_sample.generated.2021 +" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021 .csv", "output_path": "~/data/american_community_survey"}' --threads 8
294
+ dbt run --select "public_use_microdata_sample.generated.2023 +" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023 /1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023 .csv", "output_path": "~/data/american_community_survey"}' --threads 8
297
295
```
298
296
1 . ** Test** that the compressed parquet files are present and have the expected size:
299
297
```
0 commit comments