@@ -126,40 +126,37 @@ To retrieve the list of URLs from the Census Bureau's server and download and ex
126
126
127
127
```
128
128
cd data_processing
129
- dbt run --select "public_use_microdata_sample.list_urls" \
130
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}'
129
+ dbt run --select "public_use_microdata_sample.list_urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}'
131
130
```
132
131
133
132
Then save the URLs:
134
133
135
134
```
136
- dbt run --select "public_use_microdata_sample.urls" \
137
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
138
- --threads 8
135
+ dbt run --select "public_use_microdata_sample.urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
139
136
```
140
137
141
138
Then execute the dbt model for downloading and extract the archives of the microdata (takes ~ 2min on a Macbook):
142
139
143
140
```
144
- dbt run --select "public_use_microdata_sample.download_and_extract_archives" \
145
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' \
146
- --threads 8
141
+ dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' --threads 8
147
142
```
148
143
149
144
Then generate the CSV paths:
150
145
151
146
```
152
- dbt run --select "public_use_microdata_sample.csv_paths" \
153
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.json", "output_path": "~/data/american_community_survey"}' \
154
- --threads 8
147
+ dbt run --select "public_use_microdata_sample.csv_paths" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.json", "output_path": "~/data/american_community_survey"}' --threads 8
155
148
```
156
149
157
150
Then parse the data dictionary:
158
151
152
+ ``` bash
153
+ python scripts/parse_data_dictionary.py https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv
154
+ ```
155
+
156
+ Then execute the dbt model for parsing the data dictionary:
157
+
159
158
```
160
- dbt run --select "public_use_microdata_sample.parse_data_dictionary" \
161
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
162
- --threads 8
159
+ dbt run --select "public_use_microdata_sample.parse_data_dictionary" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
163
160
```
164
161
165
162
Then generate the SQL commands needed to map every state's individual people or housing unit variables to the easier to use (and read) names:
@@ -170,9 +167,7 @@ python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py ~/data/
170
167
171
168
Then execute these generated SQL queries using 1 thread (you can adjust this number to be higher depending on the available processor cores on your system):
172
169
```
173
- dbt run --select "public_use_microdata_sample.generated+" \
174
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' \
175
- --threads 8
170
+ dbt run --select "public_use_microdata_sample.generated+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' --threads 8
176
171
```
177
172
178
173
Inspect the output folder to see what has been created in the ` output_path ` specified in the previous command:
@@ -254,9 +249,7 @@ D SELECT COUNT(*) FROM '~/data/american_community_survey/housing_units_*[!united
254
249
255
250
1 . Use dbt to save the database of the the 2021 1-Year ACS PUMS data and data dictionary from the Census Bureau's server and extract the archives for all of the 50 states' PUMS files:
256
251
```
257
- dbt run --select "public_use_microdata_sample.public_use_microdata_sample_urls" \
258
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
259
- --threads 8
252
+ dbt run --select "public_use_microdata_sample.public_use_microdata_sample_urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
260
253
```
261
254
262
255
Check that the URLs appear correct:
@@ -265,15 +258,11 @@ duckdb -c "SELECT * FROM '~/data/american_community_survey/public_use_microdata_
265
258
```
266
259
2 . Download and extract the archives for all of the 50 states' PUMS files (takes about 30 seconds on a gigabit connection):
267
260
```
268
- dbt run --select "public_use_microdata_sample.download_and_extract_archives" \
269
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
270
- --threads 8
261
+ dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
271
262
```
272
263
Save the paths to the CSV files:
273
264
```
274
- dbt run --select "public_use_microdata_sample.public_use_microdata_sample_csv_paths" \
275
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
276
- --threads 8
265
+ dbt run --select "public_use_microdata_sample.public_use_microdata_sample_csv_paths" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
277
266
```
278
267
Check that the CSV files are present:
279
268
```
@@ -288,9 +277,7 @@ python scripts/parse_data_dictionary.py https://www2.census.gov/programs-surveys
288
277
289
278
Then:
290
279
```
291
- dbt run --select "public_use_microdata_sample.public_use_microdata_sample_data_dictionary_path" \
292
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
293
- --threads 8
280
+ dbt run --select "public_use_microdata_sample.public_use_microdata_sample_data_dictionary_path" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
294
281
```
295
282
Check that the data dictionary path is displayed correctly:
296
283
```
@@ -305,9 +292,7 @@ python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py \
305
292
```
306
293
1 . Execute these generated SQL queries using 8 threads (you can adjust this number to be higher depending on the available processor cores on your system):
307
294
```
308
- dbt run --select "public_use_microdata_sample.generated.2021+" \
309
- --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
310
- --threads 8
295
+ dbt run --select "public_use_microdata_sample.generated.2021+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
311
296
```
312
297
1 . ** Test** that the compressed parquet files are present and have the expected size:
313
298
```
0 commit comments