Skip to content

Commit 23b1136

Browse files
committed
fix: readme
1 parent 2653e9c commit 23b1136

File tree

1 file changed

+17
-32
lines changed

1 file changed

+17
-32
lines changed

README.md

+17-32
Original file line numberDiff line numberDiff line change
@@ -126,40 +126,37 @@ To retrieve the list of URLs from the Census Bureau's server and download and ex
126126

127127
```
128128
cd data_processing
129-
dbt run --select "public_use_microdata_sample.list_urls" \
130-
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}'
129+
dbt run --select "public_use_microdata_sample.list_urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}'
131130
```
132131

133132
Then save the URLs:
134133

135134
```
136-
dbt run --select "public_use_microdata_sample.urls" \
137-
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
138-
--threads 8
135+
dbt run --select "public_use_microdata_sample.urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
139136
```
140137

141138
Then execute the dbt model for downloading and extract the archives of the microdata (takes ~2min on a Macbook):
142139

143140
```
144-
dbt run --select "public_use_microdata_sample.download_and_extract_archives" \
145-
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' \
146-
--threads 8
141+
dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' --threads 8
147142
```
148143

149144
Then generate the CSV paths:
150145

151146
```
152-
dbt run --select "public_use_microdata_sample.csv_paths" \
153-
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.json", "output_path": "~/data/american_community_survey"}' \
154-
--threads 8
147+
dbt run --select "public_use_microdata_sample.csv_paths" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.json", "output_path": "~/data/american_community_survey"}' --threads 8
155148
```
156149

157150
Then parse the data dictionary:
158151

152+
```bash
153+
python scripts/parse_data_dictionary.py https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv
154+
```
155+
156+
Then execute the dbt model for parsing the data dictionary:
157+
159158
```
160-
dbt run --select "public_use_microdata_sample.parse_data_dictionary" \
161-
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
162-
--threads 8
159+
dbt run --select "public_use_microdata_sample.parse_data_dictionary" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
163160
```
164161

165162
Then generate the SQL commands needed to map every state's individual people or housing unit variables to the easier to use (and read) names:
@@ -170,9 +167,7 @@ python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py ~/data/
170167

171168
Then execute these generated SQL queries using 1 thread (you can adjust this number to be higher depending on the available processor cores on your system):
172169
```
173-
dbt run --select "public_use_microdata_sample.generated+" \
174-
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' \
175-
--threads 8
170+
dbt run --select "public_use_microdata_sample.generated+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' --threads 8
176171
```
177172

178173
Inspect the output folder to see what has been created in the `output_path` specified in the previous command:
@@ -254,9 +249,7 @@ D SELECT COUNT(*) FROM '~/data/american_community_survey/housing_units_*[!united
254249

255250
1. Use dbt to save the database of the the 2021 1-Year ACS PUMS data and data dictionary from the Census Bureau's server and extract the archives for all of the 50 states' PUMS files:
256251
```
257-
dbt run --select "public_use_microdata_sample.public_use_microdata_sample_urls" \
258-
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
259-
--threads 8
252+
dbt run --select "public_use_microdata_sample.public_use_microdata_sample_urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
260253
```
261254

262255
Check that the URLs appear correct:
@@ -265,15 +258,11 @@ duckdb -c "SELECT * FROM '~/data/american_community_survey/public_use_microdata_
265258
```
266259
2. Download and extract the archives for all of the 50 states' PUMS files (takes about 30 seconds on a gigabit connection):
267260
```
268-
dbt run --select "public_use_microdata_sample.download_and_extract_archives" \
269-
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
270-
--threads 8
261+
dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
271262
```
272263
Save the paths to the CSV files:
273264
```
274-
dbt run --select "public_use_microdata_sample.public_use_microdata_sample_csv_paths" \
275-
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
276-
--threads 8
265+
dbt run --select "public_use_microdata_sample.public_use_microdata_sample_csv_paths" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
277266
```
278267
Check that the CSV files are present:
279268
```
@@ -288,9 +277,7 @@ python scripts/parse_data_dictionary.py https://www2.census.gov/programs-surveys
288277

289278
Then:
290279
```
291-
dbt run --select "public_use_microdata_sample.public_use_microdata_sample_data_dictionary_path" \
292-
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
293-
--threads 8
280+
dbt run --select "public_use_microdata_sample.public_use_microdata_sample_data_dictionary_path" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
294281
```
295282
Check that the data dictionary path is displayed correctly:
296283
```
@@ -305,9 +292,7 @@ python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py \
305292
```
306293
1. Execute these generated SQL queries using 8 threads (you can adjust this number to be higher depending on the available processor cores on your system):
307294
```
308-
dbt run --select "public_use_microdata_sample.generated.2021+" \
309-
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
310-
--threads 8
295+
dbt run --select "public_use_microdata_sample.generated.2021+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
311296
```
312297
1. **Test** that the compressed parquet files are present and have the expected size:
313298
```

0 commit comments

Comments
 (0)