Skip to content

Commit ee3de8e

Browse files
committed
update for 2023 data
1 parent c5aef2f commit ee3de8e

File tree

112 files changed

+271428
-22
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

112 files changed

+271428
-22
lines changed

README.md

+11-13
Original file line numberDiff line numberDiff line change
@@ -121,25 +121,25 @@ For debugging, the `duckdb` command line tool is available on homebrew:
121121
brew install duckdb
122122
```
123123

124-
## Usage for 2022 ACS Public Use Microdata Sample (PUMS) Data
124+
## Usage for 2023 ACS Public Use Microdata Sample (PUMS) Data
125125

126126
To retrieve the list of URLs from the Census Bureau's server and download and extract the archives for all of the 50 states' PUMS files, run the following:
127127

128128
```
129129
cd data_processing
130-
dbt run --select "public_use_microdata_sample.list_urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}'
130+
dbt run --select "public_use_microdata_sample.list_urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}'
131131
```
132132

133133
Then save the URLs:
134134

135135
```
136-
dbt run --select "public_use_microdata_sample.urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
136+
dbt run --select "public_use_microdata_sample.urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}' --threads 8
137137
```
138138

139139
Then execute the dbt model for downloading and extract the archives of the microdata (takes ~2min on a Macbook):
140140

141141
```
142-
dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' --threads 8
142+
dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}' --threads 8
143143
```
144144

145145
Then generate the CSV paths:
@@ -259,26 +259,26 @@ duckdb -c "SELECT * FROM '~/data/american_community_survey/public_use_microdata_
259259
```
260260
2. Download and extract the archives for all of the 50 states' PUMS files (takes about 30 seconds on a gigabit connection):
261261
```
262-
dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
262+
dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}' --threads 8
263263
```
264264
Save the paths to the CSV files:
265265
```
266-
dbt run --select "public_use_microdata_sample.public_use_microdata_sample_csv_paths" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
266+
dbt run --select "public_use_microdata_sample.csv_paths" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}' --threads 8
267267
```
268268
Check that the CSV files are present:
269269
```
270-
duckdb -c "SELECT * FROM '~/data/american_community_survey/public_use_microdata_sample_csv_paths.parquet'"
270+
duckdb -c "SELECT * FROM '~/data/american_community_survey/csv_paths.parquet'"
271271
```
272272

273273
2. Parse the data dictionary:
274274

275275
```bash
276-
python scripts/parse_data_dictionary.py https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv
276+
python scripts/parse_data_dictionary.py https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv
277277
```
278278

279279
Then:
280280
```
281-
dbt run --select "public_use_microdata_sample.public_use_microdata_sample_data_dictionary_path" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
281+
dbt run --select "public_use_microdata_sample.data_dictionary_path" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}' --threads 8
282282
```
283283
Check that the data dictionary path is displayed correctly:
284284
```
@@ -287,13 +287,11 @@ duckdb -c "SELECT * FROM '~/data/american_community_survey/public_use_microdata_
287287

288288
1. Generate the SQL commands needed to map every state's individual people or housing unit variables to the easier to use (and read) names:
289289
```
290-
python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py \
291-
~/data/american_community_survey/public_use_microdata_sample_csv_paths.parquet \
292-
~/data/american_community_survey/PUMS_Data_Dictionary_2021.json
290+
python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py ~/data/american_community_survey/csv_paths.parquet PUMS_Data_Dictionary_2023.json
293291
```
294292
1. Execute these generated SQL queries using 8 threads (you can adjust this number to be higher depending on the available processor cores on your system):
295293
```
296-
dbt run --select "public_use_microdata_sample.generated.2021+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
294+
dbt run --select "public_use_microdata_sample.generated.2023+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}' --threads 8
297295
```
298296
1. **Test** that the compressed parquet files are present and have the expected size:
299297
```

data_processing/models/public_use_microdata_sample/download_and_extract_archives.py

+8-5
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,12 @@ def model(dbt, session):
2525
base_url = dbt.config.get('public_use_microdata_sample_url') # Assuming this is correctly set
2626

2727
# Fetch URLs from your table or view
28-
query = "SELECT * FROM list_urls "
29-
result = session.execute(query).fetchall()
30-
columns = [desc[0] for desc in session.description]
31-
url_df = pd.DataFrame(result, columns=columns)
32-
28+
# query = "SELECT * FROM ref('list_urls')"
29+
# result = session.execute(query).fetchall()
30+
# columns = [desc[0] for desc in session.description]
31+
# url_df = pd.DataFrame(result, columns=columns)
32+
# load from parquet file in ~/data/american_community_survey/urls.parquet
33+
url_df = pd.read_parquet('~/data/american_community_survey/urls.parquet')
3334
# Determine the base directory for data storage
3435
base_path = os.path.expanduser(dbt.config.get('output_path'))
3536
base_dir = os.path.join(base_path, f'{base_url.rstrip("/").split("/")[-2]}/{base_url.rstrip("/").split("/")[-1]}')
@@ -50,4 +51,6 @@ def model(dbt, session):
5051
paths_df = pd.DataFrame(extracted_files, columns=['csv_path'])
5152

5253
# Return the DataFrame with paths to the extracted CSV files
54+
#save the paths to parquet file
55+
paths_df.to_parquet('~/data/american_community_survey/csv_paths.parquet', index=False)
5356
return paths_df

0 commit comments

Comments
 (0)