Import : IPEDS_FTE_Enrollment_National Auto Refresh #1796

vishalworkdatacommon · 2025-12-19T10:40:46Z

No description provided.

gemini-code-assist · 2025-12-19T10:41:12Z

Summary of Changes

Hello @vishalworkdatacommon, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new data import for national-level Full-Time Equivalent (FTE) enrollment figures from the Integrated Postsecondary Education Data System (IPEDS). The changes establish a robust, semi-automated pipeline that covers the entire data lifecycle, from downloading raw data from a GCP bucket and preprocessing it to removing descriptive rows, and finally transforming it into a structured format suitable for ingestion. The new import is configured for regular, automated refreshes to ensure data freshness.

Highlights

New Data Import: Introduces a new import for national-level Full-Time Equivalent (FTE) enrollment data from the Integrated Postsecondary Education Data System (IPEDS).
Semi-Automated Pipeline: Establishes a semi-automated pipeline for data acquisition, preprocessing, and transformation, including scripts for downloading raw data from GCP and cleaning it.
Comprehensive Documentation: Provides a detailed README.md outlining the import process, data sources, preprocessing steps, autorefresh mechanism, and manual execution instructions.
Structured Configuration: Includes manifest.json to define the import specifications, processing scripts, input/output files, and a bi-monthly cron schedule for automated refreshes.
Extensive Data Mappings: Adds multiple property-value mapping (pvmap) files to accurately map raw data fields to Data Commons properties and values for various educational attributes like institution control, urbanization, degree levels, student levels, and state-specific data.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a new data import for IPEDS FTE Enrollment data, including configuration files, property-value mappings, a download script, and documentation. My review has found a few issues. Most critically, the preprocess.py script is referenced in the manifest but is missing from the PR, which will cause the import to fail. There's also a typo in manifest.json that will lead to a file not found error. Additionally, there are some inconsistencies in the pvmap files and the README.md that should be addressed for correctness and clarity.

statvar_imports/us_college_ipeds/manifest.json

gemini-code-assist · 2025-12-19T10:50:43Z

statvar_imports/us_college_ipeds/README.md

+
+*   **Steps**:
+    1.  Ensure raw data files are in the specified GCP bucket.
+    2.  Execute `run.sh` to fetch the raw data files into `input_files/` and then preprocess them using `preprocess.py`.


This documentation is slightly misleading. It states that run.sh fetches and preprocesses the data. However, the run.sh script only fetches the data. The preprocess.py script is executed as a separate, subsequent step by the import framework (as defined in manifest.json). Please update the documentation to accurately describe the execution flow to avoid confusion for future maintainers.

statvar_imports/us_college_ipeds/README.md

gemini-code-assist · 2025-12-19T10:50:43Z

statvar_imports/us_college_ipeds/pvmap/controlOfInstitution_pvmap.csv

+City: Large,value,{Number},sdg_urbanization,CityLarge,,,,,,,,,,
+City: Midsize,value,{Number},sdg_urbanization,CityMidsize,,,,,,,,,,
+City: Small,value,{Number},sdg_urbanization,CitySmall,,,,,,,,,,
+Suburb: Large,value,{Number},sdg_urbanization,SuburbLarge,,,,,,,,,,
+Suburb: Midsize,value,{Number},sdg_urbanization,SuburbMidsize,,,,,,,,,,
+Suburb: Small,value,{Number},sdg_urbanization,SuburbSmall,,,,,,,,,,
+Town: Fringe,value,{Number},sdg_urbanization,TownFringe,,,,,,,,,,
+Town: Distant,value,{Number},sdg_urbanization,TownDistant,,,,,,,,,,
+Town: Remote,value,{Number},sdg_urbanization,TownRemote,,,,,,,,,,


The property sdg_urbanization is used here. However, degreeOfUrbanization_pvmap.csv uses schoolLocationType for similar values. This inconsistency can lead to fragmented data. Please use schoolLocationType for consistency across the import.

City: Large,value,{Number},schoolLocationType,CityLarge,,,,,,,,,, City: Midsize,value,{Number},schoolLocationType,CityMidsize,,,,,,,,,, City: Small,value,{Number},schoolLocationType,CitySmall,,,,,,,,,, Suburb: Large,value,{Number},schoolLocationType,SuburbLarge,,,,,,,,,, Suburb: Midsize,value,{Number},schoolLocationType,SuburbMidsize,,,,,,,,,, Suburb: Small,value,{Number},schoolLocationType,SuburbSmall,,,,,,,,,, Town: Fringe,value,{Number},schoolLocationType,TownFringe,,,,,,,,,, Town: Distant,value,{Number},schoolLocationType,TownDistant,,,,,,,,,, Town: Remote,value,{Number},schoolLocationType,TownRemote,,,,,,,,,,

gemini-code-assist · 2025-12-19T10:50:43Z

statvar_imports/us_college_ipeds/pvmap/controlOfInstitution_pvmap.csv

+Doctor's degree - research/scholarship and professional practice,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,ResearchScholarshipAndProfessionalPractice,,,,,,,,
+Doctor's degree - research/scholarship,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,ResearchScholarship,,,,,,,,
+Doctor's degree -  professional practice,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,ProfessionalPractice,,,,,,,,
+Doctor's degree - other,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,Other,,,,,,,,


The value for doctorateDegreeMajor is Other, but in highestDegreeOffered_pvmap.csv it is OtherMajor for a similar entry. This inconsistency can lead to data fragmentation. Please use OtherMajor for consistency, assuming it is the correct enum value.

Doctor's degree - other,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,OtherMajor,,,,,,,,

gemini-code-assist · 2025-12-19T10:50:44Z

statvar_imports/us_college_ipeds/pvmap/highestDegreeOffered_pvmap.csv

+Doctor's degree - research/scholarship and professional practice,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,ResearchScholarshipAndProfessionalPractice,,,,,,,,
+Doctor's degree - research/scholarship,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,ResearchScholarship,,,,,,,,
+Doctor's degree -  professional practice,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,ProfessionalPractice,,,,,,,,
+Doctor's degree - other,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,OtherMajor,,,,,,,,


The value for doctorateDegreeMajor is OtherMajor. This seems more specific and correct than Other used in controlOfInstitution_pvmap.csv. Please ensure this is the intended value and apply it consistently.

statvar_imports/us_college_ipeds/README.md

krishnaswamypradeep · 2025-12-30T10:58:11Z

statvar_imports/us_college_ipeds/preprocess.py

+        cleaned_content = lines[start_index:]
+        with open(file_path, 'w') as f:
+            f.writelines(cleaned_content)
+        print(f"Cleaned {file_path} successfully, removed {start_index} initial rows.")


replace print statements with logging.info using absl logging

statvar_imports/us_college_ipeds/README.md

SandeepTuniki · 2025-12-31T05:31:02Z

statvar_imports/us_college_ipeds/manifest.json

+                {
+                    "template_mcf": "output/ControlOfInstitution_output.tmcf",
+                    "cleaned_csv": "output/ControlOfInstitution_output.csv"
+                },
+                {
+                    "template_mcf": "output/DegreeOfUrbanization_output.tmcf",
+                    "cleaned_csv": "output/DegreeOfUrbanization_output.csv"
+                },
+                {
+                    "template_mcf": "output/HighestDegreeOffered_output.tmcf",
+                    "cleaned_csv": "output/HighestDegreeOffered_output.csv"
+                },
+                {
+                    "template_mcf": "output/HistoricallyBlackCollegeOrUniversity_output.tmcf",
+                    "cleaned_csv": "output/HistoricallyBlackCollegeOrUniversity_output.csv"
+                },
+                {
+                    "template_mcf": "output/InstitutionalCategory_output.tmcf",
+                    "cleaned_csv": "output/InstitutionalCategory_output.csv"
+                },
+                {
+                    "template_mcf": "output/LevelOfInstitution_output.tmcf",
+                    "cleaned_csv": "output/LevelOfInstitution_output.csv"
+                },
+                {
+                    "template_mcf": "output/LevelOfStudent_output.tmcf",
+                    "cleaned_csv": "output/LevelOfStudent_output.csv"
+                },
+                {
+                    "template_mcf": "output/SectorOfInstitution_output.tmcf",
+                    "cleaned_csv": "output/SectorOfInstitution_output.csv"
+                },
+                {
+                    "template_mcf": "output/State_output.tmcf",
+                    "cleaned_csv": "output/State_output.csv"
+                },
+                {
+                    "template_mcf": "output/TotalStudents_output.tmcf",
+                    "cleaned_csv": "output/TotalStudents_output.csv"


Similar to source_files, how about we use a wildcard pattern for import_inputs as well to keep it simple? (ex: template_mcf: output/*_output.tmcf)

SandeepTuniki · 2025-12-31T05:32:37Z

statvar_imports/us_college_ipeds/preprocess.py

Can you add the LICENSE text that we usually add for all source files?

# Copyright 2025 Google LLC # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

SandeepTuniki · 2025-12-31T05:35:34Z

statvar_imports/us_college_ipeds/preprocess.py

@@ -0,0 +1,36 @@
+import os


Can you add a top level description to this file which you already documented in the README file please?

""" Cleans the raw CSV files by removing initial descriptive rows. """

SandeepTuniki · 2025-12-31T05:39:05Z

statvar_imports/us_college_ipeds/run.sh

+
+mkdir -p "input_files"
+
+gsutil -m cp -r gs://unresolved_mcf/IPEDS/Enrollment_FTE_National/input_files/*.csv "$SCRIPT_PATH/input_files"


Please add a new line to the end of this file.

SandeepTuniki · 2025-12-31T05:39:34Z

statvar_imports/us_college_ipeds/run.sh

Similar to previous comment, please add the license text to the top of this file.

SandeepTuniki · 2025-12-31T05:44:33Z

statvar_imports/us_college_ipeds/pvmap/degreeOfUrbanization_pvmap.csv

@@ -0,0 +1,26 @@
+key,prop,val,,,,,,,,,,,,
+Year,measuredProperty,count,populationType,Student,collegeOrGraduateSchoolEnrollment,EnrolledInCollegeOrGraduateSchool,institutionType,PostSecondaryInstitution,enrollmentStatus,FullTimeEquivalent,observationAbout,country/USA,observationPeriod,1Y
+2023-24,observationDate,2024,,,,,,,,,,,,


Quick clarification - For any period like 2023-24, is the general convention to pick the latter year (ex: 2024) instead of previous year (ex: 2023)?

SandeepTuniki · 2025-12-31T05:55:58Z

statvar_imports/us_college_ipeds/preprocess.py

+    if start_index != -1:
+        cleaned_content = lines[start_index:]
+        with open(file_path, 'w') as f:
+            f.writelines(cleaned_content)


We generally avoid modifying the original source files so that we can trace any issues back easily. Can we write the content to new files and have statvar processor work on them instead?

SandeepTuniki · 2025-12-31T05:57:13Z

statvar_imports/us_college_ipeds/preprocess.py

@@ -0,0 +1,36 @@
+import os
+import pandas as pd


I don't see any usage of pandas in this file. Can we remove it from the list of imports then?

initial auto config

138bbaa

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

prajapata and others added 4 commits December 22, 2025 11:59

manifest and sh script modified

cb3914a

addressing gemini code assist comments

f996517

Merge branch 'master' into fte_auto

6829257

manifest modified

7c38c1e

krishnaswamypradeep reviewed Dec 30, 2025

View reviewed changes

statvar_imports/us_college_ipeds/README.md Show resolved Hide resolved

vishalworkdatacommon added 5 commits December 30, 2025 06:22

removing the test folder

c0d393b

added test data

7372c02

remove extra test folder

e44cee0

addressing pr comments

874b354

adding the details in README

486fae4

krishnaswamypradeep reviewed Dec 30, 2025

View reviewed changes

addressing pr comments

5659700

krishnaswamypradeep approved these changes Dec 31, 2025

View reviewed changes

SandeepTuniki self-requested a review December 31, 2025 05:25

SandeepTuniki requested changes Dec 31, 2025

View reviewed changes


		mkdir -p "input_files"

		gsutil -m cp -r gs://unresolved_mcf/IPEDS/Enrollment_FTE_National/input_files/*.csv "$SCRIPT_PATH/input_files" No newline at end of file

Import : IPEDS_FTE_Enrollment_National Auto Refresh #1796

Are you sure you want to change the base?

Import : IPEDS_FTE_Enrollment_National Auto Refresh #1796

Uh oh!

Conversation

vishalworkdatacommon commented Dec 19, 2025

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants