Skip to content

Conversation

@vishalworkdatacommon
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @vishalworkdatacommon, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new data import for national-level Full-Time Equivalent (FTE) enrollment figures from the Integrated Postsecondary Education Data System (IPEDS). The changes establish a robust, semi-automated pipeline that covers the entire data lifecycle, from downloading raw data from a GCP bucket and preprocessing it to removing descriptive rows, and finally transforming it into a structured format suitable for ingestion. The new import is configured for regular, automated refreshes to ensure data freshness.

Highlights

  • New Data Import: Introduces a new import for national-level Full-Time Equivalent (FTE) enrollment data from the Integrated Postsecondary Education Data System (IPEDS).
  • Semi-Automated Pipeline: Establishes a semi-automated pipeline for data acquisition, preprocessing, and transformation, including scripts for downloading raw data from GCP and cleaning it.
  • Comprehensive Documentation: Provides a detailed README.md outlining the import process, data sources, preprocessing steps, autorefresh mechanism, and manual execution instructions.
  • Structured Configuration: Includes manifest.json to define the import specifications, processing scripts, input/output files, and a bi-monthly cron schedule for automated refreshes.
  • Extensive Data Mappings: Adds multiple property-value mapping (pvmap) files to accurately map raw data fields to Data Commons properties and values for various educational attributes like institution control, urbanization, degree levels, student levels, and state-specific data.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new data import for IPEDS FTE Enrollment data, including configuration files, property-value mappings, a download script, and documentation. My review has found a few issues. Most critically, the preprocess.py script is referenced in the manifest but is missing from the PR, which will cause the import to fail. There's also a typo in manifest.json that will lead to a file not found error. Additionally, there are some inconsistencies in the pvmap files and the README.md that should be addressed for correctness and clarity.


* **Steps**:
1. Ensure raw data files are in the specified GCP bucket.
2. Execute `run.sh` to fetch the raw data files into `input_files/` and then preprocess them using `preprocess.py`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This documentation is slightly misleading. It states that run.sh fetches and preprocesses the data. However, the run.sh script only fetches the data. The preprocess.py script is executed as a separate, subsequent step by the import framework (as defined in manifest.json). Please update the documentation to accurately describe the execution flow to avoid confusion for future maintainers.

Comment on lines +53 to +61
City: Large,value,{Number},sdg_urbanization,CityLarge,,,,,,,,,,
City: Midsize,value,{Number},sdg_urbanization,CityMidsize,,,,,,,,,,
City: Small,value,{Number},sdg_urbanization,CitySmall,,,,,,,,,,
Suburb: Large,value,{Number},sdg_urbanization,SuburbLarge,,,,,,,,,,
Suburb: Midsize,value,{Number},sdg_urbanization,SuburbMidsize,,,,,,,,,,
Suburb: Small,value,{Number},sdg_urbanization,SuburbSmall,,,,,,,,,,
Town: Fringe,value,{Number},sdg_urbanization,TownFringe,,,,,,,,,,
Town: Distant,value,{Number},sdg_urbanization,TownDistant,,,,,,,,,,
Town: Remote,value,{Number},sdg_urbanization,TownRemote,,,,,,,,,,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The property sdg_urbanization is used here. However, degreeOfUrbanization_pvmap.csv uses schoolLocationType for similar values. This inconsistency can lead to fragmented data. Please use schoolLocationType for consistency across the import.

City: Large,value,{Number},schoolLocationType,CityLarge,,,,,,,,,,
City: Midsize,value,{Number},schoolLocationType,CityMidsize,,,,,,,,,,
City: Small,value,{Number},schoolLocationType,CitySmall,,,,,,,,,,
Suburb: Large,value,{Number},schoolLocationType,SuburbLarge,,,,,,,,,,
Suburb: Midsize,value,{Number},schoolLocationType,SuburbMidsize,,,,,,,,,,
Suburb: Small,value,{Number},schoolLocationType,SuburbSmall,,,,,,,,,,
Town: Fringe,value,{Number},schoolLocationType,TownFringe,,,,,,,,,,
Town: Distant,value,{Number},schoolLocationType,TownDistant,,,,,,,,,,
Town: Remote,value,{Number},schoolLocationType,TownRemote,,,,,,,,,,

Doctor's degree - research/scholarship and professional practice,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,ResearchScholarshipAndProfessionalPractice,,,,,,,,
Doctor's degree - research/scholarship,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,ResearchScholarship,,,,,,,,
Doctor's degree - professional practice,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,ProfessionalPractice,,,,,,,,
Doctor's degree - other,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,Other,,,,,,,,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value for doctorateDegreeMajor is Other, but in highestDegreeOffered_pvmap.csv it is OtherMajor for a similar entry. This inconsistency can lead to data fragmentation. Please use OtherMajor for consistency, assuming it is the correct enum value.

Doctor's degree - other,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,OtherMajor,,,,,,,,

Doctor's degree - research/scholarship and professional practice,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,ResearchScholarshipAndProfessionalPractice,,,,,,,,
Doctor's degree - research/scholarship,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,ResearchScholarship,,,,,,,,
Doctor's degree - professional practice,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,ProfessionalPractice,,,,,,,,
Doctor's degree - other,value,{Number},educationalAttainment,DoctorateDegree,doctorateDegreeMajor,OtherMajor,,,,,,,,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value for doctorateDegreeMajor is OtherMajor. This seems more specific and correct than Other used in controlOfInstitution_pvmap.csv. Please ensure this is the intended value and apply it consistently.

cleaned_content = lines[start_index:]
with open(file_path, 'w') as f:
f.writelines(cleaned_content)
print(f"Cleaned {file_path} successfully, removed {start_index} initial rows.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace print statements with logging.info using absl logging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed.

@SandeepTuniki SandeepTuniki self-requested a review December 31, 2025 05:25
Comment on lines +25 to +63
{
"template_mcf": "output/ControlOfInstitution_output.tmcf",
"cleaned_csv": "output/ControlOfInstitution_output.csv"
},
{
"template_mcf": "output/DegreeOfUrbanization_output.tmcf",
"cleaned_csv": "output/DegreeOfUrbanization_output.csv"
},
{
"template_mcf": "output/HighestDegreeOffered_output.tmcf",
"cleaned_csv": "output/HighestDegreeOffered_output.csv"
},
{
"template_mcf": "output/HistoricallyBlackCollegeOrUniversity_output.tmcf",
"cleaned_csv": "output/HistoricallyBlackCollegeOrUniversity_output.csv"
},
{
"template_mcf": "output/InstitutionalCategory_output.tmcf",
"cleaned_csv": "output/InstitutionalCategory_output.csv"
},
{
"template_mcf": "output/LevelOfInstitution_output.tmcf",
"cleaned_csv": "output/LevelOfInstitution_output.csv"
},
{
"template_mcf": "output/LevelOfStudent_output.tmcf",
"cleaned_csv": "output/LevelOfStudent_output.csv"
},
{
"template_mcf": "output/SectorOfInstitution_output.tmcf",
"cleaned_csv": "output/SectorOfInstitution_output.csv"
},
{
"template_mcf": "output/State_output.tmcf",
"cleaned_csv": "output/State_output.csv"
},
{
"template_mcf": "output/TotalStudents_output.tmcf",
"cleaned_csv": "output/TotalStudents_output.csv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to source_files, how about we use a wildcard pattern for import_inputs as well to keep it simple? (ex: template_mcf: output/*_output.tmcf)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the LICENSE text that we usually add for all source files?

# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@@ -0,0 +1,36 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a top level description to this file which you already documented in the README file please?

"""
Cleans the raw CSV files by removing initial descriptive rows.
"""


mkdir -p "input_files"

gsutil -m cp -r gs://unresolved_mcf/IPEDS/Enrollment_FTE_National/input_files/*.csv "$SCRIPT_PATH/input_files" No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new line to the end of this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to previous comment, please add the license text to the top of this file.

@@ -0,0 +1,26 @@
key,prop,val,,,,,,,,,,,,
Year,measuredProperty,count,populationType,Student,collegeOrGraduateSchoolEnrollment,EnrolledInCollegeOrGraduateSchool,institutionType,PostSecondaryInstitution,enrollmentStatus,FullTimeEquivalent,observationAbout,country/USA,observationPeriod,1Y
2023-24,observationDate,2024,,,,,,,,,,,,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick clarification - For any period like 2023-24, is the general convention to pick the latter year (ex: 2024) instead of previous year (ex: 2023)?

if start_index != -1:
cleaned_content = lines[start_index:]
with open(file_path, 'w') as f:
f.writelines(cleaned_content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally avoid modifying the original source files so that we can trace any issues back easily. Can we write the content to new files and have statvar processor work on them instead?

@@ -0,0 +1,36 @@
import os
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any usage of pandas in this file. Can we remove it from the list of imports then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants