Skip to content

giganticode/llm_ingredient_extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Artifact for "Extracting Fix Ingredients using Language Models"

The following describes our data processing pipeline for the paper "Extracting Fix Ingredients using Language Models".

Getting the Data (Mining)

As a first step, full file-level bug context is obtained for TSSB-3M and Defects4J. For TSSB-3M, this involves downloading a large number of files ~1M.

TSSB-3M

Our system uses the GitHub API to only fetch files changed for the fix. The original dataset can be downloaded from Zenodo (https://zenodo.org/records/5845439). Unfortunately, TSSB-3M contains duplicates. While our scripts filter them out, it's better to remove them before mining (e.g., with df.drop_duplicates(subset=['commit_sha'], inplace=True)). Data is loaded into a PostgreSQL database. Its schema is defined in github_scraper.rb.

For a small subset we scrape full repositories (see scrape_tssb_project_level.rb).

Defects4J

We use the Defects4J framework (https://github.com/rjust/defects4j) to checkout buggy and fixed versions as well as other useful metadata (see checkout_d4j.rb).

Scripts (in mining/)

The file github_scraper.rb contains the code for our GitHub scraper. This scraper saves data into a PostgreSQL database.

scrape_tssb.rb scrapes the TSSB-3M dataset (from the original JSON TSSB-3M files). Note that TSSB-3M contains duplicates.

gen_split.rb splits data into a training, validation and test set. Various other scripts require a "split file", this scripts was used to generate it.

scrape_tssb_project_level.rb clones a subset of repositories with full project-level context. First, a subset of SHAs must be extracted using the extract_project_url_and_sha.py script.

Finally, checkout_d4j.rb was used to "mine" Defect4J.

Generating Diffs (in diffing/)

The gen_dataset.rb scripts loads files from the database and outputs files with special diff tags (e.g., <CHANGES> and <CHANGEE> indicating the starting end ending point of a change). This tags are used 1) as bug location markers by the models and 2) by postprocessing scripts (e.g., for ingredient extraction).

Ingredient extraction (in postprocessing/)

Ground-truth ingredients

The ingredients_java.py and ingredients_python.py extract the ground-truth ingredient identifiers from the "fixed version" of the bug.

extract_ingr_freqs.py is used for the frequency analysis of ingredient identifiers.

Scanner ingredients

The script ingredients_scanner.py adds ingredients extracted by the scanner (i.e., the prediction of a trained scanner model, see below) and generates training data annotated with <INGRE> tags which are then used to train a scanner-supported repair model.

Generating snippets (in scanner/)

The scanner is trained on file snippets. The scripts gen_scan_dataset_java.py and gen_scan_dataset_python.py are used to split file-level context into a pre-tokenized snippets dataset that can be used to train the scanner. This scripts also handles the generation of token labels (i.e., outputs a labels for each token).

Training and Prediction (in training_and_inference)

For training we used adapted versions of HuggingFace trasformer's training scripts (train_repair_mode.py and train_scanner.py). The hyperparameters we used can be found in the corresponding shell scripts.

predict_repair_model.py and predict_scanner_model.py are used to predict (i.e. generate) bug fixes and extracted ingredients, respectively. Note that there is some sort of feedback loop necessary. The predictions from the scanner are (via the script in preprocessing) used to annotate the repair dataset. Then this dataset is used to train the repair model. Finally, the repair model's predictions are the bugfixes.

The train_repair_model_large_context.sh contains the hyperparameters for our lare context baseline.

Manually collected or annotated data (in manual/)

Some of our anlyses are based on a manual evaluation. Since this data is the result of manual work there is the chance of errors (we hope none were made).

d4j_performance.csv contains performance data for various APR tools. Data has been collected from the corresponding papers, repositories, artefacts etc. It simply lists correctly fixed Defects4J bugs for a number of APR tools.

manual_ingredient_check.csv contains the manual classification/analysis of ingredients mentioned in the paper.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published