BigFeat: Scalable and Interpretable Automated Feature Engineering Framework

What is BigFeat?

BigFeat is a scalable and interpretable automated feature engineering framework designed to enhance the quality of input features to maximize predictive performance based on a user-defined metric. It supports both classification and regression tasks, employing a dynamic feature generation and selection mechanism to construct expressive, interpretable features that improve prediction performance.

Input/Output

BigFeat takes original input features and returns a collection of base and engineered features expected to enhance predictive performance for either classification or regression tasks.

Setup and Installation

Prerequisites

Ensure you have Python 3.8+ installed. BigFeat requires specific versions of Python packages as listed in the requirements.txt file.

Installation Steps

Clone the Repository (if applicable):

git clone https://github.com/DataSystemsGroupUT/BigFeat.git
cd BigFeat

Create a Virtual Environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies: Use the provided requirements.txt to install all required packages with their exact versions:

pip install -r requirements.txt

The requirements.txt includes:

bigfeat==0.1
joblib==1.4.2
lightgbm==4.6.0
numpy==2.2.5
pandas==2.2.3
python-dateutil==2.9.0.post0
pytz==2025.2
scikit-learn==1.6.1
scipy==1.15.2
six==1.17.0
threadpoolctl==3.6.0
tzdata==2025.2

Install BigFeat: If not already installed via requirements.txt, install BigFeat locally:
```
pip install .
```
Alternatively, install directly from the source:
```
pip install ./BigFeat
```

Usage

BigFeat can be used to generate and select features for both classification and regression tasks. Below is an example demonstrating how to run BigFeat on test datasets.

Example Code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import f1_score, r2_score
import bigfeat.bigfeat_base as bigfeat
import sklearn.preprocessing as preprocessing

def run_tst(df_path, target_ft, random_state, task_type='classification'):
    df = pd.read_csv(df_path)
    # Encode categorical columns
    object_columns = df.select_dtypes(include='object')
    if len(object_columns.columns):
        df[object_columns.columns] = object_columns.apply(preprocessing.LabelEncoder().fit_transform)
    X = df.drop(columns=target_ft)
    y = df[target_ft]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
    return X_train, X_test, y_train, y_test

# Initialize BigFeat for classification or regression
bf = bigfeat.BigFeat(task_type='classification')  # Use 'regression' for regression tasks

# Example datasets (replace with your dataset paths and target columns)
datasets = [
    ("data/shuttle.csv", "class", "classification"),
    ("data/blood-transfusion-service-center.csv", "Class", "classification"),
    ("data/credit-g.csv", "class", "classification"),
    ("data/kc1.csv", "defects", "classification"),
    ("data/nomao.csv", "Class", "classification"),
    ("data/eeg_eye_state.csv", "Class", "classification"),
    ("data/gina.csv", "class", "classification"),
    ("data/sonar.csv", "Class", "classification"),
    ("data/arcene.csv", "Class", "classification"),
    ("data/madelon.csv", "Class", "classification"),
    # Add regression datasets as needed
]

for dataset, target, task_type in datasets:
    print(f"\nProcessing dataset: {dataset}")
    X_train, X_test, y_train, y_test = run_tst(dataset, target, random_state=0, task_type=task_type)
    
    # Configure BigFeat for the task
    bf = bigfeat.BigFeat(task_type=task_type)
    
    # Fit BigFeat
    res = bf.fit(
        X_train, y_train,
        gen_size=5,
        random_state=0,
        iterations=5,
        estimator='avg',
        feat_imps=True,
        split_feats=None,
        check_corr=False,
        selection='fAnova',
        combine_res=True
    )
    
    # Evaluate performance
    if task_type == 'classification':
        clf = LogisticRegression(random_state=0).fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        print(f"Original F1 Score: {f1_score(y_test, y_pred):.4f}")
        
        clf = LogisticRegression(random_state=0).fit(bf.transform(X_train), y_train)
        y_pred_bf = clf.predict(bf.transform(X_test))
        print(f"BigFeat F1 Score: {f1_score(y_test, y_pred_bf):.4f}")
        
    else:  # regression
        reg = LinearRegression().fit(X_train, y_train)
        y_pred = reg.predict(X_test)
        print(f"Original R² Score: {r2_score(y_test, y_pred):.4f}")
        
        reg = LinearRegression().fit(bf.transform(X_train), y_train)
        y_pred_bf = reg.predict(bf.transform(X_test))
        print(f"BigFeat R² Score: {r2_score(y_test, y_pred_bf):.4f}")

Key Parameters for `BigFeat.fit`

gen_size: Number of features to generate per iteration.
random_state: Seed for reproducibility.
iterations: Number of feature generation iterations.
estimator: Method for feature importance ('avg' uses RandomForest and LightGBM).
feat_imps: Whether to use feature importance for guiding generation.
split_feats: Strategy for splitting features ('comb' or 'splits').
check_corr: Whether to check and remove highly correlated features.
selection: Feature selection method ('stability' or 'fAnova').
combine_res: Whether to combine results across iterations.

Cite Us

If you use BigFeat in your research, please cite the following paper:

@inproceedings{eldeeb2022bigfeat,
  title={BigFeat: Scalable and Interpretable Automated Feature Engineering Framework},
  author={Eldeeb, Hassan and Amashukeli, Shota and ElShawi, Radwa},
  booktitle={2022 IEEE International Conference on Big Data (Big Data)},
  pages={515--524},
  year={2022},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
benchmarking		benchmarking
bigfeat		bigfeat
docs		docs
metafeatures_training		metafeatures_training
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BigFeat: Scalable and Interpretable Automated Feature Engineering Framework

What is BigFeat?

Input/Output

Setup and Installation

Prerequisites

Installation Steps

Usage

Example Code

Key Parameters for `BigFeat.fit`

Cite Us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

DataSystemsGroupUT/BigFeat

Folders and files

Latest commit

History

Repository files navigation

BigFeat: Scalable and Interpretable Automated Feature Engineering Framework

What is BigFeat?

Input/Output

Setup and Installation

Prerequisites

Installation Steps

Usage

Example Code

Key Parameters for BigFeat.fit

Cite Us

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Key Parameters for `BigFeat.fit`

Packages