cleanlab
diff --git a/‎.dockerignore
+12 b/‎.dockerignore
+12
diff --git a/‎.gitignore
+21 b/‎.gitignore
+21
diff --git a/‎Dockerfile
+8 b/‎Dockerfile
+8
diff --git a/‎Makefile
+3 b/‎Makefile
+3
diff --git a/‎README.md
+66-1 b/‎README.md
+66-1
diff --git a/‎configs/default.yaml
+9 b/‎configs/default.yaml
+9
diff --git a/‎docker-compose.yml
+24 b/‎docker-compose.yml
+24
diff --git a/‎requirements.txt
+24 b/‎requirements.txt
+24
@@ -0,0 +1,12 @@
+# ignore .git and .cache folders
+.git
+.cache
+# ignore all *.class files in all folders, including build root
+**/*.class
+# ignore all markdown files (md) beside all README*.md other than README-secret.md
+*.md
+!README*.md
+README-secret.md
+# ignore folders
+data
+temp
@@ -0,0 +1,21 @@
+*/__init__.py
+*__pycache__
+*/.ipynb*
+.ipynb_checkpoints
+*temp*.ipynb
+*.csv
+*.pickle
+*.trial*
+*split*
+data
+temp
+archive
+
+# AutoGluon models
+*.ag
+
+# embeddings
+*embeddings.npy
+
+# ONNX models
+*.onnx
@@ -0,0 +1,8 @@
+FROM nvidia/cuda:11.4.0-cudnn8-runtime-ubuntu20.04
+WORKDIR /dcai
+COPY . .
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get -y update && \
+    apt-get -y install make python3 python3-pip ffmpeg libsm6 libxext6 git && \
+    python3 -m pip install -r requirements.txt -f https://download.pytorch.org/whl/cu113/torch_stable.html
+ENV PYTHONPATH=".:${PYTHONPATH}"
@@ -0,0 +1,3 @@
+# Use Makefile to run jupyter lab for convenience so we can save args (ip, port, allow-root, etc)
+jupyter-lab:
+	jupyter lab --ip 0.0.0.0 --port 8888 --allow-root
@@ -1,3 +1,68 @@
 # ood-detection-benchmarks
 
-Evaluation of algorithms to detect out-of-distribution data
+Out-of-distribution (OOD) detection is the task of determining whether a datapoint comes from a different distribution than the training dataset. For example, we may train a model to classify the breed of dogs and find that there is a cat image in our dataset. This cat image would be considered out-of-distribution.
+
+OOD detection is useful to find label issues where the actual ground truth label is not in the set of labels for our task (e.g. cat label for a dog breed classification task). This can serve many use-cases, some of which include:
+
+- Remove OOD datapoints from our dataset as part of a data cleaning pipeline
+- Consider adding new classes to our task
+- Gain deeper insight into the data distribution
+
+This work evaluates the effectiveness of various scores to detect OOD datapoints.
+
+We also present a novel OOD score using the average entropy of K-nearest neighbors.
+
+## Methodology
+
+We treat OOD detection as a binary classification task (True or False: is the datapoint out-of-distribution?) and evaluate the performance of various OOD scores using AUROC.
+
+## Experiments
+
+For each experiment, we perform the following procedure:
+
+1. Train a Neural Network model with ONLY the **in-distribution** training dataset.
+2. Use this model to generate predicted probabilties and embeddings for the **in-distribution** and **out-of-distribution** test datasets (these are considered out-of-sample predictions).
+3. Use out-of-sample predictions to generate OOD scores
+4. Compute AUROC of OOD scores to detect OOD datapoints
+
+| Experiment ID | In-Distribution | Out-of-Distribution |
+| :------------ | :-------------- | :------------------ |
+| 0             | cifar-10        | cifar-100           |
+| 1             | cifar-100       | cifar-10            |
+| 2             | mnist           | roman-numeral       |
+| 3             | roman-numeral   | mnist               |
+| 4             | mnist           | fashion-mnist       |
+| 5             | fashion-mnist   | mnist               |
+
+## Instructions
+
+#### 0. Prerequisite
+
+- [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker): allows us to properly utilize our NVIDIA GPUs inside docker environments
+
+#### 1. Run docker-compose to build the docker image and run the container
+
+Clone this repo and run below commands:
+
+```bash
+sudo docker-compose build
+sudo docker-compose run --rm --service-port dcai
+```
+
+#### 2. Start Jupyter Lab
+
+```bash
+make jupyter-lab
+```
+
+#### 3. Train all models with a single notebook
+
+[src/experiments/OOD/0_Train_Models.ipynb]()
+
+#### 4. Run all experiments with a single notebook
+
+[src/experiments/OOD/1_Evaluate_All_OOD_Experiments.ipynb]()
+
+## Results
+
+Preparation of final results in progress
@@ -0,0 +1,9 @@
+feature_extractor:
+  path_to_onnx: "./src/image_feature_extraction/models/feature_extractor.onnx"
+  batch_size: 32
+
+nearest_neighbor:
+  metric: "angular"
+  n_trees: 10
+  file_path: "./src/image_similarity_search/index.ann"
+  precompute_neighbors: 100
@@ -0,0 +1,24 @@
+version: "3.8"
+services:
+    dcai:
+        tty: true
+        build:
+            context: ./
+            dockerfile: Dockerfile
+        shm_size: "8gb"
+        image: dcai
+        deploy:
+            resources:
+                reservations:
+                    devices:
+                        - driver: nvidia
+                          count: "all" # use all GPU devices on host machine
+                          capabilities: [ gpu ]
+        entrypoint: bash
+        ports:
+            - "8888:8888"
+        volumes:
+            - .:/dcai
+
+            # TODO: remove below before publishing repo
+            - /home/johnson/Data:/Data # data on M.2 SSD; super fast read for training workloads
@@ -0,0 +1,24 @@
+cleanlab
+click
+pandas
+numba
+catboost
+openpyxl
+dirhash
+onnx
+onnxruntime-gpu
+hydra-core
+altair
+torch==1.10.1+cu113 
+torchvision==0.11.2+cu113 
+torchaudio==0.10.1+cu113
+skorch
+torchvision
+pytorch-lightning
+autogluon
+annoy
+cifar2png
+tensorflow
+tensorflow_datasets
+umap-learn
+seaborn
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Use Makefile to run jupyter lab for convenience so we can save args (ip, port, allow-root, etc)`
	`2`	`+jupyter-lab:`
	`3`	`+ jupyter lab --ip 0.0.0.0 --port 8888 --allow-root`