Skip to content

Commit b8fadd1

Browse files
committed
Add notebooks, requirements, and environment files
1 parent e7d5f49 commit b8fadd1

29 files changed

+1472
-1
lines changed

.dockerignore

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# ignore .git and .cache folders
2+
.git
3+
.cache
4+
# ignore all *.class files in all folders, including build root
5+
**/*.class
6+
# ignore all markdown files (md) beside all README*.md other than README-secret.md
7+
*.md
8+
!README*.md
9+
README-secret.md
10+
# ignore folders
11+
data
12+
temp

.gitignore

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
*/__init__.py
2+
*__pycache__
3+
*/.ipynb*
4+
.ipynb_checkpoints
5+
*temp*.ipynb
6+
*.csv
7+
*.pickle
8+
*.trial*
9+
*split*
10+
data
11+
temp
12+
archive
13+
14+
# AutoGluon models
15+
*.ag
16+
17+
# embeddings
18+
*embeddings.npy
19+
20+
# ONNX models
21+
*.onnx

Dockerfile

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
FROM nvidia/cuda:11.4.0-cudnn8-runtime-ubuntu20.04
2+
WORKDIR /dcai
3+
COPY . .
4+
ENV DEBIAN_FRONTEND=noninteractive
5+
RUN apt-get -y update && \
6+
apt-get -y install make python3 python3-pip ffmpeg libsm6 libxext6 git && \
7+
python3 -m pip install -r requirements.txt -f https://download.pytorch.org/whl/cu113/torch_stable.html
8+
ENV PYTHONPATH=".:${PYTHONPATH}"

Makefile

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Use Makefile to run jupyter lab for convenience so we can save args (ip, port, allow-root, etc)
2+
jupyter-lab:
3+
jupyter lab --ip 0.0.0.0 --port 8888 --allow-root

README.md

+66-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,68 @@
11
# ood-detection-benchmarks
22

3-
Evaluation of algorithms to detect out-of-distribution data
3+
Out-of-distribution (OOD) detection is the task of determining whether a datapoint comes from a different distribution than the training dataset. For example, we may train a model to classify the breed of dogs and find that there is a cat image in our dataset. This cat image would be considered out-of-distribution.
4+
5+
OOD detection is useful to find label issues where the actual ground truth label is not in the set of labels for our task (e.g. cat label for a dog breed classification task). This can serve many use-cases, some of which include:
6+
7+
- Remove OOD datapoints from our dataset as part of a data cleaning pipeline
8+
- Consider adding new classes to our task
9+
- Gain deeper insight into the data distribution
10+
11+
This work evaluates the effectiveness of various scores to detect OOD datapoints.
12+
13+
We also present a novel OOD score using the average entropy of K-nearest neighbors.
14+
15+
## Methodology
16+
17+
We treat OOD detection as a binary classification task (True or False: is the datapoint out-of-distribution?) and evaluate the performance of various OOD scores using AUROC.
18+
19+
## Experiments
20+
21+
For each experiment, we perform the following procedure:
22+
23+
1. Train a Neural Network model with ONLY the **in-distribution** training dataset.
24+
2. Use this model to generate predicted probabilties and embeddings for the **in-distribution** and **out-of-distribution** test datasets (these are considered out-of-sample predictions).
25+
3. Use out-of-sample predictions to generate OOD scores
26+
4. Compute AUROC of OOD scores to detect OOD datapoints
27+
28+
| Experiment ID | In-Distribution | Out-of-Distribution |
29+
| :------------ | :-------------- | :------------------ |
30+
| 0 | cifar-10 | cifar-100 |
31+
| 1 | cifar-100 | cifar-10 |
32+
| 2 | mnist | roman-numeral |
33+
| 3 | roman-numeral | mnist |
34+
| 4 | mnist | fashion-mnist |
35+
| 5 | fashion-mnist | mnist |
36+
37+
## Instructions
38+
39+
#### 0. Prerequisite
40+
41+
- [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker): allows us to properly utilize our NVIDIA GPUs inside docker environments
42+
43+
#### 1. Run docker-compose to build the docker image and run the container
44+
45+
Clone this repo and run below commands:
46+
47+
```bash
48+
sudo docker-compose build
49+
sudo docker-compose run --rm --service-port dcai
50+
```
51+
52+
#### 2. Start Jupyter Lab
53+
54+
```bash
55+
make jupyter-lab
56+
```
57+
58+
#### 3. Train all models with a single notebook
59+
60+
[src/experiments/OOD/0_Train_Models.ipynb]()
61+
62+
#### 4. Run all experiments with a single notebook
63+
64+
[src/experiments/OOD/1_Evaluate_All_OOD_Experiments.ipynb]()
65+
66+
## Results
67+
68+
Preparation of final results in progress

configs/default.yaml

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
feature_extractor:
2+
path_to_onnx: "./src/image_feature_extraction/models/feature_extractor.onnx"
3+
batch_size: 32
4+
5+
nearest_neighbor:
6+
metric: "angular"
7+
n_trees: 10
8+
file_path: "./src/image_similarity_search/index.ann"
9+
precompute_neighbors: 100

docker-compose.yml

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
version: "3.8"
2+
services:
3+
dcai:
4+
tty: true
5+
build:
6+
context: ./
7+
dockerfile: Dockerfile
8+
shm_size: "8gb"
9+
image: dcai
10+
deploy:
11+
resources:
12+
reservations:
13+
devices:
14+
- driver: nvidia
15+
count: "all" # use all GPU devices on host machine
16+
capabilities: [ gpu ]
17+
entrypoint: bash
18+
ports:
19+
- "8888:8888"
20+
volumes:
21+
- .:/dcai
22+
23+
# TODO: remove below before publishing repo
24+
- /home/johnson/Data:/Data # data on M.2 SSD; super fast read for training workloads

requirements.txt

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
cleanlab
2+
click
3+
pandas
4+
numba
5+
catboost
6+
openpyxl
7+
dirhash
8+
onnx
9+
onnxruntime-gpu
10+
hydra-core
11+
altair
12+
torch==1.10.1+cu113
13+
torchvision==0.11.2+cu113
14+
torchaudio==0.10.1+cu113
15+
skorch
16+
torchvision
17+
pytorch-lightning
18+
autogluon
19+
annoy
20+
cifar2png
21+
tensorflow
22+
tensorflow_datasets
23+
umap-learn
24+
seaborn

0 commit comments

Comments
 (0)