Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
aatkinson committed Jan 11, 2018
0 parents commit dc0a65f
Show file tree
Hide file tree
Showing 31 changed files with 3,209 additions and 0 deletions.
104 changes: 104 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
.static_storage/
.media/
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
6 changes: 6 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
This project has adopted the [Microsoft Open Source Code of
Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct
FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [[email protected]](mailto:[email protected])
with any additional questions or comments.
7 changes: 7 additions & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
FigureQA Code
Copyright (c) Microsoft Corporation
All rights reserved.
MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
71 changes: 71 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# FigureQA

Code to generate the FigureQA dataset, see https://datasets.maluuba.com/FigureQA.

## Data Generation

Data generation consists of 3 parts:

1. Generate the source numerical data, styles, and question-answer pairs for the figures.
1. Generate the figure images and bounding box annotatations.
1. Aggregrate the figure images, questions & answers, annotations, and source data.

### Code Map

All data generation source code lives in the `figureqa/generation` subpackage:

- `questions` subpackage contains code to generate questions

- `categorical.py` for questions for bar graphs and pie charts.
- `lines.py` for line plots.
- `utils.py` for balancing and question encoding augmentation.

- `source_data_generation.py` to generate source data, questions, and answers.

- `figure_generation.py` to generate figure images and bounding boxes.

- `json_combiner.py` aggregates the generated data into the documented format. Allows for generating a data split in multiple batches.

- `data_utils.py` has misc. utilities for reconciling data formats, placing legends, etc.

- `figure.py` defines the figure objects in Bokeh.

- `generate_dataset.py` generates a whole dataset end-to-end.

- `show_bounding_boxes.py` generates images with bounding boxes visualized.

Each runnable module (script) can have its command line arguments displayed with `--help`.

There are some additional files used for data generation in these directories:

- `config` contains `.yaml` files that configure visual apsects, source data parameters, color splits, and dataset generation.

- `resources` contains the colors and other misc. resources for data generation.

And `docs` contains additional documentation on annotations, question format, and file formats.

### Prerequisites

1. Install the FigureQA fork of Bokeh from https://www.github.com/Maluuba/bokeh.
1. `pip install -r requirements.txt`.
1. Make sure you have enough space. The whole dataset unzipped is > 6GB, plus you need room for intermediate data.

### Generate a whole dataset

#### Using a single script

This is done with the end-to-end script `generate_dataset.py`. It does the source data synthesis, figure generation, and aggregation.

This script must be run from the root directory, `FigureQA`.

The config for the actual dataset is in `config/figureqa_generation_config.yaml`.
A sample config is provided in `config/sample_figureqa_generation_config.yaml`.

Note that this does not generate the test sets.

#### With individual scripts

1. `cd FigureQA`
1. `python figureqa/generation/source_data_generation.py CONFIG_FILE.yaml SOURCE_DATA.json --<figure_type> <N_figures> ...`
1. `python figureqa/generation/figure_generation.py SOURCE_DATA.json RAW_GENERATED_DIR`
1. `python figureqa/generation/json_combiner.py FINAL_AGGREGATE_DIR RAW_GENERATED_DIR1 RAW_GENERATED_DIR2 ...`
36 changes: 36 additions & 0 deletions config/color_scheme1_source_data.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
vbar_categorical:
y_range: [0, 99]
n_points_range: [2, 10]
x_distn: ["linear"]
shape: ["random", "random", "random", "random", "linear_inc", "linear_dec", "cluster"]
color_sources: ["resources/color_split2.txt"]

hbar_categorical:
y_range: [0, 99]
n_points_range: [2, 10]
x_distn: ["linear"]
shape: ["random", "random", "random", "random", "linear_inc", "linear_dec", "cluster"]
color_sources: ["resources/color_split1.txt"]

line:
x_range: [0, 100]
y_range: [0, 100]
n_points_range: [5, 20]
x_distn: ["linear"]
shape: ["linear", "linear_with_noise", "quadratic"]
n_classes_range: [2, 7]
color_sources: ["resources/color_split2.txt"]
solid_pr: 0.5

dot_line:
x_range: [0, 100]
y_range: [0, 100]
n_points_range: [5, 20]
x_distn: ["linear"]
shape: ["linear", "linear_with_noise", "quadratic"]
n_classes_range: [2, 7]
color_sources: ["resources/color_split1.txt"]

pie:
color_sources: ["resources/color_split2.txt"]
n_classes_range: [2, 7]
36 changes: 36 additions & 0 deletions config/color_scheme2_source_data.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
vbar_categorical:
y_range: [0, 99]
n_points_range: [2, 10]
x_distn: ["linear"]
shape: ["random", "random", "random", "random", "linear_inc", "linear_dec", "cluster"]
color_sources: ["resources/color_split1.txt"]

hbar_categorical:
y_range: [0, 99]
n_points_range: [2, 10]
x_distn: ["linear"]
shape: ["random", "random", "random", "random", "linear_inc", "linear_dec", "cluster"]
color_sources: ["resources/color_split2.txt"]

line:
x_range: [0, 100]
y_range: [0, 100]
n_points_range: [5, 20]
x_distn: ["linear"]
shape: ["linear", "linear_with_noise", "quadratic"]
n_classes_range: [2, 7]
color_sources: ["resources/color_split1.txt"]
solid_pr: 0.5

dot_line:
x_range: [0, 100]
y_range: [0, 100]
n_points_range: [5, 20]
x_distn: ["linear"]
shape: ["linear", "linear_with_noise", "quadratic"]
n_classes_range: [2, 7]
color_sources: ["resources/color_split2.txt"]

pie:
color_sources: ["resources/color_split1.txt"]
n_classes_range: [2, 7]
10 changes: 10 additions & 0 deletions config/common_source_data.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
figure_height_px: 400
figure_width_ratio_range: [1.0, 2.0]
figure_min_width_side_legend: 1.33
draw_gridlines_pr: 0.5
draw_legend_pr: 1.0
legend_inside_pr: 0.5
legend_border_pr: 0.5
legend_label_font_sizes: ['8pt', '9pt', '10pt', '11pt']
legend_horizontal_pr: 0.5
legend_horizontal_max_classes: 3
50 changes: 50 additions & 0 deletions config/figureqa_generation_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
working_directory: figureqa_generation

destination_directory: figureqa_final

common_config_yaml: config/common_source_data.yaml

colors: resources/x11_colors_refined.txt

splits:
- name: figureqa-train1
partitions:
- name: train1_1
data_config_yaml: config/color_scheme1_source_data.yaml
seed: 123
vbar: 20000
hbar: 20000
pie: 20000
line: 0
dot_line: 0
- name: train1_2
data_config_yaml: config/color_scheme1_source_data.yaml
seed: 456
vbar: 0
hbar: 0
pie: 0
line: 20000
dot_line: 20000

- name: figureqa-validation1
partitions:
- name: validation1
data_config_yaml: config/color_scheme1_source_data.yaml
seed: 654
vbar: 4000
hbar: 4000
pie: 4000
line: 4000
dot_line: 4000

- name: figureqa-validation2
partitions:
- name: validation2
data_config_yaml: config/color_scheme2_source_data.yaml
seed: 321
vbar: 4000
hbar: 4000
pie: 4000
line: 4000
dot_line: 4000

39 changes: 39 additions & 0 deletions config/sample_figureqa_generation_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
working_directory: sample_generation

destination_directory: sample_final

common_config_yaml: config/common_source_data.yaml

colors: resources/x11_colors_refined.txt

splits:
- name: figureqa-train1
partitions:
- name: train1_1
data_config_yaml: config/color_scheme1_source_data.yaml
seed: 1
vbar: 1
hbar: 1
pie: 1
line: 0
dot_line: 0
- name: train1_2
data_config_yaml: config/color_scheme1_source_data.yaml
seed: 456
vbar: 0
hbar: 0
pie: 0
line: 1
dot_line: 1

- name: figureqa-validation1
partitions:
- name: validation1
data_config_yaml: config/color_scheme1_source_data.yaml
seed: 1001
vbar: 1
hbar: 1
pie: 1
line: 1
dot_line: 1

Loading

0 comments on commit dc0a65f

Please sign in to comment.