Data

The dataset is split into five separate packages which differ by train/validation/test split and the colour-to-figure assignment scheme that is used. A sample of one thousand images from the training set is also included for those curious or eager to work with the data. Each split of the dataset is differentiated by the color scheme that was used, namely "1" or "2". The entire dataset size is 3.53 GB compressed and 5.78 GB uncompressed.

I agree to the dataset usage terms and conditions outlined here.

Note: a small error affecting 0.5% of the question-answer pairs was discovered in the dataset.
This issue has been repaired and the updated dataset is available as of October 2, 2017.
Please use this version of the dataset to replicate our upcoming results.

Paper

Coming Soon!

GitHub Repo

Coming Soon!

Dataset Format

The training, sample, and validation splits of the data contain the following:

  • png/: Directory with all figure images in PNG format. Filenames have the format "<image_index>.png" where image_index starts at 0.
  • qa_pairs.json: Question-answer pairs and encoding details, with references to the images in png/.
  • annotations.json: Bounding-box and source data annotations, with references to the images in png/.

These test sets follow a similar format, except the answers for the questions and annotations.json have purposely been omitted.

qa_pairs.json

Top-Level Fields

Key Name Description
qa_pairs A list of "qa_pair" objects. See below for details.
total_distinct_questions How many question types, identified by "question_id", are present in the dataset.
total_distinct_colors How many colors, identified by "color(1|2)_id", are present in the dataset.

"qa_pair" Object Details

Key Name Description
question_string Question generated from a natural language template.
question_id Unique identifier for the question type.
answer 1 for yes/true, 0 for no/false.
image_index Unique identifier for the source image. Image is at "png/<image_index>.png".
color1_name Natural language name for the first color.
color1_id Unique identifier for the first color.
color1_rgb RGB values of the color in a list, each value ranging from 0 to 255.
color2_name Natural language name for the second color.
If there is no second color in the question, this is "--None--".
color2_id Unique identifier for the second color.
If there is no second color in the question, this is -1.
color2_rgb RGB values of the color in a list, each value ranging from 0 to 255.
If there is no second color in the question, this is [-1, -1, -1].
annotations.json

This file consists of a list of annotation objects, each of which has the same common top-level fields, described below. More details of the annotation objects are described in the annotations_format.txt file provided with each split of the dataset, and is also available here.

Key Name Description
type Figure type. One of: vbar_categorical, hbar_categorical, pie, line, dot_line.
general_plot_data Annotation data for general figure elements. E.g. dimensions, axes, labels, legend, etc.
models A list of annotations per plot element. Structure depends on figure type.
Note that the "name" and "label(s)" fields in a "models" object correspond to the color name (i.e. "color(1|2)_name" in the qa_pairs.json file).
image_index Unique identifier for the source image. Image is at "png/<image_index>.png".