Sophisticated, human-generated datasets for

natural language understanding research


Maluuba's datasets for Natural Language Understanding:

  • Machine Reading Comprehension
  • Goal Oriented Dialogue Systems
  • Conversational Interfaces and Reinforcement Learning

Maluuba is making these datasets available for the Artificial Intelligence research community.

Machine Reading Comprehension

NewsQA Dataset

Maluuba's News QA is a new machine reading comprehension dataset for developing algorithms capable of answering questions requiring human-level comprehension and reasoning skills. This dataset of CNN news articles has 120K Q&A pairs. Questions are written by humans in natural language. Questions may not have answers and answers may be multiword passages.

Goal Oriented Dialogue

Frames Dataset

Maluuba's Frames dataset is designed to help drive research that enables truly conversational agents that can support decision-making in complex settings. This dataset was prepared through human-to-human conversations via a chat interface. One human played the role of customer and the other played the role of travel agent. The dataset contains natural and complex dialogues with users considering different options, comparing packages, and progressively building rich descriptions through conversation.

Visual Question Answering

FigureQA Dataset

Maluuba's FigureQA is a novel visual question-answering dataset which aims to drive research in visual reasoning, combining the domains of vision and language. This dataset consists of figures like bar graphs, line plots, and pie charts, as well as questions that compare quantitative attributes of figure elements. All questions have either a yes or no answer, and images come with bounding box and source data used to create the figure.