Teaching AI to read?

Pythia: Facebook’s deep learning framework for the vision and language domain

Sarah Schlothauer
deep learning
© Shutterstock / Bas Nastassia

The latest open sourced tool from Facebook AI Research is Pythia, a deep learning framework designed to help with Visual Question Answering. It is built on top of the PyTorch framework and offers a modular design for building AI models. Take a peek at the research involved.

Can you teach AI how to read? Facebook AI Research has yet another open source deep learning offering. Built upon PyTorch, Pythia is a modular framework for deep learning.

It was designed for help with Visual Question Answering (VQA). This means that the AI “reads” a photo and answers questions based on the visual data available. This research can be used, for instance, to automate image captioning by reading text from a photograph.

Reading with deep learning

From the Facebook AI Research announcement regarding open sourcing Pythia:

Features include reference implementations to show how previous state-of-the-art models achieved related benchmark results and to quickly gauge the performance of new models. In addition to multitasking, Pythia also supports distributed training and a variety of datasets, as well as custom losses, metrics, scheduling, and optimizers…Pythia smooths the process of entering the growing subfield of vision and language and frees researchers to focus on faster prototyping and experimentation. Our goal is to accelerate progress by increasing the reproducibility of these models and results. This will make it easier for the community to build on, and benchmark against, successful systems.

Interested in the research on this topic of Visual Question Answering?

SEE ALSO: Deep learning anomalies with TensorFlow and Apache Spark

Read the relevant paper Towards VQA Models That Can Read from Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. You can also explore their research to see Pythia-powered examples of how deep learning “reads” text.

Pythia features

From the documentation, Pythia’s features include:

  • Model Zoo: Reference implementations for VQA using LoRRA, the Pythia model, and BAN.
  • Distributed: Supports DataParallel and DistributedDataParallel
  • Multi-tasking: Save time and train multiple datasets simultaneously
  • Customizable: Customize losses, metrics, scheduling, optimizers, and tensorboard.
  • Unopinionated: Unopinionated dataset and model implementations
  • Modules: Implementations for commonly used layers in vision and language domain

SEE ALSO: Uber’s Ludwig makes deep learning more understandable for amateurs and faster for experts

Pythia can also be used as a starter codebase or used to bootstrap a VQA project.

deep learning

Deep learning meets fashionable cats. Source.

Vision & language

Check out the full documentation here for a quickstart guide and available libraries.

A demo of the Pythia model is also on Colab. (You will need to install the necessary data before heading into the playground.)

Get the repo on GitHub and begin your deep learning journey by following the installation guide. The repo also includes pre-trained models that you can build upon in your next deep learning project.

Sarah Schlothauer

Sarah Schlothauer

All Posts by Sarah Schlothauer

Sarah Schlothauer is the editor for She received her Bachelor's degree from Monmouth University, West Long Branch, New Jersey. She currently lives in Frankfurt, Germany with her husband and cat where she enjoys reading, writing, and medieval reenactment. She is also the editor for Conditio Humana, an online magazine about ethics, AI, and technology.

Inline Feedbacks
View all comments