This was the final project done by us for the course Neural Networks. The dataset used here is the CLEVR dataset by Stanford. The aim of this post is to introduce the reader to one of the most intriguing problems in AI. This post is not a tutorial on VQA, but a gentle intro to the problem and our approach to solving the same.

Visual QnA is one of the most challenging problems in Deep Learning. Here is the basic summary of what is Visual QnA through an example :

**INPUT** TO COMPUTER IS AN IMAGE AND A QUESTION :

Image –>

Question 1 –>

What color is the cube that is behind the silver sphere and to the left of yellow cylinder ?

**OUTPUT** 1 should be :

Brown

Question 2 –>

How many big spheres are there?

**OUTPUT** 2 should be :

2

We made a model that reached an accuracy of about 46%. This was pretty good actually, considering the best models in world have an accuracy of around 55% for numbered VQA.

Here are the links to the problem statement and our solution :