Demystifying Visual Question Answering
Submitted by Laksh Arora (@techedlaksh) on Wednesday, 14 June 2017
We are witnessing a renewed excitement in multi-discipline Artificial Intelligence (AI) research problems. In particular, research in image and video captioning that combines Computer Vision (CV), Natural Language Processing (NLP), and Knowledge Representation & Reasoning (KR) has dramatically increased in the past year. Since the time, Alan Turing has developed Turing Test, it has become an important concept in the philosophy of Aritficial Intelligence. Turing Test is a test of machine’s ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. In the last couple of years,a number of papers have suggested that the task of Visual Question Answering can be used as an alternative Turing Test. The task of Visual Question Answering involves an open-ended questions ( or a series of a question) about an image.
A VQA system takes as input an image and a free-form, open-ended, natural language question about the image and produces a natural language answer as the output. This goal-driven task is applicable to scenarios encountered when visually-impaired users or intelligence analysts actively elicit visual information. Open-ended questions require a potentially vast set of AI capabilities to answer – fine-grained recognition (eg. “What kind of cheese is on pizza?”), object detection (eg “How many bikes are there?”) and others recognition such as activity recognition, knowledge base recognition and knowledge base reasoning etc.
This talk will be benefitted to those who are interested in advanced applicaton of Deep Neural Network and is looking forward to see the implementation of the latest state-of-the-art models. In this talk, there will also be the demo of live Visual Question Answering model and code will be open-sourced and shared on github. The open source implementation will be done in Keras framework which is high-level neural network API, written in Python and running on top of either TensorFlow, CNTK or Theano.
In this talk, we will look at the Visual QA challenge, and the dataset that came along with it. We will see different ways to model this problem using Recurrent Neural Network (LSTMs to be specific). Most of the code will be inspired from ICCV and NIPS paper. An important aspect of solving this problem is to have a system that can generate new answers. The problem is considered as a classification task here, wherein, 1000 top answers are chosen as classes. Images are transformed by passing it through the VGG-19 model that generates a 4096 dimensional vector in second layer , then tokens are embedded into Glove vectors and then passed through LSTM model to generate the sentences.
We would cover the following:
- What is Deep Neural Network, Vanilla CNN and RNN models.
- Motivation: Advancement in Convolutional and Recurrent Models.
- How these models are helping in current real world applications.
- Description of VQA Dataset
- Deep dive into VGG models, Glove Vectors, LSTMs, Cost functions.
- Explaination of Code
Basic knowledge of DeepLearning, MLP, CNN, RNN, pre-trained models and interest in Latest Applications of Neural Network.
Laksh Arora is Pythonista at heart and has interests in applications of Machine Learning and Computer Vision. Completed his BCA in Computer Science from IPU. Co-organiser of PyDataDelhi meetups. Previously also gave talk at PyDataDelhi community, CSI and other various small meetups. Also spoke at University level about latest work in Machine Learning and won multiple hackathons. Currently TA at coding blocks where he is teaching Machine Learning and also collaborating with other enthusiasts over the globe doing independent projects.