Anthill Inside 2017

On theory and concepts in Machine Learning, Deep Learning and Artificial Intelligence. Formerly Deep Learning Conf.

Decoding Neural Image Captioning

Submitted by Sachin Kumar (@sachinkmr) on Saturday, 10 June 2017

Preview video

Technical level



Full talk



Vote on this proposal

Login to vote

Total votes:  +36


Humans have been captioning images involuntary since decades and now in the age of social media where every image have a caption over various social platforms. Psychologically those things are affected by events and scenarios running in mind or infulenced by nearby activities and emotion. Sometimes those are far-far away from real context. Describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing.

This talk will cover some of the common deep learning architectures, latest state-of-the-art pre-trained models for image captioning, describe advantages and concerns, and provide hands-on experience.

This talk shall be beneficial for those who are interested in the advance applications of Deep Neural Networks and what can be achieved with the combination of different state-of-the-art models. We also aim to provide an open source implementation in Keras, a higher abstraction library which uses Theano/TensorFlow/CNTK as backend for writing Deep Neural Networks (DNN) over CPU and GPU.


In this talk, I will present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on COCO dataset shows the accuracy of the model and the fluency of the language it learns solely from image descriptions.

Learning a model like this would be incredible. It would be a great way to relate how relevant an image and caption are to each other. For a batch of images and captions, we can use the model to map them all into this embedding space, compute a distance metric, and for each image and for each caption find its nearest neighbors.

I will cover the following:

  1. What is deep learning, CNN and RNN?
  2. Motivation: CNN and RNN real world applications, state-of-the-art results
  3. Internal structure of Vanilla models
  4. Description of dataset and introduction to word2vec
  5. Deep dive into word embeddings, image encoder, caption encoder, cost function.
  6. Impact of GPUs (Some practical thoughts on hardware and software)
  7. Explaination of Code


Basic knowledge of DeepLearning, MLP, Backpropagation, CNN, RNN, pre-trained models such as VGG and lot of enthusiasm.

Speaker bio

Sachin Kumar is currently second year undergraduate pursuing Bachelor of Engineering in Information Technology at Netaji Subhash Institute of Technology(NSIT), New Delhi. He is also Teaching Assistant in Machine Learning Course at Coding Blocks. His interests includes Machine Learning, Artificial Intelligence, Deep Learning and Evolutionary Computing.



Preview video


  • 1
    Sandhya Ramesh (@sandhyaramesh) Reviewer a year ago

    Hi Sachin, in order to evaluate your proposal, we need draft slides and a two minute self recorded video of you walking us through the slides? Thanks!

Login with Twitter or Google to leave a comment