RNNs for multimodal information fusion
Data generated from real world events are usually temporal and contain multimodal information such as audio, visual, depth, sensor etc. which are required to be intelligently combined for classification tasks. I will discuss a novel generalized deep neural network architecture where temporal streams from multiple modalities can be combined. The hybrid Recurrent Neural Network (RNN) exploits the complimentary nature of the multimodal temporal information by allowing the network to learn both modality-specific temporal dynamics as well as the dynamics in a multimodal feature space. The efficacy of the proposed model is also demonstrated on multiple datasets.
Deep Learning overview, RNN overview, proposed model, performance and comparisons