Anthill Inside 2018
On the current state of academic research, practice and development regarding Deep Learning and Artificial Intelligence.
Jul 2018
23 Mon
24 Tue
25 Wed 08:45 AM – 05:25 PM IST
26 Thu
27 Fri
28 Sat
29 Sun
On the current state of academic research, practice and development regarding Deep Learning and Artificial Intelligence.
Jul 2018
23 Mon
24 Tue
25 Wed 08:45 AM – 05:25 PM IST
26 Thu
27 Fri
28 Sat
29 Sun
##About the conference and topics for submitting talks:
In 2016, The Fifth Elephant branched into a separate conference on Deep Learning. The Deep Learning Conference has grown in to a large community under the brand Anthill Inside.
Anthill Inside features talks, panels and Off The Record (OTR) sessions on current research, technologies and developments around Artificial Intelligence (AI) and Deep Learning. Submit proposals for talks and workshops on the following topics:
##Perks for submitting proposals:
Submitting a proposal, especially with our process, is hard work. We appreciate your effort.
We offer one conference ticket at discounted price to each proposer, and a t-shirt.
We only accept one speaker per talk. This is non-negotiable. Workshops may have more than one instructor.
In case of proposals where more than one person has been mentioned as collaborator, we offer the discounted ticket and t-shirt only to the person with who the editorial team corresponded directly during the evaluation process.
##Target audience:
We invite beginner and advanced participants from:
to participate in Anthill Inside. At the 2018 edition, tracks will be curated separately for beginner and advanced audiences.
Developer evangelists from organizations which want developers to use their APIs and technologies for deep learning and AI should participate, speak and/or sponsor Anthill Inside.
##Format:
Anthill Inside is a two-day conference with two tracks on each day. Track details will be announced with a draft schedule in February 2018.
We are accepting sessions with the following formats:
##Selection criteria:
The first filter for a proposal is whether the technology or solution you are referring to is open source or not. The following criteria apply for closed source talks:
The criteria for selecting proposals, in the order of importance, are:
No one submits the perfect proposal in the first instance. We therefore encourage you to:
Our editorial team helps potential speakers in honing their speaking skills, fine tuning and rehearsing content at least twice - before the main conference - and sharpening the focus of talks.
##How to submit a proposal (and increase your chances of getting selected):
The following guidelines will help you in submitting a proposal:
To summarize, we do not accept talks that gloss over details or try to deliver high-level knowledge without covering depth. Talks have to be backed with real insights and experiences for the content to be useful to participants.
##Passes and honorarium for speakers:
We pay an honararium of Rs. 3,000 to each speaker and workshop instructor at the end of their talk/workshop. Confirmed speakers and instructors also get a pass to the conference and networking dinner. We do not provide free passes for speakers’ colleagues and spouses.
##Travel grants for outstation speakers:
Travel grants are available for international and domestic speakers. We evaluate each case on its merits, giving preference to women, people of non-binary gender, and Africans. If you require a grant, request it when you submit your proposal in the field where you add your location. Anthill Inside is funded through ticket purchases and sponsorships; travel grant budgets vary.
##Last date for submitting proposals is: 15 April 2018.
You must submit the following details along with your proposal, or within 10 days of submission:
##Contact details:
For information about the conference, sponsorships and tickets contact support@hasgeek.com or call 7676332020. For queries on talk submissions, write to anthillinside.editorial@hasgeek.com
Hosted by
Vikram Vij
@vikramvij
Submitted Apr 13, 2018
Bixby is an intelligent, personalized voice interface for your phone. It lets you seamless switch between voice & type/touch, and supports more than 75 domains (eg. Camera, Gallery, Messages, WhatsApp, Youtube, Uber etc.). It was launched in July 2017 for English and is now available on more than 200 countries with about 8 million registered users. The talk focuses on challenges in deep learning for Bixby Automatic Speech Recognition & Natural Language understanding, ranging from CNN vs. RNNs, Word vs. Character based models, Domain Classification challenges given the massive contextual input space, Grammar complexity, Multi-modal and Multi-accent handling. We go into details of hierarchical classification, session based classification, intent rejection logic… Also about the tradeoffs between RNNs and CNNs, Optimal filter sizes for CNNs, Handling variations of data and conflicts between data. Also go into use of Transfer learning and Biligual models for Bixby for Hindi
Bixby is an intelligent, personalized voice interface for your phone. It lets you seamless switch between voice & type/touch, and supports more than 75 domains (eg. Camera, Gallery, Messages, WhatsApp, Youtube, Uber etc.). It was launched in July 2017 for English and is now available on more than 200 countries with about 8 million registered users.
My talk focuses on challenges in deep learning for Bixby Automatic Speech Recognition & Natural Language understanding, ranging from CNN vs. RNNs, Word vs. Character based models, Domain Classification challenges given the massive contextual input space, Grammar complexity, Multi-modal and Multi-accent handling. We go into details of hierarchical classification, session based classification, intent rejection logic… Also about the tradeoffs between RNNs and CNNs, Optimal filter sizes for CNNs, Handling variations of data and conflicts between data. Also go into use of Transfer learning and Bilingual models for Bixby for Hindi
When you look at processing steps of voice engine, it typically is like this. User speaks an utterance, for example “text to mom”. Then NLU engine tries to understand what domain the user is talking about, what command the user wants to execute, and extract the required parameters for execution in slot tagger.
In a minimalistic view, Bixby accepts voice signals with its Automatic Speech Recognition engine, and then give transcribed text to its Natural Language Processing engine. Then NLU engine extracts the information required for execution, and send it to devices or CP services.
Bixby Automatic Speech Recognition (ASR) was earlier optimized for US English accent only. In our testing, we found that it did not perform as well as expected. The root cause was that there are many people of Indian, Korean, Chinese and Spanish origin residing in US and the ASR did not work so well for them. So we trained ASR models optimized for Indian English, Korean English, Chinese English and Spanish English using transfer learning to save training time as well computing resources. Then, we had to find a way to load the model that would best for the individual’s voice. We incorporated an accent determination step at the Bixby onboarding time and the user is asked to speak five sentences. Word recognition accuracy is measured for all these models and we select the model using ASR performance as well as other cues such as Keyboard, Contact information. The accent selection once determined will be used as default.
One big difference of Bixby is that we tried to build a multi-modal system which supports both touch and voice interface, so that a user can execute the same function with touch or voice. This 1st version of Bixby we call it Bixby 1.0
Usually voice assistants classify user utterances into commands not caring much about the screen status. In Bixby 1.0, we try to understand user utterances based on their screen context too. So “find James” in contact application should give you the contact information of James, And “find James” in Gallery application should give you the images tagged as James.
To support that kind of multi-modality, we modeled application screens as contexts of dialog management system. So we should have added context awareness to traditional NLU to build a multi-modal NLU engine. The problem was there were thousands of different screens that should be modeled as different contexts. Moreover, we needed another kind of challenge in context awareness which is coming from supporting many device types. The set of commands vary from device to device because they have delta functions according to models and locales. So we needed to consider the changes of command set as well.
Now let’s look at the first challenge, which is the challenge of massive contextual input space. The input to NLU engine is now not only the utterances for 6,000 commands, but also the context of where the user started talking. So like I presented in the previous slide, “find james” in gallery application should work differently to “find james” in contact application. If we model it in a dumbest way, we can maintain a command classifier per each context. This will be best in performance, but the developing cost is prohibitive. It means training and maintaining 2,000 classifiers. We have a hierarchical classifier in place – meta domain (for some domains), domain, intent.. As we have a session based architecture. Once we are inside the session, we go to intent classification directly (bypassing domain classification). In case the intent classification rejects, it takes the output of the domain classifier.
RNN Domain Classification was designed as word-based model. The model converges fast. But it had issues of unknown words. And it was performing poorly for variations of the client state from where the utterance was generated. Due to this reason Domain Classification was moved to character based CNN model, where data is more and build time is also increased.
Word based model has known problem of unknown words. Whereas character based model does not have any unknowns. But character based model is not good at making a difference between different words, having similar spelling. For example “search for s8 plus”, goes to calculator domain due to presence of similar character sequence “8 plus” in calculator domain.
For such a huge input space, there were extreme variations of data. That includes lots of unknown words, during training phase. The unknowns were issues for accuracy in lots of domains. That led us to experiment on the possibilities on CNN with Respect to RNN
There were issues of misclassifications for the word inflections (when the word boundary goes beyond the representation). The CNN was candidate of research to counter the inflection problem, which we faced in the RNN. In RNN, the state was not getting learnt… Sentence is represented in vector space but it was too huge for the word based RNN to handle. Also, unknown words were not being handled with the word based RNN. So we went into CNN. This for both domain and intent classification. For the tagger, we continued with RNN.
When the migration was done to CNN, then there was a question on the optimal filter size for the CNN design.We conducted various experimentation on different combinations of values of N in N-Gram for CNN Filters. Typically shorter values of N was used for sub-word level features. And in the same time, larger values of N is used for understanding the language structures. Various experiments were conducted to determine the best filter sizes to achieve the commercial quality accuracy. We have multiple filters with various sizes (2x2, 4x4, 6x6 etc.). We have another layer of CNN which gives the final output with a probabilistic score.
For such a huge input space, there were extreme variations of data. At the same time, there exists similarities between the data. So we needed some tools to help resolve such data conflicts. We used techniques such as tf-idf, cosine similarity and policy conflict concept words to deal with this problem.
As discussed earlier, we built the DNN classifier to take the context as input as well as utterances. Now we are good as we have just one classifier for every context. But still we need to train this neural network with utterances with different context. For example, an utterance A should be mapped into command 1 when they are under context alpha or beta, utterance B needs to be mapped to command 1 at context alpha, and command 2 at context beta. If you want to maintain the training set like this, it will serve your purpose but training time and maintenance cost will still be prohibitive. So we needed a nice sampling algorithm to pick up necessary training data. How the sampling works well will ultimately determines the fluency of context understanding. Samsung is recognized for making various device models, throughout the year. When we are having multi-modality , then various device models will have their differences in UX. That’s a challenge to Bixby to handle a wide variety of output spaces. The architecture here show the handling of variable output space.
We have evaluated our Bixby1.0 architecture for its adaptability in other languages. We have taken Hind as our language for experimentations.
In India, the spoken Hindi is not strict Hindi. It’s a mix of other languages as well. Mostly it uses the Engish in it. We have used Bilingual Modeling to solve this issue. We have also experimented with neural machine translation system to translate the input data from English to Hindi. This worked. We also experimented with transliteration. This also worked but debugging/management was not so good in both these.
Bixby is an intelligent, personalized voice interface for your phone. It lets you seamless switch between voice & type/touch, and supports more than 75 domains (eg. Camera, Gallery, Messages, WhatsApp, Youtube, Uber etc.). It was launched in July 2017 for English and is now available on more than 200 countries with about 8 million registered users.
My talk focuses on challenges in deep learning for Bixby Automatic Speech Recognition & Natural Language understanding, ranging from CNN vs. RNNs, Word vs. Character based models, Domain Classification challenges given the massive contextual input space, Grammar complexity, Multi-modal and Multi-accent handling. We go into details of hierarchical classification, session based classification, intent rejection logic… Also about the tradeoffs between RNNs and CNNs, Optimal filter sizes for CNNs, Handling variations of data and conflicts between data. Also go into use of Transfer learning and Bilingual models for Bixby for Hindi
When you look at processing steps of voice engine, it typically is like this. User speaks an utterance, for example “text to mom”. Then NLU engine tries to understand what domain the user is talking about, what command the user wants to execute, and extract the required parameters for execution in slot tagger.
In a minimalistic view, Bixby accepts voice signals with its Automatic Speech Recognition engine, and then give transcribed text to its Natural Language Processing engine. Then NLU engine extracts the information required for execution, and send it to devices or CP services.
Bixby Automatic Speech Recognition (ASR) was earlier optimized for US English accent only. In our testing, we found that it did not perform as well as expected. The root cause was that there are many people of Indian, Korean, Chinese and Spanish origin residing in US and the ASR did not work so well for them. So we trained ASR models optimized for Indian English, Korean English, Chinese English and Spanish English using transfer learning to save training time as well computing resources. Then, we had to find a way to load the model that would best for the individual’s voice. We incorporated an accent determination step at the Bixby onboarding time and the user is asked to speak five sentences. Word recognition accuracy is measured for all these models and we select the model using ASR performance as well as other cues such as Keyboard, Contact information. The accent selection once determined will be used as default.
One big difference of Bixby is that we tried to build a multi-modal system which supports both touch and voice interface, so that a user can execute the same function with touch or voice. This 1st version of Bixby we call it Bixby 1.0
Usually voice assistants classify user utterances into commands not caring much about the screen status. In Bixby 1.0, we try to understand user utterances based on their screen context too. So “find James” in contact application should give you the contact information of James, And “find James” in Gallery application should give you the images tagged as James.
To support that kind of multi-modality, we modeled application screens as contexts of dialog management system. So we should have added context awareness to traditional NLU to build a multi-modal NLU engine. The problem was there were thousands of different screens that should be modeled as different contexts. Moreover, we needed another kind of challenge in context awareness which is coming from supporting many device types. The set of commands vary from device to device because they have delta functions according to models and locales. So we needed to consider the changes of command set as well.
Now let’s look at the first challenge, which is the challenge of massive contextual input space. The input to NLU engine is now not only the utterances for 6,000 commands, but also the context of where the user started talking. So like I presented in the previous slide, “find james” in gallery application should work differently to “find james” in contact application. If we model it in a dumbest way, we can maintain a command classifier per each context. This will be best in performance, but the developing cost is prohibitive. It means training and maintaining 2,000 classifiers. We have a hierarchical classifier in place – meta domain (for some domains), domain, intent.. As we have a session based architecture. Once we are inside the session, we go to intent classification directly (bypassing domain classification). In case the intent classification rejects, it takes the output of the domain classifier.
RNN Domain Classification was designed as word-based model. The model converges fast. But it had issues of unknown words. And it was performing poorly for variations of the client state from where the utterance was generated. Due to this reason Domain Classification was moved to character based CNN model, where data is more and build time is also increased.
Word based model has known problem of unknown words. Whereas character based model does not have any unknowns. But character based model is not good at making a difference between different words, having similar spelling. For example “search for s8 plus”, goes to calculator domain due to presence of similar character sequence “8 plus” in calculator domain.
For such a huge input space, there were extreme variations of data. That includes lots of unknown words, during training phase. The unknowns were issues for accuracy in lots of domains. That led us to experiment on the possibilities on CNN with Respect to RNN
There were issues of misclassifications for the word inflections (when the word boundary goes beyond the representation). The CNN was candidate of research to counter the inflection problem, which we faced in the RNN. In RNN, the state was not getting learnt… Sentence is represented in vector space but it was too huge for the word based RNN to handle. Also, unknown words were not being handled with the word based RNN. So we went into CNN. This for both domain and intent classification. For the tagger, we continued with RNN.
When the migration was done to CNN, then there was a question on the optimal filter size for the CNN design.We conducted various experimentation on different combinations of values of N in N-Gram for CNN Filters. Typically shorter values of N was used for sub-word level features. And in the same time, larger values of N is used for understanding the language structures. Various experiments were conducted to determine the best filter sizes to achieve the commercial quality accuracy. We have multiple filters with various sizes (2x2, 4x4, 6x6 etc.). We have another layer of CNN which gives the final output with a probabilistic score.
For such a huge input space, there were extreme variations of data. At the same time, there exists similarities between the data. So we needed some tools to help resolve such data conflicts. We used techniques such as tf-idf, cosine similarity and policy conflict concept words to deal with this problem.
As discussed earlier, we built the DNN classifier to take the context as input as well as utterances. Now we are good as we have just one classifier for every context. But still we need to train this neural network with utterances with different context. For example, an utterance A should be mapped into command 1 when they are under context alpha or beta, utterance B needs to be mapped to command 1 at context alpha, and command 2 at context beta. If you want to maintain the training set like this, it will serve your purpose but training time and maintenance cost will still be prohibitive. So we needed a nice sampling algorithm to pick up necessary training data. How the sampling works well will ultimately determines the fluency of context understanding. Samsung is recognized for making various device models, throughout the year. When we are having multi-modality , then various device models will have their differences in UX. That’s a challenge to Bixby to handle a wide variety of output spaces. The architecture here show the handling of variable output space.
We have evaluated our Bixby1.0 architecture for its adaptability in other languages. We have taken Hind as our language for experimentations.
In India, the spoken Hindi is not strict Hindi. It’s a mix of other languages as well. Mostly it uses the Engish in it. We have used Bilingual Modeling to solve this issue. We have also experimented with neural machine translation system to translate the input data from English to Hindi. This worked. We also experimented with transliteration. This also worked but debugging/management was not so good in both these.
Dr. Vikram Vij has over 26 years of industrial experience in multiple technical domains from Databases, Storage & File Systems, Embedded systems, Intelligent Services and IoT. He has worked at Samsung since 2004 and is currently working as Sr. Vice President and Voice Intelligence R&D Team Head at Samsung R&D Institute in Bangalore. Dr. Vij’s current focus is on building the World’s Best Voice Intelligence Experience for Mobiles and other Samsung appliances. Dr. Vikram Vij received a Ph.D. and Master’s degree from the University of California Berkeley in Computer Science, an M.B.A. degree from Santa Clara University and a B.Tech. degree from IIT Kanpur in Electronics.
https://www.slideshare.net/vinutharani1995/samsung-voice-intelligence-outline
Jul 2018
23 Mon
24 Tue
25 Wed 08:45 AM – 05:25 PM IST
26 Thu
27 Fri
28 Sat
29 Sun
Hosted by
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}