Building a Context-Aware Knowledge graph using Graph analysis & Language models
At EtherLabs, we are building a video platform that provides insights into the numerous live audio and video meetings that organizations conduct everyday. In such a scenario, in order to acquire critical metrics such as important moments, topics discussed and possible intents from textual data, basic NLP tasks like keyphrase extraction becomes significantly important. Important keyphrases extracted can later be used for other downstream tasks like topic modelling, intent detection, recommendation, search, and building knowledge graphs. We employed a graph-based approach, which is completely unsupervised, to identify important keyphrases from large amounts of textual data. To further make the graph more aware about the context of the discussion (HR, engineering, marketing etc), we use language models, trained and fine-tuned on specific domains, to re-rank the keyphrases based on the domain-knowledge.
Keyphrase extraction is a highly researched and well-defined task in the field of NLP. Various approaches ranging from supervised methods (Bag of Words, TF-IDF) to unsupervised (graph-based and clustering) to applying deeplearning algorithms on the mixture of both. Recent advances in deeplearning-based appraoches have yielded high performance for extracting keywords, however, these methods require large amount of training data and time. Many tools like SpaCy and Gensim have also provided black-box methods to achieve the same.
Although many methods and solutions are available for extracting keywords, we chose to work on graph-based approach which is inspired from the famous TextRank (or, the PageRank) algorithm. The key motivations for choosing this approach are:
- Text data have been proven to have important structural information. Such kind of information can be captured by word graphs, with words forming the nodes and their co-occurrences forming the edges or relations.
- Graph-based methods work well with noisy text data thereby not enforcing any training constraints.
- Unsupervised method lets us obtain candidate keywords which can be further filtered by using other methods like syntax rules, language models and ML classifiers.
- Graph-based extraction enables us to visualize and interpret the identification of keywords. Having a certain level of explainability helps in further fine-tuning the task which would have been tough to do if deeplearning algorithms were used.
- Graph analysis on the word graphs provides us other insights like community detection which can be used for detecting potential topics.
- The concept behind building a word graph and computing keyword ranks using PageRank algorithm
- Using sentence embeddings from language models to bias the PageRank computation.
- Using graph analysis methods like Between Centrality and Louvain partition algorithm to detect topics (communities).
- Extending the word graph to Knowledge graph to get other relations in the data.
- Exploring Graph databases, Dgraph in particular, to persist graphs.
Shashank is an AI/ML Engineer at EtherLabs, Bangalore. He has a MS degree in Computer Science (specialization in ML) from Delft University of Technology, Netherlands and has over 4 years of research and technical experience in domains such as recommendation systems, healthcare, speech & multimedia technology, IoT, NLP and HCI.