The eighth edition of The Fifth Elephant will be held in Bangalore on 25 and 26 July. A thousand data scientists, ML engineers, data engineers and analysts will gather at the NIMHANS Convention Centre in Bangalore to discuss:
- Model management, including data cleaning, instrumentation and productionizing data science.
- Bad data and case studies of failure in building data products.
- Identifying and handling fraud + data security at scale
- Applications of data science in agriculture, media and marketing, supply chain, geo-location, SaaS and e-commerce.
- Feature engineering and ML platforms.
- What it takes to create data-driven cultures in organizations of different scales.
1. Meet Peter Wang, co-founder of Anaconda Inc, and learn about why data privacy is the first step towards robust data management; the journey of building Anaconda; and Anaconda in enterprise.
2. Talk to the Fulfillment and Supply Group (FSG) team from Flipkart, and learn about their work with platform engineering where ground truths are the source of data.
3. Attend tutorials on Deep Learning with RedisAI; TransmorgifyAI, Salesforce’s open source AutoML.
4. Discuss interesting problems to solve with data science in agriculture, SaaS perspective on multi-tenancy in Machine Learning (with the Freshworks team), bias in intent classification and recommendations.
5. Meet data science, data engineering and product teams from sponsoring companies to understand how they are handling data and leveraging intelligence from data to solve interesting problems.
Why you should attend?
- Network with peers and practitioners from the data ecosystem
- Share approaches to solving expensive problems such as cleanliness of training data, model management and versioning data
- Demo your ideas in the demo session
- Join Birds of Feather (BOF) sessions to have productive discussions on focussed topics. Or, start your own Birds of Feather (BOF) session.
Full schedule published here: https://hasgeek.com/fifthelephant/2019/schedule
For more information about The Fifth Elephant, sponsorships, or any other information call +91-7676332020 or email firstname.lastname@example.org
JSFoo:VueDay 2019 sponsors:
Shashank Jaiswal, Data scientist at Clustr
ADAM - Bootstrapping a deep NN-based sequence labeling model with minimal labelingWe would be presenting answers to the following… Why just any generic approach would not have worked? How our data source and structure left us with no previously adapted choices? Why we the project was necessary to meet the end goals of the company? And how did we tackle a number of problems on the way? Company’s Goals: Following are the primary product-goals of the company which are relevant to ADAM project.: 1)Universal Product Catalog 2)Aggregation and Market Analysis 3)Self evolving Knowledge Graph Raw Dataset: Introduction, Structure and The good, bad and the ugly of the data-set. WHY ADAM (Automatic Detection and Annotation Module)… A deep NN model: Definition, Usecases:: 1) Enrichment of Knowledge graph 2) For Analytics Components of ADAM: 1)Smart Automatic Training Data Generation 2)State of the Art Sequence Tagging Model 3)Active Learning approach Why the above architecture is chosen: 1)Zero ground truth and no training data available whatsoever. 2)Multi-Independent Source of data generation thus imagine the variance 3)Short representations and extremely noisy 4)Prone to Extreme human error (not bias but error!!!) Finally details of the architecture and WHY they were necessary: 1)Smart Automatic Training Data Generation: <>How we leveraged the structure of dataset (Stock-item and Stock-group)? <>How we used the existing knowledge-base AKA (CREGS)? <>How we improvised using the information from other sources like Amazon and GS1? 2)State of the Art Sequence Tagging Model: <>Why we created our own word embeddings and how it helped us? <>Why BiLSTM and CRF were used and why are they state of the art? <>Why this specific architecture was needed and why anything else wouldn’t work <>What was the accuracy and how well did the model performed? 3)Active Learning: <>Why Active learning when we can generate labels automatically? <>How we integrated and designed Manual annotation model ourselves? <>How well did we reach maturity with the minimum data-points manually labelled? <>Why extrinsic sampling or intrinsic sampling used? Conclusion that we will showcase as per 5th Elephant: 1)How to tackle the noisy data problem in case of textual data? 2)Why a deep NN model plays an important role in generalisation? 3)Why Active Learning is a really important concept for dealing with the problem of no label data?
Venkateshan, Data Scientist at Logistics and Insight team at Flipkart
Solving the vehicle routing problem for optimizing shipment deliveryDescription of the context at the Flipkart delivery hub. Solution constraints - customer time window, maximum number of shipments per vehicle. Defining a non-standard cost function - total travel time, route outliers, compactness of routes. Formulating the problem as a variant of VRPTW (vehicle routing problem with time windows). Computational complexity: NP-hard (generalization of Traveling Salesman Problem) Overview of some exact algorithms. Heuristics - (a) construction of routes and (b) route improvement Description of our construction step and iterative computational procedure to improve solutions. Discussion of results
Fasih Khatib, Data Scientist at Simpl
Ghostbusters: optimizing debt collections with survival modelsTL;DR This talk is about using survival models to optimize the process of making collection calls (“dear sir, please pay your bill, it’s overdue”). Context: An overview of how the calling process is structured. This will give an understanding of what we’re trying to optimize. Discuss why moving people from one level of the process to another automatically and optimally is important for recovering money. Get an understanding of why data-backed decisions are important for overall efficiency. Is it worth it to make 7 calls per user or should you escalate after 4 calls? Understand how using panel data for user behavior is significantly different from more standard classifiers which use cross-sectional data. A brief introduction to survival models: What survival models are, and where they are traditionally used. Get an introduction to basic terminology like survival function, hazard rate, censoring, etc. Take a look at non-traditional applications of survival models in fields like sales lead prioritization, marketing automation, etc. How we use survival models: How math concepts are directly relevant to the business - a hazard function is directly useful as a lead score, while a survival function tells us who the ghosts are. Math => business decisions. Constructing hazard curves via parametric (Weibull) and non-parametric (Kaplan-Meier) and connecting them to our real data. Cox proportional model Data limitations force us to use censored models. Take a look at productionizing these models; how to use this information to make better decisions. One model can solve many problems (escalation, lead scoring, write-off, etc.)
Kumar Puspesh, CTO and co-founder at Moonfrog
10 steps to build your own data pipeline from day one of your startupBe clear of Requirements and Constraints Having a scalable system for data ingestion Data design (Specific or Generic) Querying interface - why stick to SQL? Take time to Design Data Walking through example of generic table design Sort out Data production part first Identify all possible data producers (and understand requirements). In our case - Android/iOS app Cannot keep sending each event over network Cannot lose data even if app crashes or is killed Keep out of context from the application itself Microservice(s) Cannot keep sending each event over network Keep data collection agnostic of microservice itself Design v1.0 of Data pipeline How and why we chose “anti-pattern” Choose/Design Data warehouse Data design in Redshift Compression ON for certain columns Tuning for scale Taking care of Querying patterns of Product Managers and Data scientists Open up: Enable many Data Interfaces On demand Data loading and querying: OnDemand Table(s) Flexibility for complicated analysis: Adhoc redshift cluster(s) Understand, Tune & Repeat Optimize for Usage Added more columns at generic level e.g. More examples Optimize for Cost & Ops Retention policies of data Not all events are of same importance But all events should be accessible if required Upgrade to v2.0 of Data pipeline
Venkata Pingali, CEO and co-founder of Scribble Data
Anatomy of a production ML feature engineering platformRough Outline: Objectives of a feature engineering platform (5 mins) Reduce time to market Enhance robustness of models Enable explainability Points of friction & required capabilities (20 mins) What is in my data? (catalog) Is my input data complete and correct? (health) How do I link existing side information (augment/enrich) How to capture tacit knowledge/signal (labeling) How do I reliably prepare my training datasets (pipelines) How do I check audit & validate what has been computed (audit) How do I discover what is being computed and used? (marketplace) How do I export and track exported discovered features for model dev (search) How do I link the features to performance? (monitor) How do I reuse the features in the streaming path? (library) Economics of Feature Engineering (5 mins) Feature computation expensive, and each has a price Amortization happens over time & across models Process discipline required Questions to ask: 1. How many models will I have over time? 2. How defensible should they be? 3. How available should they be? 4. How many features will they need? Approaches to building one (5 mins) FEAST (Go-JEK; Thought through but tied to GCP) Combine standalone components (OSS exists but incur integration costs) Thirdparty (Move fast but incur platform costs)
Sherin Thomas, Senior software architect at Tensorwerk
Tutorial: Taking deep learning to production with RedisAIYear 2018 was the year of model servers. There were numeroius initiatives for building a reliable, interoperable deep learning deployment toolkits but so far we don’t have an easy tool that can reliably handle the deep learning models from all the frameworks. With the advent of Redis modules and the availability of C APIs for the major deep learning frameworks, it is now possible to turn Redis into a reliable runtime for deep learning workloads, providing a simple solution for a model serving microservice. In this talk we will introduce RedisAI, a joint effort by [tensor]werk and RedisLabs that introduces tensors and graphs as new Redis data types and allows to execute graphs over tensors using multiple backends (PyTorch, TensorFlow, and ONNXRuntime), both on the CPU and GPU. The module also supports scripting with TorchScript, which provides a Python-like tensor language that can be used to facilitate pre- and post-processing operations, like input shaping or output ensembling. In addition, thanks to its support for the ONNX standard, including ONNX-ML, RedisAI is not strictly limited to deep learning, but it offers support for general machine learning algorithms. In this talk, we will demonstrate a full journey from training a model to deploying to production in a highly available environment. Last, we will lay down the roadmap for the future, like automated batching, sharding, integration with Redis data types (e.g. streams) and advanced monitoring. The talk will include sample code, best practices and a live demo.
Peter Wang, Co-founder of Anaconda, Inc
The Anaconda journey: challenges faced in building an OSS business with dataThe Early Years Founding visions State of Python and Scipy in early 2010s Python & “Big Data” Creation of Conda, Anaconda, PyData Community Technical initiatives Creating an OSS Business Future Challenges as we grow & scale Technical and Community hurdles
Chris Stucchio, Head of data science at Simpl
The final stage of grief (about bad data) is acceptanceOver the course of my career I’ve gone through the many stages of grief; I’ve become angry at the poor quality of my data, I’ve attempted to bargain with engineering/PMs/etc for better data, and I became depressed over the issue. Now I’ve reached the final stage; I accept that my data is bad. Given that my data is bad, I then attempt to model it’s badness, and use that model to correct for the biases introduced. In this talk I’ll discuss how I approach bad data; I accept that I cannot fix it and instead try to model where it came from. This usually involves getting a more detailed grasp of the data generating process and writing down a formal model. In many cases this enables me to use the data model to correct and enhance my predictive model, as well as provide useful measurements and insights for improving and repairing the data collection process.
Agam Jain, Technology Architect at Zapr
Contracts, schema evolution and data pipelinesThe flow would look like this The Need for a Message Bus in building a data processing pipeline For the events generated in the Message Bus, the need for a contract for data control (with examples of showing how we messed up and learnt from it). explain in more detail of what a contract is how it can be implemented starts with hierarchical modeling of data. relations between objects what are tools other there to store this complex relationship between entites Discuss the gains from implementing contract control for any data that flows in the data pipeline from a business perspective of improving business logic, joining with other data sets from a technological ease - Schema extendibility of fields in data, predictability of development, back dated processing - backward and forward compatibility Able to break down pipeline by responsibility - teams can work on different component of the pipeline - Implementing the above for multi step data processing (enrichment) Additional Advantages * Cost wise * Data cleaning * Data consistency * Linear pipeline
Analysing high throughput data in real-timeIntroduction About Hotstar Stream Processing @Hotstar What is Stream Processing and Why was it required Problems that lead to usage Video Player Metricing Social Signals User Targeting Case Study - Video Player Metrics What are the P1 metrics How did we solve and compute them real time Case Study - Social Signals What are the Social Signals How did we solve engagement in real time Key Take Away Discussion Why and when should we use Stream processing Q&A
Abishek Bhat, Member of data science team at Semantics3
Similarity search for product matching at Semantics3Introduction [~ 5 mins] This section will present an overview of the problem, the use cases that motivate it and establish the tone for the rest of the presentation. Topics Covered: Product Matching: What is it and Why is it important? Similarity Search for Product Matching: What is it and how does it speed up matching? Example Case for Similarity Search: Sample product document and sample query document to explain the following sections. Traditional Text Search Approaches [~ 5 mins] This section will cover our intial attempt at the similarity search problem using traditional text based methods largely leveraging elasticsearch. Topics Covered: Overview of how we set up the problem Bottlenecks we hit and available tuning options Examples of real queries Lessons from Traditional Text Search Approaches [~ 5 mins] This section will cover some of the key insights we gleaned from traditional text approaches and how we needed to reframe the problem. Topics Covered: The nature of our data/problem and why elasticsearch wasn’t a good fit. Need for indexing multi-modal data Examples of failed cases Search is only as good as the document’s representation. Representation Learning [~ 10 mins] This section would cover how we reframed this as a representation learning problem and the different network architectures we tried, how we suited it to our needs, what worked/didn’t work and the challenges we faced along the way. Topics Covered: How we reframed the problem Different network architectures we tried and their results. Examples of success cases which had failed previously. Infrastructure and scaling challenges Infrastructure Challenges [~ 5 mins] Solving the representation problem didn’t necessarily solve the similarity search problem. We only had a way to sufficiently represent all the product information on the vector space. This section will cover the infrastructure challenges, the options we considered and how we ended up choosing FAISS. Topics Covered: Challenges, Constraints Re-evaluating Elasticsearch Evaluating FAISS Key bencmarks Conlusion [~ 2 mins]
Jacob Joseph, Data Scientist at Clevertap
Leveraging the power of analytics for MarTechBrief about CleverTap Current State of the Industry Challenges faced by Marketers on Segmentation CleverTap’s solution for Intelligent Segmentation with Machine Learning Case studies showing the real impact of CleverTap’s solution Challenges faced by Marketers on Campaign Content CleverTap’s solution to Automating Campaign Content with Recommender Engine Case studies showing the real impact of CleverTap’s Recommender Engine
Neha Kumari, Software developer with Recommendation team at Flipkart
Improving product discovery via hierarchical recommendationsIntroduction to recommendation system at Flipkart Problem in hand Our journey towards recommending collections How hierarchical product taxonomies can be leveraged to solve cold-start problem and improving product discovery Relevance algorithm@scale Captivating findings and results
Sandeep Khurana, Data Scientist
Demystifying Social Network Analysis (SNA)Subject introduction and motivation Key concepts and terminology Network Measures Tools, software used Analytical techniques Applications • Indian elections 2014 • #MeToo Movement • Indian elections 2019 • Legal eagles on Twitter
Rajdeep Dua, Salesforce
Tutorial: Meet TransmogrifAI, Open Source AutoML powering Salesforce EinsteinIntroduction Need of Multicloud and multi tenant models Lessons learned while building Einstein platform How traditional machine learning works Introducing TransmogrifAI Type Hierarchy Automatic Feature Engineering across text, categorical, numerical, spatial features Handling label leakage Autmatic Model Selection and hyper parameter tuning Models supported currently Demo Uses cases being solved in production Summary
Pratik Sinha, Co-founder at Alt News
Technology to counter misinformation/disinformationThis is a call for the tech community to come together and create open source technlogy which can help fight misinformation/disinformation. The technology created will be useful not only in the Indian context but also in the global context.
Ayush Mittal, Lead data scientist at ShareChat
Sponsored talk: Feed generation at ShareChatShareChat is India’s largest vernacular social network platform built to enable next generation of India’s internet users. ShareChat is available in 14 vernacular languages. At ShareChat our data is fresh, with most users coming online for first time, our primary goal is to server most relevant content to the users at appropriate time. In this talk we will discuss the new challenges these first time internet user present. We will motivate the feed generation problem and give a walkthrough of Feed Generation algorithm at ShareChat. Introduction to ShareChat Recommendation Systems Landscape: Evolution of recommender systems from Group Lens to Netflix and advent of Collaborative Filtering. Deep Learning in Recommender Systems: Deep Learning Algorithms in academia and industry which try to solve recommendation problem at scale. Feed Generation Problem: What is feed generation problem and how it is different from classic recommendation systems. Data Challenges: Challenges in designing feed generation for ShareChat and unique insights that ShareChat’s data presents. ShareChat’s approach to solving Feed Generation Other problems at ShareChat
Shadab Siddiqui, Head of Information Security at Hotstar
Data security and startups: make the ends meetI will cover the following in my talk: Data security and how it differs from application security/penetration testing Ground realities of data security Data security and how to implement it without compromising the organization’s growth Why is data security needed when I have perimeter security, firewalls, intrusion detection system, etc in place? What are you protecting when you are enforcing data security? Technical solutions for implementing data security, and why this approach is better than instituting processes for protecting data? The big picture: GDPR compliance once data security is implemented? Security standards and compliance requirements for launching in countries where GDPR exists. How to structure/re-balance tech to be GDPR-ready. High state of data security and GDPR and it’s relationship with microservices. Metrics to track and evaluating how your company is doing on data security parameters.
Ishita Mathur, Data scientist at GO-JEK
How GO-FOOD built a query semantics engine to help you find food fasterThere are multiple components we built that can be grouped under the umbrella of Query Understanding. In this talk, I will briefly cover the following: Spell Correction Intent Classification Query Expansion Knowledge Graphs Autosuggest and Autocomplete Two of the most important components of the Query Understanding Workflow are Intent Classification and Query Expansion: this talk will cover both of these in further detail and will go over the following topics with respect to the models we built: Finding the right data to train the models Choosing the right algorithm: Word2Vec versus Doc2Vec Available open-source libraries and implementations Building the end-to-end pipeline for model training and deployment Experimenting and Iterating for continuous improvement
Peter Wang, Co-founder of Anaconda, Inc
Why data privacy is critical for robust data management?Data science is not just a job operationalization vs. exploration empiricism democratization, “citizen” data science The role and future of open source Two types of OSS Software isn’t just code Crowdsourcing innovation What comes next? Hardware innovation Desegregating computing The coming age of inference engines
Upendra Singh, Full stack data scientist at Clustr
How to build blazingly fast distributed computing like Apache Spark In-house?In this talk we will take care of below questions and explain the same followed by a demo is the system build. What is business motivation to build Spark like(or better) distributed processing framework in-house? Why distributed frameworks like Spark will not work for us in long run and why we need something else? What are basic design layers, data structures and algorithms required to build one such a system? What are the benchmark results and how it works better than Spark for us? Demo run of the framework.
Avinash Ramakanth, Tech lead at InMobi
A journey through Cosmos to understand usersThe topics we will be covering in this talk: 1. Introduction - Briefly provide business context to appreciate the need to solve this problem, and challenges involved. 2. The factors driving the decision to choose Cosmos DB as our backend store. 3, Key insights into what drives cost of the store, and various gotchas involved when designing such a system. 4. How to optimize the cost and bring intelligence to enable auto-scalability. 5. The need for building a multi version concurrency control and how to achieve it to enable parallel writes with multiple schema versions for the same record. 6. The tradeoff between readability and storage cost, and how to get the best of both worlds by building an avro library to enable inflight abbreviated compression.
Karnam Vasudeva Rao, Senior data scientist at Bayer
How we built a ML model to predict proteins for insecticidal activity?What are insecticidal proteins? Why machine learning for protein activity identification? Different approaches used by researchers Why not traditional methods? iFeature - a Python tool kit 5a. Why did we choose iFeature? 5b. What features iFeature has? 5c. How we adopted it for our need? 5d. What were the challenges? 5e. How did we overcome those? Key learnings