Last date for submitting proposals in 15 June.
About the conference and topics for submitting talks:
The Fifth Elephant is rated as India’s best data conference. It is a conference for practitioners, by practitioners. In 2019, The Fifth Elephant will complete its eighth edition.
The Fifth Elephant is an evolving community of stakeholders invested in data in India. Our goal is to strengthen and grow this community by presenting talks, panels and Off The Record (OTR) sessions that present real insights about:
1. Data engineering and architecture: tools, frameworks, infrastructure, architecture, case studies and scaling.
2. Data science and machine learning: fundamentals, algorithms, streaming, tools, domain specific and data specific examples, case studies.
3. The journey and challenges in building data driven products: design, data insights, visualisation, culture, security, governance and case studies.
4. Talks around an emerging domain: such as IoT, finance, e-commerce, payments or data in government.
You should attend and speak at The Fifth Elephant if your work involves:
- Engineering and architecting data pipelines.
- Building ML models, pipelines and architectures.
- ML engineering.
- Analyzing data to build features for existing products.
- Using data to predict outcomes.
- Using data to create / model visualizations.
- Building products with data – either as product managers or as decision scientists.
- Researching concepts and deciding on algorithms for analyzing datasets.
- Mining data with greater speed and efficiency.
- Developer evangelists from organizations which want developers to use their APIs and technologies for machine learning, full stack engineering, and data science.
Perks for submitting proposals:
Submitting a proposal, especially with our process, is hard work. We appreciate your effort.
We offer one conference ticket at discounted price to each proposer.
We only accept one speaker per talk. This is non-negotiable. Workshops may have more than one instructor.
In case of proposals where more than one person has been mentioned as collaborator, we offer the discounted ticket and t-shirt only to the person with who the editorial team corresponded directly during the evaluation process.
The Fifth Elephant is a two-day conference with two tracks on each day.
We are accepting sessions with the following formats:
- Full talks of 40 minutes.
- Crisp talks of 20 minutes.
- Off the Record (OTR) sessions on focussed topics / questions. An OTR is 60-90 minutes long and typically has up to four facilitators and one moderator.
- Workshops and tutorials of 3-6 hours duration on Machine Learning concepts and tools, full stack data engineering, and data science concepts and tools.
- Pre-events. Birds Of Feather (BOF) sessions, talks, and workshops for open houses and pre-events in Bangalore and other cities between October 2018 and conference in 2019.** Reach out to firstname.lastname@example.org should you be interested in speaking and/or hosting a community event between now and the conference in 2019.
The first filter for a proposal is whether the technology or solution you are referring to is open source or not. The following criteria apply for closed source talks:
- If the technology or solution is proprietary, and you want to speak about your proprietary solution to make a pitch to the audience, you should pick up a sponsored session. This involves paying for the speaking slot. Write to email@example.com
- If the technology or solution is in the process of being open sourced, we will consider the talk only if the solution is open sourced at least three months before the conference.
- If your solution is closed source, you should consider proposing a talk explaining why you built it in the first place; what options did you consider (business-wise and technology-wise) before making the decision to develop the solution; or, what is your specific use case that left you without existing options and necessitated creating the in-house solution.
The criteria for selecting proposals, in the order of importance, are:
- Key insight or takeaway: what can you share with participants that will help them in their work and in thinking about the ML, big data and data science problem space?
- Structure of the talk and flow of content: a detailed outline – either as mindmap or draft slides or textual description – will help us understand the focus of the talk, and the clarity of your thought process.
- Ability to communicate succinctly, and how you engage with the audience. You must submit link to a two-minute preview video explaining what your talk is about, and what is the key takeaway for the audience.
No one submits the perfect proposal in the first instance. We therefore encourage you to:
- Submit your proposal early so that we have more time to iterate if the proposal has potential.
- Talk to us on our community Slack channel: https://friends.hasgeek.com if you want to discuss an idea for your proposal, and need help / advice on how to structure it. Head over to the link to request an invite and join #fifthel.
Our editorial team helps potential speakers in honing their speaking skills, fine tuning and rehearsing content at least twice - before the main conference - and sharpening the focus of talks.
How to submit a proposal (and increase your chances of getting selected):
The following guidelines will help you in submitting a proposal:
- Focus on why, not how. Explain to participants why you made a business or engineering decision, or why you chose a particular approach to solving your problem.
- The journey is more important than the solution you may want to explain. We are interested in the journey, not the outcome alone. Share as much detail as possible about how you solved the problem. Glossing over details does not help participants grasp real insights.
- Focus on what participants from other domains can learn/abstract from your journey / solution. Refer to these talks from The Fifth Elephant 2017, which participants liked most: http://hsgk.in/2uvYKI9 and http://hsgk.in/2ufhbWb
- We do not accept how-to talks unless they demonstrate latest technology. If you are demonstrating new tech, show enough to motivate participants to explore the technology later. Refer to talks such as this: http://hsgk.in/2vDpag4 and http://hsgk.in/2varOqt to structure your proposal.
- Similarly, we don’t accept talks on topics that have already been covered in the previous editions. If you are unsure about whether your proposal falls in this category, drop an email to: firstname.lastname@example.org
- Content that can be read off the internet does not interest us. Our participants are keen to listen to use cases and experience stories that will help them in their practice.
To summarize, we do not accept talks that gloss over details or try to deliver high-level knowledge without covering depth. Talks have to be backed with real insights and experiences for the content to be useful to participants.
Passes and honorarium for speakers:
We pay an honorarium of Rs. 3,000 to each speaker and workshop instructor at the end of their talk/workshop. Confirmed speakers and instructors also get a pass to the conference and networking dinner. We do not provide free passes for speakers’ colleagues and spouses.
Travel grants for outstation speakers:
Travel grants are available for international and domestic speakers. We evaluate each case on its merits, giving preference to women, people of non-binary gender, and Africans. If you require a grant, request it when you submit your proposal in the field where you add your location. The Fifth Elephant is funded through ticket purchases and sponsorships; travel grant budgets vary.
You must submit the following details along with your proposal, or within 10 days of submission:
- Draft slides, mind map or a textual description detailing the structure and content of your talk.
- Link to a self-recorded, two-minute preview video, where you explain what your talk is about, and the key takeaways for participants. This preview video helps conference editors understand the lucidity of your thoughts and how invested you are in presenting insights beyond the solution you have built, or your use case. Please note that the preview video should be submitted irrespective of whether you have spoken at past editions of The Fifth Elephant.
- If you submit a workshop proposal, you must specify the target audience for your workshop; duration; number of participants you can accommodate; pre-requisites for the workshop; link to GitHub repositories and a document showing the full workshop plan.
For more information about the conference, sponsorships, or any other information contact email@example.com or call 7676332020.
Sponsor for developer evangelism, community outreach, networking with IT managers and decision-makers, and hiring.
Download our sponsorship deck or write to us for customised options. Email firstname.lastname@example.org
The Fifth Elephant 2019 sponsors:
"It works on the training data" is the new "It works on my machine": why data science is terrible engineering
Note: slides I’ve submitted are from a previous talk which I wasn’t super happy with. Will be more or less rewritten.
Talk is about autonomous organizations - organizations making as many decisions as possible without human intervention.
- Simpl is a case study; if data science team all dies, credit underwriting/fraud checks/etc should continue.
- Automated trading strategies are another example (bots playing the shares markets)
Discussion of what good engineering is:
- Loosely coupled (web frontend writes to Kafka, different backend systems read from Kafka, backend crashes don’t affect frontend)
- Modular (web + backend systems are well separated)
- Single responsibility principle (each system does one well known thing)
Data science isn’t good engineering:
- Everything is connected. A system affecting marketing alters behavior of users downstream.
- A model is a giant mess of spaghetti code. Edge cases on top of edge cases.
- Data and relationships change over time. Data often goes bad.
- Biggest problems are conceptual bugs, not coding bugs - e.g., leaking future data into the present, or allowing multiple comparisons into your testing procedure.
- Data collection management is key. If I have a team member picking up a project in March, I need to start collecting data in Jan. Time is precious, don’t waste it!
Data science interacts with the world in which it lives:
- Selection effects: a model trained on professional basketball players finds no relation between height and player quality. Should you ignore player height when picking a team?
- Adversarial effects: if you lock down a fraud vector then fraudsters stop attacking it. Data says “no fraud attempts, safe to unlock.”
- Market adjustments: if you’re interacting with a market you will often transmit information to it, and it will adjust to you.
What you can do about this mess:
- Badly formatted data can be fixed with enough code. Missing data is gone. Always focus on collection.
- Data quality monitoring - if you have trouble parsing any data, pagerduty should blow up.
- Alerts about relationships - if your predictive model assumes a correlation between X and Y, raise an alert if that correlation vanishes.
- Amateurs focus on which model to use. Professionals think about how to backtest the model.
- Continuous monitoring - how quickly can you know that your model stopped making money?
10 Steps to Build-Your-Own Data Pipeline - for Day 1 of your startup
- Be clear of Requirements and Constraints
- Having a scalable system for data ingestion
- Data design (Specific or Generic)
- Querying interface - why stick to SQL?
- Take time to Design Data
- Walking through example of generic table design
- Sort out Data production part first
- Identify all possible data producers (and understand requirements). In our case -
- Android/iOS app
- Cannot keep sending each event over network
- Cannot lose data even if app crashes or is killed
- Keep out of context from the application itself
- Cannot keep sending each event over network
- Keep data collection agnostic of microservice itself
- Design v1.0 of Data pipeline
- How and why we chose “anti-pattern”
- Choose/Design Data warehouse
- Data design in Redshift
- Compression ON for certain columns
- Tuning for scale
- Taking care of Querying patterns of Product Managers and Data scientists
- Open up: Enable many Data Interfaces
- On demand Data loading and querying: OnDemand Table(s)
- Flexibility for complicated analysis: Adhoc redshift cluster(s)
- Understand, Tune & Repeat
- Optimize for Usage
- Added more columns at generic level e.g.
- More examples
- Optimize for Cost & Ops
- Retention policies of data
- Not all events are of same importance
- But all events should be accessible if required
- Retention policies of data
- Upgrade to v2.0 of Data pipeline
- Be clear of Requirements and Constraints
[Panel] Data driven culture in the startup ecosystem
[Panel] Data driven culture in the startup ecosystem
A journey through Cosmos to understand users.
The topics we will be covering in this talk:
1. Introduction - Briefly provide business context to appreciate the need to solve this problem, and challenges involved.
2. The factors driving the decision to choose Cosmos DB as our backend store.
3, Key insights into what drives cost of the store, and various gotchas involved when designing such a system.
4. How to optimize the cost and bring intelligence to enable auto-scalability.
5. The need for building a multi version concurrency control and how to achieve it to enable parallel writes with multiple schema versions for the same record.
6. The tradeoff between readability and storage cost, and how to get the best of both worlds by building an avro library to enable inflight abbreviated compression.
How to build blazingly fast distributed computing like Apache Spark In-house?
In this talk we will take care of below questions and explain the same followed by a demo is the system build.
What is business motivation to build Spark like(or better) distributed processing framework in-house?
Why distributed frameworks like Spark will not work for us in long run and why we need something else?
What are basic design layers, data structures and algorithms required to build one such a system?
What are the benchmark results and how it works better than Spark for us?
Demo run of the framework.
[Tutorial] Meet TransmogrifAI, Open Source AutoML powering Salesforce Einstein
- Need of Multicloud and multi tenant models
- Lessons learned while building Einstein platform
- How traditional machine learning works
- Introducing TransmogrifAI
- Type Hierarchy
- Automatic Feature Engineering across text, categorical, numerical, spatial features
- Handling label leakage
- Autmatic Model Selection and hyper parameter tuning
- Models supported currently
- Uses cases being solved in production
Technology to counter misinformation/disinformation
This is a call for the tech community to come together and create open source technlogy which can help fight misinformation/disinformation. The technology created will be useful not only in the Indian context but also in the global context.
[Birds of a Feather] Incubation to Production : Building Data Products for ever changing business @Flipkart
Will cover a few case studies covering how we adapted to important business milestones:
1. Mobile App traffic surpassing Desktop
2. Scaling From a single seller (WS Retail) to tens of thousands and from handful of categories to thousands.
3. Launch of new private label business line.
Building Robust, Reliable Data Pipelines
The flow would look like this
- The Need for a Message Bus in building a data processing pipeline
- For the events generated in the Message Bus, the need for a contract for data control (with examples of showing how we messed up and learnt from it).
- explain in more detail of what a contract is
- how it can be implemented
- starts with hierarchical modeling of data. relations between objects
- what are tools other there to store this complex relationship between entites
Discuss the gains from implementing contract control for any data that flows in the data pipeline
- from a business perspective of improving business logic, joining with other data sets
- from a technological ease -
- Schema extendibility of fields in data,
- predictability of development,
- back dated processing - backward and forward compatibility
- Able to break down pipeline by responsibility - teams can work on different component of the pipeline - Implementing the above for multi step data processing (enrichment)
Additional Advantages * Cost wise * Data cleaning * Data consistency * Linear pipeline
Formulation and solution of vehicle routing problem for optimizing shipment delivery routes
- Description of the context at the Flipkart delivery hub.
- Solution constraints - customer time window, maximum number of shipments per vehicle.
- Defining a non-standard cost function - total travel time, route outliers, compactness of routes.
- Formulating the problem as a variant of VRPTW (vehicle routing problem with time windows).
- Computational complexity: NP-hard (generalization of Traveling Salesman Problem)
- Overview of some exact algorithms.
- Heuristics - (a) construction of routes and (b) route improvement
- Description of our construction step and iterative computational procedure to improve solutions.
- Discussion of results
Anatomy of a Production ML Feature Engineering Platform
Objectives of a feature engineering platform (5 mins)
- Reduce time to market
- Enhance robustness of models
- Enable explainability
Points of friction & required capabilities (20 mins)
- What is in my data? (catalog)
- Is my input data complete and correct? (health)
- How do I link existing side information (augment/enrich)
- How to capture tacit knowledge/signal (labeling)
- How do I reliably prepare my training datasets (pipelines)
- How do I check audit & validate what has been computed (audit)
- How do I discover what is being computed and used? (marketplace)
- How do I export and track exported discovered features for model dev (search)
- How do I link the features to performance? (monitor)
- How do I reuse the features in the streaming path? (library)
Economics of Feature Engineering (5 mins)
- Feature computation expensive, and each has a price
- Amortization happens over time & across models
- Process discipline required
- Questions to ask:
1. How many models will I have over time?
2. How defensible should they be?
3. How available should they be?
4. How many features will they need?
Approaches to building one (5 mins)
- FEAST (Go-JEK; Thought through but tied to GCP)
- Combine standalone components (OSS exists but incur integration costs)
- Thirdparty (Move fast but incur platform costs)
ADAM - Bootstrapping a Deep NN based Sequence Labeling Model with minimal labeling
We would be presenting answers to the following…
Why just any generic approach would not have worked?
How our data source and structure left us with no previously adapted choices?
Why we the project was necessary to meet the end goals of the company?
And how did we tackle a number of problems on the way?
Following are the primary product-goals of the company which are relevant to ADAM project.:
1)Universal Product Catalog
2)Aggregation and Market Analysis
3)Self evolving Knowledge Graph
Introduction, Structure and The good, bad and the ugly of the data-set.
WHY ADAM (Automatic Detection and Annotation Module)… A deep NN model:
1) Enrichment of Knowledge graph
2) For Analytics
Components of ADAM:
1)Smart Automatic Training Data Generation
2)State of the Art Sequence Tagging Model
3)Active Learning approach
Why the above architecture is chosen:
1)Zero ground truth and no training data available whatsoever.
2)Multi-Independent Source of data generation thus imagine the variance
3)Short representations and extremely noisy
4)Prone to Extreme human error (not bias but error!!!)
Finally details of the architecture and WHY they were necessary:
1)Smart Automatic Training Data Generation:
<>How we leveraged the structure of dataset (Stock-item and Stock-group)?
<>How we used the existing knowledge-base AKA (CREGS)?
<>How we improvised using the information from other sources like Amazon and GS1?
2)State of the Art Sequence Tagging Model:
<>Why we created our own word embeddings and how it helped us?
<>Why BiLSTM and CRF were used and why are they state of the art?
<>Why this specific architecture was needed and why anything else wouldn’t work
<>What was the accuracy and how well did the model performed?
<>Why Active learning when we can generate labels automatically?
<>How we integrated and designed Manual annotation model ourselves?
<>How well did we reach maturity with the minimum data-points manually labelled?
<>Why extrinsic sampling or intrinsic sampling used?
Conclusion that we will showcase as per 5th Elephant:
1)How to tackle the noisy data problem in case of textual data?
2)Why a deep NN model plays an important role in generalisation?
3)Why Active Learning is a really important concept for dealing with the problem of no label data?
NIMHANS Convention Centre
Hosur Main Road