The Fifth Elephant 2015

A conference on data, machine learning, and distributed and parallel computing

Machine Learning, Distributed and Parallel Computing, and High-performance Computing are the themes for this year’s edition of Fifth Elephant.

The deadline for submitting a proposal is 15th June 2015

We are looking for talks and workshops from academics and practitioners who are in the business of making sense of data, big and small.

Track 1: Discovering Insights and Driving Decisions

This track is about general, novel, fundamental, and advanced techniques for making sense of data and driving decisions from data. This could encompass applications of the following ML paradigms:

  • Statistical Visualizations
  • Unsupervised Learning
  • Supervised Learning
  • Semi-Supervised Learning
  • Active Learning
  • Reinforcement Learning
  • Monte-carlo techniques and probabilistic programming
  • Deep Learning

Across various data modalities including multi-variate, text, speech, time series, images, video, transactions, etc.

Track 2: Speed at Scale

This track is about tools and processes for collecting, indexing, and processing vast amounts of data. The theme includes:

  • Distributed and Parallel Computing
  • Real Time Analytics and Stream Processing
  • MapReduce and Graph Computing frameworks
  • Kafka, Spark, Hadoop, MPI
  • Stories of parallelizing sequential programs
  • Cost/Security/Disaster Management of Data

Commitment to Open Source

HasGeek believes in open source as the binding force of our community. If you are describing a codebase for developers to work with, we’d like it to be available under a permissive open source license. If your software is commercially licensed or available under a combination of commercial and restrictive open source licenses (such as the various forms of the GPL), please consider picking up a sponsorship. We recognize that there are valid reasons for commercial licensing, but ask that you support us in return for giving you an audience. Your session will be marked on the schedule as a sponsored session.

Workshops

If you are interested in conducting a hands-on session on any of the topics falling under the themes of the two tracks described above, please submit a proposal under the workshops section. We also need you to tell us about your past experience in teaching and/or conducting workshops.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more
Vedang Manerikar

Vedang Manerikar

@vedang

Dead Simple Scalability Patterns

Submitted Jun 15, 2015

Everyone dreams of being ‘Web Scale’, but we start out small. We — most of us — don’t launch a service and expect it to serve millions of requests from Day 1. This means that we don’t think about the ways in which our stack will blow up when the number of requests does start climbing. This talk lists simple patterns and checks that Development and Operations teams should implement from Day 1 in order to ensure a robust distributed system.

Outline

This talk will highlight development patterns that are easy to catch in code review and go a long way in improving the life of your system. For example,

  1. Do not make an unbounded number of DB calls in any request
    Bad Idea: For each person who retweeted “Ellen’s Oscar Selfie”, fetch their avatar from the DB
  2. Do not fetch an unbounded amount of data from the DB
    Bad Idea: Fetch all users who retweeted “Ellen’s Oscar Selfie”.
  3. Build timeouts into every network call made by the system
    Bad Idea: Wait forever for this list of RT users to load, don’t render the page until this happens

Slides will list out a large number of “obvious” (and some not-so-obvious) strategies that all distributed systems engineers should follow. For example,

  1. Data Projections - Fetch the absolute minimum amount of data required to satisfy a request from the DB
  2. Simple Profiling - Count the number of DB calls you make to serve a request end-to-end
  3. Essential Monitoring - Measure statistics to determine usefulness. Do you know your cache hit vs cache miss ratio?
  4. Awareness of Limits - What is the volume throughput limit on Amazon EBS volume?

I will also talk about architectural patterns that should be baked in from Day 1. For example,

  1. Separation of concerns using Message Queues
  2. LRU caching for permanent, unchanging data
  3. Version numbers in the schema for feature roll outs

... and more.

Speaker bio

Vedang Manerikar is a Platform Architect at Helpshift and has helped the Helpshift SDK go from 0 installs to 1 Billion+ installs. Along the way, he has stayed up long nights, refactored multiple systems, and learned everything in this talk the hard way. He is also terrible at Markdown.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more