The Fifth Elephant 2015

A conference on data, machine learning, and distributed and parallel computing

Machine Learning, Distributed and Parallel Computing, and High-performance Computing are the themes for this year’s edition of Fifth Elephant.

The deadline for submitting a proposal is 15th June 2015

We are looking for talks and workshops from academics and practitioners who are in the business of making sense of data, big and small.

Track 1: Discovering Insights and Driving Decisions

This track is about general, novel, fundamental, and advanced techniques for making sense of data and driving decisions from data. This could encompass applications of the following ML paradigms:

  • Statistical Visualizations
  • Unsupervised Learning
  • Supervised Learning
  • Semi-Supervised Learning
  • Active Learning
  • Reinforcement Learning
  • Monte-carlo techniques and probabilistic programming
  • Deep Learning

Across various data modalities including multi-variate, text, speech, time series, images, video, transactions, etc.

Track 2: Speed at Scale

This track is about tools and processes for collecting, indexing, and processing vast amounts of data. The theme includes:

  • Distributed and Parallel Computing
  • Real Time Analytics and Stream Processing
  • MapReduce and Graph Computing frameworks
  • Kafka, Spark, Hadoop, MPI
  • Stories of parallelizing sequential programs
  • Cost/Security/Disaster Management of Data

Commitment to Open Source

HasGeek believes in open source as the binding force of our community. If you are describing a codebase for developers to work with, we’d like it to be available under a permissive open source license. If your software is commercially licensed or available under a combination of commercial and restrictive open source licenses (such as the various forms of the GPL), please consider picking up a sponsorship. We recognize that there are valid reasons for commercial licensing, but ask that you support us in return for giving you an audience. Your session will be marked on the schedule as a sponsored session.

Workshops

If you are interested in conducting a hands-on session on any of the topics falling under the themes of the two tracks described above, please submit a proposal under the workshops section. We also need you to tell us about your past experience in teaching and/or conducting workshops.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Kaushik Paranjape

@kaushik_paranjape

Big Data Engineering made easy

Submitted Jun 14, 2015

Switching the database for scaling up and then porting all the algorithms / reporting functionalities that had been implemented to the new database is a challenge. At Sokrati we have eased this pain by implementing proprietery APIs (for internal use).

Outline

Sokrati deals with a lot of data (20TB+), each team used to worry about scaling up their databases and similar problems were faced by them. Setting up a new shard for a database, migrating data between shards was a problem and a nightmarish activity. On business front, analytics team would wait forever to fetch the data, they would have to come up with their own way of analysing it as it was impossible to load it into standard tools.

This is when an idea struck as to why don’t we build a layer that will ease life of all the developers! This layer should be able to theoretically handle infinite scale, shard / re-shard when needed, Add / Remove boxes depending on size of data, Archive data to an archive store.

Athena:
We came up with this idea of building a REST based web-service which would be simple to use and scale. This service would have just three calls, Store, Fetch and Status. Store would take tablename, dbname, schema and S3 file, containing actual data, as an input and insert / update the data. Fetch would build the right query depending on selectors and filters and save the output in an S3 file. Since this service would deal with huge amounts of data, we decided that both these calls should be offline jobs and hence we added a status servlet to check whether the job is complete or not.

Ares:
Ares is the computation framework call. This call accepts a series of map reduce jobs as input. It also accepts schedules for executing those jobs. Each job can read data from Athena and write back to athena if needed.

To summarize: Data scientists can now focus only on creating more models, data collection team can focus only on more data sources. Sokrati-infra team takes care of scaling them up, both DBs and applications!

Speaker bio

Kaushik is the brain behind many of the technologies built at Sokrati. He is responsible for building scale and efficiency in the software architecture that helps deliver millions of ads everyday. Before Sokrati he was involved with Veraz networks, a company that built soft switches and owned the Presence Server (Social networking Engine) at Veraz networks. Kaushik holds a Bachelor’s Degree in Computer Science from V.I.T, Pune.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more