The Fifth Elephant 2015

A conference on data, machine learning, and distributed and parallel computing

Kaushik Paranjape

@kaushik_paranjape

Big Data Engineering made easy

Submitted Jun 14, 2015

Switching the database for scaling up and then porting all the algorithms / reporting functionalities that had been implemented to the new database is a challenge. At Sokrati we have eased this pain by implementing proprietery APIs (for internal use).

Outline

Sokrati deals with a lot of data (20TB+), each team used to worry about scaling up their databases and similar problems were faced by them. Setting up a new shard for a database, migrating data between shards was a problem and a nightmarish activity. On business front, analytics team would wait forever to fetch the data, they would have to come up with their own way of analysing it as it was impossible to load it into standard tools.

This is when an idea struck as to why don’t we build a layer that will ease life of all the developers! This layer should be able to theoretically handle infinite scale, shard / re-shard when needed, Add / Remove boxes depending on size of data, Archive data to an archive store.

Athena:
We came up with this idea of building a REST based web-service which would be simple to use and scale. This service would have just three calls, Store, Fetch and Status. Store would take tablename, dbname, schema and S3 file, containing actual data, as an input and insert / update the data. Fetch would build the right query depending on selectors and filters and save the output in an S3 file. Since this service would deal with huge amounts of data, we decided that both these calls should be offline jobs and hence we added a status servlet to check whether the job is complete or not.

Ares:
Ares is the computation framework call. This call accepts a series of map reduce jobs as input. It also accepts schedules for executing those jobs. Each job can read data from Athena and write back to athena if needed.

To summarize: Data scientists can now focus only on creating more models, data collection team can focus only on more data sources. Sokrati-infra team takes care of scaling them up, both DBs and applications!

Speaker bio

Kaushik is the brain behind many of the technologies built at Sokrati. He is responsible for building scale and efficiency in the software architecture that helps deliver millions of ads everyday. Before Sokrati he was involved with Veraz networks, a company that built soft switches and owned the Presence Server (Social networking Engine) at Veraz networks. Kaushik holds a Bachelor’s Degree in Computer Science from V.I.T, Pune.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures