The Fifth Elephant 2015

A conference on data, machine learning, and distributed and parallel computing

Big Data Engineering made easy

Submitted by Kaushik Paranjape (@kaushik-paranjape) on Sunday, 14 June 2015

videocam_off

Technical level

Intermediate

Section

Full Talk

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +4

Objective

Switching the database for scaling up and then porting all the algorithms / reporting functionalities that had been implemented to the new database is a challenge. At Sokrati we have eased this pain by implementing proprietery APIs (for internal use).

Description

Sokrati deals with a lot of data (20TB+), each team used to worry about scaling up their databases and similar problems were faced by them. Setting up a new shard for a database, migrating data between shards was a problem and a nightmarish activity. On business front, analytics team would wait forever to fetch the data, they would have to come up with their own way of analysing it as it was impossible to load it into standard tools.

This is when an idea struck as to why don’t we build a layer that will ease life of all the developers! This layer should be able to theoretically handle infinite scale, shard / re-shard when needed, Add / Remove boxes depending on size of data, Archive data to an archive store.

Athena:
We came up with this idea of building a REST based web-service which would be simple to use and scale. This service would have just three calls, Store, Fetch and Status. Store would take tablename, dbname, schema and S3 file, containing actual data, as an input and insert / update the data. Fetch would build the right query depending on selectors and filters and save the output in an S3 file. Since this service would deal with huge amounts of data, we decided that both these calls should be offline jobs and hence we added a status servlet to check whether the job is complete or not.

Ares:
Ares is the computation framework call. This call accepts a series of map reduce jobs as input. It also accepts schedules for executing those jobs. Each job can read data from Athena and write back to athena if needed.

To summarize: Data scientists can now focus only on creating more models, data collection team can focus only on more data sources. Sokrati-infra team takes care of scaling them up, both DBs and applications!

Speaker bio

Kaushik is the brain behind many of the technologies built at Sokrati. He is responsible for building scale and efficiency in the software architecture that helps deliver millions of ads everyday. Before Sokrati he was involved with Veraz networks, a company that built soft switches and owned the Presence Server (Social networking Engine) at Veraz networks. Kaushik holds a Bachelor’s Degree in Computer Science from V.I.T, Pune.

Comments

  • 1
    Shashi Gowda (@g0wda) Reviewer 3 years ago

    Could you provide links about Athena and Ares? Or are these propriatary inside Sokrati?

  • 1
    Veera (@vbala) 3 years ago

    Can I compare Ares to oozie? Does Ares support work-flow management?

    • 1
      Kaushik Paranjape (@kaushik-paranjape) Proposer 3 years ago

      Ares is an API for running map-reduce jobs. It internally uses a work-flow management framework. Comparison of Ares is oozie wont be appropriate. Ares internally uses azkabaan (which can be compared with oozie, there is enough online documentation about it). But if we find it apt azkabaan could be replaced with oozie without affecting any development.

Login with Twitter or Google to leave a comment