Big Data Engineering made easy

Jul 2015

13 Mon

14 Tue

15 Wed

16 Thu 08:30 AM – 06:35 PM IST

17 Fri 08:30 AM – 06:30 PM IST

18 Sat 09:00 AM – 06:30 PM IST

19 Sun

NIMHANS Convention center

Big Data Engineering made easy

Submitted Jun 14, 2015

Section: Full Talk Technical level: Intermediate

Switching the database for scaling up and then porting all the algorithms / reporting functionalities that had been implemented to the new database is a challenge. At Sokrati we have eased this pain by implementing proprietery APIs (for internal use).

Outline

Sokrati deals with a lot of data (20TB+), each team used to worry about scaling up their databases and similar problems were faced by them. Setting up a new shard for a database, migrating data between shards was a problem and a nightmarish activity. On business front, analytics team would wait forever to fetch the data, they would have to come up with their own way of analysing it as it was impossible to load it into standard tools.

This is when an idea struck as to why don’t we build a layer that will ease life of all the developers! This layer should be able to theoretically handle infinite scale, shard / re-shard when needed, Add / Remove boxes depending on size of data, Archive data to an archive store.

Athena:
We came up with this idea of building a REST based web-service which would be simple to use and scale. This service would have just three calls, Store, Fetch and Status. Store would take tablename, dbname, schema and S3 file, containing actual data, as an input and insert / update the data. Fetch would build the right query depending on selectors and filters and save the output in an S3 file. Since this service would deal with huge amounts of data, we decided that both these calls should be offline jobs and hence we added a status servlet to check whether the job is complete or not.

Ares:
Ares is the computation framework call. This call accepts a series of map reduce jobs as input. It also accepts schedules for executing those jobs. Each job can read data from Athena and write back to athena if needed.

To summarize: Data scientists can now focus only on creating more models, data collection team can focus only on more data sources. Sokrati-infra team takes care of scaling them up, both DBs and applications!

Speaker bio

Kaushik is the brain behind many of the technologies built at Sokrati. He is responsible for building scale and efficiency in the software architecture that helps deliver millions of ads everyday. Before Sokrati he was involved with Veraz networks, a company that built soft switches and owned the Presence Server (Social networking Engine) at Veraz networks. Kaushik holds a Bachelor’s Degree in Computer Science from V.I.T, Pune.

The Fifth Elephant 2015