Building a high performance distributed crawler

Jul 2013

8 Mon

9 Tue

10 Wed

11 Thu 09:30 AM – 04:30 PM IST

12 Fri 10:15 AM – 05:30 PM IST

13 Sat 10:15 AM – 05:30 PM IST

14 Sun

Nimhans Convention Centre

Building a high performance distributed crawler

Submitted Apr 15, 2013

Section: Storage and Databases Technical level: Intermediate

This talk describes how we use NoSQL databases like Mongodb, Redis to store a huge amount to data and analyze it using tools like elasticsearch. It also aims to provide insight to leveraging different cloud services to build a high performance cluster for web crawling.

Outline

Building a crawler is easy. Having multiple crawlers running everyday is not. And having it use the db / resources optimally is tougher. This talk hopes to give users insight into building a crawler from ground-up and the various technologies involved. Starting from a humble 1 server for mongo, 1 server to run crawlers setup, we have been able to migrate into a cluster to run crawlers and also optimising the no. of hits to the database. This is to show that, you can always improve the architecture when it comes to Big Data.

Requirements

A basic knowledge of NoSQL is important along with a knowledge of how crawlers work (this will be covered in about a min during the talk though).

Speaker bio

I work as an Architect at CognitiveClouds with a focus on Big Data and Cloud Infrastructure. We were instrumental in building a web crawler which crawls data from more than 500 websites everyday. I have also been a Technical Lead at Sourcebits Technologies with specialisation in Ruby. I also contribute to open source repositories like Rails when I can. When I’m not coding, I like to read books, listen to music and play the guitar .

The Fifth Elephant 2013