The Fifth Elephant 2013

An Event on Big Data and Cloud Computing

Building a high performance distributed crawler

Submitted by Sandeep Ravichandran (@sandeepr) on Monday, 15 April 2013

videocam_off

Technical level

Intermediate

Section

Storage and Databases

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +9

Objective

This talk describes how we use NoSQL databases like Mongodb, Redis to store a huge amount to data and analyze it using tools like elasticsearch. It also aims to provide insight to leveraging different cloud services to build a high performance cluster for web crawling.

Description

Building a crawler is easy. Having multiple crawlers running everyday is not. And having it use the db / resources optimally is tougher. This talk hopes to give users insight into building a crawler from ground-up and the various technologies involved. Starting from a humble 1 server for mongo, 1 server to run crawlers setup, we have been able to migrate into a cluster to run crawlers and also optimising the no. of hits to the database. This is to show that, you can always improve the architecture when it comes to Big Data.

Requirements

A basic knowledge of NoSQL is important along with a knowledge of how crawlers work (this will be covered in about a min during the talk though).

Speaker bio

I work as an Architect at CognitiveClouds with a focus on Big Data and Cloud Infrastructure. We were instrumental in building a web crawler which crawls data from more than 500 websites everyday. I have also been a Technical Lead at Sourcebits Technologies with specialisation in Ruby. I also contribute to open source repositories like Rails when I can. When I'm not coding, I like to read books, listen to music and play the guitar .

Comments

  • 1
    Srinivasan Seshadri (@sesh) 5 years ago

    the objective seems unrelated to the title and the description..

  • 1
    t3rmin4t0r (@t3rmin4t0r) 5 years ago (edited 5 years ago)

    Not quite sure this is a real big-data problem. And IMHO, Mongo/Redis are horrible ways to hold data that is mostly cold (for most crawlers).

    Something like Solr/Nutch has covered the platform levels of this problem rather very well - this sounds like you are reinventing something without understanding the existing tools that simplify the problem.

    Take a look at - http://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/

Login with Twitter or Google to leave a comment