Building a high performance distributed crawler
Submitted by Sandeep Ravichandran (@sandeepr) on Monday, 15 April 2013
Storage and Databases
This talk describes how we use NoSQL databases like Mongodb, Redis to store a huge amount to data and analyze it using tools like elasticsearch. It also aims to provide insight to leveraging different cloud services to build a high performance cluster for web crawling.
Building a crawler is easy. Having multiple crawlers running everyday is not. And having it use the db / resources optimally is tougher. This talk hopes to give users insight into building a crawler from ground-up and the various technologies involved. Starting from a humble 1 server for mongo, 1 server to run crawlers setup, we have been able to migrate into a cluster to run crawlers and also optimising the no. of hits to the database. This is to show that, you can always improve the architecture when it comes to Big Data.
A basic knowledge of NoSQL is important along with a knowledge of how crawlers work (this will be covered in about a min during the talk though).
I work as an Architect at CognitiveClouds with a focus on Big Data and Cloud Infrastructure. We were instrumental in building a web crawler which crawls data from more than 500 websites everyday. I have also been a Technical Lead at Sourcebits Technologies with specialisation in Ruby. I also contribute to open source repositories like Rails when I can. When I'm not coding, I like to read books, listen to music and play the guitar .