Experiences at Kosmix with Cloud Computing

Jul 2011

4 Mon

5 Tue

6 Wed

7 Thu

8 Fri

9 Sat 09:30 AM – 05:30 PM IST

10 Sun

Make a submission

Dharmaram College, Bengaluru

PHP is today the world’s most popular open source web development language. It is used by millions of websites, most often via applications like WordPress and Drupal. Deploying a PHP website is straightforward and supported by nearly every web hosting provider.

There are limits to how much load a single web server can take though. For your website to scale, you will sooner or later need to a transition to a multi-server deployment, and this can be hard. It requires thinking about web development in entirely new ways.

The exciting new world of cloud computing promises to make all this much better. “Cloud computing” is an umbrella term for a range of tools and techniques that make scalability possible. Scaling PHP in the Cloud is a one day conference on what it takes to make the leap from single server to multi-server deployments, and of making sense of the new world beyond.

Hashtag: #phpcloud

Sessions are for 45 minutes each (30 talking + 15 Q&A). If you’d like to do a smaller session, please indicate so in the description.

To attend this event, buy your ticket from http://phpcloud.doattend.com/

Hosted by

Scaling PHP in the Cloud

Scaling PHP in the Cloud was an event by HasGeek in 2011. more

All submissions

This submission has been added to the schedule

Experiences at Kosmix with Cloud Computing

Submitted Jul 5, 2011

Section: Development Technical level: Beginner Session type: Discussion

Hands on knowledge of building a substantially large scale system and experiences and learnings thereby..

Outline

I was the founding CTO of Kosmix which has now been acquired by Walmart for their ecommerce efforts. Kosmix started off in 2004 as a next generation search engine - the thesis was that categorization was fundamental to understanding the (info in the) web.

In any case, as a result of this endeavour we built a huge scalable system that crawled and indexed over 10 billion URLS and served over million search queries each day. Righthealth.com powered by Kosmix became the #1 health web site in the world in terms of traffic.

We realized first hand the need for a distributed file system such as GFS, the need for a job tracking system that would auto restart failed jobs automatically, the need for a computation framework that would make many computations simple and easy (now MapReduce, Pig, hive etc.).

We had to write some code in assembly to get the performance that was desired to keep the capex budget blowing out of proportion.

I am also involved and have helped several other large data projects
such as Citrusleaf (www.citrusleaf.net); Inmobi’s data warehouse; helped a few companies reason about how to build what the Aadhaar project (UID project) needs in terms of scale.

As far back as 1988 I was involved with building parallel database
systems (Gamma from UW Madison, Brahma at IIT Bombay where I was a faculty member) which in todays world would have been called a DB system in the cloud -- but really there is nothing new conceptually in the idea of distributing data and work to multiple machines and collating the results.

I am looking for feedback on what aspects if any of these experiences would be interesting to the community. Accordingly we could tailor the session..

Here are some possible focus areas - topics I can think of where
we can delve deep:

i) Building a large data warehouse (using cloud computing) - Issues
How large is large? Do we need real time answers to queries? Are queries of streaming variety (need to look at only the latest data)? Depending on the tradeoffs that are possible different solutions can be manufactured from a combination of Hadoop, Hive, Pig, and other FOSS.

ii) Building a Feature Rich Ultra Fast Web Search (using cloud computing)
How does one build an ultra fast search engine that also gives
categorized results? How does one put together disparate media types in a search result? How does one rank these disparate media types..

iii) Building a large scalable search backend (a system for crawling, indexing, annotating, categorizing etc.) for billions of URLS.