The Fifth Elephant 2012

Finding the elephant in the data.

What are your users doing on your website or in your store? How do you turn the piles of data your organization generates into actionable information? Where do you get complementary data to make yours more comprehensive? What tech, and what techniques?

The Fifth Elephant is a two day conference on big data.

Early Geek tickets are available from fifthelephant.doattend.com.

The proposal funnel below will enable you to submit a session and vote on proposed sessions. It is a good practice introduce yourself and share details about your work as well as the subject of your talk while proposing a session.

Each community member can vote for or against a talk. A vote from each member of the Editorial Panel is equivalent to two community votes. Both types of votes will be considered for final speaker selection.

It’s useful to keep a few guidelines in mind while submitting proposals:

  1. Describe how to use something that is available under a liberal open source license. Participants can use this without having to pay you anything.

  2. Tell a story of how you did something. If it involves commercial tools, please explain why they made sense.

  3. Buy a slot to pitch whatever commercial tool you are backing.

Speakers will get a free ticket to both days of the event. Proposers whose talks are not on the final schedule will be able to purchase tickets at the Early Geek price of Rs. 1800.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Santosh Gannavarapu

Dianemo: Distributed task management to achieve faster throughput

Submitted Jun 28, 2012

There is always a need to run batch/offline jobs while utilizing resources efficiently. The initial attempts are to set aside pre-configured worker boxes that will pick up jobs either by a cron scheduler or by a pre-designated manager application that is responsible for kicking off processes. Assigning boxes statically would necessitate deploying boxes that may either run out of bandwidth or may be so under-utilized that you are paying for unnecessary CPU cycles.

The need of the hour is a distributed task management system that will distribute load uniformly across boxes whose CPU cycles are available while distributing the load such that every box in the fleet is efficiently used. Such a system should be tolerant of host failures resulting in high-availability.

Outline

After evaluating various open source schedulers we decided to build our own. While distributing load across boxes is the essential requirement in many there are subtle nuances such as honoring CPU load, distributed scheduling agent or priority that are missing in many.

Dainemo has more to give than just distributing load. Some of its features are listed below:

  • Achieving peer-to-peer distributed scheduler: The scheduler operates by deploying the agent across all the boxes set aside as worker boxes. Each agent picks up the job by consulting a common database of jobs. Thus each host behaves as a peer agnostic of other peers in the network. Two hosts wouldn’t know anything about each other. The jobs are currently time-based. The scheduler has a job Generator which checks at the various schedules and creates jobs that are later picked off by the agents.

  • Linear Scalability: The system achieves linear scalability and elasticity. Need for additional capacity is as easy as adding another host with the scheduling agent. Upon successful provisioning the box becomes operational by picking up unscheduled jobs. If there is excess capacity simply shutdown the agent and remove the host.
    Weighted Jobs: Each job will be assigned a designated number of pre-configured tokens. Similarly each host is assigned a set of tokens providing an indicator to the amount of resources available on the host. Each scheduling agent will pickup jobs amounting to the tokens available at any time.

  • Honoring the CPU load: The number of tokens available is directly proportional to the CPU load on the host. If the load on the box is high the tokens available are brought down to zero immediately thus disabling the scheduler from picking up any further jobs. This heuristic honors load spiked caused by any process running on the host, thus facilitating better CPU utilization.

  • Priority Based Scheduling: There are currently two types of priority in Dianemo priority 1 and priority 2. Priority 1 has a reserved set of tokens, thus giving scheduling time to priority 1 jobs first. Upon availability of any additional tokens priority 2 jobs are picked up. Priority 1 jobs are always picked up at the scheduled time allowing a configurable minimal grace period. Upon failures Priority 1 jobs are immediately marked failed. Priority 2 jobs are picked up upto a larger grace period. Upon failures priority 2 jobs are re-tried upto 3 times with retrial window being 30 minutes away.

  • Scheduling dependency jobs: A job can be marked a child of another job. Thus if the parent job is finished only then the child job is kicked off as long as the scheduled time is within proper limits.

  • Leader agent: A leader agent is selected at real-time within the available scheduling agents. The agent will check to see if there any jobs that aren’t picked up and will mark them as failed to run. If there are any jobs whose heartbeats haven’t been updated in a designated time then they are marked as UNREACHABLE.

Speaker bio

Santosh Gannavarapu is Co-Founding Geek & CTO of Sokrati. Sokrati is the leading Search Marketing platform in India.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more