Real-time Ingestion of logs into Hive with a low latency, to query and respond to events

Jul 2016

25 Mon

26 Tue

27 Wed

28 Thu 08:30 AM – 06:25 PM IST

29 Fri 08:30 AM – 06:15 PM IST

30 Sat 08:45 AM – 05:00 PM IST

31 Sun 08:15 AM – 06:00 PM IST

Make a submission

NIMHANS Convention Centre

The Fifth Elephant is India’s most renowned data science conference. It is a space for discussing some of the most cutting edge developments in the fields of machine learning, data science and technology that powers data collection and analysis.

Machine Learning, Distributed and Parallel Computing, and High-performance Computing continue to be the themes for this year’s edition of Fifth Elephant.

We are now accepting submissions for our next edition which will take place in Bangalore 28-29 July 2016.

#Tracks

We are looking for application level and tool-centric talks and tutorials on the following topics:

Deep Learning
Text Mining
Computer Vision
Social Network Analysis
Large-scale Machine Learning (ML)
Internet of Things (IoT)
Computational Biology
ML in healthcare
ML in education
ML in energy and ecology
ML in agriculrure
Analytics for emerging markets
ML in e-governance
ML in smart cities
ML in defense

The deadline for submitting proposals is 30th April 2016

Format

This year’s edition spans two days of hands-on workshops and conference. We are inviting proposals for:

Full-length 40 minute talks.
Crisp 15-minute talks.
Sponsored sessions, 15 minute duration (limited slots available; subject to editorial scrutiny and approval).
Hands-on Workshop sessions, 3 and 6 hour duration.

Selection process

Proposals will be filtered and shortlisted by an Editorial Panel. We urge you to add links to videos / slide decks when submitting proposals. This will help us understand your past speaking experience. Blurbs or blog posts covering the relevance of a particular problem statement and how it is tackled will help the Editorial Panel better judge your proposals.

We expect you to submit an outline of your proposed talk – either in the form of a mind map or a text document or draft slides within two weeks of submitting your proposal.

We will notify you about the status of your proposal within three weeks of submission.

Selected speakers must participate in one-two rounds of rehearsals before the conference. This is mandatory and helps you to prepare well for the conference.

There is only one speaker per session. Entry is free for selected speakers. As our budget is limited, we will prefer speakers from locations closer home, but will do our best to cover for anyone exceptional. HasGeek will provide a grant to cover part of your travel and accommodation in Bangalore. Grants are limited and made available to speakers delivering full sessions (40 minutes or longer).

Commitment to open source

HasGeek believes in open source as the binding force of our community. If you are describing a codebase for developers to work with, we’d like it to be available under a permissive open source licence. If your software is commercially licensed or available under a combination of commercial and restrictive open source licences (such as the various forms of the GPL), please consider picking up a sponsorship. We recognise that there are valid reasons for commercial licensing, but ask that you support us in return for giving you an audience. Your session will be marked on the schedule as a sponsored session.

Key dates and deadlines

Revised paper submission deadline: 17 June 2016
Confirmed talks announcement (in batches): 13 June 2016
Schedule announcement: 30 June 2016
Conference dates: 28-29 July 2016

##Venue
The Fifth Elephant will be held at the NIMHANS Convention Centre, Dairy Circle, Bangalore.

##Contact
For more information about speaking proposals, tickets and sponsorships, contact info@hasgeek.com or call +91-7676332020.

Hosted by

The Fifth Elephant

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

All submissions

Previous Next

Real-time Ingestion of logs into Hive with a low latency, to query and respond to events

Submitted Mar 14, 2016

Section: Crisp talk Technical level: Intermediate

Threat landscape is changing very rapidly and we are seeing more and more targeted attacks. Detecting such attacks requires a data driven approach, which requires processing PBs of telemetry data (AV detections, system access logs, network statistics etc.) received from end points, firewalls, gateways etc.

Distributed systems like Apache Hadoop allow for such processing, however ingesting data as soon as it arrives is needed for providing almost 0 day protection. Using traditional approach of using map-reduce batch data processing can provide very high throughput (number of events processed per second) but that comes at the cost of increased latency in the order of several minutes to few hours. Apache storm provides real-time processing of events with very low latency (in the order of few seconds) but it cannot be used to compute arbitrary functions on an arbitrary dataset in real time.

This has given rise to “Lambda Architecture” of using combination of both “batch layer” or Map-Reduce and “speed layer” or real-time processing with apache storm for implementing big data systems.

In most use cases, apache hive is used as “batch layer” application to execute Map-Reduce jobs by simply writing SQL queries. But to support hive queries, the data must to be present at rest on distributed file system HDFS in the format that is understood by hive. Traditionally, Map-Reduce jobs have been used to implement the data ingestion service that performs ETL tasks of ingesting data into apache hive. But to support the “speed layer” of lambda architecture, the data ingestion service also needs to fulfill the low latency requirement. So, overall the ingestion service should accept incoming telemetry events in real time; perform required data formatting and cleansing and then send this processed stream of telemetry events to “speed layer” applications and also ingest these events into hive.

To support the low latency requirement, the natural choice for implementation of data ingestion service is the Apache Storm since it supports real-time processing and also can stream events to hive using HCatalog streaming API. However our tests and research has indicated though Apache Storm supports required low latency but has low overall throughput (number of events stored per second) of ingesting the events into hive compared to Map-Reduce jobs due to limitations of the HCatalog streaming API and Hive MetaStore.

The technique presented makes use of combination of both Apache Storm and Map-Reduce to implement “Hybrid data ingestion pipeline” to support requirements of both “speed layer” and “batch layer” applications and also achieves the required high throughput requirement of ingestion into hive.

Outline

Current security scenario
Ways to Ingest Data

Streaming Data Ingest
Batch Data Ingest
Introducing the hybrid ingestion technique using the best of both

Usage in consolidation with Machine learning techniques in the security domain

Requirements

Speaker bio

Pallav has been working with the Symantec Cyber Security Services Group mainly focusing on Identifying Targeted Attacks using big data analytics. Over years he has been involved in multiple architectural engagements including Architecture Assessments, Proof-of-Concepts, Reviews, Analysis and Product Selection.

Slides

https://www.slideshare.net/secret/LaAkLuo4OOZX57