The Fifth Elephant 2016

India's most renowned data science conference

Distributed change data capture platform

Submitted by Chandraprakash Bhagtani (@cpbhagtani) on Tuesday, 14 June 2016

videocam_off

Technical level

Intermediate

Section

Full talk

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +19

Abstract

The speed of today’s processing systems have moved from classical data warehousing batch reporting to the real-time processing and analytics. RDBMS (OLTP) data is one such type of data required for analysis and deriving business insights. Traditional way of ingesting RDBMS data into analytical system (hadoop etc.) is via bulk import or query based ingestion. This approach has following issues

– A process that periodically copies a snapshot of the entire source consumes too much time and resources.

– Alternate approaches that include timestamp columns, triggers, or complex queries often hurt performance and increase complexity.

– Source table needs to make lots of changes in their schema and system to support data copy . for ex: Support of CDC columns, Read slave etc.

– Periodic bulk copy also stresses the network and other resources. Its a sinusoidal wave of usage. These process create huge spikes and rest of the time systems are idle.

– Since really old DB schema aren’t designed with Data ingestion in mind , hard deletes and updates without CDC timestamps are common occurrence . This results in loosing facts .

We have built R3D3 which is source agnostic distributed change data capture platform . This platform handles thousands of CDC events per second per server and support strong look back capabilities and subscription model. By providing Source DB agnostic CDC schema(Avro) it provides pluggable replication to any kind of Secondary storage( Hive, Hbase and cassandra) . In addition by providing a rich subscription model , R3D3 allows both batch ingestion and real time streaming on top of single pipeline. Additional features include:

– Replication in near realtime

– No extra pressure on source RDBMS and no CDC column is required for tables

– Fault tolerant, at-least once semantics, and order guarantees.

– Replays in case of failure.

– Schema evoluation.

– Safeguard PII and sensitive data via encryption/masking by using classification metadata

– Realtime publishing of auditing/metrics events and dashboarding.

– Bootstrapping (getting history) a table.

Outline

In this talk, I will talk about the following
1. Design and Architecture.
2. How are we achieving order guarantees and at-least once semantics
3. Bootstrapping a table I.e getting history and reconciliation.
4. Auditing and metrics
5. Schema evolution
6. Security and Metadata
7. Challenges faced.
8. Future enhancements

Speaker bio

Chandra has 8 years of experience in Big data systems. Working as a staff engineer in Intuit.

Comments

  • 2
    Rajeev Kumar (@rajeevkumarjha) 2 years ago

    Looks like very intresting talk

Login with Twitter or Google to leave a comment