Distributed change data capture platform

Jul 2016

25 Mon

26 Tue

27 Wed

28 Thu 08:30 AM – 06:25 PM IST

29 Fri 08:30 AM – 06:15 PM IST

30 Sat 08:45 AM – 05:00 PM IST

31 Sun 08:15 AM – 06:00 PM IST

NIMHANS Convention Centre

Distributed change data capture platform

Submitted Jun 14, 2016

Section: Full talk Technical level: Intermediate

The speed of today’s processing systems have moved from classical data warehousing batch reporting to the real-time processing and analytics. RDBMS (OLTP) data is one such type of data required for analysis and deriving business insights. Traditional way of ingesting RDBMS data into analytical system (hadoop etc.) is via bulk import or query based ingestion. This approach has following issues

-- A process that periodically copies a snapshot of the entire source consumes too much time and resources.

-- Alternate approaches that include timestamp columns, triggers, or complex queries often hurt performance and increase complexity.

-- Source table needs to make lots of changes in their schema and system to support data copy . for ex: Support of CDC columns, Read slave etc.

-- Periodic bulk copy also stresses the network and other resources. Its a sinusoidal wave of usage. These process create huge spikes and rest of the time systems are idle.

-- Since really old DB schema aren’t designed with Data ingestion in mind , hard deletes and updates without CDC timestamps are common occurrence . This results in loosing facts .

We have built R3D3 which is source agnostic distributed change data capture platform . This platform handles thousands of CDC events per second per server and support strong look back capabilities and subscription model. By providing Source DB agnostic CDC schema(Avro) it provides pluggable replication to any kind of Secondary storage( Hive, Hbase and cassandra) . In addition by providing a rich subscription model , R3D3 allows both batch ingestion and real time streaming on top of single pipeline. Additional features include:

-- Replication in near realtime

-- No extra pressure on source RDBMS and no CDC column is required for tables

-- Fault tolerant, at-least once semantics, and order guarantees.

-- Replays in case of failure.

-- Schema evoluation.

-- Safeguard PII and sensitive data via encryption/masking by using classification metadata

-- Realtime publishing of auditing/metrics events and dashboarding.

-- Bootstrapping (getting history) a table.

Outline

In this talk, I will talk about the following

Design and Architecture.
How are we achieving order guarantees and at-least once semantics
Bootstrapping a table I.e getting history and reconciliation.
Auditing and metrics
Schema evolution
Security and Metadata
Challenges faced.
Future enhancements

Speaker bio

Chandra has 8 years of experience in Big data systems. Working as a staff engineer in Intuit.

The Fifth Elephant 2016