The Fifth Elephant 2015

A conference on data, machine learning, and distributed and parallel computing

Machine Learning, Distributed and Parallel Computing, and High-performance Computing are the themes for this year’s edition of Fifth Elephant.

The deadline for submitting a proposal is 15th June 2015

We are looking for talks and workshops from academics and practitioners who are in the business of making sense of data, big and small.

Track 1: Discovering Insights and Driving Decisions

This track is about general, novel, fundamental, and advanced techniques for making sense of data and driving decisions from data. This could encompass applications of the following ML paradigms:

  • Statistical Visualizations
  • Unsupervised Learning
  • Supervised Learning
  • Semi-Supervised Learning
  • Active Learning
  • Reinforcement Learning
  • Monte-carlo techniques and probabilistic programming
  • Deep Learning

Across various data modalities including multi-variate, text, speech, time series, images, video, transactions, etc.

Track 2: Speed at Scale

This track is about tools and processes for collecting, indexing, and processing vast amounts of data. The theme includes:

  • Distributed and Parallel Computing
  • Real Time Analytics and Stream Processing
  • MapReduce and Graph Computing frameworks
  • Kafka, Spark, Hadoop, MPI
  • Stories of parallelizing sequential programs
  • Cost/Security/Disaster Management of Data

Commitment to Open Source

HasGeek believes in open source as the binding force of our community. If you are describing a codebase for developers to work with, we’d like it to be available under a permissive open source license. If your software is commercially licensed or available under a combination of commercial and restrictive open source licenses (such as the various forms of the GPL), please consider picking up a sponsorship. We recognize that there are valid reasons for commercial licensing, but ask that you support us in return for giving you an audience. Your session will be marked on the schedule as a sponsored session.

Workshops

If you are interested in conducting a hands-on session on any of the topics falling under the themes of the two tracks described above, please submit a proposal under the workshops section. We also need you to tell us about your past experience in teaching and/or conducting workshops.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Srihari Sriraman

@ssrihari

postgres clusters and their nuances

Submitted Jun 15, 2015

We built a postgres cluster using repmgr to serve 2k requests per second, and store 5G of data per day. You’ll learn about postgres’ WAL replication and archival, how repmgr works, how we leveraged it for our needs, hooked it up to our application, and built multiple lines of defence in case something bad happens. And oh, we’ll also compare it with RDS for good measure.

Outline

Because of the nature of our application (experimentation), it was crucial for us to protect the integrity and prevent against loss of data. Hence, we needed a quick failover mechanism in place.

There isn’t any out of the box postgres cluster management solution that fit our need. RDS at that time didn’t have read replica support for postgres. We built this mechanism ourselves using repmgr.

The talk would cover details on multiple axes, but gere are a few things we did around this as a gist:

  • Evaluated postgres latencies on the same box and different box.
  • Set up a postgres cluster using repmgr on vagrant.
  • Multiple lines of defence in case of a failover:
    • Repmgr runs a script to notify applications (all running instances) about failover.
    • Application polls to check if it can write to the master DB. And attemps to find the master from the cluster if not.
  • We load tested this, and were able to achieve a maximum downtime of 5s.

Note: This will be a talk by two people: Srihari and Nivedita. That’s okay right? :)

-- Update
Here’s a gist version of the below for more on the talk: https://gist.github.com/ssrihari/6443dbc0d4da4a91ba48

Why

  • Do you even need a cluster of postgreses? Why did we choose to create a cluster ourselves?
  • Why didn’t RDS work for us? Why it might not work for you.
  • What are other RDS like services out there?

Postgres side of things:

  • What you need to know within postgres so your data can be streamed
    down to multiple nodes quick, and efficiently.

Hardware

  • Can you do without an SSD? Are writing fast enough? How to find out?
  • Will performance degrade when you’re writing WALs?
  • How fast do you need to ship WALs to another machine?

WAL (Write Ahead Logging)

  • What is it? How does it work?
  • How to configure it for a cluster - hot_standby, wal_keep_segments, etc.

WAL Archives

  • Why do you need it when you have streaming replication?
  • How do you set the archive and streaming delays?

How does this impact normal workflows?

  • Large deletes or updates can blot WALs
  • Read-only standbys, large queries and timeouts

Repmgr side of things:

  • What does this magic replication manager do? What I learnt by
    digging into it’s C code.

What does a cluster of postgres machines look like?

  • What are ‘master’, ‘standbys’, ‘witnesses’, and ‘failed’ nodes
  • Cascading, and Synchronous replication
  • priorities, configurations

Into the guts of ‘how’ repmgr works

  • The consensus algorithm it uses
  • repl_nodes - a table of all the machines in the cluster, their configs, statuses
  • repl_status - how far is each machine behind the master?
  • repl_events - what’s happening in the cluster along a timeline

The things repmgr does for your cluster

  • well, it exists. not a single open tool (save pgpool) out there that’s as good.
  • help setup, configure the relationships between postgres boxes
  • ‘Clone’, ‘Register’, ‘Follow’, ‘Promote’ commands
  • monitor for failure, elect a master, automate failover
  • How to tap into the promote and failover scenarios?

The things repmgr doesn’t

  • It doesn’t talk to your application. it doesn’t act as a load
    balancer. or a single entry point to your cluster.
  • Tell you when a failover happens. You need to rig that up yourself.
  • Give you extensive support or documentation. But, you have the source.

Failovers

How things can go wrong

  • Multi-master / split brain / STONITH
  • No master
  • Network disconnect, Disk full

What happens when a failover happens?

  • How long does it take? How do you know which is the new master?
  • How do we recover a failed DB? How long does it take?
  • How do we add a new machine to the cluster?
  • How do we remove stale node entries?

When things have gone wrong

  • What measures to take to ensure you’re equipped for bad scenarios.
  • How do you save data when you know you’re going to lose it?
  • Deletes/Truncates cascade. How do you free disk space?
  • When is your free disk space critically low?
  • How to recover from WAL archives? Logical backups? Filesystem backups?
  • Which backup/restore mechanism will work best for you?

Wrapping up the big picture

  • What does my application do when there’s a failover?
  • How do the db-connections move to the new master?
  • How can the application, and repmgr work together for maximum robustness?
  • Alerting, Monitoring, Maintenance

Speaker bio

Nid and Srihari are developers at Nilenso, a software cooperative in Bangalore.

Slides

https://gist.github.com/ssrihari/6443dbc0d4da4a91ba48#file-postgres-clusters-nuances-png

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more