Rootconf Mini 2024

Geeking out on systems and security since 2012

Tickets

Loading…

Snehasish Roy

@snehasishroy

Zero Downtime, Zero Compromise: How PhonePe's DocStore Handles Billions of Documents

Submitted Oct 28, 2024

Overview

Ever wondered what happens when millions of PhonePe users share documents, buy insurance, or upload KYC information? Enter DocStore - the powerhouse behind PhonePe’s massive document operations. This home-grown object storage platform seamlessly handles thousands of critical transactions, from instant chat attachments to vital insurance documents, powering both customer experiences and developer platforms.

Picture this: Every time you send a photo in PhonePe chat, submit documents for insurance, or interact with our developer portals, you’re tapping into DocStore’s capabilities. It’s not just a storage system - it’s the digital vault that safeguards and serves documents for India’s leading fintech platform, processing terabytes of data while ensuring bank-grade security and lightning-fast accessibility.

Unlike most fintech players who rely on public clouds, PhonePe took the road less traveled - building everything on our private cloud infrastructure. But with great control comes greater responsibility: how do you ensure zero downtime when handling billions of critical documents across multiple data centers?

Our answer came in the form of an Active-Passive architecture. Through clever engineering with GlusterFS Geo Replication and our custom-built ElasticSearch replication plugin, we’ve created a system that stays resilient even when entire data centers go dark. Join us as we unveil the challenges we tackled, the solutions we crafted, and the lessons we learned while building this.

Agenda

  • Design and architecture of providing Storage at Petabyte scale.
  • Tech stack : GlusterFS, ElasticSearch, RabbitMQ, Aerospike.
  • Challenges faced in GlusterFS Geo Replication.
  • Challenges Faced in Enabling Replication in ElasticSearch Cluster

Takeaways

  • Build vs Buy
  • Do’s/Dont’s for managing infrastructure at scale.
  • Critical considerations for implementing fault tolerance
  • Practical insights for large-scale storage systems

Audience

  • Site Reliability and DevOps Engineers
  • Engineering leaders
  • Cloud architects and engineers
  • Teams building large-scale storage solutions

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid Access Ticket

Hosted by

We care about site reliability, cloud costs, security and data privacy