Choosing datastores
Rootconf For members

Choosing datastores

Guide on how to select datastores to solve different problems

Make a submission

Accepting submissions till 15 Aug 2021, 11:59 PM

Tickets

Loading…

How do you select datastores and be aware of their limitations when applied to the problem at hand? Are there misconceptions you wish someone had cleared for you as you started on your journey of scaling with datastores?

Choosing data stores for your use cases conference will help you understand:

  • Running datastores at scale - and tuning, debugging and operations.
  • Solving specific use cases with a certain datastore.
  • Data modelling and developer experience with datastore.

Senior infrastructure and software engineers from Farfetch, Aerospike, Zeotap, eightfold.ai, LinkedIn and Tesco engineering will share war stories and their learnings with practitioners in the audience.

View schedule at https://hasgeek.com/rootconf/choosing-datastores/schedule

Contact information: Join the Rootconf Telegram group at https://t.me/rootconf or follow @rootconf on Twitter.
For inquiries, contact Rootconf at rootconf.editorial@hasgeek.com or call 7676332020.

Hosted by

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more

Kalyanasundaram Somasundaram

@ksomasun

Online Data Stores at LinkedIn and their Evolution

Submitted Aug 14, 2021

Abstract
“In its early days, the LinkedIn data ecosystem was quite simple. A single RDBMS contained a handful of tables for user data such as profiles, connections, etc. This RDBMS was augmented with two specialized systems: one provided full text search of the corpus of user profile data, the other provided efficient traversal of the relationship graph. These latter two systems were kept up-to-date by Databus, a change capture stream that propagates writes to the RDBMS
primary data store, in commit order, to the search and graph clusters. Over the years, as LinkedIn evolved, so did its data needs”.

The above is an excerpt from Linkedin’s Espresso paper in 2013. At that time Linkedin had 200 million users worldwide. With a growth phase that followed, the user base today is ~4x that number, add to it the ever increasing user engagement and new feature rollouts. During this growth phase, LinkedIn data systems evolved for each of our use case. In this talk, we will attempt to give a glimpse of our Online Storage ecosystem and its evolution.

Online datasystem like Oracle and MySQL evolved from single datacenter to multi datacenter.
In addition to the above Relational systems, Online storage fleet today houses :

  • Custom NoSQL cluster(s)

    • Espresso is Linkedin’s nosql cluster
    • It ‘s sharded and supports secondary index(s)
    • It serves queries in O(M) queries per second.
  • Derived Data Store(s)

    • It might be prudent to precompute and transform data from one form to another so that other systems can directly read the transformed data
    • Serving the transformed data for low latency use case
  • BLOB Storage

    • Distributed file storage like Azure blob and AWS S3
    • Data is immutable
    • Supports replication for consistency and cross colo reads for Read after Write consistency
  • Couchbase

    • Defacto caching solution for our Source of Truth databases
  • OLAP system

    • Supports analytics on realtime and offline data stored as segments
    • Segments are time partitioned data

Glue for Data Systems

  • Cluster Manager/State Machine

    • Quorum makes sure the monitoring and state map is consistent
    • Helix initiates State Transition to converge to ideal state and also does the job of partition allocator and job scheduler.
  • Provisioner

    • Based on user requirements, provisioner allocates resources in a cluster running the required stack for provisioning.
    • Each services have to track cost to serve to upper bound the resource utilization in multi tenant infrastructure

All these components form the online storage stack for Linkedin. Each one has a unique use case and we strongly believe that “one size fits all” isn’t true in the data realm!.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Make a submission

Accepting submissions till 15 Aug 2021, 11:59 PM

Hosted by

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more