BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//HasGeek//NONSGML Funnel//EN
DESCRIPTION:Guide on how to select datastores to solve different problems
X-WR-CALDESC:Guide on how to select datastores to solve different problems
NAME:Choosing datastores
X-WR-CALNAME:Choosing datastores
REFRESH-INTERVAL;VALUE=DURATION:PT12H
SUMMARY:Choosing datastores
TIMEZONE-ID:Asia/Kolkata
X-PUBLISHED-TTL:PT12H
X-WR-TIMEZONE:Asia/Kolkata
BEGIN:VEVENT
SUMMARY:Introduction to the conference\; why Data Stores?
DTSTART:20210903T063000Z
DTEND:20210903T064000Z
DTSTAMP:20260421T150714Z
UID:session/DioiwBNHXCGoQqhcVPW3Qk@hasgeek.com
SEQUENCE:0
CREATED:20210715T085755Z
LAST-MODIFIED:20210827T061730Z
LOCATION:Online
ORGANIZER;CN=Rootconf:MAILTO:no-reply@hasgeek.com
BEGIN:VALARM
ACTION:display
DESCRIPTION:Introduction to the conference\; why Data Stores? in 5 minutes
TRIGGER:-PT5M
END:VALARM
END:VEVENT
BEGIN:VEVENT
SUMMARY:Flow of the conference and house-keeping rules
DTSTART:20210903T064000Z
DTEND:20210903T065000Z
DTSTAMP:20260421T150714Z
UID:session/RgdtTB6AcyaC2g9NtzBVnq@hasgeek.com
SEQUENCE:0
CREATED:20210827T061746Z
LAST-MODIFIED:20210902T054355Z
LOCATION:Online
ORGANIZER;CN=Rootconf:MAILTO:no-reply@hasgeek.com
BEGIN:VALARM
ACTION:display
DESCRIPTION:Flow of the conference and house-keeping rules in 5 minutes
TRIGGER:-PT5M
END:VALARM
END:VEVENT
BEGIN:VEVENT
SUMMARY:Online data stores at LinkedIn and their evolution.
DTSTART:20210903T065000Z
DTEND:20210903T072000Z
DTSTAMP:20260421T150714Z
UID:session/hrWUKw247XnqN3N5k28R5@hasgeek.com
SEQUENCE:2
CATEGORIES:30 min talk,Talk scheduled for pre-recording
CREATED:20210831T045939Z
DESCRIPTION:Abstract\n“In its early days\, the LinkedIn data ecosystem w
 as quite simple. A single RDBMS contained a handful of tables for user dat
 a such as profiles\, connections\, etc. This RDBMS was augmented with two 
 specialized systems: one provided full text search of the corpus of user p
 rofile data\, the other provided efficient traversal of the relationship g
 raph. These latter two systems were kept up-to-date by Databus\, a change 
 capture stream that propagates writes to the RDBMS\nprimary data store\, i
 n commit order\, to the search and graph clusters. Over the years\, as Lin
 kedIn evolved\, so did its data needs”. \n\nThe above is an excerpt from
  Linkedin’s Espresso paper in 2013. At that time Linkedin had 200 millio
 n users worldwide. With a growth phase that followed\, the user base today
  is ~4x that number\, add to it the ever increasing user engagement and ne
 w feature rollouts. During this growth phase\, LinkedIn data systems evolv
 ed for each of our use case. In this talk\, we will attempt to give a glim
 pse of our Online Storage ecosystem and its evolution. \n\nOnline datasyst
 em like Oracle and MySQL evolved from single datacenter to multi datacente
 r. \nIn addition to the above Relational systems\, Online storage fleet to
 day houses : \n\n\n- Custom NoSQL cluster(s)\n     - Espresso is Linkedin
 ’s nosql cluster\n     - It ‘s sharded and supports secondary index(s)
  \n     - It serves queries in O(M) queries per second.\n\n- Derived Data 
 Store(s)\n     - It might be prudent to precompute and transform data from
  one form to another so that other systems can directly read the transform
 ed data\n     - Serving the transformed data for low latency use case\n\n-
  BLOB  Storage\n    - Distributed file storage like Azure blob and AWS S3\
 n    - Data is immutable\n    - Supports replication for consistency and c
 ross colo reads for Read after Write consistency\n\n- Couchbase\n    - Def
 acto caching solution for our Source of Truth databases\n\n- OLAP system\n
     - Supports analytics on realtime and offline data stored as segments\n
     - Segments are time partitioned data\n\n#### Glue for Data Systems\n\n
 - Cluster Manager/State Machine\n    - Quorum makes sure the monitoring an
 d state map is consistent\n    - Helix initiates State Transition to conve
 rge to ideal state and also does the job of partition allocator and job sc
 heduler.\n\n- Provisioner\n    - Based on user requirements\, provisioner 
 allocates resources in a cluster running the required stack for provisioni
 ng. \n    - Each services have to track cost to serve to upper bound the r
 esource utilization in multi tenant infrastructure\n\nAll these components
  form the online storage stack for Linkedin. Each one has a unique use cas
 e and we strongly believe  that “one size fits all” isn’t true in th
 e data realm!. 
LAST-MODIFIED:20230810T072606Z
LOCATION:Online
ORGANIZER;CN=Rootconf:MAILTO:no-reply@hasgeek.com
URL:https://hasgeek.com/rootconf/choosing-datastores/schedule/online-data-
 stores-at-linkedin-and-their-evolution-hrWUKw247XnqN3N5k28R5
BEGIN:VALARM
ACTION:display
DESCRIPTION:Online data stores at LinkedIn and their evolution. in 5 minut
 es
TRIGGER:-PT5M
END:VALARM
END:VEVENT
BEGIN:VEVENT
SUMMARY:Identity resolution at billions scale @Zeotap
DTSTART:20210903T072000Z
DTEND:20210903T075500Z
DTSTAMP:20260421T150714Z
UID:session/RNS8Wq8gWdNq7xextWByuU@hasgeek.com
SEQUENCE:2
CATEGORIES:30 min talk
CREATED:20210827T152636Z
DESCRIPTION:Zeotap maintains an identity asset of around 50 Billion IDs wh
 ich enables the ID resolution and Audience Matching use cases for AdTech &
  MarTech industry. In this presentation\, we will look at the journey of z
 eotap's Identity store over the years\, how it got evolved from a naive KV
  implementation to a Hybrid KV + Graph store\, the data access patterns us
 ed for solving the business needs and the underlying data models created u
 sing Aerospike & Scylla data stores.\nAdditionally we will also provide so
 me insights into ops management\, performance and cost metrics which helpe
 d us in making rational decisions throughout the evolution process.
LAST-MODIFIED:20240123T121857Z
LOCATION:Online
ORGANIZER;CN=Rootconf:MAILTO:no-reply@hasgeek.com
URL:https://hasgeek.com/rootconf/choosing-datastores/schedule/identity-res
 olution-at-billions-scale-zeotap-RNS8Wq8gWdNq7xextWByuU
BEGIN:VALARM
ACTION:display
DESCRIPTION:Identity resolution at billions scale @Zeotap in 5 minutes
TRIGGER:-PT5M
END:VALARM
END:VEVENT
BEGIN:VEVENT
SUMMARY:Break
DTSTART:20210903T075500Z
DTEND:20210903T080000Z
DTSTAMP:20260421T150714Z
UID:session/VHJ8iDWaWS6nugq81fmipZ@hasgeek.com
SEQUENCE:0
CREATED:20210827T062148Z
LAST-MODIFIED:20210901T084708Z
LOCATION:Online
ORGANIZER;CN=Rootconf:MAILTO:no-reply@hasgeek.com
BEGIN:VALARM
ACTION:display
DESCRIPTION:Break in 5 minutes
TRIGGER:-PT5M
END:VALARM
END:VEVENT
BEGIN:VEVENT
SUMMARY:Behind the scenes of Aerospike’s cross datacenter replication (X
 DR)
DTSTART:20210903T080000Z
DTEND:20210903T083500Z
DTSTAMP:20260421T150714Z
UID:session/GqEhXCpxEotAeCh1M7cLCd@hasgeek.com
SEQUENCE:1
CATEGORIES:30 min talk,Organize walkthrough between speaker and editor
CREATED:20210827T061944Z
DESCRIPTION:Aerospike’s Cross data center replication (XDR) is the modul
 e responsible for shipping data across multiple datacenters\, typically ac
 ross geographical locations. Distributed systems come with many unique cha
 llenges and XDR is equipped to handle them.\n\nIn this talk we will take a
  look at the internals of XDR. We will get into some details of how we ach
 ieve high performance by exploiting cache friendliness and core affinity. 
 We will also cover challenges we face to keep data consistent. Many things
  can fail in a distributed system. In this talk we will discuss some of th
 e common failures like source node failure\, destination node failure and 
 network issues.\nPlease find the link to the presentation [here](https://d
 rive.google.com/file/d/1fzNtkcjD-7cWSpcDRqBDwv82dSeAUgaD/view?usp=sharing)
 .
LAST-MODIFIED:20230108T103046Z
LOCATION:Online
ORGANIZER;CN=Rootconf:MAILTO:no-reply@hasgeek.com
URL:https://hasgeek.com/rootconf/choosing-datastores/schedule/behind-the-s
 cenes-of-aerospikes-cross-datacenter-replication-xdr-GqEhXCpxEotAeCh1M7cLC
 d
BEGIN:VALARM
ACTION:display
DESCRIPTION:Behind the scenes of Aerospike’s cross datacenter replicatio
 n (XDR) in 5 minutes
TRIGGER:-PT5M
END:VALARM
END:VEVENT
BEGIN:VEVENT
SUMMARY:A Big Data Store – performance optimised for writes\, reads or b
 oth?
DTSTART:20210903T083500Z
DTEND:20210903T090500Z
DTSTAMP:20260421T150714Z
UID:session/EiDPvPvMWgZrtBRGqQeNL5@hasgeek.com
SEQUENCE:1
CATEGORIES:30 min talk
CREATED:20210827T152525Z
DESCRIPTION:At Tesco\, the 3rd largest retailer in the world\, data is hug
 e and so is the urgency in getting the latest data for use in operations a
 nd decision making.\n\nWe have modernized our demand forecasting system an
 d moved it to the Hadoop platform giving us the power and flexibility of a
  distributed platform to improve our accuracies with more data and better 
 algorithms. We have also been able to manage the forecast at the most gran
 ular level leading to huge data.\n\nEach time we forecast\, we generate 1 
 to 1.2 billion records (about 140 GB of data) three times a day. This is t
 o be saved in a data store and the total data that is queried at any point
  is about 3 TB of data in a single table/entity store. \nWe needed a data 
 store that is able to provide fast reads of less than 200 ms response time
  across 3 TB data and yet we be able to write the bulk data of 1 billion r
 ecords generated in 15 to 20 minutes without disrupting the read performan
 ce. This meant that we had to achieve a write speed of 800k to 1.1. millio
 n records/sec and yet not impact the read performance. \nWe know most data
  store architectures allow you to tune towards faster reads or faster writ
 es\, not both. We evaluated a lot of data stores and finally had to come u
 p with a different architectural pattern in order to be able to achieve th
 is. The same pattern could be applied in a SQL database like Postgres or a
  NoSQL database like HBase and we did that successfully in both. \n\nIn th
 is talk I would like to share how we achieved this\, while we continued to
  support smaller streaming updates as well.In the process we also discuver
 ed a few nuances about tuning HBase for fast reads\, which are lesser know
 n. Would like to share that as well\, if time permits.  Finally\, I would 
 like to touch upon: is there a trade off? Can we have it all? \n\n**About 
 Me**\nYou can view here: https://www.linkedin.com/in/saigeethamn/\nYou can
  view a few of my articles at https://www.saigeetha.in/blog 
LAST-MODIFIED:20230108T103046Z
LOCATION:Online
ORGANIZER;CN=Rootconf:MAILTO:no-reply@hasgeek.com
URL:https://hasgeek.com/rootconf/choosing-datastores/schedule/a-big-data-s
 tore-performance-optimised-for-writes-reads-or-both-EiDPvPvMWgZrtBRGqQeNL5
BEGIN:VALARM
ACTION:display
DESCRIPTION:A Big Data Store – performance optimised for writes\, reads 
 or both? in 5 minutes
TRIGGER:-PT5M
END:VALARM
END:VEVENT
BEGIN:VEVENT
SUMMARY:Break
DTSTART:20210903T090500Z
DTEND:20210903T091000Z
DTSTAMP:20260421T150714Z
UID:session/HUoY4ZY69AHzthXv4cGQqf@hasgeek.com
SEQUENCE:0
CREATED:20210831T045959Z
LAST-MODIFIED:20210901T084729Z
LOCATION:Online
ORGANIZER;CN=Rootconf:MAILTO:no-reply@hasgeek.com
BEGIN:VALARM
ACTION:display
DESCRIPTION:Break in 5 minutes
TRIGGER:-PT5M
END:VALARM
END:VEVENT
BEGIN:VEVENT
SUMMARY:A hybrid MySQL data model with horizontal sharding and a global da
 ta platform
DTSTART:20210903T091000Z
DTEND:20210903T093000Z
DTSTAMP:20260421T150714Z
UID:session/9GdaFLNYshMePHPpCH2jaU@hasgeek.com
SEQUENCE:2
CATEGORIES:Organize walkthrough between speaker and editor
CREATED:20210827T062059Z
DESCRIPTION:A very popular interview question revolves around choosing MyS
 QL vs NoSQL datastores. This decision is much more relevant for early star
 tups as their choice determines potential data scalability and performance
  bottlenecks in the future. A smart decision at this point can save painst
 aking effort to redesign and mass-migrate data later.\n\nFor us\, our fina
 l data model ended up helping us build our own sharding scheme that scaled
  our database infra by 100x\, and also build a cross-region replication fr
 amework with a team of just two engineering in a span of ~6 months.\n\nLet
 's look at the high-level database requirements of a typical startup -\n- 
 The overall system needs to be very stable and easy to operate and maintai
 n\n- The schema should be easy to update and flexible to avoid large struc
 tural changes in the future.\n- Ok to have not petabyte-scale upfront\, bu
 t easy to scale both vertically and horizontally in the near future as mor
 e customers get onboarded.\n- Has strong documentation and community suppo
 rt to help with quick decision-making.\n\nNow MySQL ticks a lot of checkbo
 xes above and provides many inbuilt DB features (auto increment\, indexes\
 , partitions) that make it easy to create efficient datastores upfront wit
 h relative ease.\nBut the matter of fact is - MySQL was originally designe
 d as a single-node system and not with the modern distributed data center 
 setup in mind. Also\, unless very carefully planned\, audited and updated\
 , MySQL schema can become too rigid to help accommodate needs of a dynamic
  data model.\n\nAt Eightfold\, our data architects realized this early on 
 and thought of how to implement a datastore that gives us the best of both
  worlds -\n- Excels at traditional DB properties (ACID) providing us DB fe
 atures like auto increment\, indexes\, paritions\, etc.\n- Has a balanced 
 read/write performance\n- Doesn't have us worried about the future limitat
 ions of scale\n- And best of all - lets us use SQL\, which can be called t
 he 'English language of the database world'\n\nTo achieve this\, we implem
 ented our data model in MySQL with a few key ground rules -\n- No SQL fore
 ign key relationships\n- No JOINS in SQL for production queries\n- By defa
 ult\, prefer columns instead of an opaque JSON blob\n- JSON is OK for cert
 ain data that may be unstructured\n- Denormalize data in tables based on u
 se case\, caching and write/update paths\n\nIt might be evident from above
  that each of our table has one set of individual columns and another set 
 of JSON blobs. This helps us achieve a normalization and denormalization h
 ybrid.\n\nEach top-level entity - like candidate profile or job position -
  has its table with standard individual columns that we may frequently use
  in indexes\, eg: timestamps\, users\, category\, etc.\nThe json blob fiel
 d\, on the other hand\, may store any unstructured data that is directly a
 ssociated with table entity. Eg: Details of certifications of a profile as
  a certificates_json column in profile table.\n\nThis follows a set of sec
 ond-level entities like profile tags\, notes\, etc. that need a reference 
 to the original top-level entity - profile in this case.\n\nWhile we may s
 tore the id of the top-level profile in a column in profile_tag for instan
 ce\, we do not establish a SQL foreign key relationship. This helps us sti
 ll keep relationships loose.\n\nA top-level entity may have a one-to-many 
 relationship with other entities just by aggregating entity ids as part of
  the data_json field.\n\nOur tables are designed in a way that most applic
 ation lookups are \n\n# Sharding MySQL\n\nThe true merit of our data model
  showed up once we started implementing our own custom application-level s
 harding on top of MySQL.\n\nAlmost 95% of our data is logically partitione
 d on basis of customer ids. We needed a way to physically partition these 
 customer ids into multiple database clusters without losing out on O(1) lo
 okup performance.\n\nIntroducing a layer of indirection in between\, we ma
 pped customers to logical shards and shards to physical clusters. This mad
 e sure we could distribute a customer in one or many clusters\, or have mu
 ltiple customers present on the same physical cluster. Our data infrastruc
 ture could operate in both single and multi tenant modes.\n\nImplementatio
 n - We use entity id to encode details of which shard a customer maps to. 
 Each table row also has the customer_id.\n\n```\nentity_id = <shard_id> <<
  A | <sequence_id>\ncustomer_id\n```\n\n<sequence_id> is generated from a 
 sequence table that has an auto_increment field\, while there is no auto_i
 ncrement field in our main data table.\n\nNow once our metadata stores map
 ping between customer_id to shards and shards to physical clusters\, so wh
 ile making db connections we can pick the correct cluster or set of cluste
 rs to execute the query in.\n\n# Shard migration\n\nKeeping our sharding s
 cheme backwards compatible\, we had used shard_id 0 for all these customer
 s. But with more data we needed to 'extract' these customers out from shar
 d_id 0 to newer shards existing on separate clusters.\n\nTo achieve this\,
  we created a shard migration framework that was capable of performing in-
 place id translation from one shard_id to another. We were able to achieve
  this in a simple way due to the no-foreign-key constraint.\n\n# Global da
 ta platform\n\nWhile operating in data stores across multiple AWS regions 
 - some of these did not support AWS Global clusters. We hence wanted to bu
 ild a equivalent global replication framework\, that can make sure cluster
 s across regions are eventually consistent.\n\nWe again made use of our sh
 arding strategy to include a global bit -\n\n```\nentity_id = <global_bit>
  << B | <shard_id> << A |  < <sequence_id>\n```\n\nWe also implemented SQS
  based propagation of db save operations that are handled in the other reg
 ion by a batch processing framework.
LAST-MODIFIED:20230810T072606Z
LOCATION:Online
ORGANIZER;CN=Rootconf:MAILTO:no-reply@hasgeek.com
URL:https://hasgeek.com/rootconf/choosing-datastores/schedule/a-hybrid-mys
 ql-data-model-with-horizontal-sharding-and-a-gobal-data-platform-9GdaFLNYs
 hMePHPpCH2jaU
BEGIN:VALARM
ACTION:display
DESCRIPTION:A hybrid MySQL data model with horizontal sharding and a globa
 l data platform in 5 minutes
TRIGGER:-PT5M
END:VALARM
END:VEVENT
BEGIN:VEVENT
SUMMARY:Databases at Scale - what I wish someone had told me.
DTSTART:20210903T093000Z
DTEND:20210903T095000Z
DTSTAMP:20260421T150714Z
UID:session/TgEWuRBeZdThwqDY2A2z8u@hasgeek.com
SEQUENCE:1
CATEGORIES:15 min talk,Slides for pre-recorded talk reviewed and approved
CREATED:20210827T061827Z
DESCRIPTION:Databases are the core of many systems\, the task of choosing 
 the right one is a journey. If we talk of doing it at scale could be an en
 deavour.\n\nThis talk aims to clarify some misconceptions about databases 
 trying to give the user a clear picture of what are the challenges of runn
 ing a database at scale\, either on volume or geo-distributed. 
LAST-MODIFIED:20230108T103046Z
LOCATION:Online
ORGANIZER;CN=Rootconf:MAILTO:no-reply@hasgeek.com
URL:https://hasgeek.com/rootconf/choosing-datastores/schedule/databases-sc
 ale-what-i-wish-someone-had-told-me-TgEWuRBeZdThwqDY2A2z8u
BEGIN:VALARM
ACTION:display
DESCRIPTION:Databases at Scale - what I wish someone had told me. in 5 min
 utes
TRIGGER:-PT5M
END:VALARM
END:VEVENT
BEGIN:VEVENT
SUMMARY:Summary of key takeaways\; next steps
DTSTART:20210903T095000Z
DTEND:20210903T100000Z
DTSTAMP:20260421T150714Z
UID:session/Do6aou6dwVTp5v7iP7sYsy@hasgeek.com
SEQUENCE:0
CREATED:20210827T062241Z
LAST-MODIFIED:20210902T093616Z
LOCATION:Online
ORGANIZER;CN=Rootconf:MAILTO:no-reply@hasgeek.com
BEGIN:VALARM
ACTION:display
DESCRIPTION:Summary of key takeaways\; next steps in 5 minutes
TRIGGER:-PT5M
END:VALARM
END:VEVENT
END:VCALENDAR
