A Big Data Store – Performance optimised for writes, reads or both?
At Tesco, the 3rd largest retailer in the world, data is huge and so is the urgency in getting the latest data for use in operations and decision making.
We have modernized our demand forecasting system and moved it to the Hadoop platform giving us the power and flexibility of a distributed platform to improve our accuracies with more data and better algorithms. We have also been able to manage the forecast at the most granular level leading to huge data.
Each time we forecast, we generate 1 to 1.2 billion records (about 140 GB of data) three times a day. This is to be saved in a data store and the total data that is queried at any point is about 3 TB of data in a single table/entity store.
We needed a data store that is able to provide fast reads of less than 200 ms response time across 3 TB data and yet we be able to write the bulk data of 1 billion records generated in 15 to 20 minutes without disrupting the read performance. This meant that we had to achieve a write speed of 800k to 1.1. million records/sec and yet not impact the read performance.
We know most data store architectures allow you to tune towards faster reads or faster writes, not both. We evaluated a lot of data stores and finally had to come up with a different architectural pattern in order to be able to achieve this. The same pattern could be applied in a SQL database like Postgres or a NoSQL database like HBase and we did that successfully in both.
In this talk I would like to share how we achieved this, while we continued to support smaller streaming updates as well.In the process we also discuvered a few nuances about tuning HBase for fast reads, which are lesser known. Would like to share that as well, if time permits. Finally, I would like to touch upon: is there a trade off? Can we have it all?