Empowering Data Scientists at Farfetch to GoFar with PaaS
Machine Learning is a strategic goal at Farfetch, and enabling Data Scientists to be more productive is a key objective to achieve it. Cloud Providers like AWS, Google, Azure have a lot of services/products which offer various capabilities to create value in ML and Data space. But often the problem with using these services, or building one such Product from scratch is that the key stakeholders - the Data Scientists are ignored. While these products/services (cloud offerings) are capable of hosting/delivering the promised models/analytics, the workflow of a Data Scientist which is integrated with the existing enterprise infrastructure (approvals, security, access, setup) is quite overlooked, and often the onus falls on the User who is solving the problem - making it a tedious job to navigate through the Enterprise tree to figure out what needs to be done to get setup.
As an Azure strategic partner, at Farfetch our objective is to leverage its enterprise services along with cutting edge Open Source technologies and add value for our Data Scientists to enable them to hit the ground running with every problem.
Every process/requirement is captured with the spotlight on Data Scientists. The goal is always - what is the use case, and how do we enable our users to execute it efficiently; and with that (goal) in mind, we are building a Platform as a Service with multi-tenancy at its core to give control back to our users in an Enterprise context, to use/extend the platform as they please.
We identified the requirements and blockers, and have built a ML Platform around them leveraging the components as outlined below:
Workflow Orchestration Layer- Airflow on Kubernetes [7 mins]
- Provisioning the Infrastructure through custom Terraform modules
- Airflow Deployment Pipelines
- Monitoring Airflow deployment
- Airflow Dags development - Tests, Integrations
Processing Layer - Databricks [7 mins]
- User and Access Management
- Secrets Management
Storage Layer - ADLS [7 mins]
- Architecture of Data Lake
- Fine grained Access Control and Governance
- Use case: Accessing Data from Databricks
- ML Platforms as Service
- Airflow Setup at Enterprise Scale, best practices
- Decoupling Processing and Storage Layer, and integration with Enterprise