The Fifth Elephant 2023 Monsoon

On AI, industrial applications of ML, and MLOps

Tickets

Loading…

Mohamed Imran K R

@mohamedimran

The data bottleneck in distributed AI/ML workloads

Submitted Jul 15, 2023

Its a no brainer that huge amounts of data in high TBs/PBs are going to be processed for any foundational models or even training of LLMs. In this talk, i propose to discuss the pain points of handling data at this scale

1.The choice of distributed training setups like slurm, deepspeed (pytorch wrapper) and other tools to see how to leverage multi-node GPUs which are one of the fundamental problems in AI/ML today.
2. The choice of storage technologies and its material impact on the data access speeds. Like where to use PVC, Object Storage, Block Storage and in-memory file systems
3. Choice of networking setup (200 Gigs vs 400 Gigs), openvswitch aggregation and bonding and offloads), network settings to acheive high throughputs

This trifecta, correctly setup and optimized can potentially determine if your training runs in days or weeks or months and the associated significant cost savings as a result.

This talk proposes to, in an ideal slot of 30 minutes discuss the various strategies around optimizing training speeds, inference speeds and buulding pipelines while focussing on the basic compute, storage and network aspects for a fundamental understanding and relatibility to existing compute problems.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid access (members only)

Hosted by

Jump starting better data engineering and AI futures

Supported by

E2E Cloud is India's first AI hyper scaler, a cloud computing platform providing accelerated cloud-based solutions at maximum optimization and lowest pricing