The data bottleneck in distributed AI/ML workloads

Aug 2023

7 Mon

8 Tue

9 Wed

10 Thu

11 Fri 09:00 AM – 06:00 PM IST

12 Sat

13 Sun

Bangalore International Centre (BIC), Bengaluru

Tickets

All submissions

Previous Next

The data bottleneck in distributed AI/ML workloads

Submitted Jul 15, 2023

Its a no brainer that huge amounts of data in high TBs/PBs are going to be processed for any foundational models or even training of LLMs. In this talk, i propose to discuss the pain points of handling data at this scale

1.The choice of distributed training setups like slurm, deepspeed (pytorch wrapper) and other tools to see how to leverage multi-node GPUs which are one of the fundamental problems in AI/ML today.
2. The choice of storage technologies and its material impact on the data access speeds. Like where to use PVC, Object Storage, Block Storage and in-memory file systems
3. Choice of networking setup (200 Gigs vs 400 Gigs), openvswitch aggregation and bonding and offloads), network settings to acheive high throughputs

This trifecta, correctly setup and optimized can potentially determine if your training runs in days or weeks or months and the associated significant cost savings as a result.

This talk proposes to, in an ideal slot of 30 minutes discuss the various strategies around optimizing training speeds, inference speeds and buulding pipelines while focussing on the basic compute, storage and network aspects for a fundamental understanding and relatibility to existing compute problems.

All submissions

Previous Next

Comments

Aug 2023

7 Mon

8 Tue

9 Wed

10 Thu

11 Fri 09:00 AM – 06:00 PM IST

12 Sat

13 Sun

Hybrid access (members only)

Hosted by

The Fifth Elephant

Jump starting better data engineering and AI futures

Supported by

LlamaIndex

E2E Networks Limited

E2E Cloud is India's first AI hyper scaler, a cloud computing platform providing accelerated cloud-based solutions at maximum optimization and lowest pricing

The Fifth Elephant 2023 Monsoon

The data bottleneck in distributed AI/ML workloads

Comments