Aug 2023
7 Mon
8 Tue
9 Wed
10 Thu
11 Fri 09:00 AM – 06:00 PM IST
12 Sat
13 Sun
Mohamed Imran K R
Its a no brainer that huge amounts of data in high TBs/PBs are going to be processed for any foundational models or even training of LLMs. In this talk, i propose to discuss the pain points of handling data at this scale
1.The choice of distributed training setups like slurm, deepspeed (pytorch wrapper) and other tools to see how to leverage multi-node GPUs which are one of the fundamental problems in AI/ML today.
2. The choice of storage technologies and its material impact on the data access speeds. Like where to use PVC, Object Storage, Block Storage and in-memory file systems
3. Choice of networking setup (200 Gigs vs 400 Gigs), openvswitch aggregation and bonding and offloads), network settings to acheive high throughputs
This trifecta, correctly setup and optimized can potentially determine if your training runs in days or weeks or months and the associated significant cost savings as a result.
This talk proposes to, in an ideal slot of 30 minutes discuss the various strategies around optimizing training speeds, inference speeds and buulding pipelines while focussing on the basic compute, storage and network aspects for a fundamental understanding and relatibility to existing compute problems.
Hosted by
Supported by
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}