Sep 2025
22 Mon
23 Tue
24 Wed
25 Thu
26 Fri
27 Sat 10:00 AM – 01:30 PM IST
28 Sun
Sep 2025
22 Mon
23 Tue
24 Wed
25 Thu
26 Fri
27 Sat 10:00 AM – 01:30 PM IST
28 Sun
To provide participants with a practical understanding of the principles, architectures, algorithms, and key challenges involved in processing and managing massive-scale graph data in distributed environments.
Graphs are ubiquitous! They are fundamental to modeling complex relationships, from social and recommendation networks to financial systems and knowledge graphs. But real-world graphs often contain billions or trillions of nodes and edges, far exceeding the capabilities of single-machine processing.
This workshop explores the architectural and algorithmic foundations of distributed graph systems like Spark GraphX, which are essential for scaling graph processing.
We’ll begin by examining foundational graph data models (Property Graphs and RDF) and then discuss architectural considerations including:
Common query patterns in graph data:
Why traditional batch systems (like MapReduce) fail for graph workloads due to high I/O overhead.
How specialized distributed models like Pregel and Gather-Apply-Scatter (GAS) emerged, including a discussion of the “think like a vertex” framework popularized by Google.
Effective graph partitioning strategies:
Mechanisms for fault tolerance in graphs:
Distributed Querying for graphs — how query rewrites for distributed graph databases differ.
Distributed Graph Neural Networks (GNNs)
As graphs grow massive, training GNNs at scale introduces new challenges beyond traditional distributed graph processing. In this section, we’ll explore:
Why distributed GNNs?
Real-world graphs (social, citation, web-scale) cannot fit into a single GPU/host.
Need to partition graphs across machines while preserving neighborhood structure for training.
Challenges unique to GNNs:
Neighborhood Explosion: Higher-order neighbors expand exponentially, making sampling critical.
Communication Bottlenecks: Feature exchange across partitions increases network cost.
Popular frameworks and techniques:
Architectural patterns:
Data parallelism — partition the graph, train locally, periodically sync parameters.
Model parallelism — shard large GNN models across GPUs.
This workshop was conducted as part of Rootconf 2025 Annual Conference on 16 May.
By popular demand, the instructor - Varuni HK - is doing a repeat of the workshop.
Participants will conceptually design the compute()
function for a distributed BFS, tracing message flow and state updates across supersteps to understand the vertex-centric model in action.
Using a small example graph, participants will manually simulate and compare the communication cost (cross-partition messages) for a simple query (e.g., 2-hop neighborhood) under different partitioning strategies:
This highlights the direct impact of partitioning on performance.
Refresher on how map reduce works
Participants will code the steps for implementing iterative PageRank using GraphX operators like aggregateMessages
, understanding how data flow and aggregation work in a data-parallel framework.
By the end of the workshop, participants will:
Anyone interested in understanding how distributed graph systems work and how they differ from traditional distributed systems.
It’s perfect for those looking to explore the unique challenges and solutions in scaling graph data.
Varuni works in the Indexing team at Couchbase; ex-JP Morgan. Varuni loves distributed systems, Graphs , Statistics and ML.
This workshop is open for Rootconf members and for Rootconf 2025 ticket buyers
This workshop is open to 20 participants only. Seats will be available on first-come-first-serve basis. 🎟️
For inquiries about the workshop, contact +91-7676332020 or write to info@hasgeek.com.