To provide participants with a practical understanding of the principles, architectures, algorithms, and key challenges involved in processing and managing massive-scale graph data in distributed environments.
Graphs are ubiquitous! They are fundamental to modeling complex relationships, from social and recommendation networks to financial systems and knowledge graphs. But real-world graphs often contain billions or trillions of nodes and edges, far exceeding the capabilities of single-machine processing.
This workshop explores the architectural and algorithmic foundations of distributed graph systems like Spark GraphX, which are essential for scaling graph processing.
We’ll begin by examining foundational graph data models (Property Graphs and RDF) and then discuss architectural considerations including:
- Partitioning
- Replication
- Querying
- Fault tolerance strategies employed in distributed graph environments.
-
Common query patterns in graph data:
- Reachability queries
- Subgraph and pattern matching queries
- Keyword search queries
- Path queries
-
Why traditional batch systems (like MapReduce) fail for graph workloads due to high I/O overhead.
-
How specialized distributed models like Pregel and Gather-Apply-Scatter (GAS) emerged, including a discussion of the “think like a vertex” framework popularized by Google.
-
Effective graph partitioning strategies:
-
Mechanisms for fault tolerance in graphs:
- Checkpointing
- Lineage-based recovery
-
Distributed Querying for graphs — how query rewrites for distributed graph databases differ.
-
Graph summarization — approaches for summarizing large graphs in a distributed setup, including parallelizing and distributing GNNs (Graph Neural Networks).
Participants will conceptually design the compute()
function for a distributed BFS, tracing message flow and state updates across supersteps to understand the vertex-centric model in action.
Using a small example graph, participants will manually simulate and compare the communication cost (cross-partition messages) for a simple query (e.g., 2-hop neighborhood) under different partitioning strategies:
- Hash Partitioning
- Manually optimized Edge-Cut
This highlights the direct impact of partitioning on performance.
Participants will code the steps for implementing iterative PageRank using GraphX operators like aggregateMessages
, understanding how data flow and aggregation work in a data-parallel framework.
An exercise on how to summarize large graphs using Graph Neural Networks in a distributed setting.
- A basic understanding of graph theory concepts (e.g. BFS, DFS).
- Familiarity with distributed computing fundamentals (e.g., Data Partitioning, Fault Tolerance).
- Some exposure to data-parallel paradigms (e.g. MapReduce) and a basic understanding of Spark and RDDs.
- Comfort with reading and writing Python code.
By the end of the workshop, participants will:
- Understand the core principles of distributed graph processing.
- Be familiar with the architectural considerations of distributed graph systems, including partitioning, fault tolerance, etc.
- Gain beginner-level insights into using frameworks like Pregel and GraphX.
- Walk away with conceptual tools to model and debug large-scale graph workloads.
Anyone interested in understanding how distributed graph systems work and how they differ from traditional distributed systems.
It’s perfect for those looking to explore the unique challenges and solutions in scaling graph data.
Varuni is R&D at Couchbase; ex-JP Morgan. Varuni loves distributed systems, statistics and ML.
This workshop is open for Rootconf members and for Rootconf 2025 ticket buyers
This workshop is open to 20 participants only. Seats will be available on first-come-first-serve basis. 🎟️
For inquiries about the workshop, contact +91-7676332020 or write to info@hasgeek.com.