Distributed graphs workshop - think like a vertex; scale like a cluster
Tickets

Loading…

Workshop goal

To provide participants with a practical understanding of the principles, architectures, algorithms, and key challenges involved in processing and managing massive-scale graph data in distributed environments.

Note

Overview

Graphs are ubiquitous! They are fundamental to modeling complex relationships, from social and recommendation networks to financial systems and knowledge graphs. But real-world graphs often contain billions or trillions of nodes and edges, far exceeding the capabilities of single-machine processing.

This workshop explores the architectural and algorithmic foundations of distributed graph systems like Spark GraphX, which are essential for scaling graph processing.

We’ll begin by examining foundational graph data models (Property Graphs and RDF) and then discuss architectural considerations including:

  • Partitioning
  • Replication
  • Querying
  • Fault tolerance strategies employed in distributed graph environments.

Participants will explore

  • Common query patterns in graph data:

    • Reachability queries
    • Subgraph and pattern matching queries
    • Keyword search queries
    • Path queries
  • Why traditional batch systems (like MapReduce) fail for graph workloads due to high I/O overhead.

  • How specialized distributed models like Pregel and Gather-Apply-Scatter (GAS) emerged, including a discussion of the “think like a vertex” framework popularized by Google.

  • Effective graph partitioning strategies:

    • Edge-Cut
    • Vertex-Cut
  • Mechanisms for fault tolerance in graphs:

    • Checkpointing
    • Lineage-based recovery
  • Distributed Querying for graphs — how query rewrites for distributed graph databases differ.

Distributed Graph Neural Networks (GNNs)

As graphs grow massive, training GNNs at scale introduces new challenges beyond traditional distributed graph processing. In this section, we’ll explore:

  • Why distributed GNNs?

    • Real-world graphs (social, citation, web-scale) cannot fit into a single GPU/host.

    • Need to partition graphs across machines while preserving neighborhood structure for training.

  • Challenges unique to GNNs:

    • Neighborhood Explosion: Higher-order neighbors expand exponentially, making sampling critical.

    • Communication Bottlenecks: Feature exchange across partitions increases network cost.

  • Popular frameworks and techniques:

    • GraphSAGE sampling to reduce neighborhood explosion, mini-batch training with subgraph sampling (e.g., Cluster-GCN, GraphSAINT).
  • Architectural patterns:

  • Data parallelism — partition the graph, train locally, periodically sync parameters.

  • Model parallelism — shard large GNN models across GPUs.


This workshop was conducted as part of Rootconf 2025 Annual Conference on 16 May.
workshop photo

By popular demand, the instructor - Varuni HK - is doing a repeat of the workshop.

Hands-on activities

📌 Distributed Breadth-First Search (BFS) using the Pregel Model

Participants will conceptually design the compute() function for a distributed BFS, tracing message flow and state updates across supersteps to understand the vertex-centric model in action.

📌 Partitioning Strategy Comparison

Using a small example graph, participants will manually simulate and compare the communication cost (cross-partition messages) for a simple query (e.g., 2-hop neighborhood) under different partitioning strategies:

  • Hash Partitioning
  • Manually optimized Edge-Cut

This highlights the direct impact of partitioning on performance.

📌 Map-reduce Discussion

Refresher on how map reduce works

📌 Distributed PageRank using Spark GraphX

Participants will code the steps for implementing iterative PageRank using GraphX operators like aggregateMessages, understanding how data flow and aggregation work in a data-parallel framework.


Pre-requisites

  • A basic understanding of graph theory concepts (e.g. BFS, DFS).
  • Familiarity with distributed computing fundamentals (e.g., Data Partitioning, Fault Tolerance).
  • Some exposure to data-parallel paradigms (e.g. MapReduce) and a basic understanding of Spark and RDDs.
  • Comfort with reading and writing Python code.

Key learnings for participants

By the end of the workshop, participants will:

  • Understand the core principles of distributed graph processing.
  • Be familiar with the architectural considerations of distributed graph systems, including partitioning, fault tolerance, etc.
  • Gain beginner-level insights into using frameworks like Pregel and GraphX.
  • Walk away with conceptual tools to model and debug large-scale graph workloads.

Who should attend?

Anyone interested in understanding how distributed graph systems work and how they differ from traditional distributed systems.

It’s perfect for those looking to explore the unique challenges and solutions in scaling graph data.


👩 💻 Instructor bio

Varuni works in the Indexing team at Couchbase; ex-JP Morgan. Varuni loves distributed systems, Graphs , Statistics and ML.

How to attend this workshop

This workshop is open for Rootconf members and for Rootconf 2025 ticket buyers

This workshop is open to 20 participants only. Seats will be available on first-come-first-serve basis. 🎟️

Contact information ☎️

For inquiries about the workshop, contact +91-7676332020 or write to info@hasgeek.com.

References

Hosted by

We care about site reliability, cloud costs, security and data privacy