Debugging Silent Network Failures with eBPF

Jan 2026

26 Mon

27 Tue

28 Wed

29 Thu

30 Fri

31 Sat 11:00 AM – 12:30 PM IST

1 Sun

E2E Networks Ltd, Bengaluru

All submissions

Debugging Silent Network Failures with eBPF

Submitted Jan 30, 2026

Problem Statement

When building systems that handle many concurrent connections, TCP drops can happen silently at the kernel level. I discovered this while stress testing a Go auction server—spawning 1000 concurrent clients caused TCP_LISTEN_OVERFLOW drops because connections arrived faster than the server could accept them.

Traditional debugging was frustrating:

Application logs showed “connection refused” but not why
tcpdump was too slow to run during load testing
netstat -s showed system-wide drop counters but couldn’t tell me which process or connection was affected

I needed real-time, per-process visibility into TCP drops to understand what was happening during high load.

Why this matters for distributed systems:
When services experience network issues (database connections timing out, microservice communication failures, container networking problems), the root cause is often kernel-level TCP drops that don’t surface in application logs.

Approach/Solution

I built an eBPF-based TCP drop monitor that hooks into the Linux kernel’s kfree_skb tracepoint (where packets get dropped). It captures drop events in the kernel, transfers them via ring buffer, and provides real-time output with process attribution.

Architecture:

Kernel: 30 lines of eBPF C code hooks kfree_skb, captures PID, drop reason code, and kernel function address
Ring Buffer: 64KB kernel buffer for event streaming (events stay in kernel memory until read)
Userspace: Go program reads events, resolves kernel symbols via /proc/kallsyms binary search, formats output with 256KB buffer

Drop reasons detected:

TCP_LISTEN_OVERFLOW - listen queue overflowed (my stress test case)
TCP_RETRANSMIT - retransmit failures
NETFILTER_DROP - firewall drops
TCP_CSUM - checksum errors
And more...

Caveats:

PID attribution is best-effort: we capture the process context when the drop occurs, which is usually (but not always) the process that owns the connection
kfree_skb is a hot path. At extreme drop rates, monitoring overhead becomes non-trivial
This is a debugging tool, not a production monitoring solution at high scale

Example output:

[15:04:23] Drop | PID: 1234 | Reason: TCP_LISTEN_OVERFLOW | Function: tcp_v4_syn_rcv+0x8c

Now you know: Process 1234 is experiencing TCP drops. You can investigate the specific process and connection.

Demo: Live TCP drop detection using SYN flood to trigger TCP_LISTEN_OVERFLOW drops, showing real-time per-process monitoring.

Takeaways for the Audience

1. eBPF for Distributed Systems Debugging

Kernel-level visibility without kernel modules
Safe (verifier-checked), low overhead, production-ready
Perfect for debugging Postgres replication lag, Kafka consumer issues, Spark executor disconnects

2. When to Use eBPF

Need kernel-level events (network, I/O, scheduling)
High-frequency monitoring with minimal overhead
Process-level attribution of system events
Don’t use it for: application logic, anything userspace can handle

Key insight: eBPF gives you visibility into kernel events, but attributing those events to userspace processes is tricky. For my use case (debugging listen queue overflows), the PID is usually correct. For other scenarios, you need to validate.

3. Real-World Applications

Load Testing: Debug why your server can’t handle expected load (my use case: auction server stress testing)
Distributed Systems: Investigate connection timeouts, service communication failures
Production Debugging: When applications show network symptoms but logs don’t explain why

4. Production Considerations

Kernel version compatibility (BTF helps)
Requires root/sudo to load eBPF programs into kernel
Integration with existing metrics systems
Trade-offs: real-time detail vs aggregated efficiency
Best used for debugging specific issues, not continuous production monitoring at scale

Honest assessment: This tool helped me debug my load testing issue. For production use at scale, you’d want kernel-side aggregation and more rigorous testing of the PID attribution accuracy.

Key lesson: When distributed systems show symptoms but application logs don’t explain why, drop down to the kernel level. TCP drops are often the smoking gun.

Project status: Working prototype, will be open-sourced on GitHub. This is a learning share and debugging tool, not a product pitch.

All submissions

Comments

Jan 2026

26 Mon

27 Tue

28 Wed

29 Thu

30 Fri

31 Sat 11:00 AM – 12:30 PM IST

1 Sun

Hosted by

Bengaluru Systems Meetup

Supported by

Venue host

E2E Networks Limited

E2E Cloud is India's first AI hyper scaler, a cloud computing platform providing accelerated cloud-based solutions at maximum optimization and lowest pricing

Bengaluru Systems Meetup #14

Debugging Silent Network Failures with eBPF

Problem Statement

Approach/Solution

Takeaways for the Audience

Comments