Prachi Jha

@prachiAwesomeJha

Debugging Silent Network Failures with eBPF

Submitted Jan 30, 2026

Problem Statement

When building systems that handle many concurrent connections, TCP drops can happen silently at the kernel level. I discovered this while stress testing a Go auction server—spawning 1000 concurrent clients caused TCP_LISTEN_OVERFLOW drops because connections arrived faster than the server could accept them.

Traditional debugging was frustrating:

  • Application logs showed “connection refused” but not why
  • tcpdump was too slow to run during load testing
  • netstat -s showed system-wide drop counters but couldn’t tell me which process or connection was affected

I needed real-time, per-process visibility into TCP drops to understand what was happening during high load.

Why this matters for distributed systems:
When services experience network issues (database connections timing out, microservice communication failures, container networking problems), the root cause is often kernel-level TCP drops that don’t surface in application logs.

Approach/Solution

I built an eBPF-based TCP drop monitor that hooks into the Linux kernel’s kfree_skb tracepoint (where packets get dropped). It captures drop events in the kernel, transfers them via ring buffer, and provides real-time output with process attribution.

Architecture:

  • Kernel: 30 lines of eBPF C code hooks kfree_skb, captures PID, drop reason code, and kernel function address
  • Ring Buffer: 64KB kernel buffer for event streaming (events stay in kernel memory until read)
  • Userspace: Go program reads events, resolves kernel symbols via /proc/kallsyms binary search, formats output with 256KB buffer

Drop reasons detected:

  • TCP_LISTEN_OVERFLOW - listen queue overflowed (my stress test case)
  • TCP_RETRANSMIT - retransmit failures
  • NETFILTER_DROP - firewall drops
  • TCP_CSUM - checksum errors
  • And more...

Caveats:

  • PID attribution is best-effort: we capture the process context when the drop occurs, which is usually (but not always) the process that owns the connection
  • kfree_skb is a hot path. At extreme drop rates, monitoring overhead becomes non-trivial
  • This is a debugging tool, not a production monitoring solution at high scale

Example output:

[15:04:23] Drop | PID: 1234 | Reason: TCP_LISTEN_OVERFLOW | Function: tcp_v4_syn_rcv+0x8c

Now you know: Process 1234 is experiencing TCP drops. You can investigate the specific process and connection.

Demo: Live TCP drop detection using SYN flood to trigger TCP_LISTEN_OVERFLOW drops, showing real-time per-process monitoring.

Takeaways for the Audience

1. eBPF for Distributed Systems Debugging

  • Kernel-level visibility without kernel modules
  • Safe (verifier-checked), low overhead, production-ready
  • Perfect for debugging Postgres replication lag, Kafka consumer issues, Spark executor disconnects

2. When to Use eBPF

  • Need kernel-level events (network, I/O, scheduling)
  • High-frequency monitoring with minimal overhead
  • Process-level attribution of system events
  • Don’t use it for: application logic, anything userspace can handle

Key insight: eBPF gives you visibility into kernel events, but attributing those events to userspace processes is tricky. For my use case (debugging listen queue overflows), the PID is usually correct. For other scenarios, you need to validate.

3. Real-World Applications

  • Load Testing: Debug why your server can’t handle expected load (my use case: auction server stress testing)
  • Distributed Systems: Investigate connection timeouts, service communication failures
  • Production Debugging: When applications show network symptoms but logs don’t explain why

4. Production Considerations

  • Kernel version compatibility (BTF helps)
  • Requires root/sudo to load eBPF programs into kernel
  • Integration with existing metrics systems
  • Trade-offs: real-time detail vs aggregated efficiency
  • Best used for debugging specific issues, not continuous production monitoring at scale

Honest assessment: This tool helped me debug my load testing issue. For production use at scale, you’d want kernel-side aggregation and more rigorous testing of the PID attribution accuracy.

Key lesson: When distributed systems show symptoms but application logs don’t explain why, drop down to the kernel level. TCP drops are often the smoking gun.


Project status: Working prototype, will be open-sourced on GitHub. This is a learning share and debugging tool, not a product pitch.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Bengaluru Systems Meetup

Supported by

Venue host

E2E Cloud is India's first AI hyper scaler, a cloud computing platform providing accelerated cloud-based solutions at maximum optimization and lowest pricing