Files
tcptop/.planning/research/PITFALLS.md
2026-03-21 18:08:55 -04:00

20 KiB

Pitfalls Research

Domain: Rust eBPF network monitoring CLI (tcptop) Researched: 2026-03-21 Confidence: HIGH (well-documented domain with many real-world projects to learn from)

Critical Pitfalls

Pitfall 1: macOS Has No eBPF -- Platform Abstraction Must Be Day-One Architecture

What goes wrong: Developers start building the eBPF data collection layer for Linux, get it working, then discover macOS support requires a completely different backend (DTrace, nstat, Network Extension Framework, or libpcap). The resulting code is tightly coupled to eBPF concepts, and retrofitting a platform abstraction layer requires rewriting most of the data pipeline.

Why it happens: eBPF is Linux-only. macOS uses DTrace for kernel tracing, but DTrace is severely limited by System Integrity Protection (SIP) -- even with sudo, SIP prevents tracing system binaries and most kernel probes. Apple's Network Extension framework requires Objective-C/Swift and entitlements. There is no clean equivalent to Linux eBPF on macOS.

How to avoid: Define a DataSource trait from the very first line of code. The trait returns platform-agnostic connection records (source/dest IP, port, bytes, packets, state, PID). Linux implements it with eBPF via Aya. macOS implements it with a fallback: likely nettop/nstat parsing, libpcap, or lsof+netstat polling. Accept that macOS will have lower fidelity data -- this is a known tradeoff. Do NOT attempt to build a DTrace backend that requires users to disable SIP.

Warning signs:

  • eBPF-specific types (map handles, program file descriptors) leaking into core data structures
  • No cfg(target_os) gates in the first prototype
  • "We'll add macOS later" appearing in planning discussions

Phase to address: Phase 1 (Architecture/Foundation). The trait boundary must exist before any eBPF code is written.


Pitfall 2: eBPF Verifier Rejection -- Programs That Compile Fine But the Kernel Refuses to Load

What goes wrong: The eBPF program compiles successfully with rustc/bpf-linker but the kernel verifier rejects it at load time. Common rejections: "program too complex" (exceeding instruction complexity limit), unbounded loops, stack overflow (512-byte limit), or accessing memory the verifier cannot prove is safe. This is the single most frustrating part of eBPF development.

Why it happens: The eBPF verifier simulates every possible execution path. A loop bounded by a 16-bit variable means the verifier checks up to 65,535 iterations. Branches inside loops multiply complexity exponentially. Rust's safe abstractions (Option, Result, match) generate more branches than equivalent C, hitting complexity limits faster. The 512-byte stack limit (256 with tail calls) means you cannot have large local structs.

How to avoid:

  • Keep eBPF programs minimal: extract only the data needed (src/dst IP, port, bytes, PID) and push everything else to userspace
  • Use bpf_loop() helper (kernel 5.17+) for any iteration instead of for-loops
  • Avoid deep nesting; flatten conditionals
  • Use eBPF maps (HashMap, PerCpuHashMap) to store state instead of stack variables
  • Test verifier acceptance on the oldest target kernel version, not just your dev machine
  • Write small, focused programs (one per hook point) rather than monolithic programs

Warning signs:

  • eBPF program source file exceeding 200 lines
  • Using Rust iterators or complex pattern matching in eBPF code
  • "Works on kernel 6.x but fails on 5.15"
  • Stack variables larger than ~100 bytes

Phase to address: Phase 2 (eBPF implementation). Must be understood before writing the first kprobe/tracepoint.


Pitfall 3: Nightly Rust Toolchain and bpf-linker Fragility

What goes wrong: Aya requires Rust nightly for eBPF compilation (bpfel-unknown-none target). The bpf-linker requires a specific LLVM version matching the nightly toolchain. A routine rustup update breaks the build because bpf-linker is pinned to a specific LLVM version that no longer matches the new nightly. CI breaks. Contributors cannot build. Days are lost debugging toolchain issues.

Why it happens: eBPF compilation uses unstable Rust features (build-std=core, BPF target). bpf-linker links against LLVM and must match the LLVM version used by rustc. When rustc bumps its LLVM (e.g., from 20 to 21), bpf-linker may not have a matching release yet.

How to avoid:

  • Pin the exact nightly date in rust-toolchain.toml (e.g., channel = "nightly-2025-11-28")
  • Pin bpf-linker version in CI and document the exact install command
  • Use a workspace with separate xtask build orchestration (the Aya template provides this pattern)
  • Test nightly updates in a branch before merging to main
  • Document the full toolchain setup in CONTRIBUTING.md

Warning signs:

  • No rust-toolchain.toml in the repository
  • cargo install bpf-linker without version pinning
  • CI using nightly instead of nightly-YYYY-MM-DD
  • "Works on my machine" but fails for other contributors

Phase to address: Phase 1 (Project setup). Pin toolchain before writing any code.


Pitfall 4: Kernel Version Compatibility -- What Works on Your Dev Machine Fails in Production

What goes wrong: eBPF features vary dramatically across kernel versions. Your program uses kfuncs, ring_buffer, or BTF features available on kernel 6.x but your users run Ubuntu 20.04 (kernel 5.4) or RHEL 8 (kernel 4.18). The program either fails to load or silently produces no data.

Why it happens: Key feature availability by kernel version:

  • 4.18: Basic kprobes, tracepoints, perf_event_array
  • 5.4: BTF support begins
  • 5.8: Ring buffer (BPF_MAP_TYPE_RINGBUF)
  • 5.17: bpf_loop() helper, CO-RE improvements
  • 6.x: Various kfuncs, improved verifier

Most developers work on recent kernels and never test on older ones.

How to avoid:

  • Decide on a minimum kernel version early (recommend 5.4 for broad compatibility, or 5.8 if ring_buffer is needed)
  • Use perf_event_array as fallback if ring_buffer is unavailable
  • Check /sys/kernel/btf/vmlinux at startup; provide a clear error message if BTF is missing
  • Use kprobes on stable kernel functions (tcp_v4_connect, tcp_set_state, tcp_sendmsg) rather than tracepoints that may not exist on older kernels
  • Test in VMs with minimum supported kernel version as part of CI

Warning signs:

  • No minimum kernel version documented
  • Using features without checking kernel availability
  • Only testing on your development kernel
  • No runtime feature detection at program startup

Phase to address: Phase 2 (eBPF implementation) for initial decisions; Phase 4 (Packaging/Distribution) for runtime detection and error messages.


Pitfall 5: Missed Events and Data Gaps Under High Connection Volume

What goes wrong: Under heavy network load (thousands of connections/second), the eBPF-to-userspace data pipeline drops events silently. The TUI shows stale data, ghost connections that never disappear, or missing connections. Users lose trust in the tool's accuracy.

Why it happens: Multiple failure modes:

  1. Perf buffer overflow: Per-CPU buffers fill up; kernel drops events without notification
  2. Ring buffer overflow: Single shared buffer fills; new events are dropped
  3. kRetProbe limit: Kernel limits active kRetProbes to ~4,096 (kernel 6.4.5 default); excess probes fail silently
  4. Race conditions: Connection closes between eBPF probe firing and userspace processing; "ghost" entries persist
  5. Userspace processing lag: TUI rendering blocks event consumption; events queue up and overflow

How to avoid:

  • Use ring_buffer over perf_event_array (5x less overhead on multi-core systems; 7% vs 35%)
  • Process events in a dedicated thread, separate from TUI rendering
  • Implement periodic reconciliation: sweep connection map and remove entries for connections no longer in kernel state
  • Monitor and expose a "dropped events" counter so users know when data is incomplete
  • Size ring buffer appropriately (start with 256KB, make configurable)
  • Use per-CPU hash maps for aggregation in kernel space; send summaries to userspace, not per-packet events

Warning signs:

  • Connection count in TUI never decreases
  • CPU spikes correlate with network activity
  • No "events dropped" metric anywhere in the codebase
  • Single-threaded event loop handling both eBPF events and TUI rendering

Phase to address: Phase 2 (eBPF data pipeline) for architecture; Phase 3 (TUI) for thread separation.


Pitfall 6: UDP "Connection" Tracking Is a Fundamentally Different Problem

What goes wrong: Developers model UDP tracking the same way as TCP -- expecting connect/accept/close lifecycle events. But UDP is connectionless. There are no state transitions to hook. The tool either shows nothing for UDP or shows every single datagram as a separate "connection."

Why it happens: TCP has explicit state machine hooks (tcp_v4_connect, tcp_close, tcp_set_state). UDP has none -- it's fire-and-forget. The kernel does track UDP sockets, but there's no equivalent to TCP connection lifecycle.

How to avoid:

  • Track UDP as "flows" not "connections": aggregate by (src_ip, src_port, dst_ip, dst_port) tuple
  • Hook udp_sendmsg and udp_recvmsg for byte/packet counting
  • Implement flow timeout: if no packets seen for N seconds (configurable, default 30s), mark flow as expired
  • Display UDP flows separately or with a distinct state indicator ("ACTIVE" / "IDLE" / "EXPIRED")
  • Do NOT promise RTT/latency for UDP -- it's meaningless without application-layer protocol knowledge (the project already notes this in PROJECT.md)

Warning signs:

  • Shared data structures between TCP and UDP tracking with a "state" field that doesn't apply to UDP
  • No timeout/expiry mechanism for UDP entries
  • Attempting to estimate UDP RTT without acknowledging it's heuristic at best

Phase to address: Phase 2 (eBPF implementation). Design UDP flow tracking as a separate subsystem from TCP state tracking.


Technical Debt Patterns

Shortcut Immediate Benefit Long-term Cost When Acceptable
Hardcoding eBPF as the only backend Faster initial development Cannot support macOS; testing requires root/kernel Never -- trait abstraction is cheap upfront
Polling /proc/net/tcp instead of eBPF Works without root, no kernel dependency Misses short-lived connections, high overhead, no per-packet stats MVP prototype only, replace before v0.1
Single-threaded event loop Simpler code, no synchronization TUI blocks event processing; dropped events under load Never for production; acceptable for proof-of-concept
Using perf_event_array when ring_buffer is available Works on kernel 5.4+ 5x higher overhead on multi-core; event ordering issues Only as fallback for kernel < 5.8
Embedding eBPF bytecode at compile time No runtime compilation needed Locked to one kernel version's struct layouts without CO-RE Acceptable if using CO-RE/BTF for portability
Skipping process (PID) resolution Simpler eBPF programs Major missing feature; users expect to know which process owns a connection MVP only; add in same phase as eBPF work

Integration Gotchas

Integration Common Mistake Correct Approach
Aya eBPF map access Using HashMap when data is updated from multiple CPUs causing lock contention Use PerCpuHashMap for counters; aggregate in userspace
Kernel kprobes Hooking internal kernel functions that get renamed/removed across versions Hook stable exported functions: tcp_v4_connect, tcp_v6_connect, tcp_close, tcp_set_state, tcp_sendmsg, tcp_recvmsg
Process info from eBPF Calling bpf_get_current_pid_tgid() in network hooks where context may be a kernel thread (softirq) Accept that some connections will have PID 0; resolve via socket->sk->sk_uid or /proc fallback
Terminal raw mode Not restoring terminal on panic/crash Use scopeguard or custom panic handler to restore terminal state before exit
CSV logging Flushing on every write Buffer writes, flush on interval or signal; use BufWriter with periodic flush

Performance Traps

Trap Symptoms Prevention When It Breaks
TUI rendering at uncapped frame rate 50%+ CPU on a single core; fan spin Cap at 4 FPS for a monitoring tool; use event-driven rendering (only redraw on new data or user input) Immediately in debug builds; visible in release at 60 FPS
Per-packet events from eBPF to userspace Ring buffer overflow; massive CPU in userspace parsing Aggregate in-kernel using PerCpuHashMap; send periodic summaries or only state-change events Above ~10k packets/second
Sorting the connection table on every frame Noticeable lag with 1000+ connections Sort only when sort column changes or on a timer (every 1s); use incremental insert for new entries Above ~500 connections
String formatting for every connection row Allocations per frame per row Pre-allocate row buffers; use write! into reusable String buffers Above ~200 connections at 4+ FPS
DNS reverse lookup for every IP Blocks rendering; DNS timeout stalls TUI Cache resolved names; resolve asynchronously; display IP immediately, update with name when ready Any network with DNS latency > 50ms

Security Mistakes

Mistake Risk Prevention
Not dropping privileges after eBPF program is loaded Running entire application as root unnecessarily Load eBPF programs, then drop to original user with setuid/setgid; only the loader needs CAP_BPF + CAP_NET_ADMIN
Logging sensitive connection data to CSV without warning User inadvertently captures connection metadata to a world-readable file Default CSV to mode 0600; warn in --help that CSV contains network metadata
Shipping pre-compiled eBPF bytecode without signing Supply chain risk if bytecode is modified Use include_bytes! to embed bytecode in the binary; sign release binaries
Not validating eBPF map data in userspace Corrupted/malicious map data could cause crashes Validate all data read from eBPF maps; handle malformed entries gracefully

UX Pitfalls

Pitfall User Impact Better Approach
Cryptic error when not running as root User sees "EPERM" or "operation not permitted" Detect missing capabilities at startup; print: "tcptop requires root privileges. Run with: sudo tcptop"
No indication of kernel incompatibility Program silently shows no data Check kernel version and BTF availability at startup; print specific guidance
Showing raw IP addresses only Hard to identify connections at a glance Async DNS resolution with IP shown immediately and hostname replacing it when resolved
Overwhelming amount of data with no default filter Too many connections to read Default to showing top 50 connections by bandwidth; allow expand with keybinding
No visual indication of sort column User doesn't know how data is ordered Highlight active sort column header; show sort direction arrow
Screen flicker on resize Jarring visual experience Handle SIGWINCH; debounce resize events; clear and redraw once

"Looks Done But Isn't" Checklist

  • eBPF loading: Often missing graceful fallback when BTF is unavailable -- verify behavior on kernel without /sys/kernel/btf/vmlinux
  • Connection tracking: Often missing cleanup of terminated connections -- verify connection count returns to 0 after all connections close
  • IPv6 support: Often hardcoded to IPv4 only -- verify tcp_v6_connect and IPv6 address display works
  • Short-lived connections: Often missed entirely -- verify a curl request appears and disappears in the TUI
  • Process name resolution: Often shows PID only or empty for kernel threads -- verify graceful handling of PID 0 / kernel threads
  • Terminal restore: Often broken on Ctrl+C or panic -- verify terminal is usable after abnormal exit
  • Privilege error message: Often just prints a Rust backtrace -- verify clean error message when run without sudo
  • macOS backend: Often "planned" but never built -- verify basic functionality on macOS even if lower fidelity
  • UDP flow expiry: Often accumulates forever -- verify UDP entries disappear after timeout period

Recovery Strategies

Pitfall Recovery Cost Recovery Steps
No platform abstraction (Pitfall 1) HIGH Introduce trait; refactor all eBPF calls behind it; significant rewrite of data pipeline
Verifier rejection (Pitfall 2) MEDIUM Simplify eBPF program; split into smaller programs; move logic to userspace
Toolchain breakage (Pitfall 3) LOW Pin toolchain to last known working nightly; update bpf-linker to match
Kernel version incompatibility (Pitfall 4) MEDIUM Add runtime feature detection; implement fallback paths for older kernels
Missed events (Pitfall 5) MEDIUM Switch to ring_buffer; add dedicated event thread; implement reconciliation
UDP tracking confusion (Pitfall 6) MEDIUM Separate UDP into distinct flow-tracking subsystem with timeout

Pitfall-to-Phase Mapping

Pitfall Prevention Phase Verification
No platform abstraction Phase 1 (Foundation) DataSource trait exists; eBPF is one implementation behind it
Verifier rejection Phase 2 (eBPF) All eBPF programs load on minimum target kernel version in CI
Toolchain fragility Phase 1 (Foundation) rust-toolchain.toml pinned; CI builds green; CONTRIBUTING.md documents setup
Kernel version compat Phase 2 (eBPF) + Phase 4 (Distribution) Tested on kernel 5.4 and latest; startup prints clear error on unsupported kernel
Missed events Phase 2 (Data pipeline) + Phase 3 (TUI) Load test with 10k connections; dropped event counter stays at 0
UDP flow tracking Phase 2 (eBPF) UDP flows appear with activity and expire after timeout
TUI performance Phase 3 (TUI) CPU usage < 5% at idle with 500 connections displayed
Privilege handling Phase 1 (Foundation) Running without sudo prints helpful error; running with sudo drops privileges after eBPF load
macOS support Phase 1 (trait) + Phase 4 (macOS backend) cargo build succeeds on macOS; basic connection listing works

Sources


Pitfalls research for: Rust eBPF network monitoring CLI (tcptop) Researched: 2026-03-21