Files

Zachary D. Rowitsch c765d8e457 docs: complete project research

2026-03-21 18:08:55 -04:00

20 KiB

Raw Permalink Blame History

Pitfalls Research

Domain: Rust eBPF network monitoring CLI (tcptop) Researched: 2026-03-21 Confidence: HIGH (well-documented domain with many real-world projects to learn from)

Critical Pitfalls

Pitfall 1: macOS Has No eBPF -- Platform Abstraction Must Be Day-One Architecture

What goes wrong: Developers start building the eBPF data collection layer for Linux, get it working, then discover macOS support requires a completely different backend (DTrace, nstat, Network Extension Framework, or libpcap). The resulting code is tightly coupled to eBPF concepts, and retrofitting a platform abstraction layer requires rewriting most of the data pipeline.

Why it happens: eBPF is Linux-only. macOS uses DTrace for kernel tracing, but DTrace is severely limited by System Integrity Protection (SIP) -- even with sudo, SIP prevents tracing system binaries and most kernel probes. Apple's Network Extension framework requires Objective-C/Swift and entitlements. There is no clean equivalent to Linux eBPF on macOS.

How to avoid: Define a DataSource trait from the very first line of code. The trait returns platform-agnostic connection records (source/dest IP, port, bytes, packets, state, PID). Linux implements it with eBPF via Aya. macOS implements it with a fallback: likely nettop/nstat parsing, libpcap, or lsof+netstat polling. Accept that macOS will have lower fidelity data -- this is a known tradeoff. Do NOT attempt to build a DTrace backend that requires users to disable SIP.

Warning signs:

eBPF-specific types (map handles, program file descriptors) leaking into core data structures
No cfg(target_os) gates in the first prototype
"We'll add macOS later" appearing in planning discussions

Phase to address: Phase 1 (Architecture/Foundation). The trait boundary must exist before any eBPF code is written.

Pitfall 2: eBPF Verifier Rejection -- Programs That Compile Fine But the Kernel Refuses to Load

What goes wrong: The eBPF program compiles successfully with rustc/bpf-linker but the kernel verifier rejects it at load time. Common rejections: "program too complex" (exceeding instruction complexity limit), unbounded loops, stack overflow (512-byte limit), or accessing memory the verifier cannot prove is safe. This is the single most frustrating part of eBPF development.

Why it happens: The eBPF verifier simulates every possible execution path. A loop bounded by a 16-bit variable means the verifier checks up to 65,535 iterations. Branches inside loops multiply complexity exponentially. Rust's safe abstractions (Option, Result, match) generate more branches than equivalent C, hitting complexity limits faster. The 512-byte stack limit (256 with tail calls) means you cannot have large local structs.

How to avoid:

Keep eBPF programs minimal: extract only the data needed (src/dst IP, port, bytes, PID) and push everything else to userspace
Use bpf_loop() helper (kernel 5.17+) for any iteration instead of for-loops
Avoid deep nesting; flatten conditionals
Use eBPF maps (HashMap, PerCpuHashMap) to store state instead of stack variables
Test verifier acceptance on the oldest target kernel version, not just your dev machine
Write small, focused programs (one per hook point) rather than monolithic programs

Warning signs:

eBPF program source file exceeding 200 lines
Using Rust iterators or complex pattern matching in eBPF code
"Works on kernel 6.x but fails on 5.15"
Stack variables larger than ~100 bytes

Phase to address: Phase 2 (eBPF implementation). Must be understood before writing the first kprobe/tracepoint.

Pitfall 3: Nightly Rust Toolchain and bpf-linker Fragility

What goes wrong: Aya requires Rust nightly for eBPF compilation (bpfel-unknown-none target). The bpf-linker requires a specific LLVM version matching the nightly toolchain. A routine rustup update breaks the build because bpf-linker is pinned to a specific LLVM version that no longer matches the new nightly. CI breaks. Contributors cannot build. Days are lost debugging toolchain issues.

Why it happens: eBPF compilation uses unstable Rust features (build-std=core, BPF target). bpf-linker links against LLVM and must match the LLVM version used by rustc. When rustc bumps its LLVM (e.g., from 20 to 21), bpf-linker may not have a matching release yet.

How to avoid:

Pin the exact nightly date in rust-toolchain.toml (e.g., channel = "nightly-2025-11-28")
Pin bpf-linker version in CI and document the exact install command
Use a workspace with separate xtask build orchestration (the Aya template provides this pattern)
Test nightly updates in a branch before merging to main
Document the full toolchain setup in CONTRIBUTING.md

Warning signs:

No rust-toolchain.toml in the repository
cargo install bpf-linker without version pinning
CI using nightly instead of nightly-YYYY-MM-DD
"Works on my machine" but fails for other contributors

Phase to address: Phase 1 (Project setup). Pin toolchain before writing any code.

Pitfall 4: Kernel Version Compatibility -- What Works on Your Dev Machine Fails in Production

What goes wrong: eBPF features vary dramatically across kernel versions. Your program uses kfuncs, ring_buffer, or BTF features available on kernel 6.x but your users run Ubuntu 20.04 (kernel 5.4) or RHEL 8 (kernel 4.18). The program either fails to load or silently produces no data.

Why it happens: Key feature availability by kernel version:

4.18: Basic kprobes, tracepoints, perf_event_array
5.4: BTF support begins
5.8: Ring buffer (BPF_MAP_TYPE_RINGBUF)
5.17: bpf_loop() helper, CO-RE improvements
6.x: Various kfuncs, improved verifier

Most developers work on recent kernels and never test on older ones.

How to avoid:

Decide on a minimum kernel version early (recommend 5.4 for broad compatibility, or 5.8 if ring_buffer is needed)
Use perf_event_array as fallback if ring_buffer is unavailable
Check /sys/kernel/btf/vmlinux at startup; provide a clear error message if BTF is missing
Use kprobes on stable kernel functions (tcp_v4_connect, tcp_set_state, tcp_sendmsg) rather than tracepoints that may not exist on older kernels
Test in VMs with minimum supported kernel version as part of CI

Warning signs:

No minimum kernel version documented
Using features without checking kernel availability
Only testing on your development kernel
No runtime feature detection at program startup

Phase to address: Phase 2 (eBPF implementation) for initial decisions; Phase 4 (Packaging/Distribution) for runtime detection and error messages.

Pitfall 5: Missed Events and Data Gaps Under High Connection Volume

What goes wrong: Under heavy network load (thousands of connections/second), the eBPF-to-userspace data pipeline drops events silently. The TUI shows stale data, ghost connections that never disappear, or missing connections. Users lose trust in the tool's accuracy.

Why it happens: Multiple failure modes:

Perf buffer overflow: Per-CPU buffers fill up; kernel drops events without notification
Ring buffer overflow: Single shared buffer fills; new events are dropped
kRetProbe limit: Kernel limits active kRetProbes to ~4,096 (kernel 6.4.5 default); excess probes fail silently
Race conditions: Connection closes between eBPF probe firing and userspace processing; "ghost" entries persist
Userspace processing lag: TUI rendering blocks event consumption; events queue up and overflow

How to avoid:

Use ring_buffer over perf_event_array (5x less overhead on multi-core systems; 7% vs 35%)
Process events in a dedicated thread, separate from TUI rendering
Implement periodic reconciliation: sweep connection map and remove entries for connections no longer in kernel state
Monitor and expose a "dropped events" counter so users know when data is incomplete
Size ring buffer appropriately (start with 256KB, make configurable)
Use per-CPU hash maps for aggregation in kernel space; send summaries to userspace, not per-packet events

Warning signs:

Connection count in TUI never decreases
CPU spikes correlate with network activity
No "events dropped" metric anywhere in the codebase
Single-threaded event loop handling both eBPF events and TUI rendering

Phase to address: Phase 2 (eBPF data pipeline) for architecture; Phase 3 (TUI) for thread separation.

Pitfall 6: UDP "Connection" Tracking Is a Fundamentally Different Problem

What goes wrong: Developers model UDP tracking the same way as TCP -- expecting connect/accept/close lifecycle events. But UDP is connectionless. There are no state transitions to hook. The tool either shows nothing for UDP or shows every single datagram as a separate "connection."

Why it happens: TCP has explicit state machine hooks (tcp_v4_connect, tcp_close, tcp_set_state). UDP has none -- it's fire-and-forget. The kernel does track UDP sockets, but there's no equivalent to TCP connection lifecycle.

How to avoid:

Track UDP as "flows" not "connections": aggregate by (src_ip, src_port, dst_ip, dst_port) tuple
Hook udp_sendmsg and udp_recvmsg for byte/packet counting
Implement flow timeout: if no packets seen for N seconds (configurable, default 30s), mark flow as expired
Display UDP flows separately or with a distinct state indicator ("ACTIVE" / "IDLE" / "EXPIRED")
Do NOT promise RTT/latency for UDP -- it's meaningless without application-layer protocol knowledge (the project already notes this in PROJECT.md)

Warning signs:

Shared data structures between TCP and UDP tracking with a "state" field that doesn't apply to UDP
No timeout/expiry mechanism for UDP entries
Attempting to estimate UDP RTT without acknowledging it's heuristic at best

Phase to address: Phase 2 (eBPF implementation). Design UDP flow tracking as a separate subsystem from TCP state tracking.

Technical Debt Patterns

Shortcut	Immediate Benefit	Long-term Cost	When Acceptable
Hardcoding eBPF as the only backend	Faster initial development	Cannot support macOS; testing requires root/kernel	Never -- trait abstraction is cheap upfront
Polling /proc/net/tcp instead of eBPF	Works without root, no kernel dependency	Misses short-lived connections, high overhead, no per-packet stats	MVP prototype only, replace before v0.1
Single-threaded event loop	Simpler code, no synchronization	TUI blocks event processing; dropped events under load	Never for production; acceptable for proof-of-concept
Using perf_event_array when ring_buffer is available	Works on kernel 5.4+	5x higher overhead on multi-core; event ordering issues	Only as fallback for kernel < 5.8
Embedding eBPF bytecode at compile time	No runtime compilation needed	Locked to one kernel version's struct layouts without CO-RE	Acceptable if using CO-RE/BTF for portability
Skipping process (PID) resolution	Simpler eBPF programs	Major missing feature; users expect to know which process owns a connection	MVP only; add in same phase as eBPF work

Integration Gotchas

Integration	Common Mistake	Correct Approach
Aya eBPF map access	Using `HashMap` when data is updated from multiple CPUs causing lock contention	Use `PerCpuHashMap` for counters; aggregate in userspace
Kernel kprobes	Hooking internal kernel functions that get renamed/removed across versions	Hook stable exported functions: `tcp_v4_connect`, `tcp_v6_connect`, `tcp_close`, `tcp_set_state`, `tcp_sendmsg`, `tcp_recvmsg`
Process info from eBPF	Calling `bpf_get_current_pid_tgid()` in network hooks where context may be a kernel thread (softirq)	Accept that some connections will have PID 0; resolve via socket->sk->sk_uid or /proc fallback
Terminal raw mode	Not restoring terminal on panic/crash	Use `scopeguard` or custom panic handler to restore terminal state before exit
CSV logging	Flushing on every write	Buffer writes, flush on interval or signal; use `BufWriter` with periodic flush

Performance Traps

Trap	Symptoms	Prevention	When It Breaks
TUI rendering at uncapped frame rate	50%+ CPU on a single core; fan spin	Cap at 4 FPS for a monitoring tool; use event-driven rendering (only redraw on new data or user input)	Immediately in debug builds; visible in release at 60 FPS
Per-packet events from eBPF to userspace	Ring buffer overflow; massive CPU in userspace parsing	Aggregate in-kernel using PerCpuHashMap; send periodic summaries or only state-change events	Above ~10k packets/second
Sorting the connection table on every frame	Noticeable lag with 1000+ connections	Sort only when sort column changes or on a timer (every 1s); use incremental insert for new entries	Above ~500 connections
String formatting for every connection row	Allocations per frame per row	Pre-allocate row buffers; use `write!` into reusable `String` buffers	Above ~200 connections at 4+ FPS
DNS reverse lookup for every IP	Blocks rendering; DNS timeout stalls TUI	Cache resolved names; resolve asynchronously; display IP immediately, update with name when ready	Any network with DNS latency > 50ms

Security Mistakes

Mistake	Risk	Prevention
Not dropping privileges after eBPF program is loaded	Running entire application as root unnecessarily	Load eBPF programs, then drop to original user with `setuid`/`setgid`; only the loader needs CAP_BPF + CAP_NET_ADMIN
Logging sensitive connection data to CSV without warning	User inadvertently captures connection metadata to a world-readable file	Default CSV to mode 0600; warn in --help that CSV contains network metadata
Shipping pre-compiled eBPF bytecode without signing	Supply chain risk if bytecode is modified	Use `include_bytes!` to embed bytecode in the binary; sign release binaries
Not validating eBPF map data in userspace	Corrupted/malicious map data could cause crashes	Validate all data read from eBPF maps; handle malformed entries gracefully

UX Pitfalls

Pitfall	User Impact	Better Approach
Cryptic error when not running as root	User sees "EPERM" or "operation not permitted"	Detect missing capabilities at startup; print: "tcptop requires root privileges. Run with: sudo tcptop"
No indication of kernel incompatibility	Program silently shows no data	Check kernel version and BTF availability at startup; print specific guidance
Showing raw IP addresses only	Hard to identify connections at a glance	Async DNS resolution with IP shown immediately and hostname replacing it when resolved
Overwhelming amount of data with no default filter	Too many connections to read	Default to showing top 50 connections by bandwidth; allow expand with keybinding
No visual indication of sort column	User doesn't know how data is ordered	Highlight active sort column header; show sort direction arrow
Screen flicker on resize	Jarring visual experience	Handle SIGWINCH; debounce resize events; clear and redraw once

"Looks Done But Isn't" Checklist

eBPF loading: Often missing graceful fallback when BTF is unavailable -- verify behavior on kernel without /sys/kernel/btf/vmlinux
Connection tracking: Often missing cleanup of terminated connections -- verify connection count returns to 0 after all connections close
IPv6 support: Often hardcoded to IPv4 only -- verify tcp_v6_connect and IPv6 address display works
Short-lived connections: Often missed entirely -- verify a curl request appears and disappears in the TUI
Process name resolution: Often shows PID only or empty for kernel threads -- verify graceful handling of PID 0 / kernel threads
Terminal restore: Often broken on Ctrl+C or panic -- verify terminal is usable after abnormal exit
Privilege error message: Often just prints a Rust backtrace -- verify clean error message when run without sudo
macOS backend: Often "planned" but never built -- verify basic functionality on macOS even if lower fidelity
UDP flow expiry: Often accumulates forever -- verify UDP entries disappear after timeout period

Recovery Strategies

Pitfall	Recovery Cost	Recovery Steps
No platform abstraction (Pitfall 1)	HIGH	Introduce trait; refactor all eBPF calls behind it; significant rewrite of data pipeline
Verifier rejection (Pitfall 2)	MEDIUM	Simplify eBPF program; split into smaller programs; move logic to userspace
Toolchain breakage (Pitfall 3)	LOW	Pin toolchain to last known working nightly; update bpf-linker to match
Kernel version incompatibility (Pitfall 4)	MEDIUM	Add runtime feature detection; implement fallback paths for older kernels
Missed events (Pitfall 5)	MEDIUM	Switch to ring_buffer; add dedicated event thread; implement reconciliation
UDP tracking confusion (Pitfall 6)	MEDIUM	Separate UDP into distinct flow-tracking subsystem with timeout

Pitfall-to-Phase Mapping

Pitfall	Prevention Phase	Verification
No platform abstraction	Phase 1 (Foundation)	`DataSource` trait exists; eBPF is one implementation behind it
Verifier rejection	Phase 2 (eBPF)	All eBPF programs load on minimum target kernel version in CI
Toolchain fragility	Phase 1 (Foundation)	`rust-toolchain.toml` pinned; CI builds green; CONTRIBUTING.md documents setup
Kernel version compat	Phase 2 (eBPF) + Phase 4 (Distribution)	Tested on kernel 5.4 and latest; startup prints clear error on unsupported kernel
Missed events	Phase 2 (Data pipeline) + Phase 3 (TUI)	Load test with 10k connections; dropped event counter stays at 0
UDP flow tracking	Phase 2 (eBPF)	UDP flows appear with activity and expire after timeout
TUI performance	Phase 3 (TUI)	CPU usage < 5% at idle with 500 connections displayed
Privilege handling	Phase 1 (Foundation)	Running without sudo prints helpful error; running with sudo drops privileges after eBPF load
macOS support	Phase 1 (trait) + Phase 4 (macOS backend)	`cargo build` succeeds on macOS; basic connection listing works

Sources

Pitfalls research for: Rust eBPF network monitoring CLI (tcptop) Researched: 2026-03-21

20 KiB Raw Permalink Blame History

Pitfalls Research

Critical Pitfalls

Pitfall 1: macOS Has No eBPF -- Platform Abstraction Must Be Day-One Architecture

Pitfall 2: eBPF Verifier Rejection -- Programs That Compile Fine But the Kernel Refuses to Load

Pitfall 3: Nightly Rust Toolchain and bpf-linker Fragility

Pitfall 4: Kernel Version Compatibility -- What Works on Your Dev Machine Fails in Production

Pitfall 5: Missed Events and Data Gaps Under High Connection Volume

Pitfall 6: UDP "Connection" Tracking Is a Fundamentally Different Problem

Technical Debt Patterns

Integration Gotchas

Performance Traps

Security Mistakes

UX Pitfalls

"Looks Done But Isn't" Checklist

Recovery Strategies

Pitfall-to-Phase Mapping

Sources

20 KiB

Raw Permalink Blame History