babel RTT metric false samples

11 Apr 2024

      Hi,

The babel RTT metric measurements provided by bird appears suspect for 
my setup. The metric through a tunnel with a latency of about 5ms is 
shown in babel as 150+ms.

Can others replicate this issue? (should be easy to check for other 
babel users since RTT measurement is on by default in recent versions)

First I suspected a problem with the tunnel, but I compared bird's babel 
RTT measurement against a long-running ping for the same time period and 
got ~160ms measured by bird's babel implementation, and 4.6ms with a 
28ms maximum latency reported by pings in the same wireguard tunnel. 
Other machines across my network also report similarly inflated RTT 
metrics for all non-wired links.

Debug logs show many RTT samples with approximately correct timestamps 
(4-6ms) then the occasional IHU with 800-1200ms calculated instead. 
Calculating the RTT metric by hand using babel packet logs shows that 
the calculations are correct. By correlating two packet dumps (the 
machines have <1ms NTP timekeeping) I can also see that the packets for 
which high RTT is calculated have similar transit times through the 
tunnel as other packets. Hence, I suspect the accuracy of the packet 
timestamps recorded by bird. Is the current packet timestamping system 
giving correct timestamps if the packet arrives while babel is 
processing another event?

I can provide packet captures for anyone interested in debugging further.

Thanks,
Stephanie.

Stephanie Wilde-Hobbs

Toke Høiland-Jørgensen

Maria Matejka

Erin Shepherd

Ondrej Zajicek

Stephanie Wilde-Hobbs

tags

participants (5)