Interesting Arping bug report

2012-10-05, Categories: unix, coding, network, arping

A few months ago I was strolling in the Debian bug tracking system and found a curious bug filed against Arping, a program I maintain.

It said that unlike Arping 2.09, in Arping 2.11 the ARP cache was not updated after successful reply. I thought that was odd, since there's no code to touch the ARP cache, neither read nor write. Surely this behaviour hasn't changed?

I tried to reproduce the behaviour and sure enough, with Arping 2.09 the arp cache is updated, while with 2.11 it's not.

$ arp -na | grep 192.168.0.123
$ # --- First try Arping 2.11 ---
$ sudo ./arping-2.11 -c 1 192.168.0.123
ARPING 192.168.0.123
60 bytes from 00:22:33:44:55:66 (192.168.0.123): index=0 time=1.188 msec

--- 192.168.0.123 statistics ---
1 packets transmitted, 1 packets received,   0% unanswered (0 extra)
$ arp -na | grep 192.168.0.123
$ # --- Ok, that didn't change the ARP cache. Now try 2.09 ---
$ sudo ./arping-2.09 -c 1 192.168.0.123
ARPING 192.168.0.123
60 bytes from 00:22:33:44:55:66 (192.168.0.123): index=0 time=794.888 usec

--- 192.168.0.123 statistics ---
1 packets transmitted, 1 packets received,   0% unanswered (0 extra)
$ arp -na | grep 192.168.0.123
? (192.168.0.123) at 00:22:33:44:55:66 [ether] on wlan0
How could that be? I suspected that maybe the kernel saw the ARP reply, and snooped it into the ARP table. But I quickly confirmed that the packets going over the wire were the same for 2.09 and 2.11 (as they should be).

So what changed between 2.09 and 2.11?

$ git log --pretty=oneline arping-2.09..arping-2.11 | wc -l
43
Ugh. Before doing a bisection I skimmed through the descriptions. Most were comments, compile fixes and documentation. The only functionality changes were Well, the first two don't look suspicious, so either it's the getifaddrs() or some minor change that shouldn't have mattered.

Between Arping 2.09 and 2.10 I changed the interface finding code from an ugly hack of running /sbin/ip route get 1.1.1.1 to get the outgoing interface from the routing table. Since the output of the various "show me the routing table" commands are different in different OSs, I had to implement this subprocess (ugly) and parsing (ugly) several times. The new implementation uses getifaddrs() to traverse the interfaces programmatically.

The old code was still there as a fallback. It would never actually get used unless there's a Linux system out there that doesn't have getifaddrs(). It seems it was added to glibc 2.3 back in 2002. Anyway it was trivial to temporarily switch interface selection back to the old method. I confirmed that this was indeed what caused this change of behaviour.

Surely ip route get doesn't send an ARP request and populates the ARP cache when it gets the reply? No. So if ip route get 1.1.1.1 doesn't do it, and arping-2.11 1.1.1.1 doesn't do it, then surely ip route get 1.1.1.1 ; arping-2.11 1.1.1.1 doesn't do it?

Yes, yes it does. It seems ip route get 1.1.1.1 followed by arping-2.11 1.1.1.1 will cause 1.1.1.1 to show up in the ARP cache. And it doesn't matter if ip route get is run as an ordinary user or as root! (arping of course has to run as root or have NET_ADMIN capability). Only the exact address given to ip route get will be "open to be filled" by the second command, so it seems to be per address, and that ip route get will modify state in the kernel.

$ arp -na | grep 192.168.0.123
$ sudo ./arping-2.11 -i wlan0 -q -c 1 192.168.0.123
$ arp -na | grep 192.168.0.123
$ # --- Ok, still no entry in the ARP cache Now try running both commands ---
$ ip route get 192.168.0.123 ; sudo ./arping-2.11 -i wlan0 -q -c 1 192.168.0.123
192.168.0.123 dev wlan0  src 192.168.0.100 
    cache  mtu 1500 advmss 1460 hoplimit 64
$ arp -na | grep 192.168.0.123
? (192.168.0.123) at 00:22:33:44:55:66 [ether] on wlan0

I closed the bug since it's working as intended.

I have not dived into the kernel source to find the reason for this, but I may come back and update this post if and when I do.

Links