Re: [PATCH 0/5] IP checksum improvements

26 Apr 2010

      Ondrej Zajicek <santiago@crfreenet.org> wrote on 2010/04/26 10:09:36:
...
On Mon, Apr 26, 2010 at 01:57:29AM +0200, Joakim Tjernlund wrote:
...
...
Ah, now I think I know. The while(buf < end) is optimized for
post inc so that is why.
tested little and was surprised, only 3-5% slower with the while loop
compared to my for loop, it is mainly the post increment that does that.
On x86 I can hardly see any difference between post and pre inc.
I also got 5% slowdown on MIPS. If i replaced while(buf < end) with
while(buf != end), i got no slowdown.
while(buf != end) got worse in ppc. gcc 4.3.4 got even more worse
than gcc 3.4.6. I think it is safe to say that gcc 4.3.4 is busted when
it comes to optimization, even on x86. Seen -O1 do better than -O2 for
x86 with gcc 3.4.3.

Since gcc in general isn't very good at optimization I think the best bet
is to have different loops for different archs. I seen people do that based on
endian:
#ifdef CPU_BIG_ENDIAN
  for(buf--; len, --len)
    sum = acc32(sum, *++buf);
#else
  while(buf != end)
    sum = add32(sum, *buf++);
#endif

Would you consider the asm version of add32 for PowerPC too?
...
...
However, gcc won't inline add32 as it is too big on ppc and that
is a disaster. Could you add inline to add32?
There is 'inline' in a current git.
Sorry, didn't notice that.