[PATCH 0/5] IP checksum improvements

Mon Apr 26 10:25:29 CEST 2010

Ondrej Zajicek <santiago at crfreenet.org> wrote on 2010/04/26 10:09:36:
>
> On Mon, Apr 26, 2010 at 01:57:29AM +0200, Joakim Tjernlund wrote:
> > > Ah, now I think I know. The while(buf < end) is optimized for
> > > post inc so that is why.
> >
> > tested little and was surprised, only 3-5% slower with the while loop
> > compared to my for loop, it is mainly the post increment that does that.
> > On x86 I can hardly see any difference between post and pre inc.
>
> I also got 5% slowdown on MIPS. If i replaced while(buf < end) with
> while(buf != end), i got no slowdown.

while(buf != end) got worse in ppc. gcc 4.3.4 got even more worse
than gcc 3.4.6. I think it is safe to say that gcc 4.3.4 is busted when
it comes to optimization, even on x86. Seen -O1 do better than -O2 for
x86 with gcc 3.4.3.

Since gcc in general isn't very good at optimization I think the best bet
is to have different loops for different archs. I seen people do that based on
endian:
#ifdef CPU_BIG_ENDIAN
  for(buf--; len, --len)
    sum = acc32(sum, *++buf);
#else
  while(buf != end)
    sum = add32(sum, *buf++);
#endif

Would you consider the asm version of add32 for PowerPC too?

>
> > However, gcc won't inline add32 as it is too big on ppc and that
> > is a disaster. Could you add inline to add32?
>
> There is 'inline' in a current git.

Sorry, didn't notice that.