Ondrej Zajicek <santiago@crfreenet.org> wrote on 2010/04/25 23:20:52:
On Sun, Apr 25, 2010 at 11:41:17AM +0200, Joakim Tjernlund wrote:
Here are a series of performance improvements on the Internet checksum. With these changes applied I get about 20-30% better performance on x86 and PowerPC.
Although i agree with Martin Mares that such kind of optimizations should be done mainly if we know (from profiling) that BIRD spends a significant share of time (during update processing) in that function, i did some changes to the checksum function and merged some of these patches.
I did some more optimizations (changing the loop condition, removing len decrement) and together with your change to add32 i got two times faster checksum function (on x86) than the old code. Changing postincrement to preincrement leads to worse results (only 1.4 times faster than the old code) so i kept postincrement.
On x86? That is strange. On x86 that should only lead to one extra add outside the loop, or so I think.
Ah, now I think I know. The while(buf < end) is optimized for post inc so that is why.
I do think performance is worse on every other arch as the above is probably very x86 tuned.
tested little and was surprised, only 3-5% slower with the while loop compared to my for loop, it is mainly the post increment that does that. On x86 I can hardly see any difference between post and pre inc. However, gcc won't inline add32 as it is too big on ppc and that is a disaster. Could you add inline to add32? Jocke