Ondrej Zajicek <santiago@crfreenet.org> wrote on 2010/04/26 10:09:36:
On Mon, Apr 26, 2010 at 01:57:29AM +0200, Joakim Tjernlund wrote:
Ah, now I think I know. The while(buf < end) is optimized for post inc so that is why.
tested little and was surprised, only 3-5% slower with the while loop compared to my for loop, it is mainly the post increment that does that. On x86 I can hardly see any difference between post and pre inc.
I also got 5% slowdown on MIPS. If i replaced while(buf < end) with while(buf != end), i got no slowdown.
while(buf != end) got worse in ppc. gcc 4.3.4 got even more worse than gcc 3.4.6. I think it is safe to say that gcc 4.3.4 is busted when it comes to optimization, even on x86. Seen -O1 do better than -O2 for x86 with gcc 3.4.3. Since gcc in general isn't very good at optimization I think the best bet is to have different loops for different archs. I seen people do that based on endian: #ifdef CPU_BIG_ENDIAN for(buf--; len, --len) sum = acc32(sum, *++buf); #else while(buf != end) sum = add32(sum, *buf++); #endif Would you consider the asm version of add32 for PowerPC too?
However, gcc won't inline add32 as it is too big on ppc and that is a disaster. Could you add inline to add32?
There is 'inline' in a current git.
Sorry, didn't notice that.