[PATCH 0/5] IP checksum improvements

Joakim Tjernlund

25 Apr 2010 25 Apr '10

9:41 a.m.

Here are a series of performance improvements on the Internet checksum. With these changes applied I get about 20-30% better performance on x86 and PowerPC. Even though we got off on the wrong foot I got curious enough to do some more investigation and I leared more about "add with carry" and how gcc handle them. Feel free to ignore these patches, I just wanted to at least document my findings in patch form. Joakim Tjernlund (5): checksum: improve add32 checksum: Optimize add32() for PowerPC checksum: use pre increment. checksum: optimize loop and get rid of add16() checksum: Optimize first addition. lib/checksum.c | 61 ++++++++++++++++++++++++++++++++----------------------- 1 files changed, 35 insertions(+), 26 deletions(-)

Show replies by date

Joakim Tjernlund

25 Apr 25 Apr

9:41 a.m.

New subject: [PATCH 1/5] checksum: improve add32

Gcc does not recognize z + (z < sum) as an "add with carry" However, x86 recognizes if (z < x) z++ as an "add with carry" operation so lets use that instead. Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> --- lib/checksum.c | 4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 33cb386..bf70cab 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -26,7 +26,9 @@ static u32 add32(u32 sum, u32 x) { u32 z = sum + x; - return z + (z < sum); + if (z < x) + z++; + return z; } static u16 -- 1.6.4.4

Joakim Tjernlund

9:41 a.m.

New subject: [PATCH 2/5] checksum: Optimize add32() for PowerPC

PowerPC does not recognize add32() as an "add with carry" operation so use inline assembler instead. Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> --- lib/checksum.c | 12 +++++++++++- 1 files changed, 11 insertions(+), 1 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index bf70cab..cd0fefd 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -21,7 +21,16 @@ add16(u16 sum, u16 x) u16 z = sum + x; return z + (z < sum); } - +#ifdef __powerpc__ +static +u32 +add32(u32 sum, u32 x) +{ + /* add and set carry; add carry */ + asm ("addc %0, %0, %1; addze %0, %0": "+r"(sum): "r" (x): "xer"); + return sum; +} +#else static u32 add32(u32 sum, u32 x) { @@ -30,6 +39,7 @@ add32(u32 sum, u32 x) z++; return z; } +#endif static u16 ipsum_calc_block(u16 *x, unsigned len, u16 sum) -- 1.6.4.4

Joakim Tjernlund

9:41 a.m.

New subject: [PATCH 3/5] checksum: use pre increment.

Some archs(RISC like archs) can do pre increment and load in one insn but gcc optimization often fails to take advantage of that. Help gcc to do the right thing by using pre increment instead of post increment. Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> --- lib/checksum.c | 8 +++----- 1 files changed, 3 insertions(+), 5 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index cd0fefd..2e427a2 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -72,11 +72,9 @@ ipsum_calc_block(u16 *x, unsigned len, u16 sum) len >>= 1; tmp = 0; xx = (u32 *) x; - while (len) - { - tmp = add32(tmp, *xx++); - len--; - } + for (xx--; len; --len); + tmp = add32(tmp, *++xx); + xx++; sum = add16(sum, add16(tmp & 0xffff, tmp >> 16U)); if (rest) sum = add16(sum, *(u16 *) xx); -- 1.6.4.4

Martin Mares

9:33 p.m.

New subject: [PATCH 3/5] checksum: use pre increment.

...

Some archs(RISC like archs) can do pre increment and load in one insn but gcc optimization often fails to take advantage of that. Help gcc to do the right thing by using pre increment instead of post increment.

This one is a little bit dubious, I would rather not twist the code so much in order to help a specific version of GCC. I guess that this will vary wildly between GCC versions. Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth "This is an object-oriented system. If we change anything, the users object."

Joakim Tjernlund

9:40 p.m.

New subject: [PATCH 3/5] checksum: use pre increment.

Martin Mares <mj@ucw.cz> wrote on 2010/04/25 23:33:49:

...

...
Some archs(RISC like archs) can do pre increment and load in one insn but gcc optimization often fails to take advantage of that. Help gcc to do the right thing by using pre increment instead of post increment.

This one is a little bit dubious, I would rather not twist the code so much in order to help a specific version of GCC. I guess that this will vary wildly between GCC versions.

Sadly not, I have used 2.95, 3.4.6 and 4.3.4 and PowerPC still doesn't get it right. I agree that gcc SHOULD fix this but in real life it doesn't so I always help gcc when it comes to loops such as these. Jocke

Ondrej Filip

26 Apr 26 Apr

3:40 p.m.

New subject: [PATCH 3/5] checksum: use pre increment.

On 25.4.2010 23:40, Joakim Tjernlund wrote:

...

Martin Mares <mj@ucw.cz> wrote on 2010/04/25 23:33:49:

...
...
Some archs(RISC like archs) can do pre increment and load in one insn but gcc optimization often fails to take advantage of that. Help gcc to do the right thing by using pre increment instead of post increment.

This one is a little bit dubious, I would rather not twist the code so much in order to help a specific version of GCC. I guess that this will vary wildly between GCC versions.

Sadly not, I have used 2.95, 3.4.6 and 4.3.4 and PowerPC still doesn't get it right. I agree that gcc SHOULD fix this but in real life it doesn't so I always help gcc when it comes to loops such as these.

I support MJ's view. I don't think it's a good idea to add asm lines to BIRD's code. Ondrej

...

Jocke

Joakim Tjernlund

10:53 p.m.

New subject: [PATCH 3/5] checksum: use pre increment.

Ondrej Filip <feela@network.cz> wrote on 2010/04/26 17:40:57:

...

On 25.4.2010 23:40, Joakim Tjernlund wrote:

...
Martin Mares <mj@ucw.cz> wrote on 2010/04/25 23:33:49:

...
...
Some archs(RISC like archs) can do pre increment and load in one insn but gcc optimization often fails to take advantage of that. Help gcc to do the right thing by using pre increment instead of post increment.

This one is a little bit dubious, I would rather not twist the code so much in order to help a specific version of GCC. I guess that this will vary wildly between GCC versions.

Sadly not, I have used 2.95, 3.4.6 and 4.3.4 and PowerPC still doesn't get it right. I agree that gcc SHOULD fix this but in real life it doesn't so I always help gcc when it comes to loops such as these.

I support MJ's view. I don't think it's a good idea to add asm lines to BIRD's code.

OK but just as you know, "add with carry" isn't supported in any gcc, not even in gcc git repo. It will be a LONG time until such support is in place and in a compiler near a you. I would not be surprised if x86 is the only arch gcc supports "add with carry".

Ondrej Zajicek

25 Apr 25 Apr

9:20 p.m.

On Sun, Apr 25, 2010 at 11:41:17AM +0200, Joakim Tjernlund wrote:

...

Here are a series of performance improvements on the Internet checksum. With these changes applied I get about 20-30% better performance on x86 and PowerPC.

Although i agree with Martin Mares that such kind of optimizations should be done mainly if we know (from profiling) that BIRD spends a significant share of time (during update processing) in that function, i did some changes to the checksum function and merged some of these patches. I did some more optimizations (changing the loop condition, removing len decrement) and together with your change to add32 i got two times faster checksum function (on x86) than the old code. Changing postincrement to preincrement leads to worse results (only 1.4 times faster than the old code) so i kept postincrement. -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."

Joakim Tjernlund

9:54 p.m.

Ondrej Zajicek <santiago@crfreenet.org> wrote on 2010/04/25 23:20:52:

...

On Sun, Apr 25, 2010 at 11:41:17AM +0200, Joakim Tjernlund wrote:

...
Here are a series of performance improvements on the Internet checksum. With these changes applied I get about 20-30% better performance on x86 and PowerPC.

Although i agree with Martin Mares that such kind of optimizations should be done mainly if we know (from profiling) that BIRD spends a significant share of time (during update processing) in that function, i did some changes to the checksum function and merged some of these patches.

I did some more optimizations (changing the loop condition, removing len decrement) and together with your change to add32 i got two times faster checksum function (on x86) than the old code. Changing postincrement to preincrement leads to worse results (only 1.4 times faster than the old code) so i kept postincrement.

On x86? That is strange. On x86 that should only lead to one extra add outside the loop, or so I think. the while(buf < end) definitely slower on any RISC like CPU. Did you test for (; len; --len) sum = addr32(sum, *buf++); ? Was the other arch's also faster with that? Jocke

Joakim Tjernlund

10:09 p.m.

...

Ondrej Zajicek <santiago@crfreenet.org> wrote on 2010/04/25 23:20:52:

...
On Sun, Apr 25, 2010 at 11:41:17AM +0200, Joakim Tjernlund wrote:

...
Here are a series of performance improvements on the Internet checksum. With these changes applied I get about 20-30% better performance on x86 and PowerPC.

Although i agree with Martin Mares that such kind of optimizations should be done mainly if we know (from profiling) that BIRD spends a significant share of time (during update processing) in that function, i did some changes to the checksum function and merged some of these patches.

I did some more optimizations (changing the loop condition, removing len decrement) and together with your change to add32 i got two times faster checksum function (on x86) than the old code. Changing postincrement to preincrement leads to worse results (only 1.4 times faster than the old code) so i kept postincrement.

On x86? That is strange. On x86 that should only lead to one extra add outside the loop, or so I think.

Ah, now I think I know. The while(buf < end) is optimized for post inc so that is why. I do think performance is worse on every other arch as the above is probably very x86 tuned.

...

the while(buf < end) definitely slower on any RISC like CPU. Did you test for (; len; --len) sum = addr32(sum, *buf++); ? Was the other arch's also faster with that?

Jocke

Joakim Tjernlund

11:57 p.m.

...

...
Ondrej Zajicek <santiago@crfreenet.org> wrote on 2010/04/25 23:20:52:

...
On Sun, Apr 25, 2010 at 11:41:17AM +0200, Joakim Tjernlund wrote:

...
Here are a series of performance improvements on the Internet checksum. With these changes applied I get about 20-30% better performance on x86 and PowerPC.

Although i agree with Martin Mares that such kind of optimizations should be done mainly if we know (from profiling) that BIRD spends a significant share of time (during update processing) in that function, i did some changes to the checksum function and merged some of these patches.

I did some more optimizations (changing the loop condition, removing len decrement) and together with your change to add32 i got two times faster checksum function (on x86) than the old code. Changing postincrement to preincrement leads to worse results (only 1.4 times faster than the old code) so i kept postincrement.

On x86? That is strange. On x86 that should only lead to one extra add outside the loop, or so I think.

Ah, now I think I know. The while(buf < end) is optimized for post inc so that is why.

I do think performance is worse on every other arch as the above is probably very x86 tuned.

tested little and was surprised, only 3-5% slower with the while loop compared to my for loop, it is mainly the post increment that does that. On x86 I can hardly see any difference between post and pre inc. However, gcc won't inline add32 as it is too big on ppc and that is a disaster. Could you add inline to add32? Jocke

Ondrej Zajicek

26 Apr 26 Apr

8:09 a.m.

On Mon, Apr 26, 2010 at 01:57:29AM +0200, Joakim Tjernlund wrote:

...

...
Ah, now I think I know. The while(buf < end) is optimized for post inc so that is why.

tested little and was surprised, only 3-5% slower with the while loop compared to my for loop, it is mainly the post increment that does that. On x86 I can hardly see any difference between post and pre inc.

I also got 5% slowdown on MIPS. If i replaced while(buf < end) with while(buf != end), i got no slowdown.

...

However, gcc won't inline add32 as it is too big on ppc and that is a disaster. Could you add inline to add32?

There is 'inline' in a current git. -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."

Joakim Tjernlund

8:25 a.m.

Ondrej Zajicek <santiago@crfreenet.org> wrote on 2010/04/26 10:09:36:

...

On Mon, Apr 26, 2010 at 01:57:29AM +0200, Joakim Tjernlund wrote:

...
...
Ah, now I think I know. The while(buf < end) is optimized for post inc so that is why.

tested little and was surprised, only 3-5% slower with the while loop compared to my for loop, it is mainly the post increment that does that. On x86 I can hardly see any difference between post and pre inc.

I also got 5% slowdown on MIPS. If i replaced while(buf < end) with while(buf != end), i got no slowdown.

while(buf != end) got worse in ppc. gcc 4.3.4 got even more worse than gcc 3.4.6. I think it is safe to say that gcc 4.3.4 is busted when it comes to optimization, even on x86. Seen -O1 do better than -O2 for x86 with gcc 3.4.3. Since gcc in general isn't very good at optimization I think the best bet is to have different loops for different archs. I seen people do that based on endian: #ifdef CPU_BIG_ENDIAN for(buf--; len, --len) sum = acc32(sum, *++buf); #else while(buf != end) sum = add32(sum, *buf++); #endif Would you consider the asm version of add32 for PowerPC too?

...

...
However, gcc won't inline add32 as it is too big on ppc and that is a disaster. Could you add inline to add32?

There is 'inline' in a current git.

Sorry, didn't notice that.

Martin Mares

8:31 a.m.

Hello!

...

while(buf != end) got worse in ppc. gcc 4.3.4 got even more worse than gcc 3.4.6. I think it is safe to say that gcc 4.3.4 is busted when it comes to optimization, even on x86. Seen -O1 do better than -O2 for x86 with gcc 3.4.3.

BTW have you tried unrolling the loops or using __attribute__((hot))?

...

Since gcc in general isn't very good at optimization I think the best bet is to have different loops for different archs. I seen people do that based on endian: #ifdef CPU_BIG_ENDIAN for(buf--; len, --len) sum = acc32(sum, *++buf); #else while(buf != end) sum = add32(sum, *buf++); #endif

Huh, what should do endianity have in common with the choice of pre-/postincrement? However, I would still like to see a full profile of running BIRD, so that we know the real hot spots. (If it turns out that the checksum function is a hot spot, it would be interesting to write a SSE version.) Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth Not all rumors are as misleading as this one.

Joakim Tjernlund

8:46 a.m.

Martin Mares <mj@ucw.cz> wrote on 2010/04/26 10:31:31:

...

Hello!

...
while(buf != end) got worse in ppc. gcc 4.3.4 got even more worse than gcc 3.4.6. I think it is safe to say that gcc 4.3.4 is busted when it comes to optimization, even on x86. Seen -O1 do better than -O2 for x86 with gcc 3.4.3.

BTW have you tried unrolling the loops or using __attribute__((hot))?

Not ATM, will probably don't do much on ppc. Its branch prediction makes the loop "free" when do a for(;len;--len) loop.

...

...
Since gcc in general isn't very good at optimization I think the best bet is to have different loops for different archs. I seen people do that based on endian: #ifdef CPU_BIG_ENDIAN for(buf--; len, --len) sum = acc32(sum, *++buf); #else while(buf != end) sum = add32(sum, *buf++); #endif

Huh, what should do endianity have in common with the choice of pre-/postincrement?

Because most archs that can deal with preinc. are big endian, the for loop is important too. Decrement and test for zero is basically free.

Martin Mares

27 Apr 27 Apr

7:47 a.m.

Hello!

...

...
Huh, what should do endianity have in common with the choice of pre-/postincrement?

Because most archs that can deal with preinc. are big endian, the for loop is important too. Decrement and test for zero is basically free.

If GCC generates obviously suboptimal code on some platform, fix GCC or at least file a bug report. Trying to outwit the compiler by re-arranging the source code in various magical ways can work only with specific combinations of compiler + version + platform, so your code will need constant monitoring and updating as the compilers evolve, which is a maintenance disaster. If this function is a real hot spot, let us code it in assembly for the platforms on which GCC does not do a good job. But only if a profile shows that the checksumming functions are a real hot spot. Otherwise let's keep the code simple and straightforward. So before suggesting more changes, please show us the profile. Howgh. Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth

Joakim Tjernlund

9:38 a.m.

Martin Mares <mj@ucw.cz> wrote on 2010/04/27 09:47:36:

...

Hello!

...
...
Huh, what should do endianity have in common with the choice of pre-/postincrement?

Because most archs that can deal with preinc. are big endian, the for loop is important too. Decrement and test for zero is basically free.

If GCC generates obviously suboptimal code on some platform, fix GCC or at least file a bug report. Trying to outwit the compiler by re-arranging the source code in various magical ways can work only with specific combinations of compiler + version + platform, so your code will need constant monitoring and updating as the compilers evolve, which is a maintenance disaster.

Fixing gcc is simpler said than done. Many have tried but gcc is simply not the compiler to use if you want speed. The current impl. is just what you say you don't want to support. The "while(buf < end)" is tailored to be fast on x86 and harder to maintain than the simpler and more natural "for (; len; --len)" loop. I have spent enough time on this now, if anyone wants to profile please do. Jocke

Martin Mares

25 Apr 25 Apr

9:32 p.m.

Hello!

...

Here are a series of performance improvements on the Internet checksum. With these changes applied I get about 20-30% better performance on x86 and PowerPC.

Even though we got off on the wrong foot I got curious enough to do some more investigation and I leared more about "add with carry" and how gcc handle them. Feel free to ignore these patches, I just wanted to at least document my findings in patch form.

Yes, I like these. Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth

Otto Solares

26 Apr 26 Apr

4:20 a.m.

On Sun, Apr 25, 2010 at 11:41:17AM +0200, Joakim Tjernlund wrote:

...

Here are a series of performance improvements on the ...

Joakim, although you seems to have a strong character I appreciate your performance tuning on BIRD, everyday less people do this kind of analysis on F/OSS projects so thank you for your efforts. (And someday check the BGP part with the full Internet feed ;) ). - Otto

5935

Age (days ago)

5937

Last active (days ago)

List overview

Download

19 comments

6 participants

participants (6)

Joakim Tjernlund
Joakim Tjernlund
Martin Mares
Ondrej Filip
Ondrej Zajicek
Otto Solares