[PATCH] ipsum_calc_block: Optimize size and speed

Joakim Tjernlund

23 Apr 2010 23 Apr '10

8:02 a.m.

This is a much simpler and efficent impl. of the IP checksum. It is a dry port from Quagga and I have not tested it. --- lib/checksum.c | 44 +++++++++++--------------------------------- 1 files changed, 11 insertions(+), 33 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 33cb386..b836bdb 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -15,25 +15,10 @@ #include "nest/bird.h" #include "checksum.h" -static u16 /* One-complement addition */ -add16(u16 sum, u16 x) -{ - u16 z = sum + x; - return z + (z < sum); -} - -static u32 -add32(u32 sum, u32 x) -{ - u32 z = sum + x; - return z + (z < sum); -} - static u16 ipsum_calc_block(u16 *x, unsigned len, u16 sum) { - int rest; - u32 tmp, *xx; + u32 tmp; /* * A few simple facts about the IP checksum (see RFC 1071 for detailed @@ -51,23 +36,16 @@ ipsum_calc_block(u16 *x, unsigned len, u16 sum) if (!len) return sum; len >>= 1; - if ((unsigned long) x & 2) /* Align to 32-bit boundary */ - { - sum = add16(sum, *x++); - len--; - } - rest = len & 1; - len >>= 1; - tmp = 0; - xx = (u32 *) x; - while (len) - { - tmp = add32(tmp, *xx++); - len--; - } - sum = add16(sum, add16(tmp & 0xffff, tmp >> 16U)); - if (rest) - sum = add16(sum, *(u16 *) xx); + tmp = sum; + for(x--; len; --len) + tmp += *++x; + /* + * Add back carry outs from top 16 bits to low 16 bits. + */ + tmp = (tmp >> 16) + (tmp & 0xffff); /* add high-16 to low-16 */ + tmp += (tmp >> 16); /* add carry */ + sum = ~tmp; /* ones-complement, then truncate to 16 bits */ + return sum; } -- 1.6.4.4

Show replies by date

Joakim Tjernlund

23 Apr 23 Apr

8:12 a.m.

Joakim Tjernlund <Joakim.Tjernlund@transmode.se> wrote on 2010/04/23 10:02:54:

...

This is a much simpler and efficent impl. of the IP checksum. It is a dry port from Quagga and I have not tested it. --- lib/checksum.c | 44 +++++++++++--------------------------------- 1 files changed, 11 insertions(+), 33 deletions(-)

diff --git a/lib/checksum.c b/lib/checksum.c index 33cb386..b836bdb 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -15,25 +15,10 @@ #include "nest/bird.h" #include "checksum.h"

-static u16 /* One-complement addition */ -add16(u16 sum, u16 x) -{ - u16 z = sum + x; - return z + (z < sum); -} - -static u32 -add32(u32 sum, u32 x) -{ - u32 z = sum + x; - return z + (z < sum); -} - static u16 ipsum_calc_block(u16 *x, unsigned len, u16 sum) { - int rest; - u32 tmp, *xx; + u32 tmp;

/* * A few simple facts about the IP checksum (see RFC 1071 for detailed @@ -51,23 +36,16 @@ ipsum_calc_block(u16 *x, unsigned len, u16 sum) if (!len) return sum; len >>= 1; - if ((unsigned long) x & 2) /* Align to 32-bit boundary */ - { - sum = add16(sum, *x++); - len--; - } - rest = len & 1; - len >>= 1; - tmp = 0; - xx = (u32 *) x; - while (len) - { - tmp = add32(tmp, *xx++); - len--; - } - sum = add16(sum, add16(tmp & 0xffff, tmp >> 16U)); - if (rest) - sum = add16(sum, *(u16 *) xx); + tmp = sum; + for(x--; len; --len) + tmp += *++x; + /* + * Add back carry outs from top 16 bits to low 16 bits. + */ + tmp = (tmp >> 16) + (tmp & 0xffff); /* add high-16 to low-16 */ + tmp += (tmp >> 16); /* add carry */ + sum = ~tmp; /* ones-complement, then truncate to 16 bits */

Possibly the "sum = ~tmp;" line should be removed. Looks like it is performed by the calling routines. Jocke

Ondrej Zajicek

8:42 a.m.

On Fri, Apr 23, 2010 at 10:02:54AM +0200, Joakim Tjernlund wrote:

...

This is a much simpler and efficent impl. of the IP checksum. It is a dry port from Quagga and I have not tested it.

Thanks, i would test that. -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."

Martin Mares

1:03 p.m.

Hello!

...

This is a much simpler and efficent impl. of the IP checksum. It is a dry port from Quagga and I have not tested it.

Are you convinced that your version is more efficient? The original version processes 32 bits at a time, while your code does only 16 bits at a time. It might be worth the saved branch, but do you have any benchmark proving that it helps? Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth Hex dump: Where witches put used curses...

Joakim Tjernlund

1:14 p.m.

Martin Mares <mj@ucw.cz> wrote on 2010/04/23 15:03:34:

...

Hello!

...
This is a much simpler and efficent impl. of the IP checksum. It is a dry port from Quagga and I have not tested it.

Are you convinced that your version is more efficient? The original version processes 32 bits at a time, while your code does only 16 bits at a time. It might be worth the saved branch, but do you have any benchmark proving that it helps?

Fairly, I once had the same idea for Quagga but found all those extra tests and additions were much slower(I benched it). Just look at the number of extra ops one has to do in the current code. If you want to do it faster you have to go ASM. It would be easy to add support for that too but it can wait. Jocke

Martin Mares

1:20 p.m.

Hello!

...

Fairly, I once had the same idea for Quagga but found all those extra tests and additions were much slower(I benched it). Just look at the number of extra ops one has to do in the current code. If you want to do it faster you have to go ASM. It would be easy to add support for that too but it can wait.

My primary reaction was "If something isn't broken, don't fix it." I.e., unless you have good reasons for rewriting a piece of code, don't do that. Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower. (Anyway, it might be interesting to run OProfile on Bird to see which parts of the code are real hot spots and where we should focus more on maintainibility than on speed.) Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth To understand a program you must become both the machine and the program.

Ondrej Zajicek

1:48 p.m.

On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote:

...

Hello!

...
Fairly, I once had the same idea for Quagga but found all those extra tests and additions were much slower(I benched it). Just look at the number of extra ops one has to do in the current code. If you want to do it faster you have to go ASM. It would be easy to add support for that too but it can wait.

My primary reaction was "If something isn't broken, don't fix it." I.e., unless you have good reasons for rewriting a piece of code, don't do that.

I agree. I would probably look at the Fletcher checksum, as the endianity swapping before and after that is crazy, but would do benchmarking before merging such change. But there are still several problems in OSPF that should be fixed before doing some performance tuning.

...

Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower.

OTOH, i think that most cases in the original code can be eliminated anyway, because the buffer is always aligned.

...

(Anyway, it might be interesting to run OProfile on Bird to see which parts of the code are real hot spots and where we should focus more on maintainibility than on speed.)

One thing which would probably be the most important on really big OSPF areas is the priority queue in SPF (Dijkstra) calculation. -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."

Joakim Tjernlund

2 p.m.

...

On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote:

...
Hello!

...
Fairly, I once had the same idea for Quagga but found all those extra tests and additions were much slower(I benched it). Just look at the number of extra ops one has to do in the current code. If you want to do it faster you have to go ASM. It would be easy to add support for that too but it can wait.

My primary reaction was "If something isn't broken, don't fix it." I.e., unless you have good reasons for rewriting a piece of code, don't do that.

I agree. I would probably look at the Fletcher checksum, as the endianity swapping before and after that is crazy, but would do benchmarking before merging such change. But there are still several problems in OSPF that should be fixed before doing some performance tuning.

...
Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower.

OTOH, i think that most cases in the original code can be eliminated anyway, because the buffer is always aligned.

But you can't get rid of: z + (z < sum) which is the real bottleneck. Perhaps this doesn't cost much on high end CPUs but it sure does on embedded CPUs

...

...
(Anyway, it might be interesting to run OProfile on Bird to see which parts of the code are real hot spots and where we should focus more on maintainibility than on speed.)

One thing which would probably be the most important on really big OSPF areas is the priority queue in SPF (Dijkstra) calculation.

Martin Mares

2:07 p.m.

Hello!

...

But you can't get rid of: z + (z < sum) which is the real bottleneck. Perhaps this doesn't cost much on high end CPUs but it sure does on embedded CPUs

Why should it be? It can be compiled as a sequence of "add with carry" instructions, can't it? Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth A student who changes the course of history is probably taking an exam.

Joakim Tjernlund

2:14 p.m.

...

Hello!

...
But you can't get rid of: z + (z < sum) which is the real bottleneck. Perhaps this doesn't cost much on high end CPUs but it sure does on embedded CPUs

Why should it be? It can be compiled as a sequence of "add with carry" instructions, can't it?

Yes, but have you seen gcc do that? I havn't, perhaps gcc has become smarter recently? Jocke

Ondrej Zajicek

4:32 p.m.

On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote:

...

Hello!

...
Fairly, I once had the same idea for Quagga but found all those extra tests and additions were much slower(I benched it). Just look at the number of extra ops one has to do in the current code. If you want to do it faster you have to go ASM. It would be easy to add support for that too but it can wait.

My primary reaction was "If something isn't broken, don't fix it." I.e., unless you have good reasons for rewriting a piece of code, don't do that.

Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower.

I was curious enough to do some benchmarks and got these results: Intel Atom: suggested code ~ 1.2* faster AMD Geode: no diference MIPS ADM5120: old code ~ 1.2* faster So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s. -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."

Ondrej Filip

4:41 p.m.

On 23.4.2010 18:32, Ondrej Zajicek wrote:

...

On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote:

...
Hello!

...
Fairly, I once had the same idea for Quagga but found all those extra tests and additions were much slower(I benched it). Just look at the number of extra ops one has to do in the current code. If you want to do it faster you have to go ASM. It would be easy to add support for that too but it can wait.

My primary reaction was "If something isn't broken, don't fix it." I.e., unless you have good reasons for rewriting a piece of code, don't do that.

Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower.

I was curious enough to do some benchmarks and got these results:

Intel Atom: suggested code ~ 1.2* faster AMD Geode: no diference MIPS ADM5120: old code ~ 1.2* faster

So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s.

Hmm, so there is no major reason for change. I would really support dont-fix-whats-not-broken approach. And also we should keep different code from Quagga to have heterogeneity. Ondrej

Joakim Tjernlund

5:01 p.m.

Ondrej Filip <feela@network.cz> wrote on 2010/04/23 18:41:57:

...

On 23.4.2010 18:32, Ondrej Zajicek wrote:

...
On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote:

...
Hello!

...
Fairly, I once had the same idea for Quagga but found all those extra tests and additions were much slower(I benched it). Just look at the number of extra ops one has to do in the current code. If you want to do it faster you have to go ASM. It would be easy to add support for that too but it can wait.

My primary reaction was "If something isn't broken, don't fix it." I.e., unless you have good reasons for rewriting a piece of code, don't do that.

Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower.

I was curious enough to do some benchmarks and got these results:

Intel Atom: suggested code ~ 1.2* faster AMD Geode: no diference MIPS ADM5120: old code ~ 1.2* faster

So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s.

No difference? what does 1.2 mean? to me this means 20% which is a lot

...

Hmm, so there is no major reason for change. I would really support dont-fix-whats-not-broken approach. And also we should keep different

So it is smaller, faster and easier to read and you still want to keep the old code? Very non Open source like I must say.

...

code from Quagga to have heterogeneity.

What kind of argument is that? Sorry, but this all feels like NIH syndrome.

...

Ondrej

Martin Mares

5:23 p.m.

Hello!

...

...
...
So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s.

No difference? what does 1.2 mean? to me this means 20% which is a lot

Yes, but according to Santiago's benchmarks, your code is sometimes 20% faster, sometimes 20% slower. It does not seem like a reason for change. If you have any benchmark showing that BIRD spends a substantial amount of time in this function, let's optimize it properly, even if it means having several versions for different CPU's. Otherwise, let us stick with the current code and concentrate our effort on places which matter. The difference in code size and in readability is really tiny. Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth Why is "abbreviation" such a long word?

Joakim Tjernlund

5:40 p.m.

Martin Mares <mj@ucw.cz> wrote on 2010/04/23 19:23:18:

...

Hello!

...
...
...
So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s.

No difference? what does 1.2 mean? to me this means 20% which is a lot

Yes, but according to Santiago's benchmarks, your code is sometimes 20% faster, sometimes 20% slower. It does not seem like a reason for change.

uhh, 20% slower? Ahh now I see, the MIPS. That is really strange. Santiago, are you sure that is not a typo?

...

If you have any benchmark showing that BIRD spends a substantial amount of time in this function, let's optimize it properly, even if it means having several versions for different CPU's. Otherwise, let us stick with the current code and concentrate our effort on places which matter.

Peformance matter, especially when the network grows. Is this the way BIRD development works? Only work on stuff that currently feels important is acceptet? Jocke

Martin Mares

5:47 p.m.

Hello!

...

Peformance matter, especially when the network grows. Is this the way BIRD development works? Only work on stuff that currently feels important is acceptet?

Yes, performance matters. This is why performance optimizations in BIRD have to be justified by real numbers, not by hand-waving. Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth God is real, unless declared integer.

Ondrej Zajicek

6:46 p.m.

On Fri, Apr 23, 2010 at 07:40:28PM +0200, Joakim Tjernlund wrote:

...

...
...
...
...
So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s.

No difference? what does 1.2 mean? to me this means 20% which is a lot

Yes, but according to Santiago's benchmarks, your code is sometimes 20% faster, sometimes 20% slower. It does not seem like a reason for change.

uhh, 20% slower? Ahh now I see, the MIPS. That is really strange. Santiago, are you sure that is not a typo?

Yes i am sure, i checked it several times. -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."

Ondrej Zajicek

7:39 p.m.

On Fri, Apr 23, 2010 at 07:40:28PM +0200, Joakim Tjernlund wrote:

...

Martin Mares <mj@ucw.cz> wrote on 2010/04/23 19:23:18:

...
Hello!

...
...
...
So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s.

No difference? what does 1.2 mean? to me this means 20% which is a lot

Yes, but according to Santiago's benchmarks, your code is sometimes 20% faster, sometimes 20% slower. It does not seem like a reason for change.

uhh, 20% slower? Ahh now I see, the MIPS. That is really strange. Santiago, are you sure that is not a typo?

FYI, code z = sum + x, z + (z < sum) was compiled to: addu $2,$3,$2 sltu $3,$2,$3 addu $3,$2,$3 Therefore, doing half number of iterations outweights in that case. BTW, it was compiled by GCC 3.4.6 -- Elen sila lumenn' omentielvo Ondrej 'SanTiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."

Joakim Tjernlund

8:19 p.m.

Ondrej Zajicek <santiago@crfreenet.org> wrote on 2010/04/23 21:39:06:

...

On Fri, Apr 23, 2010 at 07:40:28PM +0200, Joakim Tjernlund wrote:

...
Martin Mares <mj@ucw.cz> wrote on 2010/04/23 19:23:18:

...
Hello!

...
...
...
So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s.

No difference? what does 1.2 mean? to me this means 20% which is a lot

Yes, but according to Santiago's benchmarks, your code is sometimes 20% faster, sometimes 20% slower. It does not seem like a reason for change.

uhh, 20% slower? Ahh now I see, the MIPS. That is really strange. Santiago, are you sure that is not a typo?

FYI, code z = sum + x, z + (z < sum) was compiled to:

addu $2,$3,$2 sltu $3,$2,$3 addu $3,$2,$3

OK, MIPS has always been a strange platform to me. So I had to test myself again: x84 Core 2 duo, 3.1 MHz: New code: 64 byte buffer: 5899 +/-2.3% 128 byte buffer: 5570 +/-3.1% 256 byte buffer: 5797 +/-0.3% 512 byte buffer: 5501 +/-1.1% 1024 byte buffer: 5357 +/-1.5% 2048 byte buffer: 5277 +/-0.6% 4096 byte buffer: 5249 +/-1.2% 8192 byte buffer: 5245 +/-2.1% 16384 byte buffer: 5221 +/-1.6% Old code: 64 byte buffer: 7237 +/-0.4% 128 byte buffer: 6505 +/-1.7% 256 byte buffer: 6075 +/-1.6% 512 byte buffer: 6120 +/-1.6% 1024 byte buffer: 5773 +/-8.2% 2048 byte buffer: 5790 +/-2.0% 4096 byte buffer: 5474 +/-0.7% 8192 byte buffer: 5679 +/-47.1% 16384 byte buffer: 5339 +/-1.3% PowerPC MPC 8321, 266 Mhz New Code: 64 byte buffer: 68349 +/-8.0% 128 byte buffer: 58271 +/-8.7% 256 byte buffer: 52945 +/-8.4% 512 byte buffer: 50535 +/-8.6% 1024 byte buffer: 49288 +/-9.6% 2048 byte buffer: 48984 +/-10.3% 4096 byte buffer: 48345 +/-8.6% 8192 byte buffer: 48127 +/-8.4% Old Code: 64 byte buffer: 68349 +/-8.0% 128 byte buffer: 58271 +/-8.7% 256 byte buffer: 52945 +/-8.4% 512 byte buffer: 50535 +/-8.6% 1024 byte buffer: 49288 +/-9.6% 2048 byte buffer: 48984 +/-10.3% 4096 byte buffer: 48345 +/-8.6% 8192 byte buffer: 48127 +/-8.4% Just for fun, replace add32 with static inline unsigned long add32(unsigned long sum, unsigned long x) { asm ("addc %0, %0, %1": "=r"(sum) : "r" (x)); return sum; } MPC 8321 with asm addc: 64 byte buffer: 52007 +/-8.7% 128 byte buffer: 41986 +/-9.9% 256 byte buffer: 37160 +/-11.4% 512 byte buffer: 34593 +/-10.3% 1024 byte buffer: 33265 +/-10.4% 2048 byte buffer: 32648 +/-11.4% 4096 byte buffer: 32843 +/-14.1% 8192 byte buffer: 32223 +/-12.5% So the new code is better on both platforms and the asm addc on ppc is very fast. Test prog attached. Jocke (See attached file: crc32test.c)

Ondrej Filip

5:28 p.m.

On 23.4.2010 19:01, Joakim Tjernlund wrote:

...

Ondrej Filip <feela@network.cz> wrote on 2010/04/23 18:41:57:

...
On 23.4.2010 18:32, Ondrej Zajicek wrote:

...
On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote:

...
Hello!

My primary reaction was "If something isn't broken, don't fix it." I.e., unless you have good reasons for rewriting a piece of code, don't do that.

Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower.

I was curious enough to do some benchmarks and got these results:

Intel Atom: suggested code ~ 1.2* faster AMD Geode: no diference MIPS ADM5120: old code ~ 1.2* faster

So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s.

No difference? what does 1.2 mean? to me this means 20% which is a lot

And the code is 20% slower on MIPS. So I do not see the point. Anyway, 20% in a not often used operation does not have to be even visible.

...

...
Hmm, so there is no major reason for change. I would really support dont-fix-whats-not-broken approach. And also we should keep different

So it is smaller, faster and easier to read and you still want to keep the old code? Very non Open source like I must say.

It was not proven that this code is faster.

...

...
code from Quagga to have heterogeneity.

What kind of argument is that?

Well, let me explain you a few things. It had happened many times, that some wrong (mainly BGP) routing announcement were exported into Internet and all routers of one type (often Quaggas) has crashed. BIRD is used on some very important places like world largest IXPs as a route server (RS). Each IXP usually has at least two different RSs (if possible to avoid implementation dependent bugs. The second RS is usually OpenBGPd of Quagga. So generally speaking, I do not want to use any code from those routing daemons.

...

Sorry, but this all feels like NIH syndrome.

Again, BIRD is used on some very important places and therefore we are very conservative in accepting new patches. But I don't think we have NIH syndrome. We have been accepting foreign patches since beginning and Ondrej Zajicek is a good example. :-) He had sent me some new patches and later we started a regular cooperation and he became a key developer. Ondrej

...

...
Ondrej

Joakim Tjernlund

5:54 p.m.

Ondrej Filip <feela@network.cz> wrote on 2010/04/23 19:28:27:

...

On 23.4.2010 19:01, Joakim Tjernlund wrote:

...
Ondrej Filip <feela@network.cz> wrote on 2010/04/23 18:41:57:

...
On 23.4.2010 18:32, Ondrej Zajicek wrote:

...
On Fri, Apr 23, 2010 at 03:20:55PM +0200, Martin Mares wrote:

...
Hello!

My primary reaction was "If something isn't broken, don't fix it." I.e., unless you have good reasons for rewriting a piece of code, don't do that.

Your version is more readable and I would be in favour of accepting it, but I would still like to see at least a very simple benchmark which shows that it is not significantly slower.

I was curious enough to do some benchmarks and got these results:

Intel Atom: suggested code ~ 1.2* faster AMD Geode: no diference MIPS ADM5120: old code ~ 1.2* faster

So there isn't really difference in performance of both implementations. Even on slow embedded AMD Geode CPU, it gives ~ 180 MB/s.

No difference? what does 1.2 mean? to me this means 20% which is a lot

And the code is 20% slower on MIPS. So I do not see the point. Anyway, 20% in a not often used operation does not have to be even visible.

...
...
Hmm, so there is no major reason for change. I would really support dont-fix-whats-not-broken approach. And also we should keep different

So it is smaller, faster and easier to read and you still want to keep the old code? Very non Open source like I must say.

It was not proven that this code is faster.

...
...
code from Quagga to have heterogeneity.

What kind of argument is that?

Well, let me explain you a few things. It had happened many times, that some wrong (mainly BGP) routing announcement were exported into Internet and all routers of one type (often Quaggas) has crashed. BIRD is used on some very important places like world largest IXPs as a route server (RS). Each IXP usually has at least two different RSs (if possible to avoid implementation dependent bugs. The second RS is usually OpenBGPd of Quagga. So generally speaking, I do not want to use any code from those routing daemons.

...
Sorry, but this all feels like NIH syndrome.

Again, BIRD is used on some very important places and therefore we are very conservative in accepting new patches. But I don't think we have NIH syndrome. We have been accepting foreign patches since beginning and Ondrej Zajicek is a good example. :-) He had sent me some new patches and later we started a regular cooperation and he became a key developer.

So basically you are saying that outsiders like my self aren't welcome because BIRD is so important to some IXPs that you don't want to take any chanches? I had hoped that the possible changes I would need to do could be fed back into BIRD so I didn't have to maintain them myslef forever. Jocke

Ondrej Filip

6:04 p.m.

...

...
Again, BIRD is used on some very important places and therefore we are very conservative in accepting new patches. But I don't think we have NIH syndrome. We have been accepting foreign patches since beginning and Ondrej Zajicek is a good example. :-) He had sent me some new patches and later we started a regular cooperation and he became a key developer.

So basically you are saying that outsiders like my self aren't welcome because BIRD is so important to some IXPs that you don't want to take any chanches? I had hoped that the possible changes I would need to do could be fed back into BIRD so I didn't have to maintain them myslef forever.

Dear Jocke, no, I didn't say this. And again, all ideas are welcome. Thank you for yours. Ondrej

...

Jocke

Joakim Tjernlund

8:55 p.m.

Ondrej Filip <feela@network.cz> wrote on 2010/04/23 20:04:27:

...

...
...
Again, BIRD is used on some very important places and therefore we are very conservative in accepting new patches. But I don't think we have NIH syndrome. We have been accepting foreign patches since beginning and Ondrej Zajicek is a good example. :-) He had sent me some new patches and later we started a regular cooperation and he became a key developer.

So basically you are saying that outsiders like my self aren't welcome because BIRD is so important to some IXPs that you don't want to take any chanches? I had hoped that the possible changes I would need to do could be fed back into BIRD so I didn't have to maintain them myslef forever.

Dear Jocke, no, I didn't say this.

And again, all ideas are welcome. Thank you for yours.

It seems to me that you are so afraid to break something for your precious IXPs that you rather drop user contributions than integrate them unless the changes has been proven correct. I, as a developer, has to do all the work, testing and "prove" that the change is "good". My view would be the work load should be shared by the community, this will help moving BIRD forward much quicker. If you really care about big IXPs, you should consider a stable branch and a development branch for "unproven" stuff and let the community carry some of the work load. Jocke

Martin Mares

9:11 p.m.

Hello!

...

It seems to me that you are so afraid to break something for your precious IXPs that you rather drop user contributions than integrate them unless the changes has been proven correct. I, as a developer, has to do all the work, testing and "prove" that the change is "good". My view would be the work load should be shared by the community, this will help moving BIRD forward much quicker.

A necessary step in sharing your work with the community is to convince the community that the change you want to introduce is useful and/or interesting. Why else should they spend any effort on helping to develop and test your change? Actually, if you look inside the archive of this mailing list, you will be surprised that most people have had it very easy to get their feature accepted. Perhaps "this feature would be useful for my routers" is a sounder argument that "this will boost the performance, but I won't tell you how and why". Perhaps the problem does not lie in the BIRD community, but in your attitude... Maybe you could take a look at the development process of some other projects which aim for producing high quality code, like for example the Linux kernel. Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth Anti-trust laws should be approached with exactly that attitude.

Joakim Tjernlund

10:39 p.m.

Martin Mares <mj@ucw.cz> wrote on 2010/04/23 23:11:15:

...

Hello!

...
It seems to me that you are so afraid to break something for your precious IXPs that you rather drop user contributions than integrate them unless the changes has been proven correct. I, as a developer, has to do all the work, testing and "prove" that the change is "good". My view would be the work load should be shared by the community, this will help moving BIRD forward much quicker.

A necessary step in sharing your work with the community is to convince the community that the change you want to introduce is useful and/or interesting.

I did that, it is simpler and performs better or equal than the current code.

...

Why else should they spend any effort on helping to develop and test your change?

Actually, if you look inside the archive of this mailing list, you will be surprised that most people have had it very easy to get their feature

Lets hope this was the odd exception then.

...

accepted. Perhaps "this feature would be useful for my routers" is a sounder argument that "this will boost the performance, but I won't tell you how and why". Perhaps the problem does not lie in the BIRD community, but in your attitude...

I didn't expect so much pushback from what was a simple improvement, both performance wise and cleaner code. Instead there was don't touch working code even if it is ugly/worse and you must prove beyond doubts that it improves performance. Pretty impossible and not worth the amount of work I would have to put in just for such a small change.

...

Maybe you could take a look at the development process of some other projects which aim for producing high quality code, like for example the Linux kernel.

Been there, done that, still do that.

Martin Mares

6:10 p.m.

Hello!

...

So basically you are saying that outsiders like my self aren't welcome because BIRD is so important to some IXPs that you don't want to take any chanches?

Certainly not. However, it means that the criteria for accepting patches are somewhat stricter than in many other projects. All ideas are of course welcome (if this discussion leads to skipping all byte order conversions in OSPF on big-endian machines, I will be happy), but please be prepared that being an "obvious improvement" is often a highly subjective trait, so the other developers will sometimes want to see a proof that the patch does indeed produce better results than the status quo. In my opinion, one of the key qualities of a good engineer is to prefer exact observation and measurement of reality over personal feelings. Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth System going down at 5 pm to install scheduler bug.

Joakim Tjernlund

8:37 p.m.

Martin Mares <mj@ucw.cz> wrote on 2010/04/23 20:10:14:

...

Hello!

...
So basically you are saying that outsiders like my self aren't welcome because BIRD is so important to some IXPs that you don't want to take any chanches?

Certainly not. However, it means that the criteria for accepting patches are somewhat stricter than in many other projects.

All ideas are of course welcome (if this discussion leads to skipping all byte order conversions in OSPF on big-endian machines, I will be happy), but please be prepared that being an "obvious improvement" is often a highly subjective trait, so the other developers will sometimes want to see a proof that the patch does indeed produce better results than the status quo.

In my opinion, one of the key qualities of a good engineer is to prefer exact observation and measurement of reality over personal feelings.

How did you come to the conclusion the the current code was better than the previous version? Seems like "hand waving" to me. I told told you I had benched the "add carry in C" before and it wasn't any better(worse actually). Santiago benched it too and it was better or just as good as before. Only the MIPS had a regression. Now I have also benched yet again x86 and ppc and both are better with the simpler version. So what now? what more proof do you need? Actually, I give up now. There is nothing more to add and if every change I propose needs this level of "proof", I can't image how it hard it must be to propose something more advanced.

Martin Mares

9:02 p.m.

Hello!

...

How did you come to the conclusion the the current code was better than the previous version? Seems like "hand waving" to me.

Did I claim anywhere that the old code is better? I only pointed out the lack of arguments about the new code being better, which is a reason to stay with the old, tested code.

...

I told told you I had benched the "add carry in C" before and it wasn't any better(worse actually).

Actually, back in the ages when I wrote the old checksum function, I have checked that it performs better than a trivial implementation, and now you claim otherwise, so I naturally want to see new data which show that modern hardware behaves differently.

...

Santiago benched it too and it was better or just as good as before. Only the MIPS had a regression.

If I recall his results correctly, he has performed three tests: In the 1st one, your code was 20% faster. In the 2nd one, it was of the same speed. In the 3rd one, it was 20% slower. Maybe I wear different glasses from yours, but I clearly see that on average, there is no improvement.

...

So what now? what more proof do you need?

First of all, I want at least a rudimentary proof that IT MATTERS AT ALL. We are spending lots of time talking about a minor (20%) speedup in a small chunk of code, without having any clue about how often it gets called and what fraction of the total time is really spent there. Unless it shows up on the profile, I really think it is a waste of time to optimize that place at all.

...

Actually, I give up now. There is nothing more to add and if every change I propose needs this level of "proof", I can't image how it hard it must be to propose something more advanced.

Well, the choice is yours. Have a nice fortnight -- Martin `MJ' Mares <mj@ucw.cz> http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth To understand a program you must become both the machine and the program.

Joakim Tjernlund

10:10 p.m.

Martin Mares <mj@ucw.cz> wrote on 2010/04/23 23:02:29:

...

Hello!

...
How did you come to the conclusion the the current code was better than the previous version? Seems like "hand waving" to me.

Did I claim anywhere that the old code is better? I only pointed out

You did when you commited it and I asked twice how you came to that conlusion. Pretty much all recent tests shows otherwise.

...

the lack of arguments about the new code being better, which is a reason to stay with the old, tested code.

...
I told told you I had benched the "add carry in C" before and it wasn't any better(worse actually).

Actually, back in the ages when I wrote the old checksum function, I have checked that it performs better than a trivial implementation, and now you claim otherwise, so I naturally want to see new data which show that modern hardware behaves differently.

...
Santiago benched it too and it was better or just as good as before. Only the MIPS had a regression.

If I recall his results correctly, he has performed three tests:

In the 1st one, your code was 20% faster. In the 2nd one, it was of the same speed. In the 3rd one, it was 20% slower.

Maybe I wear different glasses from yours, but I clearly see that on average, there is no improvement.

You have ignored my tests which also show an improvement.

...

...
So what now? what more proof do you need?

First of all, I want at least a rudimentary proof that IT MATTERS AT ALL. We are spending lots of time talking about a minor (20%) speedup in a small chunk of code, without having any clue about how often it gets called and what fraction of the total time is really spent there.

20% is a lot and I not going to bench it any further. if you are not happy with the results so far, nothing I can do will change that.

5939

Age (days ago)

5939

Last active (days ago)

List overview

Download

28 comments

5 participants

participants (5)

Joakim Tjernlund
Joakim Tjernlund
Martin Mares
Ondrej Filip
Ondrej Zajicek