Bug Report: Unaligned Access in BGP Code on ARMv7 Platforms
Dear BIRD Community, I am writing to report a bug observed in the BGP implementation on ARMv7 platforms, which results in a SIGBUS error due to unaligned memory access in the NEON assembler code. I debugged this issue on version 2.15.1 of BIRD. ### Problem Description When running BIRD on ARMv7, the application crashes with a `SIGBUS` signal. The issue stems from an unaligned memory access instruction in the NEON assembly: ``` vst1.8 {d16-d17}, [r0 :64] ``` This instruction requires the address in `r0` to be 64-bit (8-byte) aligned. However, in some cases, `r0` is not correctly aligned, leading to a bus error. Here is an excerpt from the kernel logs showing the alignment trap: ``` kern.err kernel: [90636.175853] Alignment trap: not handling instruction f4400a1f at [<00056fbc>] kern.alert kernel: [90636.182044] Unhandled fault: alignment exception (0x801) at 0xb6dec01c kern.alert kernel: [90636.184946] pgd = 190a2833 kern.alert kernel: [90636.191494] [b6dec01c] *pgd=82e91835, *pte=8502b75f, *ppte=8502bc7f ``` ### Backtrace The crash occurs in the `bgp_get_prefix` function due to the misaligned memory address: ``` Program received signal SIGBUS, Bus error. 0x00057504 in bgp_get_prefix (path_id=0, net=0xb6f3e030, c=0xb6f27420) at proto/bgp/attrs.c:1710 1710 proto/bgp/attrs.c: No such file or directory. (gdb) bt #0 0x00057504 in bgp_get_prefix (path_id=0, net=0xb6f3e030, c=0xb6f27420) at proto/bgp/attrs.c:1710 #1 bgp_rt_notify (old=0x400, new=0x64, n=0x4, C=0xb6f27420, P=0xb6f27850) at proto/bgp/attrs.c:1966 #2 bgp_rt_notify (P=0xb6f27850, C=0xb6f27420, n=0x4, new=0x64, old=0x0) at proto/bgp/attrs.c:1936 ... ``` The exact crashing line is this instruction: ```c px->path_id = path_id; ``` You can view it here: https://github.com/CZ-NIC/bird/blob/0b684a43bd7ce4a32c9cd7754b88286bcd1815bb... ### Root Cause The root cause appears to be insufficient alignment of memory allocated for structures, specifically in this line: ```c px = mb_alloc(c->pool, sizeof(struct bgp_prefix) + net->length); ``` The allocated memory may not be properly aligned for structures containing 64-bit data types, which is mandatory on ARMv7 when using NEON instructions. ### Temporary Workaround To mitigate the issue, we are currently using the GCC compiler flag `-mno-unaligned-access`. This flag ensures that the compiler avoids generating code that assumes unaligned access is supported, thereby preventing the `SIGBUS` error. https://github.com/freifunk-berlin/falter-packages/commit/fcce390fc57b44593f... ### Request for Feedback I would like to hear the community's thoughts on the best approach to resolve this issue permanently. If needed, I can provide further logs or test configurations to reproduce the problem. Bests, Nick
On Thu, Dec 05, 2024 at 11:35:13PM +0100, nick via Bird-users wrote:
Dear BIRD Community,
I am writing to report a bug observed in the BGP implementation on ARMv7 platforms, which results in a SIGBUS error due to unaligned memory access in the NEON assembler code. I debugged this issue on version 2.15.1 of BIRD.
The root cause appears to be insufficient alignment of memory allocated for structures, specifically in this line:
```c px = mb_alloc(c->pool, sizeof(struct bgp_prefix) + net->length); ```
Hello Thanks for the report, can you get from your crash the pointer / aligment of 'px' and also get sizeof(struct mblock) in your build? -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) "To err is human -- to blame it on a computer is even more so."
Sorry for the long response time. Here is the requested output:
Program received signal SIGBUS, Bus error. 0x00056fc0 in bgp_get_prefix (path_id=0, net=0xb6f0709c, c=0xb6ee9bc0) at proto/bgp/attrs.c:1710 1710 proto/bgp/attrs.c: No such file or directory. (gdb) p sizeof(struct mblock) $1 = 16 (gdb) p px $2 = (struct bgp_prefix *) 0xb6eac01c (gdb) p (uintptr_t)px % 4 $3 = 0 (gdb) p (uintptr_t)px % 8 $4 = 4
On 12/6/24 3:16 AM, Ondrej Zajicek wrote:
On Thu, Dec 05, 2024 at 11:35:13PM +0100, nick via Bird-users wrote:
Dear BIRD Community,
I am writing to report a bug observed in the BGP implementation on ARMv7 platforms, which results in a SIGBUS error due to unaligned memory access in the NEON assembler code. I debugged this issue on version 2.15.1 of BIRD.
The root cause appears to be insufficient alignment of memory allocated for structures, specifically in this line:
```c px = mb_alloc(c->pool, sizeof(struct bgp_prefix) + net->length); ``` Hello
Thanks for the report, can you get from your crash the pointer / aligment of 'px' and also get sizeof(struct mblock) in your build?
I also uploaded the coredumpfile: https://github.com/PolynomialDivision/coredumpupload/blob/main/bird_coredump On 12/10/24 8:56 PM, nick via Bird-users wrote:
Sorry for the long response time.
Here is the requested output:
Program received signal SIGBUS, Bus error. 0x00056fc0 in bgp_get_prefix (path_id=0, net=0xb6f0709c, c=0xb6ee9bc0) at proto/bgp/attrs.c:1710 1710 proto/bgp/attrs.c: No such file or directory. (gdb) p sizeof(struct mblock) $1 = 16 (gdb) p px $2 = (struct bgp_prefix *) 0xb6eac01c (gdb) p (uintptr_t)px % 4 $3 = 0 (gdb) p (uintptr_t)px % 8 $4 = 4
On 12/6/24 3:16 AM, Ondrej Zajicek wrote:
On Thu, Dec 05, 2024 at 11:35:13PM +0100, nick via Bird-users wrote:
Dear BIRD Community,
I am writing to report a bug observed in the BGP implementation on ARMv7 platforms, which results in a SIGBUS error due to unaligned memory access in the NEON assembler code. I debugged this issue on version 2.15.1 of BIRD.
The root cause appears to be insufficient alignment of memory allocated for structures, specifically in this line:
```c px = mb_alloc(c->pool, sizeof(struct bgp_prefix) + net->length); ``` Hello
Thanks for the report, can you get from your crash the pointer / aligment of 'px' and also get sizeof(struct mblock) in your build?
On Tue, Dec 10, 2024 at 09:15:46PM +0100, nick via Bird-users wrote:
I also uploaded the coredumpfile: https://github.com/PolynomialDivision/coredumpupload/blob/main/bird_coredump
Thanks. This seems like an interesting issue. In BIRD, generic net_addr structure is explicitly u64-aligned (to accomodate VPN variants), while specific net_addr_ip4 and net_addr_ip6 are just u32-aligned. In this case net_addr_ip6 is allocated with u32 alignment, but then copied with net_copy(), which assumes generic net_addr for arguments, and compiler probably used some u64-optimized copying, which required 64-bit alignment despite being on 32-bit platform, For starters, try the attached patch. But it is preliminary, we will revisit alignment of these structures.
The root cause appears to be insufficient alignment of memory allocated for structures, specifically in this line:
```c px = mb_alloc(c->pool, sizeof(struct bgp_prefix) + net->length); ```
Note that it is really allocated two lines above, here: px = sl_alloc(c->prefix_slab); -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) "To err is human -- to blame it on a computer is even more so."
Thank you! I just ran a quick test and encountered the same crash at the same line. I’ll have more time to investigate tomorrow and can provide additional details then. Do you have any other ideas I could try in the meantime? Core dump: https://github.com/PolynomialDivision/coredumpupload/tree/main/u64patch On 12/11/24 1:41 AM, Ondrej Zajicek wrote:
On Tue, Dec 10, 2024 at 09:15:46PM +0100, nick via Bird-users wrote:
I also uploaded the coredumpfile: https://github.com/PolynomialDivision/coredumpupload/blob/main/bird_coredump Thanks. This seems like an interesting issue. In BIRD, generic net_addr structure is explicitly u64-aligned (to accomodate VPN variants), while specific net_addr_ip4 and net_addr_ip6 are just u32-aligned. In this case net_addr_ip6 is allocated with u32 alignment, but then copied with net_copy(), which assumes generic net_addr for arguments, and compiler probably used some u64-optimized copying, which required 64-bit alignment despite being on 32-bit platform,
For starters, try the attached patch. But it is preliminary, we will revisit alignment of these structures.
The root cause appears to be insufficient alignment of memory allocated for structures, specifically in this line:
```c px = mb_alloc(c->pool, sizeof(struct bgp_prefix) + net->length); ``` Note that it is really allocated two lines above, here:
px = sl_alloc(c->prefix_slab);
On Wed, Dec 11, 2024 at 02:14:39AM +0100, nick wrote:
Thank you! I just ran a quick test and encountered the same crash at the same line. I’ll have more time to investigate tomorrow and can provide additional details then. Do you have any other ideas I could try in the meantime?
A part of my previous explanation was wrong, the issue is not related to net_copy(), as the crash happened even before. For some reason, struct bgp_prefix is u64-aligned. That makes sense in the original code, where u64 alignment is forced by struct net_addr, but i have no idea why this is true even after the patch removing the forced alignment. The slab allocator in sl_alloc() just returns u32-aligned memory, because it is configured for sizeof(struct bgp_prefix) + sizeof(net_addr_ip6) and the second is 20 as net_addr_ip6 is u32-aligned. The sl_alloc() internally only enforces word-size alignment, wich is u32 in your case. Please disregard the previous patch. Can you try the first attached patch (log-alignment.patch) to log alignment of different structures with or without the second patch (net-addr-u32-align.patch) and report me results?
On 12/11/24 1:41 AM, Ondrej Zajicek wrote:
On Tue, Dec 10, 2024 at 09:15:46PM +0100, nick via Bird-users wrote:
I also uploaded the coredumpfile: https://github.com/PolynomialDivision/coredumpupload/blob/main/bird_coredump Thanks. This seems like an interesting issue. In BIRD, generic net_addr structure is explicitly u64-aligned (to accomodate VPN variants), while specific net_addr_ip4 and net_addr_ip6 are just u32-aligned. In this case net_addr_ip6 is allocated with u32 alignment, but then copied with net_copy(), which assumes generic net_addr for arguments, and compiler probably used some u64-optimized copying, which required 64-bit alignment despite being on 32-bit platform,
For starters, try the attached patch. But it is preliminary, we will revisit alignment of these structures.
The root cause appears to be insufficient alignment of memory allocated for structures, specifically in this line:
```c px = mb_alloc(c->pool, sizeof(struct bgp_prefix) + net->length); ``` Note that it is really allocated two lines above, here:
px = sl_alloc(c->prefix_slab);
-- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) "To err is human -- to blame it on a computer is even more so."
Can you give me a quick guide on how to enable the debug output? I tried for example setting in the config file:
log "/tmp/birddebug.log" all; debug protocols all;
I am not able to see any output from ALIGN. On 12/11/24 4:40 AM, Ondrej Zajicek wrote:
On Wed, Dec 11, 2024 at 02:14:39AM +0100, nick wrote:
Thank you! I just ran a quick test and encountered the same crash at the same line. I’ll have more time to investigate tomorrow and can provide additional details then. Do you have any other ideas I could try in the meantime? A part of my previous explanation was wrong, the issue is not related to net_copy(), as the crash happened even before.
For some reason, struct bgp_prefix is u64-aligned. That makes sense in the original code, where u64 alignment is forced by struct net_addr, but i have no idea why this is true even after the patch removing the forced alignment.
The slab allocator in sl_alloc() just returns u32-aligned memory, because it is configured for sizeof(struct bgp_prefix) + sizeof(net_addr_ip6) and the second is 20 as net_addr_ip6 is u32-aligned. The sl_alloc() internally only enforces word-size alignment, wich is u32 in your case.
Please disregard the previous patch. Can you try the first attached patch (log-alignment.patch) to log alignment of different structures with or without the second patch (net-addr-u32-align.patch) and report me results?
On 12/11/24 1:41 AM, Ondrej Zajicek wrote:
On Tue, Dec 10, 2024 at 09:15:46PM +0100, nick via Bird-users wrote:
I also uploaded the coredumpfile: https://github.com/PolynomialDivision/coredumpupload/blob/main/bird_coredump Thanks. This seems like an interesting issue. In BIRD, generic net_addr structure is explicitly u64-aligned (to accomodate VPN variants), while specific net_addr_ip4 and net_addr_ip6 are just u32-aligned. In this case net_addr_ip6 is allocated with u32 alignment, but then copied with net_copy(), which assumes generic net_addr for arguments, and compiler probably used some u64-optimized copying, which required 64-bit alignment despite being on 32-bit platform,
For starters, try the attached patch. But it is preliminary, we will revisit alignment of these structures.
> The root cause appears to be insufficient alignment of memory > allocated for > structures, specifically in this line: > > ```c > px = mb_alloc(c->pool, sizeof(struct bgp_prefix) + net->length); > ``` Note that it is really allocated two lines above, here:
px = sl_alloc(c->prefix_slab);
On Thu, Dec 12, 2024 at 12:50:31AM +0100, nick wrote:
Can you give me a quick guide on how to enable the debug output? I tried for example setting in the config file:
log "/tmp/birddebug.log" all; debug protocols all;
I am not able to see any output from ALIGN.
As the log message is generated even before the config file is parsed, it uses the default log output, which is stderr and syslog (if available): # ./bird -l bird: ALIGN: 4 4 4 8 8 # grep ALIGN /var/log/syslog 2024-12-12T02:11:59.513900+01:00 feanor bird: ALIGN: 4 4 4 8 8 (also, you could use option -D <debugfile>, in which case the default log output would be directed there.) Alternatively, if logging to the usual configured logfile would be more practical to you, the test call could be moved from bgp_build() to bgp_init() (attached log-alignment-v2.patch), which is called after everything is initialized and log is directed to the configured file.
On 12/11/24 4:40 AM, Ondrej Zajicek wrote:
On Wed, Dec 11, 2024 at 02:14:39AM +0100, nick wrote:
Thank you! I just ran a quick test and encountered the same crash at the same line. I’ll have more time to investigate tomorrow and can provide additional details then. Do you have any other ideas I could try in the meantime? A part of my previous explanation was wrong, the issue is not related to net_copy(), as the crash happened even before.
For some reason, struct bgp_prefix is u64-aligned. That makes sense in the original code, where u64 alignment is forced by struct net_addr, but i have no idea why this is true even after the patch removing the forced alignment.
The slab allocator in sl_alloc() just returns u32-aligned memory, because it is configured for sizeof(struct bgp_prefix) + sizeof(net_addr_ip6) and the second is 20 as net_addr_ip6 is u32-aligned. The sl_alloc() internally only enforces word-size alignment, wich is u32 in your case.
Please disregard the previous patch. Can you try the first attached patch (log-alignment.patch) to log alignment of different structures with or without the second patch (net-addr-u32-align.patch) and report me results?
-- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) "To err is human -- to blame it on a computer is even more so."
Dear all, Apologies for the earlier confusion—my toolchain was misconfigured, and I mistakenly compiled the unpatched BIRD sources. Upon correcting this, I believe the first patch already addresses the issue. However, I noticed that the two patches result in different alignments. Below is a summary of the observed alignments: *No Patch:* |ALIGN: 8 4 4 4 8| *First Patch (fix-net-addr-alignment.patch):* |ALIGN: 2 4 4 4 4| *Last Patch (net-addr-u32-align.patch):* |ALIGN: 4 4 4 4 4| Bests, Nick On 12/12/24 2:31 AM, Ondrej Zajicek wrote:
On Thu, Dec 12, 2024 at 12:50:31AM +0100, nick wrote:
Can you give me a quick guide on how to enable the debug output? I tried for example setting in the config file:
log "/tmp/birddebug.log" all; debug protocols all; I am not able to see any output from ALIGN. As the log message is generated even before the config file is parsed, it uses the default log output, which is stderr and syslog (if available):
# ./bird -l bird: ALIGN: 4 4 4 8 8
# grep ALIGN /var/log/syslog 2024-12-12T02:11:59.513900+01:00 feanor bird: ALIGN: 4 4 4 8 8
(also, you could use option -D <debugfile>, in which case the default log output would be directed there.)
Alternatively, if logging to the usual configured logfile would be more practical to you, the test call could be moved from bgp_build() to bgp_init() (attached log-alignment-v2.patch), which is called after everything is initialized and log is directed to the configured file.
On 12/11/24 4:40 AM, Ondrej Zajicek wrote:
On Wed, Dec 11, 2024 at 02:14:39AM +0100, nick wrote:
Thank you! I just ran a quick test and encountered the same crash at the same line. I’ll have more time to investigate tomorrow and can provide additional details then. Do you have any other ideas I could try in the meantime? A part of my previous explanation was wrong, the issue is not related to net_copy(), as the crash happened even before.
For some reason, struct bgp_prefix is u64-aligned. That makes sense in the original code, where u64 alignment is forced by struct net_addr, but i have no idea why this is true even after the patch removing the forced alignment.
The slab allocator in sl_alloc() just returns u32-aligned memory, because it is configured for sizeof(struct bgp_prefix) + sizeof(net_addr_ip6) and the second is 20 as net_addr_ip6 is u32-aligned. The sl_alloc() internally only enforces word-size alignment, wich is u32 in your case.
Please disregard the previous patch. Can you try the first attached patch (log-alignment.patch) to log alignment of different structures with or without the second patch (net-addr-u32-align.patch) and report me results?
On Mon, Dec 16, 2024 at 05:43:16PM +0100, nick wrote:
Dear all,
Apologies for the earlier confusion—my toolchain was misconfigured, and I mistakenly compiled the unpatched BIRD sources. Upon correcting this, I believe the first patch already addresses the issue. However, I noticed that the two patches result in different alignments. Below is a summary of the observed alignments:
*No Patch:* |ALIGN: 8 4 4 4 8|
*First Patch (fix-net-addr-alignment.patch):* |ALIGN: 2 4 4 4 4|
*Last Patch (net-addr-u32-align.patch):* |ALIGN: 4 4 4 4 4|
Hi I suspected that the second crash was also on the unpatched binary :-). The last patch (net-addr-u32-align.patch) is a better way to handle this issue, although not perfect (you could possibly get a crash when VPN network types are used, but not with regular IP network types). We are working on thorough review of alignment issues as there were some other issues on armhf in BIRD 3 development. Note that armhf is likely the worst case for alignment issues, as it requires 8B alignment for u64 but 4B alignment for pointers. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) "To err is human -- to blame it on a computer is even more so."
Hey, I did a PR for the routing feed containing your latest patch [0]. I hope everything is alright. I'm looking forward to the BIRD 3 release. If I need to test something, I'm happy to help. I can also give you SSH access via IPv6 to an OpenWrt device for testing, if you'd like. [0] - https://github.com/openwrt/routing/pull/1097 On 12/16/24 6:08 PM, Ondrej Zajicek wrote:
On Mon, Dec 16, 2024 at 05:43:16PM +0100, nick wrote:
Dear all,
Apologies for the earlier confusion—my toolchain was misconfigured, and I mistakenly compiled the unpatched BIRD sources. Upon correcting this, I believe the first patch already addresses the issue. However, I noticed that the two patches result in different alignments. Below is a summary of the observed alignments:
*No Patch:* |ALIGN: 8 4 4 4 8|
*First Patch (fix-net-addr-alignment.patch):* |ALIGN: 2 4 4 4 4|
*Last Patch (net-addr-u32-align.patch):* |ALIGN: 4 4 4 4 4| Hi
I suspected that the second crash was also on the unpatched binary :-). The last patch (net-addr-u32-align.patch) is a better way to handle this issue, although not perfect (you could possibly get a crash when VPN network types are used, but not with regular IP network types).
We are working on thorough review of alignment issues as there were some other issues on armhf in BIRD 3 development. Note that armhf is likely the worst case for alignment issues, as it requires 8B alignment for u64 but 4B alignment for pointers.
On Wed, Dec 18, 2024 at 12:26:37AM +0100, nick wrote:
Hey,
I did a PR for the routing feed containing your latest patch [0]. I hope everything is alright. I'm looking forward to the BIRD 3 release. If I need to test something, I'm happy to help. I can also give you SSH access via IPv6 to an OpenWrt device for testing, if you'd like.
Hi I merged the official fix, it is these two patches: commit 34b7d77e06f7fa2756a68fa4b1ebdce509dfe277 Lib: Data type for VPN route distinguishers commit 161aef353a56532752d5d0f97e432c0915058050 Lib: Ensure that all net_addr structures have the same alignment (The first patch is necessary for VPN networks that used u64) Note that there are still issues related to OSPF. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) "To err is human -- to blame it on a computer is even more so."
participants (2)
-
nick -
Ondrej Zajicek