[BMP] BIRD socket best practice
Hi BIRD team! We found a case when BMP code is trying to connect with BMP collector service with sk_open(), this causes increasing CPU utilization. To reproduce this case, you have just: 1. Server machine where BMP PDU packets will be sent, should be reachable (so it can be pinged). 2. BMP collector service itself should not be running on this server. 3. Run BIRD with enabled BMP protocol. After that you should observe that BIRD process has significantly increased CPU utilization. This is related somehow with “BIRD socket” because when I capture network traffic on host machine (where BIRD is running), I can see massive amount of TCP packets which are exchange between BIRD host machine and BMP collector machine. At the moment socket type related with BMP connection is SK_TCP_ACTIVE. Do you have any idea what is going wrong or how BIRD socket should be properly use? As a temporary fix, I have provided patch allows to avoid this issue but it is very ugly hack because it frees BIRD socket outside of IO code (sk_free()) and initialize again socket again every time when ECONNREFUSED error is passing to err_hook callback. I need also a tip if there is a way to get notification from BIRD socket if we lost connection with BMP collector service? One option is to check if sk_send() failed but what in situation when there are no updates to send by longer time and I would like to get a notification ASAP when I lost connection with BMP collector service. Is this possible with current BIRD implementation or I should to add some timer's callback which will check somehow if BMP collector service is alive? This mechanism is needed for me to synchronize/re-send all BMP data to the collector. Currently we have switched to BMP code provided on bmp branch from gitlab BIRD repo. Additionally I have a question referring to enclosed code. Can I free list node and node data itself when sk_send() returns value greater or equal to 0 (>= 0), like in the below code? WALK_LIST_DELSAFE(tx_data, tx_data_next, p->tx_queue) { ... rv = sk_send(p->sk, data_size); if (rv < 0) { return; } mb_free(tx_data->data); rem_node((node *) tx_data); mb_free(tx_data); if (rv == 0) { return; } ... Or I should to do that only if sk_send() return value greater than 0 (> 0) ? My goal is sending all data from list if there was only "temporary" problem with sk_send(). Thanks, ---- Pawel Maslanka Senior Software Engineer [signature_1256476543] Office: +1.617.444.1234 Cell: +1.617.444.1234 Akamai Technologies 150 Broadway Cambridge, MA 02142 Connect with Us: [signature_580743884]<https://community.akamai.com/> [signature_1866338322] <http://blogs.akamai.com/> [signature_2113959087] <https://twitter.com/akamai> [signature_447607273] <http://www.facebook.com/AkamaiTechnologies> [signature_1901210113] <http://www.linkedin.com/company/akamai-technologies> [signature_1973184621] <http://www.youtube.com/user/akamaitechnologies?feature=results_main>
On Thu, Jun 03, 2021 at 11:19:32PM +0000, Maslanka, Pawel wrote:
Hi BIRD team!
We found a case when BMP code is trying to connect with BMP collector service with sk_open(), this causes increasing CPU utilization. To reproduce this case, you have just:
1. Server machine where BMP PDU packets will be sent, should be reachable (so it can be pinged). 2. BMP collector service itself should not be running on this server. 3. Run BIRD with enabled BMP protocol.
After that you should observe that BIRD process has significantly increased CPU utilization. This is related somehow with “BIRD socket” because when I capture network traffic on host machine (where BIRD is running), I can see massive amount of TCP packets which are exchange between BIRD host machine and BMP collector machine. At the moment socket type related with BMP connection is SK_TCP_ACTIVE. Do you have any idea what is going wrong or how BIRD socket should be properly use?
Hi After failed attempt to connect() the socket err_hook is called. In such case err_hook is called and you are supposed to close the socket and either disable the protocol, or setup some timeout to restart connect attempts. See rpki_err_hook() or bgp_sock_err(). Otherwise, BIRD socket layer would try to connect() immediately again. This part is missing from bmp_sock_err() in our bmp branch, i should fix that. It is still WiP.
I need also a tip if there is a way to get notification from BIRD socket if we lost connection with BMP collector service? One option is to
If a connection is closed regularly, then socket err_hook is called, but with err=0.p_sock_err(). In most cases the handling would be similar to an actual error (try to re-establish connection after some timeout).
Currently we have switched to BMP code provided on bmp branch from gitlab BIRD repo.
Additionally I have a question referring to enclosed code. Can I free list node and node data itself when sk_send() returns value greater or equal to 0 (>= 0), like in the below code?
WALK_LIST_DELSAFE(tx_data, tx_data_next, p->tx_queue) { ... rv = sk_send(p->sk, data_size); if (rv < 0) { return; }
mb_free(tx_data->data); rem_node((node *) tx_data); mb_free(tx_data); if (rv == 0) { return; } ...
Or I should to do that only if sk_send() return value greater than 0 (> 0) ? My goal is sending all data from list if there was only "temporary" problem with sk_send().
This looks OK. If sk_send() returns > 0, data were sent, you can free the data and continue the loop. If sk_send() returns 0, data were not sent, but they stay in sk->tbuf, so you can free the data from your tx_queue, and break the loop and wait for tx_hook to happen again. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hi Ondrej, Thank you for your input! This is valuable info for me. I will keep you update if I will work on that. Thanks, ---- Pawel Maslanka Senior Software Engineer Office: +1.617.444.1234 Cell: +1.617.444.1234 Akamai Technologies 150 Broadway Cambridge, MA 02142 Connect with Us: <https://community.akamai.com/> <http://blogs.akamai.com/> <https://twitter.com/akamai> <http://www.facebook.com/AkamaiTechnologies> <http://www.linkedin.com/company/akamai-technologies> <http://www.youtube.com/user/akamaitechnologies?feature=results_main> On 6/7/21, 2:33 AM, "Ondrej Zajicek" <santiago@crfreenet.org> wrote: On Thu, Jun 03, 2021 at 11:19:32PM +0000, Maslanka, Pawel wrote: > Hi BIRD team! > > We found a case when BMP code is trying to connect with BMP collector service with sk_open(), this causes increasing CPU utilization. To reproduce this case, you have just: > > 1. Server machine where BMP PDU packets will be sent, should be reachable (so it can be pinged). > 2. BMP collector service itself should not be running on this server. > 3. Run BIRD with enabled BMP protocol. > > After that you should observe that BIRD process has significantly increased CPU utilization. This is related somehow with “BIRD socket” because when I capture network traffic on host machine (where BIRD is running), I can see massive amount of TCP packets which are exchange between BIRD host machine and BMP collector machine. At the moment socket type related with BMP connection is SK_TCP_ACTIVE. > Do you have any idea what is going wrong or how BIRD socket should be properly use? Hi After failed attempt to connect() the socket err_hook is called. In such case err_hook is called and you are supposed to close the socket and either disable the protocol, or setup some timeout to restart connect attempts. See rpki_err_hook() or bgp_sock_err(). Otherwise, BIRD socket layer would try to connect() immediately again. This part is missing from bmp_sock_err() in our bmp branch, i should fix that. It is still WiP. > I need also a tip if there is a way to get notification from BIRD > socket if we lost connection with BMP collector service? One option is to If a connection is closed regularly, then socket err_hook is called, but with err=0.p_sock_err(). In most cases the handling would be similar to an actual error (try to re-establish connection after some timeout). > Currently we have switched to BMP code provided on bmp branch from gitlab BIRD repo. > > Additionally I have a question referring to enclosed code. Can I free list node and node data itself when sk_send() returns value greater or equal to 0 (>= 0), like in the below code? > > WALK_LIST_DELSAFE(tx_data, tx_data_next, p->tx_queue) > { > ... > rv = sk_send(p->sk, data_size); > if (rv < 0) { > return; > } > > mb_free(tx_data->data); > rem_node((node *) tx_data); > mb_free(tx_data); > if (rv == 0) { > return; > } > ... > > Or I should to do that only if sk_send() return value greater than 0 (> 0) ? My goal is sending all data from list if there was only "temporary" problem with sk_send(). This looks OK. If sk_send() returns > 0, data were sent, you can free the data and continue the loop. If sk_send() returns 0, data were not sent, but they stay in sk->tbuf, so you can free the data from your tx_queue, and break the loop and wait for tx_hook to happen again. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
Hi Ondrej, I would like to ask you to review changes designed to improve error handling caused by interrupted connection to the BMP collector. Additionally I cleaned up code form unnecessary conditions which I replaced with 'proto_state' checking. I added patch in attachment. This patch is fully compatible with 'bmp' branch on BIRD gitlab repo. Of course, I would be happy if these changes could be merge to 'bmp' branch if will pass review successfully :) Thanks, ---- Pawel Maslanka Senior Software Engineer Office: +1.617.444.1234 Cell: +1.617.444.1234 Akamai Technologies 150 Broadway Cambridge, MA 02142 Connect with Us: <https://community.akamai.com/> <http://blogs.akamai.com/> <https://twitter.com/akamai> <http://www.facebook.com/AkamaiTechnologies> <http://www.linkedin.com/company/akamai-technologies> <http://www.youtube.com/user/akamaitechnologies?feature=results_main> On 6/7/21, 11:26 PM, "Maslanka, Pawel" <pmaslank@akamai.com> wrote: Hi Ondrej, Thank you for your input! This is valuable info for me. I will keep you update if I will work on that. Thanks, ---- Pawel Maslanka Senior Software Engineer Office: +1.617.444.1234 Cell: +1.617.444.1234 Akamai Technologies 150 Broadway Cambridge, MA 02142 Connect with Us: <https://community.akamai.com/> <http://blogs.akamai.com/> <https://urldefense.com/v3/__https://twitter.com/akamai__;!!GjvTz_vk!AULVp9fa... > <https://urldefense.com/v3/__http://www.facebook.com/AkamaiTechnologies__;!!G... > <https://urldefense.com/v3/__http://www.linkedin.com/company/akamai-technolog... > <https://urldefense.com/v3/__http://www.youtube.com/user/akamaitechnologies?f... > On 6/7/21, 2:33 AM, "Ondrej Zajicek" <santiago@crfreenet.org> wrote: On Thu, Jun 03, 2021 at 11:19:32PM +0000, Maslanka, Pawel wrote: > Hi BIRD team! > > We found a case when BMP code is trying to connect with BMP collector service with sk_open(), this causes increasing CPU utilization. To reproduce this case, you have just: > > 1. Server machine where BMP PDU packets will be sent, should be reachable (so it can be pinged). > 2. BMP collector service itself should not be running on this server. > 3. Run BIRD with enabled BMP protocol. > > After that you should observe that BIRD process has significantly increased CPU utilization. This is related somehow with “BIRD socket” because when I capture network traffic on host machine (where BIRD is running), I can see massive amount of TCP packets which are exchange between BIRD host machine and BMP collector machine. At the moment socket type related with BMP connection is SK_TCP_ACTIVE. > Do you have any idea what is going wrong or how BIRD socket should be properly use? Hi After failed attempt to connect() the socket err_hook is called. In such case err_hook is called and you are supposed to close the socket and either disable the protocol, or setup some timeout to restart connect attempts. See rpki_err_hook() or bgp_sock_err(). Otherwise, BIRD socket layer would try to connect() immediately again. This part is missing from bmp_sock_err() in our bmp branch, i should fix that. It is still WiP. > I need also a tip if there is a way to get notification from BIRD > socket if we lost connection with BMP collector service? One option is to If a connection is closed regularly, then socket err_hook is called, but with err=0.p_sock_err(). In most cases the handling would be similar to an actual error (try to re-establish connection after some timeout). > Currently we have switched to BMP code provided on bmp branch from gitlab BIRD repo. > > Additionally I have a question referring to enclosed code. Can I free list node and node data itself when sk_send() returns value greater or equal to 0 (>= 0), like in the below code? > > WALK_LIST_DELSAFE(tx_data, tx_data_next, p->tx_queue) > { > ... > rv = sk_send(p->sk, data_size); > if (rv < 0) { > return; > } > > mb_free(tx_data->data); > rem_node((node *) tx_data); > mb_free(tx_data); > if (rv == 0) { > return; > } > ... > > Or I should to do that only if sk_send() return value greater than 0 (> 0) ? My goal is sending all data from list if there was only "temporary" problem with sk_send(). This looks OK. If sk_send() returns > 0, data were sent, you can free the data and continue the loop. If sk_send() returns 0, data were not sent, but they stay in sk->tbuf, so you can free the data from your tx_queue, and break the loop and wait for tx_hook to happen again. -- Elen sila lumenn' omentielvo Ondrej 'Santiago' Zajicek (email: santiago@crfreenet.org) OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net) "To err is human -- to blame it on a computer is even more so."
participants (2)
-
Maslanka, Pawel -
Ondrej Zajicek