Skip to content

[CRASH] then followed by deadlock #3742

@rrevels-bw

Description

@rrevels-bw

OpenSIPS version you are running

version: opensips 3.4.14 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 9fda082
main.c compiled on  with gcc 12

Crash Core Dump

I havent sent a trap or otherwise kill the process yet as it may be helpful to have it running still for troubleshooting.

=== PID 1444082 ===
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
futex_wait (private=0, expected=2, futex_word=0x7f538ec424c8 <syslog_lock>) at ../sysdeps/nptl/futex-internal.h:146
#0  futex_wait (private=0, expected=2, futex_word=0x7f538ec424c8 <syslog_lock>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x7f538ec424c8 <syslog_lock>) at ./nptl/lowlevellock.c:34
#2  0x00007f538eb6833c in __vsyslog_internal (pri=130, fmt=0x560642ccf168 "%sCRITICAL:core:%s: segfault in process pid: %d, id: %d\n", ap=0x7ffc51b9c030, mode_flags=2) at ./misc/syslog.c:144
#3  0x0000560642b28ee1 in dprint (log_level=-2, facility=128, module=0x560642ca8011 "core", func=0x560642ccf958 <__FUNCTION__.1> "sig_usr", stderr_fmt=0x560642ccf160 "%s [%d] %sCRITICAL:core:%s: segfault in process pid: %d, id: %d\n", syslog_fmt=0x560642ccf168 "%sCRITICAL:core:%s: segfault in process pid: %d, id: %d\n", format=0x560642ccf1a8 "segfault in process pid: %d, id: %d\n") at ./dprint.c:698
#4  0x0000560642b9022b in sig_usr (signo=-1899748152) at ./dprint.h:206
#5  <signal handler called>
#6  __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79
#7  0x00007f538eac5258 in __vfprintf_internal (s=s@entry=0x7ffc51b9d420, format=format@entry=0x560642cdc520 "%sCRITICAL:core:%s: freeing already freed %s pointer (%p), first free: %s: %s(%ld) - aborting!\n", ap=ap@entry=0x7ffc51b9d648, mode_flags=mode_flags@entry=2) at ./stdio-common/vfprintf-process-arg.c:397
#8  0x00007f538eae5758 in __vsnprintf_internal (string=0x7ffc51b9d6c1 "CRITICAL:core:fm_free: freeing already freed shm pointer (0x7f534710e328), first free: \001", maxlen=<optimized out>, maxlen@entry=959, format=format@entry=0x560642cdc520 "%sCRITICAL:core:%s: freeing already freed %s pointer (%p), first free: %s: %s(%ld) - aborting!\n", args=args@entry=0x7ffc51b9d648, mode_flags=mode_flags@entry=2) at ./libio/vsnprintf.c:114
#9  0x00007f538eb67f6c in __vsyslog_internal (pri=130, fmt=<optimized out>, ap=0x7ffc51b9db30, mode_flags=<optimized out>) at ./misc/syslog.c:218
#10 0x0000560642b28ee1 in dprint (log_level=log_level@entry=-2, facility=128, module=module@entry=0x560642ca8011 "core", func=func@entry=0x560642cdf308 <__FUNCTION__.1> "fm_free", stderr_fmt=stderr_fmt@entry=0x560642cdc518 "%s [%d] %sCRITICAL:core:%s: freeing already freed %s pointer (%p), first free: %s: %s(%ld) - aborting!\n", syslog_fmt=syslog_fmt@entry=0x560642cdc520 "%sCRITICAL:core:%s: freeing already freed %s pointer (%p), first free: %s: %s(%ld) - aborting!\n", format=0x560642cdc580 "freeing already freed %s pointer (%p), first free: %s: %s(%ld) - aborting!\n") at ./dprint.c:698
#11 0x0000560642bd3377 in fm_free (fm=<optimized out>, p=0x7f534710e328, file=<optimized out>, func=<optimized out>, line=<optimized out>) at mem/f_malloc_dyn.h:240
#12 0x0000560642bc47e8 in _shm_free (ptr=0x7f534710e328, file=0x560642cda883 "usr_avp.c", function=0x560642cdaa98 <__FUNCTION__.12> "destroy_avp", line=400) at evi/../mem/shm_mem.h:550
#13 0x0000560642bc6115 in destroy_avps (flags=flags@entry=0, name=name@entry=231, all=all@entry=1) at ./usr_avp.c:413
#14 0x0000560642b63ce2 in pv_set_avp (msg=<optimized out>, param=<optimized out>, op=19, val=0x7ffc51b9de10) at ./pvar.c:2844
#15 0x0000560642b05af8 in do_assign (msg=msg@entry=0x7f538b7224e0, a=a@entry=0x7f538ade93e0) at ./action.c:420
#16 0x0000560642b0616b in do_action (a=a@entry=0x7f538ade93e0, msg=msg@entry=0x7f538b7224e0) at ./action.c:1148
#17 0x0000560642b090d0 in run_action_list (a=<optimized out>, msg=msg@entry=0x7f538b7224e0) at ./action.c:190
#18 0x0000560642b07fe1 in do_action (a=a@entry=0x7f538adec6d8, msg=msg@entry=0x7f538b7224e0) at ./action.c:836
#19 0x0000560642b090d0 in run_action_list (a=<optimized out>, msg=msg@entry=0x7f538b7224e0) at ./action.c:190
#20 0x0000560642b07fe1 in do_action (a=a@entry=0x7f538adf2f58, msg=msg@entry=0x7f538b7224e0) at ./action.c:836
#21 0x0000560642b090d0 in run_action_list (a=<optimized out>, msg=msg@entry=0x7f538b7224e0) at ./action.c:190
#22 0x0000560642b07fe1 in do_action (a=a@entry=0x7f538adf3088, msg=msg@entry=0x7f538b7224e0) at ./action.c:836
#23 0x0000560642b090d0 in run_action_list (a=<optimized out>, msg=msg@entry=0x7f538b7224e0) at ./action.c:190
#24 0x0000560642b07fe1 in do_action (a=a@entry=0x7f538adf31b8, msg=msg@entry=0x7f538b7224e0) at ./action.c:836
#25 0x0000560642b090d0 in run_action_list (a=<optimized out>, msg=msg@entry=0x7f538b7224e0) at ./action.c:190
#26 0x0000560642b07fe1 in do_action (a=a@entry=0x7f538adf32e8, msg=msg@entry=0x7f538b7224e0) at ./action.c:836
#27 0x0000560642b090d0 in run_action_list (a=<optimized out>, msg=msg@entry=0x7f538b7224e0) at ./action.c:190
#28 0x0000560642b07fe1 in do_action (a=a@entry=0x7f538adf5528, msg=msg@entry=0x7f538b7224e0) at ./action.c:836
#29 0x0000560642b090d0 in run_action_list (a=a@entry=0x7f538adf5528, msg=msg@entry=0x7f538b7224e0) at ./action.c:190
#30 0x0000560642b095d4 in run_actions (msg=0x7f538b7224e0, a=0x7f538adf5528) at ./action.c:136
#31 run_top_route (sr=..., msg=msg@entry=0x7f538b7224e0) at ./action.c:250
#32 0x00007f532a98f4b4 in reply_received (p_msg=<optimized out>) at ./modules/tm/t_reply.c:1628
#33 0x0000560642b328c5 in forward_reply (msg=msg@entry=0x7f538b7224e0) at ./forward.c:499
#34 0x0000560642b6f981 in receive_msg (buf=0x560642ef40a0 <buf> "SIP/2.0 300 Multiple Choices\r\nVia: SIP/2.0/UDP 192.168.102.195:5060;received=192.168.102.195;rport=5060;branch=z9hG4bKaa2f.3f31c8c3.0\r\nVia: SIP/2.0/UDP 192.168.100.107:5003;branch=z9hG4bK-3221182-4777"..., len=<optimized out>, rcv_info=rcv_info@entry=0x7ffc51b9f170, existing_context=existing_context@entry=0x0, msg_flags=msg_flags@entry=0) at ./receive.c:283
#35 0x0000560642c9a907 in udp_read_req (si=<optimized out>, bytes_read=<optimized out>) at net/proto_udp/proto_udp.c:186
#36 0x0000560642c86f3f in handle_io (fm=0x7f538aecc618, event_type=event_type@entry=1, idx=0) at net/net_udp.c:295
#37 0x0000560642c87fa1 in io_wait_loop_epoll (repeat=<optimized out>, t=<optimized out>, h=<optimized out>) at net/../io_wait_loop.h:305
#38 0x0000560642c8dd05 in udp_start_processes (chd_rank=chd_rank@entry=0x560642dc2238 <chd_rank>, startup_done=startup_done@entry=0x7f532b53d810) at net/net_udp.c:528
#39 0x0000560642b039c1 in main_loop () at ./main.c:237
#40 main (argc=<optimized out>, argv=<optimized out>) at ./main.c:966
[Inferior 1 (process 1444082) detached]

Describe the traffic that generated the bug

fairly high volume of traffic (250 cps to 450 cps). there are sql queries and rest requests and at times a bit of failover and retransmissions. crash problem appears to occur after message has been relayed and branch transaction AVPs are being cleaned up. in this case a 300 was seen and opensips crashed and then the thread seems to be waiting on a lock to get released while trying to print the message about the segfault.

The other processing threads are in endless loop of __GI_sched_yield () and maxing out the cpu.

This is very intermittent. I sometimes have to run the load for a day or two before seeing the crash and then it is a coin toss if the deadlock will occur or not.

Have been fighting with this while trying to get to version 3.4 of opensips but do see it as well on version 3.2.

To Reproduce

start decent volume of SIP traffic that will have some amount of failover and/or lack of response resulting in retransmissions. active dialogs are only allowed to reach about 4000 in lab and there is plenty of shared as well as package memory headroom. cpu is typically at 60%.

i am not seeing messages from the timer threads for these failures but if i change the call flow to have many retransmissions i get the scheduling messages and a crash long before any memory issues could arise. im not particularly interested in this as i dont expect to have call flows that fail to get responses so consider that to be the real problem in that scenario .

Relevant System Logs

OS/environment information

  • Operating System:
  • OpenSIPS installation:
  • other relevant information:
    6.1.0-40-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.153-1 (2025-09-20) x86_64 GNU/Linux

Additional context

as mentioned, i have left opensips running in this state so i may be able to answer questions to help identify these issues.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions