FreeSWITCH

loopback channels stay alive

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Minor Minor
  • Resolution: Fixed
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: freeswitch-core
  • Platform:
    Linux x86/gcc
  • Uname:
    Linux asp2.eworldcom.hu 2.6.22-14-xen #1 SMP Tue Feb 12 04:26:15 UTC 2008 x86_64 GNU/Linux
  • CPU Info:
    Hide
    root@asp2:/opt/freeswitch/log/loopback# cat /proc/cpuinfo
    processor : 0
    vendor_id : AuthenticAMD
    cpu family : 15
    model : 33
    model name : Dual Core AMD Opteron(tm) Processor 270
    stepping : 2
    cpu MHz : 1994.995
    cache size : 1024 KB
    physical id : 1
    siblings : 1
    core id : 1
    cpu cores : 1
    fpu : yes
    fpu_exception : yes
    cpuid level : 1
    wp : yes
    flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
    bogomips : 3991.58
    TLB size : 1024 4K pages
    clflush size : 64
    cache_alignment : 64
    address sizes : 40 bits physical, 48 bits virtual
    power management: ts fid vid ttp

    processor : 1
    vendor_id : AuthenticAMD
    cpu family : 15
    model : 33
    model name : Dual Core AMD Opteron(tm) Processor 270
    stepping : 2
    cpu MHz : 1994.995
    cache size : 1024 KB
    physical id : 1
    siblings : 1
    core id : 1
    cpu cores : 1
    fpu : yes
    fpu_exception : yes
    cpuid level : 1
    wp : yes
    flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow up pni lahf_lm cmp_legacy
    bogomips : 3991.58
    TLB size : 1024 4K pages
    clflush size : 64
    cache_alignment : 64
    address sizes : 40 bits physical, 48 bits virtual
    power management: ts fid vid ttp

    processor : 2
    vendor_id : AuthenticAMD
    cpu family : 15
    model : 33
    model name : Dual Core AMD Opteron(tm) Processor 270
    stepping : 2
    cpu MHz : 1994.995
    cache size : 1024 KB
    physical id : 1
    siblings : 1
    core id : 1
    cpu cores : 1
    fpu : yes
    fpu_exception : yes
    cpuid level : 1
    wp : yes
    flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow up pni lahf_lm cmp_legacy
    bogomips : 3991.58
    TLB size : 1024 4K pages
    clflush size : 64
    cache_alignment : 64
    address sizes : 40 bits physical, 48 bits virtual
    power management: ts fid vid ttp

    processor : 3
    vendor_id : AuthenticAMD
    cpu family : 15
    model : 33
    model name : Dual Core AMD Opteron(tm) Processor 270
    stepping : 2
    cpu MHz : 1994.995
    cache size : 1024 KB
    physical id : 1
    siblings : 1
    core id : 1
    cpu cores : 1
    fpu : yes
    fpu_exception : yes
    cpuid level : 1
    wp : yes
    flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow up pni lahf_lm cmp_legacy
    bogomips : 3991.58
    TLB size : 1024 4K pages
    clflush size : 64
    cache_alignment : 64
    address sizes : 40 bits physical, 48 bits virtual
    power management: ts fid vid ttp
    Show
    root@asp2:/opt/freeswitch/log/loopback# cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 270 stepping : 2 cpu MHz : 1994.995 cache size : 1024 KB physical id : 1 siblings : 1 core id : 1 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy bogomips : 3991.58 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 270 stepping : 2 cpu MHz : 1994.995 cache size : 1024 KB physical id : 1 siblings : 1 core id : 1 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow up pni lahf_lm cmp_legacy bogomips : 3991.58 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 2 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 270 stepping : 2 cpu MHz : 1994.995 cache size : 1024 KB physical id : 1 siblings : 1 core id : 1 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow up pni lahf_lm cmp_legacy bogomips : 3991.58 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 3 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 270 stepping : 2 cpu MHz : 1994.995 cache size : 1024 KB physical id : 1 siblings : 1 core id : 1 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow up pni lahf_lm cmp_legacy bogomips : 3991.58 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp
  • GCC Version:
    gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu3)
  • FreeSWITCH GIT Revision:
    12914
  • Reproduced with GIT HEAD?:
    yes
  • Description:
    Hide
    We originate loopback channels which are bridged to sofia.
    The call is pushed into fifo.


    I connected 2 FS's .say A and B.

    on A I originate calls, with test.py. the dialplan is A.xml
    B plays back soundfiles. (B.xml)

    I can reproduce with a small script which sends bgapi originate.
    After running this script for a while (it takes sometimes a 15 miutes sometimes 1 hour) and stop. "show channels" shows old channels.

    eg:
    I added SWITCH_DEBUG_RWLOCKS define to see what happens with rwlocks


    show calls:
    539c80f9-21bd-11de-841e-4731eebc5a5d,outbound,2009-04-05 10:39:50,1238920790,loopback/#2#279#9243#1#predictivedialer#-a,CS_EXECUTE,FreeSWITCH,0000000000,,pp-
    queue,fifo,pp-queue-2 in,XML,queues-asp2,L16,8000,L16,8000

    when I try to uuid_kill it I get:
    2009-04-05 11:28:31 [ERR] mod_commands.c:1259 kill_function() loopback/#2#279#9243#1#predictivedialer#-b Read lock FAIL
    freeswitch@default>

    The last lines mentioning this channels in the log are:

    2009-04-05 10:40:05 [DEBUG] switch_core_state_machine.c:420 switch_core_session_run() (loopback/#2#279#9243#1#predictivedialer#-b) State Change CS_REPORTING
    -> CS_DONE
    2009-04-05 10:40:05 [DEBUG] switch_core_session.c:1001 switch_core_session_thread() Session 205351 (loopback/#2#279#9243#1#predictivedialer#-b) Locked, Waiti
    ng on external entities
    2009-04-05 10:40:05 [ERR] switch_core_session.c:1003 switch_core_session_thread() loopback/#2#279#9243#1#predictivedialer#-b Write lock ACQUIRED


    Show
    We originate loopback channels which are bridged to sofia. The call is pushed into fifo. I connected 2 FS's .say A and B. on A I originate calls, with test.py. the dialplan is A.xml B plays back soundfiles. (B.xml) I can reproduce with a small script which sends bgapi originate. After running this script for a while (it takes sometimes a 15 miutes sometimes 1 hour) and stop. "show channels" shows old channels. eg: I added SWITCH_DEBUG_RWLOCKS define to see what happens with rwlocks show calls: 539c80f9-21bd-11de-841e-4731eebc5a5d,outbound,2009-04-05 10:39:50,1238920790,loopback/#2#279#9243#1#predictivedialer#-a,CS_EXECUTE,FreeSWITCH,0000000000,,pp- queue,fifo,pp-queue-2 in,XML,queues-asp2,L16,8000,L16,8000 when I try to uuid_kill it I get: 2009-04-05 11:28:31 [ERR] mod_commands.c:1259 kill_function() loopback/#2#279#9243#1#predictivedialer#-b Read lock FAIL freeswitch@default> The last lines mentioning this channels in the log are: 2009-04-05 10:40:05 [DEBUG] switch_core_state_machine.c:420 switch_core_session_run() (loopback/#2#279#9243#1#predictivedialer#-b) State Change CS_REPORTING -> CS_DONE 2009-04-05 10:40:05 [DEBUG] switch_core_session.c:1001 switch_core_session_thread() Session 205351 (loopback/#2#279#9243#1#predictivedialer#-b) Locked, Waiti ng on external entities 2009-04-05 10:40:05 [ERR] switch_core_session.c:1003 switch_core_session_thread() loopback/#2#279#9243#1#predictivedialer#-b Write lock ACQUIRED
  1. A.xml
    (5 kB)
    Tamas Cseke
    06/Apr/09 5:41 AM
  2. B.xml
    (0.5 kB)
    Tamas Cseke
    06/Apr/09 5:41 AM
  3. test.py
    (0.9 kB)
    Tamas Cseke
    06/Apr/09 5:41 AM
  4. vg.log
    (384 kB)
    Tamas Cseke
    09/Apr/09 9:13 AM

Activity

Hide
Tamas Cseke added a comment - 06/Apr/09 5:55 AM
It is just assumption, but maybe helps:

I guess setting SSF_DESTROYED flag blocks in switch_core_session_thread becasue the write lock is acquired in the line right before it according to the log

and I lookes into switch_core_session_read_lock which is performed when I make "uuid_kill", and I added logs to see
which condition is true in switch_test_flag(session, SSF_DESTROYED) || switch_channel_down(session->channel)
I always see SSF_DESTROYED isn't set, but the channel is down.
Show
Tamas Cseke added a comment - 06/Apr/09 5:55 AM It is just assumption, but maybe helps: I guess setting SSF_DESTROYED flag blocks in switch_core_session_thread becasue the write lock is acquired in the line right before it according to the log and I lookes into switch_core_session_read_lock which is performed when I make "uuid_kill", and I added logs to see which condition is true in switch_test_flag(session, SSF_DESTROYED) || switch_channel_down(session->channel) I always see SSF_DESTROYED isn't set, but the channel is down.
Hide
Tamas Cseke added a comment - 06/Apr/09 6:17 AM
We connected with gdb to the running FS process.
And we found the thread might be in deadlock:

http://pastebin.freeswitch.org/8323

should be hanging on this:
    switch_mutex_lock(frame->codec->mutex);
Show
Tamas Cseke added a comment - 06/Apr/09 6:17 AM We connected with gdb to the running FS process. And we found the thread might be in deadlock: http://pastebin.freeswitch.org/8323 should be hanging on this:     switch_mutex_lock(frame->codec->mutex);
Hide
Tamas Cseke added a comment - 06/Apr/09 6:51 AM
this is the thread of the loopback-b channel:
 http://pastebin.freeswitch.org/8324

it waits for:
 switch_core_session_write_lock(session);
(it isn't acquired just the log is before the lock)
Show
Tamas Cseke added a comment - 06/Apr/09 6:51 AM this is the thread of the loopback-b channel:  http://pastebin.freeswitch.org/8324 it waits for:  switch_core_session_write_lock(session); (it isn't acquired just the log is before the lock)
Hide
Anthony Minessale II added a comment - 06/Apr/09 11:36 AM
it's hard to follow this dp
what exactly happens? can you make a more minimalistic recipe to reproduce?
is it always bgapi originate to loopback who in turn calls bridge to another box?




Show
Anthony Minessale II added a comment - 06/Apr/09 11:36 AM it's hard to follow this dp what exactly happens? can you make a more minimalistic recipe to reproduce? is it always bgapi originate to loopback who in turn calls bridge to another box?
Hide
Anthony Minessale II added a comment - 06/Apr/09 3:25 PM
can you get me in on ssh so i can examine the core file myself?
email ssh details to consulting@freeswitch.org

our ssh key is in tree in support-d/shinzon.pub
Show
Anthony Minessale II added a comment - 06/Apr/09 3:25 PM can you get me in on ssh so i can examine the core file myself? email ssh details to consulting@freeswitch.org our ssh key is in tree in support-d/shinzon.pub
Hide
Tamas Cseke added a comment - 07/Apr/09 2:53 AM
minimalistic recipe:
-we bgapi originate loopback channels.
-loopback-b bridge sofia
-loopback-a goes into fifo
-after timeout the call is being hungup.

the symptoms:
loopback-a waits for codec mutex
loopback-b waits for session write lock

I emailed the ssh details.
Show
Tamas Cseke added a comment - 07/Apr/09 2:53 AM minimalistic recipe: -we bgapi originate loopback channels. -loopback-b bridge sofia -loopback-a goes into fifo -after timeout the call is being hungup. the symptoms: loopback-a waits for codec mutex loopback-b waits for session write lock I emailed the ssh details.
Hide
Tamas Cseke added a comment - 07/Apr/09 3:33 AM
I upgraded to r12933 and I still can reproduce
Show
Tamas Cseke added a comment - 07/Apr/09 3:33 AM I upgraded to r12933 and I still can reproduce
Hide
Tamas Cseke added a comment - 08/Apr/09 6:07 AM
fixed in r12944
Show
Tamas Cseke added a comment - 08/Apr/09 6:07 AM fixed in r12944
Hide
Tamas Cseke added a comment - 08/Apr/09 9:45 AM
we didn't have the original problem so I closed the bug.
But it seems to break other thing

We had 3 segfaults today with the same backtrace:
http://pastebin.freeswitch.org/8381

This was the last lines in the log .I know debug would be much better, but I can't reproduce it yet, I'm still trying...


2009-04-08 15:37:50 [NOTICE] mod_loopback.c:655 channel_receive_message() Channel [loopback/#71#83629#0036493XX#1#predictivedialer#-a] has been answered
2009-04-08 15:37:50 [NOTICE] switch_ivr_originate.c:1970 switch_ivr_originate() Channel [loopback/#71#83629#0036493XX#1#predictivedialer#-b] has been answe
red
2009-04-08 15:37:50 [NOTICE] switch_ivr.c:1345 switch_ivr_session_transfer() Transfer loopback/#71#83629#0036493XX#1#predictivedialer#-a to XML[pp-queue@q
ueues]
...
2009-04-08 15:37:53 [NOTICE] switch_ivr_bridge.c:998 switch_ivr_multi_threaded_bridge() Hangup loopback/#71#83629#0036493XX#1#predictivedialer#-b [CS_EXEC
UTE] [NONE]

I upgraded from r12913 to 12948. before that we never had such cores.
Show
Tamas Cseke added a comment - 08/Apr/09 9:45 AM we didn't have the original problem so I closed the bug. But it seems to break other thing We had 3 segfaults today with the same backtrace: http://pastebin.freeswitch.org/8381 This was the last lines in the log .I know debug would be much better, but I can't reproduce it yet, I'm still trying... 2009-04-08 15:37:50 [NOTICE] mod_loopback.c:655 channel_receive_message() Channel [loopback/#71#83629#0036493XX#1#predictivedialer#-a] has been answered 2009-04-08 15:37:50 [NOTICE] switch_ivr_originate.c:1970 switch_ivr_originate() Channel [loopback/#71#83629#0036493XX#1#predictivedialer#-b] has been answe red 2009-04-08 15:37:50 [NOTICE] switch_ivr.c:1345 switch_ivr_session_transfer() Transfer loopback/#71#83629#0036493XX#1#predictivedialer#-a to XML[pp-queue@q ueues] ... 2009-04-08 15:37:53 [NOTICE] switch_ivr_bridge.c:998 switch_ivr_multi_threaded_bridge() Hangup loopback/#71#83629#0036493XX#1#predictivedialer#-b [CS_EXEC UTE] [NONE] I upgraded from r12913 to 12948. before that we never had such cores.
Hide
Tamas Cseke added a comment - 09/Apr/09 9:13 AM
I'm trying r12956.

I had 1 quiet big segfault (1.1G)
http://pastebin.freeswitch.org/8416
This is a bit more complicated dialplan.

I also realized possible memory leak with the same scenario (test.py -A.xm-B.xml)
Brian think the problem is around python, but asked me to run with valgrind.

It was this:
root@asp2:/DEVEL/freeswitch# valgrind --version
valgrind-3.3.0-Debian
root@asp2:/DEVEL/freeswitch# valgrind --tool=memcheck --log-file=vg.log --leak-check=full --leak-resolution=high --show-reachable=yes .libs/freeswitch -vg

I attach vg.log.
Show
Tamas Cseke added a comment - 09/Apr/09 9:13 AM I'm trying r12956. I had 1 quiet big segfault (1.1G) http://pastebin.freeswitch.org/8416 This is a bit more complicated dialplan. I also realized possible memory leak with the same scenario (test.py -A.xm-B.xml) Brian think the problem is around python, but asked me to run with valgrind. It was this: root@asp2:/DEVEL/freeswitch# valgrind --version valgrind-3.3.0-Debian root@asp2:/DEVEL/freeswitch# valgrind --tool=memcheck --log-file=vg.log --leak-check=full --leak-resolution=high --show-reachable=yes .libs/freeswitch -vg I attach vg.log.
Hide
Tamas Cseke added a comment - 10/Apr/09 5:34 AM
we're trying r12980.

we had one case when the channel is stucked in.
Here is the backtrace: http://pastebin.freeswitch.org/8427
I try to reproduce reliably, but maybe it is obvious for you.

(I tried the test with esl lib, and I don't have memory problem yet, so don't care about that)
Show
Tamas Cseke added a comment - 10/Apr/09 5:34 AM we're trying r12980. we had one case when the channel is stucked in. Here is the backtrace: http://pastebin.freeswitch.org/8427 I try to reproduce reliably, but maybe it is obvious for you. (I tried the test with esl lib, and I don't have memory problem yet, so don't care about that)
Hide
Anthony Minessale II added a comment - 10/Apr/09 12:43 PM
try r12986
Show
Anthony Minessale II added a comment - 10/Apr/09 12:43 PM try r12986
Hide
Tamas Cseke added a comment - 14/Apr/09 9:40 AM
we're testing, we haven't had any problem yet.
tonight I upgrade a production box too, and we will see what happen.
thank you.
Show
Tamas Cseke added a comment - 14/Apr/09 9:40 AM we're testing, we haven't had any problem yet. tonight I upgrade a production box too, and we will see what happen. thank you.
Hide
Tamas Cseke added a comment - 15/Apr/09 3:33 AM
We have a segfault, while dereferencing codec implementation.
Here is the backtrace: http://pastebin.freeswitch.org/8596
Maybe this is related to this issue.

This happend once with r13008.
Show
Tamas Cseke added a comment - 15/Apr/09 3:33 AM We have a segfault, while dereferencing codec implementation. Here is the backtrace: http://pastebin.freeswitch.org/8596 Maybe this is related to this issue. This happend once with r13008.
Hide
Anthony Minessale II added a comment - 15/Apr/09 8:27 AM
You must be on at least 13018 but i strongly recommend svn trunk as you debug this issue.
Show
Anthony Minessale II added a comment - 15/Apr/09 8:27 AM You must be on at least 13018 but i strongly recommend svn trunk as you debug this issue.
Hide
Tamas Cseke added a comment - 15/Apr/09 9:41 AM
we are always on latest trunk on test boxes. and since we didn't have problem yesterday with it I updated to this 13008.
now I installed r13027 without testing.
 
Show
Tamas Cseke added a comment - 15/Apr/09 9:41 AM we are always on latest trunk on test boxes. and since we didn't have problem yesterday with it I updated to this 13008. now I installed r13027 without testing.  
Hide
Anthony Minessale II added a comment - 15/Apr/09 9:51 AM
ok,
np

I recommend you watch the svn version on mod_loopback and or switch_core_io.c and try to update whenever they change until your problem is gone for sure.
Show
Anthony Minessale II added a comment - 15/Apr/09 9:51 AM ok, np I recommend you watch the svn version on mod_loopback and or switch_core_io.c and try to update whenever they change until your problem is gone for sure.
Hide
Bill Belanger added a comment - 05/May/09 1:06 PM
As of version 12944, loopback appears to be broken in our setup. We connect two conferences using mod_loopback ('api conference test-1 dial loopback/test-2'), with a special entry in the dialplan that will connect 'test-xx' to its own unique conference.

Revision 12943 appears to work fine, but after 12944 audio traveling from one conference to another is broken. Audio is severely delayed, and only heard in blips. This continues to hold true for current (13232).
Show
Bill Belanger added a comment - 05/May/09 1:06 PM As of version 12944, loopback appears to be broken in our setup. We connect two conferences using mod_loopback ('api conference test-1 dial loopback/test-2'), with a special entry in the dialplan that will connect 'test-xx' to its own unique conference. Revision 12943 appears to work fine, but after 12944 audio traveling from one conference to another is broken. Audio is severely delayed, and only heard in blips. This continues to hold true for current (13232).
Hide
Anthony Minessale II added a comment - 05/May/09 2:35 PM
try 13236
Show
Anthony Minessale II added a comment - 05/May/09 2:35 PM try 13236
Hide
Bill Belanger added a comment - 05/May/09 3:08 PM
My issue mentioned above is Fixed in revision 13236.
Show
Bill Belanger added a comment - 05/May/09 3:08 PM My issue mentioned above is Fixed in revision 13236.
Hide
Anthony Minessale II added a comment - 05/May/09 4:32 PM
r13236
Show
Anthony Minessale II added a comment - 05/May/09 4:32 PM r13236

People

Dates

  • Created:
    06/Apr/09 5:37 AM
    Updated:
    27/Sep/10 1:45 AM
    Resolved:
    05/May/09 4:32 PM