Instability - June 2008

Latest news about SIP Sorcery
Post Reply
Aaron
Site Admin
Posts: 4652
Joined: Thu Jul 12, 2007 12:13 am

Instability - June 2008

Post by Aaron » Sat Jun 28, 2008 12:39 am

Hi All,

Just a quick note relating to the instability of the sipswitch over the last couple of weeks. A memory leak problem was identified 5 or 6 weeks ago but is proving difficult to track down. Initially the leak did not appear to be causing any issues except for obviously chewing up memory but in the last 2 to 3 weeks there have been some incidents where certain sipswitch services were impaired or failed. I suspect that the problem causing the memory leak has grown in proportion to the increased sipswitch user base and is now having a bigger impact.

Unfortunately this period of instability and increased load corresponded to a time when I was travelling and unable to spend anytime looking into it. I am now back on the ground and am actively looking into the problem and hope to get it sorted out shortly. Blueface have also moved their SIP monitoring system and for the time being it cannot be used for the sipswitch monitoring.

At this point I suspect the Ruby engine the sipswitch uses for its dial plan execution is the cause of the leak. Initial investigations indicate that when a dial plan with an error is executed an exception is generated and results the Ruby engine is then not cleaning itself up. Something like this is not entirely unexpected as the Ruby engine (IronRuby) is not even at the alpha stage and the implemnetation the sipswitch is using was built directly from the IronRuby development trunk! There is sometimes a price to pay for being to close to the cutting edge.

Regards,

Aaron

Aaron
Site Admin
Posts: 4652
Joined: Thu Jul 12, 2007 12:13 am

Post by Aaron » Sun Jul 06, 2008 2:33 am

A further update on this issue. After some serious debugging, testing and profiling I am now confident that the memory leak is being caused by the Ruby script hosting engine. Specifically when a dial plan script generates certain types of exceptions such as ArgumnetOutOfRangeException or NullReference Exception the memory leak manifests. Dial plan scripts that do not generate exceptions do not cause any problems.

I have posted my findings to the IronRuby forum and hopefully a fix will be possible in the near future.

Regards,

Aaron

Aaron
Site Admin
Posts: 4652
Joined: Thu Jul 12, 2007 12:13 am

Post by Aaron » Sat Jul 12, 2008 12:26 am

Hi All,

This issue is still present in the spswitch and is a big problem. It means we have to keep a constant eye on the memory utilisation of the sipswitch and restart it roughly every 3 or 4 days.

While the issue has been conclusively isolated to exceptions being generated within Ruby dialplans the problem in the IronRuby code has not been found. The only way to alleviate the memory leak would be to disable Ruby dial plans but that's not an option given that they are now widely in use.

As a consequence of this memory leak the majority of the time I have to work on the sipswitch was spent finding the memory leak and since then has been spent looking through the IronRuby code to see if I can spot anything there. In the meantime work on feature requests or non-critical bugs for the sipswitch has been put on the backburner.

Without getting to the bottom of this we'll be back to the situation where it's possible for the sipswitch to stall and become unresponsive if we don't catch a spike in memory usage in time.

Regards,

Aaron

Aaron
Site Admin
Posts: 4652
Joined: Thu Jul 12, 2007 12:13 am

Post by Aaron » Thu Jul 17, 2008 10:44 am

Hi All,

We are still not at the bottom of this despite over three weeks of looking into it and investing in memory profiling tools. A memory leak was found in the IrobRuby code but steps that were taken to worked around it have not been successful.

As an additional test logging of events to the database and recording of CDRs has been temporarily turned off in order to investigate what impact that has. It's envisaged that they will be off for a maximum of 48 hours.

Regards,

Aaron

Post Reply