hanging on by a thread

I’ve been doing a lot of core dump analysis at work on Solaris, digging deep into the bag of coder kung-fu using adb, which is a debugger that is not for the faint of heart.

So, trying to help someone troubleshoot a problem with AOLserver on Linux, I asked the simple question of “did it drop a core file”? The answer was, “No, it didn’t.” That prompts the next question, “Did you set ulimit -c unlimited?” The affirmative answer surprised me: how is it not dropping a core file? Ahh, that’s when it occurred to me: Linux has had issues with threads (same info, more polished) for a long time now. So, I decided to do some hacking on my own Debian Linux box running my own hand-compiled 2.6.7 kernel.

Apparently, there are some kernel patches to implement multithreaded core dumps which suggests that the patches were merged into the 2.5.47 kernel. So, why isn’t AOLserver writing a core file when it segfaults?

After making dinner and putting the kids to bed, I thought about it and wrote a little test program that starts a thread then aborts in the thread. It wrote out a core file just fine! Then it dawned on me: AOLserver does a setuid to drop root privileges — Linux, by default, won’t write a core file out if the process has changed uid/euid. Figuring I wasn’t the first person to want Linux to behave like every other Unix out there, I found my answer in the unofficial comp.os.linux.development.* FAQ: How can I make a suid executable dump core? Making the change in nsd/nsmain.c, adding the prctl(PR_SET_DUMPABLE) call, I can now get AOLserver to dump a core file on Linux. Yay.

I’ve just filed SF Bug #1031599, attached the diff’s and committed the change.

While this doesn’t solve the original problem of why this person’s AOLserver is crashing, at least now I might be able to get a core file to look at …

Comments

  1. Which line the prctl(PR_SET_DUMPABLE) call added?? Thanks.

  2. Line 398 in nsd/nsmain.c rev 1.59.

  3. Today, I start aolserver(added prctl call) through putting “./bin/nsd -t ./sample-config.tcl -u xmail” on console, I can get core file when it down, how kind you are. But, if I start aolserver through adding line “wm:235:respawn:/usr/local/aolserver/bin/nsd -it /usr/local/aolserver/sample-config.tcl -u xmail -g xmail” in file /etc/inittab, I find no core file when it down. How can I get core file in this auto start-up mode?
    It runs on Redhatlinux7.1 and kernel2.4.23.

  4. You might want to try something like this in your /etc/inittab:

    wm:235:respawn:/bin/sh -c “ulimit -c unlimited; /usr/local/aolserver/bin/nsd -it /usr/local/aolserver/sample-config.tcl -u xmail -g xmail”

    If this doesn’t work, then your problem may be something else: the “xmail” user may not have permission to write the corefile — make sure directory permissions are correct under /usr/local/aolserver so that the “xmail” user can write its corefile there.

  5. On RedHatlinux7.1 and kernel2.4.23, though I try so much method, there is no core file yet when aolserver down if it has started up using /etc/inittab. Then, I try it on RedHat Fedora core 2 and kernel2.6.5, it’s ok, haha……
    I think the core files have something related to platform and kernel version.

  6. I want to ask you a question: is the maxthread of aolserver limited by linux’s maxthread 256 limit? Can I set maxthread to 800?Thanks.

  7. If you’re using a modern kernel (I’m using a 2.6 kernel on Debian Linux) you can set Linux’s maxthreads:

    # sysctl kernel.threads-max
    kernel.threads-max = 16383

    AOLserver’s maxthreads ought to be limited by the OS’s thread max limit, sure. But, if you can raise the OS limit, AOLserver should be able to grow with it.

  8. My aolserver has a core file everyday. Gdb’s result is:
    (gdb) where
    #0 0x4017b801 in __kill () from /lib/i686/libc.so.6
    #1 0x4011e61b in raise (sig=6) at signals.c:65
    #2 0x4017cd82 in abort () at ../sysdeps/generic/abort.c:88
    #3 0x400d3651 in Tcl_PanicVA () at eval.c:41
    #4 0x400d3674 in Tcl_Panic () at eval.c:41
    #5 0x40064e28 in NsThreadFatal () at eval.c:41
    #6 0x40066e90 in NsCreateThread () at eval.c:41
    #7 0x40065c3a in Ns_ThreadCreate () at eval.c:41
    #8 0x4003e791 in SchedThread () at eval.c:41
    #9 0x40065cd5 in NsThreadMain () at eval.c:41
    #10 0x40067264 in ThreadMain () at eval.c:41
    #11 0x4011bbfd in pthread_start_thread (arg=0x404cac00) at manager.c:262

    There are 5 million requests to this aolserver 3.5.11 downloaded from sourceforge. Before create the core file, nsd use memory grow to 2G. The top result like this:

    PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
    20668 xmail 9 0 1292M 1.3G 2524 S 0.0 34.6 0:04 nsd
    20669 xmail 9 0 1292M 1.3G 2524 S 0.0 34.6 0:07 nsd
    20670 xmail 9 0 1292M 1.3G 2524 S 0.0 34.6 0:05 nsd
    20671 xmail 9 0 1292M 1.3G 2524 S 0.0 34.6 0:04 nsd

    I think whether my so file has some memory leak or tcl thread library has. If a thread cleanuped, wether memory it used is all released??

    Or this server pressure is too big, so cause this problem??

  9. Definitely sounds like you have a memory leak somewhere. Are you running OpenACS or is everything code you wrote?

  10. I am not running OpenACS. Only I use “ab -n300 -c300 http://server/index.html“, my server get a growing memory resident and after some spare time, nsd’s resident memory did not resume yet. I use aolserver3.5.11 and tcl8.4.7, only request static html pages, the result is the same. What is the matter with aolserver or tcl??

  11. I use tcl8.4.0 instead of tcl8.4.7, it’s ok, haha……

Leave a Reply to weiqiong Cancel reply

*