rqs gone bad. - SGI

This is a discussion on rqs gone bad. - SGI ; I just installed the new patch (5473) on a bunch of machines. All went fine, except on one of them. Here, the rqsall seems to hang forever, ie. inst says: (... the usuall stuff, nothing wrong ...) Installations and removals ...

+ Reply to Thread
Results 1 to 5 of 5

Thread: rqs gone bad.

  1. rqs gone bad.


    I just installed the new patch (5473) on a bunch of machines. All
    went fine, except on one of them. Here, the rqsall seems to hang
    forever, ie. inst says:

    (... the usuall stuff, nothing wrong ...)
    Installations and removals were successful.
    Requickstarting ELF files (see rqsall(1))

    And there it hangs. In another shell, I can see

    lavoisier:/# date
    Sat Feb 7 16:21:41 MET 2004
    lavoisier:/# ps -ef | head -1 ; ps -ef | grep '[r]qs'
    UID PID PPID C STIME TTY TIME CMD
    root 616672 619262 0 11:15:36 ? 294:31 /usr/etc/rqs -force_requickstart -load_address 0x7c70000 -timestamp 0x4024bac8
    root 619262 619653 0 11:12:21 ? 1:16 /usr/etc/rqsall -force -no_echo -inst #3 -o /var/inst/.rqsfiles -rescan /var/in


    So, rqs has been running for 5 hours. With top, I can see it gets
    ~95% CPU. There is nothing interesting in /var/adm/SYSLOG .

    Can I somehow find out what's holding up rqs? I checked the file
    /var/inst/.rqsfiles, it is dated 11:12, and ends with

    lb:libc.so.1
    v:sgi1.0
    t:0x3e948995
    c:0x513644fd
    el:
    eo:

    which tells me nothing, apart from suggesting the problem is
    related to libc.... Or perhaps not

    I have never seen this before, and I have no idea what do do.
    Simply kill rqs and hope for the best? Reboot? xfs_check?

    This is an Power Indigo2 XZ, Extreme, IP26, running 6.5.18m.

    All hints appreciated,
    sincerely,
    --
    Hans Peter Verne ( hpv at kjemi dot uio dot no )

    `It would seem that you have no useful skill or talent whatsoever,
    have you thought of going into teaching?' -- Terry Pratchett, "Mort"

  2. Re: rqs gone bad.

    In article ,
    Hans Peter Verne wrote:
    >
    >I just installed the new patch (5473) on a bunch of machines. All
    >went fine, except on one of them. Here, the rqsall seems to hang
    >forever, ie. inst says:
    >
    > (... the usuall stuff, nothing wrong ...)
    > Installations and removals were successful.
    > Requickstarting ELF files (see rqsall(1))
    >
    >And there it hangs. In another shell, I can see
    >
    > lavoisier:/# date
    > Sat Feb 7 16:21:41 MET 2004
    > lavoisier:/# ps -ef | head -1 ; ps -ef | grep '[r]qs'
    > UID PID PPID C STIME TTY TIME CMD
    > root 616672 619262 0 11:15:36 ? 294:31 /usr/etc/rqs -force_requickstart -load_address 0x7c70000 -timestamp 0x4024bac8
    > root 619262 619653 0 11:12:21 ? 1:16 /usr/etc/rqsall -force -no_echo -inst #3 -o /var/inst/.rqsfiles -rescan /var/in
    >
    >
    >So, rqs has been running for 5 hours. With top, I can see it gets
    >~95% CPU. There is nothing interesting in /var/adm/SYSLOG .


    This is a big surprise. No 'hanging rqs' issue is known
    or has been known for years (in fact I don't recall such at all ever,
    though a couple of botches have lead to rqsall never finishing
    due to circular-links in /var/inst/.rqsfiles).

    Try interceping with par(1) to get a report on what it is doing.

    It is written to be safe at all times.
    A new DSO is written out and the removal and replacement
    of the old is designed to be 'atomic'.

    So I should be the case that killing 616672 should cause
    no problem.


    >Can I somehow find out what's holding up rqs? I checked the file


    Well, look at par(1). That can help.

    >/var/inst/.rqsfiles, it is dated 11:12, and ends with
    >
    >lb:libc.so.1
    >v:sgi1.0
    >t:0x3e948995
    >c:0x513644fd
    >el:
    >eo:
    >
    >which tells me nothing, apart from suggesting the problem is
    >related to libc.... Or perhaps not


    No, not a correct conclusion.

    >I have never seen this before, and I have no idea what do do.
    >Simply kill rqs and hope for the best? Reboot? xfs_check?


    a) try par on rqs to find out what it is doing, report that
    here.

    b) kill rqs.

    c) xfs_check is a good idea: a file system problem is
    the only cause I can think of offhand.

    c2) If /usr/lib*/so_locations is trashed this can
    give rqs problems, though not leading to an infinite loop
    that I know of.


    >This is an Power Indigo2 XZ, Extreme, IP26, running 6.5.18m.


    d) even if rqs and rqsall fail your machine will still be ok.
    It just might not start up some system utilities quite
    as fast as it could when they are run.

    e) Neither has any known bugs (I do have things I'd like to
    do to improve reporting of its actions -- right now the
    available options for reporting are not very useful)

    f) I don't understand what is causing the problem.

    g) You can always rerun rqsall (don't run 2 at once!) later if
    you wish to. Not a bad idea given this odd situation.
    See the rqsall man page.

    Sign me very surprised by your diffficulty.
    David B. Anderson davea at sgi dot com http://reality.sgiweb.org/davea
    [rqs, rqsall maintainer.]

  3. Re: rqs gone bad.

    In article ,
    Hans Peter Verne writes:
    >
    > I have never seen this before, and I have no idea what do do.
    > Simply kill rqs and hope for the best? Reboot? xfs_check?
    >
    > This is an Power Indigo2 XZ, Extreme, IP26, running 6.5.18m.


    Interesting. I actually regularly saw the same rqs behavior on an IP26 (Challenge M)
    too lately (on .21 and .22). I didn't look deeper into this yet, so I can't
    really help you - but the fact that you're only experiencing it on an r8000
    made me raise an eye brow.. probably no coincidence

    so long,
    Timo

    --
    Timo Kanera . GPG Key-ID: 1024D/30CDB412



  4. Re: rqs gone bad.


    In article davea@quasar.engr.sgi.com (David Anderson) writes:

    > >So, rqs has been running for 5 hours. With top, I can see it gets
    > >~95% CPU. There is nothing interesting in /var/adm/SYSLOG .

    >
    > This is a big surprise.


    Thanks for your interest. I had to put this away for some time,
    but I tried out some more today.

    > No 'hanging rqs' issue is known
    > or has been known for years (in fact I don't recall such at all ever,
    > though a couple of botches have lead to rqsall never finishing
    > due to circular-links in /var/inst/.rqsfiles).


    Can .rqsfiles be regenerated somehow?

    > Try interceping with par(1) to get a report on what it is doing.


    OK. I don't have all that much experience with par, but I didn't
    find much. See below.

    > So I should be the case that killing 616672 should cause
    > no problem.


    Yup, so I did.

    > c) xfs_check is a good idea: a file system problem is
    > the only cause I can think of offhand.


    The system is offsite, so I can't easily boot from CD, but
    xfs_check -s said nothing.

    > c2) If /usr/lib*/so_locations is trashed this can
    > give rqs problems, though not leading to an infinite loop
    > that I know of.


    I don't know what to look for here, though:

    # ls -l /usr/lib*/so_locations
    -r--r--r-- 1 root root 54463 Feb 8 12:11 /usr/lib/so_locations
    -r--r--r-- 1 root root 69160 Feb 8 12:11 /usr/lib32/so_locations
    -r--r--r-- 1 root root 10078 Feb 8 12:11 /usr/lib64/so_locations



    > a) try par on rqs to find out what it is doing, report that
    > here.


    Now, this is what I did:

    tail /var/inst/INSTLOG showed how rqsall was called, so I tried:

    /usr/etc/rqsall -v -count -log /tmp/rqs.log -force \
    -o /var/inst/.rqsfiles-done2 -rescan /var/inst/.rqsfiles

    This outputs a lot, then hangs, last message is

    removing starting address 0xd3f7000 of length 0x7000 from memory list vl_vec[23] mlist 0x101a08c0 for /usr/lib32/libawareaudio.so.1 -- SUCCEED

    /tmp/rqs.log remains empty, and /var/inst/.rqsfiles-done2 is not even
    created. So (in another window) I run "ps -ef | grep rqs" and find it
    hanging here:

    root 762052 762698 0 14:30:13 pts/4 36:52 /usr/etc/rqs -force_requickstart -load_address 0x4ca0000 -timestamp 0x403368e5

    ps won't tell me the full arg list, but pid 762052 has been running for
    quite some time. So I try

    # par -SS -Q -A -i -s -r -k -i -p 762052

    After a couple of minutes, it has printed this list:

    0mS inetd(762132): I/O queued; flags 0x14019 dev 0,48 count 16384 blkno 45600
    0mS inetd(762132): I/O started; flags 0x14019 dev 0,48 count 16384 blkno 45600
    830mS (762132): was sent signal SIGCLD
    869mS sh(762363): I/O queued; flags 0x9 dev 0,44 count 8192 blkno 845952
    870mS sh(762363): I/O started; flags 0x9 dev 0,44 count 8192 blkno 845952
    888mS (762132): was sent signal SIGCLD
    1040mS (762132): was sent signal SIGCLD
    1220mS (762132): was sent signal SIGCLD
    1396mS (762132): was sent signal SIGCLD
    71354mS (750293): was sent signal SIGCLD


    I'm still stumped!

    --
    Hans Peter Verne ( hpv at kjemi dot uio dot no )

    `It would seem that you have no useful skill or talent whatsoever,
    have you thought of going into teaching?' -- Terry Pratchett, "Mort"

  5. Re: rqs gone bad.

    In article ,
    Hans Peter Verne wrote:
    >
    >In article davea@quasar.engr.sgi.com (David Anderson) writes:
    >
    >> >So, rqs has been running for 5 hours. With top, I can see it gets
    >> >~95% CPU. There is nothing interesting in /var/adm/SYSLOG .

    ....
    >> No 'hanging rqs' issue is known
    >> or has been known for years (in fact I don't recall such at all ever,
    >> though a couple of botches have lead to rqsall never finishing
    >> due to circular-links in /var/inst/.rqsfiles).

    >
    >Can .rqsfiles be regenerated somehow?


    Regrettably no. But that does not seem to be your problem,
    and anyway modern rqsall notices the circular list and
    avoids looping.


    >> Try interceping with par(1) to get a report on what it is doing.

    >
    >OK. I don't have all that much experience with par, but I didn't
    >find much. See below.


    Yes, Not much of any use there.


    >> a) try par on rqs to find out what it is doing, report that
    >> here.

    >
    >Now, this is what I did:
    >
    >tail /var/inst/INSTLOG showed how rqsall was called, so I tried:
    >
    > /usr/etc/rqsall -v -count -log /tmp/rqs.log -force \
    > -o /var/inst/.rqsfiles-done2 -rescan /var/inst/.rqsfiles
    >
    >This outputs a lot, then hangs, last message is


    I'd like to see a lot more than 'the last message' here.
    What I'm looking for is the name of the file rqsall is
    going to run rqs on, and the rqs command being run.
    It should be in this output somewhere, not far from the end.

    > removing starting address 0xd3f7000 of length 0x7000 from memory list vl_vec[23] mlist 0x101a08c0 for /usr/lib32/libawareaudio.so.1 -- SUCCEED
    >
    >/tmp/rqs.log remains empty, and /var/inst/.rqsfiles-done2 is not even
    >created. So (in another window) I run "ps -ef | grep rqs" and find it
    >hanging here:
    >
    > root 762052 762698 0 14:30:13 pts/4 36:52 /usr/etc/rqs -force_requickstart -load_address 0x4ca0000 -timestamp 0x403368e5



    Regrettably the kernel string array limit make
    the file name being rqs'd invisible to ps.

    That's the rqs command, but missing the file name.

    The fact that par tells us nothing about this process is curious.
    Makes me think (since no IO going on) that the object
    is damaged somehow. Or could the object be nfs-mounted
    and the mount point be unusable hanging up rqs?
    Look at the path names in
    /var/inst/.rqsfiles for anything that looks like
    a full path to an nfs file (for this system).

    Again, keep a backup of /var/inst/.rqsfiles . Don't let it get lost.

    >ps won't tell me the full arg list, but pid 762052 has been running for
    >quite some time. So I try
    >
    > # par -SS -Q -A -i -s -r -k -i -p 762052
    >
    >After a couple of minutes, it has printed this list:


    The so_locations files you ls -l'd were of sensible length
    (/usr/lib64/so_locations was a tiny bit smaller than I would
    have expected, but within bounds of reasonableness.

    Could you email as much of the tail of an rqsall run stdout/stderr
    as you can? (add
    -debug reason
    to the options you give rqsall.)

    When rqs hangs, kill it, letting rqsall run to completion.
    Somehow I have to find out *:which* file is hanging rqs.
    The file will be a full path in /var/inst/.rqsfiles

    We should take this to support or to email and report findings
    here when we have something.


    Thanks for your patience.
    David Anderson davea at sgi dot com

+ Reply to Thread