System hanging during dump - FreeBSD

This is a discussion on System hanging during dump - FreeBSD ; Last night, I attempted a full, compressed backup of my 181GB /home (on a PATA disk) to a remote system. The backup started at 2159 and everything appeared normal until about 0040 when the system became non-responsive and this lasted ...

+ Reply to Thread
Results 1 to 12 of 12

Thread: System hanging during dump

  1. System hanging during dump

    Last night, I attempted a full, compressed backup of my 181GB /home
    (on a PATA disk) to a remote system. The backup started at 2159 and
    everything appeared normal until about 0040 when the system became
    non-responsive and this lasted until the dump completed at 1033. This
    is the first full backup of /home I've made for several years (due to
    lack of space).

    I noticed the non-responsiveness at about 0500 when:
    - The dump, gzip and fifo pipeline were running normally.
    - A 'systat -v' I had started was running normally (though it
    reported an excessive number of 'D' processes). Other values
    all appeared normal.
    - No response to return key at a zsh prompt
    - No response to up/down arrows in mutt
    [above all done in pre-existing ssh sessions from another host]
    - telnet to port 22 connected but didn't produce a banner.

    The duration above is based on system logs - which show nothing
    happened during this period. At the end, there were various anomolous
    entries:
    Oct 15 10:33:27 server ntpd[750]: too many recvbufs allocated (40)
    Oct 15 10:33:30 server sshd[947]: error: accept: Software caused connectionabort
    Oct 15 10:33:34 server kernel: TCP: [192.168.123.123]:59516 to [192.168.123.200]:25 tcpflags 0x4; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored

    Possibly useful information:
    The dump pipeline was:
    dump -uaL0 -C 32 -f - /home | reblock | gzip [stdout connected to socket
    to remote server]
    'reblock' is basically a 200MB FIFO I wrote to desynchronise the (often
    I/O bound) dump from the CPU-bound gzip.

    server% uname -a
    FreeBSD server.vk2pj.dyndns.org 7.0-STABLE FreeBSD 7.0-STABLE #18: Sun May 18 15:02:39 EST 2008 root@server.vk2pj.dyndns.org:/var/obj/k7/usr/src/sys/server i386
    server% df -ki
    Filesystem 1024-blocks Used Avail Capacity iused ifree %iused Mounted on
    /dev/ad0s3d 204648864 181911710 6365246 97% 1703016 11353942 13% /home

    About the only think that happened at around this time was nightly
    updates. These start at 0005, fetching CTM cvs-cur updates, applying
    them to /home/ncvs, then cvs updating /home/ports. Looking at
    timestamps, /home/ports/graphics/icod/CVS/Entries was updated at
    0042 and /home/ports/graphics/imlib2_loaders/CVS/Entries (the next
    entry) was updated at 1034.

    Whilst /home is fairly full, I can't see that the snapshot meta and
    rollback data would have occupied the 20GB free (and no 'out-of-space'
    messages were generated). Is there some limit on the number of inodes
    that can be updated whilst a snapshot exists?

    Has anyone else seen anything similar?
    --
    Peter Jeremy
    Please excuse any delays as the result of my ISP's inability to implement
    an MTA that is either RFC2821-compliant or matches their claimed behaviour.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.9 (FreeBSD)

    iEYEARECAAYFAkj1qLwACgkQ/opHv/APuIeREACgpjCPVxERhgEs0D8grqn3uGc3
    +28AniXCh990RNkp/msGrhs3CffIMBtV
    =XdEX
    -----END PGP SIGNATURE-----


  2. Re: System hanging during dump

    On Wed, Oct 15, 2008 at 07:24:28PM +1100, Peter Jeremy wrote:
    > Last night, I attempted a full, compressed backup of my 181GB /home
    > (on a PATA disk) to a remote system. The backup started at 2159 and
    > everything appeared normal until about 0040 when the system became
    > non-responsive and this lasted until the dump completed at 1033. This
    > is the first full backup of /home I've made for several years (due to
    > lack of space).
    >
    > I noticed the non-responsiveness at about 0500 when:
    > - The dump, gzip and fifo pipeline were running normally.
    > - A 'systat -v' I had started was running normally (though it
    > reported an excessive number of 'D' processes). Other values
    > all appeared normal.
    > - No response to return key at a zsh prompt
    > - No response to up/down arrows in mutt
    > [above all done in pre-existing ssh sessions from another host]
    > - telnet to port 22 connected but didn't produce a banner.
    >
    > The duration above is based on system logs - which show nothing
    > happened during this period. At the end, there were various anomolous
    > entries:
    > Oct 15 10:33:27 server ntpd[750]: too many recvbufs allocated (40)
    > Oct 15 10:33:30 server sshd[947]: error: accept: Software caused connection abort
    > Oct 15 10:33:34 server kernel: TCP: [192.168.123.123]:59516 to [192.168.123.200]:25 tcpflags 0x4; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
    >
    > Possibly useful information:
    > The dump pipeline was:
    > dump -uaL0 -C 32 -f - /home | reblock | gzip [stdout connected to socket
    > to remote server]
    > 'reblock' is basically a 200MB FIFO I wrote to desynchronise the (often
    > I/O bound) dump from the CPU-bound gzip.
    >
    > server% uname -a
    > FreeBSD server.vk2pj.dyndns.org 7.0-STABLE FreeBSD 7.0-STABLE #18: Sun May 18 15:02:39 EST 2008 root@server.vk2pj.dyndns.org:/var/obj/k7/usr/src/sys/server i386
    > server% df -ki
    > Filesystem 1024-blocks Used Avail Capacity iused ifree %iused Mounted on
    > /dev/ad0s3d 204648864 181911710 6365246 97% 1703016 11353942 13% /home
    >
    > About the only think that happened at around this time was nightly
    > updates. These start at 0005, fetching CTM cvs-cur updates, applying
    > them to /home/ncvs, then cvs updating /home/ports. Looking at
    > timestamps, /home/ports/graphics/icod/CVS/Entries was updated at
    > 0042 and /home/ports/graphics/imlib2_loaders/CVS/Entries (the next
    > entry) was updated at 1034.
    >
    > Whilst /home is fairly full, I can't see that the snapshot meta and
    > rollback data would have occupied the 20GB free (and no 'out-of-space'
    > messages were generated). Is there some limit on the number of inodes
    > that can be updated whilst a snapshot exists?
    >
    > Has anyone else seen anything similar?


    It's a known problem documented in my Wiki -- see "dump/restore". Note
    the part about UFS2 snapshot generation. I'm almost certain this is
    what you're describing.

    http://wiki.freebsd.org/JeremyChadwi...eported_issues

    This is one of the many reasons why I moved our backup infrastructure
    over to use rsnapshot/rsync, despite the atime modification problem.

    --
    | Jeremy Chadwick jdc at parodius.com |
    | Parodius Networking http://www.parodius.com/ |
    | UNIX Systems Administrator Mountain View, CA, USA |
    | Making life hard for others since 1977. PGP: 4BD6C0CB |

    _______________________________________________
    freebsd-stable@freebsd.org mailing list
    http://lists.freebsd.org/mailman/lis...freebsd-stable
    To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"


  3. Re: System hanging during dump

    On 2008-Oct-15 01:35:38 -0700, Jeremy Chadwick wrote:
    >On Wed, Oct 15, 2008 at 07:24:28PM +1100, Peter Jeremy wrote:
    >> Last night, I attempted a full, compressed backup of my 181GB /home
    >> (on a PATA disk) to a remote system. The backup started at 2159 and
    >> everything appeared normal until about 0040 when the system became
    >> non-responsive and this lasted until the dump completed at 1033. This
    >> is the first full backup of /home I've made for several years (due to
    >> lack of space).

    ...
    >It's a known problem documented in my Wiki -- see "dump/restore". Note
    >the part about UFS2 snapshot generation. I'm almost certain this is
    >what you're describing.


    No, my problem is not mentioned in your Wiki. You mention:
    * dump process frequently hangs
    In my case, the dump was progressing normally. The _rest_ of the
    system was hung.

    * UFS2 snapshot generation (mksnap_ffs, dump -L) takes too long; system is unusable during this time
    In my case, snapshot creation took ~4 minutes. The system was
    running normally for 2.6 hours after snapshot creation completed
    before it froze.

    * Filesystems not cleanly shut down if reboot performed while dump(8) stillrunning
    Not applicable.

    --
    Peter Jeremy
    Please excuse any delays as the result of my ISP's inability to implement
    an MTA that is either RFC2821-compliant or matches their claimed behaviour.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.9 (FreeBSD)

    iEYEARECAAYFAkj1sMMACgkQ/opHv/APuIe6cwCeJdYvfpQdOzTdQQNB2g+ioKk+
    JtgAn050P0w/T2iWYh11Q0AppKtOY4+3
    =LmFS
    -----END PGP SIGNATURE-----


  4. Re: System hanging during dump

    On 2008-Oct-15 02:08:48 -0700, Jeremy Chadwick wrote:
    >On Wed, Oct 15, 2008 at 07:58:43PM +1100, Peter Jeremy wrote:
    >> On 2008-Oct-15 01:35:38 -0700, Jeremy Chadwick wrote:
    >> >On Wed, Oct 15, 2008 at 07:24:28PM +1100, Peter Jeremy wrote:
    >> >> Last night, I attempted a full, compressed backup of my 181GB /home
    >> >> (on a PATA disk) to a remote system. The backup started at 2159 and
    >> >> everything appeared normal until about 0040 when the system became
    >> >> non-responsive and this lasted until the dump completed at 1033. This
    >> >> is the first full backup of /home I've made for several years (due to
    >> >> lack of space).

    >> ...
    >> >It's a known problem documented in my Wiki -- see "dump/restore". Note
    >> >the part about UFS2 snapshot generation. I'm almost certain this is
    >> >what you're describing.

    >>
    >> * UFS2 snapshot generation (mksnap_ffs, dump -L) takes too long; system is unusable during this time
    >> In my case, snapshot creation took ~4 minutes. The system was
    >> running normally for 2.6 hours after snapshot creation completed
    >> before it froze.

    >
    >Did you read the References, including the one from myself?


    Yes. In my case, dump started and ran mksnap_ffs. About 4 minutes
    later, actual dumping started and data streaming continued for about
    12.6 hours. The system froze about 2.6 hours into the dump (after
    dump had written about 31GB).

    >Snapshot generation in some cases took only minutes, but *removal* of
    >the generated the snapshot took 1.5 hours or more, hanging the system
    >until the removal was complete.


    Based on progress reports from both dump and my fifo process, the
    snapshot removal began about 10 hours _after_ the system froze
    (during this time, dump wrote about 143GB). Given the timeline,
    it's fairly clear that neither mksnap_ffs nor the 'rm snapshot'
    were running at the time the system froze. I am therefore quite
    confident that the problem I saw is not related to either creation
    or removal of snapshots.

    I have been using FreeBSD snapshots for many years and am quite
    familiar with their quirks. I have never seen this particular
    problem before. (And FWIW, I _am_ using Doug Ambrisko's patch to
    ffs_snapshot.c).

    --
    Peter Jeremy
    Please excuse any delays as the result of my ISP's inability to implement
    an MTA that is either RFC2821-compliant or matches their claimed behaviour.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.9 (FreeBSD)

    iEYEARECAAYFAkj1vFkACgkQ/opHv/APuIfmUgCdEe0Dhb8rl/ex8rl4qe2p7ZvO
    WaUAoIvgREtW4FIiV62mWHtiC90vZDx6
    =FpOu
    -----END PGP SIGNATURE-----


  5. Re: System hanging during dump

    On Wed, Oct 15, 2008 at 08:48:09PM +1100, Peter Jeremy wrote:
    > On 2008-Oct-15 02:08:48 -0700, Jeremy Chadwick wrote:
    > >On Wed, Oct 15, 2008 at 07:58:43PM +1100, Peter Jeremy wrote:
    > >> On 2008-Oct-15 01:35:38 -0700, Jeremy Chadwick wrote:
    > >> >On Wed, Oct 15, 2008 at 07:24:28PM +1100, Peter Jeremy wrote:
    > >> >> Last night, I attempted a full, compressed backup of my 181GB /home
    > >> >> (on a PATA disk) to a remote system. The backup started at 2159 and
    > >> >> everything appeared normal until about 0040 when the system became
    > >> >> non-responsive and this lasted until the dump completed at 1033. This
    > >> >> is the first full backup of /home I've made for several years (due to
    > >> >> lack of space).
    > >> ...
    > >> >It's a known problem documented in my Wiki -- see "dump/restore". Note
    > >> >the part about UFS2 snapshot generation. I'm almost certain this is
    > >> >what you're describing.
    > >>
    > >> * UFS2 snapshot generation (mksnap_ffs, dump -L) takes too long; system is unusable during this time
    > >> In my case, snapshot creation took ~4 minutes. The system was
    > >> running normally for 2.6 hours after snapshot creation completed
    > >> before it froze.

    > >
    > >Did you read the References, including the one from myself?

    >
    > Yes. In my case, dump started and ran mksnap_ffs. About 4 minutes
    > later, actual dumping started and data streaming continued for about
    > 12.6 hours. The system froze about 2.6 hours into the dump (after
    > dump had written about 31GB).
    >
    > >Snapshot generation in some cases took only minutes, but *removal* of
    > >the generated the snapshot took 1.5 hours or more, hanging the system
    > >until the removal was complete.

    >
    > Based on progress reports from both dump and my fifo process, the
    > snapshot removal began about 10 hours _after_ the system froze
    > (during this time, dump wrote about 143GB). Given the timeline,
    > it's fairly clear that neither mksnap_ffs nor the 'rm snapshot'
    > were running at the time the system froze. I am therefore quite
    > confident that the problem I saw is not related to either creation
    > or removal of snapshots.
    >
    > I have been using FreeBSD snapshots for many years and am quite
    > familiar with their quirks. I have never seen this particular
    > problem before. (And FWIW, I _am_ using Doug Ambrisko's patch to
    > ffs_snapshot.c).


    I don't doubt your seniority or technical skill set. I was simply
    offering information that appeared relevant.

    Sorry for the noise and incorrect correlation.

    --
    | Jeremy Chadwick jdc at parodius.com |
    | Parodius Networking http://www.parodius.com/ |
    | UNIX Systems Administrator Mountain View, CA, USA |
    | Making life hard for others since 1977. PGP: 4BD6C0CB |

    _______________________________________________
    freebsd-stable@freebsd.org mailing list
    http://lists.freebsd.org/mailman/lis...freebsd-stable
    To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"


  6. Re: System hanging during dump

    * heliocentric@gmail.com [081015 12:00] wrote:
    >
    > Perhaps a 'show locks' with an 8-CURRENT kernel with WITNESS enabled
    > will shed light on the problem? As most of the filesystem locking
    > doesn't use lockmgr in 7-STABLE, it would be silent with that kernel.


    I might be wrong here, but I think it's actually that lockmgr
    in 7.x isn't watched by witness (but it is in 8.x) is why you
    can only see this warnings in 8.x.

    (In other words, filesystems use lockmgr in just about every version
    of FreeBSD back to 2.x or before.)

    -Alfred
    _______________________________________________
    freebsd-stable@freebsd.org mailing list
    http://lists.freebsd.org/mailman/lis...freebsd-stable
    To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"


  7. Re: System hanging during dump

    Jeremy Chadwick wrote:
    >> Has anyone else seen anything similar?

    >
    > It's a known problem documented in my Wiki -- see "dump/restore". Note
    > the part about UFS2 snapshot generation. I'm almost certain this is
    > what you're describing.


    I don't know how you can say this so confidently without even comparing
    the process wait channels.

    Peter, there was a bug causing dump to hang (completely unrelated to
    UFS2 snapshot generation) merged to RELENG_7 a month or so ago. Can you
    try updating?

    Kris
    _______________________________________________
    freebsd-stable@freebsd.org mailing list
    http://lists.freebsd.org/mailman/lis...freebsd-stable
    To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"


  8. Re: System hanging during dump

    On 10/15/08, Alfred Perlstein wrote:
    > * heliocentric@gmail.com [081015 12:00] wrote:
    >>
    >> Perhaps a 'show locks' with an 8-CURRENT kernel with WITNESS enabled
    >> will shed light on the problem? As most of the filesystem locking
    >> doesn't use lockmgr in 7-STABLE, it would be silent with that kernel.

    >
    > I might be wrong here, but I think it's actually that lockmgr
    > in 7.x isn't watched by witness (but it is in 8.x) is why you
    > can only see this warnings in 8.x.
    >
    > (In other words, filesystems use lockmgr in just about every version
    > of FreeBSD back to 2.x or before.)


    Yes, you're right. I misread the email about the change.
    _______________________________________________
    freebsd-stable@freebsd.org mailing list
    http://lists.freebsd.org/mailman/lis...freebsd-stable
    To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"


  9. Re: System hanging during dump

    On 2008-Oct-15 21:37:36 +0100, Kris Kennaway wrote:
    >Peter, there was a bug causing dump to hang (completely unrelated to
    >UFS2 snapshot generation) merged to RELENG_7 a month or so ago. Can you
    >try updating?


    Well, dump wasn't hanging, rather it was hanging the rest of the system.
    In any case, I have upgraded to a recent -stable and am no longer able
    to reproduce the problem.

    I have built myself a looping 'ps -axl' which should let me gather more
    information if it does re-appear. (In the process, I've found that ps
    leaks memory, though that's not a problem until you wrap it in a loop).

    --
    Peter Jeremy
    Please excuse any delays as the result of my ISP's inability to implement
    an MTA that is either RFC2821-compliant or matches their claimed behaviour.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.9 (FreeBSD)

    iEYEARECAAYFAkj6p6AACgkQ/opHv/APuId5vQCeORRJwGptPO/hWQn2+nzsQcr9
    YBYAnRvZBMzU3TqhdjVYLxMKXcMdReHp
    =PbVs
    -----END PGP SIGNATURE-----


  10. Re: System hanging during dump

    On Sun, Oct 19, 2008 at 02:21:04PM +1100, Peter Jeremy wrote:
    > On 2008-Oct-15 21:37:36 +0100, Kris Kennaway wrote:
    > >Peter, there was a bug causing dump to hang (completely unrelated to
    > >UFS2 snapshot generation) merged to RELENG_7 a month or so ago. Can you
    > >try updating?

    >
    > Well, dump wasn't hanging, rather it was hanging the rest of the system.
    > In any case, I have upgraded to a recent -stable and am no longer able
    > to reproduce the problem.
    >
    > I have built myself a looping 'ps -axl' which should let me gather more
    > information if it does re-appear. (In the process, I've found that ps
    > leaks memory, though that's not a problem until you wrap it in a loop).


    What memory ? Kernel one ? How did you noted this ? Could you add
    vmstat -z and vmstat -m to the loop and watch what allocation grows ?


    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.9 (FreeBSD)

    iEYEARECAAYFAkj68iYACgkQC3+MBN1Mb4jTKQCgnnYkmutQg7 Td6RprhPDxb/Sh
    99EAoJqMlgRfR4S2328UwlBI3IxBUJFV
    =3+Hr
    -----END PGP SIGNATURE-----


  11. Re: System hanging during dump

    Sorry for the late reply.

    On 2008-Oct-19 11:39:02 +0300, Kostik Belousov wrote:
    >> I have built myself a looping 'ps -axl' which should let me gather more
    >> information if it does re-appear. (In the process, I've found that ps
    >> leaks memory, though that's not a problem until you wrap it in a loop).

    >
    >What memory ? Kernel one ? How did you noted this ? Could you add
    >vmstat -z and vmstat -m to the loop and watch what allocation grows ?


    ps(1) malloc's memory and doesn't free it. This isn't an issue in
    normal operation because it's a once-through program. I hacked ps to
    turn the guts of main() into a while(1){} loop and this showed the
    process was growing. There were a couple of superfluous strdup()
    calls that could be removed but I don't think it's worth making it
    exhaustively clean up after itself (my hacking included hard-wiring
    the options so I'm not sure my cleanup code is complete in the general
    case). As a low priority, I'll create a PR covering the strdup's.

    --
    Peter Jeremy
    Please excuse any delays as the result of my ISP's inability to implement
    an MTA that is either RFC2821-compliant or matches their claimed behaviour.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.9 (FreeBSD)

    iEYEARECAAYFAkkNZrAACgkQ/opHv/APuIdOTwCfbCy7o1fVs5sdpu3OS7YbhYDV
    gRcAnjlpGMG0lUosS110Zawn3WOOOymo
    =r9Bg
    -----END PGP SIGNATURE-----


  12. Re: System hanging during dump

    On Sun, Nov 02, 2008 at 07:37:05PM +1100, Peter Jeremy wrote:
    > Sorry for the late reply.
    >
    > On 2008-Oct-19 11:39:02 +0300, Kostik Belousov wrote:
    > >> I have built myself a looping 'ps -axl' which should let me gather more
    > >> information if it does re-appear. (In the process, I've found that ps
    > >> leaks memory, though that's not a problem until you wrap it in a loop).

    > >
    > >What memory ? Kernel one ? How did you noted this ? Could you add
    > >vmstat -z and vmstat -m to the loop and watch what allocation grows ?

    >
    > ps(1) malloc's memory and doesn't free it. This isn't an issue in
    > normal operation because it's a once-through program. I hacked ps to
    > turn the guts of main() into a while(1){} loop and this showed the
    > process was growing. There were a couple of superfluous strdup()
    > calls that could be removed but I don't think it's worth making it
    > exhaustively clean up after itself (my hacking included hard-wiring
    > the options so I'm not sure my cleanup code is complete in the general
    > case). As a low priority, I'll create a PR covering the strdup's.


    Thank you for clarification. Please, Cc: me with a PR, I will look at it.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.9 (FreeBSD)

    iEYEARECAAYFAkkNhtkACgkQC3+MBN1Mb4jvGgCdFiSrqHL3B7 B4YKS7YZ+73dPL
    OI0AoN6lEfaq01q/9WARobrS9PxMCjvp
    =8YXL
    -----END PGP SIGNATURE-----


+ Reply to Thread