Stale filehandles again - NFS

This is a discussion on Stale filehandles again - NFS ; I thought I understood stale NFS filehandles: if you change the server filesystem (eg. by copying files to a new drive) then your filehandles will all change and you'll get stale filehandles. But I just had a case where I ...

+ Reply to Thread
Results 1 to 10 of 10

Thread: Stale filehandles again

  1. Stale filehandles again

    I thought I understood stale NFS filehandles: if you change the server
    filesystem (eg. by copying files to a new drive) then your filehandles
    will all change and you'll get stale filehandles. But I just had a
    case where I had some stale filehandles without changing the server
    filesystem, and I'm curious why.

    The server and clients are all running CentOS 4.1 (a recompile of
    RedHat Enterprise 4.1). I have several filesystems exported from the
    server to the clients. When the server died last night (bug in lvm2
    snapshots) the clients all (same behavior) were able to recover most
    of the filesystems, but got stale filehandles on two of them. The
    only thing similar about the two that failed is that they're rarely
    (if ever) used by the users.

    Given that fsck shouldn't screw with inode numbers or the filesystem
    UUID or generation number, I'm really curious what happened to cause
    two of the filesystems to break in this case. Any thoughts?

    Damian Menscher
    --
    -=#| Physics Grad Student & SysAdmin @ U Illinois Urbana-Champaign |#=-
    -=#| 488 LLP, 1110 W. Green St, Urbana, IL 61801 Ofc217)333-0038 |#=-
    -=#| 4602 Beckman, VMIL/MS, Imaging Technology Group217)244-3074 |#=-
    -=#| www.uiuc.edu/~menscher/ Fax217)333-9819 |#=-
    -=#| The above opinions are not necessarily those of my employers. |#=-

  2. Re: Stale filehandles again

    In article ,
    Damian Menscher wrote:
    >I thought I understood stale NFS filehandles: if you change the server
    >filesystem (eg. by copying files to a new drive) then your filehandles
    >will all change and you'll get stale filehandles. But I just had a
    >case where I had some stale filehandles without changing the server
    >filesystem, and I'm curious why.
    >
    >The server and clients are all running CentOS 4.1 (a recompile of
    >RedHat Enterprise 4.1). I have several filesystems exported from the
    >server to the clients. When the server died last night (bug in lvm2
    >snapshots) the clients all (same behavior) were able to recover most
    >of the filesystems, but got stale filehandles on two of them. The
    >only thing similar about the two that failed is that they're rarely
    >(if ever) used by the users.


    It's a bug. We have it, too, but have never had time to track it
    down. Linux's client is streets better than it was, and more-or-less
    up to industry average in quality, but is still not wonderful.

    Curiously, I have seen that on Solaris, when things were going really
    pear-shaped, and suspect that the design of NFS and its implementations
    are such that it tends to be the symptom of a completely unrelated
    error. That may be true in the Linux case, too.

    One of my standard hobby-horses is to describe the appalling state of
    checking, diagnostics and tracing in modern software, and to try to
    get the message across to vendors that putting 10% of development into
    such things SAVES money. I can witness this from experience, but have
    not so far succeeded in persuading any of the veeps.


    Regards,
    Nick Maclaren.

  3. Re: Stale filehandles again

    Damian Menscher wrote:

    > I thought I understood stale NFS filehandles: if you change the server
    > filesystem (eg. by copying files to a new drive) then your filehandles
    > will all change and you'll get stale filehandles. But I just had a
    > case where I had some stale filehandles without changing the server
    > filesystem, and I'm curious why.


    The first time I ran into stale file handles was changing the OS.
    I had an HP-UX system that could boot two different versions of HP-UX.
    If I went back to the version that exported it originally it was fine.

    No reason that I know of that only some would do it, though.

    -- glen


  4. Re: Stale filehandles again

    Nick Maclaren wrote:
    > In article ,
    > Damian Menscher wrote:
    >>I thought I understood stale NFS filehandles: if you change the server
    >>filesystem (eg. by copying files to a new drive) then your filehandles
    >>will all change and you'll get stale filehandles. But I just had a
    >>case where I had some stale filehandles without changing the server
    >>filesystem, and I'm curious why.
    >>
    >>The server and clients are all running CentOS 4.1 (a recompile of
    >>RedHat Enterprise 4.1). I have several filesystems exported from the
    >>server to the clients. When the server died last night (bug in lvm2
    >>snapshots) the clients all (same behavior) were able to recover most
    >>of the filesystems, but got stale filehandles on two of them. The
    >>only thing similar about the two that failed is that they're rarely
    >>(if ever) used by the users.


    > It's a bug. We have it, too, but have never had time to track it
    > down. Linux's client is streets better than it was, and more-or-less
    > up to industry average in quality, but is still not wonderful.


    Any guesses what to do to diagnose such things? Maybe get a packet
    capture? We had another server crash today (3ware ioctl() broke stuff)
    and again got stuck with stale filehandles. Kinda defeats the purpose
    of a hard mount.

    I wouldn't mind helping get this fixed if anyone can suggest what
    diagnostics to try next time it happens (probably either tonight or
    the coming weekend).

    Damian Menscher
    --
    -=#| Physics Grad Student & SysAdmin @ U Illinois Urbana-Champaign |#=-
    -=#| 488 LLP, 1110 W. Green St, Urbana, IL 61801 Ofc217)333-0038 |#=-
    -=#| 4602 Beckman, VMIL/MS, Imaging Technology Group217)244-3074 |#=-
    -=#| www.uiuc.edu/~menscher/ Fax217)333-9819 |#=-
    -=#| The above opinions are not necessarily those of my employers. |#=-

  5. Re: Stale filehandles again


    In article ,
    Damian Menscher writes:
    |>
    |> > It's a bug. We have it, too, but have never had time to track it
    |> > down. Linux's client is streets better than it was, and more-or-less
    |> > up to industry average in quality, but is still not wonderful.
    |>
    |> Any guesses what to do to diagnose such things? Maybe get a packet
    |> capture? We had another server crash today (3ware ioctl() broke stuff)
    |> and again got stuck with stale filehandles. Kinda defeats the purpose
    |> of a hard mount.
    |>
    |> I wouldn't mind helping get this fixed if anyone can suggest what
    |> diagnostics to try next time it happens (probably either tonight or
    |> the coming weekend).

    I have been telling vendors for 10-20 years that inserting checking
    for and diagnostics of internal errors, and the tracing of data and
    control flow, saves money - but I have got nowhere. Linux is no
    worse and no better than proprietary systems, which is to say that
    it is a software engineering disaster area.

    When faced with this sort of problem, I reckon that I have a 30%
    chance of success after several weeks' work, and I am one of the
    few remaining people with experience in diagnosing such failures
    in hostile, alien software. I wish you the best of luck :-(


    Regards,
    Nick Maclaren.

  6. Re: Stale filehandles again

    Nick Maclaren wrote:
    > In article ,
    > Damian Menscher writes:
    > |>
    > |> > It's a bug. We have it, too, but have never had time to track it
    > |> > down. Linux's client is streets better than it was, and more-or-less
    > |> > up to industry average in quality, but is still not wonderful.
    > |>
    > |> Any guesses what to do to diagnose such things? Maybe get a packet
    > |> capture? We had another server crash today (3ware ioctl() broke stuff)
    > |> and again got stuck with stale filehandles. Kinda defeats the purpose
    > |> of a hard mount.
    > |>
    > |> I wouldn't mind helping get this fixed if anyone can suggest what
    > |> diagnostics to try next time it happens (probably either tonight or
    > |> the coming weekend).
    >
    > When faced with this sort of problem, I reckon that I have a 30%
    > chance of success after several weeks' work, and I am one of the
    > few remaining people with experience in diagnosing such failures
    > in hostile, alien software. I wish you the best of luck :-(


    I think I got lucky, thanks to a local admin who suggested I check
    out the fsid option. That, combined with my suspicion that LVM was
    breaking things, turned up lots of stuff on google, including this
    explanation:
    https://www.redhat.com/archives/linu.../msg00029.html

    Sadly, RedHat considers this behavior to be NOTABUG:
    https://bugzilla.redhat.com/bugzilla....cgi?id=166750
    since it can't be helped that the minor numbers of logical volumes
    are dynamically assigned. Which really just says (to me) that the
    fsid should NOT be based on the major/minor numbers of the block
    device, but rather on the UUID of the filesystem. It seems so
    obvious -- what am I missing? Is there some reason this wouldn't
    work? I think all modern filesystems contain a UUID, right?

    Damian Menscher
    --
    -=#| Physics Grad Student & SysAdmin @ U Illinois Urbana-Champaign |#=-
    -=#| 488 LLP, 1110 W. Green St, Urbana, IL 61801 Ofc217)333-0038 |#=-
    -=#| 4602 Beckman, VMIL/MS, Imaging Technology Group217)244-3074 |#=-
    -=#| www.uiuc.edu/~menscher/ Fax217)333-9819 |#=-
    -=#| The above opinions are not necessarily those of my employers. |#=-

  7. Re: Stale filehandles again


    In article ,
    Damian Menscher writes:
    |> >
    |> > When faced with this sort of problem, I reckon that I have a 30%
    |> > chance of success after several weeks' work, and I am one of the
    |> > few remaining people with experience in diagnosing such failures
    |> > in hostile, alien software. I wish you the best of luck :-(
    |>
    |> I think I got lucky, thanks to a local admin who suggested I check
    |> out the fsid option. That, combined with my suspicion that LVM was
    |> breaking things, turned up lots of stuff on google, including this
    |> explanation:
    |> https://www.redhat.com/archives/linu.../msg00029.html

    Ah. I misunderstood. Yes, you should expect that message after
    a server reboot (and not just on Linux). I was referring to the
    case where you get that message and there is no reboot involved.


    Regards,
    Nick Maclaren.

  8. Re: Stale filehandles again

    nmm1@cus.cam.ac.uk (Nick Maclaren) writes:

    >One of my standard hobby-horses is to describe the appalling state of
    >checking, diagnostics and tracing in modern software, and to try to
    >get the message across to vendors that putting 10% of development into
    >such things SAVES money. I can witness this from experience, but have
    >not so far succeeded in persuading any of the veeps.



    Solaris DTrace seems to be extremely helpful in tracking the origin of
    any error return from the kernel to userland, including stale file
    handles.

    Stale file handles, of course, should not happen as a result anything other
    than a file being removed on one node or the server while it's still in
    use on another node.

    The rest of the discussion seems to indicate that some server don't
    have persistent filehandles across reboots; that is bad because they're
    supposed to persist.

    I think, but am not sure, that NFSv4 allows you to refresh filehandles as
    this is needed for dataset migration. (But whether that guarantees that
    you get the same file back is not clear to me)

    Casper
    --
    Expressed in this posting are my opinions. They are in no way related
    to opinions held by my employer, Sun Microsystems.
    Statements on Sun products included here are not gospel and may
    be fiction rather than truth.

  9. Re: Stale filehandles again


    In article <4357721a$0$11073$e4fe514c@news.xs4all.nl>,
    Casper H.S. Dik writes:
    |> nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
    |>
    |> >One of my standard hobby-horses is to describe the appalling state of
    |> >checking, diagnostics and tracing in modern software, and to try to
    |> >get the message across to vendors that putting 10% of development into
    |> >such things SAVES money. I can witness this from experience, but have
    |> >not so far succeeded in persuading any of the veeps.
    |>
    |> Solaris DTrace seems to be extremely helpful in tracking the origin of
    |> any error return from the kernel to userland, including stale file
    |> handles.

    Unfortunately, NFS problems are very rarely that simple. Yes, I
    used dtrace in identifying one such problem (37331588 and 37341541,
    if you are interested), but it took over a week's work to locate
    it even as far as I did (see my messages of 12 Apr 2005 14:58:49,
    12 Apr 2005 20:46:44 and 13 Apr 2005 22:11:33 in 37331588, for
    example).

    dtrace is a useful tool, but is NOT a substitute for what I am
    saying any well engineered system should have built in. I know
    of no such Unix or Microsoft system :-(


    Regards,
    Nick Maclaren.

  10. Re: Stale filehandles again

    Nick Maclaren wrote:
    > In article ,
    > Damian Menscher writes:
    > |> >
    > |> > When faced with this sort of problem, I reckon that I have a 30%
    > |> > chance of success after several weeks' work, and I am one of the
    > |> > few remaining people with experience in diagnosing such failures
    > |> > in hostile, alien software. I wish you the best of luck :-(
    > |>
    > |> I think I got lucky, thanks to a local admin who suggested I check
    > |> out the fsid option. That, combined with my suspicion that LVM was
    > |> breaking things, turned up lots of stuff on google, including this
    > |> explanation:
    > |> https://www.redhat.com/archives/linu.../msg00029.html
    >
    > Ah. I misunderstood. Yes, you should expect that message after
    > a server reboot (and not just on Linux). I was referring to the
    > case where you get that message and there is no reboot involved.


    Huh? As Casper said, the NFS filehandles are supposed to persist
    across a reboot. This is not occurring on linux due to the stupidity
    of using the block device major/minor numbers (which are randomly
    assigned upon reboot for logical volumes) rather than the filesystem
    UUID.

    Or am I not following what you are saying?

    Damian Menscher
    --
    -=#| Physics Grad Student & SysAdmin @ U Illinois Urbana-Champaign |#=-
    -=#| 488 LLP, 1110 W. Green St, Urbana, IL 61801 Ofc217)333-0038 |#=-
    -=#| 4602 Beckman, VMIL/MS, Imaging Technology Group217)244-3074 |#=-
    -=#| www.uiuc.edu/~menscher/ Fax217)333-9819 |#=-
    -=#| The above opinions are not necessarily those of my employers. |#=-

+ Reply to Thread