Mysterious NFS-related issue
I have a mail server setup that works as follows:
- MX servers use a lockfile to write to a message file. The file is
opened for read and write, and uses lseek(fd, 0, SEEK_END) to get to
the end to start writing. During the write, it is necessary to seek
back to the start of where writing first began to write a header
sequence that contains information only known after the message data
is written. Only a small amount of data is written there (less than
100 bytes). After the message file is written to and closed, a
request file is written to another NFS location.
- Index management servers read these request files, and use them to
update an index to messages within the message file. The message file
is opened read-only to verify that the message data contains a valid
header, and the text message header data is read at this point to
update a separate header database file. No locking is done here as
the index server is the only one writing to the index file and
database files.
The problem I am discovering is that on rare occasion I will come
across a message that is completely nulled out within the message file
-- yet the index and header database files contain valid information
for the message, which means that the message was good when the index
management server examined it, but at some later point (i.e. when a
new message came in) the data from the previous message became
completely set to nulls.
There is a high number of MX servers (over 100) running Linux 2.6,
mounting a NetApp (Network Appliance).
Has anyone else experienced similar NFS issues, or have any idea how
to avoid this problem?
Re: Mysterious NFS-related issue
Nevermind... looks like there were some nfs bugs in the 2.6.20 Linux
kernel (that didn't exist in the 2.6.16 kernel)