Introduction : We have a C++ application in RHEL 5.4 platform. We are using TCP/IP socket programming as well to send and receive some sort of messages. We are using socket write and read command for this purpose and we are getting some run-time write issues in between.

Detailed Description : Consider I have a server and a client. Server writes some messages using write command and client is supposed to read the same data.

Major code snippets in Server side is as follows [It is not feasible to put here the actual files and application codes as a whole]

sockfd = socket(AF_INET, SOCK_STREAM, 0);

int reuse=1;
setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR,& reuse, sizeof(reuse));

my_addr.sin_family = AF_INET; /* host byte order */
my_addr.sin_port = htons(port); /* short, network byte order */
my_addr.sin_addr.s_addr = INADDR_ANY; /* auto-fill with my IP */
bzero(&(my_addr.sin_zero), 8); /* zero the rest of the struct */
---------
int rc = 0;
while ((rc = bind(sockfd, (struct sockaddr *)&my_addr, sizeof(struct sockaddr))) == -1 && errno == EINTR);
---------
while ((rc =listen(sockfd, 1024)) == -1 && errno == EINTR);
---------
sin_size = sizeof(struct sockaddr_in);

while ((new_fd = accept(sockfd, (struct sockaddr *)&their_addr,(unsigned *) &sin_size)) == -1 && errno == EINTR);
---------

Writting message code: Here 2464 is the whole size of the message. Loop is written to write the whole data completely.
BOOL WSClient::writeMsg(WSMsg *msg, int fd, int *errorNumber)
{
PCHAR data = NULL;
int n = 0;
int nbytes = 0;
int size = sizeof(WSMsg);
int retry = 10;

data = new char [sizeof(WSMsg)];
memcpy(data, msg, sizeof(WSMsg));

if (fd != -1)
{
while (nbytes != size && retry--)
{
n = write(fd, &data[nbytes], (size - nbytes));

if (n != size)
{
FO::std_out ("%IY %@","Bytes tried [%d] .. Bytes Sent [%d] pid [%ld]\n", (size - nbytes), n, getpid());
::fflush(stdout);
}

if (n >= 0)
nbytes = nbytes + n;

if (n == -1 || retry == 0)
{
if (errno != EINTR || retry == 0)
{
if (errorNumber)
*errorNumber = errno;

delete [] data;
return FALSE;
}

sleep(1);
}
}
}

delete [] data;

return TRUE;
}

----------------------
Major code line in client side.
socket(AF_INET, SOCK_STREAM, 0);

their_addr.sin_family = AF_INET;
their_addr.sin_port = htons(port);
their_addr.sin_addr = *((struct in_addr *) he->h_addr);
bzero(&(their_addr.sin_zero), 8)

connect(new_fd, (struct sockaddr *)&their_addr, sizeof(struct sockaddr))

Code for reading data: Here also loop is written to get the data fully for 2464 bytes as in writeMsg in server side.

BOOL WSClient::readMsg(WSMsg *msg)
{
PCHAR data = NULL;
int nbytes = 0;
int size = sizeof(WSMsg);
int n = 0;
int retry = 10;

data = new char[sizeof(WSMsg)];
memset(data, NULL, sizeof(WSMsg));

while (nbytes != size && retry--)
{
n = read (getDataReadFD(), &data[nbytes], (size - nbytes));
if (n != size)
{
FO::std_out ("%IY %@","Bytes tried [%d] .. Bytes Received [%d] pid [%ld]\n",
(size - nbytes), n, getpid());
fflush(stdout);
}

if (n >= 0)
nbytes = nbytes + n;

if (n == -1 || retry == 0)
{
if (errno != EINTR || retry == 0)
{
delete [] data;
return FALSE;
}

sleep(1);
}
}

memcpy(msg, data, sizeof(WSMsg));

delete [] data;
return TRUE;
}
With this set up, it works fine in normal cases. ie, each message is assumed to be of size 2464 bytes. write command will write the message and the same is fully read by the read command in client.

Issue happens when we write lot of messages [say 100 messages] in quick succession. Here, it will successfully write first set of messages [say first 40messages]. Then it attempts for writting the next message [say 41st message], it partially write the message [instead of writting the data fully 2464 bytes, it writes only say 1800 bytes]. Since we have written loop in the writeMsg funtion to keep track of the number of bytes written, it will again invoke the write command to write the rest of the data [2464 - 1800 = 664 bytes].
Here the problem comes in. This write fails with reason EAGAIN and hence the complete write wont happen. In recieving side, this incomplete message is read and which in turn cause further issues.

In short, issue is assumed to be in the write side, such that one write happens partially in server side and the immediate attempt to write the rest of the data fails with EAGAIN and hence the incomplete information is read by the clients.

Please see the strace output of server program [please note that it includes only the write calls, we have removed the rest of the commands for the ease of presenting]
7187:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
7188:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
7189:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
7194:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
7195:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
7196:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
7197:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
7198:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464 ---->Comment : So far here tried for writing 2464 bytes and written data fully.
7199:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 608 ---->Comment : socket write happened partially only [608 bytes out of 2464 bytes], No error code returned, returned the number of bytes written.
7202:write(19, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0\0\0\0\0\0"..., 1856) = -1 EAGAIN (Resource temporarily unavailable) ---->Comment : If any partial write happens, there is already code written to write again for the pending data. Here write attempted for 1856 bytes]. But the write failed with EAGAIN. Getting EAGAIN error during bulk number of write is usual [This is happening for EK, IB in Solaris as well]. In this case we will write it back to back up store and tries. This is the normal behaviour.
7234:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = -1 EAGAIN (Resource temporarily unavailable)
7265:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = -1 EAGAIN (Resource temporarily unavailable)
7296:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = -1 EAGAIN (Resource temporarily unavailable)
7327:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = -1 7513:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = -1 EAGAIN (Resource temporarily unavailable)
7544:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = -1 EAGAIN (Resource temporarily unavailable)
7575:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
7576:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
7577:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
7578:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
7581:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 1600
7584:write(19, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0\0\0\0\0\0"..., 864) = -1 EAGAIN (Resource temporarily unavailable)
7616:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = -1 EAGAIN (Resource temporarily unavailable)

Additional Information
We have checked for the generic reasons for partial write and could not find any specific reason in our case. We have checked PIPE_BUF, which is of 4096 bytes
We are writting only 2464 bytes which is less than PIPE_BUF and we assume the write to happen atomically.

Another importan point is that we have the same application in Soalris version as well. It is being running in various production enviroment for several years and no such issue reported in these areas in Solaris.
We trussed the same in solaris and we confirmed that in Solaris there never happens partial write followed by EAGAIN as happened in Linux.

Analysis conclusion:As per the strace output of server program,
Point 1:
As shown from the strace output, a set of write calls successfully writes the whole data as
7197:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
7198:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 2464
So far here tried for writing 2464 bytes and written data fully. No issue till here, everything is pretty fine.

Point 2.
The next write statement partilly writes the data as
7199:write(19, "\1\0\0\0flight_leg\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 2464) = 608
Here a non-atomic write happened [608 bytes out of 2464 bytes]. We would like to know the possibile cases of such a behaviour. In Solaris environment, we could not find any such partial [non-atomic] write. In solaris write will result in either writes the data fully [2464 bytes] or no bytes at all [0 bytes]. What all the chances in Linux to have such a non-atomic write? Is there any settings by which we can make it atomic always?

Point 3:
And then the immediate attempt to write the rest of the data failed with EAGAIN as follows
7202:write(19, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0\0\0\0\0\0"..., 1856) = -1 EAGAIN (Resource temporarily unavailable)
In Solaris enviroment also we could see getting this EAGAIN error when back to back write happens in a burst; but the difference is; in Solaris there wont be an earlier non-atomic [partial] write as in point 2.


Could any one please help on how to proceed on this?

Regards
AMV