On Wed, 28 Jun 2006, Ivan Voras wrote:

> Leo Huang wrote:
>> The result is followed:
>> OS Clients Result(queries per second) TPS(got from
>> iostat)
>> FreeBSD6.1 50 516.1 about 2000

Seems normal for drives that do write caching.

>> Debian3.1 50 49.8 about 200

Seems to slow for disks that do write caching. Maybe Debian does something
to force the drive to complete it's i/o, or just does a full sync() like
someone mentioned Linux doing.

>> I know that MySQL uses fsync() to flush both the data and log files at

> I tried to see the effects from fsync() with this little program:
> #include
> #include
> #include
> #define BUF_SIZE 512
> #define COUNT 50000
> int main() {
> int fd;
> char buf[BUF_SIZE];
> int i;
> fd = open("test.file", O_CREAT|O_TRUNC|O_WRONLY, 0600);
> if (fd < 0) {
> printf("cannot open\n");
> exit(1);
> }
> for (i = 0; i < COUNT; i++) {
> if (write(fd, buf, BUF_SIZE) != BUF_SIZE) {
> printf("error writing\n");
> exit(1);
> }

The results are much clearer with BUF_SIZE == 1 and COUNT <= fs_blocksize.
Then the file system keeps writing the same block and inode, and drives
with write caching are limited mainly by their software overhead. With
a program similar to the above, I get the following times on a 7200 RPM
ATA drive:

COUNT = fs_blocksize = 8192 to regular file in /tmp
(mount options: none, no soft updates)
7.76 seconds (iostat 500-3500 tps 4.5-7.7 KB/t)
to /dev/null on a devfs-free system:
9.67 seconds (iostat 450-2200 tps 8.0 KB/t)
to /dev/ttyv on a devfs-free system:
16.30 seconds (iostat 500-550 tps 8.0 KB/t) (yes, /dev/ttyv0 is slowest!)

Er, the results were clear. In a previous run, with different mount options,
(-async and maybe -noatime), and COUNT = 1000, I got 4000+ tps 4.5 KB/t
consistently for the regular file. 4.5 is the average of 8 and 9 (which I
thought was 1 8K data block and 1 1K inode block, but now think was 1 1K
data block amd 1 8K inode block). Changing COUNT back to 1000 now gives
a consistent 4.5KB/t but only about 500 tps. The variation on the block
size is caused by 8192 being larger than 1000 -- the file usually consists
of 1-7 fragments except at limits it is empty or 1 block.

fsync()ing /dev/null and /dev/ttyv1 is apparently slow because I (or
someone at my request) prematurely removed the hack for not syncing
file times for device files. IN_LAZYMOD was supposed to make the
hack unnecessary, but I never got around to making IN_LAZYMOD apply
more generally. In -current, it only applies to device files that are
not in devfs and on ffs without soft updates, but there are no such
files so it never applies. In my kernel, it applies to all files but
still only for atimes and not for soft updates.

It is strange that fsync()ing /dev/null is slower than fsync()ing
a regular file, and especially strange that fsync()ing /dev/ttyv1
is much slower than fsync()ing /dev/null. Both should be about twice
as fast since only 1 block needs to be written (an inode block).

> if (fsync(fd) < 0) {
> printf("error in fsync\n");
> exit(1);
> }
> }
> close(fd);
> unlink("test.file");
> return 0;
> But I see strange results with iostat. It shows 16KB transactions, ~2900 tps
> and 46 MB/s. On the other hand, the program runs for ~36 seconds, which gives
> ~1390 tps (this is a single desktop drive). Since 36 seconds of 46MB/s would
> result in a file 1.6 GB in size, while it's clearly 50000*512=25MB, iostat is
> lying.

This is because you fsync() every 512 bytes. The file system then writes
a 16K inode block and a 16K data block, giving 64 times as much i/o as

> I think it's a too valuable tool to be lying. For what it's worth, gstat is
> also lying in the same way.

iostat and gstat just report whatever is recorded by devstat(9). The
recording is done at a fairly low level but not low enough to be
correct. Recorders lie mainly for block sizes larger than 64K. E.g.,
geom claims that all (?) disk devices can handle block sizes up to
MAXPHYS (128K) and then splits up i/o's into whatever sizes the disk
devices drivers handle. Most disk devices drivers claim to handle
DFLTPHYS (64K) whether or not the disk drive can handle that, and may
further split up the i/o as necessary. This makes it hard to see the
sizes that actually reach the hardware.

freebsd-fs@freebsd.org mailing list
To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"