[PATCH] ext4: dir inode reservation V3 - Kernel
This is a discussion on [PATCH] ext4: dir inode reservation V3 - Kernel ; Basic idea of my dir inode reservation patch can be found here, http://lists.openwall.net/linux-ext4/2007/11/05/3 1, What does dir inode reservation do Dir inode reservation tries to reserve several inodes in inodes table for a directory when this directory is created. When ...
| | LinkBack | Tools |
|
#1
| |||
| |||
| http://lists.openwall.net/linux-ext4/2007/11/05/3 1, What does dir inode reservation do Dir inode reservation tries to reserve several inodes in inodes table for a directory when this directory is created. When create new file under this directory, try to allocate inode from the reserved inodes area. This is called as dir_ireserve inode allocator. 2, What does dir inode reservation help If we create huge number of directories, and create files in each directory alternatively (like some web proxy or mail server do), current linear inode allocator in ext[234] will introduce unnecessary hard disk accessing and seeking during creation/unlink, which brings out poor performance. Here is an example. The uppercase letters represent directories inode, and corresponded lowercase letters represent file inodes under the parent directory. When number of directories are much more than block groups number, some inodes of directories will be allocated in inodes table like this, <-- blk 0 --> A B C D E F H I When the files under each directory increase alternatively, some time latter (if each directory has 2 files), the layout of this inodes table will be, <-- blk 0 --> <-- blk 1 --> <-- blk 2 --> A B C D E F H I a b c d e f h i a b c d e f h i Now if these directories are unlinked recursively in hashed order, because every time when unlink a file from a directory, the inode of directory should also be updated to hard disk. If each I/O session can only write 1 block into hard disk (this is the most simple condition which will not happen in practice), the access sequence should be, 1 0 2 0 1 0 2 0 1 0 2 0 1 0 2 0 1 0 2 0 1 0 2 0 1 0 2 0 1 0 2 0 >From the above sequence, we can find in order to remove 8 directories and their files, 32 blocks should be written to hard disk. Dir inode reservation patches tries to improve unlink performance in above condition. In dir_ireserve inode allocator, directory inodes will be allocated like, <-- blk 0 --> <-- blk 1 --> ...<-- blk 6 --> <-- blk 7 --> A B H I If we create new files under each directory, the layout in inodes table will be, <-- blk 0 --> <-- blk 1 --> ...<-- blk 6 --> <-- blk 7 --> A a a B b b H h h I i i Because files inodes are very near to parent directory inode, when these files are removed recursively, directory inode and its file inodes can be updated to hard disk in one I/O session. If each I/O session can only write 1 block into hard disk (like we assumed as before), the access sequence should be, 0 1 2 3 4 5 6 7 By dir_ireserve inode allocator, 25 extra block I/O can be avoided. At the same time, because files inodes are near to parent directory inode, we also avoid the unnecessary hard disk seeking between files inode and directory inode. >From benchmark, I also find a helpful side effect from dir inode reservation -- it is 10%~30% faster to create these directories and files, faster result can be observed on more directories are created. The reason is same, when create new files the parent directory inode is updated to hard disk, too. Place files inodes and directory inode nearly can merge many redundant block I/O. 3, Full compatible to current ext[234] This is 3rd release (and 5th house keeping version)of my dir inode reservation patch. V1 is only for feasibility research, V2 is in compatible to ext[23], both require to patch e2fsprogs. V3 is only patch to kernel, no e2fsprogs modification. V3 has no any on-disk modification or file system data structure change, therefore it is full compatible to current ext[234]. 4, Dir inode reservation is optional Dir inode reservation is optional, you can use -o followed by one of these options to enable dir inode reservation during mount ext4 file system: dir_ireserve=low dir_ireserve=normal dir_ireserve=high Currently, 'low' reserves 15 file inodes for each directory, 'normal' reserves 31 inodes and 'high' reserves 127 inodes. Reserving more than 127 inodes does not help to performance obviously. 5, Performance number On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create 50000 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I tried to remove all the directories and files recursively by a 'rm -rf'. Bellow is the benchmark result, normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low Create dirs: real 0m49.101s real 2m59.703s Create files: real 24m17.962s real 21m8.161s Unlink all: real 24m43.788s real 17m29.862s Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Creating files with dir inode reservation is 13% faster than normal ext4. Unlink all the directories and files is 29.2% faster as expected. When number of directories is increased, the performance improvement will be more considerable. More benchmark result will be posted here if necessary, because I need more time to run more test cases. 6, Credits Andreas Dilger and Danial Phillips developed the original idea of inode reservation. Andreas introduced me this idea when I wanted to improve unlink performance on ext4. Theodore Tso noted me compatibility issue in V2 patch and brought out several useful advice on e2fsprogs patching. Mingming Cao reviewed V2 kernel patch and pointed out several issues on the patch e.g. lack of document, unacceptable patch to ext4 patch queue, and provided many other help on my work. Jan Kara helped to review V2 kernel patch, most of the code of V3 patch is developed based on Jan Kara's advice.Also thank Eric Sandeen, Chris Wedgwood and other friendly guys who discussed ext4 stuffs with me in irc. Thanks to make me enjoy open source development. 7, Feedback For any feedback, criticism, please email to coyli at suse dot de. I am very happy to hear from you and improve the dir inode reservation patch. Signed-off-by: Coly Li Cc: Andreas Dilger Cc: Mingming Cao --- fs/ext4/ialloc.c | 77 ++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/super.c | 16 +++++++++ include/linux/ext4_fs.h | 8 ++++ include/linux/ext4_fs_sb.h | 2 + 4 files changed, 103 insertions(+), 0 deletions(-) diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 17b5df1..f838a72 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -10,6 +10,8 @@ * Stephen Tweedie (sct@redhat.com), 1993 * Big-endian to little-endian byte-swapping/bitmaps by * David S. Miller (davem@caip.rutgers.edu), 1995 + * Directory inodes reservation by + * Coly Li (coyli@suse.de), 2007 */ #include @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, return -1; } +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, + int mode, ext4_group_t *group, unsigned long *ino) +{ + struct super_block *sb; + struct ext4_sb_info *sbi; + struct ext4_group_desc *gdp = NULL; + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; + ext4_group_t ires_group = *group; + unsigned long ires_ino; + int i, bit; + + sb = dir->i_sb; + sbi = EXT4_SB(sb); + + /* if the inode number is not for directory, + * only try to allocate after directory's inode + */ + if (!S_ISDIR(mode)) { + *ino = dir->i_ino % EXT4_INODES_PER_GROUP(sb); + return 0; + } + + /* reserve inodes for new directory */ + for (i = 0; i < sbi->s_groups_count; i++) { + gdp = ext4_get_group_desc(sb, ires_group, &gdp_bh); + if (!gdp) + goto fail; + bit = 0; +try_same_group: + if (bit < EXT4_INODES_PER_GROUP(sb)) { + brelse(bitmap_bh); + bitmap_bh = read_inode_bitmap(sb, ires_group); + if (!bitmap_bh) + goto fail; + + BUFFER_TRACE(bitmap_bh, "get_write_access"); + if (ext4_journal_get_write_access( + handle, bitmap_bh) != 0) + goto fail; + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), + bit, bitmap_bh->b_data)) { + /* we won it */ + BUFFER_TRACE(bitmap_bh, + "call ext4_journal_dirty_metadata"); + if (ext4_journal_dirty_metadata(handle, + bitmap_bh) != 0) + goto fail; + ires_ino = bit; + goto find; + } + /* we lost it */ + jbd2_journal_release_buffer(handle, bitmap_bh); + bit += sbi->s_dir_ireserve_nr; + goto try_same_group; + } + if (++ires_group == sbi->s_groups_count) + ires_group = 0; + } + goto fail; +find: + brelse(bitmap_bh); + *group = ires_group; + *ino = ires_ino; + return 0; +fail: + brelse(bitmap_bh); + return -ENOSPC; +} + /* * There are two policies for allocating an inode. If the new inode is * a directory, then a forward search is made for a block group with both @@ -543,6 +614,12 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode * dir, int mode) ino = 0; + if (test_opt(sb, DIR_IRESERVE)) { + err = ext4_ino_from_ireserve(handle, dir, + mode, &group, &ino); + if ((!err) && S_ISDIR(mode)) + goto got; + } repeat_in_this_group: ino = ext4_find_next_zero_bit((unsigned long *) bitmap_bh->b_data, EXT4_INODES_PER_GROUP(sb), ino); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index b626339..9b873b7 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -884,6 +884,7 @@ enum { Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev, Opt_journal_checksum, Opt_journal_async_commit, Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback, + Opt_dir_ireserve_low, Opt_dir_ireserve_normal, Opt_dir_ireserve_high, Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, @@ -929,6 +930,9 @@ static match_table_t tokens = { {Opt_data_journal, "data=journal"}, {Opt_data_ordered, "data=ordered"}, {Opt_data_writeback, "data=writeback"}, + {Opt_dir_ireserve_low, "dir_ireserve=low"}, + {Opt_dir_ireserve_normal, "dir_ireserve=normal"}, + {Opt_dir_ireserve_high, "dir_ireserve=high"}, {Opt_offusrjquota, "usrjquota="}, {Opt_usrjquota, "usrjquota=%s"}, {Opt_offgrpjquota, "grpjquota="}, @@ -1311,6 +1315,18 @@ clear_qf_name: return 0; sbi->s_stripe = option; break; + case Opt_dir_ireserve_low: + set_opt(sbi->s_mount_opt, DIR_IRESERVE); + sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_LOW; + break; + case Opt_dir_ireserve_normal: + set_opt(sbi->s_mount_opt, DIR_IRESERVE); + sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_NORMAL; + break; + case Opt_dir_ireserve_high: + set_opt(sbi->s_mount_opt, DIR_IRESERVE); + sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_HIGH; + break; default: printk (KERN_ERR "EXT4-fs: Unrecognized mount option \"%s\" " diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h index bcdb59d..88f5173 100644 --- a/include/linux/ext4_fs.h +++ b/include/linux/ext4_fs.h @@ -92,6 +92,13 @@ struct ext4_allocation_request { #define EXT4_GOOD_OLD_FIRST_INO 11 /* + * Macro-instructions used to reserve inodes for directories + */ +#define EXT4_DIR_IRESERVE_LOW 16 +#define EXT4_DIR_IRESERVE_NORMAL 64 +#define EXT4_DIR_IRESERVE_HIGH 128 + +/* * Maximal count of links to a file */ #define EXT4_LINK_MAX 65000 @@ -502,6 +509,7 @@ do { \ #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x1000000 /* Journal Async Commit */ #define EXT4_MOUNT_DELALLOC 0x2000000 /* Delalloc support */ #define EXT4_MOUNT_MBALLOC 0x4000000 /* Buddy allocation support */ +#define EXT4_MOUNT_DIR_IRESERVE 0x10000000/* dir inode reservation */ /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */ #ifndef _LINUX_EXT2_FS_H #define clear_opt(o, opt) o &= ~EXT4_MOUNT_##opt diff --git a/include/linux/ext4_fs_sb.h b/include/linux/ext4_fs_sb.h index 744e746..790b0cf 100644 --- a/include/linux/ext4_fs_sb.h +++ b/include/linux/ext4_fs_sb.h @@ -147,6 +147,8 @@ struct ext4_sb_info { /* locality groups */ struct ext4_locality_group *s_locality_groups; + /* number of inodes we reserve in a directory */ + int s_dir_ireserve_nr; }; #define EXT4_GROUP_INFO(sb, group) \ EXT4_SB(sb)->s_group_info[(group) >> EXT4_DESC_PER_BLOCK_BITS(sb)] \ -- Coly Li SuSE PRC Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
#2
| |||
| |||
| hmm. so you trade 265% degradation of creation for 40% improvement of unlink? thanks, Alex Coly Li wrote: > normal ext4 ext4 with dir inode reservation > mount options: -o data=writeback -o data=writeback,dir_ireserve=low > Create dirs: real 0m49.101s real 2m59.703s > Create files: real 24m17.962s real 21m8.161s > Unlink all: real 24m43.788s real 17m29.862s > Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating > directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Creating > files with dir inode reservation is 13% faster than normal ext4. Unlink all the directories and > files is 29.2% faster as expected. > When number of directories is increased, the performance improvement will be more considerable. More > benchmark result will be posted here if necessary, because I need more time to run more test cases. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
#3
| |||
| |||
| Thanks for the feedback :-) Alex Tomas wrote: > hmm. so you trade 265% degradation of creation for 40% improvement of > unlink? > 265% degradation is only for creating 50000 empty directories. This is not a common case. There are 13% improvement on create 15 files in each directories. Total time on creating these directories and files are 25m6s VS. 24m86s, indeed, dir inode reservation is a little faster. Maybe most of the people will not create dozens of empty directories in their applications, therefore IMHO the 265% degradation is acceptable. If user really need to create so many empty directories, they also can mount the file system without dir inode reservation to get better performance. > thanks, Alex > > Coly Li wrote: >> normal ext4 ext4 with dir inode reservation >> mount options: -o data=writeback -o >> data=writeback,dir_ireserve=low >> Create dirs: real 0m49.101s real 2m59.703s >> Create files: real 24m17.962s real 21m8.161s >> Unlink all: real 24m43.788s real 17m29.862s >> Creating dirs with dir inode reservation is slower than normal ext4 as >> predicted, because allocating >> directory inodes in non-linear order will cause extra hard disk >> seeking and block I/O. Creating >> files with dir inode reservation is 13% faster than normal ext4. >> Unlink all the directories and >> files is 29.2% faster as expected. >> When number of directories is increased, the performance improvement >> will be more considerable. More >> benchmark result will be posted here if necessary, because I need more >> time to run more test cases. > -- Coly Li SuSE PRC Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
#4
| |||
| |||
| Coly Li wrote: > Thanks for the feedback :-) > > Alex Tomas wrote: >> hmm. so you trade 265% degradation of creation for 40% improvement of >> unlink? >> > 265% degradation is only for creating 50000 empty directories. This is not a common case. > There are 13% improvement on create 15 files in each directories. Total time on creating these > directories and files are 25m6s VS. 24m86s, indeed, dir inode reservation is a little faster. Sorry a typo here, it's 25m6s VS. 24m7.86s. > > Maybe most of the people will not create dozens of empty directories in their applications, > therefore IMHO the 265% degradation is acceptable. > > If user really need to create so many empty directories, they also can mount the file system without > dir inode reservation to get better performance. > >> thanks, Alex >> >> Coly Li wrote: >>> normal ext4 ext4 with dir inode reservation >>> mount options: -o data=writeback -o >>> data=writeback,dir_ireserve=low >>> Create dirs: real 0m49.101s real 2m59.703s >>> Create files: real 24m17.962s real 21m8.161s >>> Unlink all: real 24m43.788s real 17m29.862s >>> Creating dirs with dir inode reservation is slower than normal ext4 as >>> predicted, because allocating >>> directory inodes in non-linear order will cause extra hard disk >>> seeking and block I/O. Creating >>> files with dir inode reservation is 13% faster than normal ext4. >>> Unlink all the directories and >>> files is 29.2% faster as expected. >>> When number of directories is increased, the performance improvement >>> will be more considerable. More >>> benchmark result will be posted here if necessary, because I need more >>> time to run more test cases. > -- Coly Li SuSE PRC Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
#5
| |||
| |||
| On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote: > Basic idea of my dir inode reservation patch can be found here, > http://lists.openwall.net/linux-ext4/2007/11/05/3 > > 1, What does dir inode reservation do > Dir inode reservation tries to reserve several inodes in inodes table for a directory when this > directory is created. When create new file under this directory, try to allocate inode from the > reserved inodes area. This is called as dir_ireserve inode allocator. > Thanks for the update. Let me try to understand your method: So the basic idea is not do linear inode allocation for directory? Inode structure block for directory file is only coming from block 0, N, N +N,... where the number of skipped blocks N is stored in the in-core superblock structure. When ever need to allocate an inode for directory, skip N reserved bits (space for N*16 inodes) if the previous block is already allocated. That way place two directories with the hole of N*16 inodes structures, then allow files under the first directory stay closer with their parent directory. Is this correct? > 4, Dir inode reservation is optional > Dir inode reservation is optional, you can use -o followed by one of these options to enable dir > inode reservation during mount ext4 file system: > dir_ireserve=low > dir_ireserve=normal > dir_ireserve=high Would be nice to pass the tuning info low/normal/high(16/64/128 blocks correspondingly) via something else rather than mount options. > Currently, 'low' reserves 15 file inodes for each directory, 'normal' reserves 31 inodes and 'high' > reserves 127 inodes. Reserving more than 127 inodes does not help to performance obviously. > > > 5, Performance number > On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create > 50000 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I > tried to remove all the directories and files recursively by a 'rm -rf'. Bellow is the benchmark result, > normal ext4 ext4 with dir inode reservation > mount options: -o data=writeback -o data=writeback,dir_ireserve=low > Create dirs: real 0m49.101s real 2m59.703s > Create files: real 24m17.962s real 21m8.161s > Unlink all: real 24m43.788s real 17m29.862s > Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating > directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Hmm...I suspect there is bug in your patch, the extra seek should not contribute to 4 times slower > > #include > @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, > return -1; > } > > +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, > + int mode, ext4_group_t *group, unsigned long *ino) > +{ > + struct super_block *sb; > + struct ext4_sb_info *sbi; > + struct ext4_group_desc *gdp = NULL; > + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; > + ext4_group_t ires_group = *group; > + unsigned long ires_ino; > + int i, bit; > + > + sb = dir->i_sb; > + sbi = EXT4_SB(sb); > + > + /* if the inode number is not for directory, > + * only try to allocate after directory's inode > + */ > + if (!S_ISDIR(mode)) { > + *ino = dir->i_ino % EXT4_INODES_PER_GROUP(sb); > + return 0; > + } > + > + /* reserve inodes for new directory */ > + for (i = 0; i < sbi->s_groups_count; i++) { > + gdp = ext4_get_group_desc(sb, ires_group, &gdp_bh); > + if (!gdp) > + goto fail; > + bit = 0; > +try_same_group: > + if (bit < EXT4_INODES_PER_GROUP(sb)) { > + brelse(bitmap_bh); > + bitmap_bh = read_inode_bitmap(sb, ires_group); > + if (!bitmap_bh) > + goto fail; > + > + BUFFER_TRACE(bitmap_bh, "get_write_access"); > + if (ext4_journal_get_write_access( > + handle, bitmap_bh) != 0) > + goto fail; > + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), > + bit, bitmap_bh->b_data)) { > + /* we won it */ > + BUFFER_TRACE(bitmap_bh, > + "call ext4_journal_dirty_metadata"); > + if (ext4_journal_dirty_metadata(handle, > + bitmap_bh) != 0) > + goto fail; > + ires_ino = bit; > + goto find; > + } > + /* we lost it */ > + jbd2_journal_release_buffer(handle, bitmap_bh); > + bit += sbi->s_dir_ireserve_nr; > + goto try_same_group; > + } > + if (++ires_group == sbi->s_groups_count) > + ires_group = 0; > + } > + goto fail; > +find: > + brelse(bitmap_bh); > + *group = ires_group; > + *ino = ires_ino; > + return 0; > +fail: > + brelse(bitmap_bh); > + return -ENOSPC; > +} > + > /* > * There are two policies for allocating an inode. If the new inode is > * a directory, then a forward search is made for a block group with both > @@ -543,6 +614,12 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode * dir, int mode) > > ino = 0; > > + if (test_opt(sb, DIR_IRESERVE)) { > + err = ext4_ino_from_ireserve(handle, dir, > + mode, &group, &ino); > + if ((!err) && S_ISDIR(mode)) > + goto got; > + } So you are calling ext4_ino_from_ireserve() inside a loop iterate all block groups? I think this is bug, it should move outside of the loop. > repeat_in_this_group: > ino = ext4_find_next_zero_bit((unsigned long *) > bitmap_bh->b_data, EXT4_INODES_PER_GROUP(sb), ino); > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index b626339..9b873b7 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -884,6 +884,7 @@ enum { > Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev, > Opt_journal_checksum, Opt_journal_async_commit, > Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback, > + Opt_dir_ireserve_low, Opt_dir_ireserve_normal, Opt_dir_ireserve_high, > Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, > Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, > Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, > @@ -929,6 +930,9 @@ static match_table_t tokens = { > {Opt_data_journal, "data=journal"}, > {Opt_data_ordered, "data=ordered"}, > {Opt_data_writeback, "data=writeback"}, > + {Opt_dir_ireserve_low, "dir_ireserve=low"}, > + {Opt_dir_ireserve_normal, "dir_ireserve=normal"}, > + {Opt_dir_ireserve_high, "dir_ireserve=high"}, > {Opt_offusrjquota, "usrjquota="}, > {Opt_usrjquota, "usrjquota=%s"}, > {Opt_offgrpjquota, "grpjquota="}, > @@ -1311,6 +1315,18 @@ clear_qf_name: > return 0; > sbi->s_stripe = option; > break; > + case Opt_dir_ireserve_low: > + set_opt(sbi->s_mount_opt, DIR_IRESERVE); > + sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_LOW; > + break; > + case Opt_dir_ireserve_normal: > + set_opt(sbi->s_mount_opt, DIR_IRESERVE); > + sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_NORMAL; > + break; > + case Opt_dir_ireserve_high: > + set_opt(sbi->s_mount_opt, DIR_IRESERVE); > + sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_HIGH; > + break; > default: > printk (KERN_ERR > "EXT4-fs: Unrecognized mount option \"%s\" " > diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h > index bcdb59d..88f5173 100644 > --- a/include/linux/ext4_fs.h > +++ b/include/linux/ext4_fs.h > @@ -92,6 +92,13 @@ struct ext4_allocation_request { > #define EXT4_GOOD_OLD_FIRST_INO 11 > > /* > + * Macro-instructions used to reserve inodes for directories > + */ > +#define EXT4_DIR_IRESERVE_LOW 16 > +#define EXT4_DIR_IRESERVE_NORMAL 64 > +#define EXT4_DIR_IRESERVE_HIGH 128 > + > +/* > * Maximal count of links to a file > */ > #define EXT4_LINK_MAX 65000 > @@ -502,6 +509,7 @@ do { \ > #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x1000000 /* Journal Async Commit */ > #define EXT4_MOUNT_DELALLOC 0x2000000 /* Delalloc support */ > #define EXT4_MOUNT_MBALLOC 0x4000000 /* Buddy allocation support */ > +#define EXT4_MOUNT_DIR_IRESERVE 0x10000000/* dir inode reservation */ > /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */ > #ifndef _LINUX_EXT2_FS_H > #define clear_opt(o, opt) o &= ~EXT4_MOUNT_##opt > diff --git a/include/linux/ext4_fs_sb.h b/include/linux/ext4_fs_sb.h > index 744e746..790b0cf 100644 > --- a/include/linux/ext4_fs_sb.h > +++ b/include/linux/ext4_fs_sb.h > @@ -147,6 +147,8 @@ struct ext4_sb_info { > > /* locality groups */ > struct ext4_locality_group *s_locality_groups; > + /* number of inodes we reserve in a directory */ > + int s_dir_ireserve_nr; > }; > #define EXT4_GROUP_INFO(sb, group) \ > EXT4_SB(sb)->s_group_info[(group) >> EXT4_DESC_PER_BLOCK_BITS(sb)] \ > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
#6
| |||
| |||
| Thanks for the feedback :-) Mingming Cao wrote: > On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote: >> Basic idea of my dir inode reservation patch can be found here, >> http://lists.openwall.net/linux-ext4/2007/11/05/3 >> >> 1, What does dir inode reservation do >> Dir inode reservation tries to reserve several inodes in inodes table for a directory when this >> directory is created. When create new file under this directory, try to allocate inode from the >> reserved inodes area. This is called as dir_ireserve inode allocator. >> > Thanks for the update. > > Let me try to understand your method: > > So the basic idea is not do linear inode allocation for directory? Inode > structure block for directory file is only coming from block 0, N, N > +N,... where the number of skipped blocks N is stored in the in-core > superblock structure. N is not stored in in-core superblock. N = s_dir_ireserve_nr / inodes_per_block. What is stored in in-core superblock is number of inodes to be reserved for each directory. > > When ever need to allocate an inode for directory, skip N reserved bits > (space for N*16 inodes) if the previous block is already allocated. That > way place two directories with the hole of N*16 inodes structures, then > allow files under the first directory stay closer with their parent > directory. Is this correct? The hole is (s_dir_ireserve_nr - 1), not N * s_dir_ireserve_nr. Because directory inode will also use a inode slot from reserved area, reset slots number for files is (s_dir_ireserve_nr - 1). Except for the reserved inodes number, your understanding exactly matches my idea. > > >> 4, Dir inode reservation is optional >> Dir inode reservation is optional, you can use -o followed by one of these options to enable dir >> inode reservation during mount ext4 file system: >> dir_ireserve=low >> dir_ireserve=normal >> dir_ireserve=high > > Would be nice to pass the tuning info low/normal/high(16/64/128 blocks > correspondingly) via something else rather than mount options. Sure, I agree with you. Also I am thinking should this patch permit user to input reserved inodes number directly other than a low/normal/high. Also I am looking for methods to display the tuning info more convenient to users. > >> Currently, 'low' reserves 15 file inodes for each directory, 'normal' reserves 31 inodes and 'high' >> reserves 127 inodes. Reserving more than 127 inodes does not help to performance obviously. >> >> >> 5, Performance number >> On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create >> 50000 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I >> tried to remove all the directories and files recursively by a 'rm -rf'. Bellow is the benchmark result, >> normal ext4 ext4 with dir inode reservation >> mount options: -o data=writeback -o data=writeback,dir_ireserve=low >> Create dirs: real 0m49.101s real 2m59.703s >> Create files: real 24m17.962s real 21m8.161s >> Unlink all: real 24m43.788s real 17m29.862s >> Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating >> directory inodes in non-linear order will cause extra hard disk seeking and block I/O. > > Hmm...I suspect there is bug in your patch, the extra seek should not > contribute to 4 times slower I agree with you :-) > >> #include >> @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, >> return -1; >> } >> >> +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, >> + int mode, ext4_group_t *group, unsigned long *ino) >> +{ >> + struct super_block *sb; >> + struct ext4_sb_info *sbi; >> + struct ext4_group_desc *gdp = NULL; >> + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; >> + ext4_group_t ires_group = *group; >> + unsigned long ires_ino; >> + int i, bit; >> + >> + sb = dir->i_sb; >> + sbi = EXT4_SB(sb); >> + >> + /* if the inode number is not for directory, >> + * only try to allocate after directory's inode >> + */ >> + if (!S_ISDIR(mode)) { >> + *ino = dir->i_ino % EXT4_INODES_PER_GROUP(sb); >> + return 0; >> + } >> + >> + /* reserve inodes for new directory */ >> + for (i = 0; i < sbi->s_groups_count; i++) { >> + gdp = ext4_get_group_desc(sb, ires_group, &gdp_bh); >> + if (!gdp) >> + goto fail; >> + bit = 0; >> +try_same_group: >> + if (bit < EXT4_INODES_PER_GROUP(sb)) { >> + brelse(bitmap_bh); >> + bitmap_bh = read_inode_bitmap(sb, ires_group); >> + if (!bitmap_bh) >> + goto fail; >> + >> + BUFFER_TRACE(bitmap_bh, "get_write_access"); >> + if (ext4_journal_get_write_access( >> + handle, bitmap_bh) != 0) >> + goto fail; >> + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), >> + bit, bitmap_bh->b_data)) { >> + /* we won it */ >> + BUFFER_TRACE(bitmap_bh, >> + "call ext4_journal_dirty_metadata"); >> + if (ext4_journal_dirty_metadata(handle, >> + bitmap_bh) != 0) >> + goto fail; >> + ires_ino = bit; >> + goto find; >> + } >> + /* we lost it */ >> + jbd2_journal_release_buffer(handle, bitmap_bh); >> + bit += sbi->s_dir_ireserve_nr; >> + goto try_same_group; >> + } >> + if (++ires_group == sbi->s_groups_count) >> + ires_group = 0; >> + } >> + goto fail; >> +find: >> + brelse(bitmap_bh); >> + *group = ires_group; >> + *ino = ires_ino; >> + return 0; >> +fail: >> + brelse(bitmap_bh); >> + return -ENOSPC; >> +} >> + >> /* >> * There are two policies for allocating an inode. If the new inode is >> * a directory, then a forward search is made for a block group with both >> @@ -543,6 +614,12 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode * dir, int mode) >> >> ino = 0; >> >> + if (test_opt(sb, DIR_IRESERVE)) { >> + err = ext4_ino_from_ireserve(handle, dir, >> + mode, &group, &ino); >> + if ((!err) && S_ISDIR(mode)) >> + goto got; >> + } > > > So you are calling ext4_ino_from_ireserve() inside a loop iterate all > block groups? I think this is bug, it should move outside of the loop. > Hmm, sure, putting ext4_ino_from_ireserve() in this loop is redundant, it should be out of this loop. Neat eyes:-) So this is second bug in my patch. I will submit V4 patch and basic benchmark number before next internal conf-call. Thanks for your review, great help to me :-) >> repeat_in_this_group: >> ino = ext4_find_next_zero_bit((unsigned long *) >> bitmap_bh->b_data, EXT4_INODES_PER_GROUP(sb), ino); >> diff --git a/fs/ext4/super.c b/fs/ext4/super.c >> index b626339..9b873b7 100644 >> --- a/fs/ext4/super.c >> +++ b/fs/ext4/super.c >> @@ -884,6 +884,7 @@ enum { >> Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev, >> Opt_journal_checksum, Opt_journal_async_commit, >> Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback, >> + Opt_dir_ireserve_low, Opt_dir_ireserve_normal, Opt_dir_ireserve_high, >> Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, >> Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, >> Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, >> @@ -929,6 +930,9 @@ static match_table_t tokens = { >> {Opt_data_journal, "data=journal"}, >> {Opt_data_ordered, "data=ordered"}, >> {Opt_data_writeback, "data=writeback"}, >> + {Opt_dir_ireserve_low, "dir_ireserve=low"}, >> + {Opt_dir_ireserve_normal, "dir_ireserve=normal"}, >> + {Opt_dir_ireserve_high, "dir_ireserve=high"}, >> {Opt_offusrjquota, "usrjquota="}, >> {Opt_usrjquota, "usrjquota=%s"}, >> {Opt_offgrpjquota, "grpjquota="}, >> @@ -1311,6 +1315,18 @@ clear_qf_name: >> return 0; >> sbi->s_stripe = option; >> break; >> + case Opt_dir_ireserve_low: >> + set_opt(sbi->s_mount_opt, DIR_IRESERVE); >> + sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_LOW; >> + break; >> + case Opt_dir_ireserve_normal: >> + set_opt(sbi->s_mount_opt, DIR_IRESERVE); >> + sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_NORMAL; >> + break; >> + case Opt_dir_ireserve_high: >> + set_opt(sbi->s_mount_opt, DIR_IRESERVE); >> + sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_HIGH; >> + break; >> default: >> printk (KERN_ERR >> "EXT4-fs: Unrecognized mount option \"%s\" " >> diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h >> index bcdb59d..88f5173 100644 >> --- a/include/linux/ext4_fs.h >> +++ b/include/linux/ext4_fs.h >> @@ -92,6 +92,13 @@ struct ext4_allocation_request { >> #define EXT4_GOOD_OLD_FIRST_INO 11 >> >> /* >> + * Macro-instructions used to reserve inodes for directories >> + */ >> +#define EXT4_DIR_IRESERVE_LOW 16 >> +#define EXT4_DIR_IRESERVE_NORMAL 64 >> +#define EXT4_DIR_IRESERVE_HIGH 128 >> + >> +/* >> * Maximal count of links to a file >> */ >> #define EXT4_LINK_MAX 65000 >> @@ -502,6 +509,7 @@ do { \ >> #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x1000000 /* Journal Async Commit */ >> #define EXT4_MOUNT_DELALLOC 0x2000000 /* Delalloc support */ >> #define EXT4_MOUNT_MBALLOC 0x4000000 /* Buddy allocation support */ >> +#define EXT4_MOUNT_DIR_IRESERVE 0x10000000/* dir inode reservation */ >> /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */ >> #ifndef _LINUX_EXT2_FS_H >> #define clear_opt(o, opt) o &= ~EXT4_MOUNT_##opt >> diff --git a/include/linux/ext4_fs_sb.h b/include/linux/ext4_fs_sb.h >> index 744e746..790b0cf 100644 >> --- a/include/linux/ext4_fs_sb.h >> +++ b/include/linux/ext4_fs_sb.h >> @@ -147,6 +147,8 @@ struct ext4_sb_info { >> >> /* locality groups */ >> struct ext4_locality_group *s_locality_groups; >> + /* number of inodes we reserve in a directory */ >> + int s_dir_ireserve_nr; >> }; >> #define EXT4_GROUP_INFO(sb, group) \ >> EXT4_SB(sb)->s_group_info[(group) >> EXT4_DESC_PER_BLOCK_BITS(sb)] \ >> >> > -- Coly Li SuSE PRC Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
#7
| |||
| |||
| Hi Coly, finally I've found some time to have a look at a new version of your patch. > 5, Performance number > On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 > partition, and tried to create 50000 directories, and create 15 (1KB) > files in each directory alternatively. After a remount, I tried to > remove all the directories and files recursively by a 'rm -rf'. Bellow > is the benchmark result, > normal ext4 ext4 with dir inode reservation > mount options: -o data=writeback -o data=writeback,dir_ireserve=low > Create dirs: real 0m49.101s real 2m59.703s > Create files: real 24m17.962s real 21m8.161s > Unlink all: real 24m43.788s real 17m29.862s > Creating dirs with dir inode reservation is slower than normal ext4 as > predicted, because allocating directory inodes in non-linear order > will cause extra hard disk seeking and block I/O. Creating files with > dir inode reservation is 13% faster than normal ext4. Unlink all the > directories and files is 29.2% faster as expected. When number of > directories is increased, the performance improvement will be more > considerable. More benchmark result will be posted here if necessary, > because I need more time to run more test cases. Maybe to get some better idea - could you compare how long does take untar of a kernel tree, find through the whole kernel tree and removal of the tree? Also measuring CPU load during your benchmarks would be useful so that we can see whether we don't increase too much the cost of searching in bitmaps... The patch is nicely short ![]() > diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c > index 17b5df1..f838a72 100644 > --- a/fs/ext4/ialloc.c > +++ b/fs/ext4/ialloc.c > @@ -10,6 +10,8 @@ > * Stephen Tweedie (sct@redhat.com), 1993 > * Big-endian to little-endian byte-swapping/bitmaps by > * David S. Miller (davem@caip.rutgers.edu), 1995 > + * Directory inodes reservation by > + * Coly Li (coyli@suse.de), 2007 > */ > > #include > @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, > return -1; > } > > +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, > + int mode, ext4_group_t *group, unsigned long *ino) > +{ > + struct super_block *sb; > + struct ext4_sb_info *sbi; > + struct ext4_group_desc *gdp = NULL; > + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; > + ext4_group_t ires_group = *group; > + unsigned long ires_ino; > + int i, bit; > + > + sb = dir->i_sb; > + sbi = EXT4_SB(sb); > + > + /* if the inode number is not for directory, > + * only try to allocate after directory's inode > + */ > + if (!S_ISDIR(mode)) { > + *ino = dir->i_ino % EXT4_INODES_PER_GROUP(sb); > + return 0; > + } ^^^ You don't set a group here - is this intentional? It means we get the group from find_group_other() and there we start searching at a place corresponding to parent's inode number... That would be an interesting heuristic but I'm not sure if that's what you want. > + > + /* reserve inodes for new directory */ > + for (i = 0; i < sbi->s_groups_count; i++) { > + gdp = ext4_get_group_desc(sb, ires_group, &gdp_bh); > + if (!gdp) > + goto fail; > + bit = 0; > +try_same_group: > + if (bit < EXT4_INODES_PER_GROUP(sb)) { > + brelse(bitmap_bh); > + bitmap_bh = read_inode_bitmap(sb, ires_group); > + if (!bitmap_bh) > + goto fail; > + > + BUFFER_TRACE(bitmap_bh, "get_write_access"); > + if (ext4_journal_get_write_access( > + handle, bitmap_bh) != 0) > + goto fail; > + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), > + bit, bitmap_bh->b_data)) { > + /* we won it */ > + BUFFER_TRACE(bitmap_bh, > + "call ext4_journal_dirty_metadata"); > + if (ext4_journal_dirty_metadata(handle, > + bitmap_bh) != 0) > + goto fail; > + ires_ino = bit; > + goto find; > + } > + /* we lost it */ > + jbd2_journal_release_buffer(handle, bitmap_bh); > + bit += sbi->s_dir_ireserve_nr; > + goto try_same_group; > + } So this above is just a while loop coded with goto... While loop would be IMO better. Honza -- Jan Kara SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
#8
| |||
| |||
| Jan, Thanks for taking time to review the patch :-) Jan Kara wrote: > Hi Coly, > > finally I've found some time to have a look at a new version of your > patch. > >> 5, Performance number >> On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 >> partition, and tried to create 50000 directories, and create 15 (1KB) >> files in each directory alternatively. After a remount, I tried to >> remove all the directories and files recursively by a 'rm -rf'. Bellow >> is the benchmark result, >> normal ext4 ext4 with dir inode reservation >> mount options: -o data=writeback -o data=writeback,dir_ireserve=low >> Create dirs: real 0m49.101s real 2m59.703s >> Create files: real 24m17.962s real 21m8.161s >> Unlink all: real 24m43.788s real 17m29.862s >> Creating dirs with dir inode reservation is slower than normal ext4 as >> predicted, because allocating directory inodes in non-linear order >> will cause extra hard disk seeking and block I/O. Creating files with >> dir inode reservation is 13% faster than normal ext4. Unlink all the >> directories and files is 29.2% faster as expected. When number of >> directories is increased, the performance improvement will be more >> considerable. More benchmark result will be posted here if necessary, >> because I need more time to run more test cases. > Maybe to get some better idea - could you compare how long does take > untar of a kernel tree, find through the whole kernel tree and removal > of the tree? Also measuring CPU load during your benchmarks would be > useful so that we can see whether we don't increase too much the cost of > searching in bitmaps... Sure, good ideas, I will add these items to my benchmark list. > > The patch is nicely short ![]() Thanks for the encouragement. The first version was more than 2000 lines, including kernel and e2fsprogs patches. Finally I find only around 100 lines kernel patch can achieve same performance too. Really nice :-) > >> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c >> index 17b5df1..f838a72 100644 >> --- a/fs/ext4/ialloc.c >> +++ b/fs/ext4/ialloc.c >> @@ -10,6 +10,8 @@ >> * Stephen Tweedie (sct@redhat.com), 1993 >> * Big-endian to little-endian byte-swapping/bitmaps by >> * David S. Miller (davem@caip.rutgers.edu), 1995 >> + * Directory inodes reservation by >> + * Coly Li (coyli@suse.de), 2007 >> */ >> >> #include >> @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, >> return -1; >> } >> >> +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, >> + int mode, ext4_group_t *group, unsigned long *ino) >> +{ >> + struct super_block *sb; >> + struct ext4_sb_info *sbi; >> + struct ext4_group_desc *gdp = NULL; >> + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; >> + ext4_group_t ires_group = *group; >> + unsigned long ires_ino; >> + int i, bit; >> + >> + sb = dir->i_sb; >> + sbi = EXT4_SB(sb); >> + >> + /* if the inode number is not for directory, >> + * only try to allocate after directory's inode >> + */ >> + if (!S_ISDIR(mode)) { >> + *ino = dir->i_ino % EXT4_INODES_PER_GROUP(sb); >> + return 0; >> + } > ^^^ You don't set a group here - is this intentional? It means we get > the group from find_group_other() and there we start searching at a > place corresponding to parent's inode number... That would be an > interesting heuristic but I'm not sure if that's what you want. Yes, if allocating a file inode, just return first inode offset in the reserved area of parent directory. In this case, group is decided by find_group_other() or find_group_orlov(), ext4_ino_from_ireserve() just tries to persuade linear inode allocator to search free inode slot after parent's inode. > >> + >> + /* reserve inodes for new directory */ >> + for (i = 0; i < sbi->s_groups_count; i++) { >> + gdp = ext4_get_group_desc(sb, ires_group, &gdp_bh); >> + if (!gdp) >> + goto fail; >> + bit = 0; >> +try_same_group: >> + if (bit < EXT4_INODES_PER_GROUP(sb)) { >> + brelse(bitmap_bh); >> + bitmap_bh = read_inode_bitmap(sb, ires_group); >> + if (!bitmap_bh) >> + goto fail; >> + >> + BUFFER_TRACE(bitmap_bh, "get_write_access"); >> + if (ext4_journal_get_write_access( >> + handle, bitmap_bh) != 0) >> + goto fail; >> + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), >> + bit, bitmap_bh->b_data)) { >> + /* we won it */ >> + BUFFER_TRACE(bitmap_bh, >> + "call ext4_journal_dirty_metadata"); >> + if (ext4_journal_dirty_metadata(handle, >> + bitmap_bh) != 0) >> + goto fail; >> + ires_ino = bit; >> + goto find; >> + } >> + /* we lost it */ >> + jbd2_journal_release_buffer(handle, bitmap_bh); >> + bit += sbi->s_dir_ireserve_nr; >> + goto try_same_group; >> + } > So this above is just a while loop coded with goto... While loop > would be IMO better. The only reason for me to use a goto, is 80 column limitation BTW, goto does not hurt performanceand readability here. IMHO, it's acceptable :-) > > Honza Today I finished V4 patch, which fixs a bug and provides new mount option to configure reserved inodes number. It can be faster on creating empty directories, I will release it once I finished basic benchmark. Your advice on benchmark is valuable, I will post the benchmark result next week. Best regards. -- Coly Li SuSE PRC Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
#9
| |||
| |||
| On Wed 21-11-07 00:40:17, Coly Li wrote: > Jan Kara wrote: > >> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c > >> index 17b5df1..f838a72 100644 > >> --- a/fs/ext4/ialloc.c > >> +++ b/fs/ext4/ialloc.c > >> @@ -10,6 +10,8 @@ > >> * Stephen Tweedie (sct@redhat.com), 1993 > >> * Big-endian to little-endian byte-swapping/bitmaps by > >> * David S. Miller (davem@caip.rutgers.edu), 1995 > >> + * Directory inodes reservation by > >> + * Coly Li (coyli@suse.de), 2007 > >> */ > >> > >> #include > >> @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, > >> return -1; > >> } > >> > >> +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, > >> + int mode, ext4_group_t *group, unsigned long *ino) > >> +{ > >> + struct super_block *sb; > >> + struct ext4_sb_info *sbi; > >> + struct ext4_group_desc *gdp = NULL; > >> + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; > >> + ext4_group_t ires_group = *group; > >> + unsigned long ires_ino; > >> + int i, bit; > >> + > >> + sb = dir->i_sb; > >> + sbi = EXT4_SB(sb); > >> + > >> + /* if the inode number is not for directory, > >> + * only try to allocate after directory's inode > >> + */ > >> + if (!S_ISDIR(mode)) { > >> + *ino = dir->i_ino % EXT4_INODES_PER_GROUP(sb); > >> + return 0; > >> + } > > ^^^ You don't set a group here - is this intentional? It means we get > > the group from find_group_other() and there we start searching at a > > place corresponding to parent's inode number... That would be an > > interesting heuristic but I'm not sure if that's what you want. > Yes, if allocating a file inode, just return first inode offset in the reserved area of parent > directory. In this case, group is decided by find_group_other() or find_group_orlov(), > ext4_ino_from_ireserve() just tries to persuade linear inode allocator to search free inode slot > after parent's inode. But what I mean is: Parent directory is in group 1, with inode number 10, now find_group_other will set group to 2 and you set inode number to 10 so linear allocator will start searching in group 2, inode number 10 which is *not* just after directory inode.... > >> + > >> + /* reserve inodes for new directory */ > >> + for (i = 0; i < sbi->s_groups_count; i++) { > >> + gdp = ext4_get_group_desc(sb, ires_group, &gdp_bh); > >> + if (!gdp) > >> + goto fail; > >> + bit = 0; > >> +try_same_group: > >> + if (bit < EXT4_INODES_PER_GROUP(sb)) { > >> + brelse(bitmap_bh); > >> + bitmap_bh = read_inode_bitmap(sb, ires_group); > >> + if (!bitmap_bh) > >> + goto fail; > >> + > >> + BUFFER_TRACE(bitmap_bh, "get_write_access"); > >> + if (ext4_journal_get_write_access( > >> + handle, bitmap_bh) != 0) > >> + goto fail; > >> + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), > >> + bit, bitmap_bh->b_data)) { > >> + /* we won it */ > >> + BUFFER_TRACE(bitmap_bh, > >> + "call ext4_journal_dirty_metadata"); > >> + if (ext4_journal_dirty_metadata(handle, > >> + bitmap_bh) != 0) > >> + goto fail; > >> + ires_ino = bit; > >> + goto find; > >> + } > >> + /* we lost it */ > >> + jbd2_journal_release_buffer(handle, bitmap_bh); > >> + bit += sbi->s_dir_ireserve_nr; > >> + goto try_same_group; > >> + } > > So this above is just a while loop coded with goto... While loop > > would be IMO better. > > The only reason for me to use a goto, is 80 column limitation BTW,> goto does not hurt performance and readability here. IMHO, it's > acceptable :-) But you could just remove goto try_same_group; and change 'if' to 'while'. Honza -- Jan Kara SUSE Labs, CR - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
#10
| |||
| |||
| On Tue, 2007-11-20 at 12:14 +0800, Coly Li wrote: > Thanks for the feedback :-) > > Mingming Cao wrote: > > On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote: > >> Basic idea of my dir inode reservation patch can be found here, > >> http://lists.openwall.net/linux-ext4/2007/11/05/3 > >> > >> 1, What does dir inode reservation do > >> Dir inode reservation tries to reserve several inodes in inodes table for a directory when this > >> directory is created. When create new file under this directory, try to allocate inode from the > >> reserved inodes area. This is called as dir_ireserve inode allocator. > >> > > Thanks for the update. > > > > Let me try to understand your method: > > > > So the basic idea is not do linear inode allocation for directory? Inode > > structure block for directory file is only coming from block 0, N, N > > +N,... where the number of skipped blocks N is stored in the in-core > > superblock structure. > > N is not stored in in-core superblock. N = s_dir_ireserve_nr / inodes_per_block. What is stored in > in-core superblock is number of inodes to be reserved for each directory. > > > > > When ever need to allocate an inode for directory, skip N reserved bits > > (space for N*16 inodes) if the previous block is already allocated. That > > way place two directories with the hole of N*16 inodes structures, then > > allow files under the first directory stay closer with their parent > > directory. Is this correct? > > The hole is (s_dir_ireserve_nr - 1), not N * s_dir_ireserve_nr. Because directory inode will also > use a inode slot from reserved area, reset slots number for files is (s_dir_ireserve_nr - 1). > Except for the reserved inodes number, your understanding exactly matches my idea. > Ok, thanks for clarification. The performance gain on large number of directories looks interesting, but I am afraid this makes the uninitialized block group feature less useful (to speed up fsck) than before The reserved inode will causethe unused inode watermark higher than before, and spread the directories over to other block groups earlier than before. Maybe it worth it...not sure. > > > > > >> 4, Dir inode reservation is optional > >> Dir inode reservation is optional, you can use -o followed by one of these options to enable dir > >> inode reservation during mount ext4 file system: > >> dir_ireserve=low > >> dir_ireserve=normal > >> dir_ireserve=high > > > > Would be nice to pass the tuning info low/normal/high(16/64/128 blocks > > correspondingly) via something else rather than mount options. > > Sure, I agree with you. Also I am thinking should this patch permit user to input reserved inodes > number directly other than a low/normal/high. Also I am looking for methods to display the tuning > info more convenient to users. > export/tune through /procfs? > > > >> Currently, 'low' reserves 15 file inodes for each directory, 'normal' reserves 31 inodes and 'high' > >> reserves 127 inodes. Reserving more than 127 inodes does not help to performance obviously. > >> > >> > >> 5, Performance number > >> On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create > >> 50000 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I > >> tried to remove all the directories and files recursively by a 'rm -rf'. Bellow is the benchmark result, > >> normal ext4 ext4 with dir inode reservation > >> mount options: -o data=writeback -o data=writeback,dir_ireserve=low > >> Create dirs: real 0m49.101s real 2m59.703s > >> Create files: real 24m17.962s real 21m8.161s > >> Unlink all: real 24m43.788s real 17m29.862s > >> Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating > >> directory inodes in non-linear order will cause extra hard disk seeking and block I/O. > > > > Hmm...I suspect there is bug in your patch, the extra seek should not > > contribute to 4 times slower > > I agree with you :-) > > > > >> #include > >> @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, > >> return -1; > >> } > >> > >> +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, > >> + int mode, ext4_group_t *group, unsigned long *ino) > >> +{ > >> + struct super_block *sb; > >> + struct ext4_sb_info *sbi; > >> + struct ext4_group_desc *gdp = NULL; > >> + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; > >> + ext4_group_t ires_group = *group; > >> + unsigned long ires_ino; > >> + int i, bit; > >> + > >> + sb = dir->i_sb; > >> + sbi = EXT4_SB(sb); > >> + > >> + /* if the inode number is not for directory, > >> + * only try to allocate after directory's inode > >> + */ > >> + if (!S_ISDIR(mode)) { > >> + *ino = dir->i_ino % EXT4_INODES_PER_GROUP(sb); > >> + return 0; > >> + } > >> + > >> + /* reserve inodes for new directory */ > >> + for (i = 0; i < sbi->s_groups_count; i++) { > >> + gdp = ext4_get_group_desc(sb, ires_group, &gdp_bh); > >> + if (!gdp) > >> + goto fail; > >> + bit = 0; It would be nice to remember the last allocated bit for each block group, so we don't have to start from bit 0 (in the worse case scan the entire block group) for every single directory. Probably this can be saved in the in-core block group descriptor. Right now the in core block group descriptor structure is the same on-disk block-group structure though, it might worth to separate them and provide load/store helper functions. > >> +try_same_group: > >> + if (bit < EXT4_INODES_PER_GROUP(sb)) { > >> + brelse(bitmap_bh); > >> + bitmap_bh = read_inode_bitmap(sb, ires_group); > >> + if (!bitmap_bh) > >> + goto fail; > >> + > >> + BUFFER_TRACE(bitmap_bh, "get_write_access"); > >> + if (ext4_journal_get_write_access( > >> + handle, bitmap_bh) != 0) > >> + goto fail; > >> + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), > >> + bit, bitmap_bh->b_data)) { > >> + /* we won it */ > >> + BUFFER_TRACE(bitmap_bh, > >> + "call ext4_journal_dirty_metadata"); > >> + if (ext4_journal_dirty_metadata(handle, > >> + bitmap_bh) != 0) > >> + goto fail; > >> + ires_ino = bit; > >> + goto find; > >> + } > >> + /* we lost it */ > >> + jbd2_journal_release_buffer(handle, bitmap_bh); > >> + bit += sbi->s_dir_ireserve_nr; > >> + goto try_same_group; > >> + } > >> + if (++ires_group == sbi->s_groups_count) > >> + ires_group = 0; > >> + } > >> + goto fail; > >> +find: > >> + brelse(bitmap_bh); > >> + *group = ires_group; > >> + *ino = ires_ino; > >> + return 0; > >> +fail: > >> + brelse(bitmap_bh); > >> + return -ENOSPC; > >> +} > >> + > >> /* > >> * There are two policies for allocating an inode. If the new inode is > >> * a directory, then a forward search is made for a block group with both > >> @@ -543,6 +614,12 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode * dir, int mode) > >> > >> ino = 0; > >> > >> + if (test_opt(sb, DIR_IRESERVE)) { > >> + err = ext4_ino_from_ireserve(handle, dir, > >> + mode, &group, &ino); > >> + if ((!err) && S_ISDIR(mode)) > >> + goto got; > >> + } > > > > > > So you are calling ext4_ino_from_ireserve() inside a loop iterate all > > block groups? I think this is bug, it should move outside of the loop. > > > > Hmm, sure, putting ext4_ino_from_ireserve() in this loop is redundant, it should be out of this > loop. Neat eyes:-) So this is second bug in my patch. > > I will submit V4 patch and basic benchmark number before next internal conf-call. Thanks for your > review, great help to me :-) > > > >> repeat_in_this_group: > >> ino = ext4_find_next_zero_bit((unsigned long *) > >> bitmap_bh->b_data, EXT4_INODES_PER_GROUP(sb), ino); > >> diff --git a/fs/ext4/super.c b/fs/ext4/super.c > >> index b626339..9b873b7 100644 > >> --- a/fs/ext4/super.c > >> +++ b/fs/ext4/super.c > >> @@ -884,6 +884,7 @@ enum { > >> Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev, > >> Opt_journal_checksum, Opt_journal_async_commit, > >> Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback, > >> + Opt_dir_ireserve_low, Opt_dir_ireserve_normal, Opt_dir_ireserve_high, > >> Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, > >> Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, > >> Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, > >> @@ -929,6 +930,9 @@ static match_table_t tokens = { > >> {Opt_data_journal, "data=journal"}, > >> {Opt_data_ordered, "data=ordered"}, > >> {Opt_data_writeback, "data=writeback"}, > >> + {Opt_dir_ireserve_low, "dir_ireserve=low"}, > >> + {Opt_dir_ireserve_normal, "dir_ireserve=normal"}, > >> + {Opt_dir_ireserve_high, "dir_ireserve=high"}, > >> {Opt_offusrjquota, "usrjquota="}, > >> {Opt_usrjquota, "usrjquota=%s"}, > >> {Opt_offgrpjquota, "grpjquota="}, > >> @@ -1311,6 +1315,18 @@ clear_qf_name: > >> return 0; > >> sbi->s_stripe = option; > >> break; > >> + case Opt_dir_ireserve_low: > >> + set_opt(sbi->s_mount_opt, DIR_IRESERVE); > >> + sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_LOW; > >> + break; > >> + case Opt_dir_ireserve_normal: > >> + set_opt(sbi->s_mount_opt, DIR_IRESERVE); > >> + sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_NORMAL; > >> + break; > >> + case Opt_dir_ireserve_high: > >> + set_opt(sbi->s_mount_opt, DIR_IRESERVE); > >> + sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_HIGH; > >> + break; > >> default: > >> printk (KERN_ERR > >> "EXT4-fs: Unrecognized mount option \"%s\" " > >> diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h > >> index bcdb59d..88f5173 100644 > >> --- a/include/linux/ext4_fs.h > >> +++ b/include/linux/ext4_fs.h > >> @@ -92,6 +92,13 @@ struct ext4_allocation_request { > >> #define EXT4_GOOD_OLD_FIRST_INO 11 > >> > >> /* > >> + * Macro-instructions used to reserve inodes for directories > >> + */ > >> +#define EXT4_DIR_IRESERVE_LOW 16 > >> +#define EXT4_DIR_IRESERVE_NORMAL 64 > >> +#define EXT4_DIR_IRESERVE_HIGH 128 > >> + > >> +/* > >> * Maximal count of links to a file > >> */ > >> #define EXT4_LINK_MAX 65000 > >> @@ -502,6 +509,7 @@ do { \ > >> #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x1000000 /* Journal Async Commit */ > >> #define EXT4_MOUNT_DELALLOC 0x2000000 /* Delalloc support */ > >> #define EXT4_MOUNT_MBALLOC 0x4000000 /* Buddy allocation support */ > >> +#define EXT4_MOUNT_DIR_IRESERVE 0x10000000/* dir inode reservation */ > >> /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */ > >> #ifndef _LINUX_EXT2_FS_H > >> #define clear_opt(o, opt) o &= ~EXT4_MOUNT_##opt > >> diff --git a/include/linux/ext4_fs_sb.h b/include/linux/ext4_fs_sb.h > >> index 744e746..790b0cf 100644 > >> --- a/include/linux/ext4_fs_sb.h > >> +++ b/include/linux/ext4_fs_sb.h > >> @@ -147,6 +147,8 @@ struct ext4_sb_info { > >> > >> /* locality groups */ > >> struct ext4_locality_group *s_locality_groups; > >> + /* number of inodes we reserve in a directory */ > >> + int s_dir_ireserve_nr; > >> }; > >> #define EXT4_GROUP_INFO(sb, group) \ > >> EXT4_SB(sb)->s_group_info[(group) >> EXT4_DESC_PER_BLOCK_BITS(sb)] \ > >> > >> > > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
#11
| |||
| |||
| On Nov 20, 2007 12:22 -0800, Mingming Cao wrote: > On Tue, 2007-11-20 at 12:14 +0800, Coly Li wrote: > > Mingming Cao wrote: > > > On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote: > > The hole is (s_dir_ireserve_nr - 1), not N * s_dir_ireserve_nr. Because > > directory inode will also use a inode slot from reserved area, reset > > slots number for files is (s_dir_ireserve_nr - 1). Except for the > > reserved inodes number, your understanding exactly matches my idea. > > The performance gain on large number of directories looks interesting, > but I am afraid this makes the uninitialized block group feature less > useful (to speed up fsck) than before The reserved inode will cause> the unused inode watermark higher than before, and spread the > directories over to other block groups earlier than before. Maybe it > worth it...not sure. My original thoughts on the design for this were slightly different: - that the per-directory reserved window would scale with the size of the directory, so that even (or especially) with htree directories the inodes would be kept in hash-ordered sets to speed up stat/unlink - it would be possible/desirable to re-use the existing block bitmap reservation code to handle inode bitmap reservation for directories while those directories are in-core. We already have the mechanisms for this, "all" that would need to change is have the reservation code point at the inode bitmaps but I don't know how easy that is - after an unmount/remount it would be desirable to re-use the same blocks for getting good hash->inode mappings, wherein lies the problem of compatibility One possible solutions for the restart problem is searching the directory leaf block in which an inode is being added for the inode numbers and try to use those as a goal for the inode allocation... Has a minor problem with ordering, because usually the inode is allocated before the dirent is created, but isn't impossible to fix (e.g. find dirent slot first, keep a pointer to it, check for inode goals, and then fill in dirent inum after allocating inode) > > >> 5, Performance number > > >> On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 > > >> partition, and tried to create 50000 directories, and create 15 (1KB) > > >> files in each directory alternatively. After a remount, I tried to > > >> remove all the directories and files recursively by a 'rm -rf'. > > >> Below is the benchmark result, > > >> normal ext4 ext4 with dir inode reservation > > >> mount options: -o data=writeback -o data=writeback,dir_ireserve=low > > >> Create dirs: real 0m49.101s real 2m59.703s > > >> Create files: real 24m17.962s real 21m8.161s > > >> Unlink all: real 24m43.788s real 17m29.862s One likely reason that the create dirs step is slower is that this is doing a lot more IO than in the past. Only a single inode in each inode table block is being used, so that means that a lot of empty bytes are being read and written (maybe 16x as much data in this case). Also, in what order are you creating files in the directories? If you are creating them in directory order like: for (f = 0; f < 15; f++) for (i = 0; i < 50000; i++) touch dir$i/f$f then it is completely unsurprising that directory reservation is faster at file create/unlink because those inodes are now contiguous at the expense of having gaps in the inode sequence. Creating 15 files per directory is of course the optimum test case also. How does this patch behave with benchmarks like dbench, mongo, postmark? > It would be nice to remember the last allocated bit for each block > group, so we don't have to start from bit 0 (in the worse case scan the > entire block group) for every single directory. Probably this can be > saved in the in-core block group descriptor. Right now the in core > block group descriptor structure is the same on-disk block-group > structure though, it might worth to separate them and provide load/store > helper functions. Note that mballoc already creates an in-memory struct for each group. I think the initialization of this should be moved outside of mballoc so that it can be used for other purposes as you propose. Eric had a benchmark where creating many files/subdirs would cause a huge slowdown because of bitmap searching, and having a per-group pointer with the first free inode (or last allocated inode might be less work to track) would speed this up a lot. Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
« Previous Thread
|
Next Thread »
| Tools | |
| |
| | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| [PATCH, RFC] ext4: Use preallocation when reading from the inode table | unix | Kernel | 8 | 09-25-2008 11:50 PM |
| [PATCH 3/6] ext4: don't inherit inappropriate inode flags from parent | unix | Kernel | 1 | 06-19-2008 03:50 PM |
| [PATCH 3/6] ext4: don't inherit inappropriate inode flags from parent | unix | Kernel | 1 | 06-11-2008 12:10 AM |
| [PATCH 3/3] ext4: don't inherit inappropriate inode flags from parent | unix | Kernel | 0 | 06-04-2008 01:00 PM |
| [PATCH] ext4: issue warning when bad inode found via ext4_lookup | unix | Kernel | 0 | 12-14-2007 05:50 PM |
All times are GMT. The time now is 09:37 AM.


BTW, goto does not hurt performance
The reserved inode will cause