Optimizing WIndows 2003 SP1 R2 for a large number of small files - Storage

This is a discussion on Optimizing WIndows 2003 SP1 R2 for a large number of small files - Storage ; Hello, I am currently running an online backup service using NovaNet-WEB software. My current environment includes a single Dell PowerEdge 2850 server (2.8GHz, 2GB RAM, 10K RPM SCSI drives in RAID 5) and two Dell PowerVault 745N NAS boxes (2.8GHz, ...

+ Reply to Thread
Results 1 to 9 of 9

Thread: Optimizing WIndows 2003 SP1 R2 for a large number of small files

  1. Optimizing WIndows 2003 SP1 R2 for a large number of small files

    Hello,

    I am currently running an online backup service using NovaNet-WEB
    software. My current environment includes a single Dell PowerEdge 2850
    server (2.8GHz, 2GB RAM, 10K RPM SCSI drives in RAID 5) and two Dell
    PowerVault 745N NAS boxes (2.8GHz, 2GB RAM, 4x250GB 7200RM SATAII
    Maxtor Drives in a RAID 5). The nature of the application is such that
    I end up with millions of mostly small files on my storage volumes.
    The application seems to have at least a 20K overhead for each file
    (for encryption, compression, etc) and therefore, my median file size
    is 20K.

    My current cluster size is set to 4K on all data volumes and I run
    Diskeeper on all volumes daily. I do not run VSS. In addition, I do a
    robocopy of my data to an external USB 2 1TB Lacie drive. The IO
    performance of my system is laughable. On my SCSI drives in RAID 5 I
    get about 10MBytes/sec read/write performance. On my SATA drives in
    the NAS boxes my throughput is about 2MBytes/sec. On my single drive
    USB II connection I get over 40MB/sec. To back up and defrag my
    volumes takes over a day, which is obvoiusly a big problem.

    Therefore, I have decided to purchase an entry level SAN RAID box from
    Infortrend. The model is S12F-G1420
    (http://www.infortrend.com/main/2_pro...2f-r-g1420.asp). It has a
    single controller module (FC 4Gb), 1GB of Cache, 6 x 500GB Seagate
    Near-Line drives (8ms access time, 16MB cache).

    My question is how to best optimize my environment to get the best
    possible performance out of the new hardware. I am thinking of
    connecting my primary PowerEdge server and one of my NASes to the SAN
    via FibreChannel and using the NAS for backup only. Here is a list of
    questions:

    1) What RAID level should I use? I am mostly concerned with
    performance and therefore was thinking of using RAID 1+0 across all 6
    500GB drives giving me about 1.5TB of usable space.

    2) How many LUNS should I partition my 1.5TB into? How large should
    they be for best performance?

    3) On the file system side, is there a performance hit for having a
    large (1.5TB) volume as opposed to 2 smaller ones (750GB each)?
    Ideally, I would prefer to have one large volume.

    4) Since I will probably be using multiple LUNs and they will be
    presented as individual disks to the OS I will need to create a volume
    to join them into one partition. What kind of volume should I use for
    best performance? Simple Volume? Stripe Set?

    5) What cluster size should I use when formatting my partitions? I
    was thinking of using 32KB since my median file size is 20KB. However,
    the Infortrend box has a stripe size of 128KB at the hardware level.
    Any recommendations on how to optimize performance vs. space usage in
    this case?

    6) In Windows 2003 SP1 R2 do I need to worry about MFT fragmentation?
    What should I set the size of the MFT to?

    7) When backing up my data, should I use concurrent robocopy threads
    or a single instance to speed up the backup of the data?

    8) What other types of optimizations can I implement? Disabling 8.3
    names for instance?

    9) If disabling something like 8.3 names, can it be done on a single
    volume only or does it need to be done for all volumes on a particular
    Windows system? How does this work in a FibreChannel SAN environment
    where the volume is shared by multiple systems?

    10) In the future, what is the best way to grow my storage capacity
    when adding more disks to the SAN? Should I expand the hardware RAID
    or create a new one? Should I simply create a new Volume on the OS
    side or can I extent the existing volume? My preference is always to
    have all my data on one partition, iff the performance is not affected
    too much.

    Thank you in advance for your help. I realize this is quite a laundry
    list.

    Vadim.


  2. Re: Optimizing WIndows 2003 SP1 R2 for a large number of small files

    Since you have 2 NAS boxes, probably the best way to go is to create 2 LUNs
    (one for each) but expose them with DFS - so the client sees a single share
    with 2 subfolders (each folder representing a LUN). This will help with
    both management and availability. If you lose a LUN - you are still 50%
    up - vs 100% down.

    As to cluster size - it has to be a multiple of 4k. The problem w/lots of
    small files is the wasted space at the end of a cluster. With 4k clusters
    and a 20k file size, you have a nice fit with very little wasted space - but
    you also have a propensity towards fragmentation; though this could be
    greatly reduced if the application created and immediately reserved 20k for
    the write vs creating a file w/size 0 and writing 20k (basically the data
    gets interleaved in 4k blocks which is what you are seeing).

    If the median size is 21k (vs the 20k specified in your mail), then you are
    losing 3k at the end of the file. This isn't _necessarily_ a bad thing b/c
    if you access the file to add a few bytes - there is some space to do it
    w/out fragmenting. Whereas if the file is exactly 20k and the blocks are
    4k, then there is no room and when you add some data to the end (assuming
    that the files are right next to each other) the file will fragment.

    To help with that you could select a larger default cluster size (8k, 16k,
    32k, 64k). If you go with 8k, then the disk space required for a 20k file
    will be 24k. It would also make sure that you always have some 'spare'
    space to extend a file if needed w/out inducing further fragmentation.
    Going to 16k or 32k would mean 12k of extra space. Your perf would also
    improve as you would be able to read the entire file w/out an extra seek.

    The RAID selected is a tradeoff between availability, performance, and
    efficiency (% of GB available for data). A stripe set is by far the
    fastest - but if you lose any of the drives, then you lose it all. For some
    workloads this is acceptable - for others it isn't. 0+1 is fast &
    available, but only 1/2 the GB you paid for are available at any given time.
    It is hard to beat RAID5 is the middle of the road for all 3. So, really it
    comes down to what is most important.


    Pat



    wrote in message
    news:1154905213.536019.147420@i3g2000cwc.googlegro ups.com...
    > Hello,
    >
    > I am currently running an online backup service using NovaNet-WEB
    > software. My current environment includes a single Dell PowerEdge 2850
    > server (2.8GHz, 2GB RAM, 10K RPM SCSI drives in RAID 5) and two Dell
    > PowerVault 745N NAS boxes (2.8GHz, 2GB RAM, 4x250GB 7200RM SATAII
    > Maxtor Drives in a RAID 5). The nature of the application is such that
    > I end up with millions of mostly small files on my storage volumes.
    > The application seems to have at least a 20K overhead for each file
    > (for encryption, compression, etc) and therefore, my median file size
    > is 20K.
    >
    > My current cluster size is set to 4K on all data volumes and I run
    > Diskeeper on all volumes daily. I do not run VSS. In addition, I do a
    > robocopy of my data to an external USB 2 1TB Lacie drive. The IO
    > performance of my system is laughable. On my SCSI drives in RAID 5 I
    > get about 10MBytes/sec read/write performance. On my SATA drives in
    > the NAS boxes my throughput is about 2MBytes/sec. On my single drive
    > USB II connection I get over 40MB/sec. To back up and defrag my
    > volumes takes over a day, which is obvoiusly a big problem.
    >
    > Therefore, I have decided to purchase an entry level SAN RAID box from
    > Infortrend. The model is S12F-G1420
    > (http://www.infortrend.com/main/2_pro...2f-r-g1420.asp). It has a
    > single controller module (FC 4Gb), 1GB of Cache, 6 x 500GB Seagate
    > Near-Line drives (8ms access time, 16MB cache).
    >
    > My question is how to best optimize my environment to get the best
    > possible performance out of the new hardware. I am thinking of
    > connecting my primary PowerEdge server and one of my NASes to the SAN
    > via FibreChannel and using the NAS for backup only. Here is a list of
    > questions:
    >
    > 1) What RAID level should I use? I am mostly concerned with
    > performance and therefore was thinking of using RAID 1+0 across all 6
    > 500GB drives giving me about 1.5TB of usable space.
    >
    > 2) How many LUNS should I partition my 1.5TB into? How large should
    > they be for best performance?
    >
    > 3) On the file system side, is there a performance hit for having a
    > large (1.5TB) volume as opposed to 2 smaller ones (750GB each)?
    > Ideally, I would prefer to have one large volume.
    >
    > 4) Since I will probably be using multiple LUNs and they will be
    > presented as individual disks to the OS I will need to create a volume
    > to join them into one partition. What kind of volume should I use for
    > best performance? Simple Volume? Stripe Set?
    >
    > 5) What cluster size should I use when formatting my partitions? I
    > was thinking of using 32KB since my median file size is 20KB. However,
    > the Infortrend box has a stripe size of 128KB at the hardware level.
    > Any recommendations on how to optimize performance vs. space usage in
    > this case?
    >
    > 6) In Windows 2003 SP1 R2 do I need to worry about MFT fragmentation?
    > What should I set the size of the MFT to?
    >
    > 7) When backing up my data, should I use concurrent robocopy threads
    > or a single instance to speed up the backup of the data?
    >
    > 8) What other types of optimizations can I implement? Disabling 8.3
    > names for instance?
    >
    > 9) If disabling something like 8.3 names, can it be done on a single
    > volume only or does it need to be done for all volumes on a particular
    > Windows system? How does this work in a FibreChannel SAN environment
    > where the volume is shared by multiple systems?
    >
    > 10) In the future, what is the best way to grow my storage capacity
    > when adding more disks to the SAN? Should I expand the hardware RAID
    > or create a new one? Should I simply create a new Volume on the OS
    > side or can I extent the existing volume? My preference is always to
    > have all my data on one partition, iff the performance is not affected
    > too much.
    >
    > Thank you in advance for your help. I realize this is quite a laundry
    > list.
    >
    > Vadim.
    >



  3. Re: Optimizing WIndows 2003 SP1 R2 for a large number of small fil

    To properly calculate optimal cluster size you need to know your AVERAGE FILE
    Size. MS defrag report tells you this as may other programs.

    For example my xp pro box has 192kb for average. In this case 3x64=192
    divides evenly so 64kb would be optimal for my average file size. Now if I
    was concerning about cluster waste [and I haven't been in years] I may want
    to go to 32kb [average file would be in 6 clusters instead of 3] but I would
    have a reduced amount of waste for the smaller files.

    Here are some thoughts:
    1. We haven't had alot 4kb files in years. Heck, a blank Word doc is 19kb [
    @4kb thats 5 sectors!].
    2. that 192kb file would have to be in 48 pieces vs 3-6 pieces. Which is
    going to get retrieved/written faster? That's right, the smaller number of
    pieces so there is a performance improvement with "right" sizing your cluster
    size.

    Concerning RAID;
    0+1 is faster than raid5. But you MUST have a hot spare in the system
    [every hardware raid system should have hot spares] since this raid level is
    a mirror of stripes. You lose one drive you are only stripped with no
    mirror. Lose one more drive and you are only as good as your backup.
    Raid 10 on the other hand is a stripe of mirrors. You lose one disk you are
    still mirrored but you lost the stripe. Great performance with higher fault
    tolerance. I would recommend you consider Raid10.

    Personally I don't use or recommend Raid5 especially when considering
    system/boot and data partitions on the same array.

  4. Re: Optimizing WIndows 2003 SP1 R2 for a large number of small files

    Pat,

    Thank you for your reply. A few of follow-up questions:

    1) As far as performance goes, is there a recommended maximum LUN size?

    2) Will I get better performance on a system with two 750GB partitions
    as opposed to one 1TB partition?

    3) Since I will be using multiple LUNs, should I use Spanned Volume or
    Striped volume to access them with a single drive letter?

    4) How large should I make my MFT?

    5) Will disabling 8.3 names make a difference?

    6) Can you disable 8.3 names only on a particular volume and not on the
    entire system?

    Thank you,
    Vadim.
    Pat [MSFT} wrote:
    > Since you have 2 NAS boxes, probably the best way to go is to create 2 LUNs
    > (one for each) but expose them with DFS - so the client sees a single share
    > with 2 subfolders (each folder representing a LUN). This will help with
    > both management and availability. If you lose a LUN - you are still 50%
    > up - vs 100% down.
    >
    > As to cluster size - it has to be a multiple of 4k. The problem w/lots of
    > small files is the wasted space at the end of a cluster. With 4k clusters
    > and a 20k file size, you have a nice fit with very little wasted space - but
    > you also have a propensity towards fragmentation; though this could be
    > greatly reduced if the application created and immediately reserved 20k for
    > the write vs creating a file w/size 0 and writing 20k (basically the data
    > gets interleaved in 4k blocks which is what you are seeing).
    >
    > If the median size is 21k (vs the 20k specified in your mail), then you are
    > losing 3k at the end of the file. This isn't _necessarily_ a bad thing b/c
    > if you access the file to add a few bytes - there is some space to do it
    > w/out fragmenting. Whereas if the file is exactly 20k and the blocks are
    > 4k, then there is no room and when you add some data to the end (assuming
    > that the files are right next to each other) the file will fragment.
    >
    > To help with that you could select a larger default cluster size (8k, 16k,
    > 32k, 64k). If you go with 8k, then the disk space required for a 20k file
    > will be 24k. It would also make sure that you always have some 'spare'
    > space to extend a file if needed w/out inducing further fragmentation.
    > Going to 16k or 32k would mean 12k of extra space. Your perf would also
    > improve as you would be able to read the entire file w/out an extra seek.
    >
    > The RAID selected is a tradeoff between availability, performance, and
    > efficiency (% of GB available for data). A stripe set is by far the
    > fastest - but if you lose any of the drives, then you lose it all. For some
    > workloads this is acceptable - for others it isn't. 0+1 is fast &
    > available, but only 1/2 the GB you paid for are available at any given time.
    > It is hard to beat RAID5 is the middle of the road for all 3. So, really it
    > comes down to what is most important.
    >
    >
    > Pat
    >
    >
    >
    > wrote in message
    > news:1154905213.536019.147420@i3g2000cwc.googlegro ups.com...
    > > Hello,
    > >
    > > I am currently running an online backup service using NovaNet-WEB
    > > software. My current environment includes a single Dell PowerEdge 2850
    > > server (2.8GHz, 2GB RAM, 10K RPM SCSI drives in RAID 5) and two Dell
    > > PowerVault 745N NAS boxes (2.8GHz, 2GB RAM, 4x250GB 7200RM SATAII
    > > Maxtor Drives in a RAID 5). The nature of the application is such that
    > > I end up with millions of mostly small files on my storage volumes.
    > > The application seems to have at least a 20K overhead for each file
    > > (for encryption, compression, etc) and therefore, my median file size
    > > is 20K.
    > >
    > > My current cluster size is set to 4K on all data volumes and I run
    > > Diskeeper on all volumes daily. I do not run VSS. In addition, I do a
    > > robocopy of my data to an external USB 2 1TB Lacie drive. The IO
    > > performance of my system is laughable. On my SCSI drives in RAID 5 I
    > > get about 10MBytes/sec read/write performance. On my SATA drives in
    > > the NAS boxes my throughput is about 2MBytes/sec. On my single drive
    > > USB II connection I get over 40MB/sec. To back up and defrag my
    > > volumes takes over a day, which is obvoiusly a big problem.
    > >
    > > Therefore, I have decided to purchase an entry level SAN RAID box from
    > > Infortrend. The model is S12F-G1420
    > > (http://www.infortrend.com/main/2_pro...2f-r-g1420.asp). It has a
    > > single controller module (FC 4Gb), 1GB of Cache, 6 x 500GB Seagate
    > > Near-Line drives (8ms access time, 16MB cache).
    > >
    > > My question is how to best optimize my environment to get the best
    > > possible performance out of the new hardware. I am thinking of
    > > connecting my primary PowerEdge server and one of my NASes to the SAN
    > > via FibreChannel and using the NAS for backup only. Here is a list of
    > > questions:
    > >
    > > 1) What RAID level should I use? I am mostly concerned with
    > > performance and therefore was thinking of using RAID 1+0 across all 6
    > > 500GB drives giving me about 1.5TB of usable space.
    > >
    > > 2) How many LUNS should I partition my 1.5TB into? How large should
    > > they be for best performance?
    > >
    > > 3) On the file system side, is there a performance hit for having a
    > > large (1.5TB) volume as opposed to 2 smaller ones (750GB each)?
    > > Ideally, I would prefer to have one large volume.
    > >
    > > 4) Since I will probably be using multiple LUNs and they will be
    > > presented as individual disks to the OS I will need to create a volume
    > > to join them into one partition. What kind of volume should I use for
    > > best performance? Simple Volume? Stripe Set?
    > >
    > > 5) What cluster size should I use when formatting my partitions? I
    > > was thinking of using 32KB since my median file size is 20KB. However,
    > > the Infortrend box has a stripe size of 128KB at the hardware level.
    > > Any recommendations on how to optimize performance vs. space usage in
    > > this case?
    > >
    > > 6) In Windows 2003 SP1 R2 do I need to worry about MFT fragmentation?
    > > What should I set the size of the MFT to?
    > >
    > > 7) When backing up my data, should I use concurrent robocopy threads
    > > or a single instance to speed up the backup of the data?
    > >
    > > 8) What other types of optimizations can I implement? Disabling 8.3
    > > names for instance?
    > >
    > > 9) If disabling something like 8.3 names, can it be done on a single
    > > volume only or does it need to be done for all volumes on a particular
    > > Windows system? How does this work in a FibreChannel SAN environment
    > > where the volume is shared by multiple systems?
    > >
    > > 10) In the future, what is the best way to grow my storage capacity
    > > when adding more disks to the SAN? Should I expand the hardware RAID
    > > or create a new one? Should I simply create a new Volume on the OS
    > > side or can I extent the existing volume? My preference is always to
    > > have all my data on one partition, iff the performance is not affected
    > > too much.
    > >
    > > Thank you in advance for your help. I realize this is quite a laundry
    > > list.
    > >
    > > Vadim.
    > >



  5. Re: Optimizing WIndows 2003 SP1 R2 for a large number of small files

    1) LUN size has no impact one way or another on performance. The "overhead"
    of a LUN is in the mounting - i.e. there is the same overhead for 1 GB LUN
    and a 1TB LUN. Likewise (assuming 0 fragmentation) file size is more or
    less immaterial for performance. File count per directory is probably the
    most impactfult and even then not until you get > 100k files in a directory.
    This is b/c of how NTFS indexes the files.

    2) Performance won't be different - except in some extreme circumstances, in
    which case the 2 LUNs will beat the 1 LUN. But that assumes that: a) the
    LUNs have no common spindles and b) the rest of the storage subsystem is
    maxed out.

    3) Striped is generally faster - esp. for small files.

    4) You can use the default - the file system will auto expand. Though since
    you have a very good idea of your data I would probably increase it from the
    default 1/8th to 1/4 or 3/8. This will help prevent MFT fragmentation.

    5) Probably. 8.3 adds overhead (more work on every file creation) but
    you'll need to make sure that your app is compatible with the 8.3 disabled.
    Also, the perf improvement is negligible for directories with small file
    counts - though for very large file counts the difference can become quite
    large due to the number of name collisions.

    6) Nope. It's a system setting.


    Pat

    "Vadim" wrote in message
    news:1154984597.471020.136190@m73g2000cwd.googlegr oups.com...
    > Pat,
    >
    > Thank you for your reply. A few of follow-up questions:
    >
    > 1) As far as performance goes, is there a recommended maximum LUN size?
    >
    > 2) Will I get better performance on a system with two 750GB partitions
    > as opposed to one 1TB partition?
    >
    > 3) Since I will be using multiple LUNs, should I use Spanned Volume or
    > Striped volume to access them with a single drive letter?
    >
    > 4) How large should I make my MFT?
    >
    > 5) Will disabling 8.3 names make a difference?
    >
    > 6) Can you disable 8.3 names only on a particular volume and not on the
    > entire system?
    >
    > Thank you,
    > Vadim.
    > Pat [MSFT} wrote:
    >> Since you have 2 NAS boxes, probably the best way to go is to create 2
    >> LUNs
    >> (one for each) but expose them with DFS - so the client sees a single
    >> share
    >> with 2 subfolders (each folder representing a LUN). This will help with
    >> both management and availability. If you lose a LUN - you are still 50%
    >> up - vs 100% down.
    >>
    >> As to cluster size - it has to be a multiple of 4k. The problem w/lots
    >> of
    >> small files is the wasted space at the end of a cluster. With 4k
    >> clusters
    >> and a 20k file size, you have a nice fit with very little wasted space -
    >> but
    >> you also have a propensity towards fragmentation; though this could be
    >> greatly reduced if the application created and immediately reserved 20k
    >> for
    >> the write vs creating a file w/size 0 and writing 20k (basically the data
    >> gets interleaved in 4k blocks which is what you are seeing).
    >>
    >> If the median size is 21k (vs the 20k specified in your mail), then you
    >> are
    >> losing 3k at the end of the file. This isn't _necessarily_ a bad thing
    >> b/c
    >> if you access the file to add a few bytes - there is some space to do it
    >> w/out fragmenting. Whereas if the file is exactly 20k and the blocks are
    >> 4k, then there is no room and when you add some data to the end (assuming
    >> that the files are right next to each other) the file will fragment.
    >>
    >> To help with that you could select a larger default cluster size (8k,
    >> 16k,
    >> 32k, 64k). If you go with 8k, then the disk space required for a 20k
    >> file
    >> will be 24k. It would also make sure that you always have some 'spare'
    >> space to extend a file if needed w/out inducing further fragmentation.
    >> Going to 16k or 32k would mean 12k of extra space. Your perf would also
    >> improve as you would be able to read the entire file w/out an extra seek.
    >>
    >> The RAID selected is a tradeoff between availability, performance, and
    >> efficiency (% of GB available for data). A stripe set is by far the
    >> fastest - but if you lose any of the drives, then you lose it all. For
    >> some
    >> workloads this is acceptable - for others it isn't. 0+1 is fast &
    >> available, but only 1/2 the GB you paid for are available at any given
    >> time.
    >> It is hard to beat RAID5 is the middle of the road for all 3. So, really
    >> it
    >> comes down to what is most important.
    >>
    >>
    >> Pat
    >>
    >>
    >>
    >> wrote in message
    >> news:1154905213.536019.147420@i3g2000cwc.googlegro ups.com...
    >> > Hello,
    >> >
    >> > I am currently running an online backup service using NovaNet-WEB
    >> > software. My current environment includes a single Dell PowerEdge 2850
    >> > server (2.8GHz, 2GB RAM, 10K RPM SCSI drives in RAID 5) and two Dell
    >> > PowerVault 745N NAS boxes (2.8GHz, 2GB RAM, 4x250GB 7200RM SATAII
    >> > Maxtor Drives in a RAID 5). The nature of the application is such that
    >> > I end up with millions of mostly small files on my storage volumes.
    >> > The application seems to have at least a 20K overhead for each file
    >> > (for encryption, compression, etc) and therefore, my median file size
    >> > is 20K.
    >> >
    >> > My current cluster size is set to 4K on all data volumes and I run
    >> > Diskeeper on all volumes daily. I do not run VSS. In addition, I do a
    >> > robocopy of my data to an external USB 2 1TB Lacie drive. The IO
    >> > performance of my system is laughable. On my SCSI drives in RAID 5 I
    >> > get about 10MBytes/sec read/write performance. On my SATA drives in
    >> > the NAS boxes my throughput is about 2MBytes/sec. On my single drive
    >> > USB II connection I get over 40MB/sec. To back up and defrag my
    >> > volumes takes over a day, which is obvoiusly a big problem.
    >> >
    >> > Therefore, I have decided to purchase an entry level SAN RAID box from
    >> > Infortrend. The model is S12F-G1420
    >> > (http://www.infortrend.com/main/2_pro...2f-r-g1420.asp). It has a
    >> > single controller module (FC 4Gb), 1GB of Cache, 6 x 500GB Seagate
    >> > Near-Line drives (8ms access time, 16MB cache).
    >> >
    >> > My question is how to best optimize my environment to get the best
    >> > possible performance out of the new hardware. I am thinking of
    >> > connecting my primary PowerEdge server and one of my NASes to the SAN
    >> > via FibreChannel and using the NAS for backup only. Here is a list of
    >> > questions:
    >> >
    >> > 1) What RAID level should I use? I am mostly concerned with
    >> > performance and therefore was thinking of using RAID 1+0 across all 6
    >> > 500GB drives giving me about 1.5TB of usable space.
    >> >
    >> > 2) How many LUNS should I partition my 1.5TB into? How large should
    >> > they be for best performance?
    >> >
    >> > 3) On the file system side, is there a performance hit for having a
    >> > large (1.5TB) volume as opposed to 2 smaller ones (750GB each)?
    >> > Ideally, I would prefer to have one large volume.
    >> >
    >> > 4) Since I will probably be using multiple LUNs and they will be
    >> > presented as individual disks to the OS I will need to create a volume
    >> > to join them into one partition. What kind of volume should I use for
    >> > best performance? Simple Volume? Stripe Set?
    >> >
    >> > 5) What cluster size should I use when formatting my partitions? I
    >> > was thinking of using 32KB since my median file size is 20KB. However,
    >> > the Infortrend box has a stripe size of 128KB at the hardware level.
    >> > Any recommendations on how to optimize performance vs. space usage in
    >> > this case?
    >> >
    >> > 6) In Windows 2003 SP1 R2 do I need to worry about MFT fragmentation?
    >> > What should I set the size of the MFT to?
    >> >
    >> > 7) When backing up my data, should I use concurrent robocopy threads
    >> > or a single instance to speed up the backup of the data?
    >> >
    >> > 8) What other types of optimizations can I implement? Disabling 8.3
    >> > names for instance?
    >> >
    >> > 9) If disabling something like 8.3 names, can it be done on a single
    >> > volume only or does it need to be done for all volumes on a particular
    >> > Windows system? How does this work in a FibreChannel SAN environment
    >> > where the volume is shared by multiple systems?
    >> >
    >> > 10) In the future, what is the best way to grow my storage capacity
    >> > when adding more disks to the SAN? Should I expand the hardware RAID
    >> > or create a new one? Should I simply create a new Volume on the OS
    >> > side or can I extent the existing volume? My preference is always to
    >> > have all my data on one partition, iff the performance is not affected
    >> > too much.
    >> >
    >> > Thank you in advance for your help. I realize this is quite a laundry
    >> > list.
    >> >
    >> > Vadim.
    >> >

    >



  6. Re: Optimizing WIndows 2003 SP1 R2 for a large number of small fil

    You want median, not mean/average. There are a number of reasons why, but
    basically it comes down to you can have a very few number of very large
    files that skews the average significantly; which may not be a big deal on
    some systems but can be a huge difference on others. That said, it may not
    be an issue unless your file count is quite large (in this case any waste is
    multiplied by the millions).

    You are correct though, that file sizes in general are increasing. Some of
    it is due to the move towards transportable data (e.g. XML) which generally
    takes more room than binary equivalents.



    Pat



    "Joshua Bolton" wrote in message
    news:B7040839-6049-4F83-A0F7-371512D4B456@microsoft.com...
    > To properly calculate optimal cluster size you need to know your AVERAGE
    > FILE
    > Size. MS defrag report tells you this as may other programs.
    >
    > For example my xp pro box has 192kb for average. In this case 3x64=192
    > divides evenly so 64kb would be optimal for my average file size. Now if
    > I
    > was concerning about cluster waste [and I haven't been in years] I may
    > want
    > to go to 32kb [average file would be in 6 clusters instead of 3] but I
    > would
    > have a reduced amount of waste for the smaller files.
    >
    > Here are some thoughts:
    > 1. We haven't had alot 4kb files in years. Heck, a blank Word doc is 19kb
    > [
    > @4kb thats 5 sectors!].
    > 2. that 192kb file would have to be in 48 pieces vs 3-6 pieces. Which is
    > going to get retrieved/written faster? That's right, the smaller number
    > of
    > pieces so there is a performance improvement with "right" sizing your
    > cluster
    > size.
    >
    > Concerning RAID;
    > 0+1 is faster than raid5. But you MUST have a hot spare in the system
    > [every hardware raid system should have hot spares] since this raid level
    > is
    > a mirror of stripes. You lose one drive you are only stripped with no
    > mirror. Lose one more drive and you are only as good as your backup.
    > Raid 10 on the other hand is a stripe of mirrors. You lose one disk you
    > are
    > still mirrored but you lost the stripe. Great performance with higher
    > fault
    > tolerance. I would recommend you consider Raid10.
    >
    > Personally I don't use or recommend Raid5 especially when considering
    > system/boot and data partitions on the same array.



  7. Re: Optimizing WIndows 2003 SP1 R2 for a large number of small files

    1. Realize that there is a write penalty of 4 for RAID 5. If you do a lot
    of reads and few writes, this should not be an issue. As the percentage of
    writes in the workload increases, the performance of the array will decrease
    disproportionately due to this significant write penalty.

    2. A 10K FC drive can do about 90 random 4K IOs with a response time of 20
    ms. A 10K SATA drive is more like 40; less than half the performance.

    3. The usual segregate then stitch it back together with DFS blurb. As the
    size of the volume increases, so does the size of the metadata files
    associated with that volume. If you have a logical way to partition the
    data (a subdirectory structure for instance) multiple LUNs and DFS to
    present it as a single namespace would make sense.

    4. The usual NTFS optimizations apply. Disabling 8.3 name support comes to
    mind.

    John


    wrote in message
    news:1154905213.536019.147420@i3g2000cwc.googlegro ups.com...
    > Hello,
    >
    > I am currently running an online backup service using NovaNet-WEB
    > software. My current environment includes a single Dell PowerEdge 2850
    > server (2.8GHz, 2GB RAM, 10K RPM SCSI drives in RAID 5) and two Dell
    > PowerVault 745N NAS boxes (2.8GHz, 2GB RAM, 4x250GB 7200RM SATAII
    > Maxtor Drives in a RAID 5). The nature of the application is such that
    > I end up with millions of mostly small files on my storage volumes.
    > The application seems to have at least a 20K overhead for each file
    > (for encryption, compression, etc) and therefore, my median file size
    > is 20K.
    >
    > My current cluster size is set to 4K on all data volumes and I run
    > Diskeeper on all volumes daily. I do not run VSS. In addition, I do a
    > robocopy of my data to an external USB 2 1TB Lacie drive. The IO
    > performance of my system is laughable. On my SCSI drives in RAID 5 I
    > get about 10MBytes/sec read/write performance. On my SATA drives in
    > the NAS boxes my throughput is about 2MBytes/sec. On my single drive
    > USB II connection I get over 40MB/sec. To back up and defrag my
    > volumes takes over a day, which is obvoiusly a big problem.
    >
    > Therefore, I have decided to purchase an entry level SAN RAID box from
    > Infortrend. The model is S12F-G1420
    > (http://www.infortrend.com/main/2_pro...2f-r-g1420.asp). It has a
    > single controller module (FC 4Gb), 1GB of Cache, 6 x 500GB Seagate
    > Near-Line drives (8ms access time, 16MB cache).
    >
    > My question is how to best optimize my environment to get the best
    > possible performance out of the new hardware. I am thinking of
    > connecting my primary PowerEdge server and one of my NASes to the SAN
    > via FibreChannel and using the NAS for backup only. Here is a list of
    > questions:
    >
    > 1) What RAID level should I use? I am mostly concerned with
    > performance and therefore was thinking of using RAID 1+0 across all 6
    > 500GB drives giving me about 1.5TB of usable space.
    >
    > 2) How many LUNS should I partition my 1.5TB into? How large should
    > they be for best performance?
    >
    > 3) On the file system side, is there a performance hit for having a
    > large (1.5TB) volume as opposed to 2 smaller ones (750GB each)?
    > Ideally, I would prefer to have one large volume.
    >
    > 4) Since I will probably be using multiple LUNs and they will be
    > presented as individual disks to the OS I will need to create a volume
    > to join them into one partition. What kind of volume should I use for
    > best performance? Simple Volume? Stripe Set?
    >
    > 5) What cluster size should I use when formatting my partitions? I
    > was thinking of using 32KB since my median file size is 20KB. However,
    > the Infortrend box has a stripe size of 128KB at the hardware level.
    > Any recommendations on how to optimize performance vs. space usage in
    > this case?
    >
    > 6) In Windows 2003 SP1 R2 do I need to worry about MFT fragmentation?
    > What should I set the size of the MFT to?
    >
    > 7) When backing up my data, should I use concurrent robocopy threads
    > or a single instance to speed up the backup of the data?
    >
    > 8) What other types of optimizations can I implement? Disabling 8.3
    > names for instance?
    >
    > 9) If disabling something like 8.3 names, can it be done on a single
    > volume only or does it need to be done for all volumes on a particular
    > Windows system? How does this work in a FibreChannel SAN environment
    > where the volume is shared by multiple systems?
    >
    > 10) In the future, what is the best way to grow my storage capacity
    > when adding more disks to the SAN? Should I expand the hardware RAID
    > or create a new one? Should I simply create a new Volume on the OS
    > side or can I extent the existing volume? My preference is always to
    > have all my data on one partition, iff the performance is not affected
    > too much.
    >
    > Thank you in advance for your help. I realize this is quite a laundry
    > list.
    >
    > Vadim.
    >




  8. Re: Optimizing WIndows 2003 SP1 R2 for a large number of small files

    John,

    Thank you for the reply. Could you please eleborate on the write
    penalty of 4 for RAID 5? Am I correct in understanding that because of
    the parity that needs to be calculated and writen with every write
    there is a performance hit? If so, what RAID level will give me best
    performance? Raid 1+0 perhaps?

    Also, for #3 do you mean that I should take my 2.5 of total storage on
    a single logical drive (inside my RAID unit), partition it into smaller
    partitions, then maps those logical partitions to my host and then use
    MS Disk Volume Manager to connect them back together as a spanned
    volume? If so, how large should I make the individual partitions?

    Also, I am asking for trouble if I map the same set of logical drives
    to different Windows 2003 hosts and have the hosts access those drive
    at the same time? If so, what are my options if I want one host to do
    most of the writing/reading to the LUN and another host for backing up
    the data to an external drive while resetting archive bits?

    Thanks for your help.

    Vadim.

    John Fullbright [MVP] wrote:
    > 1. Realize that there is a write penalty of 4 for RAID 5. If you do a lot
    > of reads and few writes, this should not be an issue. As the percentage of
    > writes in the workload increases, the performance of the array will decrease
    > disproportionately due to this significant write penalty.
    >
    > 2. A 10K FC drive can do about 90 random 4K IOs with a response time of 20
    > ms. A 10K SATA drive is more like 40; less than half the performance.
    >
    > 3. The usual segregate then stitch it back together with DFS blurb. As the
    > size of the volume increases, so does the size of the metadata files
    > associated with that volume. If you have a logical way to partition the
    > data (a subdirectory structure for instance) multiple LUNs and DFS to
    > present it as a single namespace would make sense.
    >
    > 4. The usual NTFS optimizations apply. Disabling 8.3 name support comes to
    > mind.
    >
    > John
    >
    >
    > wrote in message
    > news:1154905213.536019.147420@i3g2000cwc.googlegro ups.com...
    > > Hello,
    > >
    > > I am currently running an online backup service using NovaNet-WEB
    > > software. My current environment includes a single Dell PowerEdge 2850
    > > server (2.8GHz, 2GB RAM, 10K RPM SCSI drives in RAID 5) and two Dell
    > > PowerVault 745N NAS boxes (2.8GHz, 2GB RAM, 4x250GB 7200RM SATAII
    > > Maxtor Drives in a RAID 5). The nature of the application is such that
    > > I end up with millions of mostly small files on my storage volumes.
    > > The application seems to have at least a 20K overhead for each file
    > > (for encryption, compression, etc) and therefore, my median file size
    > > is 20K.
    > >
    > > My current cluster size is set to 4K on all data volumes and I run
    > > Diskeeper on all volumes daily. I do not run VSS. In addition, I do a
    > > robocopy of my data to an external USB 2 1TB Lacie drive. The IO
    > > performance of my system is laughable. On my SCSI drives in RAID 5 I
    > > get about 10MBytes/sec read/write performance. On my SATA drives in
    > > the NAS boxes my throughput is about 2MBytes/sec. On my single drive
    > > USB II connection I get over 40MB/sec. To back up and defrag my
    > > volumes takes over a day, which is obvoiusly a big problem.
    > >
    > > Therefore, I have decided to purchase an entry level SAN RAID box from
    > > Infortrend. The model is S12F-G1420
    > > (http://www.infortrend.com/main/2_pro...2f-r-g1420.asp). It has a
    > > single controller module (FC 4Gb), 1GB of Cache, 6 x 500GB Seagate
    > > Near-Line drives (8ms access time, 16MB cache).
    > >
    > > My question is how to best optimize my environment to get the best
    > > possible performance out of the new hardware. I am thinking of
    > > connecting my primary PowerEdge server and one of my NASes to the SAN
    > > via FibreChannel and using the NAS for backup only. Here is a list of
    > > questions:
    > >
    > > 1) What RAID level should I use? I am mostly concerned with
    > > performance and therefore was thinking of using RAID 1+0 across all 6
    > > 500GB drives giving me about 1.5TB of usable space.
    > >
    > > 2) How many LUNS should I partition my 1.5TB into? How large should
    > > they be for best performance?
    > >
    > > 3) On the file system side, is there a performance hit for having a
    > > large (1.5TB) volume as opposed to 2 smaller ones (750GB each)?
    > > Ideally, I would prefer to have one large volume.
    > >
    > > 4) Since I will probably be using multiple LUNs and they will be
    > > presented as individual disks to the OS I will need to create a volume
    > > to join them into one partition. What kind of volume should I use for
    > > best performance? Simple Volume? Stripe Set?
    > >
    > > 5) What cluster size should I use when formatting my partitions? I
    > > was thinking of using 32KB since my median file size is 20KB. However,
    > > the Infortrend box has a stripe size of 128KB at the hardware level.
    > > Any recommendations on how to optimize performance vs. space usage in
    > > this case?
    > >
    > > 6) In Windows 2003 SP1 R2 do I need to worry about MFT fragmentation?
    > > What should I set the size of the MFT to?
    > >
    > > 7) When backing up my data, should I use concurrent robocopy threads
    > > or a single instance to speed up the backup of the data?
    > >
    > > 8) What other types of optimizations can I implement? Disabling 8.3
    > > names for instance?
    > >
    > > 9) If disabling something like 8.3 names, can it be done on a single
    > > volume only or does it need to be done for all volumes on a particular
    > > Windows system? How does this work in a FibreChannel SAN environment
    > > where the volume is shared by multiple systems?
    > >
    > > 10) In the future, what is the best way to grow my storage capacity
    > > when adding more disks to the SAN? Should I expand the hardware RAID
    > > or create a new one? Should I simply create a new Volume on the OS
    > > side or can I extent the existing volume? My preference is always to
    > > have all my data on one partition, iff the performance is not affected
    > > too much.
    > >
    > > Thank you in advance for your help. I realize this is quite a laundry
    > > list.
    > >
    > > Vadim.
    > >



  9. Re: Optimizing WIndows 2003 SP1 R2 for a large number of small files

    Without any additional logic at the virtualization layer, RAID 0 will give
    the best performance because there is no write penalty. Of course, there's
    also no parity. RAID 1 has a write penalty of 2. Data is mirrored. RAID 5
    has a write penalty of 4. The difference between RAID 4 and RAID 5 is that
    with RAID 4 the parity is on a dedicated spindle. With RAID 5 the parity is
    dispersed across the set.

    In traditional RAID4/5 There are 4 IO operations; you have to write the
    data, read the data, read the parity, calculate the new parity and write the
    new parity. The vendor I work for uses RAID 4/RAID DP as the underlying
    parity scheme. The difference is in the virtualization layer. The
    virtualization layer always writes to new space and never overwrites, so you
    avoid mutilple reads and writes. All writes are to the log that is held in
    NVRAM. When a write event occurs (Log1 fills, log 2 opens, and log 1 is
    processed) writes are coalesced and parity is only calculated once (in RAM).
    A sort of artifact of that is snapshots. Because the data is already there
    (no overwrite), there is no performance draining copy on write process.

    If you have a large volume that consists of several subdirectories, why not
    take each subdirectory and place it in a seperate volume. Then, you can use
    DFS to stitch the namespace back together so that it all appears as one
    namespace.

    If we are talking block mode access to a volume, you don't want to present
    to multiple hosts at the same time unless you are controlling access with
    the SCSI RESERVE command (as the clusdisk driver does). If you are
    presenting a share to multiple hosts (file mode access), the SMB/CIFS
    protocol handles contention by implementing oplocks.

    For purposes of backups, you would use snapshots or shadow copies. The
    primary host has access via block mode to the LUN. You would take a
    snapshot, then present the snapshot to the backup host.







    Vadim" wrote in message
    news:1155613353.845179.156970@m73g2000cwd.googlegr oups.com...
    > John,
    >
    > Thank you for the reply. Could you please eleborate on the write
    > penalty of 4 for RAID 5? Am I correct in understanding that because of
    > the parity that needs to be calculated and writen with every write
    > there is a performance hit? If so, what RAID level will give me best
    > performance? Raid 1+0 perhaps?
    >
    > Also, for #3 do you mean that I should take my 2.5 of total storage on
    > a single logical drive (inside my RAID unit), partition it into smaller
    > partitions, then maps those logical partitions to my host and then use
    > MS Disk Volume Manager to connect them back together as a spanned
    > volume? If so, how large should I make the individual partitions?
    >
    > Also, I am asking for trouble if I map the same set of logical drives
    > to different Windows 2003 hosts and have the hosts access those drive
    > at the same time? If so, what are my options if I want one host to do
    > most of the writing/reading to the LUN and another host for backing up
    > the data to an external drive while resetting archive bits?
    >
    > Thanks for your help.
    >
    > Vadim.
    >
    > John Fullbright [MVP] wrote:
    >> 1. Realize that there is a write penalty of 4 for RAID 5. If you do a
    >> lot
    >> of reads and few writes, this should not be an issue. As the percentage
    >> of
    >> writes in the workload increases, the performance of the array will
    >> decrease
    >> disproportionately due to this significant write penalty.
    >>
    >> 2. A 10K FC drive can do about 90 random 4K IOs with a response time of
    >> 20
    >> ms. A 10K SATA drive is more like 40; less than half the performance.
    >>
    >> 3. The usual segregate then stitch it back together with DFS blurb. As
    >> the
    >> size of the volume increases, so does the size of the metadata files
    >> associated with that volume. If you have a logical way to partition the
    >> data (a subdirectory structure for instance) multiple LUNs and DFS to
    >> present it as a single namespace would make sense.
    >>
    >> 4. The usual NTFS optimizations apply. Disabling 8.3 name support comes
    >> to
    >> mind.
    >>
    >> John
    >>
    >>
    >> wrote in message
    >> news:1154905213.536019.147420@i3g2000cwc.googlegro ups.com...
    >> > Hello,
    >> >
    >> > I am currently running an online backup service using NovaNet-WEB
    >> > software. My current environment includes a single Dell PowerEdge 2850
    >> > server (2.8GHz, 2GB RAM, 10K RPM SCSI drives in RAID 5) and two Dell
    >> > PowerVault 745N NAS boxes (2.8GHz, 2GB RAM, 4x250GB 7200RM SATAII
    >> > Maxtor Drives in a RAID 5). The nature of the application is such that
    >> > I end up with millions of mostly small files on my storage volumes.
    >> > The application seems to have at least a 20K overhead for each file
    >> > (for encryption, compression, etc) and therefore, my median file size
    >> > is 20K.
    >> >
    >> > My current cluster size is set to 4K on all data volumes and I run
    >> > Diskeeper on all volumes daily. I do not run VSS. In addition, I do a
    >> > robocopy of my data to an external USB 2 1TB Lacie drive. The IO
    >> > performance of my system is laughable. On my SCSI drives in RAID 5 I
    >> > get about 10MBytes/sec read/write performance. On my SATA drives in
    >> > the NAS boxes my throughput is about 2MBytes/sec. On my single drive
    >> > USB II connection I get over 40MB/sec. To back up and defrag my
    >> > volumes takes over a day, which is obvoiusly a big problem.
    >> >
    >> > Therefore, I have decided to purchase an entry level SAN RAID box from
    >> > Infortrend. The model is S12F-G1420
    >> > (http://www.infortrend.com/main/2_pro...2f-r-g1420.asp). It has a
    >> > single controller module (FC 4Gb), 1GB of Cache, 6 x 500GB Seagate
    >> > Near-Line drives (8ms access time, 16MB cache).
    >> >
    >> > My question is how to best optimize my environment to get the best
    >> > possible performance out of the new hardware. I am thinking of
    >> > connecting my primary PowerEdge server and one of my NASes to the SAN
    >> > via FibreChannel and using the NAS for backup only. Here is a list of
    >> > questions:
    >> >
    >> > 1) What RAID level should I use? I am mostly concerned with
    >> > performance and therefore was thinking of using RAID 1+0 across all 6
    >> > 500GB drives giving me about 1.5TB of usable space.
    >> >
    >> > 2) How many LUNS should I partition my 1.5TB into? How large should
    >> > they be for best performance?
    >> >
    >> > 3) On the file system side, is there a performance hit for having a
    >> > large (1.5TB) volume as opposed to 2 smaller ones (750GB each)?
    >> > Ideally, I would prefer to have one large volume.
    >> >
    >> > 4) Since I will probably be using multiple LUNs and they will be
    >> > presented as individual disks to the OS I will need to create a volume
    >> > to join them into one partition. What kind of volume should I use for
    >> > best performance? Simple Volume? Stripe Set?
    >> >
    >> > 5) What cluster size should I use when formatting my partitions? I
    >> > was thinking of using 32KB since my median file size is 20KB. However,
    >> > the Infortrend box has a stripe size of 128KB at the hardware level.
    >> > Any recommendations on how to optimize performance vs. space usage in
    >> > this case?
    >> >
    >> > 6) In Windows 2003 SP1 R2 do I need to worry about MFT fragmentation?
    >> > What should I set the size of the MFT to?
    >> >
    >> > 7) When backing up my data, should I use concurrent robocopy threads
    >> > or a single instance to speed up the backup of the data?
    >> >
    >> > 8) What other types of optimizations can I implement? Disabling 8.3
    >> > names for instance?
    >> >
    >> > 9) If disabling something like 8.3 names, can it be done on a single
    >> > volume only or does it need to be done for all volumes on a particular
    >> > Windows system? How does this work in a FibreChannel SAN environment
    >> > where the volume is shared by multiple systems?
    >> >
    >> > 10) In the future, what is the best way to grow my storage capacity
    >> > when adding more disks to the SAN? Should I expand the hardware RAID
    >> > or create a new one? Should I simply create a new Volume on the OS
    >> > side or can I extent the existing volume? My preference is always to
    >> > have all my data on one partition, iff the performance is not affected
    >> > too much.
    >> >
    >> > Thank you in advance for your help. I realize this is quite a laundry
    >> > list.
    >> >
    >> > Vadim.
    >> >

    >




+ Reply to Thread