[RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio - Kernel

This is a discussion on [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio - Kernel ; The goal of the patch is to control how much dirty file pages a cgroup can have at any given time (see also [1]). Dirty file and writeback pages are accounted for each cgroup using the memory controller statistics. Moreover, ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 23

Thread: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

  1. [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio


    The goal of the patch is to control how much dirty file pages a cgroup can have
    at any given time (see also [1]).

    Dirty file and writeback pages are accounted for each cgroup using the memory
    controller statistics. Moreover, the dirty_ratio parameter is added to the
    memory controller. It contains, as a percentage of the cgroup memory, the
    number of dirty pages at which the processes belonging to the cgroup which are
    generating disk writes will start writing out dirty data.

    So, the behaviour is actually the same as the global dirty_ratio, except that
    it works per cgroup.

    Interface:
    - two new entries "writeback" and "filedirty" are added to the file
    memory.stat, to export to userspace respectively the number of pages under
    writeback and the number of dirty file pages in the cgroup

    - the new file memory.dirty_ratio is added in the cgroup filesystem to show/set
    the memcg dirty_ratio

    [ This patch is still experimental and I only did few quick tests. I'd like to
    do run more detailed benchmarks and compare the results, I guess the overhead
    introduced by this patch shouldn't be so small... and BTW I would prefer a
    dirty limit in bytes, intead of using a percentage of memory. Bytes are hugely
    more flexible IMHO, they allow to define more fine-grained limits and so this
    would work better on large memory machines. ]

    [1] http://lkml.org/lkml/2008/9/9/245

    -Andrea
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    On Fri, 12 Sep 2008 17:09:50 +0200
    Andrea Righi wrote:

    >
    > The goal of the patch is to control how much dirty file pages a cgroup can have
    > at any given time (see also [1]).
    >
    > Dirty file and writeback pages are accounted for each cgroup using the memory
    > controller statistics. Moreover, the dirty_ratio parameter is added to the
    > memory controller. It contains, as a percentage of the cgroup memory, the
    > number of dirty pages at which the processes belonging to the cgroup which are
    > generating disk writes will start writing out dirty data.
    >
    > So, the behaviour is actually the same as the global dirty_ratio, except that
    > it works per cgroup.
    >
    > Interface:
    > - two new entries "writeback" and "filedirty" are added to the file
    > memory.stat, to export to userspace respectively the number of pages under
    > writeback and the number of dirty file pages in the cgroup
    >
    > - the new file memory.dirty_ratio is added in the cgroup filesystem to show/set
    > the memcg dirty_ratio


    Seems like a desirable objective.

    > [ This patch is still experimental and I only did few quick tests. I'd like to
    > do run more detailed benchmarks and compare the results, I guess the overhead
    > introduced by this patch shouldn't be so small... and BTW I would prefer a
    > dirty limit in bytes, intead of using a percentage of memory. Bytes are hugely
    > more flexible IMHO, they allow to define more fine-grained limits and so this
    > would work better on large memory machines. ]
    >
    > [1] http://lkml.org/lkml/2008/9/9/245


    I tend to duck experimental and rfc patches

    One thing to think about please: Michael Rubin is hitting problems with
    the existing /proc/sys/vm/dirty-ratio. Its present granularity of 1%
    is just too coarse for really large machines, and as
    memory-size/disk-speed ratios continue to increase, this will just get
    worse.

    So after thinking about it a bit I encouraged him to propose a patch
    which adds a new /proc/sys/vm/hires-dirty-ratio (for some value of
    "hires" ) which simply offers a higher-resolution interface to the
    same internal kernel machinery.

    How does this affect you? I don't think we should be adding new
    interfaces which have the old 1%-resolution problem. Once we get this
    higher-resolution interface sorted out, your new interface should do it
    the same way.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    Andrew Morton wrote:
    > On Fri, 12 Sep 2008 17:09:50 +0200
    > Andrea Righi wrote:
    >
    >> The goal of the patch is to control how much dirty file pages a cgroup can have
    >> at any given time (see also [1]).
    >>
    >> Dirty file and writeback pages are accounted for each cgroup using the memory
    >> controller statistics. Moreover, the dirty_ratio parameter is added to the
    >> memory controller. It contains, as a percentage of the cgroup memory, the
    >> number of dirty pages at which the processes belonging to the cgroup which are
    >> generating disk writes will start writing out dirty data.
    >>
    >> So, the behaviour is actually the same as the global dirty_ratio, except that
    >> it works per cgroup.
    >>
    >> Interface:
    >> - two new entries "writeback" and "filedirty" are added to the file
    >> memory.stat, to export to userspace respectively the number of pages under
    >> writeback and the number of dirty file pages in the cgroup
    >>
    >> - the new file memory.dirty_ratio is added in the cgroup filesystem to show/set
    >> the memcg dirty_ratio

    >
    > Seems like a desirable objective.
    >
    >> [ This patch is still experimental and I only did few quick tests. I'd like to
    >> do run more detailed benchmarks and compare the results, I guess the overhead
    >> introduced by this patch shouldn't be so small... and BTW I would prefer a
    >> dirty limit in bytes, intead of using a percentage of memory. Bytes are hugely
    >> more flexible IMHO, they allow to define more fine-grained limits and so this
    >> would work better on large memory machines. ]
    >>
    >> [1] http://lkml.org/lkml/2008/9/9/245

    >
    > I tend to duck experimental and rfc patches
    >
    > One thing to think about please: Michael Rubin is hitting problems with
    > the existing /proc/sys/vm/dirty-ratio. Its present granularity of 1%
    > is just too coarse for really large machines, and as
    > memory-size/disk-speed ratios continue to increase, this will just get
    > worse.
    >
    > So after thinking about it a bit I encouraged him to propose a patch
    > which adds a new /proc/sys/vm/hires-dirty-ratio (for some value of
    > "hires" ) which simply offers a higher-resolution interface to the
    > same internal kernel machinery.
    >
    > How does this affect you? I don't think we should be adding new
    > interfaces which have the old 1%-resolution problem. Once we get this
    > higher-resolution interface sorted out, your new interface should do it
    > the same way.


    Totally agree.

    The hires-dirty-ratio interface seems much better. I'll follow the progresses
    of this new interface, reusing the same way in my patch doesn't look too difficult,
    in any case.

    BTW why not use a simple dirty-ratio-in-bytes?

    Thanks for commenting,
    -Andrea
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    On Sat, 13 Sep 2008 01:04:35 +0200
    Andrea Righi wrote:

    > BTW why not use a simple dirty-ratio-in-bytes?


    s/ratio/amount/

    No particular reason - I haven't really thought about it frankly.

    A "ratio" might make more sense in a containerised setup, particularly
    if the container can be resized on the fly.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    Currently the problme we are hitting is that we cannot specify pdflush
    to have background limits less than 1% of memory. I am currently
    finishing up a patch right now that adds a dirty_ratio_millis
    interface. I hope to submit the patch to LKML by the end of the week.

    The idea is that we don't want to break backwards compatibility and we
    also don't want to have two conflicting knobs in the sysctl or
    /proc/sys/vm/ space. I thought adding a new knob for those who want to
    specify finer grained functionality was a compromise. So the patch has
    a vm_dirty_ratio and a vm_dirty_ratio_millis interface. The first to
    specify 0-100% and the second to specify .0 to .999%.

    So to represent 0.125% of RAM we set
    vm_dirty_ratio = 0
    vm_dirty_ratio_millis = 125

    The same for the background_ratio.

    Any feedback?

    mrubin

    On Fri, Sep 12, 2008 at 4:10 PM, Andrew Morton
    wrote:
    > On Sat, 13 Sep 2008 01:04:35 +0200
    > Andrea Righi wrote:
    >
    >> BTW why not use a simple dirty-ratio-in-bytes?

    >
    > s/ratio/amount/
    >
    > No particular reason - I haven't really thought about it frankly.
    >
    > A "ratio" might make more sense in a containerised setup, particularly
    > if the container can be resized on the fly.
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    On Fri, Sep 12, 2008 at 1:18 PM, Andrew Morton
    wrote:
    > One thing to think about please: Michael Rubin is hitting problems with
    > the existing /proc/sys/vm/dirty-ratio. Its present granularity of 1%
    > is just too coarse for really large machines, and as
    > memory-size/disk-speed ratios continue to increase, this will just get
    > worse.


    Re-sending since I top-posted before. Never again. Also adding more
    thoughts on a byte based interface.

    Currently the problem we are hitting is that we cannot specify pdflush
    to have background limits less than 1% of memory. I am currently
    finishing up a patch right now that adds a dirty_ratio_millis
    interface. I hope to submit the patch to LKML by the end of the week.

    The idea is that we don't want to break backwards compatibility and we
    also don't want to have two conflicting knobs in the sysctl or
    /proc/sys/vm/ space. I thought adding a new knob for those who want to
    specify finer grained functionality was a compromise. So the patch has
    a vm_dirty_ratio and a vm_dirty_ratio_millis interface. The first to
    specify 0-100% and the second to specify .0 to .999%.

    So to represent 0.125% of RAM we set
    vm_dirty_ratio = 0
    vm_dirty_ratio_millis = 125

    The same for the background_ratio.

    I would also prefer using a bytes interface but I am not sure how to
    offer that without either removing the legacy interface of the ratios
    or by offering a concurrent interface that might be confusing such as
    when users are looking at the old one and not aware of a new one.

    Any feedback?

    mrubin
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    Michael Rubin wrote:
    > On Fri, Sep 12, 2008 at 1:18 PM, Andrew Morton
    > wrote:
    >> One thing to think about please: Michael Rubin is hitting problems with
    >> the existing /proc/sys/vm/dirty-ratio. Its present granularity of 1%
    >> is just too coarse for really large machines, and as
    >> memory-size/disk-speed ratios continue to increase, this will just get
    >> worse.

    >
    > Re-sending since I top-posted before. Never again. Also adding more
    > thoughts on a byte based interface.
    >
    > Currently the problem we are hitting is that we cannot specify pdflush
    > to have background limits less than 1% of memory. I am currently
    > finishing up a patch right now that adds a dirty_ratio_millis
    > interface. I hope to submit the patch to LKML by the end of the week.
    >
    > The idea is that we don't want to break backwards compatibility and we
    > also don't want to have two conflicting knobs in the sysctl or
    > /proc/sys/vm/ space. I thought adding a new knob for those who want to
    > specify finer grained functionality was a compromise. So the patch has
    > a vm_dirty_ratio and a vm_dirty_ratio_millis interface. The first to
    > specify 0-100% and the second to specify .0 to .999%.
    >
    > So to represent 0.125% of RAM we set
    > vm_dirty_ratio = 0
    > vm_dirty_ratio_millis = 125
    >
    > The same for the background_ratio.
    >
    > I would also prefer using a bytes interface but I am not sure how to
    > offer that without either removing the legacy interface of the ratios
    > or by offering a concurrent interface that might be confusing such as
    > when users are looking at the old one and not aware of a new one.
    >
    > Any feedback?
    >
    > mrubin


    I think using millis is ok today, but it may not scale well to systems
    with 1TB of memory (in this case the min granularity would be 10MB).

    A bytes/pages interface would resolve such problem also for tomorrow
    machines.

    Moreover, wouldn't it be safer to set them mutually exclusive? I mean,
    writing a value != 0 to vm_dirty_millis automatically sets
    vm_dirty_ratio to 0 (disabled) and vice versa (this could be implemented
    using an appropriate .proc_handler for example).

    OK, I would like to set percentages like 12.456%, but if we don't do so
    a simple "sysctl -p" could create unexpected behaviours, reconfiguring
    the vm_dirty_ratio and not vm_dirty_ratio_millis for example.

    The same should be valid also for a bytes/pages interface, so setting
    vm_dirty_bytes != 0 (or vm_dirty_pages) should "disable" vm_dirty_ratio
    and vice versa.

    Thanks,
    -Andrea
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    > Currently the problem we are hitting is that we cannot specify pdflush
    > to have background limits less than 1% of memory. I am currently
    > finishing up a patch right now that adds a dirty_ratio_millis
    > interface. I hope to submit the patch to LKML by the end of the week.
    >
    > The idea is that we don't want to break backwards compatibility and we
    > also don't want to have two conflicting knobs in the sysctl or
    > /proc/sys/vm/ space. I thought adding a new knob for those who want to
    > specify finer grained functionality was a compromise. So the patch has
    > a vm_dirty_ratio and a vm_dirty_ratio_millis interface. The first to
    > specify 0-100% and the second to specify .0 to .999%.
    >
    > So to represent 0.125% of RAM we set
    > vm_dirty_ratio = 0
    > vm_dirty_ratio_millis = 125
    >
    > The same for the background_ratio.


    Why vm_dirty_ratio = 0.125 is wrong?
    it is hardly for parser maker, but it have nicer user experience.

    >
    > I would also prefer using a bytes interface but I am not sure how to
    > offer that without either removing the legacy interface of the ratios
    > or by offering a concurrent interface that might be confusing such as
    > when users are looking at the old one and not aware of a new one.
    >
    > Any feedback?


    Sure.
    We don't have any motivation of its interface change.



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    On Tue, Sep 23, 2008 at 10:48 AM, KOSAKI Motohiro
    wrote:
    > Why vm_dirty_ratio = 0.125 is wrong?
    > it is hardly for parser maker, but it have nicer user experience.


    Here's an idea to build off Kosaki's suggestion and incorporate other
    previous suggestions.

    What if we have two knobs for every ratio. So we could have
    vm_dirty_ratio and also vm_dirty_KB

    vm_dirty_KB allows the user to set the number of KB desired and also
    read the amount of KB that has been set.

    Writing to vm_dirty_ratio works just as before and only allows whole
    percentages.
    Reading from vm_dirty_ratio will return a reply as before except if KB
    has been set it can return a number in percentages (rounded off to
    thousandths).

    This way we allow new functionality and preserve old functionality
    while not surprising the user.
    Maybe we should deprecate the vm_dirty_ratio interface also and point
    folks to the vm_dirty_KB.

    > We don't have any motivation of its interface change.


    We are seeing problems where we are generating a lot of dirty memory
    from asynchronous background writes while more important traffic is
    operating with DIRECT_IO. The DIRECT_IO traffic will incur high
    latency spikes as the pdflush hits the background threshold and tries
    to write a lot of dirty buffers at once.

    What we want to do is lower the background threshold low enough so
    that we don't end up writing a lot of data at one time. As systems get
    more and more memory this is and will become difficult. 1% of system
    RAM could tie up a disk.

    mrubin
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    > > We don't have any motivation of its interface change.
    >
    > We are seeing problems where we are generating a lot of dirty memory
    > from asynchronous background writes while more important traffic is
    > operating with DIRECT_IO. The DIRECT_IO traffic will incur high
    > latency spikes as the pdflush hits the background threshold and tries
    > to write a lot of dirty buffers at once.
    >
    > What we want to do is lower the background threshold low enough so
    > that we don't end up writing a lot of data at one time. As systems get
    > more and more memory this is and will become difficult. 1% of system
    > RAM could tie up a disk.


    yup.
    sorry, I choosed bad word at my last mail. it caused your confusion.
    I only disagreed vm_dirty_KB.

    I agreed with fine graind vm_dirty_ratio.

    Thanks.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    KOSAKI Motohiro wrote:
    >> Currently the problem we are hitting is that we cannot specify pdflush
    >> to have background limits less than 1% of memory. I am currently
    >> finishing up a patch right now that adds a dirty_ratio_millis
    >> interface. I hope to submit the patch to LKML by the end of the week.
    >>
    >> The idea is that we don't want to break backwards compatibility and we
    >> also don't want to have two conflicting knobs in the sysctl or
    >> /proc/sys/vm/ space. I thought adding a new knob for those who want to
    >> specify finer grained functionality was a compromise. So the patch has
    >> a vm_dirty_ratio and a vm_dirty_ratio_millis interface. The first to
    >> specify 0-100% and the second to specify .0 to .999%.
    >>
    >> So to represent 0.125% of RAM we set
    >> vm_dirty_ratio = 0
    >> vm_dirty_ratio_millis = 125
    >>
    >> The same for the background_ratio.

    >
    > Why vm_dirty_ratio = 0.125 is wrong?
    > it is hardly for parser maker, but it have nicer user experience.
    >
    >> I would also prefer using a bytes interface but I am not sure how to
    >> offer that without either removing the legacy interface of the ratios
    >> or by offering a concurrent interface that might be confusing such as
    >> when users are looking at the old one and not aware of a new one.
    >>
    >> Any feedback?

    >
    > Sure.
    > We don't have any motivation of its interface change.


    The more I think about this and the more I would prefer to have an
    interface in KB (or pages) that automatically adjusts the old int percentage
    in dirty_ratio (the same for dirty_background_ratio).

    The parser issue for writing decimal values doesn't seem to be a big
    problem, but if the user expects to read an int from vm_dirty_ratio and
    instead receives something like 0.125, well... this could break
    something. So, IMHO also in this way we're changing the kernel-userspace
    interface.

    -Andrea
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    Michael Rubin wrote:
    > On Fri, Sep 12, 2008 at 1:18 PM, Andrew Morton
    > wrote:
    >> One thing to think about please: Michael Rubin is hitting problems with
    >> the existing /proc/sys/vm/dirty-ratio. Its present granularity of 1%
    >> is just too coarse for really large machines, and as
    >> memory-size/disk-speed ratios continue to increase, this will just get
    >> worse.

    >
    > Re-sending since I top-posted before. Never again. Also adding more
    > thoughts on a byte based interface.
    >
    > Currently the problem we are hitting is that we cannot specify pdflush
    > to have background limits less than 1% of memory. I am currently
    > finishing up a patch right now that adds a dirty_ratio_millis
    > interface. I hope to submit the patch to LKML by the end of the week.
    >
    > The idea is that we don't want to break backwards compatibility and we
    > also don't want to have two conflicting knobs in the sysctl or
    > /proc/sys/vm/ space. I thought adding a new knob for those who want to
    > specify finer grained functionality was a compromise. So the patch has
    > a vm_dirty_ratio and a vm_dirty_ratio_millis interface. The first to
    > specify 0-100% and the second to specify .0 to .999%.
    >
    > So to represent 0.125% of RAM we set
    > vm_dirty_ratio = 0
    > vm_dirty_ratio_millis = 125
    >
    > The same for the background_ratio.
    >
    > I would also prefer using a bytes interface but I am not sure how to
    > offer that without either removing the legacy interface of the ratios
    > or by offering a concurrent interface that might be confusing such as
    > when users are looking at the old one and not aware of a new one.
    >


    Just provide a vm_dirty_ration_in_bytes interface and keep it in sync with
    vm_dirty_ratio (they are just two representations of the same internal value)
    and for higher resolution propose that users use the bytes interface.



    --
    Balbir
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    Balbir Singh wrote:
    > Michael Rubin wrote:
    >> On Fri, Sep 12, 2008 at 1:18 PM, Andrew Morton
    >> wrote:
    >>> One thing to think about please: Michael Rubin is hitting problems with
    >>> the existing /proc/sys/vm/dirty-ratio. Its present granularity of 1%
    >>> is just too coarse for really large machines, and as
    >>> memory-size/disk-speed ratios continue to increase, this will just get
    >>> worse.

    >> Re-sending since I top-posted before. Never again. Also adding more
    >> thoughts on a byte based interface.
    >>
    >> Currently the problem we are hitting is that we cannot specify pdflush
    >> to have background limits less than 1% of memory. I am currently
    >> finishing up a patch right now that adds a dirty_ratio_millis
    >> interface. I hope to submit the patch to LKML by the end of the week.
    >>
    >> The idea is that we don't want to break backwards compatibility and we
    >> also don't want to have two conflicting knobs in the sysctl or
    >> /proc/sys/vm/ space. I thought adding a new knob for those who want to
    >> specify finer grained functionality was a compromise. So the patch has
    >> a vm_dirty_ratio and a vm_dirty_ratio_millis interface. The first to
    >> specify 0-100% and the second to specify .0 to .999%.
    >>
    >> So to represent 0.125% of RAM we set
    >> vm_dirty_ratio = 0
    >> vm_dirty_ratio_millis = 125
    >>
    >> The same for the background_ratio.
    >>
    >> I would also prefer using a bytes interface but I am not sure how to
    >> offer that without either removing the legacy interface of the ratios
    >> or by offering a concurrent interface that might be confusing such as
    >> when users are looking at the old one and not aware of a new one.
    >>

    >
    > Just provide a vm_dirty_ration_in_bytes interface and keep it in sync with
    > vm_dirty_ratio (they are just two representations of the same internal value)
    > and for higher resolution propose that users use the bytes interface.


    Hi Balbir,

    now that I read carefully the documentation, the description in
    Documentation/filesystems/proc.txt seems to be a bit misleading. In
    proc.txt we say that dirty_ratio and dirty_background_ratio are "a
    percentage of total system memory", but in mm/page-writeback.c we apply
    the percentages to the dirtyable memory: free pages + reclaimable pages.
    So, first of all I think we should clarify this in the documentation...

    Saying that, keeping in sync the vm_dirty_amount_in_bytes according to
    dirty_ratio_in_percentage is not a trivial task. One is a static value,
    the other depends on the dirtyable memory in the system. If we want to
    preserve the same behaviour we should do the following:

    dirty_ratio = x => dirty_amount_in_bytes = x * dirtyable_memory / 100

    dirty_amount_in_bytes = y => dirty_ratio = y / dirtyable_memory * 100

    But anytime the dirtyable memory (or the total memory in the system)
    changes we should update both values accordingly to preserve the
    coherency between them (ouch!).

    Possible solutions:

    1) introduce fine-grained dirty_ratio handling decimals by an opportune
    parser (disadvantage: this would break the compatibility with all the
    userspace apps that expect to read an int from vm_dirty_ratio)

    2) introduce dirty_ratio + dirty_ratio_millis (disadvantage: can
    generate unexpected behaviours when something is written to
    dirty_ratio ignoring the existence of dirty_ratio_millis)

    3) introduce dirty_ratio + dirty_amount_in_bytes mutually exclusive,
    writing to one automatically "disable" the other (disadvantage:
    writing to dirty_ratio ignoring dirty_amount_in_bytes can cause
    unexpected behaviours)

    4) introduce dirty_ratio + dirty_amount_in_bytes and change the
    old behaviour: when something is written to dirty_ratio,
    dirty_amount_in_bytes is evaluated in function of totalram_pages (or
    the memcg limit) and then we always use this static value, instead of
    something that depends on the dirtyable memory - we can easily update
    dirty_amount_in_bytes also when totalram_pages or the memcg limit
    changes (disadvantage: change an old - working - behaviour).

    5) handle fine-grained dirty_ratio decimals by an opportune parser when
    writing something to dirty_ratio; export the percentage units via
    dirty_ratio, and the decimals via dirty_ratio_decimals; writing to
    dirty_ratio_decimals is not allowed.

    I tend to choose 5. The same for dirty_background_ratio.

    -Andrea
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    On Tue, 07 Oct 2008 17:49:49 +0200
    Andrea Righi wrote:

    > Balbir Singh wrote:
    > > Michael Rubin wrote:
    > >> On Fri, Sep 12, 2008 at 1:18 PM, Andrew Morton
    > >> wrote:
    > >>> One thing to think about please: Michael Rubin is hitting problems with
    > >>> the existing /proc/sys/vm/dirty-ratio. Its present granularity of 1%
    > >>> is just too coarse for really large machines, and as
    > >>> memory-size/disk-speed ratios continue to increase, this will just get
    > >>> worse.
    > >> Re-sending since I top-posted before. Never again. Also adding more
    > >> thoughts on a byte based interface.
    > >>
    > >> Currently the problem we are hitting is that we cannot specify pdflush
    > >> to have background limits less than 1% of memory. I am currently
    > >> finishing up a patch right now that adds a dirty_ratio_millis
    > >> interface. I hope to submit the patch to LKML by the end of the week.
    > >>
    > >> The idea is that we don't want to break backwards compatibility and we
    > >> also don't want to have two conflicting knobs in the sysctl or
    > >> /proc/sys/vm/ space. I thought adding a new knob for those who want to
    > >> specify finer grained functionality was a compromise. So the patch has
    > >> a vm_dirty_ratio and a vm_dirty_ratio_millis interface. The first to
    > >> specify 0-100% and the second to specify .0 to .999%.
    > >>
    > >> So to represent 0.125% of RAM we set
    > >> vm_dirty_ratio = 0
    > >> vm_dirty_ratio_millis = 125
    > >>
    > >> The same for the background_ratio.
    > >>
    > >> I would also prefer using a bytes interface but I am not sure how to
    > >> offer that without either removing the legacy interface of the ratios
    > >> or by offering a concurrent interface that might be confusing such as
    > >> when users are looking at the old one and not aware of a new one.
    > >>

    > >
    > > Just provide a vm_dirty_ration_in_bytes interface and keep it in sync with
    > > vm_dirty_ratio (they are just two representations of the same internal value)
    > > and for higher resolution propose that users use the bytes interface.

    >
    > Hi Balbir,
    >
    > now that I read carefully the documentation, the description in
    > Documentation/filesystems/proc.txt seems to be a bit misleading. In
    > proc.txt we say that dirty_ratio and dirty_background_ratio are "a
    > percentage of total system memory", but in mm/page-writeback.c we apply
    > the percentages to the dirtyable memory: free pages + reclaimable pages.
    > So, first of all I think we should clarify this in the documentation...
    >
    > Saying that, keeping in sync the vm_dirty_amount_in_bytes according to
    > dirty_ratio_in_percentage is not a trivial task. One is a static value,
    > the other depends on the dirtyable memory in the system. If we want to
    > preserve the same behaviour we should do the following:
    >
    > dirty_ratio = x => dirty_amount_in_bytes = x * dirtyable_memory / 100
    >
    > dirty_amount_in_bytes = y => dirty_ratio = y / dirtyable_memory * 100
    >
    > But anytime the dirtyable memory (or the total memory in the system)
    > changes we should update both values accordingly to preserve the
    > coherency between them (ouch!).
    >
    > Possible solutions:
    >
    > 1) introduce fine-grained dirty_ratio handling decimals by an opportune
    > parser (disadvantage: this would break the compatibility with all the
    > userspace apps that expect to read an int from vm_dirty_ratio)
    >
    > 2) introduce dirty_ratio + dirty_ratio_millis (disadvantage: can
    > generate unexpected behaviours when something is written to
    > dirty_ratio ignoring the existence of dirty_ratio_millis)
    >
    > 3) introduce dirty_ratio + dirty_amount_in_bytes mutually exclusive,
    > writing to one automatically "disable" the other (disadvantage:
    > writing to dirty_ratio ignoring dirty_amount_in_bytes can cause
    > unexpected behaviours)
    >
    > 4) introduce dirty_ratio + dirty_amount_in_bytes and change the
    > old behaviour: when something is written to dirty_ratio,
    > dirty_amount_in_bytes is evaluated in function of totalram_pages (or
    > the memcg limit) and then we always use this static value, instead of
    > something that depends on the dirtyable memory - we can easily update
    > dirty_amount_in_bytes also when totalram_pages or the memcg limit
    > changes (disadvantage: change an old - working - behaviour).
    >
    > 5) handle fine-grained dirty_ratio decimals by an opportune parser when
    > writing something to dirty_ratio; export the percentage units via
    > dirty_ratio, and the decimals via dirty_ratio_decimals; writing to
    > dirty_ratio_decimals is not allowed.
    >
    > I tend to choose 5. The same for dirty_background_ratio.
    >


    Hmm... I agree to "5"... like this ?
    ==
    prvoides
    - vm.dirty_ratio (1/100)
    - vm.dirty_ratio_percentmille(1/100,000, pcm)

    and allow
    #echo 0.05 > vm/dirty_ratio
    #cat vm/dirty_ratio

    0
    #cat vm/dirty_ratio_percentmille
    500
    ==

    Thanks,
    -Kame

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [RFC] [PATCH -mm 0/2] memcg: per cgroup dirty_ratio

    KAMEZAWA Hiroyuki wrote:
    > On Tue, 07 Oct 2008 17:49:49 +0200
    > Andrea Righi wrote:
    >
    >> Balbir Singh wrote:
    >>> Michael Rubin wrote:
    >>>> On Fri, Sep 12, 2008 at 1:18 PM, Andrew Morton
    >>>> wrote:
    >>>>> One thing to think about please: Michael Rubin is hitting problems with
    >>>>> the existing /proc/sys/vm/dirty-ratio. Its present granularity of 1%
    >>>>> is just too coarse for really large machines, and as
    >>>>> memory-size/disk-speed ratios continue to increase, this will just get
    >>>>> worse.
    >>>> Re-sending since I top-posted before. Never again. Also adding more
    >>>> thoughts on a byte based interface.
    >>>>
    >>>> Currently the problem we are hitting is that we cannot specify pdflush
    >>>> to have background limits less than 1% of memory. I am currently
    >>>> finishing up a patch right now that adds a dirty_ratio_millis
    >>>> interface. I hope to submit the patch to LKML by the end of the week.
    >>>>
    >>>> The idea is that we don't want to break backwards compatibility and we
    >>>> also don't want to have two conflicting knobs in the sysctl or
    >>>> /proc/sys/vm/ space. I thought adding a new knob for those who want to
    >>>> specify finer grained functionality was a compromise. So the patch has
    >>>> a vm_dirty_ratio and a vm_dirty_ratio_millis interface. The first to
    >>>> specify 0-100% and the second to specify .0 to .999%.
    >>>>
    >>>> So to represent 0.125% of RAM we set
    >>>> vm_dirty_ratio = 0
    >>>> vm_dirty_ratio_millis = 125
    >>>>
    >>>> The same for the background_ratio.
    >>>>
    >>>> I would also prefer using a bytes interface but I am not sure how to
    >>>> offer that without either removing the legacy interface of the ratios
    >>>> or by offering a concurrent interface that might be confusing such as
    >>>> when users are looking at the old one and not aware of a new one.
    >>>>
    >>> Just provide a vm_dirty_ration_in_bytes interface and keep it in sync with
    >>> vm_dirty_ratio (they are just two representations of the same internal value)
    >>> and for higher resolution propose that users use the bytes interface.

    >> Hi Balbir,
    >>
    >> now that I read carefully the documentation, the description in
    >> Documentation/filesystems/proc.txt seems to be a bit misleading. In
    >> proc.txt we say that dirty_ratio and dirty_background_ratio are "a
    >> percentage of total system memory", but in mm/page-writeback.c we apply
    >> the percentages to the dirtyable memory: free pages + reclaimable pages.
    >> So, first of all I think we should clarify this in the documentation...
    >>
    >> Saying that, keeping in sync the vm_dirty_amount_in_bytes according to
    >> dirty_ratio_in_percentage is not a trivial task. One is a static value,
    >> the other depends on the dirtyable memory in the system. If we want to
    >> preserve the same behaviour we should do the following:
    >>
    >> dirty_ratio = x => dirty_amount_in_bytes = x * dirtyable_memory / 100
    >>
    >> dirty_amount_in_bytes = y => dirty_ratio = y / dirtyable_memory * 100
    >>
    >> But anytime the dirtyable memory (or the total memory in the system)
    >> changes we should update both values accordingly to preserve the
    >> coherency between them (ouch!).
    >>


    I see what you mean.

    >> Possible solutions:
    >>
    >> 1) introduce fine-grained dirty_ratio handling decimals by an opportune
    >> parser (disadvantage: this would break the compatibility with all the
    >> userspace apps that expect to read an int from vm_dirty_ratio)
    >>
    >> 2) introduce dirty_ratio + dirty_ratio_millis (disadvantage: can
    >> generate unexpected behaviours when something is written to
    >> dirty_ratio ignoring the existence of dirty_ratio_millis)
    >>
    >> 3) introduce dirty_ratio + dirty_amount_in_bytes mutually exclusive,
    >> writing to one automatically "disable" the other (disadvantage:
    >> writing to dirty_ratio ignoring dirty_amount_in_bytes can cause
    >> unexpected behaviours)
    >>
    >> 4) introduce dirty_ratio + dirty_amount_in_bytes and change the
    >> old behaviour: when something is written to dirty_ratio,
    >> dirty_amount_in_bytes is evaluated in function of totalram_pages (or
    >> the memcg limit) and then we always use this static value, instead of
    >> something that depends on the dirtyable memory - we can easily update
    >> dirty_amount_in_bytes also when totalram_pages or the memcg limit
    >> changes (disadvantage: change an old - working - behaviour).
    >>
    >> 5) handle fine-grained dirty_ratio decimals by an opportune parser when
    >> writing something to dirty_ratio; export the percentage units via
    >> dirty_ratio, and the decimals via dirty_ratio_decimals; writing to
    >> dirty_ratio_decimals is not allowed.
    >>
    >> I tend to choose 5. The same for dirty_background_ratio.
    >>

    >
    > Hmm... I agree to "5"... like this ?
    > ==
    > prvoides
    > - vm.dirty_ratio (1/100)
    > - vm.dirty_ratio_percentmille(1/100,000, pcm)
    >
    > and allow
    > #echo 0.05 > vm/dirty_ratio
    > #cat vm/dirty_ratio
    > 0
    > #cat vm/dirty_ratio_percentmille
    > 500
    > ==


    I guess this would be the easiest way forward, I'll let you select the
    granularity of the interface and its meaning.


    --
    Balbir
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. [PATCH -mm] page-writeback: fine-grained dirty_ratio and dirty_background_ratio

    The current granularity of 5% of dirtyable memory for dirty pages writeback is
    too coarse for large memory machines and this will get worse as
    memory-size/disk-speed ratio continues to increase.

    These large writebacks can be unpleasant for desktop or latency-sensitive
    environments, where the time to complete a writeback can be perceived as a
    lack of responsiveness by the whole system.

    So, something to define fine grained settings is needed.

    Following there's a similar solution as discussed in [1], but I tried to
    simplify the things a little bit, in order to provide the same functionality
    (in particular try to avoid backward compatibility problems) and reduce the
    amount of code needed to implement an in-kernel parser to handle percentages
    with decimals digits.

    The kernel provides the following parameters:
    - dirty_ratio, dirty_background_ratio in percentage
    (1 ... 100)
    - dirty_ratio_pcm, dirty_background_ratio_pcm in units of percent mille
    (1 ... 100,000)

    Both dirty_ratio and dirty_ratio_pcm refer to the same vm_dirty_ratio variable,
    only the interface to read/write this value is different. The same is valid for
    dirty_background_ratio and dirty_background_ratio_pcm.

    In this way it's possible to provide a fine grained interface to configure the
    writeback policy and at the same time preserve the compatibility with the old
    coarse grained dirty_ratio / dirty_background_ratio users.

    Examples:
    # echo 5 > /proc/sys/vm/dirty_ratio
    # cat /proc/sys/vm/dirty_ratio
    5
    # cat /proc/sys/vm/dirty_ratio_pcm
    5000

    # echo 500 > /proc/sys/vm/dirty_ratio_pcm
    # cat /proc/sys/vm/dirty_ratio
    0
    # cat /proc/sys/vm/dirty_ratio_pcm
    500

    # echo 5500 > /proc/sys/vm/dirty_ratio_pcm
    # cat /proc/sys/vm/dirty_ratio
    5
    # cat /proc/sys/vm/dirty_ratio_pcm
    5500

    [1] http://lkml.org/lkml/2008/10/7/230

    Signed-off-by: Andrea Righi
    ---
    Documentation/filesystems/proc.txt | 20 +++++++++
    include/linux/sysctl.h | 7 +++
    kernel/sysctl.c | 80 +++++++++++++++++++++++++++++++++--
    kernel/sysctl_check.c | 3 +
    mm/page-writeback.c | 29 ++++++++++---
    5 files changed, 128 insertions(+), 11 deletions(-)

    diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
    index 394eb2c..95f31f5 100644
    --- a/Documentation/filesystems/proc.txt
    +++ b/Documentation/filesystems/proc.txt
    @@ -1383,6 +1383,16 @@ dirty_background_ratio
    Contains, as a percentage of total system memory, the number of pages at which
    the pdflush background writeback daemon will start writing out dirty data.

    +dirty_background_ratio_pcm
    +--------------------------
    +
    +A fine-grained interface to configure dirty_background_ratio.
    +
    +Contains, as a percentage in units of pcm (percent mille) of the dirtyable
    +system memory (free pages + mapped pages + file cache, not including locked
    +pages and HugePages), the number of pages at which the pdflush background
    +writeback daemon will start writing out dirty data.
    +
    dirty_ratio
    -----------------

    @@ -1390,6 +1400,16 @@ Contains, as a percentage of total system memory, the number of pages at which
    a process which is generating disk writes will itself start writing out dirty
    data.

    +dirty_ratio_pcm
    +---------------
    +
    +A fine-grained interface to configure dirty_ratio.
    +
    +Contains, as a percentage in units of pcm (percent mille) of the dirtyable
    +system memory (free pages + mapped pages + file cache, not including locked
    +pages and HugePages), the number of pages at which a process which is
    +generating disk writes will itself start writing out dirty data.
    +
    dirty_writeback_centisecs
    -------------------------

    diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
    index 39d471d..799594b 100644
    --- a/include/linux/sysctl.h
    +++ b/include/linux/sysctl.h
    @@ -32,6 +32,9 @@
    struct file;
    struct completion;

    +#define PERCENT_PCM 1000
    +#define ONE_HUNDRED_PCM (100 * PERCENT_PCM)
    +
    #define CTL_MAXNAME 10 /* how many path components do we allow in a
    call to sysctl? In other words, what is
    the largest acceptable value for the nlen
    @@ -205,6 +208,8 @@ enum
    VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
    VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
    VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
    + VM_DIRTY_BACKGROUND_PCM = 36, /* fine-grained dirty_background_ratio */
    + VM_DIRTY_RATIO_PCM = 37, /* fine-grained dirty_ratio */
    };


    @@ -991,6 +996,8 @@ extern int proc_dointvec_userhz_jiffies(struct ctl_table *, int, struct file *,
    void __user *, size_t *, loff_t *);
    extern int proc_dointvec_ms_jiffies(struct ctl_table *, int, struct file *,
    void __user *, size_t *, loff_t *);
    +extern int proc_dointvec_pcm_minmax(struct ctl_table *, int, struct file *,
    + void __user *, size_t *, loff_t *);
    extern int proc_doulongvec_minmax(struct ctl_table *, int, struct file *,
    void __user *, size_t *, loff_t *);
    extern int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int,
    diff --git a/kernel/sysctl.c b/kernel/sysctl.c
    index fcd66f1..e22ab48 100644
    --- a/kernel/sysctl.c
    +++ b/kernel/sysctl.c
    @@ -89,9 +89,7 @@ extern int rcutorture_runnable;
    #endif /* #ifdef CONFIG_RCU_TORTURE_TEST */

    /* Constants used for minimum and maximum */
    -#if defined(CONFIG_HIGHMEM) || defined(CONFIG_DETECT_SOFTLOCKUP)
    static int one = 1;
    -#endif

    #ifdef CONFIG_DETECT_SOFTLOCKUP
    static int sixty = 60;
    @@ -104,6 +102,7 @@ static int two = 2;

    static int zero;
    static int one_hundred = 100;
    +static int one_hundred_pcm = ONE_HUNDRED_PCM;

    /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
    static int maxolduid = 65535;
    @@ -910,12 +909,23 @@ static struct ctl_table vm_table[] = {
    .data = &dirty_background_ratio,
    .maxlen = sizeof(dirty_background_ratio),
    .mode = 0644,
    - .proc_handler = &proc_dointvec_minmax,
    + .proc_handler = &proc_dointvec_pcm_minmax,
    .strategy = &sysctl_intvec,
    - .extra1 = &zero,
    + .extra1 = &one,
    .extra2 = &one_hundred,
    },
    {
    + .ctl_name = VM_DIRTY_BACKGROUND_PCM,
    + .procname = "dirty_background_ratio_pcm",
    + .data = &dirty_background_ratio,
    + .maxlen = sizeof(dirty_background_ratio),
    + .mode = 0644,
    + .proc_handler = &proc_dointvec_minmax,
    + .strategy = &sysctl_intvec,
    + .extra1 = &one,
    + .extra2 = &one_hundred_pcm,
    + },
    + {
    .ctl_name = VM_DIRTY_RATIO,
    .procname = "dirty_ratio",
    .data = &vm_dirty_ratio,
    @@ -923,10 +933,21 @@ static struct ctl_table vm_table[] = {
    .mode = 0644,
    .proc_handler = &dirty_ratio_handler,
    .strategy = &sysctl_intvec,
    - .extra1 = &zero,
    + .extra1 = &one,
    .extra2 = &one_hundred,
    },
    {
    + .ctl_name = VM_DIRTY_RATIO_PCM,
    + .procname = "dirty_ratio_pcm",
    + .data = &vm_dirty_ratio,
    + .maxlen = sizeof(vm_dirty_ratio),
    + .mode = 0644,
    + .proc_handler = &dirty_ratio_handler,
    + .strategy = &sysctl_intvec,
    + .extra1 = &one,
    + .extra2 = &one_hundred_pcm,
    + },
    + {
    .procname = "dirty_writeback_centisecs",
    .data = &dirty_writeback_interval,
    .maxlen = sizeof(dirty_writeback_interval),
    @@ -2539,6 +2560,35 @@ int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write,
    lenp, ppos, HZ, 1000l);
    }

    +static int do_proc_dointvec_pcm_minmax_conv(int *negp, unsigned long *lvalp,
    + int *valp, int write, void *data)
    +{
    + struct do_proc_dointvec_minmax_conv_param *param = data;
    + int val;
    +
    + if (write) {
    + if (*lvalp > LONG_MAX / PERCENT_PCM)
    + return -EINVAL;
    + val = *negp ? -*lvalp : *lvalp;
    + if ((param->min && *param->min > val) ||
    + (param->max && *param->max < val))
    + return -EINVAL;
    + *valp = val * PERCENT_PCM;
    + } else {
    + unsigned long lval;
    +
    + val = *valp;
    + if (val < 0) {
    + *negp = -1;
    + lval = (unsigned long)-val;
    + } else {
    + *negp = 0;
    + lval = (unsigned long)val;
    + }
    + *lvalp = lval / PERCENT_PCM;
    + }
    + return 0;
    +}

    static int do_proc_dointvec_jiffies_conv(int *negp, unsigned long *lvalp,
    int *valp,
    @@ -2677,6 +2727,19 @@ int proc_dointvec_ms_jiffies(struct ctl_table *table, int write, struct file *fi
    do_proc_dointvec_ms_jiffies_conv, NULL);
    }

    +int proc_dointvec_pcm_minmax(struct ctl_table *table, int write,
    + struct file *filp, void __user *buffer, size_t *lenp,
    + loff_t *ppos)
    +{
    + struct do_proc_dointvec_minmax_conv_param param = {
    + .min = (int *)table->extra1,
    + .max = (int *)table->extra2,
    + };
    +
    + return do_proc_dointvec(table, write, filp, buffer, lenp, ppos,
    + do_proc_dointvec_pcm_minmax_conv, &param);
    +}
    +
    static int proc_do_cad_pid(struct ctl_table *table, int write, struct file *filp,
    void __user *buffer, size_t *lenp, loff_t *ppos)
    {
    @@ -2725,6 +2788,13 @@ int proc_dointvec_jiffies(struct ctl_table *table, int write, struct file *filp,
    return -ENOSYS;
    }

    +int proc_dointvec_pcm_minmax(struct ctl_table *table, int write,
    + struct file *filp, void __user *buffer, size_t *lenp,
    + loff_t *ppos)
    +{
    + return -ENOSYS;
    +}
    +
    int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write, struct file *filp,
    void __user *buffer, size_t *lenp, loff_t *ppos)
    {
    diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
    index c35da23..83934a8 100644
    --- a/kernel/sysctl_check.c
    +++ b/kernel/sysctl_check.c
    @@ -111,7 +111,9 @@ static const struct trans_ctl_table trans_vm_table[] = {
    { VM_OVERCOMMIT_MEMORY, "overcommit_memory" },
    { VM_PAGE_CLUSTER, "page-cluster" },
    { VM_DIRTY_BACKGROUND, "dirty_background_ratio" },
    + { VM_DIRTY_BACKGROUND_PCM, "dirty_background_ratio_pcm" },
    { VM_DIRTY_RATIO, "dirty_ratio" },
    + { VM_DIRTY_RATIO_PCM, "dirty_ratio_pcm" },
    { VM_DIRTY_WB_CS, "dirty_writeback_centisecs" },
    { VM_DIRTY_EXPIRE_CS, "dirty_expire_centisecs" },
    { VM_NR_PDFLUSH_THREADS, "nr_pdflush_threads" },
    @@ -1494,6 +1496,7 @@ int sysctl_check_table(struct nsproxy *namespaces, struct ctl_table *table)
    (table->proc_handler == proc_dostring) ||
    (table->proc_handler == proc_dointvec) ||
    (table->proc_handler == proc_dointvec_minmax) ||
    + (table->proc_handler == proc_dointvec_pcm_minmax) ||
    (table->proc_handler == proc_dointvec_jiffies) ||
    (table->proc_handler == proc_dointvec_userhz_jiffies) ||
    (table->proc_handler == proc_dointvec_ms_jiffies) ||
    diff --git a/mm/page-writeback.c b/mm/page-writeback.c
    index c6d6088..6bc8c9b 100644
    --- a/mm/page-writeback.c
    +++ b/mm/page-writeback.c
    @@ -66,7 +66,7 @@ static inline long sync_writeback_pages(void)
    /*
    * Start background writeback (via pdflush) at this percentage
    */
    -int dirty_background_ratio = 5;
    +int dirty_background_ratio = 5 * PERCENT_PCM;

    /*
    * free highmem will not be subtracted from the total free memory
    @@ -77,7 +77,7 @@ int vm_highmem_is_dirtyable;
    /*
    * The generator of dirty data starts writeback at this percentage
    */
    -int vm_dirty_ratio = 10;
    +int vm_dirty_ratio = 10 * PERCENT_PCM;

    /*
    * The interval between `kupdate'-style writebacks, in jiffies
    @@ -135,7 +135,8 @@ static int calc_period_shift(void)
    {
    unsigned long dirty_total;

    - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
    + dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
    + / ONE_HUNDRED_PCM;
    return 2 + ilog2(dirty_total - 1);
    }

    @@ -147,7 +148,23 @@ int dirty_ratio_handler(struct ctl_table *table, int write,
    loff_t *ppos)
    {
    int old_ratio = vm_dirty_ratio;
    - int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
    + int ret;
    +
    + switch (table->ctl_name) {
    + case VM_DIRTY_RATIO:
    + ret = proc_dointvec_pcm_minmax(table, write, filp, buffer,
    + lenp, ppos);
    + break;
    + case VM_DIRTY_RATIO_PCM:
    + ret = proc_dointvec_minmax(table, write, filp, buffer,
    + lenp, ppos);
    + break;
    + default:
    + ret = -EINVAL;
    + WARN_ON(1);
    + break;
    + }
    +
    if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
    int shift = calc_period_shift();
    prop_change_shift(&vm_completions, shift);
    @@ -380,8 +397,8 @@ get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
    if (background_ratio >= dirty_ratio)
    background_ratio = dirty_ratio / 2;

    - background = (background_ratio * available_memory) / 100;
    - dirty = (dirty_ratio * available_memory) / 100;
    + background = (background_ratio * available_memory) / ONE_HUNDRED_PCM;
    + dirty = (dirty_ratio * available_memory) / ONE_HUNDRED_PCM;
    tsk = current;
    if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
    background += background / 4;
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [PATCH -mm] page-writeback: fine-grained dirty_ratio and dirty_background_ratio

    On Thu, 09 Oct 2008 17:29:46 +0200
    Andrea Righi wrote:

    > The current granularity of 5% of dirtyable memory for dirty pages writeback is
    > too coarse for large memory machines and this will get worse as
    > memory-size/disk-speed ratio continues to increase.
    >
    > These large writebacks can be unpleasant for desktop or latency-sensitive
    > environments, where the time to complete a writeback can be perceived as a
    > lack of responsiveness by the whole system.
    >
    > So, something to define fine grained settings is needed.
    >
    > Following there's a similar solution as discussed in [1], but I tried to
    > simplify the things a little bit, in order to provide the same functionality
    > (in particular try to avoid backward compatibility problems) and reduce the
    > amount of code needed to implement an in-kernel parser to handle percentages
    > with decimals digits.
    >
    > The kernel provides the following parameters:
    > - dirty_ratio, dirty_background_ratio in percentage
    > (1 ... 100)
    > - dirty_ratio_pcm, dirty_background_ratio_pcm in units of percent mille
    > (1 ... 100,000)
    >
    > Both dirty_ratio and dirty_ratio_pcm refer to the same vm_dirty_ratio variable,
    > only the interface to read/write this value is different. The same is valid for
    > dirty_background_ratio and dirty_background_ratio_pcm.
    >
    > In this way it's possible to provide a fine grained interface to configure the
    > writeback policy and at the same time preserve the compatibility with the old
    > coarse grained dirty_ratio / dirty_background_ratio users.
    >
    > Examples:
    > # echo 5 > /proc/sys/vm/dirty_ratio
    > # cat /proc/sys/vm/dirty_ratio
    > 5
    > # cat /proc/sys/vm/dirty_ratio_pcm
    > 5000
    >
    > # echo 500 > /proc/sys/vm/dirty_ratio_pcm
    > # cat /proc/sys/vm/dirty_ratio
    > 0
    > # cat /proc/sys/vm/dirty_ratio_pcm
    > 500
    >
    > # echo 5500 > /proc/sys/vm/dirty_ratio_pcm
    > # cat /proc/sys/vm/dirty_ratio
    > 5
    > # cat /proc/sys/vm/dirty_ratio_pcm
    > 5500
    >

    I like this. thanks.



    > -int dirty_background_ratio = 5;
    > +int dirty_background_ratio = 5 * PERCENT_PCM;
    >
    > /*
    > * free highmem will not be subtracted from the total free memory
    > @@ -77,7 +77,7 @@ int vm_highmem_is_dirtyable;
    > /*
    > * The generator of dirty data starts writeback at this percentage
    > */
    > -int vm_dirty_ratio = 10;
    > +int vm_dirty_ratio = 10 * PERCENT_PCM;
    >
    > /*
    > * The interval between `kupdate'-style writebacks, in jiffies
    > @@ -135,7 +135,8 @@ static int calc_period_shift(void)
    > {
    > unsigned long dirty_total;
    >
    > - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
    > + dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
    > + / ONE_HUNDRED_PCM;
    > return 2 + ilog2(dirty_total - 1);
    > }
    >

    I wonder...isn't this overflow in 32bit system ?

    Thanks,
    -Kame


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [PATCH -mm] page-writeback: fine-grained dirty_ratio and dirty_background_ratio

    KAMEZAWA Hiroyuki wrote:
    >
    >
    >> -int dirty_background_ratio = 5;
    >> +int dirty_background_ratio = 5 * PERCENT_PCM;
    >>
    >> /*
    >> * free highmem will not be subtracted from the total free memory
    >> @@ -77,7 +77,7 @@ int vm_highmem_is_dirtyable;
    >> /*
    >> * The generator of dirty data starts writeback at this percentage
    >> */
    >> -int vm_dirty_ratio = 10;
    >> +int vm_dirty_ratio = 10 * PERCENT_PCM;
    >>
    >> /*
    >> * The interval between `kupdate'-style writebacks, in jiffies
    >> @@ -135,7 +135,8 @@ static int calc_period_shift(void)
    >> {
    >> unsigned long dirty_total;
    >>
    >> - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
    >> + dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
    >> + / ONE_HUNDRED_PCM;
    >> return 2 + ilog2(dirty_total - 1);
    >> }
    >>

    > I wonder...isn't this overflow in 32bit system ?


    Correct! the worst case is (in pages):

    4GB = 100,000 * determine_dirtyable_memory()

    that means 42950 pages (~168MB) of dirtyable memory is enough to overflow .
    Using an u64 for dirty_total should resolve.

    Delta patch is below.

    Unfortunately I have all 64-bit machines right now. Maybe tomorrow I'll
    be able to get a 32-bit box, if someone doesn't test this before.

    Thanks!
    -Andrea

    ---
    Subject: fix overflow in 32-bit systems using fine-grained dirty_ratio

    Signed-off-by: Andrea Righi
    Signed-off-by: KAMEZAWA Hiroyuki
    ---
    mm/page-writeback.c | 2 +-
    1 files changed, 1 insertions(+), 1 deletions(-)

    diff --git a/mm/page-writeback.c b/mm/page-writeback.c
    index 6bc8c9b..29913e5 100644
    --- a/mm/page-writeback.c
    +++ b/mm/page-writeback.c
    @@ -133,7 +133,7 @@ static struct prop_descriptor vm_dirties;
    */
    static int calc_period_shift(void)
    {
    - unsigned long dirty_total;
    + u64 dirty_total;

    dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
    / ONE_HUNDRED_PCM;
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [PATCH -mm] page-writeback: fine-grained dirty_ratio and dirty_background_ratio

    Andrea Righi wrote:
    > KAMEZAWA Hiroyuki wrote:
    >>
    >>
    >>> -int dirty_background_ratio = 5;
    >>> +int dirty_background_ratio = 5 * PERCENT_PCM;
    >>>
    >>> /*
    >>> * free highmem will not be subtracted from the total free memory
    >>> @@ -77,7 +77,7 @@ int vm_highmem_is_dirtyable;
    >>> /*
    >>> * The generator of dirty data starts writeback at this percentage
    >>> */
    >>> -int vm_dirty_ratio = 10;
    >>> +int vm_dirty_ratio = 10 * PERCENT_PCM;
    >>>
    >>> /*
    >>> * The interval between `kupdate'-style writebacks, in jiffies
    >>> @@ -135,7 +135,8 @@ static int calc_period_shift(void)
    >>> {
    >>> unsigned long dirty_total;
    >>>
    >>> - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
    >>> + dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
    >>> + / ONE_HUNDRED_PCM;
    >>> return 2 + ilog2(dirty_total - 1);
    >>> }
    >>>

    >> I wonder...isn't this overflow in 32bit system ?

    >
    > Correct! the worst case is (in pages):
    >
    > 4GB = 100,000 * determine_dirtyable_memory()
    >
    > that means 42950 pages (~168MB) of dirtyable memory is enough to overflow .
    > Using an u64 for dirty_total should resolve.
    >
    > Delta patch is below.
    >
    > Unfortunately I have all 64-bit machines right now. Maybe tomorrow I'll
    > be able to get a 32-bit box, if someone doesn't test this before.
    >
    > Thanks!
    > -Andrea


    I've been able to quickly resolve creating a 1GB mem i386 VM with kvm.

    Everything seems to work fine and with the following fix it doesn't overflow.

    -Andrea


    >
    > ---
    > Subject: fix overflow in 32-bit systems using fine-grained dirty_ratio
    >
    > Signed-off-by: Andrea Righi
    > Signed-off-by: KAMEZAWA Hiroyuki
    > ---
    > mm/page-writeback.c | 2 +-
    > 1 files changed, 1 insertions(+), 1 deletions(-)
    >
    > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
    > index 6bc8c9b..29913e5 100644
    > --- a/mm/page-writeback.c
    > +++ b/mm/page-writeback.c
    > @@ -133,7 +133,7 @@ static struct prop_descriptor vm_dirties;
    > */
    > static int calc_period_shift(void)
    > {
    > - unsigned long dirty_total;
    > + u64 dirty_total;
    >
    > dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
    > / ONE_HUNDRED_PCM;

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. [PATCH -mm] mm: fine-grained dirty_ratio_pcm and dirty_background_ratio_pcm (v2)

    The current granularity of 5% of dirtyable memory for dirty pages writeback is
    too coarse for large memory machines and this will get worse as
    memory-size/disk-speed ratio continues to increase.

    These large writebacks can be unpleasant for desktop or latency-sensitive
    environments, where the time to complete each writeback can be perceived as a
    lack of responsiveness by the whole system.

    Following there's a similar solution as discussed in [1], but a little
    bit simplified in order to provide the same functionality (in particular
    to avoid backward compatibility problems) and reduce the amount of code
    needed to implement an in-kernel parser to handle percentages with
    decimals digits.

    The kernel provides the following parameters:
    - dirty_ratio, dirty_background_ratio in percentage (1 ... 100)
    - dirty_ratio_pcm, dirty_background_ratio_pcm in units of percent mille (1 ... 100,000)

    Both dirty_ratio and dirty_ratio_pcm refer to the same vm_dirty_ratio variable,
    only the interface to read/write this value is different. The same is valid for
    dirty_background_ratio.

    In this way it's possible to provide a fine-grained interface to configure the
    writeback policy and at the same time preserve the compatibility with the old
    dirty_ratio / dirty_background_ratio users.

    Examples:
    # echo 5 > /proc/sys/vm/dirty_ratio
    # cat /proc/sys/vm/dirty_ratio
    5
    # cat /proc/sys/vm/dirty_ratio_pcm
    5000

    # echo 500 > /proc/sys/vm/dirty_ratio_pcm
    # cat /proc/sys/vm/dirty_ratio
    0
    # cat /proc/sys/vm/dirty_ratio_pcm
    500

    # echo 5500 > /proc/sys/vm/dirty_ratio_pcm
    # cat /proc/sys/vm/dirty_ratio
    5
    # cat /proc/sys/vm/dirty_ratio_pcm
    5500

    Changelog: (v1 -> v2)

    * fix overflow in 32bit systems (calc_period_shift needs a u64)
    * rebase (and tested) to 2.6.28-rc2-mm1

    [1] http://lkml.org/lkml/2008/10/7/230

    Signed-off-by: Andrea Righi
    ---
    Documentation/filesystems/proc.txt | 20 +++++++++
    include/linux/sysctl.h | 7 +++
    kernel/sysctl.c | 80 +++++++++++++++++++++++++++++++++--
    kernel/sysctl_check.c | 3 +
    mm/page-writeback.c | 31 +++++++++++---
    5 files changed, 129 insertions(+), 12 deletions(-)

    diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
    index bcceb99..38ed5bf 100644
    --- a/Documentation/filesystems/proc.txt
    +++ b/Documentation/filesystems/proc.txt
    @@ -1389,6 +1389,16 @@ pages + file cache, not including locked pages and HugePages), the number of
    pages at which the pdflush background writeback daemon will start writing out
    dirty data.

    +dirty_background_ratio_pcm
    +--------------------------
    +
    +A fine-grained interface to configure dirty_background_ratio.
    +
    +Contains, as a percentage in units of pcm (percent mille) of the dirtyable
    +system memory (free pages + mapped pages + file cache, not including locked
    +pages and HugePages), the number of pages at which the pdflush background
    +writeback daemon will start writing out dirty data.
    +
    dirty_ratio
    -----------------

    @@ -1397,6 +1407,16 @@ pages + file cache, not including locked pages and HugePages), the number of
    pages at which a process which is generating disk writes will itself start
    writing out dirty data.

    +dirty_ratio_pcm
    +---------------
    +
    +A fine-grained interface to configure dirty_ratio.
    +
    +Contains, as a percentage in units of pcm (percent mille) of the dirtyable
    +system memory (free pages + mapped pages + file cache, not including locked
    +pages and HugePages), the number of pages at which a process which is
    +generating disk writes will itself start writing out dirty data.
    +
    dirty_writeback_centisecs
    -------------------------

    diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
    index 39d471d..799594b 100644
    --- a/include/linux/sysctl.h
    +++ b/include/linux/sysctl.h
    @@ -32,6 +32,9 @@
    struct file;
    struct completion;

    +#define PERCENT_PCM 1000
    +#define ONE_HUNDRED_PCM (100 * PERCENT_PCM)
    +
    #define CTL_MAXNAME 10 /* how many path components do we allow in a
    call to sysctl? In other words, what is
    the largest acceptable value for the nlen
    @@ -205,6 +208,8 @@ enum
    VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
    VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
    VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
    + VM_DIRTY_BACKGROUND_PCM = 36, /* fine-grained dirty_background_ratio */
    + VM_DIRTY_RATIO_PCM = 37, /* fine-grained dirty_ratio */
    };


    @@ -991,6 +996,8 @@ extern int proc_dointvec_userhz_jiffies(struct ctl_table *, int, struct file *,
    void __user *, size_t *, loff_t *);
    extern int proc_dointvec_ms_jiffies(struct ctl_table *, int, struct file *,
    void __user *, size_t *, loff_t *);
    +extern int proc_dointvec_pcm_minmax(struct ctl_table *, int, struct file *,
    + void __user *, size_t *, loff_t *);
    extern int proc_doulongvec_minmax(struct ctl_table *, int, struct file *,
    void __user *, size_t *, loff_t *);
    extern int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int,
    diff --git a/kernel/sysctl.c b/kernel/sysctl.c
    index d14953a..06ba902 100644
    --- a/kernel/sysctl.c
    +++ b/kernel/sysctl.c
    @@ -88,9 +88,7 @@ extern int rcutorture_runnable;
    #endif /* #ifdef CONFIG_RCU_TORTURE_TEST */

    /* Constants used for minimum and maximum */
    -#if defined(CONFIG_HIGHMEM) || defined(CONFIG_DETECT_SOFTLOCKUP)
    static int one = 1;
    -#endif

    #ifdef CONFIG_DETECT_SOFTLOCKUP
    static int sixty = 60;
    @@ -103,6 +101,7 @@ static int two = 2;

    static int zero;
    static int one_hundred = 100;
    +static int one_hundred_pcm = ONE_HUNDRED_PCM;

    /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
    static int maxolduid = 65535;
    @@ -926,12 +925,23 @@ static struct ctl_table vm_table[] = {
    .data = &dirty_background_ratio,
    .maxlen = sizeof(dirty_background_ratio),
    .mode = 0644,
    - .proc_handler = &proc_dointvec_minmax,
    + .proc_handler = &proc_dointvec_pcm_minmax,
    .strategy = &sysctl_intvec,
    - .extra1 = &zero,
    + .extra1 = &one,
    .extra2 = &one_hundred,
    },
    {
    + .ctl_name = VM_DIRTY_BACKGROUND_PCM,
    + .procname = "dirty_background_ratio_pcm",
    + .data = &dirty_background_ratio,
    + .maxlen = sizeof(dirty_background_ratio),
    + .mode = 0644,
    + .proc_handler = &proc_dointvec_minmax,
    + .strategy = &sysctl_intvec,
    + .extra1 = &one,
    + .extra2 = &one_hundred_pcm,
    + },
    + {
    .ctl_name = VM_DIRTY_RATIO,
    .procname = "dirty_ratio",
    .data = &vm_dirty_ratio,
    @@ -939,10 +949,21 @@ static struct ctl_table vm_table[] = {
    .mode = 0644,
    .proc_handler = &dirty_ratio_handler,
    .strategy = &sysctl_intvec,
    - .extra1 = &zero,
    + .extra1 = &one,
    .extra2 = &one_hundred,
    },
    {
    + .ctl_name = VM_DIRTY_RATIO_PCM,
    + .procname = "dirty_ratio_pcm",
    + .data = &vm_dirty_ratio,
    + .maxlen = sizeof(vm_dirty_ratio),
    + .mode = 0644,
    + .proc_handler = &dirty_ratio_handler,
    + .strategy = &sysctl_intvec,
    + .extra1 = &one,
    + .extra2 = &one_hundred_pcm,
    + },
    + {
    .procname = "dirty_writeback_centisecs",
    .data = &dirty_writeback_interval,
    .maxlen = sizeof(dirty_writeback_interval),
    @@ -2525,6 +2546,35 @@ int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write,
    lenp, ppos, HZ, 1000l);
    }

    +static int do_proc_dointvec_pcm_minmax_conv(int *negp, unsigned long *lvalp,
    + int *valp, int write, void *data)
    +{
    + struct do_proc_dointvec_minmax_conv_param *param = data;
    + int val;
    +
    + if (write) {
    + if (*lvalp > LONG_MAX / PERCENT_PCM)
    + return -EINVAL;
    + val = *negp ? -*lvalp : *lvalp;
    + if ((param->min && *param->min > val) ||
    + (param->max && *param->max < val))
    + return -EINVAL;
    + *valp = val * PERCENT_PCM;
    + } else {
    + unsigned long lval;
    +
    + val = *valp;
    + if (val < 0) {
    + *negp = -1;
    + lval = (unsigned long)-val;
    + } else {
    + *negp = 0;
    + lval = (unsigned long)val;
    + }
    + *lvalp = lval / PERCENT_PCM;
    + }
    + return 0;
    +}

    static int do_proc_dointvec_jiffies_conv(int *negp, unsigned long *lvalp,
    int *valp,
    @@ -2663,6 +2713,19 @@ int proc_dointvec_ms_jiffies(struct ctl_table *table, int write, struct file *fi
    do_proc_dointvec_ms_jiffies_conv, NULL);
    }

    +int proc_dointvec_pcm_minmax(struct ctl_table *table, int write,
    + struct file *filp, void __user *buffer, size_t *lenp,
    + loff_t *ppos)
    +{
    + struct do_proc_dointvec_minmax_conv_param param = {
    + .min = (int *)table->extra1,
    + .max = (int *)table->extra2,
    + };
    +
    + return do_proc_dointvec(table, write, filp, buffer, lenp, ppos,
    + do_proc_dointvec_pcm_minmax_conv, &param);
    +}
    +
    static int proc_do_cad_pid(struct ctl_table *table, int write, struct file *filp,
    void __user *buffer, size_t *lenp, loff_t *ppos)
    {
    @@ -2711,6 +2774,13 @@ int proc_dointvec_jiffies(struct ctl_table *table, int write, struct file *filp,
    return -ENOSYS;
    }

    +int proc_dointvec_pcm_minmax(struct ctl_table *table, int write,
    + struct file *filp, void __user *buffer, size_t *lenp,
    + loff_t *ppos)
    +{
    + return -ENOSYS;
    +}
    +
    int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write, struct file *filp,
    void __user *buffer, size_t *lenp, loff_t *ppos)
    {
    diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
    index c35da23..83934a8 100644
    --- a/kernel/sysctl_check.c
    +++ b/kernel/sysctl_check.c
    @@ -111,7 +111,9 @@ static const struct trans_ctl_table trans_vm_table[] = {
    { VM_OVERCOMMIT_MEMORY, "overcommit_memory" },
    { VM_PAGE_CLUSTER, "page-cluster" },
    { VM_DIRTY_BACKGROUND, "dirty_background_ratio" },
    + { VM_DIRTY_BACKGROUND_PCM, "dirty_background_ratio_pcm" },
    { VM_DIRTY_RATIO, "dirty_ratio" },
    + { VM_DIRTY_RATIO_PCM, "dirty_ratio_pcm" },
    { VM_DIRTY_WB_CS, "dirty_writeback_centisecs" },
    { VM_DIRTY_EXPIRE_CS, "dirty_expire_centisecs" },
    { VM_NR_PDFLUSH_THREADS, "nr_pdflush_threads" },
    @@ -1494,6 +1496,7 @@ int sysctl_check_table(struct nsproxy *namespaces, struct ctl_table *table)
    (table->proc_handler == proc_dostring) ||
    (table->proc_handler == proc_dointvec) ||
    (table->proc_handler == proc_dointvec_minmax) ||
    + (table->proc_handler == proc_dointvec_pcm_minmax) ||
    (table->proc_handler == proc_dointvec_jiffies) ||
    (table->proc_handler == proc_dointvec_userhz_jiffies) ||
    (table->proc_handler == proc_dointvec_ms_jiffies) ||
    diff --git a/mm/page-writeback.c b/mm/page-writeback.c
    index b3584bf..e010a39 100644
    --- a/mm/page-writeback.c
    +++ b/mm/page-writeback.c
    @@ -66,7 +66,7 @@ static inline long sync_writeback_pages(void)
    /*
    * Start background writeback (via pdflush) at this percentage
    */
    -int dirty_background_ratio = 5;
    +int dirty_background_ratio = 5 * PERCENT_PCM;

    /*
    * free highmem will not be subtracted from the total free memory
    @@ -77,7 +77,7 @@ int vm_highmem_is_dirtyable;
    /*
    * The generator of dirty data starts writeback at this percentage
    */
    -int vm_dirty_ratio = 10;
    +int vm_dirty_ratio = 10 * PERCENT_PCM;

    /*
    * The interval between `kupdate'-style writebacks, in jiffies
    @@ -133,9 +133,10 @@ static struct prop_descriptor vm_dirties;
    */
    static int calc_period_shift(void)
    {
    - unsigned long dirty_total;
    + u64 dirty_total;

    - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
    + dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
    + / ONE_HUNDRED_PCM;
    return 2 + ilog2(dirty_total - 1);
    }

    @@ -147,7 +148,23 @@ int dirty_ratio_handler(struct ctl_table *table, int write,
    loff_t *ppos)
    {
    int old_ratio = vm_dirty_ratio;
    - int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
    + int ret;
    +
    + switch (table->ctl_name) {
    + case VM_DIRTY_RATIO:
    + ret = proc_dointvec_pcm_minmax(table, write, filp, buffer,
    + lenp, ppos);
    + break;
    + case VM_DIRTY_RATIO_PCM:
    + ret = proc_dointvec_minmax(table, write, filp, buffer,
    + lenp, ppos);
    + break;
    + default:
    + ret = -EINVAL;
    + WARN_ON(1);
    + break;
    + }
    +
    if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
    int shift = calc_period_shift();
    prop_change_shift(&vm_completions, shift);
    @@ -380,8 +397,8 @@ get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
    if (background_ratio >= dirty_ratio)
    background_ratio = dirty_ratio / 2;

    - background = (background_ratio * available_memory) / 100;
    - dirty = (dirty_ratio * available_memory) / 100;
    + background = (background_ratio * available_memory) / ONE_HUNDRED_PCM;
    + dirty = (dirty_ratio * available_memory) / ONE_HUNDRED_PCM;
    tsk = current;
    if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
    background += background / 4;
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 2 1 2 LastLast