[PATCH 0/2] NR_CPUS: increase maximum NR_CPUS to 4096 - Kernel

This is a discussion on [PATCH 0/2] NR_CPUS: increase maximum NR_CPUS to 4096 - Kernel ; * Increases the limit of NR_CPUS to 4096 and introduces a boolean called "MAXSMP" which when set (e.g. "allyesconfig") will set NR_CPUS = 4096 and NODES_SHIFT = 9 (512). I've been running this config (4k NR_CPUS, 512 Max Nodes) on ...

+ Reply to Thread
Results 1 to 20 of 20

Thread: [PATCH 0/2] NR_CPUS: increase maximum NR_CPUS to 4096

  1. [PATCH 0/2] NR_CPUS: increase maximum NR_CPUS to 4096


    * Increases the limit of NR_CPUS to 4096 and introduces a
    boolean called "MAXSMP" which when set (e.g. "allyesconfig")
    will set NR_CPUS = 4096 and NODES_SHIFT = 9 (512).

    I've been running this config (4k NR_CPUS, 512 Max Nodes)
    on an AMD box with 2 dual-cores and 4gb memory as well as an
    Intel box with 4 single-core cpus and 8Mb. I've also
    successfully booted it in a simulated 2cpus/1Gb environment.

    Based on:
    git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
    + x86/latest .../x86/linux-2.6-x86.git
    + sched-devel/latest .../mingo/linux-2.6-sched-devel.git

    Signed-off-by: Mike Travis
    ---

    Memory usage effects from upping NR_CPUS to 4096 and MAX_NUMANODES to 512.

    255-akpm2: akpm2 config with NR_CPUS=255 / NUMA_NODE_SHIFT=6
    4k-akpm2: akpm2 config with NR_CPUS=4096 / NUMA_NODE_SHIFT=9

    ====== Data (-l 1000)
    1 - 255-akpm2
    2 - 4k-akpm2

    .1. .2. ..final..
    1114112 +3899392 5013504 +350% irq_desc(.data.cacheline_aligned)
    313344 +4177920 4491264 +1333% irq_cfg(.data.read_mostly)
    76800 +537600 614400 +700% early_node_map(.init.data)
    32640 +491648 524288 +1506% boot_pageset(.bss)
    32640 +491648 524288 +1506% boot_cpu_pda(.data.cacheline_aligned)
    23040 +161280 184320 +700% initkmem_list3(.init.data)
    5632 +39424 45056 +700% node_devices(.bss)
    4096 +28672 32768 +700% plat_node_bdata(.bss)
    2656 +34312 36968 +1291% cache_cache(.data)
    2048 +14336 16384 +700% rio_devs(.init.data)
    2048 +260096 262144 +12700% node_to_cpumask_map(.data.read_mostly)
    2040 +30728 32768 +1506% centrino_model(.bss)
    2040 +30728 32768 +1506% centrino_cpu(.bss)
    2040 +30728 32768 +1506% _cpu_pda(.data.read_mostly)
    1024 +1024 2048 +100% pxm_to_node_map(.data)
    1024 +7168 8192 +700% nodes_add(.bss)
    1024 +7168 8192 +700% nodes(.init.data)
    1024 +7168 8192 +700% hugepage_freelists(.bss)
    1020 +15364 16384 +1506% x86_cpu_to_node_map_init(.data)
    1020 +15364 16384 +1506% cpu_set_freq(.bss)
    1020 +15364 16384 +1506% cpu_min_freq(.bss)
    1020 +15364 16384 +1506% cpu_max_freq(.bss)
    1020 +15364 16384 +1506% cpu_is_managed(.bss)
    1020 +15364 16384 +1506% cpu_cur_freq(.bss)
    512 +3584 4096 +700% zone_movable_pfn(.init.data)
    512 +3584 4096 +700% scal_devs(.init.data)
    512 +3584 4096 +700% node_data(.data.read_mostly)
    510 +7682 8192 +1506% x86_cpu_to_apicid_init(.init.data)
    510 +7682 8192 +1506% x86_bios_cpu_apicid_init(.init.data)
    0 +4096 4096 . tvec_base_done(.data)
    0 +2048 2048 . surplus_huge_pages_node(.bss)
    0 +2048 2048 . nr_huge_pages_node(.bss)
    0 +2048 2048 . node_to_pxm_map(.data)
    0 +2048 2048 . node_order(.bss)
    0 +2048 2048 . node_load(.bss)
    0 +2048 2048 . free_huge_pages_node(.bss)
    0 +2048 2048 . fake_node_to_pxm_map(.init.data)
    0 +1552 1552 . def_root_domain(.bss)

    ====== Sections (-l 500)
    1 - 255-akpm2
    2 - 4k-akpm2

    .1. .2. ..final..
    63092788 +10579589 73672377 +16% Total
    41514099 +93823 41607922 <1% .debug_info
    6648945 -2268 6646677 <1% .debug_loc
    3365341 +7483 3372824 <1% .text
    2631073 -1672 2629401 <1% .debug_line
    1320219 +31557 1351776 +2% .debug_abbrev
    1149568 +4391040 5540608 +381% .data.cacheline_aligned
    1106192 -4784 1101408 <1% .debug_ranges
    732736 +728832 1461568 +99% .bss
    329672 +4474992 4804664 +1357% .data.read_mostly
    285576 +100320 385896 +35% .data
    173664 +751936 925600 +432% .init.data
    40824 +7808 48632 +19% .data.percpu

    ====== Text/Data ()
    1 - 255-akpm2
    2 - 4k-akpm2

    .1. .2. ..final..
    3364864 +8192 3373056 <1% TextSize
    1552384 +100352 1652736 +6% DataSize
    733184 +729088 1462272 +99% BssSize
    393216 +757760 1150976 +192% InitSize
    40960 +8192 49152 +20% PerCPU
    1529856 +8869888 10399744 +579% OtherSize
    7614464 +10473472 18087936 +137% Totals

    ====== PerCPU ()
    1 - 255-akpm2
    2 - 4k-akpm2

    .1. .2. ..final..
    18432 -2048 16384 -11% kstat
    10240 -2048 8192 -20% init_tss
    2048 -2048 . -100% fdtable_defer_list
    0 +2048 2048 . node_domains
    0 +2048 2048 . lru_add_active_pvecs
    0 +2048 2048 . cpuidle_devices
    0 +2048 2048 . cpu_mask
    0 +2048 2048 . cpu_info
    0 +2048 2048 . cpu_core_map
    0 +2048 2048 . core_domains
    30720 +8192 38912 +26% Totals

    ====== Stack (-l 1000)
    1 - 255-akpm2
    2 - 4k-akpm2

    .1. .2. ..final..
    0 +4216 4216 . show_schedstat
    0 +2744 2744 . build_sched_domains
    0 +2152 2152 . centrino_target
    0 +1640 1640 . setup_IO_APIC
    0 +1592 1592 . move_task_off_dead_cpu
    0 +1576 1576 . setup_IO_APIC_irq
    0 +1560 1560 . tick_notify
    0 +1560 1560 . __assign_irq_vector
    0 +1552 1552 . arch_setup_msi_irq
    0 +1552 1552 . arch_setup_ht_irq
    0 +1544 1544 . tick_do_periodic_broadcast
    0 +1544 1544 . irq_affinity_write_proc
    0 +1144 1144 . threshold_create_device
    0 +1112 1112 . sched_balance_self
    0 +1064 1064 . _cpu_down
    0 +1056 1056 . __smp_call_function_mask
    0 +1048 1048 . store_threshold_limit
    0 +1048 1048 . set_ioapic_affinity_irq
    0 +1048 1048 . acpi_processor_set_throttling
    0 +1048 1048 . acpi_map_lsapic
    0 +1040 1040 . store_interrupt_enable
    0 +1040 1040 . set_msi_irq_affinity
    0 +1040 1040 . set_ht_irq_affinity
    0 +1032 1032 . store_error_count
    0 +1032 1032 . show_error_count
    0 +1032 1032 . setup_ioapic_dest
    0 +1032 1032 . sched_setaffinity
    0 +1032 1032 . physflat_send_IPI_allbutself
    0 +1032 1032 . native_flush_tlb_others
    0 +1032 1032 . move_masked_irq
    0 +1032 1032 . flat_send_IPI_allbutself
    0 +1024 1024 . pci_bus_show_cpuaffinity
    0 +1024 1024 . machine_crash_shutdown
    0 +1024 1024 . local_cpus_show
    0 +1024 1024 . irq_complete_move
    0 +1024 1024 . ioapic_retrigger_irq
    0 +1024 1024 . fixup_irqs
    0 +1024 1024 . create_irq

    ====== MemInfo ()
    1 - 255-akpm2
    2 - 4k-akpm2

    .1. .2. ..final..
    30146560 +786432 30932992 +2% Active
    1018880 +64512 1083392 +6% Active(Node.0)
    6517760 +132096 6649856 +2% Active(Node.1)
    17465344 -12288 17453056 <1% AnonPages
    2932736 +69632 3002368 +2% AnonPages(Node.0)
    14532608 -81920 14450688 <1% AnonPages(Node.1)
    5804032 +327680 6131712 +5% Buffers
    57851904 +17252352 75104256 +29% Cached
    10078793728 -5058560 10073735168 <1% CommitLimit
    73453568 +1028096 74481664 +1% Committed_AS
    184320 -184320 . -100% Dirty
    20480 -20480 . -100% Dirty(Node.0)
    163840 -163840 . -100% Dirty(Node.1)
    3391488 +352256 3743744 +10% FilePages(Node.0)
    60264448 +17227776 77492224 +28% FilePages(Node.1)
    50900992 +16818176 67719168 +33% Inactive
    561152 +41984 603136 +7% Inactive(Node.0)
    12164096 +4162560 16326656 +34% Inactive(Node.1)
    8847360 +16384 8863744 <1% Mapped
    290816 +45056 335872 +15% Mapped(Node.0)
    8556544 -28672 8527872 <1% Mapped(Node.1)
    4014837760 -54091776 3960745984 -1% MemFree
    2012188672 -16060416 1996128256 <1% MemFree(Node.0)
    2002522112 -38031360 1964490752 -1% MemFree(Node.1)
    4151279616 -10117120 4141162496 <1% MemTotal
    134877184 +16060416 150937600 +11% MemUsed(Node.0)
    143912960 +38031360 181944320 +26% MemUsed(Node.1)
    2306048 -8192 2297856 <1% PageTables
    872448 +32768 905216 +3% PageTables(Node.0)
    1433600 -40960 1392640 -2% PageTables(Node.1)
    8290304 +217088 8507392 +2% SReclaimable
    1155072 -184320 970752 -15% SReclaimable(Node.0)
    7135232 +401408 7536640 +5% SReclaimable(Node.1)
    12480512 +11087872 23568384 +88% SUnreclaim
    4730880 +6569984 11300864 +138% SUnreclaim(Node.0)
    7749632 +4517888 12267520 +58% SUnreclaim(Node.1)
    20770816 +11304960 32075776 +54% Slab
    5885952 +6385664 12271616 +108% Slab(Node.0)
    14884864 +4919296 19804160 +33% Slab(Node.1)
    159670272 -4096 159666176 <1% VmallocUsed
    23140846592 +33765376 23174611968 +0% Totals


    Memory usage in a simulated 2cpu/1gb environment using the
    default configuration and NR_CPUS=4096, MAX Nodes=512:

    Memory: 1013440k/1048576k available
    (3588k kernel code, 33728k reserved, 1962k data, 1212k init)

    MemTotal: 1014652 kB
    MemFree: 991364 kB
    Buffers: 192 kB
    Cached: 3436 kB
    SwapCached: 0 kB
    Active: 1636 kB
    Inactive: 2648 kB
    SwapTotal: 0 kB
    SwapFree: 0 kB
    Dirty: 20 kB
    Writeback: 0 kB
    AnonPages: 656 kB
    Mapped: 1412 kB
    Slab: 12752 kB
    SReclaimable: 236 kB
    SUnreclaim: 12516 kB
    PageTables: 36 kB
    NFS_Unstable: 0 kB
    Bounce: 0 kB
    CommitLimit: 507324 kB
    Committed_AS: 0 kB
    VmallocTotal: 34359738367 kB
    VmallocUsed: 4896 kB
    VmallocChunk: 34359733471 kB
    HugePages_Total: 0
    HugePages_Free: 0
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB

    --
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor

    On Fri, Apr 04, 2008 at 06:30:15PM -0700, Mike Travis wrote:
    > * Increase stack size for the kernel bootloader decompressor. This is
    > needed to boot a kernel with NR_CPUS = 4096. I tested with 8k stack
    > size but that wasn't sufficient.
    >
    > Based on:
    > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
    > + x86/latest .../x86/linux-2.6-x86.git
    > + sched-devel/latest .../mingo/linux-2.6-sched-devel.git
    >
    > Signed-off-by: Mike Travis
    > ---
    > arch/x86/boot/compressed/head_64.S | 2 +-
    > 1 file changed, 1 insertion(+), 1 deletion(-)
    >
    > --- linux-2.6.25-rc5.orig/arch/x86/boot/compressed/head_64.S
    > +++ linux-2.6.25-rc5/arch/x86/boot/compressed/head_64.S
    > @@ -314,5 +314,5 @@ gdt_end:
    > /* Stack for uncompression */
    > .balign 4
    > user_stack:
    > - .fill 4096,4,0
    > + .fill 16384,4,0

    --------------^^^ * ^

    Changed from 16K to 64K. I wonder what is using so much space on
    this stack?

    > user_stack_end:
    >
    > --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor

    Alexander van Heukelum wrote:
    > On Fri, Apr 04, 2008 at 06:30:15PM -0700, Mike Travis wrote:
    >> * Increase stack size for the kernel bootloader decompressor. This is
    >> needed to boot a kernel with NR_CPUS = 4096. I tested with 8k stack
    >> size but that wasn't sufficient.
    >>
    >> Based on:
    >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
    >> + x86/latest .../x86/linux-2.6-x86.git
    >> + sched-devel/latest .../mingo/linux-2.6-sched-devel.git
    >>
    >> Signed-off-by: Mike Travis
    >> ---
    >> arch/x86/boot/compressed/head_64.S | 2 +-
    >> 1 file changed, 1 insertion(+), 1 deletion(-)
    >>
    >> --- linux-2.6.25-rc5.orig/arch/x86/boot/compressed/head_64.S
    >> +++ linux-2.6.25-rc5/arch/x86/boot/compressed/head_64.S
    >> @@ -314,5 +314,5 @@ gdt_end:
    >> /* Stack for uncompression */
    >> .balign 4
    >> user_stack:
    >> - .fill 4096,4,0
    >> + .fill 16384,4,0

    > --------------^^^ * ^
    >
    > Changed from 16K to 64K. I wonder what is using so much space on
    > this stack?
    >
    >> user_stack_end:
    >>
    >> --


    Hi,

    That is a good question. It's pretty difficult to debug at that early
    stage (any ideas are certainly welcome!). It's mostly hit and miss (and
    handy access to the reset button ;-) I could do some further research
    but since it's "throwaway" memory (at least I think it is), then I didn't
    think it important to pursue.

    And thanks for the correction, I thought I was bumping a byte count, not
    a word count.

    Thanks,
    Mike
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor

    On Mon, Apr 07, 2008 at 11:14:16AM -0700, Mike Travis wrote:
    > Alexander van Heukelum wrote:
    > > On Fri, Apr 04, 2008 at 06:30:15PM -0700, Mike Travis wrote:
    > >> * Increase stack size for the kernel bootloader decompressor. This is
    > >> needed to boot a kernel with NR_CPUS = 4096. I tested with 8k stack
    > >> size but that wasn't sufficient.
    > >>
    > >> Based on:
    > >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
    > >> + x86/latest .../x86/linux-2.6-x86.git
    > >> + sched-devel/latest .../mingo/linux-2.6-sched-devel.git
    > >>
    > >> Signed-off-by: Mike Travis
    > >> ---
    > >> arch/x86/boot/compressed/head_64.S | 2 +-
    > >> 1 file changed, 1 insertion(+), 1 deletion(-)
    > >>
    > >> --- linux-2.6.25-rc5.orig/arch/x86/boot/compressed/head_64.S
    > >> +++ linux-2.6.25-rc5/arch/x86/boot/compressed/head_64.S
    > >> @@ -314,5 +314,5 @@ gdt_end:
    > >> /* Stack for uncompression */
    > >> .balign 4
    > >> user_stack:
    > >> - .fill 4096,4,0
    > >> + .fill 16384,4,0

    > > --------------^^^ * ^
    > >
    > > Changed from 16K to 64K. I wonder what is using so much space on
    > > this stack?
    > >
    > >> user_stack_end:
    > >>
    > >> --

    >
    > Hi,
    >
    > That is a good question. It's pretty difficult to debug at that early
    > stage (any ideas are certainly welcome!). It's mostly hit and miss (and
    > handy access to the reset button ;-) I could do some further research
    > but since it's "throwaway" memory (at least I think it is), then I didn't
    > think it important to pursue.


    It's certainly not important enough to put much time in. I tried
    MAXSMP on top of just Ingo's -x86 with qemu, though, but it wouldn't
    crash. I set the stack size to 16 bytes, and it still booted happily
    (of course there is still about 11 kilobytes of inflate code which
    is then overwritten by stack use).

    I did see that the malloc space that the inflate code is using is
    taken from _after_ the end of the bss. I don't see how this is
    protected from being used/overwritten. Changing the stack size changes
    the memory layout a bit... maybe you were so unlucky to create a
    vmlinux image that was just barely smaller than some threshold and
    increasing the stack size made the decompression/relocation area be
    located somewhere else?

    Test patch follows.

    Greetings,
    Alexander

    > Thanks,
    > Mike


    diff --git a/arch/x86/boot/compressed/vmlinux_32.lds b/arch/x86/boot/compressed/vmlinux_32.lds
    index bb3c483..c858e30 100644
    --- a/arch/x86/boot/compressed/vmlinux_32.lds
    +++ b/arch/x86/boot/compressed/vmlinux_32.lds
    @@ -39,5 +39,6 @@ SECTIONS
    *(.bss.*)
    *(COMMON)
    _end = . ;
    + _real_end = . + 0x4000;
    }
    }
    diff --git a/arch/x86/boot/compressed/vmlinux_64.lds b/arch/x86/boot/compressed/vmlinux_64.lds
    index 7e5c720..9bef3cd 100644
    --- a/arch/x86/boot/compressed/vmlinux_64.lds
    +++ b/arch/x86/boot/compressed/vmlinux_64.lds
    @@ -44,5 +44,6 @@ SECTIONS
    pgtable = . ;
    . = . + 4096 * 6;
    _heap = .;
    + _heap_end = . + 0x7000;
    }
    }

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor


    * Alexander van Heukelum wrote:

    > I did see that the malloc space that the inflate code is using is
    > taken from _after_ the end of the bss. I don't see how this is
    > protected from being used/overwritten. Changing the stack size changes
    > the memory layout a bit... maybe you were so unlucky to create a
    > vmlinux image that was just barely smaller than some threshold and
    > increasing the stack size made the decompression/relocation area be
    > located somewhere else?
    >
    > Test patch follows.


    that's a really interesting theory.

    FWIIW, i've been booting allyesconfig bzImages for a long time (with
    only minimal amount of drivers disabled - mostly old ISA ones that
    assume the presence of the real hardware), and they boot and work fine
    on both 32-bit and 64-bit typical whitebox PCs. That means huge bzImages
    that decompresses into a ~41 MB kernel image. I'd expect that to be a
    rather severe test of the decompressor.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. [PATCH] x86: cleanup boot-heap usage

    The kernel decompressor wrapper uses memory located beyond the
    end of the image. This might lead to hard to debug problems,
    but even if it can be proven to be safe, it is at the very
    least unclean. I don't see any advantages either, unless you
    count it not being zeroed out as an advantage. This patch
    moves the boot-heap area to the bss segment.

    Signed-off-by: Alexander van Heukelum

    ---

    On Tue, Apr 08, 2008 at 10:23:54AM +0200, Ingo Molnar wrote:
    > * Alexander van Heukelum wrote:
    > > I did see that the malloc space that the inflate code is using is
    > > taken from _after_ the end of the bss. I don't see how this is
    > > protected from being used/overwritten. Changing the stack size changes
    > > the memory layout a bit... maybe you were so unlucky to create a
    > > vmlinux image that was just barely smaller than some threshold and
    > > increasing the stack size made the decompression/relocation area be
    > > located somewhere else?
    > >
    > > Test patch follows.

    >
    > that's a really interesting theory.
    >
    > FWIIW, i've been booting allyesconfig bzImages for a long time (with
    > only minimal amount of drivers disabled - mostly old ISA ones that
    > assume the presence of the real hardware), and they boot and work fine
    > on both 32-bit and 64-bit typical whitebox PCs. That means huge bzImages
    > that decompresses into a ~41 MB kernel image. I'd expect that to be a
    > rather severe test of the decompressor.
    >
    > Ingo


    Hi Ingo,

    Even if this patch might not solve the problem, I think it
    is a good clean-up that is suitable for -x86? qemu is happy
    with it.

    Greetings,
    Alexander

    arch/x86/boot/compressed/head_32.S | 15 +++++++++------
    arch/x86/boot/compressed/head_64.S | 22 +++++++++++++---------
    arch/x86/boot/compressed/misc.c | 8 +-------
    include/asm-x86/boot.h | 8 ++++++++
    4 files changed, 31 insertions(+), 22 deletions(-)

    diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
    index 036e635..ba7736c 100644
    --- a/arch/x86/boot/compressed/head_32.S
    +++ b/arch/x86/boot/compressed/head_32.S
    @@ -130,7 +130,7 @@ relocated:
    /*
    * Setup the stack for the decompressor
    */
    - leal stack_end(%ebx), %esp
    + leal boot_stack_end(%ebx), %esp

    /*
    * Do the decompression, and jump to the new kernel..
    @@ -142,8 +142,8 @@ relocated:
    pushl %eax # input_len
    leal input_data(%ebx), %eax
    pushl %eax # input_data
    - leal _end(%ebx), %eax
    - pushl %eax # end of the image as third argument
    + leal boot_heap(%ebx), %eax
    + pushl %eax # heap area as third argument
    pushl %esi # real mode pointer as second arg
    call decompress_kernel
    addl $20, %esp
    @@ -181,7 +181,10 @@ relocated:
    jmp *%ebp

    .bss
    +/* Stack and heap for uncompression */
    .balign 4
    -stack:
    - .fill 4096, 1, 0
    -stack_end:
    +boot_heap:
    + .fill BOOT_HEAP_SIZE, 1, 0
    +boot_stack:
    + .fill BOOT_STACK_SIZE, 1, 0
    +boot_stack_end:
    diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
    index e8657b9..7a212a6 100644
    --- a/arch/x86/boot/compressed/head_64.S
    +++ b/arch/x86/boot/compressed/head_64.S
    @@ -28,6 +28,7 @@
    #include
    #include
    #include
    +#include
    #include
    #include

    @@ -62,7 +63,7 @@ startup_32:
    subl $1b, %ebp

    /* setup a stack and make sure cpu supports long mode. */
    - movl $user_stack_end, %eax
    + movl $boot_stack_end, %eax
    addl %ebp, %eax
    movl %eax, %esp

    @@ -274,7 +275,7 @@ relocated:
    stosb

    /* Setup the stack */
    - leaq user_stack_end(%rip), %rsp
    + leaq boot_stack_end(%rip), %rsp

    /* zero EFLAGS after setting rsp */
    pushq $0
    @@ -285,7 +286,7 @@ relocated:
    */
    pushq %rsi # Save the real mode argument
    movq %rsi, %rdi # real mode address
    - leaq _heap(%rip), %rsi # _heap
    + leaq boot_heap(%rip), %rsi # malloc area for uncompression
    leaq input_data(%rip), %rdx # input_data
    movl input_len(%rip), %eax
    movq %rax, %rcx # input_len
    @@ -310,9 +311,12 @@ gdt:
    .quad 0x0080890000000000 /* TS descriptor */
    .quad 0x0000000000000000 /* TS continued */
    gdt_end:
    - .bss
    -/* Stack for uncompression */
    - .balign 4
    -user_stack:
    - .fill 4096,4,0
    -user_stack_end:
    +
    +.bss
    +/* Stack and heap for uncompression */
    +.balign 4
    +boot_heap:
    + .fill BOOT_HEAP_SIZE, 1, 0
    +boot_stack:
    + .fill BOOT_STACK_SIZE, 1, 0
    +boot_stack_end:
    diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
    index dad4e69..90456ce 100644
    --- a/arch/x86/boot/compressed/misc.c
    +++ b/arch/x86/boot/compressed/misc.c
    @@ -217,12 +217,6 @@ static void putstr(const char *);
    static memptr free_mem_ptr;
    static memptr free_mem_end_ptr;

    -#ifdef CONFIG_X86_64
    -#define HEAP_SIZE 0x7000
    -#else
    -#define HEAP_SIZE 0x4000
    -#endif
    -
    static char *vidmem;
    static int vidport;
    static int lines, cols;
    @@ -449,7 +443,7 @@ asmlinkage void decompress_kernel(void *rmode, memptr heap,

    window = output; /* Output buffer (Normally at 1M) */
    free_mem_ptr = heap; /* Heap */
    - free_mem_end_ptr = heap + HEAP_SIZE;
    + free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
    inbuf = input_data; /* Input buffer */
    insize = input_len;
    inptr = 0;
    diff --git a/include/asm-x86/boot.h b/include/asm-x86/boot.h
    index ed8affb..2faed7e 100644
    --- a/include/asm-x86/boot.h
    +++ b/include/asm-x86/boot.h
    @@ -17,4 +17,12 @@
    + (CONFIG_PHYSICAL_ALIGN - 1)) \
    & ~(CONFIG_PHYSICAL_ALIGN - 1))

    +#ifdef CONFIG_X86_64
    +#define BOOT_HEAP_SIZE 0x7000
    +#define BOOT_STACK_SIZE 0x4000
    +#else
    +#define BOOT_HEAP_SIZE 0x4000
    +#define BOOT_STACK_SIZE 0x1000
    +#endif
    +
    #endif /* _ASM_BOOT_H */

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor

    Hi Ingo,

    I see you have applied the following patch to x86#for-akpm. It was
    really ment for testing only. I think you ment to use this one instead?

    [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor
    http://lkml.org/lkml/2008/3/25/497

    Otherwise, please consider [PATCH] x86: cleanup boot-heap usage instead.
    http://lkml.org/lkml/2008/4/8/90

    Greetings,
    Alexander

    On Mon, Apr 07, 2008 at 11:44:34PM +0200, Alexander van Heukelum wrote:
    > diff --git a/arch/x86/boot/compressed/vmlinux_32.lds b/arch/x86/boot/compressed/vmlinux_32.lds
    > index bb3c483..c858e30 100644
    > --- a/arch/x86/boot/compressed/vmlinux_32.lds
    > +++ b/arch/x86/boot/compressed/vmlinux_32.lds
    > @@ -39,5 +39,6 @@ SECTIONS
    > *(.bss.*)
    > *(COMMON)
    > _end = . ;
    > + _real_end = . + 0x4000;
    > }
    > }
    > diff --git a/arch/x86/boot/compressed/vmlinux_64.lds b/arch/x86/boot/compressed/vmlinux_64.lds
    > index 7e5c720..9bef3cd 100644
    > --- a/arch/x86/boot/compressed/vmlinux_64.lds
    > +++ b/arch/x86/boot/compressed/vmlinux_64.lds
    > @@ -44,5 +44,6 @@ SECTIONS
    > pgtable = . ;
    > . = . + 4096 * 6;
    > _heap = .;
    > + _heap_end = . + 0x7000;
    > }
    > }


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor


    * Alexander van Heukelum wrote:

    > Hi Ingo,
    >
    > I see you have applied the following patch to x86#for-akpm. It was
    > really ment for testing only. I think you ment to use this one
    > instead?


    yep, i wanted to see how it holds up in testing - it's OK so far. I've
    got your other, fuller one queued up meanwhile - it's not pushed out
    yet.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor

    Ingo Molnar wrote:
    > * Alexander van Heukelum wrote:
    >
    >> I did see that the malloc space that the inflate code is using is
    >> taken from _after_ the end of the bss. I don't see how this is
    >> protected from being used/overwritten. Changing the stack size changes
    >> the memory layout a bit... maybe you were so unlucky to create a
    >> vmlinux image that was just barely smaller than some threshold and
    >> increasing the stack size made the decompression/relocation area be
    >> located somewhere else?
    >>
    >> Test patch follows.

    >
    > that's a really interesting theory.
    >
    > FWIIW, i've been booting allyesconfig bzImages for a long time (with
    > only minimal amount of drivers disabled - mostly old ISA ones that
    > assume the presence of the real hardware), and they boot and work fine
    > on both 32-bit and 64-bit typical whitebox PCs. That means huge bzImages
    > that decompresses into a ~41 MB kernel image. I'd expect that to be a
    > rather severe test of the decompressor.
    >
    > Ingo


    Well admittedly, I did discover this problem way early in booting
    up a 4k NR_CPU kernel (obviously ;-). Once it booted, I haven't
    revisited that problem again. I wonder if it's a pathological case
    of a single bitstream that causes expansion instead of compression?
    Note I was using the akpm2 config script with NR_CPUS=4096 and
    NODES_SHIFT=9 (plus some other tweaks specific to our AMD and Intel
    boxes.)

    Thanks,
    Mike
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor

    Ingo Molnar wrote:
    > * Alexander van Heukelum wrote:
    >
    >> Hi Ingo,
    >>
    >> I see you have applied the following patch to x86#for-akpm. It was
    >> really ment for testing only. I think you ment to use this one
    >> instead?

    >
    > yep, i wanted to see how it holds up in testing - it's OK so far. I've
    > got your other, fuller one queued up meanwhile - it's not pushed out
    > yet.
    >
    > Ingo


    I will try it out on my failing case as soon as I can...

    Thanks,
    Mike
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor


    * Mike Travis wrote:

    > Ingo Molnar wrote:
    > > * Alexander van Heukelum wrote:
    > >
    > >> Hi Ingo,
    > >>
    > >> I see you have applied the following patch to x86#for-akpm. It was
    > >> really ment for testing only. I think you ment to use this one
    > >> instead?

    > >
    > > yep, i wanted to see how it holds up in testing - it's OK so far. I've
    > > got your other, fuller one queued up meanwhile - it's not pushed out
    > > yet.
    > >
    > > Ingo

    >
    > I will try it out on my failing case as soon as I can...


    more test results: i just booted an allyesconfig 64-bit (MAXSMP, etc.)
    kernel on x86 native hardware successfully - that has Alexander's patch
    included but not your boot tweak. (has all your other patches included)

    would you expect a real 4K CPUs system to boot any differently? So early
    during bootup all x86 hardware is just a uniprocessor, so i'd be
    surprised if there was any difference.

    [ in any case, if the tweak still makes a real difference for you we can
    still apply it because it does not hurt anyone - but lets try to avoid
    black voodoo tweaks as much as possible ]

    btw,. booting up MAXSMP is pretty impressive:

    CONFIG_NR_CPUS=4096

    shows how far Linux scalability has come

    i've got a bugreport for you though: MAXSMP does not suspend+resume
    correctly ;-) It gets this far:

    [ 146.348790] PM: Syncing filesystems ... done.
    [ 146.353488] PM: Preparing system for mem sleep
    [ 146.360204] Freezing user space processes ... (elapsed 0.00 seconds) done.
    [ 146.367172] Freezing remaining freezable tasks ... (elapsed 0.93 seconds) done.
    [ 147.309618] PM: Entering mem sleep
    [ 147.313032] Suspending console(s)

    then reboots spontaneously instead of resuming. (I use the
    suspend+resume self-test feature below to conduct automated
    suspend/resume tests.)

    Ingo

    ------------------------->
    Subject: suspend: sleepy linux self-test
    From: David Brownell

    See the appended; it includes more of Ingo's suggestions.

    Since this is increasingly unrelated to the "sleepy linux" concept
    (a version of what systems like OLPC, N700, and N800 are doing), I
    got rid of the "sleepy.c" file.

    - Dave

    ============ CUT HERE
    Boot-time test for system suspend states (STR or standby). The generic
    RTC framework triggers wakeup alarms, used to exit those states.

    - Measures some aspects of suspend time; uses "jiffies". This
    should probably use a clocksource instead, since those often
    work properly even while IRQs are disabled.

    - Includes a command line parameter, which needs work yet ... it
    currently turns this test off, but it should also let the target
    state be specified (and maybe even default to "no test").

    Lightly tested on an ARM system, which reported that suspending devices
    took 7 msec and resuming them took 132 msec:

    * The PCMCIA stack misbehaved a bit. It didn't finish enumerating
    the card before it suspended, so the wakeup event came from the
    CF card IRQ not from the RTC!

    * The MMC stack misbehaved more seriously. It wants to remove devices
    during the suspend sequence (quite needlessly, on this hardware),
    which now makes Linux unhappy.

    Workaround in both cases was to take the memory card out before booting.

    Also includes some Kconfig tweaks to help reduce configuration bugs on
    x86, by avoiding the legacy RTC driver when the generic RTC framework
    is enabled ... those should become a separate patch.

    Signed-off-by: Ingo Molnar
    ---
    drivers/char/Kconfig | 5 +
    drivers/rtc/Kconfig | 1
    kernel/power/Kconfig | 10 +++
    kernel/power/main.c | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++ +
    4 files changed, 178 insertions(+), 1 deletion(-)

    Index: linux/drivers/char/Kconfig
    ================================================== =================
    --- linux.orig/drivers/char/Kconfig
    +++ linux/drivers/char/Kconfig
    @@ -704,9 +704,12 @@ config NVRAM
    To compile this driver as a module, choose M here: the
    module will be called nvram.

    +comment "You are using the RTC framework, not the legacy CMOS RTC driver"
    + depends on RTC_DRV_CMOS
    +
    config RTC
    tristate "Enhanced Real Time Clock Support"
    - depends on !PPC && !PARISC && !IA64 && !M68K && !SPARC && !FRV && !ARM && !SUPERH && !S390
    + depends on !PPC && !PARISC && !IA64 && !M68K && !SPARC && !FRV && !ARM && !SUPERH && !S390 && !RTC_DRV_CMOS
    ---help---
    If you say Y here and create a character special file /dev/rtc with
    major number 10 and minor number 135 using mknod ("man mknod"), you
    Index: linux/drivers/rtc/Kconfig
    ================================================== =================
    --- linux.orig/drivers/rtc/Kconfig
    +++ linux/drivers/rtc/Kconfig
    @@ -303,6 +303,7 @@ comment "Platform RTC drivers"
    config RTC_DRV_CMOS
    tristate "PC-style 'CMOS'"
    depends on X86 || ALPHA || ARM || M32R || ATARI || PPC || MIPS
    + default y if X86
    help
    Say "yes" here to get direct support for the real time clock
    found in every PC or ACPI-based system, and some other boards.
    Index: linux/kernel/power/Kconfig
    ================================================== =================
    --- linux.orig/kernel/power/Kconfig
    +++ linux/kernel/power/Kconfig
    @@ -104,6 +104,16 @@ config SUSPEND
    powered and thus its contents are preserved, such as the
    suspend-to-RAM state (e.g. the ACPI S3 state).

    +config PM_TEST_SUSPEND
    + bool "Test suspend/resume and wakealarm during bootup"
    + depends on SUSPEND && PM_DEBUG && RTC_LIB=y
    + ---help---
    + This option will suspend your machine during bootup, and make
    + it wake up a few seconds later using the RTC's wakeup alarm.
    +
    + You probably want to have your system's RTC driver statically
    + linked, ensuring that it's available when this test runs.
    +
    config SUSPEND_FREEZER
    bool "Enable freezer for suspend to RAM/standby" \
    if ARCH_WANTS_FREEZER_CONTROL || BROKEN
    Index: linux/kernel/power/main.c
    ================================================== =================
    --- linux.orig/kernel/power/main.c
    +++ linux/kernel/power/main.c
    @@ -132,6 +132,52 @@ static inline int suspend_test(int level

    #ifdef CONFIG_SUSPEND

    +#ifdef CONFIG_PM_TEST_SUSPEND
    +
    +/*
    + * We test the system suspend code by setting an RTC wakealarm a short
    + * time in the future, then suspending. Suspending the devices won't
    + * normally take long ... some systems only need a few milliseconds.
    + *
    + * The time it takes is system-specific though, so when we test this
    + * during system bootup we allow a LOT of time.
    + */
    +#define TEST_SUSPEND_SECONDS 5
    +
    +static unsigned long suspend_test_start_time;
    +
    +static void suspend_test_start(void)
    +{
    + /* FIXME Use better timebase than "jiffies", ideally a clocksource.
    + * What we want is a hardware counter that will work correctly even
    + * during the irqs-are-off stages of the suspend/resume cycle...
    + */
    + suspend_test_start_time = jiffies;
    +}
    +
    +static void suspend_test_finish(const char *label)
    +{
    + long nj = jiffies - suspend_test_start_time;
    + unsigned msec;
    +
    + msec = jiffies_to_msecs((nj >= 0) ? nj : -nj);
    + pr_info("PM: %s took %d.%03d seconds\n", label,
    + msec / 1000, msec % 1000);
    + WARN_ON_ONCE(msec > ((TEST_SUSPEND_SECONDS+5) * 1000));
    +}
    +
    +#else
    +
    +static void suspend_test_start(void)
    +{
    +}
    +
    +static void suspend_test_finish(const char *label)
    +{
    +}
    +
    +#endif
    +
    /* This is just an arbitrary number */
    #define FREE_PAGE_NUMBER (100)

    @@ -264,11 +310,14 @@ int suspend_devices_and_enter(suspend_st
    goto Close;
    }
    suspend_console();
    +
    + suspend_test_start();
    error = device_suspend(PMSG_SUSPEND);
    if (error) {
    printk(KERN_ERR "PM: Some devices failed to suspend\n");
    goto Resume_console;
    }
    + suspend_test_finish("suspend devices");

    if (suspend_test(TEST_DEVICES))
    goto Resume_devices;
    @@ -291,7 +340,9 @@ int suspend_devices_and_enter(suspend_st
    if (suspend_ops->finish)
    suspend_ops->finish();
    Resume_devices:
    + suspend_test_start();
    device_resume();
    + suspend_test_finish("resume devices");
    Resume_console:
    resume_console();
    Close:
    @@ -515,3 +566,115 @@ static int __init pm_init(void)
    }

    core_initcall(pm_init);
    +
    +
    +#ifdef CONFIG_PM_TEST_SUSPEND
    +
    +#include
    +
    +/*
    + * To test system suspend, we need a hands-off mechanism to resume the
    + * system. RTCs with wakeup alarms are the the most common mechanism
    + * that's self-contained.
    + */
    +
    +static void __init test_wakealarm(struct rtc_device *rtc, suspend_state_t state)
    +{
    + static char err_readtime [] __initdata =
    + KERN_ERR "PM: can't read %s time, err %d\n";
    + static char err_wakealarm [] __initdata =
    + KERN_ERR "PM: can't set %s wakealarm, err %d\n";
    + static char err_suspend [] __initdata =
    + KERN_ERR "PM: suspend test failed, error %d\n";
    + static char info_test [] __initdata =
    + KERN_INFO "PM: test RTC wakeup from '%s' suspend\n";
    +
    + unsigned long now;
    + struct rtc_wkalrm alm;
    + int status;
    +
    + /* this may fail if the RTC hasn't been initialized */
    + status = rtc_read_time(rtc, &alm.time);
    + if (status < 0) {
    + printk(err_readtime, rtc->dev.bus_id, status);
    + return;
    + }
    + rtc_tm_to_time(&alm.time, &now);
    +
    + memset(&alm, 0, sizeof alm);
    + rtc_time_to_tm(now + TEST_SUSPEND_SECONDS, &alm.time);
    + alm.enabled = true;
    +
    + status = rtc_set_alarm(rtc, &alm);
    + if (status < 0) {
    + printk(err_wakealarm, rtc->dev.bus_id, status);
    + return;
    + }
    +
    + if (state == PM_SUSPEND_MEM) {
    + printk(info_test, pm_states[state]);
    + status = pm_suspend(state);
    + if (status == -ENODEV)
    + state = PM_SUSPEND_STANDBY;
    + }
    + if (state == PM_SUSPEND_STANDBY) {
    + printk(info_test, pm_states[state]);
    + status = pm_suspend(state);
    + }
    + if (status < 0)
    + printk(err_suspend, status);
    +}
    +
    +static int __init has_wakealarm(struct device *dev, void *name_ptr)
    +{
    + struct rtc_device *candidate = to_rtc_device(dev);
    +
    + if (!candidate->ops->set_alarm)
    + return 0;
    + if (!device_may_wakeup(candidate->dev.parent))
    + return 0;
    +
    + *(char **)name_ptr = dev->bus_id;
    + return 1;
    +}
    +
    +/*
    + * We normally test Suspend-to-RAM, with standby as a backup when
    + * the system doesn't support that state. But we also need to be
    + * able to disable the powerup test, and tell it to ignore STR since
    + * the RTC may not work then.
    + */
    +static suspend_state_t test_state __initdata = PM_SUSPEND_MEM;
    +
    +static int __init setup_test_suspend(char *value)
    +{
    + /* FIXME accept "standby", etc */
    + test_state = PM_SUSPEND_ON;
    + return 0;
    +}
    +__setup("test_suspend", setup_test_suspend);
    +
    +static int __init test_suspend(void)
    +{
    + static char warn_no_rtc[] __initdata =
    + KERN_WARNING "PM: no wakealarm-capable RTC driver is ready\n";
    +
    + char *pony = NULL;
    + struct rtc_device *rtc = NULL;
    +
    + class_find_device(rtc_class, &pony, has_wakealarm);
    + if (pony)
    + rtc = rtc_class_open(pony);
    +
    + if (rtc) {
    + if (test_state != PM_SUSPEND_ON)
    + test_wakealarm(rtc, test_state);
    + rtc_class_close(rtc);
    + } else
    + printk(warn_no_rtc);
    +
    + return 0;
    +}
    +late_initcall(test_suspend);
    +
    +#endif /* CONFIG_PM_TEST_SUSPEND */

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor

    On Tue, Apr 8, 2008 at 1:23 AM, Ingo Molnar wrote:
    >
    > * Alexander van Heukelum wrote:
    >
    > > I did see that the malloc space that the inflate code is using is
    > > taken from _after_ the end of the bss. I don't see how this is
    > > protected from being used/overwritten. Changing the stack size changes
    > > the memory layout a bit... maybe you were so unlucky to create a
    > > vmlinux image that was just barely smaller than some threshold and
    > > increasing the stack size made the decompression/relocation area be
    > > located somewhere else?
    > >
    > > Test patch follows.

    >
    > that's a really interesting theory.
    >
    > FWIIW, i've been booting allyesconfig bzImages for a long time (with
    > only minimal amount of drivers disabled - mostly old ISA ones that
    > assume the presence of the real hardware), and they boot and work fine
    > on both 32-bit and 64-bit typical whitebox PCs. That means huge bzImages
    > that decompresses into a ~41 MB kernel image. I'd expect that to be a
    > rather severe test of the decompressor.


    i don't that Alexander's patch is needed.

    also because Alex move heap before _end,
    we may need add some extra for buffer offset

    /* Replace the compressed data size with the uncompressed size */
    subl input_len(%ebp), %ebx
    movl output_len(%ebp), %eax
    addl %eax, %ebx
    /* Add 8 bytes for every 32K input block */
    shrl $12, %eax
    addl %eax, %ebx
    /* Add 32K + 18 bytes of extra slack and align on a 4K boundary */
    addl $(32768 + 18 + 4095), %ebx
    andl $~4095, %ebx =============================> need add
    heap size too.
    .....



    /* Replace the compressed data size with the uncompressed size */
    movl input_len(%rip), %eax
    subq %rax, %rbx
    movl output_len(%rip), %eax
    addq %rax, %rbx
    /* Add 8 bytes for every 32K input block */
    shrq $12, %rax
    addq %rax, %rbx
    /* Add 32K + 18 bytes of extra slack and align on a 4K boundary */
    addq $(32768 + 18 + 4095), %rbx
    =============================> need add heap size too.
    andq $~4095, %rbx

    do we need to move pgtable before _end?

    YH
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [PATCH] x86: cleanup boot-heap usage

    On Tue, Apr 8, 2008 at 3:54 AM, Alexander van Heukelum
    wrote:
    > The kernel decompressor wrapper uses memory located beyond the
    > end of the image. This might lead to hard to debug problems,
    > but even if it can be proven to be safe, it is at the very
    > least unclean. I don't see any advantages either, unless you
    > count it not being zeroed out as an advantage. This patch
    > moves the boot-heap area to the bss segment.
    >
    > Signed-off-by: Alexander van Heukelum
    >
    > ---
    >
    > On Tue, Apr 08, 2008 at 10:23:54AM +0200, Ingo Molnar wrote:
    > > * Alexander van Heukelum wrote:
    > > > I did see that the malloc space that the inflate code is using is
    > > > taken from _after_ the end of the bss. I don't see how this is
    > > > protected from being used/overwritten. Changing the stack size changes
    > > > the memory layout a bit... maybe you were so unlucky to create a
    > > > vmlinux image that was just barely smaller than some threshold and
    > > > increasing the stack size made the decompression/relocation area be
    > > > located somewhere else?

    the compressed image is copied to end of buff ( with extra code size
    for from relocated: in .text to _end)
    and do the on possition decompressed. .text section is near end.

    YH
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor

    Ingo Molnar wrote:
    > * Mike Travis wrote:
    >
    >> Ingo Molnar wrote:
    >>> * Alexander van Heukelum wrote:
    >>>
    >>>> Hi Ingo,
    >>>>
    >>>> I see you have applied the following patch to x86#for-akpm. It was
    >>>> really ment for testing only. I think you ment to use this one
    >>>> instead?
    >>> yep, i wanted to see how it holds up in testing - it's OK so far. I've
    >>> got your other, fuller one queued up meanwhile - it's not pushed out
    >>> yet.
    >>>
    >>> Ingo

    >> I will try it out on my failing case as soon as I can...

    >
    > more test results: i just booted an allyesconfig 64-bit (MAXSMP, etc.)
    > kernel on x86 native hardware successfully - that has Alexander's patch
    > included but not your boot tweak. (has all your other patches included)
    >
    > would you expect a real 4K CPUs system to boot any differently? So early
    > during bootup all x86 hardware is just a uniprocessor, so i'd be
    > surprised if there was any difference.
    >
    > [ in any case, if the tweak still makes a real difference for you we can
    > still apply it because it does not hurt anyone - but lets try to avoid
    > black voodoo tweaks as much as possible ]


    Yes, my patch is not needed. I booted the akpm2 config with 512 possible
    cpus (8 real) on an Intel box with 8gig total ram. It booted fine and is
    running some cpuset and sched-domain tests now.

    One problem though, even though it has slots for extra cpus to be brought
    online there is no /sys/devices/cpu/cpuXX/online file to actually bring
    them online. This was shown in a simulated run I did with 64 real cpus
    and 12 of them disabled. They showed up in the 'possible' map but no
    way to bring them online. [Unless there's a trick I don't know about.
    Nothing is mentioned in Documentation/cpu-hotplug.txt about this.]

    >
    > btw,. booting up MAXSMP is pretty impressive:
    >
    > CONFIG_NR_CPUS=4096
    >
    > shows how far Linux scalability has come
    >
    > i've got a bugreport for you though: MAXSMP does not suspend+resume
    > correctly ;-) It gets this far:
    >
    > [ 146.348790] PM: Syncing filesystems ... done.
    > [ 146.353488] PM: Preparing system for mem sleep
    > [ 146.360204] Freezing user space processes ... (elapsed 0.00 seconds) done.
    > [ 146.367172] Freezing remaining freezable tasks ... (elapsed 0.93 seconds) done.
    > [ 147.309618] PM: Entering mem sleep
    > [ 147.313032] Suspending console(s)
    >
    > then reboots spontaneously instead of resuming. (I use the
    > suspend+resume self-test feature below to conduct automated
    > suspend/resume tests.)
    >
    > Ingo


    I will try it out. Yes, please send me any tests I can add to my suite.

    Thanks!
    Mike


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor

    Ingo Molnar wrote:

    > i've got a bugreport for you though: MAXSMP does not suspend+resume
    > correctly ;-) It gets this far:
    >
    > [ 146.348790] PM: Syncing filesystems ... done.
    > [ 146.353488] PM: Preparing system for mem sleep
    > [ 146.360204] Freezing user space processes ... (elapsed 0.00 seconds) done.
    > [ 146.367172] Freezing remaining freezable tasks ... (elapsed 0.93 seconds) done.
    > [ 147.309618] PM: Entering mem sleep
    > [ 147.313032] Suspending console(s)
    >
    > then reboots spontaneously instead of resuming. (I use the
    > suspend+resume self-test feature below to conduct automated
    > suspend/resume tests.)


    Here are my test results. It worked fine on the Intel box, but
    hung hard on the AMD box. I'll do some further testing to narrow
    it down.

    Thanks,
    Mike

    ------------------------------------------------------------
    Intel box:

    PM: test RTC wakeup from 'mem' suspend
    PM: test RTC wakeup from 'standby' suspend
    PM: Syncing filesystems ... done.
    PM: Preparing system for standby sleep
    Freezing user space processes ... (elapsed 0.00 seconds) done.
    Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
    PM: Entering standby sleep
    Suspending console(s)

  16. Re: [PATCH 0/2] NR_CPUS: increase maximum NR_CPUS to 4096

    On Fri, Apr 4, 2008 at 6:30 PM, Mike Travis wrote:
    >
    > * Increases the limit of NR_CPUS to 4096 and introduces a
    > boolean called "MAXSMP" which when set (e.g. "allyesconfig")
    > will set NR_CPUS = 4096 and NODES_SHIFT = 9 (512).
    >
    > I've been running this config (4k NR_CPUS, 512 Max Nodes)
    > on an AMD box with 2 dual-cores and 4gb memory as well as an
    > Intel box with 4 single-core cpus and 8Mb. I've also
    > successfully booted it in a simulated 2cpus/1Gb environment.
    >
    > Based on:
    > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
    > + x86/latest .../x86/linux-2.6-x86.git
    > + sched-devel/latest .../mingo/linux-2.6-sched-devel.git
    >
    > Signed-off-by: Mike Travis


    got

    ------------[ cut here ]------------
    WARNING: at kernel/sched_fair.c:815 hrtick_start_fair+0x69/0x156()
    Modules linked in:
    Pid: 1, comm: swapper Not tainted
    2.6.25-rc8-x86-latest.git-smp-01033-ga39ae31-dirty #77

    Call Trace:
    [] warn_on_slowpath+0x67/0x8e
    [] hrtick_start_fair+0x69/0x156
    [] ? dequeue_entity+0x2a/0xf8
    [] dequeue_task_fair+0x5f/0x7e
    [] dequeue_task+0x22/0x44
    [] deactivate_task+0x39/0x69
    [] schedule+0x1b9/0x5c5
    [] ? autoremove_wake_function+0x20/0x5e
    [] schedule_timeout+0x31/0xd7
    [] ? __wake_up+0x52/0x75
    [] wait_for_common+0x103/0x189
    [] ? default_wake_function+0x0/0x36
    [] wait_for_completion+0x2b/0x41
    [] call_usermodehelper_exec+0x87/0xe5
    [] kobject_uevent_env+0x3d0/0x424
    [] kobject_uevent+0x1e/0x34
    [] device_add+0x2f9/0x494
    [] device_register+0x28/0x43
    [] pcie_port_device_register+0x3f1/0x43e
    [] ? pcibios_set_master+0x8d/0xa8
    [] pcie_portdrv_probe+0x79/0xbb
    [] pci_call_probe+0xe5/0x146
    [] pci_device_probe+0x64/0xa2
    [] driver_probe_device+0xcf/0x16d
    [] ? sysfs_addrm_finish+0x2f/0x22b
    [] ? __driver_attach+0x0/0xbe
    [] __driver_attach+0x6e/0xbe
    [] bus_for_each_dev+0x5e/0xa2
    [] driver_attach+0x2f/0x45
    [] bus_add_driver+0xc6/0x226
    [] ? bus_put+0x29/0x3f
    [] driver_register+0x6d/0xfc
    [] __pci_register_driver+0x62/0xb0
    [] pcie_portdrv_init+0x4a/0x72
    [] kernel_init+0x1b4/0x340
    [] child_rip+0xa/0x12
    [] ? kernel_init+0x0/0x340
    [] ? child_rip+0x0/0x12

    ---[ end trace e26645195698f5cf ]---
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000148
    IP: [] pick_next_task_fair+0x7c/0xbb
    PGD 0
    Oops: 0000 [1] SMP
    CPU 28
    Modules linked in:
    Pid: 1, comm: swapper Not tainted
    2.6.25-rc8-x86-latest.git-smp-01033-ga39ae31-dirty #77
    RIP: 0010:[] []
    pick_next_task_fair+0x7c/0xbb
    RSP: 0018:ffff81081cc5cd70 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: ffff81383c21a280 RCX: 0000000000000000
    RDX: ffff81383c224080 RSI: ffff81383c224080 RDI: 0000000063e15417
    RBP: ffff81081cc5cda0 R08: 0000000000000000 R09: ffff81383c224108
    R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
    R13: ffff81383c224080 R14: ffff81383c224080 R15: 000000000000001c
    FS: 0000000000000000(0000) GS:ffff81401cc3c600(0000) knlGS:0000000000000000
    CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    CR2: 0000000000000148 CR3: 0000000000201000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process swapper (pid: 1, threadinfo ffff81081cc5c000, task ffff81401cc52000)
    Stack: ffff81081cc5cda0 0000000063e15417 ffffffff80a81840 0000000000000000
    00000000fffeecfd ffff81383c224080 ffff81081cc5ce70 ffffffff80a57d28
    ffff81081cc5ce00 ffff81081cc5ce20 ffffffff81963080 ffffffff81963080
    Call Trace:
    [] schedule+0x2b0/0x5c5
    [] ? autoremove_wake_function+0x20/0x5e
    [] schedule_timeout+0x31/0xd7
    [] ? __wake_up+0x52/0x75
    [] wait_for_common+0x103/0x189
    [] ? default_wake_function+0x0/0x36
    [] wait_for_completion+0x2b/0x41
    [] call_usermodehelper_exec+0x87/0xe5
    [] kobject_uevent_env+0x3d0/0x424
    [] kobject_uevent+0x1e/0x34
    [] device_add+0x2f9/0x494
    [] device_register+0x28/0x43
    [] pcie_port_device_register+0x3f1/0x43e
    [] ? pcibios_set_master+0x8d/0xa8
    [] pcie_portdrv_probe+0x79/0xbb
    [] pci_call_probe+0xe5/0x146
    [] pci_device_probe+0x64/0xa2
    [] driver_probe_device+0xcf/0x16d
    [] ? sysfs_addrm_finish+0x2f/0x22b
    [] ? __driver_attach+0x0/0xbe
    [] __driver_attach+0x6e/0xbe
    [] bus_for_each_dev+0x5e/0xa2
    [] driver_attach+0x2f/0x45
    [] bus_add_driver+0xc6/0x226
    [] ? bus_put+0x29/0x3f
    [] driver_register+0x6d/0xfc
    [] __pci_register_driver+0x62/0xb0
    [] pcie_portdrv_init+0x4a/0x72
    [] kernel_init+0x1b4/0x340
    [] child_rip+0xa/0x12
    [] ? kernel_init+0x0/0x340
    [] ? child_rip+0x0/0x12


    Code: 24 40 78 1c 8b 3d 36 05 b3 00 48 89 da be 00 04 00 00 e8 7a eb
    ff ff 49 39 c6 7f 04 4c 8b 63 48 4c 89 e6 4
    8 89 df e8 29 f1 ff ff <49> 8b 9c 24 48 01 00 00 48 85 db 75 a5 49 8d
    5c 24 c8 4c 89 ef
    RIP [] pick_next_task_fair+0x7c/0xbb
    RSP
    CR2: 0000000000000148
    ---[ end trace e26645195698f5cf ]---
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [PATCH 0/2] NR_CPUS: increase maximum NR_CPUS to 4096

    Yinghai Lu wrote:
    > On Fri, Apr 4, 2008 at 6:30 PM, Mike Travis wrote:
    >> * Increases the limit of NR_CPUS to 4096 and introduces a
    >> boolean called "MAXSMP" which when set (e.g. "allyesconfig")
    >> will set NR_CPUS = 4096 and NODES_SHIFT = 9 (512).
    >>
    >> I've been running this config (4k NR_CPUS, 512 Max Nodes)
    >> on an AMD box with 2 dual-cores and 4gb memory as well as an
    >> Intel box with 4 single-core cpus and 8Mb. I've also
    >> successfully booted it in a simulated 2cpus/1Gb environment.
    >>
    >> Based on:
    >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
    >> + x86/latest .../x86/linux-2.6-x86.git
    >> + sched-devel/latest .../mingo/linux-2.6-sched-devel.git
    >>
    >> Signed-off-by: Mike Travis

    >
    > got


    Hi Yinghai,

    Thanks for the feedback! Would you send me your config file and
    other details (like cpu type/mem size/etc.) and I'll attempt
    to reproduce the failure.

    (My problem is that only the AMD box is a real "workstation", the
    Intel box is a dual quad-cpu server so it's really deficient in I/O.)

    Thanks,
    Mike
    >
    > ------------[ cut here ]------------
    > WARNING: at kernel/sched_fair.c:815 hrtick_start_fair+0x69/0x156()
    > Modules linked in:
    > Pid: 1, comm: swapper Not tainted
    > 2.6.25-rc8-x86-latest.git-smp-01033-ga39ae31-dirty #77
    >
    > Call Trace:
    > [] warn_on_slowpath+0x67/0x8e
    > [] hrtick_start_fair+0x69/0x156
    > [] ? dequeue_entity+0x2a/0xf8
    > [] dequeue_task_fair+0x5f/0x7e
    > [] dequeue_task+0x22/0x44
    > [] deactivate_task+0x39/0x69
    > [] schedule+0x1b9/0x5c5
    > [] ? autoremove_wake_function+0x20/0x5e
    > [] schedule_timeout+0x31/0xd7
    > [] ? __wake_up+0x52/0x75
    > [] wait_for_common+0x103/0x189
    > [] ? default_wake_function+0x0/0x36
    > [] wait_for_completion+0x2b/0x41
    > [] call_usermodehelper_exec+0x87/0xe5
    > [] kobject_uevent_env+0x3d0/0x424
    > [] kobject_uevent+0x1e/0x34
    > [] device_add+0x2f9/0x494
    > [] device_register+0x28/0x43
    > [] pcie_port_device_register+0x3f1/0x43e
    > [] ? pcibios_set_master+0x8d/0xa8
    > [] pcie_portdrv_probe+0x79/0xbb
    > [] pci_call_probe+0xe5/0x146
    > [] pci_device_probe+0x64/0xa2
    > [] driver_probe_device+0xcf/0x16d
    > [] ? sysfs_addrm_finish+0x2f/0x22b
    > [] ? __driver_attach+0x0/0xbe
    > [] __driver_attach+0x6e/0xbe
    > [] bus_for_each_dev+0x5e/0xa2
    > [] driver_attach+0x2f/0x45
    > [] bus_add_driver+0xc6/0x226
    > [] ? bus_put+0x29/0x3f
    > [] driver_register+0x6d/0xfc
    > [] __pci_register_driver+0x62/0xb0
    > [] pcie_portdrv_init+0x4a/0x72
    > [] kernel_init+0x1b4/0x340
    > [] child_rip+0xa/0x12
    > [] ? kernel_init+0x0/0x340
    > [] ? child_rip+0x0/0x12
    >
    > ---[ end trace e26645195698f5cf ]---
    > BUG: unable to handle kernel NULL pointer dereference at 0000000000000148
    > IP: [] pick_next_task_fair+0x7c/0xbb
    > PGD 0
    > Oops: 0000 [1] SMP
    > CPU 28
    > Modules linked in:
    > Pid: 1, comm: swapper Not tainted
    > 2.6.25-rc8-x86-latest.git-smp-01033-ga39ae31-dirty #77
    > RIP: 0010:[] []
    > pick_next_task_fair+0x7c/0xbb
    > RSP: 0018:ffff81081cc5cd70 EFLAGS: 00010046
    > RAX: 0000000000000000 RBX: ffff81383c21a280 RCX: 0000000000000000
    > RDX: ffff81383c224080 RSI: ffff81383c224080 RDI: 0000000063e15417
    > RBP: ffff81081cc5cda0 R08: 0000000000000000 R09: ffff81383c224108
    > R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
    > R13: ffff81383c224080 R14: ffff81383c224080 R15: 000000000000001c
    > FS: 0000000000000000(0000) GS:ffff81401cc3c600(0000) knlGS:0000000000000000
    > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    > CR2: 0000000000000148 CR3: 0000000000201000 CR4: 00000000000006e0
    > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    > Process swapper (pid: 1, threadinfo ffff81081cc5c000, task ffff81401cc52000)
    > Stack: ffff81081cc5cda0 0000000063e15417 ffffffff80a81840 0000000000000000
    > 00000000fffeecfd ffff81383c224080 ffff81081cc5ce70 ffffffff80a57d28
    > ffff81081cc5ce00 ffff81081cc5ce20 ffffffff81963080 ffffffff81963080
    > Call Trace:
    > [] schedule+0x2b0/0x5c5
    > [] ? autoremove_wake_function+0x20/0x5e
    > [] schedule_timeout+0x31/0xd7
    > [] ? __wake_up+0x52/0x75
    > [] wait_for_common+0x103/0x189
    > [] ? default_wake_function+0x0/0x36
    > [] wait_for_completion+0x2b/0x41
    > [] call_usermodehelper_exec+0x87/0xe5
    > [] kobject_uevent_env+0x3d0/0x424
    > [] kobject_uevent+0x1e/0x34
    > [] device_add+0x2f9/0x494
    > [] device_register+0x28/0x43
    > [] pcie_port_device_register+0x3f1/0x43e
    > [] ? pcibios_set_master+0x8d/0xa8
    > [] pcie_portdrv_probe+0x79/0xbb
    > [] pci_call_probe+0xe5/0x146
    > [] pci_device_probe+0x64/0xa2
    > [] driver_probe_device+0xcf/0x16d
    > [] ? sysfs_addrm_finish+0x2f/0x22b
    > [] ? __driver_attach+0x0/0xbe
    > [] __driver_attach+0x6e/0xbe
    > [] bus_for_each_dev+0x5e/0xa2
    > [] driver_attach+0x2f/0x45
    > [] bus_add_driver+0xc6/0x226
    > [] ? bus_put+0x29/0x3f
    > [] driver_register+0x6d/0xfc
    > [] __pci_register_driver+0x62/0xb0
    > [] pcie_portdrv_init+0x4a/0x72
    > [] kernel_init+0x1b4/0x340
    > [] child_rip+0xa/0x12
    > [] ? kernel_init+0x0/0x340
    > [] ? child_rip+0x0/0x12
    >
    >
    > Code: 24 40 78 1c 8b 3d 36 05 b3 00 48 89 da be 00 04 00 00 e8 7a eb
    > ff ff 49 39 c6 7f 04 4c 8b 63 48 4c 89 e6 4
    > 8 89 df e8 29 f1 ff ff <49> 8b 9c 24 48 01 00 00 48 85 db 75 a5 49 8d
    > 5c 24 c8 4c 89 ef
    > RIP [] pick_next_task_fair+0x7c/0xbb
    > RSP
    > CR2: 0000000000000148
    > ---[ end trace e26645195698f5cf ]---


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor


    On Tue, 8 Apr 2008 10:54:15 -0700, "Yinghai Lu"
    said:
    > On Tue, Apr 8, 2008 at 1:23 AM, Ingo Molnar wrote:
    > >
    > > * Alexander van Heukelum wrote:
    > >
    > > > I did see that the malloc space that the inflate code is using is
    > > > taken from _after_ the end of the bss. I don't see how this is
    > > > protected from being used/overwritten. Changing the stack size changes
    > > > the memory layout a bit... maybe you were so unlucky to create a
    > > > vmlinux image that was just barely smaller than some threshold and
    > > > increasing the stack size made the decompression/relocation area be
    > > > located somewhere else?
    > > >
    > > > Test patch follows.

    > >
    > > that's a really interesting theory.
    > >
    > > FWIIW, i've been booting allyesconfig bzImages for a long time (with
    > > only minimal amount of drivers disabled - mostly old ISA ones that
    > > assume the presence of the real hardware), and they boot and work fine
    > > on both 32-bit and 64-bit typical whitebox PCs. That means huge bzImages
    > > that decompresses into a ~41 MB kernel image. I'd expect that to be a
    > > rather severe test of the decompressor.

    >
    > i don't that Alexander's patch is needed.


    Hello Yinghai Lu,

    Indeed, I now think it is not needed either. The decompression is
    done in-place nowadays: the (compressed) image is moved to a high
    memory address first, then the decompression is done starting at
    the low end of the buffer. It is guaranteed that the output never
    overwrites the input, and the decompression code, the stack, and
    the heap are all at higher addresses than the input buffer. The
    same goes for the pagetables needed for x86_64.

    > also because Alex move heap before _end,
    > we may need add some extra for buffer offset
    >
    > /* Replace the compressed data size with the uncompressed size */
    > subl input_len(%ebp), %ebx
    > movl output_len(%ebp), %eax
    > addl %eax, %ebx
    > /* Add 8 bytes for every 32K input block */
    > shrl $12, %eax
    > addl %eax, %ebx
    > /* Add 32K + 18 bytes of extra slack and align on a 4K boundary
    > */
    > addl $(32768 + 18 + 4095), %ebx
    > andl $~4095, %ebx =============================> need add
    > heap size too.
    > ....


    No, that size is accounted for automatically: the code computes the
    buffer size needed (including slack) minus the buffer size that is
    already available (in the embedded gzip-file). The image is moved
    by this amount (rounded up to a page). So that part is fine.

    >
    >
    > /* Replace the compressed data size with the uncompressed size */
    > movl input_len(%rip), %eax
    > subq %rax, %rbx
    > movl output_len(%rip), %eax
    > addq %rax, %rbx
    > /* Add 8 bytes for every 32K input block */
    > shrq $12, %rax
    > addq %rax, %rbx
    > /* Add 32K + 18 bytes of extra slack and align on a 4K boundary
    > */
    > addq $(32768 + 18 + 4095), %rbx
    > =============================> need add heap size too.
    > andq $~4095, %rbx
    >
    > do we need to move pgtable before _end?


    I just tried, but it fails: The pgtable is built and enabled in the
    32-bit
    setup code, but the kernel image is moved in the 64-bit part...
    overwriting
    the pagetable with zeroes .

    I can't think of an obvious safe place to put the pagetables, though.
    One
    option is to move the image in the 32-bit code and tell the 64-bit part
    somehow not to do it again... by calling into the 64-bit code at a
    different
    place, for example.

    Greetings,
    Alexander

    > YH

    --
    Alexander van Heukelum
    heukelum@fastmail.fm

    --
    http://www.fastmail.fm - Send your email first class

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor

    On Wed, Apr 9, 2008 at 8:08 AM, Alexander van Heukelum
    wrote:
    >
    > On Tue, 8 Apr 2008 10:54:15 -0700, "Yinghai Lu"
    > said:
    >
    > > On Tue, Apr 8, 2008 at 1:23 AM, Ingo Molnar wrote:
    > > >
    > > > * Alexander van Heukelum wrote:
    > > >
    > > > > I did see that the malloc space that the inflate code is using is
    > > > > taken from _after_ the end of the bss. I don't see how this is
    > > > > protected from being used/overwritten. Changing the stack size changes
    > > > > the memory layout a bit... maybe you were so unlucky to create a
    > > > > vmlinux image that was just barely smaller than some threshold and
    > > > > increasing the stack size made the decompression/relocation area be
    > > > > located somewhere else?
    > > > >
    > > > > Test patch follows.
    > > >
    > > > that's a really interesting theory.
    > > >
    > > > FWIIW, i've been booting allyesconfig bzImages for a long time (with
    > > > only minimal amount of drivers disabled - mostly old ISA ones that
    > > > assume the presence of the real hardware), and they boot and work fine
    > > > on both 32-bit and 64-bit typical whitebox PCs. That means huge bzImages
    > > > that decompresses into a ~41 MB kernel image. I'd expect that to be a
    > > > rather severe test of the decompressor.

    > >
    > > i don't that Alexander's patch is needed.

    >
    > Hello Yinghai Lu,
    >
    > Indeed, I now think it is not needed either. The decompression is
    > done in-place nowadays: the (compressed) image is moved to a high
    > memory address first, then the decompression is done starting at
    > the low end of the buffer. It is guaranteed that the output never
    > overwrites the input, and the decompression code, the stack, and
    > the heap are all at higher addresses than the input buffer. The
    > same goes for the pagetables needed for x86_64.
    >
    >
    > > also because Alex move heap before _end,
    > > we may need add some extra for buffer offset
    > >
    > > /* Replace the compressed data size with the uncompressed size */
    > > subl input_len(%ebp), %ebx
    > > movl output_len(%ebp), %eax
    > > addl %eax, %ebx
    > > /* Add 8 bytes for every 32K input block */
    > > shrl $12, %eax
    > > addl %eax, %ebx
    > > /* Add 32K + 18 bytes of extra slack and align on a 4K boundary
    > > */
    > > addl $(32768 + 18 + 4095), %ebx
    > > andl $~4095, %ebx =============================> need add
    > > heap size too.
    > > ....

    >
    > No, that size is accounted for automatically: the code computes the
    > buffer size needed (including slack) minus the buffer size that is
    > already available (in the embedded gzip-file). The image is moved
    > by this amount (rounded up to a page). So that part is fine.


    just wonder if Ingo have very big vmlinux, that +32K + 18 formula still works.

    YH
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [PATCH 1/2] boot: increase stack size for kernel boot loader decompressor

    On Wed, Apr 9, 2008 at 8:08 AM, Alexander van Heukelum
    wrote:
    >
    > On Tue, 8 Apr 2008 10:54:15 -0700, "Yinghai Lu"
    > said:
    >
    > > On Tue, Apr 8, 2008 at 1:23 AM, Ingo Molnar wrote:
    > > >
    > > > * Alexander van Heukelum wrote:
    > > >
    > > > > I did see that the malloc space that the inflate code is using is
    > > > > taken from _after_ the end of the bss. I don't see how this is
    > > > > protected from being used/overwritten. Changing the stack size changes
    > > > > the memory layout a bit... maybe you were so unlucky to create a
    > > > > vmlinux image that was just barely smaller than some threshold and
    > > > > increasing the stack size made the decompression/relocation area be
    > > > > located somewhere else?
    > > > >
    > > > > Test patch follows.
    > > >
    > > > that's a really interesting theory.
    > > >
    > > > FWIIW, i've been booting allyesconfig bzImages for a long time (with
    > > > only minimal amount of drivers disabled - mostly old ISA ones that
    > > > assume the presence of the real hardware), and they boot and work fine
    > > > on both 32-bit and 64-bit typical whitebox PCs. That means huge bzImages
    > > > that decompresses into a ~41 MB kernel image. I'd expect that to be a
    > > > rather severe test of the decompressor.

    > >
    > > i don't that Alexander's patch is needed.

    >
    > Hello Yinghai Lu,
    >
    > Indeed, I now think it is not needed either. The decompression is
    > done in-place nowadays: the (compressed) image is moved to a high
    > memory address first, then the decompression is done starting at
    > the low end of the buffer. It is guaranteed that the output never
    > overwrites the input, and the decompression code, the stack, and
    > the heap are all at higher addresses than the input buffer. The
    > same goes for the pagetables needed for x86_64.
    >
    >
    > > also because Alex move heap before _end,
    > > we may need add some extra for buffer offset
    > >
    > > /* Replace the compressed data size with the uncompressed size */
    > > subl input_len(%ebp), %ebx
    > > movl output_len(%ebp), %eax
    > > addl %eax, %ebx
    > > /* Add 8 bytes for every 32K input block */
    > > shrl $12, %eax
    > > addl %eax, %ebx
    > > /* Add 32K + 18 bytes of extra slack and align on a 4K boundary
    > > */
    > > addl $(32768 + 18 + 4095), %ebx
    > > andl $~4095, %ebx =============================> need add
    > > heap size too.
    > > ....

    >
    > No, that size is accounted for automatically: the code computes the
    > buffer size needed (including slack) minus the buffer size that is
    > already available (in the embedded gzip-file). The image is moved
    > by this amount (rounded up to a page). So that part is fine.


    yes, that don't need to changed.

    ....
    > > do we need to move pgtable before _end?

    >
    > I just tried, but it fails: The pgtable is built and enabled in the
    > 32-bit
    > setup code, but the kernel image is moved in the 64-bit part...
    > overwriting
    > the pagetable with zeroes .
    >
    > I can't think of an obvious safe place to put the pagetables, though.
    > One
    > option is to move the image in the 32-bit code and tell the 64-bit part
    > somehow not to do it again... by calling into the 64-bit code at a
    > different
    > place, for example.


    current pgtable table is safe, before arch/x86/kernel/head_64.S using
    new pgtable.

    YH
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread