From sfr at canb.auug.org.au Sat Sep 11 16:32:00 2004 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Sat, 11 Sep 2004 16:32:00 +1000 Subject: linuxppc64-dev mailing list In-Reply-To: <200409110049.i8B0nPJ0013638@supreme.pcug.org.au> References: <200409110049.i8B0nPJ0013638@supreme.pcug.org.au> Message-ID: <20040911163200.2ef3cd04.sfr@canb.auug.org.au> On Sat, 11 Sep 2004 10:49:25 +1000 (EST) Stephen Rothwell wrote: > > From: Hugh Blemings > > > > Chat with Anton this morning and Hollis this afternoon I suggest we go > > ahead and set up at least a temporary linuxppc64-dev mailing list on > > ozlabs.org > > > > Does this sound sane ? > > > > If so, Stephen could you oblige ? > > I will do this after lunch, OK? This is now done (as you will all have guessed). > The trick is how to publicize it ... Do we have access to the old subscriber list? Anyone want to volunteer to be list owner. Currently I am it and am more than willing to stay in this position. Can we revive the linuxppc.org domain, or at least get it hosted somewhere (DNS and web)? ozlabs.org is available for this. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040911/4d709008/attachment.pgp From anton at samba.org Sun Sep 12 12:28:31 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 12 Sep 2004 12:28:31 +1000 Subject: DEBUG_INFO Message-ID: <20040912022831.GG32755@krispykreme> Hi, # grep DEBUG_INFO .config CONFIG_DEBUG_INFO=y # ls -l vmlinux arch/ppc64/boot/zImage -rwxr-xr-x 1 anton anton 18428775 Sep 12 11:38 arch/ppc64/boot/zImage -rwxr-xr-x 1 anton anton 39353167 Sep 12 11:37 vmlinux We should at least strip the vmlinux we stuff into zImage, an 18MB zImage is pretty obnoxious and it fails to load over the network on my 270: BOOTP S = 1 FILE: zImage.congo Load Addr=0x4000 Max Size=0xbfc000 FINAL File Size = 12566528 bytes. !20EE000B ! Does anyone with special Makefile powers want to have a go at fixing this? Bonus points for converting zlib stuff to the generic /lib stuff like Tom Rini did on ppc32 :) It looks like yaboot can load a 40MB vmlinux OK, so Id prefer to leave the vmlinux unstripped (so people can use gdb against it, addr2line etc). Anton From anton at samba.org Fri Sep 10 22:14:56 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:14:56 +1000 Subject: [PATCH] [ppc64] Remove unused ppc64_calibrate_delay In-Reply-To: <20040910121238.GG24408@krispykreme> References: <20040910121238.GG24408@krispykreme> Message-ID: <20040910121456.GH24408@krispykreme> - Remove ppc64_calibrate_delay, no longer used - Formatting fixups Signed-off-by: Anton Blanchard diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/setup.c foobar3/arch/ppc64/kernel/setup.c --- foobar2/arch/ppc64/kernel/setup.c 2004-09-10 19:42:54.402526581 +1000 +++ foobar3/arch/ppc64/kernel/setup.c 2004-09-10 19:38:55.392966432 +1000 @@ -51,7 +51,6 @@ extern unsigned long klimit; /* extern void *stab; */ extern HTAB htab_data; -extern unsigned long loops_per_jiffy; int have_of = 1; @@ -68,11 +67,11 @@ unsigned long r7); extern void fw_feature_init(void); -extern void iSeries_init_early( void ); -extern void pSeries_init_early( void ); +extern void iSeries_init_early(void); +extern void pSeries_init_early(void); extern void pSeriesLP_init_early(void); extern void pmac_init_early(void); -extern void mm_init_ppc64( void ); +extern void mm_init_ppc64(void); extern void pseries_secondary_smp_init(unsigned long); extern int idle_setup(void); extern void vpa_init(int cpu); @@ -263,10 +262,10 @@ #ifdef CONFIG_PPC_ISERIES /* pSeries systems are identified in prom.c via OF. */ - if ( itLpNaca.xLparInstalled == 1 ) + if (itLpNaca.xLparInstalled == 1) systemcfg->platform = PLATFORM_ISERIES_LPAR; #endif - + switch (systemcfg->platform) { #ifdef CONFIG_PPC_ISERIES case PLATFORM_ISERIES_LPAR: @@ -627,17 +626,6 @@ arch_initcall(ppc_init); -void __init ppc64_calibrate_delay(void) -{ - loops_per_jiffy = tb_ticks_per_jiffy; - - printk("Calibrating delay loop... %lu.%02lu BogoMips\n", - loops_per_jiffy/(500000/HZ), - loops_per_jiffy/(5000/HZ) % 100); -} - -extern void (*calibrate_delay)(void); - #ifdef CONFIG_IRQSTACKS static void __init irqstack_early_init(void) { @@ -693,7 +681,6 @@ extern int panic_timeout; extern void do_init_bootmem(void); - calibrate_delay = ppc64_calibrate_delay; ppc64_boot_msg(0x12, "Setup Arch"); From anton at samba.org Fri Sep 10 19:11:28 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 19:11:28 +1000 Subject: [PATCH] Enable NUMA API on ppc64 In-Reply-To: <20040910090943.GC24408@krispykreme> References: <20040910090458.GB24408@krispykreme> <20040910090943.GC24408@krispykreme> Message-ID: <20040910091128.GD24408@krispykreme> Plumb the NUMA API syscalls into ppc64. Also add some missing cond_syscalls so we still link with NUMA API disabled. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/misc.S~numa_api arch/ppc64/kernel/misc.S --- foobar2/arch/ppc64/kernel/misc.S~numa_api 2004-09-10 18:32:34.385768254 +1000 +++ foobar2-anton/arch/ppc64/kernel/misc.S 2004-09-10 18:32:34.420765564 +1000 @@ -860,9 +860,9 @@ _GLOBAL(sys_call_table32) .llong .sys_ni_syscall /* 256 reserved for sys_debug_setcontext */ .llong .sys_ni_syscall /* 257 reserved for vserver */ .llong .sys_ni_syscall /* 258 reserved for new sys_remap_file_pages */ - .llong .sys_ni_syscall /* 259 reserved for new sys_mbind */ - .llong .sys_ni_syscall /* 260 reserved for new sys_get_mempolicy */ - .llong .sys_ni_syscall /* 261 reserved for new sys_set_mempolicy */ + .llong .compat_mbind + .llong .compat_get_mempolicy /* 260 */ + .llong .compat_set_mempolicy .llong .compat_sys_mq_open .llong .sys_mq_unlink .llong .compat_sys_mq_timedsend @@ -1132,9 +1132,9 @@ _GLOBAL(sys_call_table) .llong .sys_ni_syscall /* 256 reserved for sys_debug_setcontext */ .llong .sys_ni_syscall /* 257 reserved for vserver */ .llong .sys_ni_syscall /* 258 reserved for new sys_remap_file_pages */ - .llong .sys_ni_syscall /* 259 reserved for new sys_mbind */ - .llong .sys_ni_syscall /* 260 reserved for new sys_get_mempolicy */ - .llong .sys_ni_syscall /* 261 reserved for new sys_set_mempolicy */ + .llong .sys_mbind + .llong .sys_get_mempolicy /* 260 */ + .llong .sys_set_mempolicy .llong .sys_mq_open .llong .sys_mq_unlink .llong .sys_mq_timedsend diff -puN arch/ppc64/mm/numa.c~numa_api arch/ppc64/mm/numa.c diff -puN include/asm-ppc64/unistd.h~numa_api include/asm-ppc64/unistd.h --- foobar2/include/asm-ppc64/unistd.h~numa_api 2004-09-10 18:32:34.397767332 +1000 +++ foobar2-anton/include/asm-ppc64/unistd.h 2004-09-10 18:32:34.423765333 +1000 @@ -269,9 +269,9 @@ /* Number 256 is reserved for sys_debug_setcontext */ /* Number 257 is reserved for vserver */ /* Number 258 is reserved for new sys_remap_file_pages */ -/* Number 259 is reserved for new sys_mbind */ -/* Number 260 is reserved for new sys_get_mempolicy */ -/* Number 261 is reserved for new sys_set_mempolicy */ +#define __NR_mbind 259 +#define __NR_get_mempolicy 260 +#define __NR_set_mempolicy 261 #define __NR_mq_open 262 #define __NR_mq_unlink 263 #define __NR_mq_timedsend 264 diff -puN kernel/sys.c~fix_numa_api kernel/sys.c --- foobar2/kernel/sys.c~fix_numa_api 2004-09-10 18:59:26.757155478 +1000 +++ foobar2-anton/kernel/sys.c 2004-09-10 19:00:16.455837772 +1000 @@ -274,7 +274,9 @@ cond_syscall(compat_sys_mq_getsetattr) cond_syscall(sys_mbind) cond_syscall(sys_get_mempolicy) cond_syscall(sys_set_mempolicy) +cond_syscall(compat_mbind) cond_syscall(compat_get_mempolicy) +cond_syscall(compat_set_mempolicy) /* arch-specific weak syscall entries */ cond_syscall(sys_pciconfig_read) _ From anton at samba.org Fri Sep 10 19:09:43 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 19:09:43 +1000 Subject: [PATCH] RTAS error logs can appear twice in dmesg In-Reply-To: <20040910090458.GB24408@krispykreme> References: <20040910090458.GB24408@krispykreme> Message-ID: <20040910090943.GC24408@krispykreme> Ive started seeing rtas errors printed twice. Remove the second call to printk_log_rtas. Signed-off-by: Anton Blanchard ===== rtasd.c 1.30 vs edited ===== --- 1.30/arch/ppc64/kernel/rtasd.c Fri Sep 3 19:08:18 2004 +++ edited/rtasd.c Fri Sep 10 17:09:57 2004 @@ -216,12 +216,13 @@ if (!no_more_logging && !(err_type & ERR_FLAG_BOOT)) nvram_write_error_log(buf, len, err_type); - /* rtas errors can occur during boot, and we do want to capture + /* + * rtas errors can occur during boot, and we do want to capture * those somewhere, even if nvram isn't ready (why not?), and even - * if rtasd isn't ready. Put them into the boot log, at least. */ - if ((err_type & ERR_TYPE_MASK) == ERR_TYPE_RTAS_LOG) { + * if rtasd isn't ready. Put them into the boot log, at least. + */ + if ((err_type & ERR_TYPE_MASK) == ERR_TYPE_RTAS_LOG) printk_log_rtas(buf, len); - } /* Check to see if we need to or have stopped logging */ if (fatal || no_more_logging) { @@ -233,9 +234,6 @@ /* call type specific method for error */ switch (err_type & ERR_TYPE_MASK) { case ERR_TYPE_RTAS_LOG: - /* put into syslog and error_log file */ - printk_log_rtas(buf, len); - offset = rtas_error_log_buffer_max * ((rtas_log_start+rtas_log_size) & LOG_NUMBER_MASK); From anton at samba.org Fri Sep 10 22:23:37 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:23:37 +1000 Subject: [PATCH] [ppc64] Use early_param In-Reply-To: <20040910121941.GI24408@krispykreme> References: <20040910121238.GG24408@krispykreme> <20040910121456.GH24408@krispykreme> <20040910121941.GI24408@krispykreme> Message-ID: <20040910122337.GJ24408@krispykreme> Make use of Rusty's early_param code. Its good stuff. We appear to be the first user :) Move vpa_init and idle_setup later in boot, we dont have to do them right up front in setup_system. Signed-off-by: Anton Blanchard diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/setup.c foobar3/arch/ppc64/kernel/setup.c --- foobar2/arch/ppc64/kernel/setup.c 2004-09-10 19:52:08.296273933 +1000 +++ foobar3/arch/ppc64/kernel/setup.c 2004-09-10 19:50:06.856308336 +1000 @@ -244,7 +244,21 @@ systemcfg->processorCount = num_present_cpus(); } + #endif /* !defined(CONFIG_PPC_ISERIES) && defined(CONFIG_SMP) */ + +#ifdef CONFIG_XMON +static int __init early_xmon(char *p) +{ + /* ensure xmon is enabled */ + xmon_init(); + debugger(0); + + return 0; +} +early_param("xmon", early_xmon); +#endif + /* * Do some initial setup of the system. The parameters are those which * were passed in from the bootloader. @@ -256,10 +270,6 @@ int ret, i; #endif -#ifdef CONFIG_XMON_DEFAULT - xmon_init(); -#endif - #ifdef CONFIG_PPC_ISERIES /* pSeries systems are identified in prom.c via OF. */ if (itLpNaca.xLparInstalled == 1) @@ -290,6 +300,9 @@ #endif /* CONFIG_PPC_PMAC */ } +#ifdef CONFIG_XMON_DEFAULT + xmon_init(); +#endif /* If we were passed an initrd, set the ROOT_DEV properly if the values * look sensible. If not, clear initrd reference. */ @@ -330,6 +343,11 @@ iSeries_parse_cmdline(); #endif + /* Save unparsed command line copy for /proc/cmdline */ + strlcpy(saved_command_line, cmd_line, COMMAND_LINE_SIZE); + + parse_early_param(); + #ifdef CONFIG_SMP #ifndef CONFIG_PPC_ISERIES /* @@ -351,6 +369,10 @@ i); } } + + if (cur_cpu_spec->firmware_features & FW_FEATURE_SPLPAR) + vpa_init(boot_cpuid); + #endif /* CONFIG_PPC_PSERIES */ #endif /* CONFIG_SMP */ @@ -380,15 +402,6 @@ printk("-----------------------------------------------------\n"); mm_init_ppc64(); - -#if defined(CONFIG_SMP) && defined(CONFIG_PPC_PSERIES) - if (cur_cpu_spec->firmware_features & FW_FEATURE_SPLPAR) { - vpa_init(boot_cpuid); - } -#endif - - /* Select the correct idle loop for the platform. */ - idle_setup(); } void machine_restart(char *cmd) @@ -512,30 +525,20 @@ .show = show_cpuinfo, }; -#endif +#if 0 /* XXX not currently used */ +unsigned long memory_limit; - /* Look for mem= option on command line */ - if (strstr(cmd_line, "mem=")) { - char *p, *q; - unsigned long maxmem = 0; - extern unsigned long __max_memory; - - for (q = cmd_line; (p = strstr(q, "mem=")) != 0; ) { - q = p + 4; - if (p > cmd_line && p[-1] != ' ') - continue; - maxmem = simple_strtoul(q, &q, 0); - if (*q == 'k' || *q == 'K') { - maxmem <<= 10; - ++q; - } else if (*q == 'm' || *q == 'M') { - maxmem <<= 20; - ++q; - } - } - __max_memory = maxmem; - } +static int __init early_parsemem(char *p) +{ + if (!p) + return 0; + + memory_limit = memparse(p, &p); + + return 0; } +early_param("mem", early_parsemem); +#endif #ifdef CONFIG_PPC_PSERIES static int __init set_preferred_console(void) @@ -681,16 +684,10 @@ extern int panic_timeout; extern void do_init_bootmem(void); - ppc64_boot_msg(0x12, "Setup Arch"); -#ifdef CONFIG_XMON - if (strstr(cmd_line, "xmon")) { - /* ensure xmon is enabled */ - xmon_init(); - debugger(0); - } -#endif /* CONFIG_XMON */ + *cmdline_p = cmd_line; + /* * Set cache line size based on type of cpu as a default. @@ -711,16 +708,15 @@ init_mm.end_data = (unsigned long) _edata; init_mm.brk = klimit; - /* Save unparsed command line copy for /proc/cmdline */ - strlcpy(saved_command_line, cmd_line, COMMAND_LINE_SIZE); - *cmdline_p = cmd_line; - irqstack_early_init(); emergency_stack_init(); /* set up the bootmem stuff with available memory */ do_init_bootmem(); + /* Select the correct idle loop for the platform. */ + idle_setup(); + ppc_md.setup_arch(); paging_init(); diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/mm/numa.c foobar3/arch/ppc64/mm/numa.c --- foobar2/arch/ppc64/mm/numa.c 2004-09-10 19:52:05.108989721 +1000 +++ foobar3/arch/ppc64/mm/numa.c 2004-09-10 19:46:40.576232848 +1000 @@ -18,6 +18,8 @@ #include #include +static int numa_enabled = 1; + static int numa_debug; #define dbg(args...) if (numa_debug) { printk(KERN_INFO args); } @@ -189,10 +191,7 @@ long entries = lmb_end_of_DRAM() >> MEMORY_INCREMENT_SHIFT; unsigned long i; - if (strstr(saved_command_line, "numa=debug")) - numa_debug = 1; - - if (strstr(saved_command_line, "numa=off")) { + if (numa_enabled == 0) { printk(KERN_WARNING "NUMA disabled by user\n"); return -1; } @@ -587,3 +586,18 @@ start_pfn, zholes_size); } } + +static int __init early_numa(char *p) +{ + if (!p) + return 0; + + if (strstr(p, "off")) + numa_enabled = 0; + + if (strstr(p, "debug")) + numa_debug = 1; + + return 0; +} +early_param("numa", early_numa); From anton at samba.org Fri Sep 10 22:32:09 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:32:09 +1000 Subject: [PATCH] [ppc64] Enable POWER5 low power mode in idle loop In-Reply-To: <20040910122904.GK24408@krispykreme> References: <20040910121238.GG24408@krispykreme> <20040910121456.GH24408@krispykreme> <20040910121941.GI24408@krispykreme> <20040910122337.GJ24408@krispykreme> <20040910122904.GK24408@krispykreme> Message-ID: <20040910123209.GL24408@krispykreme> Now that we understand (and have fixed) the problem with using low power mode in the idle loop, lets enable it. It should save a fair amount of power. (The problem was that our exceptions were inheriting the low power mode and so were executing at a fraction of the normal cpu issue rate. We fixed it by always bumping our priority to medium at the start of every exception). Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/idle.c~enable_r31_in_idle arch/ppc64/kernel/idle.c --- foobar2/arch/ppc64/kernel/idle.c~enable_r31_in_idle 2004-09-10 20:58:19.402799782 +1000 +++ foobar2-anton/arch/ppc64/kernel/idle.c 2004-09-10 20:58:19.423798168 +1000 @@ -142,7 +142,12 @@ int default_idle(void) while (!need_resched() && !cpu_is_offline(cpu)) { barrier(); + /* + * Go into low thread priority and possibly + * low power mode. + */ HMT_low(); + HMT_very_low(); } HMT_medium(); @@ -184,18 +189,18 @@ int dedicated_idle(void) start_snooze = __get_tb() + *smt_snooze_delay * tb_ticks_per_usec; while (!need_resched() && !cpu_is_offline(cpu)) { - /* need_resched could be 1 or 0 at this - * point. If it is 0, set it to 0, so - * an IPI/Prod is sent. If it is 1, keep - * it that way & schedule work. + /* + * Go into low thread priority and possibly + * low power mode. */ + HMT_low(); + HMT_very_low(); + if (*smt_snooze_delay == 0 || - __get_tb() < start_snooze) { - HMT_low(); /* Low thread priority */ + __get_tb() < start_snooze) continue; - } - HMT_very_low(); /* Low power mode */ + HMT_medium(); if (!(ppaca->lppaca.xIdle)) { /* Indicate we are no longer polling for @@ -210,7 +215,6 @@ int dedicated_idle(void) break; } - /* DRENG: Go HMT_medium here ? */ local_irq_disable(); /* SMT dynamic mode. Cede will result _ From anton at samba.org Fri Sep 10 19:13:42 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 19:13:42 +1000 Subject: [PATCH] [ppc64] Give the kernel an OPD section In-Reply-To: <20040910091128.GD24408@krispykreme> References: <20040910090458.GB24408@krispykreme> <20040910090943.GC24408@krispykreme> <20040910091128.GD24408@krispykreme> Message-ID: <20040910091342.GE24408@krispykreme> From: Alan Modra Give the kernel an OPD section, required for recent ppc64 toolchains. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/vmlinux.lds.S~kernel-opd arch/ppc64/kernel/vmlinux.lds.S --- gr_work/arch/ppc64/kernel/vmlinux.lds.S~kernel-opd 2004-09-04 21:14:22.123514698 -0500 +++ gr_work-anton/arch/ppc64/kernel/vmlinux.lds.S 2004-09-04 21:14:22.133513110 -0500 @@ -117,10 +117,13 @@ SECTIONS .data : { *(.data .data.rel* .toc1) - *(.opd) *(.branch_lt) } + .opd : { + *(.opd) + } + .got : { __toc_start = .; *(.got) _ From anton at samba.org Fri Sep 10 22:19:41 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:19:41 +1000 Subject: [PATCH] [ppc64] Remove EEH command line device matching code In-Reply-To: <20040910121456.GH24408@krispykreme> References: <20040910121238.GG24408@krispykreme> <20040910121456.GH24408@krispykreme> Message-ID: <20040910121941.GI24408@krispykreme> We have had reports of people attempting to disable EEH on POWER5 boxes. This is not supported, and the device will most likely not respond to config space reads/writes. Remove the IBM location matching code that was being used to disable devices as well as the global option. We already have the ability to ignore EEH erros via the panic_on_oops sysctl option, advanced users should make use of that instead. Signed-off-by: Anton Blanchard diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/eeh.c foobar3/arch/ppc64/kernel/eeh.c --- foobar2/arch/ppc64/kernel/eeh.c 2004-09-10 19:41:20.500660954 +1000 +++ foobar3/arch/ppc64/kernel/eeh.c 2004-09-10 19:41:15.745368932 +1000 @@ -48,9 +48,6 @@ static int ibm_slot_error_detail; static int eeh_subsystem_enabled; -#define EEH_MAX_OPTS 4096 -static char *eeh_opts; -static int eeh_opts_last; /* Buffer for reporting slot-error-detail rtas calls */ static unsigned char slot_errbuf[RTAS_ERROR_LOG_MAX]; @@ -62,10 +59,6 @@ static DEFINE_PER_CPU(unsigned long, false_positives); static DEFINE_PER_CPU(unsigned long, ignored_failures); -static int eeh_check_opts_config(struct device_node *dn, int class_code, - int vendor_id, int device_id, - int default_state); - /** * The pci address cache subsystem. This subsystem places * PCI device address resources into a red-black tree, sorted @@ -497,7 +490,6 @@ struct eeh_early_enable_info { unsigned int buid_hi; unsigned int buid_lo; - int force_off; }; /* Enable eeh for the given device node. */ @@ -539,18 +531,8 @@ if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) enable = 0; - if (!eeh_check_opts_config(dn, *class_code, *vendor_id, *device_id, - enable)) { - if (enable) { - printk(KERN_WARNING "EEH: %s user requested to run " - "without EEH checking.\n", dn->full_name); - enable = 0; - } - } - - if (!enable || info->force_off) { + if (!enable) dn->eeh_mode |= EEH_MODE_NOCHECK; - } /* Ok... see if this device supports EEH. Some do, some don't, * and the only way to find out is to check each and every one. */ @@ -604,15 +586,12 @@ { struct device_node *phb, *np; struct eeh_early_enable_info info; - char *eeh_force_off = strstr(saved_command_line, "eeh-force-off"); init_pci_config_tokens(); np = of_find_node_by_path("/rtas"); - if (np == NULL) { - printk(KERN_WARNING "EEH: RTAS not found !\n"); + if (np == NULL) return; - } ibm_set_eeh_option = rtas_token("ibm,set-eeh-option"); ibm_set_slot_reset = rtas_token("ibm,set-slot-reset"); @@ -632,13 +611,6 @@ eeh_error_buf_size = RTAS_ERROR_LOG_MAX; } - info.force_off = 0; - if (eeh_force_off) { - printk(KERN_WARNING "EEH: WARNING: PCI Enhanced I/O Error " - "Handling is user disabled\n"); - info.force_off = 1; - } - /* Enable EEH for all adapters. Note that eeh requires buid's */ for (phb = of_find_node_by_name(NULL, "pci"); phb; phb = of_find_node_by_name(phb, "pci")) { @@ -653,11 +625,10 @@ traverse_pci_devices(phb, early_enable_eeh, &info); } - if (eeh_subsystem_enabled) { + if (eeh_subsystem_enabled) printk(KERN_INFO "EEH: PCI Enhanced I/O Error Handling Enabled\n"); - } else { - printk(KERN_WARNING "EEH: disabled PCI Enhanced I/O Error Handling\n"); - } + else + printk(KERN_WARNING "EEH: No capable adapters found\n"); } /** @@ -816,129 +787,3 @@ return 0; } __initcall(eeh_init_proc); - -/* - * Test if "dev" should be configured on or off. - * This processes the options literally from left to right. - * This lets the user specify stupid combinations of options, - * but at least the result should be very predictable. - */ -static int eeh_check_opts_config(struct device_node *dn, - int class_code, int vendor_id, int device_id, - int default_state) -{ - char devname[32], classname[32]; - char *strs[8], *s; - int nstrs, i; - int ret = default_state; - - /* Build list of strings to match */ - nstrs = 0; - s = (char *)get_property(dn, "ibm,loc-code", NULL); - if (s) - strs[nstrs++] = s; - sprintf(devname, "dev%04x:%04x", vendor_id, device_id); - strs[nstrs++] = devname; - sprintf(classname, "class%04x", class_code); - strs[nstrs++] = classname; - strs[nstrs++] = ""; /* yes, this matches the empty string */ - - /* - * Now see if any string matches the eeh_opts list. - * The eeh_opts list entries start with + or -. - */ - for (s = eeh_opts; s && (s < (eeh_opts + eeh_opts_last)); - s += strlen(s)+1) { - for (i = 0; i < nstrs; i++) { - if (strcasecmp(strs[i], s+1) == 0) { - ret = (strs[i][0] == '+') ? 1 : 0; - } - } - } - return ret; -} - -/* - * Handle kernel eeh-on & eeh-off cmd line options for eeh. - * - * We support: - * eeh-off=loc1,loc2,loc3... - * - * and this option can be repeated so - * eeh-off=loc1,loc2 eeh-off=loc3 - * is the same as eeh-off=loc1,loc2,loc3 - * - * loc is an IBM location code that can be found in a manual or - * via openfirmware (or the Hardware Management Console). - * - * We also support these additional "loc" values: - * - * dev#:# vendor:device id in hex (e.g. dev1022:2000) - * class# class id in hex (e.g. class0200) - * - * If no location code is specified all devices are assumed - * so eeh-off means eeh by default is off. - */ - -/* - * This is implemented as a null separated list of strings. - * Each string looks like this: "+X" or "-X" - * where X is a loc code, vendor:device, class (as shown above) - * or empty which is used to indicate all. - * - * We interpret this option string list so that it will literally - * behave left-to-right even if some combinations don't make sense. - */ -static int __init eeh_parm(char *str, int state) -{ - char *s, *cur, *curend; - - if (!eeh_opts) { - eeh_opts = alloc_bootmem(EEH_MAX_OPTS); - eeh_opts[eeh_opts_last++] = '+'; /* default */ - eeh_opts[eeh_opts_last++] = '\0'; - } - if (*str == '\0') { - eeh_opts[eeh_opts_last++] = state ? '+' : '-'; - eeh_opts[eeh_opts_last++] = '\0'; - return 1; - } - if (*str == '=') - str++; - for (s = str; s && *s != '\0'; s = curend) { - cur = s; - /* ignore empties. Don't treat as "all-on" or "all-off" */ - while (*cur == ',') - cur++; - curend = strchr(cur, ','); - if (!curend) - curend = cur + strlen(cur); - if (*cur) { - int curlen = curend-cur; - if (eeh_opts_last + curlen > EEH_MAX_OPTS-2) { - printk(KERN_WARNING "EEH: sorry...too many " - "eeh cmd line options\n"); - return 1; - } - eeh_opts[eeh_opts_last++] = state ? '+' : '-'; - strncpy(eeh_opts+eeh_opts_last, cur, curlen); - eeh_opts_last += curlen; - eeh_opts[eeh_opts_last++] = '\0'; - } - } - - return 1; -} - -static int __init eehoff_parm(char *str) -{ - return eeh_parm(str, 0); -} - -static int __init eehon_parm(char *str) -{ - return eeh_parm(str, 1); -} - -__setup("eeh-off", eehoff_parm); -__setup("eeh-on", eehon_parm); From anton at samba.org Fri Sep 10 22:12:38 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:12:38 +1000 Subject: [PATCH] [ppc64] Clean up kernel command line code Message-ID: <20040910121238.GG24408@krispykreme> Clean up some of our command line code: - We were copying the command line out of the device tree twice, but the first time we forgot to add CONFIG_CMDLINE. Fix this and remove the second copy. - The command line birec code ran after we had done some command line parsing in prom.c. This had the opportunity to really confuse the user, with some options being parsed out of the device tree and the other out of birecs. Luckily we could find no user of the command line birecs, so remove them. - remove duplicate printing of kernel command line; - clean up iseries inits and create an iSeries_parse_cmdline. Signed-off-by: Anton Blanchard diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/chrp_setup.c foobar3/arch/ppc64/kernel/chrp_setup.c --- foobar2/arch/ppc64/kernel/chrp_setup.c 2004-09-10 19:33:10.910718416 +1000 +++ foobar3/arch/ppc64/kernel/chrp_setup.c 2004-09-10 19:29:44.431528121 +1000 @@ -140,8 +140,6 @@ ROOT_DEV = Root_SDA2; } - printk("Boot arguments: %s\n", cmd_line); - fwnmi_init(); #ifndef CONFIG_PPC_ISERIES diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/iSeries_setup.c foobar3/arch/ppc64/kernel/iSeries_setup.c --- foobar2/arch/ppc64/kernel/iSeries_setup.c 2004-09-10 19:33:10.918717801 +1000 +++ foobar3/arch/ppc64/kernel/iSeries_setup.c 2004-09-10 19:30:32.042107395 +1000 @@ -333,32 +333,31 @@ #endif if (itLpNaca.xPirEnvironMode == 0) piranha_simulator = 1; + + /* Associate Lp Event Queue 0 with processor 0 */ + HvCallEvent_setLpEventQueueInterruptProc(0, 0); + + mf_init(); + mf_initialized = 1; + mb(); } -void __init iSeries_init(unsigned long r3, unsigned long r4, unsigned long r5, - unsigned long r6, unsigned long r7) +void __init iSeries_parse_cmdline(void) { char *p, *q; - /* Associate Lp Event Queue 0 with processor 0 */ - HvCallEvent_setLpEventQueueInterruptProc(0, 0); - /* copy the command line parameter from the primary VSP */ HvCallEvent_dmaToSp(cmd_line, 2 * 64* 1024, 256, HvLpDma_Direction_RemoteToLocal); p = cmd_line; q = cmd_line + 255; - while( p < q ) { + while(p < q) { if (!*p || *p == '\n') break; ++p; } *p = 0; - - mf_init(); - mf_initialized = 1; - mb(); } /* diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/prom.c foobar3/arch/ppc64/kernel/prom.c --- foobar2/arch/ppc64/kernel/prom.c 2004-09-10 19:33:10.928717033 +1000 +++ foobar3/arch/ppc64/kernel/prom.c 2004-09-10 19:35:32.146053659 +1000 @@ -1707,6 +1707,9 @@ } RELOC(cmd_line[0]) = 0; +#ifdef CONFIG_CMDLINE + strlcpy(RELOC(cmd_line), CONFIG_CMDLINE, sizeof(cmd_line)); +#endif /* CONFIG_CMDLINE */ if ((long)_prom->chosen > 0) { prom_getprop(_prom->chosen, "bootargs", p, sizeof(cmd_line)); if (p != NULL && p[0] != 0) diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/setup.c foobar3/arch/ppc64/kernel/setup.c --- foobar2/arch/ppc64/kernel/setup.c 2004-09-10 19:33:10.934716571 +1000 +++ foobar3/arch/ppc64/kernel/setup.c 2004-09-10 19:32:10.785258973 +1000 @@ -68,7 +68,6 @@ unsigned long r7); extern void fw_feature_init(void); -extern void iSeries_init( void ); extern void iSeries_init_early( void ); extern void pSeries_init_early( void ); extern void pSeriesLP_init_early(void); @@ -77,6 +76,7 @@ extern void pseries_secondary_smp_init(unsigned long); extern int idle_setup(void); extern void vpa_init(int cpu); +extern void iSeries_parse_cmdline(void); unsigned long decr_overclock = 1; unsigned long decr_overclock_proc0 = 1; @@ -87,10 +87,6 @@ unsigned char aux_device_present; -void parse_cmd_line(unsigned long r3, unsigned long r4, unsigned long r5, - unsigned long r6, unsigned long r7); -int parse_bootinfo(void); - #ifdef CONFIG_MAGIC_SYSRQ unsigned long SYSRQ_KEY; #endif /* CONFIG_MAGIC_SYSRQ */ @@ -282,19 +278,16 @@ case PLATFORM_PSERIES: fw_feature_init(); pSeries_init_early(); - parse_bootinfo(); break; case PLATFORM_PSERIES_LPAR: fw_feature_init(); pSeriesLP_init_early(); - parse_bootinfo(); break; #endif /* CONFIG_PPC_PSERIES */ #ifdef CONFIG_PPC_PMAC case PLATFORM_POWERMAC: pmac_init_early(); - parse_bootinfo(); #endif /* CONFIG_PPC_PMAC */ } @@ -334,6 +327,10 @@ } #endif /* CONFIG_PPC_PSERIES */ +#ifdef CONFIG_PPC_ISERIES + iSeries_parse_cmdline(); +#endif + #ifdef CONFIG_SMP #ifndef CONFIG_PPC_ISERIES /* @@ -393,18 +390,6 @@ /* Select the correct idle loop for the platform. */ idle_setup(); - - switch (systemcfg->platform) { -#ifdef CONFIG_PPC_ISERIES - case PLATFORM_ISERIES_LPAR: - iSeries_init(); - break; -#endif - default: - /* The following relies on the device tree being */ - /* fully configured. */ - parse_cmd_line(r3, r4, r5, r6, r7); - } } void machine_restart(char *cmd) @@ -528,31 +513,6 @@ .show = show_cpuinfo, }; -/* - * Fetch the cmd_line from open firmware. - */ -void parse_cmd_line(unsigned long r3, unsigned long r4, unsigned long r5, - unsigned long r6, unsigned long r7) -{ - cmd_line[0] = 0; - -#ifdef CONFIG_CMDLINE - strlcpy(cmd_line, CONFIG_CMDLINE, sizeof(cmd_line)); -#endif /* CONFIG_CMDLINE */ - -#ifdef CONFIG_PPC_PSERIES - { - struct device_node *chosen; - - chosen = of_find_node_by_name(NULL, "chosen"); - if (chosen != NULL) { - char *p; - p = get_property(chosen, "bootargs", NULL); - if (p != NULL && p[0] != 0) - strlcpy(cmd_line, p, sizeof(cmd_line)); - of_node_put(chosen); - } - } #endif /* Look for mem= option on command line */ @@ -652,26 +612,6 @@ } console_initcall(set_preferred_console); - -int parse_bootinfo(void) -{ - struct bi_record *rec; - - rec = prom.bi_recs; - - if ( rec == NULL || rec->tag != BI_FIRST ) - return -1; - - for ( ; rec->tag != BI_LAST ; rec = bi_rec_next(rec) ) { - switch (rec->tag) { - case BI_CMD_LINE: - strlcpy(cmd_line, (void *)rec->data, sizeof(cmd_line)); - break; - } - } - - return 0; -} #endif int __init ppc_init(void) From anton at samba.org Fri Sep 10 19:04:58 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 19:04:58 +1000 Subject: [PATCH] remove SPINLINE config option Message-ID: <20040910090458.GB24408@krispykreme> After the spinlock rework, CONFIG_SPINLINE doesnt work and causes a compile error. Remove it for now. Anton Signed-off-by: Anton Blanchard diff -puN arch/ppc64/lib/locks.c~remove_spinline arch/ppc64/lib/locks.c --- foobar2/arch/ppc64/lib/locks.c~remove_spinline 2004-09-10 18:10:07.925966698 +1000 +++ foobar2-anton/arch/ppc64/lib/locks.c 2004-09-10 18:10:23.249023185 +1000 @@ -20,8 +20,6 @@ #include #include -#ifndef CONFIG_SPINLINE - /* waiting for a spinlock... */ #if defined(CONFIG_PPC_SPLPAR) || defined(CONFIG_PPC_ISERIES) @@ -95,5 +93,3 @@ void spin_unlock_wait(spinlock_t *lock) } EXPORT_SYMBOL(spin_unlock_wait); - -#endif /* CONFIG_SPINLINE */ diff -puN ./arch/ppc64/Kconfig.debug~remove_spinline ./arch/ppc64/Kconfig.debug --- foobar2/./arch/ppc64/Kconfig.debug~remove_spinline 2004-09-10 18:10:33.861789115 +1000 +++ foobar2-anton/./arch/ppc64/Kconfig.debug 2004-09-10 18:10:42.108471315 +1000 @@ -44,16 +44,6 @@ config IRQSTACKS for handling hard and soft interrupts. This can help avoid overflowing the process kernel stacks. -config SPINLINE - bool "Inline spinlock code at each call site" - depends on SMP && !PPC_SPLPAR && !PPC_ISERIES - help - Say Y if you want to have the code for acquiring spinlocks - and rwlocks inlined at each call site. This makes the kernel - somewhat bigger, but can be useful when profiling the kernel. - - If in doubt, say N. - config SCHEDSTATS bool "Collect scheduler statistics" depends on DEBUG_KERNEL && PROC_FS _ From anton at samba.org Fri Sep 10 22:36:49 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:36:49 +1000 Subject: [PATCH] [ppc64] Clean up idle loop code In-Reply-To: <20040910123209.GL24408@krispykreme> References: <20040910121238.GG24408@krispykreme> <20040910121456.GH24408@krispykreme> <20040910121941.GI24408@krispykreme> <20040910122337.GJ24408@krispykreme> <20040910122904.GK24408@krispykreme> <20040910123209.GL24408@krispykreme> Message-ID: <20040910123649.GM24408@krispykreme> Clean up our idle loop code: - Remove a bunch of useless includes and make most functions static - There were places where we werent disabling interrupts before checking need_resched then calling the hypervisor to sleep our thread. We might race with an IPI and end up missing a reschedule. Disable interrupts around these regions to make them safe. - We forgot to turn off the polling flag when exiting the dedicated_idle idle loop. This could have resulted in all manner problems as other cpus would avoid sending IPIs to force reschedules. - Add a missing check for cpu_is_offline in the shared cpu idle loop. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/idle.c~cleanup_idle arch/ppc64/kernel/idle.c --- foobar2/arch/ppc64/kernel/idle.c~cleanup_idle 2004-09-10 21:56:13.748876543 +1000 +++ foobar2-anton/arch/ppc64/kernel/idle.c 2004-09-10 22:02:54.115249090 +1000 @@ -16,28 +16,16 @@ */ #include -#include #include #include -#include #include -#include -#include -#include -#include -#include #include -#include -#include #include -#include #include #include -#include #include #include -#include #include #include @@ -45,11 +33,11 @@ extern long cede_processor(void); extern long poll_pending(void); extern void power4_idle(void); -int (*idle_loop)(void); +static int (*idle_loop)(void); #ifdef CONFIG_PPC_ISERIES -unsigned long maxYieldTime = 0; -unsigned long minYieldTime = 0xffffffffffffffffUL; +static unsigned long maxYieldTime = 0; +static unsigned long minYieldTime = 0xffffffffffffffffUL; static void yield_shared_processor(void) { @@ -80,7 +68,7 @@ static void yield_shared_processor(void) process_iSeries_events(); } -int iSeries_idle(void) +static int iSeries_idle(void) { struct paca_struct *lpaca; long oldval; @@ -91,13 +79,10 @@ int iSeries_idle(void) CTRL = mfspr(CTRLF); CTRL &= ~RUNLATCH; mtspr(CTRLT, CTRL); -#if 0 - init_idle(); -#endif lpaca = get_paca(); - for (;;) { + while (1) { if (lpaca->lppaca.xSharedProc) { if (ItLpQueue_isLpIntPending(lpaca->lpqueue_ptr)) process_iSeries_events(); @@ -125,11 +110,13 @@ int iSeries_idle(void) schedule(); } + return 0; } -#endif -int default_idle(void) +#else + +static int default_idle(void) { long oldval; unsigned int cpu = smp_processor_id(); @@ -164,8 +151,6 @@ int default_idle(void) return 0; } -#ifdef CONFIG_PPC_PSERIES - DECLARE_PER_CPU(unsigned long, smt_snooze_delay); int dedicated_idle(void) @@ -179,8 +164,10 @@ int dedicated_idle(void) ppaca = &paca[cpu ^ 1]; while (1) { - /* Indicate to the HV that we are idle. Now would be - * a good time to find other work to dispatch. */ + /* + * Indicate to the HV that we are idle. Now would be + * a good time to find other work to dispatch. + */ lpaca->lppaca.xIdle = 1; oldval = test_and_clear_thread_flag(TIF_NEED_RESCHED); @@ -203,21 +190,17 @@ int dedicated_idle(void) HMT_medium(); if (!(ppaca->lppaca.xIdle)) { - /* Indicate we are no longer polling for - * work, and then clear need_resched. If - * need_resched was 1, set it back to 1 - * and schedule work + local_irq_disable(); + + /* + * We are about to sleep the thread + * and so wont be polling any + * more. */ clear_thread_flag(TIF_POLLING_NRFLAG); - oldval = test_and_clear_thread_flag(TIF_NEED_RESCHED); - if(oldval == 1) { - set_need_resched(); - break; - } - - local_irq_disable(); - /* SMT dynamic mode. Cede will result + /* + * SMT dynamic mode. Cede will result * in this thread going dormant, if the * partner thread is still doing work. * Thread wakes up if partner goes idle, @@ -225,15 +208,21 @@ int dedicated_idle(void) * occurs. Returning from the cede * enables external interrupts. */ - cede_processor(); + if (!need_resched()) + cede_processor(); + else + local_irq_enable(); } else { - /* Give the HV an opportunity at the + /* + * Give the HV an opportunity at the * processor, since we are not doing * any work. */ poll_pending(); } } + + clear_thread_flag(TIF_POLLING_NRFLAG); } else { set_need_resched(); } @@ -247,48 +236,49 @@ int dedicated_idle(void) return 0; } -int shared_idle(void) +static int shared_idle(void) { struct paca_struct *lpaca = get_paca(); + unsigned int cpu = smp_processor_id(); while (1) { - if (cpu_is_offline(smp_processor_id()) && - system_state == SYSTEM_RUNNING) - cpu_die(); - - /* Indicate to the HV that we are idle. Now would be - * a good time to find other work to dispatch. */ + /* + * Indicate to the HV that we are idle. Now would be + * a good time to find other work to dispatch. + */ lpaca->lppaca.xIdle = 1; - if (!need_resched()) { - local_irq_disable(); - - /* + while (!need_resched() && !cpu_is_offline(cpu)) { + local_irq_disable(); + + /* * Yield the processor to the hypervisor. We return if * an external interrupt occurs (which are driven prior * to returning here) or if a prod occurs from another - * processor. When returning here, external interrupts + * processor. When returning here, external interrupts * are enabled. + * + * Check need_resched() again with interrupts disabled + * to avoid a race. */ - cede_processor(); + if (!need_resched()) + cede_processor(); + else + local_irq_enable(); } HMT_medium(); lpaca->lppaca.xIdle = 0; schedule(); + if (cpu_is_offline(smp_processor_id()) && + system_state == SYSTEM_RUNNING) + cpu_die(); } return 0; } -#endif - -int cpu_idle(void) -{ - idle_loop(); - return 0; -} -int native_idle(void) +static int powermac_idle(void) { while(1) { if (!need_resched()) @@ -298,6 +288,13 @@ int native_idle(void) } return 0; } +#endif + +int cpu_idle(void) +{ + idle_loop(); + return 0; +} int idle_setup(void) { @@ -318,8 +315,8 @@ int idle_setup(void) idle_loop = default_idle; } } else if (systemcfg->platform == PLATFORM_POWERMAC) { - printk("idle = native_idle\n"); - idle_loop = native_idle; + printk("idle = powermac_idle\n"); + idle_loop = powermac_idle; } else { printk("idle_setup: unknown platform, use default_idle\n"); idle_loop = default_idle; @@ -328,4 +325,3 @@ int idle_setup(void) return 1; } - _ From anton at samba.org Fri Sep 10 19:15:58 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 19:15:58 +1000 Subject: [PATCH] [ppc64] Use nm --synthetic where available In-Reply-To: <20040910091342.GE24408@krispykreme> References: <20040910090458.GB24408@krispykreme> <20040910090943.GC24408@krispykreme> <20040910091128.GD24408@krispykreme> <20040910091342.GE24408@krispykreme> Message-ID: <20040910091558.GF24408@krispykreme> On new toolchains we need to use nm --synthetic or we miss code symbols. Sam, Im not thrilled about this patch but Im not sure of an easier way. Any ideas? Signed-off-by: Anton Blanchard diff -puN arch/ppc64/Makefile~nm_synthetic arch/ppc64/Makefile --- gr_work/arch/ppc64/Makefile~nm_synthetic 2004-09-01 03:45:49.180788436 -0500 +++ gr_work-anton/arch/ppc64/Makefile 2004-09-01 03:46:31.467604301 -0500 @@ -22,6 +22,12 @@ LD := $(LD) -m elf64ppc CC := $(CC) -m64 endif +new_nm := $(shell if $(NM) --help 2>&1 | grep -- '--synthetic' > /dev/null; then echo y; else echo n; fi) + +ifeq ($(new_nm),y) +NM := $(NM) --synthetic +endif + CHECKFLAGS += -m64 -D__powerpc__=1 LDFLAGS := -m elf64ppc _ From anton at samba.org Fri Sep 10 22:29:04 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:29:04 +1000 Subject: [PATCH] [ppc64] Restore smt-enabled=off kernel command line option In-Reply-To: <20040910122337.GJ24408@krispykreme> References: <20040910121238.GG24408@krispykreme> <20040910121456.GH24408@krispykreme> <20040910121941.GI24408@krispykreme> <20040910122337.GJ24408@krispykreme> Message-ID: <20040910122904.GK24408@krispykreme> Restore the smt-enabled=off kernel command line functionality: - Remove the SMT_DYNAMIC state now that smt_snooze_delay allows for the same thing. - Remove the early prom.c parsing for the option, put it into an early_param instead. - In setup_cpu_maps honour the smt-enabled setting Note to Nathan: In order to allow cpu hotplug add of secondary threads after booting with smt-enabled=off, I had to initialise cpu_present_map to cpu_online_map in smp_cpus_done. Im not sure how you want to handle this but it seems our present map currently does not allow cpus to be added into the partition that werent there at boot (but were in the possible map). Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/idle.c~cmdline-5 arch/ppc64/kernel/idle.c --- foobar2/arch/ppc64/kernel/idle.c~cmdline-5 2004-09-10 20:08:09.790873157 +1000 +++ foobar2-anton/arch/ppc64/kernel/idle.c 2004-09-10 20:08:09.850868545 +1000 @@ -197,12 +197,7 @@ int dedicated_idle(void) HMT_very_low(); /* Low power mode */ - /* If the SMT mode is system controlled & the - * partner thread is doing work, switch into - * ST mode. - */ - if((naca->smt_state == SMT_DYNAMIC) && - (!(ppaca->lppaca.xIdle))) { + if (!(ppaca->lppaca.xIdle)) { /* Indicate we are no longer polling for * work, and then clear need_resched. If * need_resched was 1, set it back to 1 diff -puN arch/ppc64/kernel/prom.c~cmdline-5 arch/ppc64/kernel/prom.c --- foobar2/arch/ppc64/kernel/prom.c~cmdline-5 2004-09-10 20:08:09.798872542 +1000 +++ foobar2-anton/arch/ppc64/kernel/prom.c 2004-09-10 20:08:09.858867930 +1000 @@ -918,11 +918,7 @@ static void __init prom_hold_cpus(unsign = (void *)virt_to_abs(&__secondary_hold_acknowledge); unsigned long secondary_hold = virt_to_abs(*PTRRELOC((unsigned long *)__secondary_hold)); - struct systemcfg *_systemcfg = RELOC(systemcfg); struct prom_t *_prom = PTRRELOC(&prom); -#ifdef CONFIG_SMP - struct naca_struct *_naca = RELOC(naca); -#endif prom_debug("prom_hold_cpus: start...\n"); prom_debug(" 1) spinloop = 0x%x\n", (unsigned long)spinloop); @@ -1003,18 +999,18 @@ static void __init prom_hold_cpus(unsign (*acknowledge == ((unsigned long)-1)); i++ ) ; if (*acknowledge == cpuid) { - prom_printf("... done\n"); + prom_printf(" done\n"); /* We have to get every CPU out of OF, * even if we never start it. */ if (cpuid >= NR_CPUS) goto next; } else { - prom_printf("... failed: %x\n", *acknowledge); + prom_printf(" failed: %x\n", *acknowledge); } } #ifdef CONFIG_SMP else - prom_printf("%x : booting cpu %s\n", cpuid, path); + prom_printf("%x : boot cpu %s\n", cpuid, path); #endif next: #ifdef CONFIG_SMP @@ -1023,13 +1019,6 @@ next: cpuid++; if (cpuid >= NR_CPUS) continue; - prom_printf("%x : preparing thread ... ", - interrupt_server[i]); - if (_naca->smt_state) { - prom_printf("available\n"); - } else { - prom_printf("not available\n"); - } } #endif cpuid++; @@ -1068,57 +1057,6 @@ next: prom_debug("prom_hold_cpus: end...\n"); } -static void __init smt_setup(void) -{ - char *p, *q; - char my_smt_enabled = SMT_DYNAMIC; - ihandle prom_options = 0; - char option[9]; - unsigned long offset = reloc_offset(); - struct naca_struct *_naca = RELOC(naca); - char found = 0; - - if (strstr(RELOC(cmd_line), RELOC("smt-enabled="))) { - for (q = RELOC(cmd_line); (p = strstr(q, RELOC("smt-enabled="))) != 0; ) { - q = p + 12; - if (p > RELOC(cmd_line) && p[-1] != ' ') - continue; - found = 1; - if (q[0] == 'o' && q[1] == 'f' && - q[2] == 'f' && (q[3] == ' ' || q[3] == '\0')) { - my_smt_enabled = SMT_OFF; - } else if (q[0]=='o' && q[1] == 'n' && - (q[2] == ' ' || q[2] == '\0')) { - my_smt_enabled = SMT_ON; - } else { - my_smt_enabled = SMT_DYNAMIC; - } - } - } - if (!found) { - prom_options = call_prom("finddevice", 1, 1, ADDR("/options")); - if (prom_options != (ihandle) -1) { - prom_getprop(prom_options, "ibm,smt-enabled", - option, sizeof(option)); - if (option[0] != 0) { - found = 1; - if (!strcmp(option, RELOC("off"))) - my_smt_enabled = SMT_OFF; - else if (!strcmp(option, RELOC("on"))) - my_smt_enabled = SMT_ON; - else - my_smt_enabled = SMT_DYNAMIC; - } - } - } - - if (!found ) - my_smt_enabled = SMT_DYNAMIC; /* default to on */ - - _naca->smt_state = my_smt_enabled; -} - - #ifdef CONFIG_BOOTX_TEXT /* This function will enable the early boot text when doing OF booting. This @@ -1730,8 +1668,6 @@ prom_init(unsigned long r3, unsigned lon /* Initialize some system info into the Naca early... */ prom_initialize_naca(); - smt_setup(); - /* If we are on an SMP machine, then we *MUST* do the * following, regardless of whether we have an SMP * kernel or not. diff -puN arch/ppc64/kernel/setup.c~cmdline-5 arch/ppc64/kernel/setup.c --- foobar2/arch/ppc64/kernel/setup.c~cmdline-5 2004-09-10 20:08:09.805872004 +1000 +++ foobar2-anton/arch/ppc64/kernel/setup.c 2004-09-10 20:26:50.139096936 +1000 @@ -152,6 +152,50 @@ void __init disable_early_printk(void) } #if !defined(CONFIG_PPC_ISERIES) && defined(CONFIG_SMP) + +static int smt_enabled_cmdline; + +/* Look for ibm,smt-enabled OF option */ +static void check_smt_enabled(void) +{ + struct device_node *dn; + char *smt_option; + + /* Allow the command line to overrule the OF option */ + if (smt_enabled_cmdline) + return; + + dn = of_find_node_by_path("/options"); + + if (dn) { + smt_option = (char *)get_property(dn, "ibm,smt-enabled", NULL); + + if (smt_option) { + if (!strcmp(smt_option, "on")) + smt_enabled_at_boot = 1; + else if (!strcmp(smt_option, "off")) + smt_enabled_at_boot = 0; + } + } +} + +/* Look for smt-enabled= cmdline option */ +static int __init early_smt_enabled(char *p) +{ + smt_enabled_cmdline = 1; + + if (!p) + return 0; + + if (!strcmp(p, "on") || !strcmp(p, "1")) + smt_enabled_at_boot = 1; + else if (!strcmp(p, "off") || !strcmp(p, "0")) + smt_enabled_at_boot = 0; + + return 0; +} +early_param("smt-enabled", early_smt_enabled); + /** * setup_cpu_maps - initialize the following cpu maps: * cpu_possible_map @@ -174,6 +218,8 @@ static void __init setup_cpu_maps(void) struct device_node *dn = NULL; int cpu = 0; + check_smt_enabled(); + while ((dn = of_find_node_by_type(dn, "cpu")) && cpu < NR_CPUS) { u32 *intserv; int j, len = sizeof(u32), nthreads; @@ -186,9 +232,16 @@ static void __init setup_cpu_maps(void) nthreads = len / sizeof(u32); for (j = 0; j < nthreads && cpu < NR_CPUS; j++) { + /* + * Only spin up secondary threads if SMT is enabled. + * We must leave space in the logical map for the + * threads. + */ + if (j == 0 || smt_enabled_at_boot) { + cpu_set(cpu, cpu_present_map); + set_hard_smp_processor_id(cpu, intserv[j]); + } cpu_set(cpu, cpu_possible_map); - cpu_set(cpu, cpu_present_map); - set_hard_smp_processor_id(cpu, intserv[j]); cpu++; } } diff -puN arch/ppc64/kernel/smp.c~cmdline-5 arch/ppc64/kernel/smp.c --- foobar2/arch/ppc64/kernel/smp.c~cmdline-5 2004-09-10 20:08:09.811871543 +1000 +++ foobar2-anton/arch/ppc64/kernel/smp.c 2004-09-10 20:48:26.959223351 +1000 @@ -74,6 +74,8 @@ void smp_call_function_interrupt(void); extern long register_vpa(unsigned long flags, unsigned long proc, unsigned long vpa); +int smt_enabled_at_boot = 1; + /* Low level assembly function used to backup CPU 0 state */ extern void __save_cpu_setup(void); @@ -942,4 +944,12 @@ void __init smp_cpus_done(unsigned int m smp_threads_ready = 1; set_cpus_allowed(current, old_mask); + + /* + * We know at boot the maximum number of cpus we can add to + * a partition and set cpu_possible_map accordingly. cpu_present_map + * needs to match for the hotplug code to allow us to hot add + * any offline cpus. + */ + cpu_present_map = cpu_possible_map; } diff -puN include/asm-ppc64/memory.h~cmdline-5 include/asm-ppc64/memory.h --- foobar2/include/asm-ppc64/memory.h~cmdline-5 2004-09-10 20:08:09.817871081 +1000 +++ foobar2-anton/include/asm-ppc64/memory.h 2004-09-10 20:08:09.865867392 +1000 @@ -56,14 +56,4 @@ static inline void isync(void) #define HMT_MEDIUM_HIGH "\tor 5,5,5 # medium high priority\n" #define HMT_HIGH "\tor 3,3,3 # high priority\n" -/* - * Various operational modes for SMT - * Off : never run threaded - * On : always run threaded - * Dynamic: Allow the system to switch modes as needed - */ -#define SMT_OFF 0 -#define SMT_ON 1 -#define SMT_DYNAMIC 2 - #endif diff -puN include/asm-ppc64/naca.h~cmdline-5 include/asm-ppc64/naca.h --- foobar2/include/asm-ppc64/naca.h~cmdline-5 2004-09-10 20:08:09.823870620 +1000 +++ foobar2-anton/include/asm-ppc64/naca.h 2004-09-10 20:08:09.867867238 +1000 @@ -37,9 +37,6 @@ struct naca_struct { u32 dCacheL1LinesPerPage; /* L1 d-cache lines / page 0x64 */ u32 iCacheL1LogLineSize; /* L1 i-cache line size Log2 0x68 */ u32 iCacheL1LinesPerPage; /* L1 i-cache lines / page 0x6c */ - u8 smt_state; /* 0 = SMT off 0x70 */ - /* 1 = SMT on */ - /* 2 = SMT dynamic */ u8 resv0[15]; /* Reserved 0x71 - 0x7F */ }; diff -puN include/asm-ppc64/smp.h~cmdline-5 include/asm-ppc64/smp.h --- foobar2/include/asm-ppc64/smp.h~cmdline-5 2004-09-10 20:08:09.829870159 +1000 +++ foobar2-anton/include/asm-ppc64/smp.h 2004-09-10 20:08:09.868867161 +1000 @@ -65,6 +65,8 @@ extern int query_cpu_stopped(unsigned in #define set_hard_smp_processor_id(CPU, VAL) \ do { (paca[(CPU)].hw_cpu_id = (VAL)); } while (0) +extern int smt_enabled_at_boot; + #endif /* __ASSEMBLY__ */ #endif /* !(_PPC64_SMP_H) */ _ From david at gibson.dropbear.id.au Mon Sep 13 14:11:19 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 13 Sep 2004 14:11:19 +1000 Subject: [PPC64] Improved VSID allocation algorithm Message-ID: <20040913041119.GA5351@zax> Andrew, please apply. This patch has been tested both on SLB and segment table machines. This new approach is far from the final word in VSID/context allocation, but it's a noticeable improvement on the old method. Replace the VSID allocation algorithm. The new algorithm first generates a 36-bit "proto-VSID" (with 0xfffffffff reserved). For kernel addresses this is equal to the ESID (address >> 28), for user addresses it is: (context << 15) | (esid & 0x7fff) These are distinguishable from kernel proto-VSIDs because the top bit is clear. Proto-VSIDs with the top two bits equal to 0b10 are reserved for now. The proto-VSIDs are then scrambled into real VSIDs with the multiplicative hash: VSID = (proto-VSID * VSID_MULTIPLIER) % VSID_MODULUS where VSID_MULTIPLIER = 268435399 = 0xFFFFFC7 VSID_MODULUS = 2^36-1 = 0xFFFFFFFFF This scramble is 1:1, because VSID_MULTIPLIER and VSID_MODULUS are co-prime since VSID_MULTIPLIER is prime (the largest 28-bit prime, in fact). This scheme has a number of advantages over the old one: - We now have VSIDs for every kernel address (i.e. everything above 0xC000000000000000), except the very top segment. That simplifies a number of things. - We allow for 15 significant bits of ESID for user addresses with 20 bits of context. i.e. 8T (43 bits) of address space for up to 1M contexts, significantly more than the old method (although we will need changes in the hash path and context allocation to take advantage of this). - Because we use a real multiplicative hash function, we have better and more robust hash scattering with this VSID algorithm (at least based on some initial results). Because the MODULUS is 2^n-1 we can use a trick to compute it efficiently without a divide or extra multiply. This makes the new algorithm barely slower than the old one. Index: working-2.6/include/asm-ppc64/mmu_context.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu_context.h 2004-08-25 10:37:27.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu_context.h 2004-09-09 16:18:05.847490304 +1000 @@ -34,7 +34,7 @@ } #define NO_CONTEXT 0 -#define FIRST_USER_CONTEXT 0x10 /* First 16 reserved for kernel */ +#define FIRST_USER_CONTEXT 1 #define LAST_USER_CONTEXT 0x8000 /* Same as PID_MAX for now... */ #define NUM_USER_CONTEXT (LAST_USER_CONTEXT-FIRST_USER_CONTEXT) @@ -181,46 +181,87 @@ local_irq_restore(flags); } -/* This is only valid for kernel (including vmalloc, imalloc and bolted) EA's +/* VSID allocation + * =============== + * + * We first generate a 36-bit "proto-VSID". For kernel addresses this + * is equal to the ESID, for user addresses it is: + * (context << 15) | (esid & 0x7fff) + * + * The two forms are distinguishable because the top bit is 0 for user + * addresses, whereas the top two bits are 1 for kernel addresses. + * Proto-VSIDs with the top two bits equal to 0b10 are reserved for + * now. + * + * The proto-VSIDs are then scrambled into real VSIDs with the + * multiplicative hash: + * + * VSID = (proto-VSID * VSID_MULTIPLIER) % VSID_MODULUS + * where VSID_MULTIPLIER = 268435399 = 0xFFFFFC7 + * VSID_MODULUS = 2^36-1 = 0xFFFFFFFFF + * + * This scramble is only well defined for proto-VSIDs below + * 0xFFFFFFFFF, so both proto-VSID and actual VSID 0xFFFFFFFFF are + * reserved. VSID_MULTIPLIER is prime (the largest 28-bit prime, in + * fact), so in particular it is co-prime to VSID_MODULUS, making this + * a 1:1 scrambling function. Because the modulus is 2^n-1 we can + * compute it efficiently without a divide or extra multiply (see + * below). + * + * This scheme has several advantages over older methods: + * + * - We have VSIDs allocated for every kernel address + * (i.e. everything above 0xC000000000000000), except the very top + * segment, which simplifies several things. + * + * - We allow for 15 significant bits of ESID and 20 bits of + * context for user addresses. i.e. 8T (43 bits) of address space for + * up to 1M contexts (although the page table structure and context + * allocation will need changes to take advantage of this). + * + * - The scramble function gives robust scattering in the hash + * table (at least based on some initial results). The previous + * method was more susceptible to pathological cases giving excessive + * hash collisions. */ -static inline unsigned long -get_kernel_vsid( unsigned long ea ) -{ - unsigned long ordinal, vsid; - - ordinal = (((ea >> 28) & 0x1fff) * LAST_USER_CONTEXT) | (ea >> 60); - vsid = (ordinal * VSID_RANDOMIZER) & VSID_MASK; - -#ifdef HTABSTRESS - /* For debug, this path creates a very poor vsid distribuition. - * A user program can access virtual addresses in the form - * 0x0yyyyxxxx000 where yyyy = xxxx to cause multiple mappings - * to hash to the same page table group. - */ - ordinal = ((ea >> 28) & 0x1fff) | (ea >> 44); - vsid = ordinal & VSID_MASK; -#endif /* HTABSTRESS */ - return vsid; -} +/* + * WARNING - If you change these you must make sure the asm + * implementations in slb_allocate(), do_stab_bolted and mmu.h + * (ASM_VSID_SCRAMBLE macro) are changed accordingly. + * + * You'll also need to change the precomputed VSID values in head.S + * which are used by the iSeries firmware. + */ + +static inline unsigned long vsid_scramble(unsigned long protovsid) +{ +#if 0 + /* The code below is equivalent to this function for arguments + * < 2^VSID_BITS, which is all this should ever be called + * with. However gcc is not clever enough to compute the + * modulus (2^n-1) without a second multiply. */ + return ((protovsid * VSID_MULTIPLIER) % VSID_MODULUS); +#else /* 1 */ + unsigned long x; + + x = protovsid * VSID_MULTIPLIER; + x = (x >> VSID_BITS) + (x & VSID_MODULUS); + return (x + ((x+1) >> VSID_BITS)) & VSID_MODULUS; +#endif /* 1 */ +} -/* This is only valid for user EA's (user EA's do not exceed 2^41 (EADDR_SIZE)) - */ -static inline unsigned long -get_vsid( unsigned long context, unsigned long ea ) +/* This is only valid for addresses >= KERNELBASE */ +static inline unsigned long get_kernel_vsid(unsigned long ea) { - unsigned long ordinal, vsid; - - ordinal = (((ea >> 28) & 0x1fff) * LAST_USER_CONTEXT) | context; - vsid = (ordinal * VSID_RANDOMIZER) & VSID_MASK; - -#ifdef HTABSTRESS - /* See comment above. */ - ordinal = ((ea >> 28) & 0x1fff) | (context << 16); - vsid = ordinal & VSID_MASK; -#endif /* HTABSTRESS */ + return vsid_scramble(ea >> SID_SHIFT); +} - return vsid; +/* This is only valid for user addresses (which are below 2^41) */ +static inline unsigned long get_vsid(unsigned long context, unsigned long ea) +{ + return vsid_scramble((context << USER_ESID_BITS) + | (ea >> SID_SHIFT)); } #endif /* __PPC64_MMU_CONTEXT_H */ Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2004-09-07 10:38:00.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu.h 2004-09-09 15:04:16.814447984 +1000 @@ -15,6 +15,7 @@ #include #include +#include #ifndef __ASSEMBLY__ @@ -215,12 +216,44 @@ #define SLB_VSID_KERNEL (SLB_VSID_KP|SLB_VSID_C) #define SLB_VSID_USER (SLB_VSID_KP|SLB_VSID_KS) -#define VSID_RANDOMIZER ASM_CONST(42470972311) -#define VSID_MASK 0xfffffffffUL -/* Because we never access addresses below KERNELBASE as kernel - * addresses, this VSID is never used for anything real, and will - * never have pages hashed into it */ -#define BAD_VSID ASM_CONST(0) +#define VSID_MULTIPLIER ASM_CONST(268435399) /* largest 28-bit prime */ +#define VSID_BITS 36 +#define VSID_MODULUS ((1UL<= \ + * 2^36-1, then r3+1 has the 2^36 bit set. So, if r3+1 has \ + * the bit clear, r3 already has the answer we want, if it \ + * doesn't, the answer is the low 36 bits of r3+1. So in all \ + * cases the answer is the low 36 bits of (r3 + ((r3+1) >> 36))*/\ + addi rx,rt,1; \ + srdi rx,rx,VSID_BITS; /* extract 2^36 bit */ \ + add rt,rt,rx /* Block size masks */ #define BL_128K 0x000 Index: working-2.6/arch/ppc64/mm/slb_low.S =================================================================== --- working-2.6.orig/arch/ppc64/mm/slb_low.S 2004-09-07 10:38:00.000000000 +1000 +++ working-2.6/arch/ppc64/mm/slb_low.S 2004-09-09 15:04:16.815447832 +1000 @@ -68,19 +68,19 @@ srdi r3,r3,28 /* get esid */ cmpldi cr7,r9,0xc /* cmp KERNELBASE for later use */ - /* r9 = region, r3 = esid, cr7 = <>KERNELBASE */ - - rldicr. r11,r3,32,16 - bne- 8f /* invalid ea bits set */ - addi r11,r9,-1 - cmpldi r11,0xb - blt- 8f /* invalid region */ + rldimi r10,r3,28,0 /* r10= ESID<<28 | entry */ + oris r10,r10,SLB_ESID_V at h /* r10 |= SLB_ESID_V */ - /* r9 = region, r3 = esid, r10 = entry, cr7 = <>KERNELBASE */ + /* r3 = esid, r10 = esid_data, cr7 = <>KERNELBASE */ blt cr7,0f /* user or kernel? */ - /* kernel address */ + /* kernel address: proto-VSID = ESID */ + /* WARNING - MAGIC: we don't use the VSID 0xfffffffff, but + * this code will generate the protoVSID 0xfffffffff for the + * top segment. That's ok, the scramble below will translate + * it to VSID 0, which is reserved as a bad VSID - one which + * will never have any pages in it. */ li r11,SLB_VSID_KERNEL BEGIN_FTR_SECTION bne cr7,9f @@ -88,8 +88,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_16M_PAGE) b 9f -0: /* user address */ +0: /* user address: proto-VSID = context<<15 | ESID */ li r11,SLB_VSID_USER + + srdi. r9,r3,13 + bne- 8f /* invalid ea bits set */ + #ifdef CONFIG_HUGETLB_PAGE BEGIN_FTR_SECTION /* check against the hugepage ranges */ @@ -111,33 +115,18 @@ #endif /* CONFIG_HUGETLB_PAGE */ 6: ld r9,PACACONTEXTID(r13) + rldimi r3,r9,USER_ESID_BITS,0 -9: /* r9 = "context", r3 = esid, r11 = flags, r10 = entry */ - - rldimi r9,r3,15,0 /* r9= VSID ordinal */ - -7: rldimi r10,r3,28,0 /* r10= ESID<<28 | entry */ - oris r10,r10,SLB_ESID_V at h /* r10 |= SLB_ESID_V */ - - /* r9 = ordinal, r3 = esid, r11 = flags, r10 = esid_data */ - - li r3,VSID_RANDOMIZER at higher - sldi r3,r3,32 - oris r3,r3,VSID_RANDOMIZER at h - ori r3,r3,VSID_RANDOMIZER at l - - mulld r9,r3,r9 /* r9 = ordinal * VSID_RANDOMIZER */ - clrldi r9,r9,28 /* r9 &= VSID_MASK */ - sldi r9,r9,SLB_VSID_SHIFT /* r9 <<= SLB_VSID_SHIFT */ - or r9,r9,r11 /* r9 |= flags */ +9: /* r3 = protovsid, r11 = flags, r10 = esid_data, cr7 = <>KERNELBASE */ + ASM_VSID_SCRAMBLE(r3,r9) - /* r9 = vsid_data, r10 = esid_data, cr7 = <>KERNELBASE */ + rldimi r11,r3,SLB_VSID_SHIFT,16 /* combine VSID and flags */ /* * No need for an isync before or after this slbmte. The exception * we enter with and the rfid we exit with are context synchronizing. */ - slbmte r9,r10 + slbmte r11,r10 bgelr cr7 /* we're done for kernel addresses */ @@ -160,6 +149,6 @@ blr 8: /* invalid EA */ - li r9,0 /* 0 VSID ordinal -> BAD_VSID */ + li r3,0 /* BAD_VSID */ li r11,SLB_VSID_USER /* flags don't much matter */ - b 7b + b 9b Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S 2004-09-09 15:04:16.770454672 +1000 +++ working-2.6/arch/ppc64/kernel/head.S 2004-09-09 15:04:16.817447528 +1000 @@ -548,15 +548,15 @@ .llong 0 /* Reserved */ .llong 0 /* Reserved */ .llong 0 /* Reserved */ - .llong 0xc00000000 /* KERNELBASE ESID */ - .llong 0x6a99b4b14 /* KERNELBASE VSID */ + .llong (KERNELBASE>>SID_SHIFT) + .llong 0x40bffffd5 /* KERNELBASE VSID */ /* We have to list the bolted VMALLOC segment here, too, so that it * will be restored on shared processor switch */ - .llong 0xd00000000 /* VMALLOCBASE ESID */ - .llong 0x08d12e6ab /* VMALLOCBASE VSID */ + .llong (VMALLOCBASE>>SID_SHIFT) + .llong 0xb0cffffd1 /* VMALLOCBASE VSID */ .llong 8192 /* # pages to map (32 MB) */ .llong 0 /* Offset from start of loadarea to start of map */ - .llong 0x0006a99b4b140000 /* VPN of first page to map */ + .llong 0x40bffffd50000 /* VPN of first page to map */ . = 0x6100 @@ -1064,18 +1064,9 @@ rldimi r10,r11,7,52 /* r10 = first ste of the group */ /* Calculate VSID */ - /* (((ea >> 28) & 0x1fff) << 15) | (ea >> 60) */ - rldic r11,r11,15,36 - ori r11,r11,0xc - - /* VSID_RANDOMIZER */ - li r9,9 - sldi r9,r9,32 - oris r9,r9,58231 - ori r9,r9,39831 - - mulld r9,r11,r9 - rldic r9,r9,12,16 /* r9 = vsid << 12 */ + /* This is a kernel address, so protovsid = ESID */ + ASM_VSID_SCRAMBLE(r11, r9) + rldic r9,r11,12,16 /* r9 = vsid << 12 */ /* Search the primary group for a free entry */ 1: ld r11,0(r10) /* Test valid bit of the current ste */ Index: working-2.6/arch/ppc64/mm/stab.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/stab.c 2004-08-25 10:37:26.000000000 +1000 +++ working-2.6/arch/ppc64/mm/stab.c 2004-09-09 15:04:16.818447376 +1000 @@ -115,15 +115,11 @@ unsigned char stab_entry; unsigned long offset; - /* Check for invalid effective addresses. */ - if (!IS_VALID_EA(ea)) - return 1; - /* Kernel or user address? */ if (ea >= KERNELBASE) { vsid = get_kernel_vsid(ea); } else { - if (! mm) + if ((ea >= TASK_SIZE_USER64) || (! mm)) return 1; vsid = get_vsid(mm->context.id, ea); Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2004-09-07 10:38:00.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2004-09-09 15:29:13.949495840 +1000 @@ -45,10 +45,16 @@ PGD_INDEX_SIZE + PAGE_SHIFT) /* + * Size of EA range mapped by our pagetables. + */ +#define PGTABLE_EA_BITS 41 +#define PGTABLE_EA_MASK ((1UL< physical */ #define KRANGE_START KERNELBASE -#define KRANGE_END (KRANGE_START + VALID_EA_BITS) +#define KRANGE_END (KRANGE_START + PGTABLE_EA_MASK) /* * Define the user address range */ #define USER_START (0UL) -#define USER_END (USER_START + VALID_EA_BITS) +#define USER_END (USER_START + PGTABLE_EA_MASK) /* Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2004-08-26 10:20:55.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2004-09-09 15:04:16.820447072 +1000 @@ -253,24 +253,24 @@ int local = 0; cpumask_t tmp; - /* Check for invalid addresses. */ - if (!IS_VALID_EA(ea)) - return 1; - switch (REGION_ID(ea)) { case USER_REGION_ID: user_region = 1; mm = current->mm; - if (mm == NULL) + if ((ea > USER_END) || (! mm)) return 1; vsid = get_vsid(mm->context.id, ea); break; case IO_REGION_ID: + if (ea > IMALLOC_END) + return 1; mm = &ioremap_mm; vsid = get_kernel_vsid(ea); break; case VMALLOC_REGION_ID: + if (ea > VMALLOC_END) + return 1; mm = &init_mm; vsid = get_kernel_vsid(ea); break; Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2004-09-07 10:38:00.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2004-09-09 15:04:16.820447072 +1000 @@ -212,17 +212,6 @@ #define USER_REGION_ID (0UL) #define REGION_ID(X) (((unsigned long)(X))>>REGION_SHIFT) -/* - * Define valid/invalid EA bits (for all ranges) - */ -#define VALID_EA_BITS (0x000001ffffffffffUL) -#define INVALID_EA_BITS (~(REGION_MASK|VALID_EA_BITS)) - -#define IS_VALID_REGION_ID(x) \ - (((x) == USER_REGION_ID) || ((x) >= KERNEL_REGION_ID)) -#define IS_VALID_EA(x) \ - ((!((x) & INVALID_EA_BITS)) && IS_VALID_REGION_ID(REGION_ID(x))) - #define __bpn_to_ba(x) ((((unsigned long)(x))<> PAGE_SHIFT) -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From anton at samba.org Mon Sep 13 20:55:05 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 Sep 2004 20:55:05 +1000 Subject: [PATCH] [ppc64] force_sigsegv fixes Message-ID: <20040913105505.GA14553@krispykreme> Replace do_exit() in 64bit signal code with force_sig/force_sigsegv where appropriate. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/signal.c~signal_fixes arch/ppc64/kernel/signal.c --- 2.6.9-rc1-mm5/arch/ppc64/kernel/signal.c~signal_fixes 2004-09-13 19:53:00.173734784 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/kernel/signal.c 2004-09-13 19:53:07.350235795 +1000 @@ -371,7 +371,8 @@ badframe: printk("badframe in sys_rt_sigreturn, regs=%p uc=%p &uc->uc_mcontext=%p\n", regs, uc, &uc->uc_mcontext); #endif - do_exit(SIGSEGV); + force_sig(SIGSEGV, current); + return 0; } static void setup_rt_frame(int signr, struct k_sigaction *ka, siginfo_t *info, @@ -446,7 +447,7 @@ badframe: printk("badframe in setup_rt_frame, regs=%p frame=%p newsp=%lx\n", regs, frame, newsp); #endif - do_exit(SIGSEGV); + force_sigsegv(signr, current); } _ From anton at samba.org Mon Sep 13 20:56:56 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 Sep 2004 20:56:56 +1000 Subject: [PATCH] [ppc64] powersave_nap sysctl In-Reply-To: <20040913105505.GA14553@krispykreme> References: <20040913105505.GA14553@krispykreme> Message-ID: <20040913105656.GB14553@krispykreme> Implement powersave_nap sysctl, like ppc32. This allows us to disable the nap function which is useful when profiling with oprofile (to get an accurate count of idle time). Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/idle.c~powersave_nap arch/ppc64/kernel/idle.c --- 2.6.9-rc1-mm5/arch/ppc64/kernel/idle.c~powersave_nap 2004-09-13 19:51:24.809722022 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/kernel/idle.c 2004-09-13 19:51:24.835720023 +1000 @@ -20,6 +20,8 @@ #include #include #include +#include +#include #include #include @@ -296,6 +298,38 @@ int cpu_idle(void) return 0; } +int powersave_nap; + +#ifdef CONFIG_SYSCTL +/* + * Register the sysctl to set/clear powersave_nap. + */ +static ctl_table powersave_nap_ctl_table[]={ + { + .ctl_name = KERN_PPC_POWERSAVE_NAP, + .procname = "powersave-nap", + .data = &powersave_nap, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { 0, }, +}; +static ctl_table powersave_nap_sysctl_root[] = { + { 1, "kernel", NULL, 0, 0755, powersave_nap_ctl_table, }, + { 0,}, +}; + +static int __init +register_powersave_nap_sysctl(void) +{ + register_sysctl_table(powersave_nap_sysctl_root, 0); + + return 0; +} +__initcall(register_powersave_nap_sysctl); +#endif + int idle_setup(void) { #ifdef CONFIG_PPC_ISERIES diff -puN arch/ppc64/kernel/setup.c~powersave_nap arch/ppc64/kernel/setup.c --- 2.6.9-rc1-mm5/arch/ppc64/kernel/setup.c~powersave_nap 2004-09-13 19:51:24.815721561 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/kernel/setup.c 2004-09-13 19:51:24.837719870 +1000 @@ -82,8 +82,6 @@ unsigned long decr_overclock_proc0 = 1; unsigned long decr_overclock_set = 0; unsigned long decr_overclock_proc0_set = 0; -int powersave_nap; - unsigned char aux_device_present; #ifdef CONFIG_MAGIC_SYSRQ _ From anton at samba.org Mon Sep 13 21:10:24 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 Sep 2004 21:10:24 +1000 Subject: [PATCH] [ppc64] Clean up asm/mmu.h In-Reply-To: <20040913110837.GD14553@krispykreme> References: <20040913105505.GA14553@krispykreme> <20040913105656.GB14553@krispykreme> <20040913110742.GC14553@krispykreme> <20040913110837.GD14553@krispykreme> Message-ID: <20040913111024.GE14553@krispykreme> Remove some old definitions that arent relevant to us. Signed-off-by: Anton Blanchard diff -puN include/asm-ppc64/mmu.h~rip_up_mmu_h include/asm-ppc64/mmu.h --- 2.6.9-rc1-mm5/include/asm-ppc64/mmu.h~rip_up_mmu_h 2004-09-13 19:51:29.477016885 +1000 +++ 2.6.9-rc1-mm5-anton/include/asm-ppc64/mmu.h 2004-09-13 19:51:29.499015194 +1000 @@ -107,18 +107,6 @@ typedef struct { extern HTAB htab_data; -void invalidate_hpte( unsigned long slot ); -long select_hpte_slot( unsigned long vpn ); -void create_valid_hpte( unsigned long slot, unsigned long vpn, - unsigned long prpn, unsigned hash, - void * ptep, unsigned hpteflags, - unsigned bolted ); - -#define PD_SHIFT (10+12) /* Page directory */ -#define PD_MASK 0x02FF -#define PT_SHIFT (12) /* Page Table */ -#define PT_MASK 0x02FF - #define LARGE_PAGE_SHIFT 24 static inline unsigned long hpt_hash(unsigned long vpn, int large) @@ -255,149 +243,4 @@ extern void htab_finish_init(void); srdi rx,rx,VSID_BITS; /* extract 2^36 bit */ \ add rt,rt,rx -/* Block size masks */ -#define BL_128K 0x000 -#define BL_256K 0x001 -#define BL_512K 0x003 -#define BL_1M 0x007 -#define BL_2M 0x00F -#define BL_4M 0x01F -#define BL_8M 0x03F -#define BL_16M 0x07F -#define BL_32M 0x0FF -#define BL_64M 0x1FF -#define BL_128M 0x3FF -#define BL_256M 0x7FF - -/* Used to set up SDR1 register */ -#define HASH_TABLE_SIZE_64K 0x00010000 -#define HASH_TABLE_SIZE_128K 0x00020000 -#define HASH_TABLE_SIZE_256K 0x00040000 -#define HASH_TABLE_SIZE_512K 0x00080000 -#define HASH_TABLE_SIZE_1M 0x00100000 -#define HASH_TABLE_SIZE_2M 0x00200000 -#define HASH_TABLE_SIZE_4M 0x00400000 -#define HASH_TABLE_MASK_64K 0x000 -#define HASH_TABLE_MASK_128K 0x001 -#define HASH_TABLE_MASK_256K 0x003 -#define HASH_TABLE_MASK_512K 0x007 -#define HASH_TABLE_MASK_1M 0x00F -#define HASH_TABLE_MASK_2M 0x01F -#define HASH_TABLE_MASK_4M 0x03F - -/* These are the Ks and Kp from the PowerPC books. For proper operation, - * Ks = 0, Kp = 1. - */ -#define MI_AP 786 -#define MI_Ks 0x80000000 /* Should not be set */ -#define MI_Kp 0x40000000 /* Should always be set */ - -/* The effective page number register. When read, contains the information - * about the last instruction TLB miss. When MI_RPN is written, bits in - * this register are used to create the TLB entry. - */ -#define MI_EPN 787 -#define MI_EPNMASK 0xfffff000 /* Effective page number for entry */ -#define MI_EVALID 0x00000200 /* Entry is valid */ -#define MI_ASIDMASK 0x0000000f /* ASID match value */ - /* Reset value is undefined */ - -/* A "level 1" or "segment" or whatever you want to call it register. - * For the instruction TLB, it contains bits that get loaded into the - * TLB entry when the MI_RPN is written. - */ -#define MI_TWC 789 -#define MI_APG 0x000001e0 /* Access protection group (0) */ -#define MI_GUARDED 0x00000010 /* Guarded storage */ -#define MI_PSMASK 0x0000000c /* Mask of page size bits */ -#define MI_PS8MEG 0x0000000c /* 8M page size */ -#define MI_PS512K 0x00000004 /* 512K page size */ -#define MI_PS4K_16K 0x00000000 /* 4K or 16K page size */ -#define MI_SVALID 0x00000001 /* Segment entry is valid */ - /* Reset value is undefined */ - -/* Real page number. Defined by the pte. Writing this register - * causes a TLB entry to be created for the instruction TLB, using - * additional information from the MI_EPN, and MI_TWC registers. - */ -#define MI_RPN 790 - -/* Define an RPN value for mapping kernel memory to large virtual - * pages for boot initialization. This has real page number of 0, - * large page size, shared page, cache enabled, and valid. - * Also mark all subpages valid and write access. - */ -#define MI_BOOTINIT 0x000001fd - -#define MD_CTR 792 /* Data TLB control register */ -#define MD_GPM 0x80000000 /* Set domain manager mode */ -#define MD_PPM 0x40000000 /* Set subpage protection */ -#define MD_CIDEF 0x20000000 /* Set cache inhibit when MMU dis */ -#define MD_WTDEF 0x10000000 /* Set writethrough when MMU dis */ -#define MD_RSV4I 0x08000000 /* Reserve 4 TLB entries */ -#define MD_TWAM 0x04000000 /* Use 4K page hardware assist */ -#define MD_PPCS 0x02000000 /* Use MI_RPN prob/priv state */ -#define MD_IDXMASK 0x00001f00 /* TLB index to be loaded */ -#define MD_RESETVAL 0x04000000 /* Value of register at reset */ - -#define M_CASID 793 /* Address space ID (context) to match */ -#define MC_ASIDMASK 0x0000000f /* Bits used for ASID value */ - - -/* These are the Ks and Kp from the PowerPC books. For proper operation, - * Ks = 0, Kp = 1. - */ -#define MD_AP 794 -#define MD_Ks 0x80000000 /* Should not be set */ -#define MD_Kp 0x40000000 /* Should always be set */ - -/* The effective page number register. When read, contains the information - * about the last instruction TLB miss. When MD_RPN is written, bits in - * this register are used to create the TLB entry. - */ -#define MD_EPN 795 -#define MD_EPNMASK 0xfffff000 /* Effective page number for entry */ -#define MD_EVALID 0x00000200 /* Entry is valid */ -#define MD_ASIDMASK 0x0000000f /* ASID match value */ - /* Reset value is undefined */ - -/* The pointer to the base address of the first level page table. - * During a software tablewalk, reading this register provides the address - * of the entry associated with MD_EPN. - */ -#define M_TWB 796 -#define M_L1TB 0xfffff000 /* Level 1 table base address */ -#define M_L1INDX 0x00000ffc /* Level 1 index, when read */ - /* Reset value is undefined */ - -/* A "level 1" or "segment" or whatever you want to call it register. - * For the data TLB, it contains bits that get loaded into the TLB entry - * when the MD_RPN is written. It is also provides the hardware assist - * for finding the PTE address during software tablewalk. - */ -#define MD_TWC 797 -#define MD_L2TB 0xfffff000 /* Level 2 table base address */ -#define MD_L2INDX 0xfffffe00 /* Level 2 index (*pte), when read */ -#define MD_APG 0x000001e0 /* Access protection group (0) */ -#define MD_GUARDED 0x00000010 /* Guarded storage */ -#define MD_PSMASK 0x0000000c /* Mask of page size bits */ -#define MD_PS8MEG 0x0000000c /* 8M page size */ -#define MD_PS512K 0x00000004 /* 512K page size */ -#define MD_PS4K_16K 0x00000000 /* 4K or 16K page size */ -#define MD_WT 0x00000002 /* Use writethrough page attribute */ -#define MD_SVALID 0x00000001 /* Segment entry is valid */ - /* Reset value is undefined */ - - -/* Real page number. Defined by the pte. Writing this register - * causes a TLB entry to be created for the data TLB, using - * additional information from the MD_EPN, and MD_TWC registers. - */ -#define MD_RPN 798 - -/* This is a temporary storage register that could be used to save - * a processor working register during a tablewalk. - */ -#define M_TW 799 - #endif /* _PPC64_MMU_H_ */ _ From anton at samba.org Mon Sep 13 21:08:37 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 Sep 2004 21:08:37 +1000 Subject: [PATCH] [ppc64] iseries build fixes In-Reply-To: <20040913110742.GC14553@krispykreme> References: <20040913105505.GA14553@krispykreme> <20040913105656.GB14553@krispykreme> <20040913110742.GC14553@krispykreme> Message-ID: <20040913110837.GD14553@krispykreme> Fix one compile warning and one build warning on iseries. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/mm/init.c~fix_iseries arch/ppc64/mm/init.c --- 2.6.9-rc1-mm5/arch/ppc64/mm/init.c~fix_iseries 2004-09-13 19:51:27.220930624 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/mm/init.c 2004-09-13 19:51:27.246928626 +1000 @@ -534,7 +534,9 @@ arch_initcall(mmu_context_init); */ void __init mm_init_ppc64(void) { +#ifndef CONFIG_PPC_ISERIES unsigned long i; +#endif ppc64_boot_msg(0x100, "MM Init"); diff -puN include/asm-ppc64/page.h~fix_iseries include/asm-ppc64/page.h --- 2.6.9-rc1-mm5/include/asm-ppc64/page.h~fix_iseries 2004-09-13 19:51:27.225930240 +1000 +++ 2.6.9-rc1-mm5-anton/include/asm-ppc64/page.h 2004-09-13 19:51:27.248928472 +1000 @@ -201,9 +201,9 @@ extern int page_is_ram(unsigned long pfn /* to change! */ #define PAGE_OFFSET ASM_CONST(0xC000000000000000) #define KERNELBASE PAGE_OFFSET -#define VMALLOCBASE 0xD000000000000000UL -#define IOREGIONBASE 0xE000000000000000UL -#define EEHREGIONBASE 0xA000000000000000UL +#define VMALLOCBASE ASM_CONST(0xD000000000000000) +#define IOREGIONBASE ASM_CONST(0xE000000000000000) +#define EEHREGIONBASE ASM_CONST(0xA000000000000000) #define IO_REGION_ID (IOREGIONBASE>>REGION_SHIFT) #define EEH_REGION_ID (EEHREGIONBASE>>REGION_SHIFT) _ From anton at samba.org Mon Sep 13 21:11:27 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 Sep 2004 21:11:27 +1000 Subject: [PATCH] [ppc64] Fix pseries build in -mm In-Reply-To: <20040913111024.GE14553@krispykreme> References: <20040913105505.GA14553@krispykreme> <20040913105656.GB14553@krispykreme> <20040913110742.GC14553@krispykreme> <20040913110837.GD14553@krispykreme> <20040913111024.GE14553@krispykreme> Message-ID: <20040913111127.GF14553@krispykreme> Looks like a list macro cleanup patch went in, resulting in two definitions of *dev. Remove one. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/pSeries_pci.c~fix_pseries arch/ppc64/kernel/pSeries_pci.c --- 2.6.9-rc1-mm5/arch/ppc64/kernel/pSeries_pci.c~fix_pseries 2004-09-13 19:58:29.941874428 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/kernel/pSeries_pci.c 2004-09-13 19:59:21.967773089 +1000 @@ -601,7 +601,6 @@ EXPORT_SYMBOL(pcibios_fixup_device_resou void __devinit pcibios_fixup_bus(struct pci_bus *bus) { struct pci_controller *hose = PCI_GET_PHB_PTR(bus); - struct pci_dev *dev; /* XXX or bus->parent? */ struct pci_dev *dev = bus->self; _ From anton at samba.org Mon Sep 13 21:07:42 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 Sep 2004 21:07:42 +1000 Subject: [PATCH] [ppc64] replace mmu_context_queue with idr allocator In-Reply-To: <20040913105656.GB14553@krispykreme> References: <20040913105505.GA14553@krispykreme> <20040913105656.GB14553@krispykreme> Message-ID: <20040913110742.GC14553@krispykreme> Replace the mmu_context_queue structure with the idr allocator. The mmu_context_queue allocation was quite large (~200kB) so on most machines we will have a reduction in usage. We might put a single entry cache on the front of this so we are more likely to reuse ppc64 MMU hashtable entries that are in the caches. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/mm/init.c~context_queue arch/ppc64/mm/init.c --- 2.6.9-rc1-mm5/arch/ppc64/mm/init.c~context_queue 2004-09-13 19:51:26.130749817 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/mm/init.c 2004-09-13 19:51:26.164747203 +1000 @@ -36,6 +36,7 @@ #include #include #include +#include #include #include @@ -62,8 +63,6 @@ #include #include - -struct mmu_context_queue_t mmu_context_queue; int mem_init_done; unsigned long ioremap_bot = IMALLOC_BASE; static unsigned long phbs_io_bot = PHBS_IO_BASE; @@ -477,6 +476,59 @@ void free_initrd_mem(unsigned long start } #endif +static spinlock_t mmu_context_lock = SPIN_LOCK_UNLOCKED; +static DEFINE_IDR(mmu_context_idr); + +int init_new_context(struct task_struct *tsk, struct mm_struct *mm) +{ + int index; + int err; + +again: + if (!idr_pre_get(&mmu_context_idr, GFP_KERNEL)) + return -ENOMEM; + + spin_lock(&mmu_context_lock); + err = idr_get_new(&mmu_context_idr, NULL, &index); + spin_unlock(&mmu_context_lock); + + if (err == -EAGAIN) + goto again; + else if (err) + return err; + + if (index > MAX_CONTEXT) { + idr_remove(&mmu_context_idr, index); + return -ENOMEM; + } + + mm->context.id = index; + + return 0; +} + +void destroy_context(struct mm_struct *mm) +{ + spin_lock(&mmu_context_lock); + idr_remove(&mmu_context_idr, mm->context.id); + spin_unlock(&mmu_context_lock); + + mm->context.id = NO_CONTEXT; +} + +static int __init mmu_context_init(void) +{ + int index; + + /* Reserve the first (invalid) context*/ + idr_pre_get(&mmu_context_idr, GFP_KERNEL); + idr_get_new(&mmu_context_idr, NULL, &index); + BUG_ON(0 != index); + + return 0; +} +arch_initcall(mmu_context_init); + /* * Do very early mm setup. */ @@ -486,17 +538,6 @@ void __init mm_init_ppc64(void) ppc64_boot_msg(0x100, "MM Init"); - /* Reserve all contexts < FIRST_USER_CONTEXT for kernel use. - * The range of contexts [FIRST_USER_CONTEXT, NUM_USER_CONTEXT) - * are stored on a stack/queue for easy allocation and deallocation. - */ - mmu_context_queue.lock = SPIN_LOCK_UNLOCKED; - mmu_context_queue.head = 0; - mmu_context_queue.tail = NUM_USER_CONTEXT-1; - mmu_context_queue.size = NUM_USER_CONTEXT; - for (i = 0; i < NUM_USER_CONTEXT; i++) - mmu_context_queue.elements[i] = i + FIRST_USER_CONTEXT; - /* This is the story of the IO hole... please, keep seated, * unfortunately, we are out of oxygen masks at the moment. * So we need some rough way to tell where your big IO hole diff -puN include/asm-ppc64/mmu_context.h~context_queue include/asm-ppc64/mmu_context.h --- 2.6.9-rc1-mm5/include/asm-ppc64/mmu_context.h~context_queue 2004-09-13 19:51:26.142748894 +1000 +++ 2.6.9-rc1-mm5-anton/include/asm-ppc64/mmu_context.h 2004-09-13 19:51:26.168746896 +1000 @@ -2,11 +2,9 @@ #define __PPC64_MMU_CONTEXT_H #include -#include #include #include #include -#include #include /* @@ -33,107 +31,15 @@ static inline int sched_find_first_bit(u return __ffs(b[2]) + 128; } -#define NO_CONTEXT 0 -#define FIRST_USER_CONTEXT 1 -#define LAST_USER_CONTEXT 0x8000 /* Same as PID_MAX for now... */ -#define NUM_USER_CONTEXT (LAST_USER_CONTEXT-FIRST_USER_CONTEXT) - -/* Choose whether we want to implement our context - * number allocator as a LIFO or FIFO queue. - */ -#if 1 -#define MMU_CONTEXT_LIFO -#else -#define MMU_CONTEXT_FIFO -#endif - -struct mmu_context_queue_t { - spinlock_t lock; - long head; - long tail; - long size; - mm_context_id_t elements[LAST_USER_CONTEXT]; -}; - -extern struct mmu_context_queue_t mmu_context_queue; - -static inline void -enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) +static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { } -/* - * The context number queue has underflowed. - * Meaning: we tried to push a context number that was freed - * back onto the context queue and the queue was already full. - */ -static inline void -mmu_context_underflow(void) -{ - printk(KERN_DEBUG "mmu_context_underflow\n"); - panic("mmu_context_underflow"); -} - -/* - * Set up the context for a new address space. - */ -static inline int -init_new_context(struct task_struct *tsk, struct mm_struct *mm) -{ - long head; - unsigned long flags; - /* This does the right thing across a fork (I hope) */ - - spin_lock_irqsave(&mmu_context_queue.lock, flags); - - if (mmu_context_queue.size <= 0) { - spin_unlock_irqrestore(&mmu_context_queue.lock, flags); - return -ENOMEM; - } +#define NO_CONTEXT 0 +#define MAX_CONTEXT (0x100000-1) - head = mmu_context_queue.head; - mm->context.id = mmu_context_queue.elements[head]; - - head = (head < LAST_USER_CONTEXT-1) ? head+1 : 0; - mmu_context_queue.head = head; - mmu_context_queue.size--; - - spin_unlock_irqrestore(&mmu_context_queue.lock, flags); - - return 0; -} - -/* - * We're finished using the context for an address space. - */ -static inline void -destroy_context(struct mm_struct *mm) -{ - long index; - unsigned long flags; - - spin_lock_irqsave(&mmu_context_queue.lock, flags); - - if (mmu_context_queue.size >= NUM_USER_CONTEXT) { - spin_unlock_irqrestore(&mmu_context_queue.lock, flags); - mmu_context_underflow(); - } - -#ifdef MMU_CONTEXT_LIFO - index = mmu_context_queue.head; - index = (index > 0) ? index-1 : LAST_USER_CONTEXT-1; - mmu_context_queue.head = index; -#else - index = mmu_context_queue.tail; - index = (index < LAST_USER_CONTEXT-1) ? index+1 : 0; - mmu_context_queue.tail = index; -#endif - - mmu_context_queue.size++; - mmu_context_queue.elements[index] = mm->context.id; - - spin_unlock_irqrestore(&mmu_context_queue.lock, flags); -} +extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm); +extern void destroy_context(struct mm_struct *mm); extern void switch_stab(struct task_struct *tsk, struct mm_struct *mm); extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm); _ From jschopp at austin.ibm.com Tue Sep 14 01:56:54 2004 From: jschopp at austin.ibm.com (Joel Schopp) Date: Mon, 13 Sep 2004 10:56:54 -0500 Subject: [Fwd: [RFC][PATCH] flush_hash_page] Message-ID: <4145C346.6090102@austin.ibm.com> Resending since list was down last time I sent it. --------------------------------------------------- I'm very new to the ppc64 memory management code, please forgive my ignorance. I need to be able to use flush_hash_page on arbitrary ptes for memory remove. There is a comment in flush_hash_page about not supporting large ptes. It looks like most of that work has already been done, and all that is needed is the following patch. Am I missing something? -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: largepte.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040913/fc2227de/attachment.txt From anton at samba.org Tue Sep 14 02:27:46 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 14 Sep 2004 02:27:46 +1000 Subject: [PATCH] [ppc64] Restore smt-enabled=off kernel command line option In-Reply-To: <1095091735.12145.72.camel@biclops.private.network> References: <20040910121238.GG24408@krispykreme> <20040910121456.GH24408@krispykreme> <20040910121941.GI24408@krispykreme> <20040910122337.GJ24408@krispykreme> <20040910122904.GK24408@krispykreme> <1095091735.12145.72.camel@biclops.private.network> Message-ID: <20040913162746.GF12514@krispykreme> > Whoops, sorry, didn't mean to break smt-enabled=off. No problem, it gave me a chance to clean up some more code :) > As you mentioned, we do not have any code which updates the present map > when a cpu is added or removed. What we really need is cpu-specific > hooks in of_add_node/of_remove_node which will take care of that. I'll > send along a patch for that shortly. Ahh yes that makes sense. Can we initially set it up so if we boot ST or with maxcpus= that its still possible to hotplug add the threads or cpus? Anton From linas at austin.ibm.com Tue Sep 14 03:09:46 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 13 Sep 2004 12:09:46 -0500 Subject: vDSO preliminary implementation In-Reply-To: <1094863417.2667.152.camel@gaston> References: <1093496594.2172.80.camel@gaston> <20040910211158.GX9645@austin.ibm.com> <1094863417.2667.152.camel@gaston> Message-ID: <20040913170946.GB9645@austin.ibm.com> On Sat, Sep 11, 2004 at 10:43:38AM +1000, Benjamin Herrenschmidt was heard to remark: > > > > > > Here's a first shot at implementing a vDSO for ppc32/ppc64. This is definitely > > > > What's vDSO ? Google was amazingly unhelpful in figuring this out. > > virtual .so, that is a library mapped by the kernel in userspace Let me re-phrase that: what's it good for? Is this a mechanism for sharing the text segment of a library between all users? Ye olde AIX had this feature; I've never thought about whether Linux does this or not; shared libs were loaded so that the text segment of a library appeared only once in 'real' memory, and was thus shared by the various apps. I'm not sure, I think in AIX even the "ptes" were shared too: the text was always loaded into the same segment (segment 0 iirc), so you wouldn't have tlb misses on things like libc. I've never thought about how Linux loads libraries, so excuse me on this newbie-sounding question. How does Linux load .so's today? Is there one copy per process, or are they shared? --linas From olof at austin.ibm.com Tue Sep 14 03:20:08 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Mon, 13 Sep 2004 12:20:08 -0500 Subject: [Fwd: [RFC][PATCH] flush_hash_page] In-Reply-To: <4145C346.6090102@austin.ibm.com> References: <4145C346.6090102@austin.ibm.com> Message-ID: <4145D6C8.8000702@austin.ibm.com> Joel Schopp wrote: > Resending since list was down last time I sent it. > --------------------------------------------------- > > I'm very new to the ppc64 memory management code, please forgive my > ignorance. I need to be able to use flush_hash_page on arbitrary ptes > for memory remove. There is a comment in flush_hash_page about not > supporting large ptes. It looks like most of that work has already been > done, and all that is needed is the following patch. Am I missing > something? I think you might be missing something. You just changed the function to assume that all pages are large pages if the CPU supports them. This is true for kernel pages, but not for user ones. As a result, the wrong hash function will/might be used. You need to know if the page you're looking to flush is large or not. Right now there's no way to pass that down, thus the comment in the function. -Olof From haveblue at us.ibm.com Tue Sep 14 03:08:29 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Mon, 13 Sep 2004 10:08:29 -0700 Subject: [Fwd: [RFC][PATCH] flush_hash_page] In-Reply-To: <4145C346.6090102@austin.ibm.com> References: <4145C346.6090102@austin.ibm.com> Message-ID: <1095095309.3422.7.camel@localhost> On Mon, 2004-09-13 at 08:56, Joel Schopp wrote: > Resending since list was down last time I sent it. > --------------------------------------------------- > > I'm very new to the ppc64 memory management code, please forgive my > ignorance. I need to be able to use flush_hash_page on arbitrary ptes > for memory remove. There is a comment in flush_hash_page about not > supporting large ptes. It looks like most of that work has already been > done, and all that is needed is the following patch. Am I missing > something? > ________________________________________________________________________ > + if (cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) > + large = 1; > + First of all, if you do that, doesn't it assume that all ptes that are passed in are large pages? I don't think that's correct. Also, think about how huge pages are implemented in Linux. Do huge pages really even get Linux ptes, or just pmds that act like ptes? That reminds me. Anton, I don't see ppc64 setting up the Linux pagetable for the kernel mappings anywhere. Did I just miss them? -- Dave From linas at austin.ibm.com Tue Sep 14 06:05:39 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 13 Sep 2004 15:05:39 -0500 Subject: Resending: [PATCH] PPC64: New version of EEH notifier code Message-ID: <20040913200539.GD9645@austin.ibm.com> Resending: ----- The following addresses had permanent fatal errors ----- Paul, I picked up the eeh notifier call-chain patch from http://ozlabs.org/ppc64-patches/ patch 239, I beleive. Because it doens't apply cleanly any more, I whacked on it a bit to get it to apply; the result is below. I'd suggest sending this upstream, as soon as reasonable. It's not 'perfect', but it does provide a convenieint base to do further work from. --linas Signed-off-by: Linas Vepstas ===== arch/ppc64/kernel/eeh.c 1.30 vs edited ===== --- 1.30/arch/ppc64/kernel/eeh.c Thu Sep 2 15:22:27 2004 +++ edited/arch/ppc64/kernel/eeh.c Thu Sep 2 16:06:58 2004 @@ -17,29 +17,79 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#include #include +#include +#include +#include #include #include -#include -#include #include -#include #include -#include -#include -#include +#include +#include #include #include -#include #include +#include #include "pci.h" #undef DEBUG +/** Overview: + * EEH, or "Extended Error Handling" is a PCI bridge technology for + * dealing with PCI bus errors that can't be dealt with within the + * usual PCI framework, except by check-stopping the CPU. Systems + * that are designed for high-availability/reliability cannot afford + * to crash due to a "mere" PCI error, thus the need for EEH. + * An EEH-capable bridge operates by converting a detected error + * into a "slot freeze", taking the PCI adapter off-line, making + * the slot behave, from the OS'es point of view, as if the slot + * were "empty": all reads return 0xff's and all writes are silently + * ignored. EEH slot isolation events can be triggered by parity + * errors on the address or data busses (e.g. during posted writes), + * which in turn might be caused by dust, vibration, humidity, + * radioactivity or plain-old failed hardware. + * + * Note, however, that one of the leading causes of EEH slot + * freeze events are buggy device drivers, buggy device microcode, + * or buggy device hardware. This is because any attempt by the + * device to bus-master data to a memory address that is not + * assigned to the device will trigger a slot freeze. (The idea + * is to prevent devices-gone-wild from corrupting system memory). + * Buggy hardware/drivers will have a miserable time co-existing + * with EEH. + * + * Ideally, a PCI device driver, when suspecting that an isolation + * event has occured (e.g. by reading 0xff's), will then ask EEH + * whether this is the case, and then take appropriate steps to + * reset the PCI slot, the PCI device, and then resume operations. + * However, until that day, the checking is done here, with the + * eeh_check_failure() routine embedded in the MMIO macros. If + * the slot is found to be isolated, an "EEH Event" is synthesized + * and sent out for processing. + */ + +/** Bus Unit ID macros; get low and hi 32-bits of the 64-bit BUID */ #define BUID_HI(buid) ((buid) >> 32) #define BUID_LO(buid) ((buid) & 0xffffffff) -#define CONFIG_ADDR(busno, devfn) \ - (((((busno) & 0xff) << 8) | ((devfn) & 0xf8)) << 8) + +/* EEH event workqueue setup. */ +static spinlock_t eeh_eventlist_lock = SPIN_LOCK_UNLOCKED; +LIST_HEAD(eeh_eventlist); +static void eeh_event_handler(void *); +DECLARE_WORK(eeh_event_wq, eeh_event_handler, NULL); + +static struct notifier_block *eeh_notifier_chain; + +/* + * If a device driver keeps reading an MMIO register in an interrupt + * handler after a slot isolation event has occurred, we assume it + * is broken and panic. This sets the threshold for how many read + * attempts we allow before panicking. + */ +#define EEH_MAX_FAILS 1000 +static atomic_t eeh_fail_count; /* RTAS tokens */ static int ibm_set_eeh_option; @@ -61,6 +111,7 @@ static DEFINE_PER_CPU(unsigned long, total_mmio_ffs); static DEFINE_PER_CPU(unsigned long, false_positives); static DEFINE_PER_CPU(unsigned long, ignored_failures); +static DEFINE_PER_CPU(unsigned long, slot_resets); static int eeh_check_opts_config(struct device_node *dn, int class_code, int vendor_id, int device_id, @@ -71,7 +122,8 @@ * PCI device address resources into a red-black tree, sorted * according to the address range, so that given only an i/o * address, the corresponding PCI device can be **quickly** - * found. + * found. It is safe to perform an address lookup in an interrupt + * context; this ability is an important feature. * * Currently, the only customer of this code is the EEH subsystem; * thus, this code has been somewhat tailored to suit EEH better. @@ -340,6 +392,94 @@ #endif } +/* --------------------------------------------------------------- */ +/* Above lies the PCI Address Cache. Below lies the EEH event infrastructure */ + +/** + * eeh_register_notifier - Register to find out about EEH events. + * @nb: notifier block to callback on events + */ +int eeh_register_notifier(struct notifier_block *nb) +{ + return notifier_chain_register(&eeh_notifier_chain, nb); +} + +/** + * eeh_unregister_notifier - Unregister to an EEH event notifier. + * @nb: notifier block to callback on events + */ +int eeh_unregister_notifier(struct notifier_block *nb) +{ + return notifier_chain_unregister(&eeh_notifier_chain, nb); +} + +/** + * eeh_panic - call panic() for an eeh event that cannot be handled. + * The philosophy of this routine is that it is better to panic and + * halt the OS than it is to risk possible data corruption by + * oblivious device drivers that don't know better. + * + * @dev pci device that had an eeh event + * @reset_state current reset state of the device slot + */ +static void eeh_panic(struct pci_dev *dev, int reset_state) +{ + /* + * XXX We should create a seperate sysctl for this. + * + * Since the panic_on_oops sysctl is used to halt the system + * in light of potential corruption, we can use it here. + */ + if (panic_on_oops) + panic("EEH: MMIO failure (%d) on device:%s %s\n", reset_state, + pci_name(dev), pci_pretty_name(dev)); + else { + __get_cpu_var(ignored_failures)++; + printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s %s\n", + reset_state, pci_name(dev), pci_pretty_name(dev)); + } +} + +/** + * eeh_event_handler - dispatch EEH events. The detection of a frozen + * slot can occur inside an interrupt, where it can be hard to do + * anything about it. The goal of this routine is to pull these + * detection events out of the context of the interrupt handler, and + * re-dispatch them for processing at a later time in a normal context. + * + * @dummy - unused + */ +static void eeh_event_handler(void *dummy) +{ + unsigned long flags; + struct eeh_event *event; + + while (1) { + spin_lock_irqsave(&eeh_eventlist_lock, flags); + event = NULL; + if (!list_empty(&eeh_eventlist)) { + event = list_entry(eeh_eventlist.next, struct eeh_event, list); + list_del(&event->list); + } + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + if (event == NULL) + break; + + printk(KERN_INFO "EEH: MMIO failure (%d), notifiying device " + "%s %s\n", event->reset_state, + pci_name(event->dev), pci_pretty_name(event->dev)); + + atomic_set(&eeh_fail_count, 0); + notifier_call_chain (&eeh_notifier_chain, + EEH_NOTIFY_FREEZE, event); + + __get_cpu_var(slot_resets)++; + + pci_dev_put(event->dev); + kfree(event); + } +} + /** * eeh_token_to_phys - convert EEH address token to phys address * @token i/o token, should be address in the form 0xA.... @@ -371,11 +511,11 @@ * * Check for an EEH failure for the given device node. Call this * routine if the result of a read was all 0xff's and you want to - * find out if this is due to an EEH slot freeze event. This routine + * find out if this is due to an EEH slot freeze. This routine * will query firmware for the EEH status. * * Returns 0 if there has not been an EEH error; otherwise returns - * an error code. + * a non-zero value and queues up a solt isolation event notification. * * It is safe to call this routine in an interrupt context. */ @@ -384,6 +524,8 @@ int ret; int rets[2]; unsigned long flags; + int rc, reset_state; + struct eeh_event *event; __get_cpu_var(total_mmio_ffs)++; @@ -402,6 +544,24 @@ if (!dn->eeh_config_addr) { return 0; } + + /* + * If we already have a pending isolation event for this + * slot, we know it's bad already, we don't need to check... + */ + if (dn->eeh_mode & EEH_MODE_ISOLATED) { + atomic_inc(&eeh_fail_count); + if (atomic_read(&eeh_fail_count) >= EEH_MAX_FAILS) { + /* re-read the slot reset state */ + rets[0] = -1; + rtas_call(ibm_read_slot_reset_state, 3, 3, rets, + dn->eeh_config_addr, + BUID_HI(dn->phb->buid), + BUID_LO(dn->phb->buid)); + eeh_panic(dev, rets[0]); + } + return 0; + } /* * Now test for an EEH failure. This is VERY expensive. @@ -414,45 +574,52 @@ dn->eeh_config_addr, BUID_HI(dn->phb->buid), BUID_LO(dn->phb->buid)); - if (ret == 0 && rets[1] == 1 && rets[0] >= 2) { - int log_event; - - spin_lock_irqsave(&slot_errbuf_lock, flags); - memset(slot_errbuf, 0, eeh_error_buf_size); - - log_event = rtas_call(ibm_slot_error_detail, - 8, 1, NULL, dn->eeh_config_addr, - BUID_HI(dn->phb->buid), - BUID_LO(dn->phb->buid), NULL, 0, - virt_to_phys(slot_errbuf), - eeh_error_buf_size, - 1 /* Temporary Error */); - - if (log_event == 0) - log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, - 1 /* Fatal */); - - spin_unlock_irqrestore(&slot_errbuf_lock, flags); - - /* - * XXX We should create a separate sysctl for this. - * - * Since the panic_on_oops sysctl is used to halt - * the system in light of potential corruption, we - * can use it here. - */ - if (panic_on_oops) { - panic("EEH: MMIO failure (%d) on device:%s %s\n", - rets[0], dn->name, dn->full_name); - } else { - __get_cpu_var(ignored_failures)++; - printk(KERN_INFO "EEH: MMIO failure (%d) on device:%s %s\n", - rets[0], dn->name, dn->full_name); - } - } else { + if (!(ret == 0 && rets[1] == 1 && rets[0] >= 2)) { __get_cpu_var(false_positives)++; + return 0; } + /* prevent repeated reports of this failure */ + dn->eeh_mode |= EEH_MODE_ISOLATED; + + reset_state = rets[0]; + + spin_lock_irqsave(&slot_errbuf_lock, flags); + memset(slot_errbuf, 0, eeh_error_buf_size); + + rc = rtas_call(ibm_slot_error_detail, + 8, 1, NULL, dn->eeh_config_addr, + BUID_HI(dn->phb->buid), + BUID_LO(dn->phb->buid), NULL, 0, + virt_to_phys(slot_errbuf), + eeh_error_buf_size, + 1 /* Temporary Error */); + + if (rc == 0) + log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); + spin_unlock_irqrestore(&slot_errbuf_lock, flags); + + event = kmalloc(sizeof(*event), GFP_ATOMIC); + if (event == NULL) { + eeh_panic(dev, reset_state); + return 1; + } + + event->dev = dev; + event->dn = dn; + event->reset_state = reset_state; + + /* We may or may not be called in an interrupt context */ + spin_lock_irqsave(&eeh_eventlist_lock, flags); + list_add(&event->list, &eeh_eventlist); + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + + /* Most EEH events are due to device driver bugs. Having + * a stack trace will help the device-driver authors figure + * out what happened. So print that out. */ + dump_stack(); + schedule_work(&eeh_event_wq); + return 0; } @@ -768,11 +935,13 @@ { unsigned int cpu; unsigned long ffs = 0, positives = 0, failures = 0; + unsigned long resets = 0; for_each_cpu(cpu) { ffs += per_cpu(total_mmio_ffs, cpu); positives += per_cpu(false_positives, cpu); failures += per_cpu(ignored_failures, cpu); + resets += per_cpu(slot_resets, cpu); } if (0 == eeh_subsystem_enabled) { @@ -782,8 +951,11 @@ seq_printf(m, "EEH Subsystem is enabled\n"); seq_printf(m, "eeh_total_mmio_ffs=%ld\n" "eeh_false_positives=%ld\n" - "eeh_ignored_failures=%ld\n", - ffs, positives, failures); + "eeh_ignored_failures=%ld\n" + "eeh_slot_resets=%ld\n" + "eeh_fail_count=%d\n", + ffs, positives, failures, resets, + eeh_fail_count.counter); } return 0; ===== include/asm-ppc64/eeh.h 1.15 vs edited ===== --- 1.15/include/asm-ppc64/eeh.h Thu Sep 2 15:22:27 2004 +++ edited/include/asm-ppc64/eeh.h Thu Sep 2 15:38:32 2004 @@ -20,8 +20,10 @@ #ifndef _PPC64_EEH_H #define _PPC64_EEH_H -#include #include +#include +#include +#include struct pci_dev; struct device_node; @@ -41,6 +43,7 @@ /* Values for eeh_mode bits in device_node */ #define EEH_MODE_SUPPORTED (1<<0) #define EEH_MODE_NOCHECK (1<<1) +#define EEH_MODE_ISOLATED (1<<2) extern void __init eeh_init(void); unsigned long eeh_check_failure(void *token, unsigned long val); @@ -76,7 +79,28 @@ #define EEH_RELEASE_DMA 3 int eeh_set_option(struct pci_dev *dev, int options); -/* + +/** + * Notifier event flags. + */ +#define EEH_NOTIFY_FREEZE 1 + +/** EEH event -- structure holding pci slot data that describes + * a change in the isolation status of a PCI slot. A pointer + * to this struct is passed as the data pointer in a notify callback. + */ +struct eeh_event { + struct list_head list; + struct pci_dev *dev; + struct device_node *dn; + int reset_state; +}; + +/** Register to find out about EEH events. */ +int eeh_register_notifier(struct notifier_block *nb); +int eeh_unregister_notifier(struct notifier_block *nb); + +/** * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure. * * Order this macro for performance. From jschopp at austin.ibm.com Tue Sep 14 06:10:03 2004 From: jschopp at austin.ibm.com (Joel Schopp) Date: Mon, 13 Sep 2004 15:10:03 -0500 Subject: [Fwd: [RFC][PATCH] flush_hash_page] In-Reply-To: <1095095309.3422.7.camel@localhost> References: <4145C346.6090102@austin.ibm.com> <1095095309.3422.7.camel@localhost> Message-ID: <4145FE9B.5090801@austin.ibm.com> htab_initialize calls create_pte_mapping on every lmb, with large set to true if it is supported. create_pte_mapping calls pSeries_lpar_hpte_insert, which calls H_ENTER, which creates a hardware page table entry. This leads me to believe that all physical memory gets initialized with large ptes. I'm sure I'm wrong, but I just don't see the rest of the code. >>+ if (cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) >>+ large = 1; >>+ > > > First of all, if you do that, doesn't it assume that all ptes that are > passed in are large pages? I don't think that's correct. I have a hard time with that assumption myself, I was hoping by proposing the wrong answer somebody would be able to help me with the right one. > > Also, think about how huge pages are implemented in Linux. Do huge > pages really even get Linux ptes, or just pmds that act like ptes? We need to be clear in our language to differ between hardware page table entries and Linux page table entries. > > That reminds me. Anton, I don't see ppc64 setting up the Linux > pagetable for the kernel mappings anywhere. Did I just miss them? > > -- Dave > > From linas at austin.ibm.com Tue Sep 14 06:57:49 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 13 Sep 2004 15:57:49 -0500 Subject: [PATCH 1/1] ppc64: Block config accesses during BIST Message-ID: <20040913205749.GE9645@austin.ibm.com> Forwarding ... Brian sent this patch while the list was down. The problem that spurs this patch was discussed a number of time on this mailing list. I like this patch; it seems to solve the problem with a minimum of fuss. I suspect this patch doesn't apply cleanly after other recent changes. Torvalds suggests using "Pirated-by:" when forwarding a patch such as this: http://www.ussg.iu.edu/hypermail/linux/kernel/0405.3/0226.html Signed-off-by: Linas Vepstas --linas ----- Forwarded message from brking at us.ibm.com ----- Subject: [PATCH 1/1] ppc64: Block config accesses during BIST Some PCI adapters on pSeries and iSeries hardware (ipr scsi adapters) have an exposure today in that they issue BIST to the adapter to reset the card. If, during the time it takes to complete BIST, userspace attempts to access PCI config space, the host bus bridge will master abort the access since the ipr adapter does not respond on the PCI bus for a brief period of time when running BIST. This master abort results in the host PCI bridge isolating that PCI device from the rest of the system, making the device unusable until Linux is rebooted. This patch is an attempt to close that exposure by introducing some blocking code in the arch specific PCI code. The intent is to have the ipr device driver invoke these routines to prevent userspace PCI accesses from occurring during this window. It has been tested by running BIST on an ipr adapter while running a script which looped reading the config space of that adapter through sysfs. Without the patch, an EEH error occurrs. With the patch there is no EEH error. Tested on Power 5 and iSeries Power 4. Signed-off-by: Brian King --- linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/iSeries_pci.c | 127 +++++++++- linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/pSeries_pci.c | 103 +++++++- linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/prom.c | 1 linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/iSeries/iSeries_pci.h | 2 linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/pci.h | 6 linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/prom.h | 5 6 files changed, 226 insertions(+), 18 deletions(-) diff -puN include/asm-ppc64/prom.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/prom.h --- linux-2.6.9-rc1-bk8/include/asm-ppc64/prom.h~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 +++ linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/prom.h 2004-09-01 16:20:35.000000000 -0500 @@ -169,16 +169,21 @@ struct device_node { struct proc_dir_entry *addr_link; /* addr symlink */ atomic_t _users; /* reference count */ unsigned long _flags; + spinlock_t config_lock; }; /* flag descriptions */ #define OF_STALE 0 /* node is slated for deletion */ #define OF_DYNAMIC 1 /* node and properties were allocated via kmalloc */ +#define OF_NO_CFGIO 2 /* config space accesses should fail */ #define OF_IS_STALE(x) test_bit(OF_STALE, &x->_flags) #define OF_MARK_STALE(x) set_bit(OF_STALE, &x->_flags) #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags) #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags) +#define OF_IS_CFGIO_BLOCKED(x) test_bit(OF_NO_CFGIO, &x->_flags) +#define OF_UNBLOCK_CFGIO(x) clear_bit(OF_NO_CFGIO, &x->_flags) +#define OF_BLOCK_CFGIO(x) set_bit(OF_NO_CFGIO, &x->_flags) /* * Until 32-bit ppc can add proc_dir_entries to its device_node diff -puN arch/ppc64/kernel/prom.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/prom.c --- linux-2.6.9-rc1-bk8/arch/ppc64/kernel/prom.c~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 +++ linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/prom.c 2004-09-01 16:20:35.000000000 -0500 @@ -2959,6 +2959,7 @@ int of_add_node(const char *path, struct np->properties = proplist; OF_MARK_DYNAMIC(np); + spin_lock_init(&np->config_lock); of_node_get(np); np->parent = derive_parent(path); if (!np->parent) { diff -puN arch/ppc64/kernel/pSeries_pci.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/pSeries_pci.c --- linux-2.6.9-rc1-bk8/arch/ppc64/kernel/pSeries_pci.c~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 +++ linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/pSeries_pci.c 2004-09-01 16:20:35.000000000 -0500 @@ -61,15 +61,12 @@ static int s7a_workaround; extern unsigned long pci_probe_only; -static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) +static int __rtas_read_config(struct device_node *dn, int where, int size, u32 *val) { int returnval = -1; unsigned long buid, addr; int ret; - if (!dn) - return -2; - addr = (dn->busno << 16) | (dn->devfn << 8) | where; buid = dn->phb->buid; if (buid) { @@ -82,6 +79,23 @@ static int rtas_read_config(struct devic return ret; } +static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) +{ + unsigned long flags; + int ret = 0; + + if (!dn) + return -2; + + spin_lock_irqsave(&dn->config_lock, flags); + if (OF_IS_CFGIO_BLOCKED(dn)) + *val = -1; + else + ret = __rtas_read_config(dn, where, size, val); + spin_unlock_irqrestore(&dn->config_lock, flags); + return ret; +} + static int rtas_pci_read_config(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 *val) @@ -100,14 +114,11 @@ static int rtas_pci_read_config(struct p return PCIBIOS_DEVICE_NOT_FOUND; } -static int rtas_write_config(struct device_node *dn, int where, int size, u32 val) +static int __rtas_write_config(struct device_node *dn, int where, int size, u32 val) { unsigned long buid, addr; int ret; - if (!dn) - return -2; - addr = (dn->busno << 16) | (dn->devfn << 8) | where; buid = dn->phb->buid; if (buid) { @@ -118,6 +129,21 @@ static int rtas_write_config(struct devi return ret; } +static int rtas_write_config(struct device_node *dn, int where, int size, u32 val) +{ + unsigned long flags; + int ret = 0; + + if (!dn) + return -2; + + spin_lock_irqsave(&dn->config_lock, flags); + if (!OF_IS_CFGIO_BLOCKED(dn)) + ret = __rtas_write_config(dn, where, size, val); + spin_unlock_irqrestore(&dn->config_lock, flags); + return ret; +} + static int rtas_pci_write_config(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 val) @@ -141,6 +167,67 @@ struct pci_ops rtas_pci_ops = { rtas_pci_write_config }; +/** + * pci_block_config_io - Block PCI config reads/writes + * @pdev: pci device struct + * + * This function blocks any PCI config accesses from occurring. + * Device drivers may call this prior to running BIST if the + * adapter cannot handle PCI config reads or writes when + * running BIST. When blocked, any writes will be ignored and + * treated as successful and any reads will return all 1's data. + * + * Return value: + * nothing + **/ +void pci_block_config_io(struct pci_dev *pdev) +{ + struct device_node *dn = pci_device_to_OF_node(pdev); + unsigned long flags; + + spin_lock_irqsave(&dn->config_lock, flags); + OF_BLOCK_CFGIO(dn); + spin_unlock_irqrestore(&dn->config_lock, flags); +} +EXPORT_SYMBOL(pci_block_config_io); + +/** + * pci_unblock_config_io - Unblock PCI config reads/writes + * @pdev: pci device struct + * + * This function allows PCI config accesses to resume. + * + * Return value: + * nothing + **/ +void pci_unblock_config_io(struct pci_dev *pdev) +{ + struct device_node *dn = pci_device_to_OF_node(pdev); + unsigned long flags; + + spin_lock_irqsave(&dn->config_lock, flags); + OF_UNBLOCK_CFGIO(dn); + spin_unlock_irqrestore(&dn->config_lock, flags); +} +EXPORT_SYMBOL(pci_unblock_config_io); + +/** + * pci_start_bist - Start BIST on a PCI device + * @pdev: pci device struct + * + * This function allows a device driver to start BIST + * when PCI config accesses are disabled. + * + * Return value: + * nothing + **/ +int pci_start_bist(struct pci_dev *pdev) +{ + struct device_node *dn = pci_device_to_OF_node(pdev); + return __rtas_write_config(dn, PCI_BIST, 1, PCI_BIST_START); +} +EXPORT_SYMBOL(pci_start_bist); + /****************************************************************** * pci_read_irq_line * diff -puN include/asm-ppc64/pci.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/pci.h --- linux-2.6.9-rc1-bk8/include/asm-ppc64/pci.h~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 +++ linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/pci.h 2004-09-01 16:20:35.000000000 -0500 @@ -233,6 +233,12 @@ extern int pci_read_irq_line(struct pci_ extern void pcibios_add_platform_entries(struct pci_dev *dev); +extern void pci_block_config_io(struct pci_dev *dev); + +extern void pci_unblock_config_io(struct pci_dev *dev); + +extern int pci_start_bist(struct pci_dev *dev); + #endif /* __KERNEL__ */ #endif /* __PPC64_PCI_H */ diff -puN include/asm-ppc64/iSeries/iSeries_pci.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/iSeries/iSeries_pci.h --- linux-2.6.9-rc1-bk8/include/asm-ppc64/iSeries/iSeries_pci.h~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 +++ linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/iSeries/iSeries_pci.h 2004-09-01 16:20:35.000000000 -0500 @@ -91,6 +91,7 @@ struct iSeries_Device_Node { int ReturnCode; /* Return Code Holder */ int IoRetry; /* Current Retry Count */ int Flags; /* Possible flags(disable/bist)*/ +#define ISERIES_CFGIO_BLOCKED 1 u16 Vendor; /* Vendor ID */ u8 LogicalSlot; /* Hv Slot Index for Tces */ struct iommu_table* iommu_table;/* Device TCE Table */ @@ -99,6 +100,7 @@ struct iSeries_Device_Node { u8 FrameId; /* iSeries spcn Frame Id */ char CardLocation[4];/* Char format of planar vpd */ char Location[20]; /* Frame 1, Card C10 */ + spinlock_t config_lock; }; /************************************************************************/ diff -puN arch/ppc64/kernel/iSeries_pci.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/iSeries_pci.c --- linux-2.6.9-rc1-bk8/arch/ppc64/kernel/iSeries_pci.c~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 +++ linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/iSeries_pci.c 2004-09-01 16:20:35.000000000 -0500 @@ -131,6 +131,7 @@ static struct iSeries_Device_Node *build node->AgentId = AgentId; node->DevFn = PCI_DEVFN(ISERIES_ENCODE_DEVICE(AgentId), Function); node->IoRetry = 0; + spin_lock_init(&node->config_lock); iSeries_Get_Location_Code(node); PCIFR("Device 0x%02X.%2X, Node:0x%p ", ISERIES_BUS(node), ISERIES_DEVFUN(node), node); @@ -515,16 +516,12 @@ static u64 hv_cfg_write_func[4] = { /* * Read PCI config space */ -static int iSeries_pci_read_config(struct pci_bus *bus, unsigned int devfn, +static int __iSeries_pci_read_config(struct iSeries_Device_Node *node, int offset, int size, u32 *val) { - struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); u64 fn; struct HvCallPci_LoadReturn ret; - if (node == NULL) - return PCIBIOS_DEVICE_NOT_FOUND; - fn = hv_cfg_read_func[(size - 1) & 3]; HvCall3Ret16(fn, &ret, node->DsaAddr.DsaAddr, offset, 0); @@ -537,20 +534,36 @@ static int iSeries_pci_read_config(struc return 0; } +static int iSeries_pci_read_config(struct pci_bus *bus, unsigned int devfn, + int offset, int size, u32 *val) +{ + struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); + int ret = PCIBIOS_DEVICE_NOT_FOUND; + unsigned long flags; + + if (node) { + ret = 0; + spin_lock_irqsave(&node->config_lock, flags); + if (node->Flags & ISERIES_CFGIO_BLOCKED) + *val = -1; + else + ret = __iSeries_pci_read_config(node, offset, size, val); + spin_unlock_irqrestore(&node->config_lock, flags); + } + + return ret; +} + /* * Write PCI config space */ -static int iSeries_pci_write_config(struct pci_bus *bus, unsigned int devfn, +static int __iSeries_pci_write_config(struct iSeries_Device_Node *node, int offset, int size, u32 val) { - struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); u64 fn; u64 ret; - if (node == NULL) - return PCIBIOS_DEVICE_NOT_FOUND; - fn = hv_cfg_write_func[(size - 1) & 3]; ret = HvCall4(fn, node->DsaAddr.DsaAddr, offset, val, 0); @@ -560,6 +573,23 @@ static int iSeries_pci_write_config(stru return 0; } +static int iSeries_pci_write_config(struct pci_bus *bus, unsigned int devfn, + int offset, int size, u32 val) +{ + struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); + int ret = PCIBIOS_DEVICE_NOT_FOUND; + unsigned long flags; + + if (node) { + spin_lock_irqsave(&node->config_lock, flags); + if (!(node->Flags & ISERIES_CFGIO_BLOCKED)) + ret = __iSeries_pci_write_config(node, offset, size, val); + spin_unlock_irqrestore(&node->config_lock, flags); + } + + return ret; +} + static struct pci_ops iSeries_pci_ops = { .read = iSeries_pci_read_config, .write = iSeries_pci_write_config @@ -820,3 +850,80 @@ void iSeries_Write_Long(u32 data, void * } while (CheckReturnCode("WWL", DevNode, rc) != 0); } EXPORT_SYMBOL(iSeries_Write_Long); + +/** + * pci_block_config_io - Block PCI config reads/writes + * @pdev: pci device struct + * + * This function blocks any PCI config accesses from occurring. + * Device drivers may call this prior to running BIST if the + * adapter cannot handle PCI config reads or writes when + * running BIST. When blocked, any writes will be ignored and + * treated as successful and any reads will return all 1's data. + * + * Return value: + * nothing + **/ +void pci_block_config_io(struct pci_dev *pdev) +{ + struct iSeries_Device_Node *node; + unsigned long flags; + + node = find_Device_Node(pdev->bus->number, pdev->devfn); + + if (node == NULL) + return; + + spin_lock_irqsave(&node->config_lock, flags); + node->Flags |= ISERIES_CFGIO_BLOCKED; + spin_unlock_irqrestore(&node->config_lock, flags); +} +EXPORT_SYMBOL(pci_block_config_io); + +/** + * pci_unblock_config_io - Unblock PCI config reads/writes + * @pdev: pci device struct + * + * This function allows PCI config accesses to resume. + * + * Return value: + * nothing + **/ +void pci_unblock_config_io(struct pci_dev *pdev) +{ + struct iSeries_Device_Node *node; + unsigned long flags; + + node = find_Device_Node(pdev->bus->number, pdev->devfn); + + if (node == NULL) + return; + + spin_lock_irqsave(&node->config_lock, flags); + node->Flags &= ~ISERIES_CFGIO_BLOCKED; + spin_unlock_irqrestore(&node->config_lock, flags); +} +EXPORT_SYMBOL(pci_unblock_config_io); + +/** + * pci_start_bist - Start BIST on a PCI device + * @pdev: pci device struct + * + * This function allows a device driver to start BIST + * when PCI config accesses are disabled. + * + * Return value: + * nothing + **/ +int pci_start_bist(struct pci_dev *pdev) +{ + struct iSeries_Device_Node *node; + + node = find_Device_Node(pdev->bus->number, pdev->devfn); + + if (node == NULL) + return PCIBIOS_DEVICE_NOT_FOUND; + + return __iSeries_pci_write_config(node, PCI_BIST, 1, PCI_BIST_START); +} +EXPORT_SYMBOL(pci_start_bist); _ ----- End forwarded message ----- From brking at us.ibm.com Tue Sep 14 07:05:39 2004 From: brking at us.ibm.com (Brian King) Date: Mon, 13 Sep 2004 16:05:39 -0500 Subject: [PATCH 1/1] ppc64: Block config accesses during BIST In-Reply-To: <20040913205749.GE9645@austin.ibm.com> References: <20040913205749.GE9645@austin.ibm.com> Message-ID: <41460BA3.9070007@us.ibm.com> I'll be sending a patch that applies cleanly fairly soon. -Brian Linas Vepstas wrote: > Forwarding ... > > Brian sent this patch while the list was down. The problem that > spurs this patch was discussed a number of time on this mailing list. > I like this patch; it seems to solve the problem with a minimum of > fuss. > > I suspect this patch doesn't apply cleanly after other recent > changes. > > Torvalds suggests using "Pirated-by:" when forwarding a patch such as this: > http://www.ussg.iu.edu/hypermail/linux/kernel/0405.3/0226.html > > Signed-off-by: Linas Vepstas > > --linas > > ----- Forwarded message from brking at us.ibm.com ----- > > Subject: [PATCH 1/1] ppc64: Block config accesses during BIST > > Some PCI adapters on pSeries and iSeries hardware (ipr scsi adapters) > have an exposure today in that they issue BIST to the adapter to reset > the card. If, during the time it takes to complete BIST, userspace attempts > to