From linas at austin.ibm.com Fri Apr 1 06:06:22 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Thu, 31 Mar 2005 14:06:22 -0600 Subject: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] In-Reply-To: <20050322175728.GE12675@colo.lackof.org> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <1109207532.5384.32.camel@gaston> <20050224013137.GF2088@austin.ibm.com> <20050226063609.GC7036@colo.lackof.org> <20050321231028.GV498@austin.ibm.com> <20050322175728.GE12675@colo.lackof.org> Message-ID: <20050331200622.GG15596@austin.ibm.com> Hmm, Got distracted by other issues, so I'm answering a week late... On Tue, Mar 22, 2005 at 10:57:28AM -0700, Grant Grundler was heard to remark: > On Mon, Mar 21, 2005 at 05:10:28PM -0600, Linas Vepstas wrote: > > My current hardware will halt all i/o to/from the symbios controller > > upon detection of a PCI error. The recovery proceedure that I am > > currently using is to call system firmware (aka 'bios') to raise > > and then lower the #RST pci signal line for 1/4 second, then wait 2 > > seconds for the PCI bus to settle, then restore the PCI config space > > registers (BARs, interrupt line, etc) to what they used to be. Then, > > I call sym_start_up() in an attempt to get the symbios card working > > again. And that's where I get stuck ... > > Does this process cause a SCSI bus reset? Don't get a chance to get that far. Have to bring up the PCI interfaces first, before any scsi command can be issued. > BTW, when did sym2 get a chance to cleanup "pending" requests? Yes, the sym2 driver has mechanisms for that. > You want everything moved back to the "queued" state or failed > (flush pending IO so upper layers can retry if they want). Upper layer is the linux block device; my understanding is that it does not retry, nor do the filesystems above that. Passing errors upwards seems to be pretty darned fatal. My goal is to limit retries to the driver. > > Sometimes, I get the PCI error while the card is sitting there idly > > after the #RST, but more often, I get the error in sym_chip_reset(), > > immediately after the OUTB (nc_istat, SRST); > > Oh? Is this the driver trying to issue SCSI Reset? No I am trying to reinitialize the scsi card after the pci bus has been reset. This has nothing to do with scsi bus resets, as far as I know ... --linas From apw at us.ibm.com Fri Apr 1 05:34:58 2005 From: apw at us.ibm.com (Amos Waterland) Date: Thu, 31 Mar 2005 14:34:58 -0500 Subject: [patch] fix prom.c compile warning Message-ID: <20050331193458.GA4186@kvasir.watson.ibm.com> The code in unflatten_device_tree knows that get_property is written to only return with lenp equal to 1 when also returning a valid pointer. The gcc 3.3.3 compiler is not able to prove this to itself, so it warns about a possible uninitialized pointer dereference: .../arch/ppc64/kernel/prom.c: In function `unflatten_device_tree': .../arch/ppc64/kernel/prom.c:828: warning: `p' might be used uninitialized in this function Unless it is desired to rework the interaction between the two functions, this will keep the existing behavior but quiet the compiler. Signed-off-by: Amos Waterland ===== arch/ppc64/kernel/prom.c 1.127 vs edited ===== --- 1.127/arch/ppc64/kernel/prom.c 2005-03-28 17:21:21 -05:00 +++ edited/arch/ppc64/kernel/prom.c 2005-03-31 13:40:42 -05:00 @@ -825,7 +825,7 @@ { unsigned long start, mem, size; struct device_node **allnextp = &allnodes; - char *p; + char *p = NULL; int l = 0; DBG(" -> unflatten_device_tree()\n"); From linas at austin.ibm.com Fri Apr 1 06:14:09 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Thu, 31 Mar 2005 14:14:09 -0600 Subject: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] In-Reply-To: <4240581C.1000906@us.ibm.com> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <1109207532.5384.32.camel@gaston> <20050224013137.GF2088@austin.ibm.com> <20050226063609.GC7036@colo.lackof.org> <20050321231028.GV498@austin.ibm.com> <4240581C.1000906@us.ibm.com> Message-ID: <20050331201409.GH15596@austin.ibm.com> On Tue, Mar 22, 2005 at 11:38:36AM -0600, Brian King was heard to remark: > Linas Vepstas wrote: > > > > My current hardware will halt all i/o to/from the symbios controller > > upon detection of a PCI error. The recovery proceedure that I am > > currently using is to call system firmware (aka 'bios') to raise > > and then lower the #RST pci signal line for 1/4 second, then wait 2 > > seconds for the PCI bus to settle, then restore the PCI config space > > registers (BARs, interrupt line, etc) to what they used to be. Then, > > I call sym_start_up() in an attempt to get the symbios card working > > again. And that's where I get stuck ... > > > > My assumption is that after the #RST, that the symbios card will sit > > there, dumb and stupid, with no scripts running. But sometimes I find > > that the card has done something to make the PCI error hardware trip > > again. Typically, this means that the card attempted to DMA to some > > address that its not allowed to touch, or raised #SERR or possibly > > #PERR (I can't tell which). > > What config registers are you restoring? BAR's, grant, latency, interrupt, cacheline size. > Is it possible symbios does not > like something in your config restore? possibly... > Another possiblity is that asserting PCI reset is not cleanly resetting > the card. Does PCI reset force BIST to be run on these cards? You could > try to manually run BIST on the card after the PCI reset to see if that I didn't see bist in the code, but I wasn't looking for it either. I could try that. > helps, or you could try power cycling the slot instead of using PCI reset. yes I could :( I'll try that next. Problem is, not all slots are power-cyclable, only the hotplug slots are. I've discoverd that for example, the ethernet chips are soldered to the motherboard, and can't be power-cycled (but fortunately, those don't give me trouble). --linas From linas at austin.ibm.com Fri Apr 1 06:21:48 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Thu, 31 Mar 2005 14:21:48 -0600 Subject: RFC/Patch more xmon additions In-Reply-To: <16936.10223.704710.234312@cargo.ozlabs.ibm.com> References: <421E3BE3.90301@vnet.ibm.com> <16936.10223.704710.234312@cargo.ozlabs.ibm.com> Message-ID: <20050331202148.GI15596@austin.ibm.com> Hi Will, I just unearthed this email from the deep mound ... On Fri, Mar 04, 2005 at 08:18:39PM +1100, Paul Mackerras was heard to remark: > will schmidt writes: > > > Am looking for comments on this additional function i've added to xmon > > on the side.. > > > > the bulk of my intent was to make it easier for me to poke at memory > > within a particular user process. > > The main problem I have with it is that we seem to be accessing a lot > of kernel data structures without checking any pointers or using > mread() to read the memory safely. One of the goals of xmon is that > it should be as reliable as possible even if kernel data structures > are corrupted, and I think your patch would reduce that reliability. Please clean up per Paul's suggestions and resubmit; as a matter of principle, its nice to have the debugger print parsed output instead of having to count 289 bytes into some struct task or such to manually decode a bitflag ... --linas From dwmw2 at infradead.org Fri Apr 1 06:44:47 2005 From: dwmw2 at infradead.org (David Woodhouse) Date: Thu, 31 Mar 2005 21:44:47 +0100 Subject: [PATCH] Export re{serv,leas}e_pmc_hardware() for oprofile Message-ID: <1112301887.24487.363.camel@hades.cambridge.redhat.com> CONFIG_OPROFILE=m doesn't work on ppc64 if these aren't exported... Signed-off-by: David Woodhouse --- linux-2.6.11/arch/ppc64/kernel/pmc.c.orig 2005-03-31 20:31:07.000000000 +0100 +++ linux-2.6.11/arch/ppc64/kernel/pmc.c 2005-03-31 20:30:15.000000000 +0100 @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -50,6 +51,7 @@ int reserve_pmc_hardware(perf_irq_t new_ spin_unlock(&pmc_owner_lock); return err; } +EXPORT_SYMBOL_GPL(reserve_pmc_hardware); void release_pmc_hardware(void) { @@ -62,3 +64,4 @@ void release_pmc_hardware(void) spin_unlock(&pmc_owner_lock); } +EXPORT_SYMBOL_GPL(release_pmc_hardware); -- dwmw2 From jschopp at austin.ibm.com Fri Apr 1 07:42:12 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Thu, 31 Mar 2005 15:42:12 -0600 Subject: system call for LPAR characteristics Message-ID: <424C6EB4.7040400@austin.ibm.com> Scott, Not sure if Manish got back to you or not. Saw this message go by on a mailing list (IBM internal one) and thought it might relate to what you were asking about. If it does you might post to linuxppc64-dev at ozlabs.org and ask about /proc/ppc64/lparcfg containing PURR data. Come to think of it that mailing list might be a good place to get your kernel questions answered. -Joel -------- Original Message -------- On Thu, Mar 31, 2005 at 02:05:33PM -0600, Chakarat Skawratananond wrote: > Hi All, > > AIX has the lpar_get_info( ) system call. > Doesn't seem like we have the equivalent for LoP. > If not, is there a workaround? > > We have /proc/ppc64/lparcfg but this is configuration data, not the > current CPU use. /proc/ppc64/lparcfg has recently been amended to include cpu (PURR) usage. It might not be a feature in either of the distros yet. Jeff Scheel might now. -Olof From mikpe at csd.uu.se Fri Apr 1 08:07:34 2005 From: mikpe at csd.uu.se (Mikael Pettersson) Date: Fri, 1 Apr 2005 00:07:34 +0200 (MEST) Subject: [PATCH 2.6.12-rc1-mm5 1/3] perfctr: ppc64 arch hooks Message-ID: <200503312207.j2VM7YUI011924@alkaid.it.uu.se> Here's a 3-part patch kit which adds a ppc64 driver to perfctr, written by David Gibson . ppc64 is sufficiently different from ppc32 that this driver is kept separate from my ppc32 driver. This shouldn't matter unless people actually want to run ppc32 kernels on ppc64 processors. ppc64 perfctr driver from David Gibson : - ppc64 arch hooks: Kconfig, syscalls numbers and tables, task struct, and process management ops (switch_to, exit, fork) Signed-off-by: Mikael Pettersson arch/ppc64/Kconfig | 1 + arch/ppc64/kernel/misc.S | 12 ++++++++++++ arch/ppc64/kernel/process.c | 6 ++++++ include/asm-ppc64/processor.h | 2 ++ include/asm-ppc64/unistd.h | 8 +++++++- 5 files changed, 28 insertions(+), 1 deletion(-) diff -rupN linux-2.6.12-rc1-mm4/arch/ppc64/Kconfig linux-2.6.12-rc1-mm4.perfctr27-ppc64-arch-hooks/arch/ppc64/Kconfig --- linux-2.6.12-rc1-mm4/arch/ppc64/Kconfig 2005-03-31 21:08:24.000000000 +0200 +++ linux-2.6.12-rc1-mm4.perfctr27-ppc64-arch-hooks/arch/ppc64/Kconfig 2005-03-31 23:28:07.000000000 +0200 @@ -297,6 +297,7 @@ config SECCOMP endmenu +source "drivers/perfctr/Kconfig" menu "General setup" diff -rupN linux-2.6.12-rc1-mm4/arch/ppc64/kernel/misc.S linux-2.6.12-rc1-mm4.perfctr27-ppc64-arch-hooks/arch/ppc64/kernel/misc.S --- linux-2.6.12-rc1-mm4/arch/ppc64/kernel/misc.S 2005-03-31 21:08:24.000000000 +0200 +++ linux-2.6.12-rc1-mm4.perfctr27-ppc64-arch-hooks/arch/ppc64/kernel/misc.S 2005-03-31 23:28:07.000000000 +0200 @@ -956,6 +956,12 @@ _GLOBAL(sys_call_table32) .llong .sys32_request_key .llong .compat_sys_keyctl .llong .compat_sys_waitid + .llong .sys_ni_syscall /* 273 reserved for sys_ioprio_set */ + .llong .sys_ni_syscall /* 274 reserved for sys_ioprio_get */ + .llong .sys_vperfctr_open /* 275 */ + .llong .sys_vperfctr_control + .llong .sys_vperfctr_write + .llong .sys_vperfctr_read .balign 8 _GLOBAL(sys_call_table) @@ -1232,3 +1238,9 @@ _GLOBAL(sys_call_table) .llong .sys_request_key /* 270 */ .llong .sys_keyctl .llong .sys_waitid + .llong .sys_ni_syscall /* 273 reserved for sys_ioprio_set */ + .llong .sys_ni_syscall /* 274 reserved for sys_ioprio_get */ + .llong .sys_vperfctr_open /* 275 */ + .llong .sys_vperfctr_control + .llong .sys_vperfctr_write + .llong .sys_vperfctr_read diff -rupN linux-2.6.12-rc1-mm4/arch/ppc64/kernel/process.c linux-2.6.12-rc1-mm4.perfctr27-ppc64-arch-hooks/arch/ppc64/kernel/process.c --- linux-2.6.12-rc1-mm4/arch/ppc64/kernel/process.c 2005-03-31 21:07:46.000000000 +0200 +++ linux-2.6.12-rc1-mm4.perfctr27-ppc64-arch-hooks/arch/ppc64/kernel/process.c 2005-03-31 23:28:07.000000000 +0200 @@ -36,6 +36,7 @@ #include #include #include +#include #include #include @@ -225,7 +226,9 @@ struct task_struct *__switch_to(struct t local_irq_save(flags); + perfctr_suspend_thread(&prev->thread); last = _switch(old_thread, new_thread); + perfctr_resume_thread(¤t->thread); local_irq_restore(flags); @@ -323,6 +326,7 @@ void exit_thread(void) last_task_used_altivec = NULL; #endif /* CONFIG_ALTIVEC */ #endif /* CONFIG_SMP */ + perfctr_exit_thread(¤t->thread); } void flush_thread(void) @@ -425,6 +429,8 @@ copy_thread(int nr, unsigned long clone_ */ kregs->nip = *((unsigned long *)ret_from_fork); + perfctr_copy_task(p, regs); + return 0; } diff -rupN linux-2.6.12-rc1-mm4/include/asm-ppc64/processor.h linux-2.6.12-rc1-mm4.perfctr27-ppc64-arch-hooks/include/asm-ppc64/processor.h --- linux-2.6.12-rc1-mm4/include/asm-ppc64/processor.h 2005-03-31 21:08:31.000000000 +0200 +++ linux-2.6.12-rc1-mm4.perfctr27-ppc64-arch-hooks/include/asm-ppc64/processor.h 2005-03-31 23:28:07.000000000 +0200 @@ -574,6 +574,8 @@ struct thread_struct { unsigned long vrsave; int used_vr; /* set if process has used altivec */ #endif /* CONFIG_ALTIVEC */ + /* performance counters */ + struct vperfctr *perfctr; }; #define ARCH_MIN_TASKALIGN 16 diff -rupN linux-2.6.12-rc1-mm4/include/asm-ppc64/unistd.h linux-2.6.12-rc1-mm4.perfctr27-ppc64-arch-hooks/include/asm-ppc64/unistd.h --- linux-2.6.12-rc1-mm4/include/asm-ppc64/unistd.h 2005-03-31 21:07:54.000000000 +0200 +++ linux-2.6.12-rc1-mm4.perfctr27-ppc64-arch-hooks/include/asm-ppc64/unistd.h 2005-03-31 23:28:07.000000000 +0200 @@ -283,8 +283,14 @@ #define __NR_request_key 270 #define __NR_keyctl 271 #define __NR_waitid 272 +/* 273 is reserved for ioprio_set */ +/* 274 is reserved for ioprio_get */ +#define __NR_vperfctr_open 275 +#define __NR_vperfctr_control (__NR_vperfctr_open+1) +#define __NR_vperfctr_write (__NR_vperfctr_open+2) +#define __NR_vperfctr_read (__NR_vperfctr_open+3) -#define __NR_syscalls 273 +#define __NR_syscalls 279 #ifdef __KERNEL__ #define NR_syscalls __NR_syscalls #endif From mikpe at csd.uu.se Fri Apr 1 08:09:04 2005 From: mikpe at csd.uu.se (Mikael Pettersson) Date: Fri, 1 Apr 2005 00:09:04 +0200 (MEST) Subject: [PATCH 2.6.12-rc1-mm5 2/3] perfctr: common updates for ppc64 Message-ID: <200503312209.j2VM94QH011932@alkaid.it.uu.se> ppc64 perfctr driver from David Gibson : - perfctr common updates: Makefile, version - perfctr virtual quirk: the ppc64 low-level driver is unable to prevent all stray overflow interrupts, on ppc64 (and only ppc64) the right action in this case is to ignore the interrupt and resume Signed-off-by: Mikael Pettersson drivers/perfctr/Makefile | 5 ++++- drivers/perfctr/version.h | 2 +- drivers/perfctr/virtual.c | 11 ++++++++++- 3 files changed, 15 insertions(+), 3 deletions(-) diff -rupN linux-2.6.12-rc1-mm4/drivers/perfctr/Makefile linux-2.6.12-rc1-mm4.perfctr-ppc64-common-update/drivers/perfctr/Makefile --- linux-2.6.12-rc1-mm4/drivers/perfctr/Makefile 2005-03-31 21:08:26.000000000 +0200 +++ linux-2.6.12-rc1-mm4.perfctr-ppc64-common-update/drivers/perfctr/Makefile 2005-03-31 23:36:04.000000000 +0200 @@ -1,4 +1,4 @@ -# $Id: Makefile,v 1.26 2004/05/30 23:02:14 mikpe Exp $ +# $Id: Makefile,v 1.27 2005/03/23 01:29:34 mikpe Exp $ # Makefile for the Performance-monitoring counters driver. # This also covers x86_64. @@ -8,6 +8,9 @@ tests-objs-$(CONFIG_X86) := x86_tests.o perfctr-objs-$(CONFIG_PPC32) := ppc.o tests-objs-$(CONFIG_PPC32) := ppc_tests.o +perfctr-objs-$(CONFIG_PPC64) := ppc64.o +tests-objs-$(CONFIG_PPC64) := ppc64_tests.o + perfctr-objs-y += init.o perfctr-objs-$(CONFIG_PERFCTR_INIT_TESTS) += $(tests-objs-y) perfctr-objs-$(CONFIG_PERFCTR_VIRTUAL) += virtual.o diff -rupN linux-2.6.12-rc1-mm4/drivers/perfctr/version.h linux-2.6.12-rc1-mm4.perfctr-ppc64-common-update/drivers/perfctr/version.h --- linux-2.6.12-rc1-mm4/drivers/perfctr/version.h 2005-03-31 21:08:26.000000000 +0200 +++ linux-2.6.12-rc1-mm4.perfctr-ppc64-common-update/drivers/perfctr/version.h 2005-03-31 23:36:04.000000000 +0200 @@ -1 +1 @@ -#define VERSION "2.7.14" +#define VERSION "2.7.15" diff -rupN linux-2.6.12-rc1-mm4/drivers/perfctr/virtual.c linux-2.6.12-rc1-mm4.perfctr-ppc64-common-update/drivers/perfctr/virtual.c --- linux-2.6.12-rc1-mm4/drivers/perfctr/virtual.c 2005-03-31 21:08:26.000000000 +0200 +++ linux-2.6.12-rc1-mm4.perfctr-ppc64-common-update/drivers/perfctr/virtual.c 2005-03-31 23:36:04.000000000 +0200 @@ -1,4 +1,4 @@ -/* $Id: virtual.c,v 1.111 2005/02/20 11:56:44 mikpe Exp $ +/* $Id: virtual.c,v 1.115 2005/03/28 22:39:02 mikpe Exp $ * Virtual per-process performance counters. * * Copyright (C) 1999-2005 Mikael Pettersson @@ -272,8 +272,17 @@ static void vperfctr_handle_overflow(str pmc_mask = perfctr_cpu_identify_overflow(&perfctr->cpu_state); if (!pmc_mask) { +#ifdef CONFIG_PPC64 + /* On some hardware (ppc64, in particular) it's + * impossible to control interrupts finely enough to + * eliminate overflows on counters we don't care + * about. So in this case just restart the counters + * and keep going. */ + vperfctr_resume(perfctr); +#else printk(KERN_ERR "%s: BUG! pid %d has unidentifiable overflow source\n", __FUNCTION__, tsk->pid); +#endif return; } perfctr->ireload_needed = 1; From mikpe at csd.uu.se Fri Apr 1 08:09:49 2005 From: mikpe at csd.uu.se (Mikael Pettersson) Date: Fri, 1 Apr 2005 00:09:49 +0200 (MEST) Subject: [PATCH 2.6.12-rc1-mm5 3/3] perfctr: ppc64 driver core Message-ID: <200503312209.j2VM9nCe011940@alkaid.it.uu.se> ppc64 perfctr driver from David Gibson : - ppc64 perfctr driver core Signed-off-by: Mikael Pettersson drivers/perfctr/ppc64.c | 743 ++++++++++++++++++++++++++++++++++++++++++ drivers/perfctr/ppc64_tests.c | 322 ++++++++++++++++++ drivers/perfctr/ppc64_tests.h | 12 include/asm-ppc64/perfctr.h | 166 +++++++++ 4 files changed, 1243 insertions(+) diff -rupN linux-2.6.12-rc1-mm4/drivers/perfctr/ppc64.c linux-2.6.12-rc1-mm4.perfctr-ppc64-driver/drivers/perfctr/ppc64.c --- linux-2.6.12-rc1-mm4/drivers/perfctr/ppc64.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.12-rc1-mm4.perfctr-ppc64-driver/drivers/perfctr/ppc64.c 2005-03-31 23:37:37.000000000 +0200 @@ -0,0 +1,743 @@ +/* + * PPC64 performance-monitoring counters driver. + * + * based on Mikael Pettersson's 32 bit ppc code + * Copyright (C) 2004 David Gibson, IBM Corporation. + * Copyright (C) 2004 Mikael Pettersson + */ + +#include +#include +#include +#include +#include +#include +#include /* tb_ticks_per_jiffy */ +#include +#include + +#include "ppc64_tests.h" + +extern void ppc64_enable_pmcs(void); + +/* Support for lazy perfctr SPR updates. */ +struct per_cpu_cache { /* roughly a subset of perfctr_cpu_state */ + unsigned int id; /* cache owner id */ + /* Physically indexed cache of the MMCRs. */ + unsigned long ppc64_mmcr0, ppc64_mmcr1, ppc64_mmcra; +}; +static DEFINE_PER_CPU(struct per_cpu_cache, per_cpu_cache); +#define __get_cpu_cache(cpu) (&per_cpu(per_cpu_cache, cpu)) +#define get_cpu_cache() (&__get_cpu_var(per_cpu_cache)) + +/* Structure for counter snapshots, as 32-bit values. */ +struct perfctr_low_ctrs { + unsigned int tsc; + unsigned int pmc[8]; +}; + +static unsigned int new_id(void) +{ + static DEFINE_SPINLOCK(lock); + static unsigned int counter; + int id; + + spin_lock(&lock); + id = ++counter; + spin_unlock(&lock); + return id; +} + +static inline unsigned int read_pmc(unsigned int pmc) +{ + switch (pmc) { + case 0: + return mfspr(SPRN_PMC1); + break; + case 1: + return mfspr(SPRN_PMC2); + break; + case 2: + return mfspr(SPRN_PMC3); + break; + case 3: + return mfspr(SPRN_PMC4); + break; + case 4: + return mfspr(SPRN_PMC5); + break; + case 5: + return mfspr(SPRN_PMC6); + break; + case 6: + return mfspr(SPRN_PMC7); + break; + case 7: + return mfspr(SPRN_PMC8); + break; + + default: + return -EINVAL; + } +} + +static inline void write_pmc(int pmc, s32 val) +{ + switch (pmc) { + case 0: + mtspr(SPRN_PMC1, val); + break; + case 1: + mtspr(SPRN_PMC2, val); + break; + case 2: + mtspr(SPRN_PMC3, val); + break; + case 3: + mtspr(SPRN_PMC4, val); + break; + case 4: + mtspr(SPRN_PMC5, val); + break; + case 5: + mtspr(SPRN_PMC6, val); + break; + case 6: + mtspr(SPRN_PMC7, val); + break; + case 7: + mtspr(SPRN_PMC8, val); + break; + } +} + +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT +static void perfctr_default_ihandler(unsigned long pc) +{ + unsigned int mmcr0 = mfspr(SPRN_MMCR0); + + mmcr0 &= ~MMCR0_PMXE; + mtspr(SPRN_MMCR0, mmcr0); +} + +static perfctr_ihandler_t perfctr_ihandler = perfctr_default_ihandler; + +void do_perfctr_interrupt(struct pt_regs *regs) +{ + unsigned long mmcr0; + + /* interrupts are disabled here, so we don't need to + * preempt_disable() */ + + (*perfctr_ihandler)(instruction_pointer(regs)); + + /* clear PMAO so the interrupt doesn't reassert immediately */ + mmcr0 = mfspr(SPRN_MMCR0) & ~MMCR0_PMAO; + mtspr(SPRN_MMCR0, mmcr0); +} + +void perfctr_cpu_set_ihandler(perfctr_ihandler_t ihandler) +{ + perfctr_ihandler = ihandler ? ihandler : perfctr_default_ihandler; +} + +#else +#define perfctr_cstatus_has_ictrs(cstatus) 0 +#endif + + +#if defined(CONFIG_SMP) && defined(CONFIG_PERFCTR_INTERRUPT_SUPPORT) + +static inline void +set_isuspend_cpu(struct perfctr_cpu_state *state, int cpu) +{ + state->isuspend_cpu = cpu; +} + +static inline int +is_isuspend_cpu(const struct perfctr_cpu_state *state, int cpu) +{ + return state->isuspend_cpu == cpu; +} + +static inline void clear_isuspend_cpu(struct perfctr_cpu_state *state) +{ + state->isuspend_cpu = NR_CPUS; +} + +#else +static inline void set_isuspend_cpu(struct perfctr_cpu_state *state, int cpu) { } +static inline int is_isuspend_cpu(const struct perfctr_cpu_state *state, int cpu) { return 1; } +static inline void clear_isuspend_cpu(struct perfctr_cpu_state *state) { } +#endif + + +static void ppc64_clear_counters(void) +{ + mtspr(SPRN_MMCR0, 0); + mtspr(SPRN_MMCR1, 0); + mtspr(SPRN_MMCRA, 0); + + mtspr(SPRN_PMC1, 0); + mtspr(SPRN_PMC2, 0); + mtspr(SPRN_PMC3, 0); + mtspr(SPRN_PMC4, 0); + mtspr(SPRN_PMC5, 0); + mtspr(SPRN_PMC6, 0); + + if (cpu_has_feature(CPU_FTR_PMC8)) { + mtspr(SPRN_PMC7, 0); + mtspr(SPRN_PMC8, 0); + } +} + +/* + * Driver methods, internal and exported. + */ + +static void perfctr_cpu_write_control(const struct perfctr_cpu_state *state) +{ + struct per_cpu_cache *cache; + unsigned long long value; + + cache = get_cpu_cache(); + /* + * Order matters here: update threshmult and event + * selectors before updating global control, which + * potentially enables PMIs. + * + * Since mtspr doesn't accept a runtime value for the + * SPR number, unroll the loop so each mtspr targets + * a constant SPR. + * + * For processors without MMCR2, we ensure that the + * cache and the state indicate the same value for it, + * preventing any actual mtspr to it. Ditto for MMCR1. + */ + value = state->control.mmcra; + if (value != cache->ppc64_mmcra) { + cache->ppc64_mmcra = value; + mtspr(SPRN_MMCRA, value); + } + value = state->control.mmcr1; + if (value != cache->ppc64_mmcr1) { + cache->ppc64_mmcr1 = value; + mtspr(SPRN_MMCR1, value); + } + value = state->control.mmcr0; + if (perfctr_cstatus_has_ictrs(state->user.cstatus)) + value |= MMCR0_PMXE; + if (value != cache->ppc64_mmcr0) { + cache->ppc64_mmcr0 = value; + mtspr(SPRN_MMCR0, value); + } + cache->id = state->id; +} + +static void perfctr_cpu_read_counters(struct perfctr_cpu_state *state, + struct perfctr_low_ctrs *ctrs) +{ + unsigned int cstatus, i, pmc; + + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + ctrs->tsc = mftb() & 0xffffffff; + + for (i = 0; i < perfctr_cstatus_nractrs(cstatus); ++i) { + pmc = state->user.pmc[i].map; + ctrs->pmc[i] = read_pmc(pmc); + } +} + +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT +static void perfctr_cpu_isuspend(struct perfctr_cpu_state *state) +{ + unsigned int cstatus, nrctrs, i; + int cpu; + + cpu = smp_processor_id(); + set_isuspend_cpu(state, cpu); /* early to limit cpu's live range */ + cstatus = state->user.cstatus; + nrctrs = perfctr_cstatus_nrctrs(cstatus); + for (i = perfctr_cstatus_nractrs(cstatus); i < nrctrs; ++i) { + unsigned int pmc = state->user.pmc[i].map; + unsigned int now = read_pmc(pmc); + + state->user.pmc[i].sum += now - state->user.pmc[i].start; + state->user.pmc[i].start = now; + } +} + +static void perfctr_cpu_iresume(const struct perfctr_cpu_state *state) +{ + struct per_cpu_cache *cache; + unsigned int cstatus, nrctrs, i; + int cpu; + + cpu = smp_processor_id(); + cache = __get_cpu_cache(cpu); + if (cache->id == state->id) { + /* Clearing cache->id to force write_control() + to unfreeze MMCR0 would be done here, but it + is subsumed by resume()'s MMCR0 reload logic. */ + if (is_isuspend_cpu(state, cpu)) { + return; /* skip reload of PMCs */ + } + } + /* + * The CPU state wasn't ours. + * + * The counters must be frozen before being reinitialised, + * to prevent unexpected increments and missed overflows. + * + * All unused counters must be reset to a non-overflow state. + */ + if (!(cache->ppc64_mmcr0 & MMCR0_FC)) { + cache->ppc64_mmcr0 |= MMCR0_FC; + mtspr(SPRN_MMCR0, cache->ppc64_mmcr0); + } + cstatus = state->user.cstatus; + nrctrs = perfctr_cstatus_nrctrs(cstatus); + for (i = perfctr_cstatus_nractrs(cstatus); i < nrctrs; ++i) { + write_pmc(state->user.pmc[i].map, state->user.pmc[i].start); + } +} + +/* Call perfctr_cpu_ireload() just before perfctr_cpu_resume() to + bypass internal caching and force a reload if the I-mode PMCs. */ +void perfctr_cpu_ireload(struct perfctr_cpu_state *state) +{ +#ifdef CONFIG_SMP + clear_isuspend_cpu(state); +#else + get_cpu_cache()->id = 0; +#endif +} + +/* PRE: the counters have been suspended and sampled by perfctr_cpu_suspend() */ +unsigned int perfctr_cpu_identify_overflow(struct perfctr_cpu_state *state) +{ + unsigned int cstatus, nractrs, nrctrs, i; + unsigned int pmc_mask = 0; + int nr_pmcs = 6; + + if (cpu_has_feature(CPU_FTR_PMC8)) + nr_pmcs = 8; + + cstatus = state->user.cstatus; + nractrs = perfctr_cstatus_nractrs(cstatus); + nrctrs = perfctr_cstatus_nrctrs(cstatus); + + /* Ickity, ickity, ick. We don't have fine enough interrupt + * control to disable interrupts on all the counters we're not + * interested in. So, we have to deal with overflows on actrs + * amd unused PMCs as well as the ones we actually care + * about. */ + for (i = 0; i < nractrs; ++i) { + int pmc = state->user.pmc[i].map; + unsigned int val = read_pmc(pmc); + + /* For actrs, force a sample if they overflowed */ + + if ((int)val < 0) { + state->user.pmc[i].sum += val - state->user.pmc[i].start; + state->user.pmc[i].start = 0; + write_pmc(pmc, 0); + } + } + for (; i < nrctrs; ++i) { + if ((int)state->user.pmc[i].start < 0) { /* PPC64-specific */ + int pmc = state->user.pmc[i].map; + /* XXX: "+=" to correct for overshots */ + state->user.pmc[i].start = state->control.ireset[pmc]; + pmc_mask |= (1 << i); + } + } + + /* Clear any unused overflowed counters, so we don't loop on + * the interrupt */ + for (i = 0; i < nr_pmcs; ++i) { + if (! (state->unused_pmcs & (1<control.header.nractrs; + nrctrs = i + state->control.header.nrictrs; + for(; i < nrctrs; ++i) { + unsigned int pmc = state->user.pmc[i].map; + if ((int)state->control.ireset[pmc] < 0) /* PPC64-specific */ + return -EINVAL; + state->user.pmc[i].start = state->control.ireset[pmc]; + } + return 0; +} + +#else /* CONFIG_PERFCTR_INTERRUPT_SUPPORT */ +static inline void perfctr_cpu_isuspend(struct perfctr_cpu_state *state) { } +static inline void perfctr_cpu_iresume(const struct perfctr_cpu_state *state) { } +static inline int check_ireset(struct perfctr_cpu_state *state) { return 0; } +#endif /* CONFIG_PERFCTR_INTERRUPT_SUPPORT */ + +static int check_control(struct perfctr_cpu_state *state) +{ + unsigned int i, nractrs, nrctrs, pmc_mask, pmc; + unsigned int nr_pmcs = 6; + + if (cpu_has_feature(CPU_FTR_PMC8)) + nr_pmcs = 8; + + nractrs = state->control.header.nractrs; + nrctrs = nractrs + state->control.header.nrictrs; + if (nrctrs < nractrs || nrctrs > nr_pmcs) + return -EINVAL; + + pmc_mask = 0; + for (i = 0; i < nrctrs; ++i) { + pmc = state->control.pmc_map[i]; + state->user.pmc[i].map = pmc; + if (pmc >= nr_pmcs || (pmc_mask & (1<control.mmcr0 & MMCR0_PMXE) + || (state->control.mmcr0 & MMCR0_PMAO) + || (state->control.mmcr0 & MMCR0_TBEE) ) + return -EINVAL; + + state->unused_pmcs = ((1 << nr_pmcs)-1) & ~pmc_mask; + + state->id = new_id(); + + return 0; +} + +int perfctr_cpu_update_control(struct perfctr_cpu_state *state, int is_global) +{ + int err; + + clear_isuspend_cpu(state); + state->user.cstatus = 0; + + /* disallow i-mode counters if we cannot catch the interrupts */ + if (!(perfctr_info.cpu_features & PERFCTR_FEATURE_PCINT) + && state->control.header.nrictrs) + return -EPERM; + + err = check_control(state); /* may initialise state->cstatus */ + if (err < 0) + return err; + err = check_ireset(state); + if (err < 0) + return err; + state->user.cstatus |= perfctr_mk_cstatus(state->control.header.tsc_on, + state->control.header.nractrs, + state->control.header.nrictrs); + return 0; +} + +/* + * get_reg_offset() maps SPR numbers to offsets into struct perfctr_cpu_control. + */ +static const struct { + unsigned int spr; + unsigned int offset; + unsigned int size; +} reg_offsets[] = { + { SPRN_MMCR0, offsetof(struct perfctr_cpu_control, mmcr0), sizeof(long) }, + { SPRN_MMCR1, offsetof(struct perfctr_cpu_control, mmcr1), sizeof(long) }, + { SPRN_MMCRA, offsetof(struct perfctr_cpu_control, mmcra), sizeof(long) }, + { SPRN_PMC1, offsetof(struct perfctr_cpu_control, ireset[1-1]), sizeof(int) }, + { SPRN_PMC2, offsetof(struct perfctr_cpu_control, ireset[2-1]), sizeof(int) }, + { SPRN_PMC3, offsetof(struct perfctr_cpu_control, ireset[3-1]), sizeof(int) }, + { SPRN_PMC4, offsetof(struct perfctr_cpu_control, ireset[4-1]), sizeof(int) }, + { SPRN_PMC5, offsetof(struct perfctr_cpu_control, ireset[5-1]), sizeof(int) }, + { SPRN_PMC6, offsetof(struct perfctr_cpu_control, ireset[6-1]), sizeof(int) }, + { SPRN_PMC7, offsetof(struct perfctr_cpu_control, ireset[7-1]), sizeof(int) }, + { SPRN_PMC8, offsetof(struct perfctr_cpu_control, ireset[8-1]), sizeof(int) }, +}; + +static int get_reg_offset(unsigned int spr, unsigned int *size) +{ + unsigned int i; + + for(i = 0; i < ARRAY_SIZE(reg_offsets); ++i) + if (spr == reg_offsets[i].spr) { + *size = reg_offsets[i].size; + return reg_offsets[i].offset; + } + return -1; +} + +static int access_regs(struct perfctr_cpu_control *control, + void *argp, unsigned int argbytes, int do_write) +{ + struct perfctr_cpu_reg *regs; + unsigned int i, nr_regs, size; + int offset; + + nr_regs = argbytes / sizeof(struct perfctr_cpu_reg); + if (nr_regs * sizeof(struct perfctr_cpu_reg) != argbytes) + return -EINVAL; + regs = (struct perfctr_cpu_reg*)argp; + + for(i = 0; i < nr_regs; ++i) { + offset = get_reg_offset(regs[i].nr, &size); + if (offset < 0) + return -EINVAL; + if (size == sizeof(long)) { + unsigned long *where = (unsigned long*)((char*)control + offset); + if (do_write) + *where = regs[i].value; + else + regs[i].value = *where; + } else { + unsigned int *where = (unsigned int*)((char*)control + offset); + if (do_write) + *where = regs[i].value; + else + regs[i].value = *where; + } + } + return argbytes; +} + +int perfctr_cpu_control_write(struct perfctr_cpu_control *control, unsigned int domain, + const void *srcp, unsigned int srcbytes) +{ + if (domain != PERFCTR_DOMAIN_CPU_REGS) + return -EINVAL; + return access_regs(control, (void*)srcp, srcbytes, 1); +} + +int perfctr_cpu_control_read(const struct perfctr_cpu_control *control, unsigned int domain, + void *dstp, unsigned int dstbytes) +{ + if (domain != PERFCTR_DOMAIN_CPU_REGS) + return -EINVAL; + return access_regs((struct perfctr_cpu_control*)control, dstp, dstbytes, 0); +} + +void perfctr_cpu_suspend(struct perfctr_cpu_state *state) +{ + unsigned int i, cstatus; + struct perfctr_low_ctrs now; + + /* quiesce the counters */ + mtspr(SPRN_MMCR0, MMCR0_FC); + get_cpu_cache()->ppc64_mmcr0 = MMCR0_FC; + + if (perfctr_cstatus_has_ictrs(state->user.cstatus)) + perfctr_cpu_isuspend(state); + + perfctr_cpu_read_counters(state, &now); + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + state->user.tsc_sum += now.tsc - state->user.tsc_start; + + for (i = 0; i < perfctr_cstatus_nractrs(cstatus); ++i) + state->user.pmc[i].sum += now.pmc[i] - state->user.pmc[i].start; +} + +void perfctr_cpu_resume(struct perfctr_cpu_state *state) +{ + struct perfctr_low_ctrs now; + unsigned int i, cstatus; + + if (perfctr_cstatus_has_ictrs(state->user.cstatus)) + perfctr_cpu_iresume(state); + perfctr_cpu_write_control(state); + + perfctr_cpu_read_counters(state, &now); + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + state->user.tsc_start = now.tsc; + + for (i = 0; i < perfctr_cstatus_nractrs(cstatus); ++i) + state->user.pmc[i].start = now.pmc[i]; + + /* XXX: if (SMP && start.tsc == now.tsc) ++now.tsc; */ +} + +void perfctr_cpu_sample(struct perfctr_cpu_state *state) +{ + unsigned int i, cstatus, nractrs; + struct perfctr_low_ctrs now; + + perfctr_cpu_read_counters(state, &now); + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) { + state->user.tsc_sum += now.tsc - state->user.tsc_start; + state->user.tsc_start = now.tsc; + } + nractrs = perfctr_cstatus_nractrs(cstatus); + for(i = 0; i < nractrs; ++i) { + state->user.pmc[i].sum += now.pmc[i] - state->user.pmc[i].start; + state->user.pmc[i].start = now.pmc[i]; + } +} + +static void perfctr_cpu_clear_counters(void) +{ + struct per_cpu_cache *cache; + + cache = get_cpu_cache(); + memset(cache, 0, sizeof *cache); + cache->id = 0; + + ppc64_clear_counters(); +} + +/**************************************************************** + * * + * Processor detection and initialisation procedures. * + * * + ****************************************************************/ + +static void ppc64_cpu_setup(void) +{ + /* allow user to initialize these???? */ + + unsigned long long mmcr0 = mfspr(SPRN_MMCR0); + unsigned long long mmcra = mfspr(SPRN_MMCRA); + + + ppc64_enable_pmcs(); + + mmcr0 |= MMCR0_FC; + mtspr(SPRN_MMCR0, mmcr0); + + mmcr0 |= MMCR0_FCM1|MMCR0_PMXE|MMCR0_FCECE; + mmcr0 |= MMCR0_PMC1CE|MMCR0_PMCjCE; + mtspr(SPRN_MMCR0, mmcr0); + + mmcra |= MMCRA_SAMPLE_ENABLE; + mtspr(SPRN_MMCRA, mmcra); + + printk("setup on cpu %d, mmcr0 %lx\n", smp_processor_id(), + mfspr(SPRN_MMCR0)); + printk("setup on cpu %d, mmcr1 %lx\n", smp_processor_id(), + mfspr(SPRN_MMCR1)); + printk("setup on cpu %d, mmcra %lx\n", smp_processor_id(), + mfspr(SPRN_MMCRA)); + +/* mtmsrd(mfmsr() | MSR_PMM); */ + + ppc64_clear_counters(); + + mmcr0 = mfspr(SPRN_MMCR0); + mmcr0 &= ~MMCR0_PMAO; + mmcr0 &= ~MMCR0_FC; + mtspr(SPRN_MMCR0, mmcr0); + + printk("start on cpu %d, mmcr0 %llx\n", smp_processor_id(), mmcr0); +} + + +static void perfctr_cpu_clear_one(void *ignore) +{ + /* PREEMPT note: when called via on_each_cpu(), + this is in IRQ context with preemption disabled. */ + perfctr_cpu_clear_counters(); +} + +static void perfctr_cpu_reset(void) +{ + on_each_cpu(perfctr_cpu_clear_one, NULL, 1, 1); + perfctr_cpu_set_ihandler(NULL); +} + +int __init perfctr_cpu_init(void) +{ + extern unsigned long ppc_proc_freq; + extern unsigned long ppc_tb_freq; + + perfctr_info.cpu_features = PERFCTR_FEATURE_RDTSC + | PERFCTR_FEATURE_RDPMC | PERFCTR_FEATURE_PCINT; + + perfctr_cpu_name = "PowerPC64"; + + perfctr_info.cpu_khz = ppc_proc_freq / 1000; + /* We need to round here rather than truncating, because in a + * few cases the raw ratio can end up being 7.9999 or + * suchlike */ + perfctr_info.tsc_to_cpu_mult = + (ppc_proc_freq + ppc_tb_freq - 1) / ppc_tb_freq; + + on_each_cpu((void *)ppc64_cpu_setup, NULL, 0, 1); + + perfctr_ppc64_init_tests(); + + perfctr_cpu_reset(); + return 0; +} + +void __exit perfctr_cpu_exit(void) +{ + perfctr_cpu_reset(); +} + +/**************************************************************** + * * + * Hardware reservation. * + * * + ****************************************************************/ + +static spinlock_t service_mutex = SPIN_LOCK_UNLOCKED; +static const char *current_service = NULL; + +const char *perfctr_cpu_reserve(const char *service) +{ + const char *ret; + + spin_lock(&service_mutex); + + ret = current_service; + if (ret) + goto out; + + ret = "unknown driver (oprofile?)"; + if (reserve_pmc_hardware(do_perfctr_interrupt) != 0) + goto out; + + current_service = service; + ret = NULL; + + out: + spin_unlock(&service_mutex); + return ret; +} + +void perfctr_cpu_release(const char *service) +{ + spin_lock(&service_mutex); + + if (service != current_service) { + printk(KERN_ERR "%s: attempt by %s to release while reserved by %s\n", + __FUNCTION__, service, current_service); + goto out; + } + + /* power down the counters */ + perfctr_cpu_reset(); + current_service = NULL; + release_pmc_hardware(); + + out: + spin_unlock(&service_mutex); +} diff -rupN linux-2.6.12-rc1-mm4/drivers/perfctr/ppc64_tests.c linux-2.6.12-rc1-mm4.perfctr-ppc64-driver/drivers/perfctr/ppc64_tests.c --- linux-2.6.12-rc1-mm4/drivers/perfctr/ppc64_tests.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.12-rc1-mm4.perfctr-ppc64-driver/drivers/perfctr/ppc64_tests.c 2005-03-31 23:37:37.000000000 +0200 @@ -0,0 +1,322 @@ +/* + * Performance-monitoring counters driver. + * Optional PPC64-specific init-time tests. + * + * Copyright (C) 2004 David Gibson, IBM Corporation. + * Copyright (C) 2004 Mikael Pettersson + */ +#include +#include +#include +#include +#include +#include +#include /* for tb_ticks_per_jiffy */ +#include "ppc64_tests.h" + +#define NITER 256 +#define X2(S) S"; "S +#define X8(S) X2(X2(X2(S))) + +static void __init do_read_tbl(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mftbl %0") : "=r"(dummy)); +} + +static void __init do_read_pmc1(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_PMC1)) : "=r"(dummy)); +} + +static void __init do_read_pmc2(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_PMC2)) : "=r"(dummy)); +} + +static void __init do_read_pmc3(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_PMC3)) : "=r"(dummy)); +} + +static void __init do_read_pmc4(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_PMC4)) : "=r"(dummy)); +} + +static void __init do_read_mmcr0(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_MMCR0)) : "=r"(dummy)); +} + +static void __init do_read_mmcr1(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_MMCR1)) : "=r"(dummy)); +} + +static void __init do_write_pmc2(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_PMC2) ",%0") : : "r"(arg)); +} + +static void __init do_write_pmc3(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_PMC3) ",%0") : : "r"(arg)); +} + +static void __init do_write_pmc4(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_PMC4) ",%0") : : "r"(arg)); +} + +static void __init do_write_mmcr1(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_MMCR1) ",%0") : : "r"(arg)); +} + +static void __init do_write_mmcr0(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_MMCR0) ",%0") : : "r"(arg)); +} + +static void __init do_empty_loop(unsigned int unused) +{ + unsigned i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__("" : : ); +} + +static unsigned __init run(void (*doit)(unsigned int), unsigned int arg) +{ + unsigned int start, stop; + start = mfspr(SPRN_PMC1); + (*doit)(arg); /* should take < 2^32 cycles to complete */ + stop = mfspr(SPRN_PMC1); + return stop - start; +} + +static void __init init_tests_message(void) +{ +#if 0 + printk(KERN_INFO "Please email the following PERFCTR INIT lines " + "to mikpe at csd.uu.se\n" + KERN_INFO "To remove this message, rebuild the driver " + "with CONFIG_PERFCTR_INIT_TESTS=n\n"); + printk(KERN_INFO "PERFCTR INIT: PVR 0x%08x, CPU clock %u kHz, TB clock %lu kHz\n", + pvr, + perfctr_info.cpu_khz, + tb_ticks_per_jiffy*(HZ/10)/(1000/10)); +#endif +} + +static void __init clear(void) +{ + mtspr(SPRN_MMCR0, 0); + mtspr(SPRN_MMCR1, 0); + mtspr(SPRN_MMCRA, 0); + mtspr(SPRN_PMC1, 0); + mtspr(SPRN_PMC2, 0); + mtspr(SPRN_PMC3, 0); + mtspr(SPRN_PMC4, 0); + mtspr(SPRN_PMC5, 0); + mtspr(SPRN_PMC6, 0); + mtspr(SPRN_PMC7, 0); + mtspr(SPRN_PMC8, 0); +} + +static void __init check_fcece(unsigned int pmc1ce) +{ + unsigned int mmcr0; + unsigned int pmc1; + int x = 0; + + /* JHE check out section 1.6.6.2 of the POWER5 pdf */ + + /* + * This test checks if MMCR0[FC] is set after PMC1 overflows + * when MMCR0[FCECE] is set. + * 74xx documentation states this behaviour, while documentation + * for 604/750 processors doesn't mention this at all. + * + * Also output the value of PMC1 shortly after the overflow. + * This tells us if PMC1 really was frozen. On 604/750, it may not + * freeze since we don't enable PMIs. [No freeze confirmed on 750.] + * + * When pmc1ce == 0, MMCR0[PMC1CE] is zero. It's unclear whether + * this masks all PMC1 overflow events or just PMC1 PMIs. + * + * PMC1 counts processor cycles, with 100 to go before overflowing. + * FCECE is set. + * PMC1CE is clear if !pmc1ce, otherwise set. + */ + pmc1 = mfspr(SPRN_PMC1); + + mtspr(SPRN_PMC1, 0x80000000-100); + mmcr0 = MMCR0_FCECE | MMCR0_SHRFC; + + if (pmc1ce) + mmcr0 |= MMCR0_PMC1CE; + + mtspr(SPRN_MMCR0, mmcr0); + + pmc1 = mfspr(SPRN_PMC1); + + do { + do_empty_loop(0); + + pmc1 = mfspr(SPRN_PMC1); + if (x++ > 20000000) { + break; + } + } while (!(mfspr(SPRN_PMC1) & 0x80000000)); + do_empty_loop(0); + + printk(KERN_INFO "PERFCTR INIT: %s(%u): MMCR0[FC] is %u, PMC1 is %#lx\n", + __FUNCTION__, pmc1ce, + !!(mfspr(SPRN_MMCR0) & MMCR0_FC), mfspr(SPRN_PMC1)); + mtspr(SPRN_MMCR0, 0); + mtspr(SPRN_PMC1, 0); +} + +static void __init check_trigger(unsigned int pmc1ce) +{ + unsigned int mmcr0; + unsigned int pmc1; + int x = 0; + + /* + * This test checks if MMCR0[TRIGGER] is reset after PMC1 overflows. + * 74xx documentation states this behaviour, while documentation + * for 604/750 processors doesn't mention this at all. + * [No reset confirmed on 750.] + * + * Also output the values of PMC1 and PMC2 shortly after the overflow. + * PMC2 should be equal to PMC1-0x80000000. + * + * When pmc1ce == 0, MMCR0[PMC1CE] is zero. It's unclear whether + * this masks all PMC1 overflow events or just PMC1 PMIs. + * + * PMC1 counts processor cycles, with 100 to go before overflowing. + * PMC2 counts processor cycles, starting from 0. + * TRIGGER is set, so PMC2 doesn't start until PMC1 overflows. + * PMC1CE is clear if !pmc1ce, otherwise set. + */ + mtspr(SPRN_PMC2, 0); + mtspr(SPRN_PMC1, 0x80000000-100); + mmcr0 = MMCR0_TRIGGER | MMCR0_SHRFC | MMCR0_FCHV; + + if (pmc1ce) + mmcr0 |= MMCR0_PMC1CE; + + mtspr(SPRN_MMCR0, mmcr0); + do { + do_empty_loop(0); + pmc1 = mfspr(SPRN_PMC1); + if (x++ > 20000000) { + break; + } + + } while (!(mfspr(SPRN_PMC1) & 0x80000000)); + do_empty_loop(0); + printk(KERN_INFO "PERFCTR INIT: %s(%u): MMCR0[TRIGGER] is %u, PMC1 is %#lx, PMC2 is %#lx\n", + __FUNCTION__, pmc1ce, + !!(mfspr(SPRN_MMCR0) & MMCR0_TRIGGER), mfspr(SPRN_PMC1), mfspr(SPRN_PMC2)); + mtspr(SPRN_MMCR0, 0); + mtspr(SPRN_PMC1, 0); + mtspr(SPRN_PMC2, 0); +} + +static void __init measure_overheads(void) +{ + int i; + unsigned int mmcr0, loop, ticks[12]; + const char *name[12]; + + clear(); + + /* PMC1 = "processor cycles", + PMC2 = "completed instructions", + not disabled in any mode, + no interrupts */ + /* mmcr0 = (0x01 << 6) | (0x02 << 0); */ + mmcr0 = MMCR0_SHRFC | MMCR0_FCWAIT; + mtspr(SPRN_MMCR0, mmcr0); + + name[0] = "mftbl"; + ticks[0] = run(do_read_tbl, 0); + name[1] = "mfspr (pmc1)"; + ticks[1] = run(do_read_pmc1, 0); + name[2] = "mfspr (pmc2)"; + ticks[2] = run(do_read_pmc2, 0); + name[3] = "mfspr (pmc3)"; + ticks[3] = run(do_read_pmc3, 0); + name[4] = "mfspr (pmc4)"; + ticks[4] = run(do_read_pmc4, 0); + name[5] = "mfspr (mmcr0)"; + ticks[5] = run(do_read_mmcr0, 0); + name[6] = "mfspr (mmcr1)"; + ticks[6] = run(do_read_mmcr1, 0); + name[7] = "mtspr (pmc2)"; + ticks[7] = run(do_write_pmc2, 0); + name[8] = "mtspr (pmc3)"; + ticks[8] = run(do_write_pmc3, 0); + name[9] = "mtspr (pmc4)"; + ticks[9] = run(do_write_pmc4, 0); + name[10] = "mtspr (mmcr1)"; + ticks[10] = run(do_write_mmcr1, 0); + name[11] = "mtspr (mmcr0)"; + ticks[11] = run(do_write_mmcr0, mmcr0); + + loop = run(do_empty_loop, 0); + + clear(); + + init_tests_message(); + printk(KERN_INFO "PERFCTR INIT: NITER == %u\n", NITER); + printk(KERN_INFO "PERFCTR INIT: loop overhead is %u cycles\n", loop); + for(i = 0; i < ARRAY_SIZE(ticks); ++i) { + unsigned int x; + if (!ticks[i]) + continue; + x = ((ticks[i] - loop) * 10) / NITER; + printk(KERN_INFO "PERFCTR INIT: %s cost is %u.%u cycles (%u total)\n", + name[i], x/10, x%10, ticks[i]); + } + + check_fcece(0); +#if 0 + check_fcece(1); + check_trigger(0); + check_trigger(1); +#endif +} + +void __init perfctr_ppc64_init_tests(void) +{ + preempt_disable(); + measure_overheads(); + preempt_enable(); +} diff -rupN linux-2.6.12-rc1-mm4/drivers/perfctr/ppc64_tests.h linux-2.6.12-rc1-mm4.perfctr-ppc64-driver/drivers/perfctr/ppc64_tests.h --- linux-2.6.12-rc1-mm4/drivers/perfctr/ppc64_tests.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.12-rc1-mm4.perfctr-ppc64-driver/drivers/perfctr/ppc64_tests.h 2005-03-31 23:37:37.000000000 +0200 @@ -0,0 +1,12 @@ +/* + * Performance-monitoring counters driver. + * Optional PPC32-specific init-time tests. + * + * Copyright (C) 2004 Mikael Pettersson + */ + +#ifdef CONFIG_PERFCTR_INIT_TESTS +extern void perfctr_ppc64_init_tests(void); +#else +static inline void perfctr_ppc64_init_tests(void) { } +#endif diff -rupN linux-2.6.12-rc1-mm4/include/asm-ppc64/perfctr.h linux-2.6.12-rc1-mm4.perfctr-ppc64-driver/include/asm-ppc64/perfctr.h --- linux-2.6.12-rc1-mm4/include/asm-ppc64/perfctr.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.12-rc1-mm4.perfctr-ppc64-driver/include/asm-ppc64/perfctr.h 2005-03-31 23:37:37.000000000 +0200 @@ -0,0 +1,166 @@ +/* + * PPC64 Performance-Monitoring Counters driver + * + * Copyright (C) 2004 David Gibson, IBM Corporation. + * Copyright (C) 2004 Mikael Pettersson + */ +#ifndef _ASM_PPC64_PERFCTR_H +#define _ASM_PPC64_PERFCTR_H + +#include + +struct perfctr_sum_ctrs { + __u64 tsc; + __u64 pmc[8]; /* the size is not part of the user ABI */ +}; + +struct perfctr_cpu_control_header { + __u32 tsc_on; + __u32 nractrs; /* number of accumulation-mode counters */ + __u32 nrictrs; /* number of interrupt-mode counters */ +}; + +struct perfctr_cpu_state_user { + __u32 cstatus; + /* The two tsc fields must be inlined. Placing them in a + sub-struct causes unwanted internal padding on x86-64. */ + __u32 tsc_start; + __u64 tsc_sum; + struct { + __u32 map; + __u32 start; + __u64 sum; + } pmc[8]; /* the size is not part of the user ABI */ +}; + +/* cstatus is a re-encoding of control.tsc_on/nractrs/nrictrs + which should have less overhead in most cases */ +/* XXX: ppc driver internally also uses cstatus&(1<<30) */ + +static inline +unsigned int perfctr_mk_cstatus(unsigned int tsc_on, unsigned int nractrs, + unsigned int nrictrs) +{ + return (tsc_on<<31) | (nrictrs<<16) | ((nractrs+nrictrs)<<8) | nractrs; +} + +static inline unsigned int perfctr_cstatus_enabled(unsigned int cstatus) +{ + return cstatus; +} + +static inline int perfctr_cstatus_has_tsc(unsigned int cstatus) +{ + return (int)cstatus < 0; /* test and jump on sign */ +} + +static inline unsigned int perfctr_cstatus_nractrs(unsigned int cstatus) +{ + return cstatus & 0x7F; /* and with imm8 */ +} + +static inline unsigned int perfctr_cstatus_nrctrs(unsigned int cstatus) +{ + return (cstatus >> 8) & 0x7F; +} + +static inline unsigned int perfctr_cstatus_has_ictrs(unsigned int cstatus) +{ + return cstatus & (0x7F << 16); +} + +/* + * 'struct siginfo' support for perfctr overflow signals. + * In unbuffered mode, si_code is set to SI_PMC_OVF and a bitmask + * describing which perfctrs overflowed is put in si_pmc_ovf_mask. + * A bitmask is used since more than one perfctr can have overflowed + * by the time the interrupt handler runs. + */ +#define SI_PMC_OVF -8 +#define si_pmc_ovf_mask _sifields._pad[0] /* XXX: use an unsigned field later */ + +#ifdef __KERNEL__ + +#if defined(CONFIG_PERFCTR) + +struct perfctr_cpu_control { + struct perfctr_cpu_control_header header; + u64 mmcr0; + u64 mmcr1; + u64 mmcra; + unsigned int ireset[8]; /* [0,0x7fffffff], for i-mode counters, physical indices */ + unsigned int pmc_map[8]; /* virtual to physical index map */ +}; + +struct perfctr_cpu_state { + /* Don't change field order here without first considering the number + of cache lines touched during sampling and context switching. */ + unsigned int id; + int isuspend_cpu; + struct perfctr_cpu_state_user user; + unsigned int unused_pmcs; + struct perfctr_cpu_control control; +}; + +/* Driver init/exit. */ +extern int perfctr_cpu_init(void); +extern void perfctr_cpu_exit(void); + +/* CPU type name. */ +extern char *perfctr_cpu_name; + +/* Hardware reservation. */ +extern const char *perfctr_cpu_reserve(const char *service); +extern void perfctr_cpu_release(const char *service); + +/* PRE: state has no running interrupt-mode counters. + Check that the new control data is valid. + Update the driver's private control data. + Returns a negative error code if the control data is invalid. */ +extern int perfctr_cpu_update_control(struct perfctr_cpu_state *state, int is_global); + +/* Parse and update control for the given domain. */ +extern int perfctr_cpu_control_write(struct perfctr_cpu_control *control, + unsigned int domain, + const void *srcp, unsigned int srcbytes); + +/* Retrieve and format control for the given domain. + Returns number of bytes written. */ +extern int perfctr_cpu_control_read(const struct perfctr_cpu_control *control, + unsigned int domain, + void *dstp, unsigned int dstbytes); + +/* Read a-mode counters. Subtract from start and accumulate into sums. + Must be called with preemption disabled. */ +extern void perfctr_cpu_suspend(struct perfctr_cpu_state *state); + +/* Write control registers. Read a-mode counters into start. + Must be called with preemption disabled. */ +extern void perfctr_cpu_resume(struct perfctr_cpu_state *state); + +/* Perform an efficient combined suspend/resume operation. + Must be called with preemption disabled. */ +extern void perfctr_cpu_sample(struct perfctr_cpu_state *state); + +/* The type of a perfctr overflow interrupt handler. + It will be called in IRQ context, with preemption disabled. */ +typedef void (*perfctr_ihandler_t)(unsigned long pc); + +/* Operations related to overflow interrupt handling. */ +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT +extern void perfctr_cpu_set_ihandler(perfctr_ihandler_t); +extern void perfctr_cpu_ireload(struct perfctr_cpu_state*); +extern unsigned int perfctr_cpu_identify_overflow(struct perfctr_cpu_state*); +#else +static inline void perfctr_cpu_set_ihandler(perfctr_ihandler_t x) { } +#endif +static inline int perfctr_cpu_has_pending_interrupt(const struct perfctr_cpu_state *state) +{ + return 0; +} + +#endif /* CONFIG_PERFCTR */ + +#endif /* __KERNEL__ */ + +#endif /* _ASM_PPC64_PERFCTR_H */ From akpm at osdl.org Fri Apr 1 09:11:29 2005 From: akpm at osdl.org (Andrew Morton) Date: Thu, 31 Mar 2005 15:11:29 -0800 Subject: [PATCH 2.6.12-rc1-mm5 1/3] perfctr: ppc64 arch hooks In-Reply-To: <200503312207.j2VM7YUI011924@alkaid.it.uu.se> References: <200503312207.j2VM7YUI011924@alkaid.it.uu.se> Message-ID: <20050331151129.279b0618.akpm@osdl.org> Mikael Pettersson wrote: > > Here's a 3-part patch kit which adds a ppc64 driver to perfctr, > written by David Gibson . Well that seems like progress. Where do we feel that we stand wrt preparedness for merging all this up? From david at gibson.dropbear.id.au Fri Apr 1 09:49:40 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 1 Apr 2005 09:49:40 +1000 Subject: [PATCH 2.6.12-rc1-mm5 1/3] perfctr: ppc64 arch hooks In-Reply-To: <20050331151129.279b0618.akpm@osdl.org> References: <200503312207.j2VM7YUI011924@alkaid.it.uu.se> <20050331151129.279b0618.akpm@osdl.org> Message-ID: <20050331234940.GA21676@localhost.localdomain> On Thu, Mar 31, 2005 at 03:11:29PM -0800, Andrew Morton wrote: > Mikael Pettersson wrote: > > > > Here's a 3-part patch kit which adds a ppc64 driver to perfctr, > > written by David Gibson . > > Well that seems like progress. Where do we feel that we stand wrt > preparedness for merging all this up? I'm still uneasy about it. There were sufficient changes made getting this one ready to go that I'm not confident there aren't more important things to be found. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From akpm at osdl.org Fri Apr 1 11:33:02 2005 From: akpm at osdl.org (Andrew Morton) Date: Thu, 31 Mar 2005 17:33:02 -0800 Subject: [PATCH 2.6.12-rc1-mm5 1/3] perfctr: ppc64 arch hooks In-Reply-To: <20050331234940.GA21676@localhost.localdomain> References: <200503312207.j2VM7YUI011924@alkaid.it.uu.se> <20050331151129.279b0618.akpm@osdl.org> <20050331234940.GA21676@localhost.localdomain> Message-ID: <20050331173302.3ec64e59.akpm@osdl.org> David Gibson wrote: > > On Thu, Mar 31, 2005 at 03:11:29PM -0800, Andrew Morton wrote: > > Mikael Pettersson wrote: > > > > > > Here's a 3-part patch kit which adds a ppc64 driver to perfctr, > > > written by David Gibson . > > > > Well that seems like progress. Where do we feel that we stand wrt > > preparedness for merging all this up? > > I'm still uneasy about it. There were sufficient changes made getting > this one ready to go that I'm not confident there aren't more > important things to be found. That's a bit open-ended. How do we determine whether more things will be needed? How do we know when we're done? From grundler at parisc-linux.org Fri Apr 1 16:08:34 2005 From: grundler at parisc-linux.org (Grant Grundler) Date: Thu, 31 Mar 2005 23:08:34 -0700 Subject: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] In-Reply-To: <20050331200622.GG15596@austin.ibm.com> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <1109207532.5384.32.camel@gaston> <20050224013137.GF2088@austin.ibm.com> <20050226063609.GC7036@colo.lackof.org> <20050321231028.GV498@austin.ibm.com> <20050322175728.GE12675@colo.lackof.org> <20050331200622.GG15596@austin.ibm.com> Message-ID: <20050401060834.GB29734@colo.lackof.org> On Thu, Mar 31, 2005 at 02:06:22PM -0600, Linas Vepstas wrote: > > Does this process cause a SCSI bus reset? > > Don't get a chance to get that far. Have to bring up the PCI interfaces > first, before any scsi command can be issued. My point is you want the scsi bus to get reset so devices drop all pending IO and stop trying to tell you how much work they've done. I thought this was possible by banging on registers in the 53c8xx chips. > > BTW, when did sym2 get a chance to cleanup "pending" requests? > > Yes, the sym2 driver has mechanisms for that. Uhm, *when*? It wasn't clear from your previous description. I would take care of this *before* trying to get the card back on it's feet. > > You want everything moved back to the "queued" state or failed > > (flush pending IO so upper layers can retry if they want). > > Upper layer is the linux block device; my understanding is that it does > not retry, nor do the filesystems above that. Passing errors upwards > seems to be pretty darned fatal. My goal is to limit retries to the > driver. That's a bad idea. Been there done that. Upper layers can be alot smarter about retries than the driver ever could be. While the driver knows more about the transport and why someting might fail, upper layers will know alternate pathes to the same devices or to the same data on different devices. Upper layers also set the recovery policy for particular storage. Trying to do recovery transperently in the drivers is going to also mess up other high level SW like Service Guard or LifeKeeper. They want to know when a path has failed, log it, and make sure someone gets sent to service the HW if threshholds are exceeded. Let higher layers like dm, VxFS, LVM worry about recovery. > > > Sometimes, I get the PCI error while the card is sitting there idly > > > after the #RST, but more often, I get the error in sym_chip_reset(), > > > immediately after the OUTB (nc_istat, SRST); > > > > Oh? Is this the driver trying to issue SCSI Reset? > > No I am trying to reinitialize the scsi card after the pci bus has been > reset. This has nothing to do with scsi bus resets, as far as I know > ... Ok. Sounds like the card hasn't yet recovered from the PCI Bus reset. I don't know enough about programming 53c8xx chips to tell you where in the process it's dying or why. If you collect traces of which registers get read/written before it dies again, that would a necessary step in for whoever tries to sort this out. hth, grant From grundler at parisc-linux.org Fri Apr 1 16:15:08 2005 From: grundler at parisc-linux.org (Grant Grundler) Date: Thu, 31 Mar 2005 23:15:08 -0700 Subject: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] In-Reply-To: <20050331201409.GH15596@austin.ibm.com> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <1109207532.5384.32.camel@gaston> <20050224013137.GF2088@austin.ibm.com> <20050226063609.GC7036@colo.lackof.org> <20050321231028.GV498@austin.ibm.com> <4240581C.1000906@us.ibm.com> <20050331201409.GH15596@austin.ibm.com> Message-ID: <20050401061508.GC29734@colo.lackof.org> On Thu, Mar 31, 2005 at 02:14:09PM -0600, Linas Vepstas wrote: > > What config registers are you restoring? > > BAR's, grant, latency, interrupt, cacheline size. "grant" is PCI_COMMAND? If so, I think you have all of them. You may want to leave BUS_MASTER disabled until you think the driver is in a state where it needs to do DMA again. E.g. before kicking off the scripts engine. > > helps, or you could try power cycling the slot instead of using PCI reset. > > yes I could :( I'll try that next. Problem is, not all slots are > power-cyclable, only the hotplug slots are. I've discoverd that > for example, the ethernet chips are soldered to the motherboard, and > can't be power-cycled (but fortunately, those don't give me trouble). They can if the NIC driver doesn't deal with programming the phy properly. We had a problem with tg3 because of that in the past. The phy doesn't get reset as part of the PCI Bus RESET. grant From mikpe at csd.uu.se Fri Apr 1 22:46:53 2005 From: mikpe at csd.uu.se (Mikael Pettersson) Date: Fri, 1 Apr 2005 14:46:53 +0200 Subject: [PATCH 2.6.12-rc1-mm5 1/3] perfctr: ppc64 arch hooks In-Reply-To: <20050331173302.3ec64e59.akpm@osdl.org> References: <200503312207.j2VM7YUI011924@alkaid.it.uu.se> <20050331151129.279b0618.akpm@osdl.org> <20050331234940.GA21676@localhost.localdomain> <20050331173302.3ec64e59.akpm@osdl.org> Message-ID: <16973.17085.561804.567539@alkaid.it.uu.se> Andrew Morton writes: > David Gibson wrote: > > > > On Thu, Mar 31, 2005 at 03:11:29PM -0800, Andrew Morton wrote: > > > Mikael Pettersson wrote: > > > > > > > > Here's a 3-part patch kit which adds a ppc64 driver to perfctr, > > > > written by David Gibson . > > > > > > Well that seems like progress. Where do we feel that we stand wrt > > > preparedness for merging all this up? > > > > I'm still uneasy about it. There were sufficient changes made getting > > this one ready to go that I'm not confident there aren't more > > important things to be found. > > That's a bit open-ended. How do we determine whether more things will be > needed? How do we know when we're done? I have two planned changes that will be done RSN: - On x86/x86-64, user-space uses the mmap()ed state's TSC start value as a way to detect if a user-space sampling operation (which needs to be "virtually atomic") was preempted by the kernel. On ppc{32,64} we've used the TB for the same thing up to now, but that doesn't quite work because the TB is about a magnitude or two too slow. So the plan is to change ppc to store a software generation counter in the mmap()ed state, and change the ppc user-space to check that one instead. - Move common stuff to . In addition, there is one unresolved issue: - A counter's value is represented by a 64-bit software sum, a 32-bit start value containing the HW counter's value at the start of the current time slice, and the current HW counter's value (now). The actual value is computed as sum + (now - start). This is reflected in the mmap()ed state, which contains a variable- length { u32 map; u32 start; u64 sum; } pmc[] array. This layout is very cache-efficient on current 32 and 64-bit CPUs, but there is a _possible_ concern that it won't do on 10+ GHz CPUs. So the question is, should we change it to use 64-bit start values already now (and take more cache misses), or should that wait a few years until it becomes a necessity (causing ABI change issues)? /Mikael From will_schmidt at vnet.ibm.com Sat Apr 2 00:05:47 2005 From: will_schmidt at vnet.ibm.com (will schmidt) Date: Fri, 01 Apr 2005 08:05:47 -0600 Subject: RFC/Patch more xmon additions In-Reply-To: <20050331202148.GI15596@austin.ibm.com> References: <421E3BE3.90301@vnet.ibm.com> <16936.10223.704710.234312@cargo.ozlabs.ibm.com> <20050331202148.GI15596@austin.ibm.com> Message-ID: <424D553B.60306@vnet.ibm.com> Linas Vepstas wrote: > Hi Will, > > I just unearthed this email from the deep mound ... > > On Fri, Mar 04, 2005 at 08:18:39PM +1100, Paul Mackerras was heard to remark: > >>will schmidt writes: >> >> >>>Am looking for comments on this additional function i've added to xmon >>>on the side.. >>> >>>the bulk of my intent was to make it easier for me to poke at memory >>>within a particular user process. >> >>The main problem I have with it is that we seem to be accessing a lot >>of kernel data structures without checking any pointers or using >>mread() to read the memory safely. One of the goals of xmon is that >>it should be as reliable as possible even if kernel data structures >>are corrupted, and I think your patch would reduce that reliability. > > > Please clean up per Paul's suggestions and resubmit; as a matter of principle, > its nice to have the debugger print parsed output instead of having to count 289 > bytes into some struct task or such to manually decode a bitflag ... > > --linas YeAh, it's still on the ToDO list. From brking at us.ibm.com Sat Apr 2 01:27:22 2005 From: brking at us.ibm.com (Brian King) Date: Fri, 01 Apr 2005 09:27:22 -0600 Subject: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] In-Reply-To: <20050401060834.GB29734@colo.lackof.org> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <1109207532.5384.32.camel@gaston> <20050224013137.GF2088@austin.ibm.com> <20050226063609.GC7036@colo.lackof.org> <20050321231028.GV498@austin.ibm.com> <20050322175728.GE12675@colo.lackof.org> <20050331200622.GG15596@austin.ibm.com> <20050401060834.GB29734@colo.lackof.org> Message-ID: <424D685A.6070505@us.ibm.com> Grant Grundler wrote: >>>You want everything moved back to the "queued" state or failed >>>(flush pending IO so upper layers can retry if they want). >> >>Upper layer is the linux block device; my understanding is that it does >>not retry, nor do the filesystems above that. Passing errors upwards >>seems to be pretty darned fatal. My goal is to limit retries to the >>driver. > > > That's a bad idea. Been there done that. > > Upper layers can be alot smarter about retries than the driver ever > could be. While the driver knows more about the transport and why > someting might fail, upper layers will know alternate pathes > to the same devices or to the same data on different devices. > Upper layers also set the recovery policy for particular storage. > > Trying to do recovery transperently in the drivers is going to also > mess up other high level SW like Service Guard or LifeKeeper. > They want to know when a path has failed, log it, and make sure > someone gets sent to service the HW if threshholds are exceeded. > > Let higher layers like dm, VxFS, LVM worry about recovery. The sym2 driver should fail everything back with DID_ERROR. In most cases, the scsi midlayer will retry if the upper layer allows retries and you will get the behavior you desire. If retries are not allowed, like for a tape device, the command will get failed back to the upper layer driver. -- Brian King eServer Storage I/O IBM Linux Technology Center From akpm at osdl.org Sat Apr 2 04:25:14 2005 From: akpm at osdl.org (Andrew Morton) Date: Fri, 1 Apr 2005 10:25:14 -0800 Subject: [PATCH 2.6.12-rc1-mm5 1/3] perfctr: ppc64 arch hooks In-Reply-To: <16973.17085.561804.567539@alkaid.it.uu.se> References: <200503312207.j2VM7YUI011924@alkaid.it.uu.se> <20050331151129.279b0618.akpm@osdl.org> <20050331234940.GA21676@localhost.localdomain> <20050331173302.3ec64e59.akpm@osdl.org> <16973.17085.561804.567539@alkaid.it.uu.se> Message-ID: <20050401102514.505ad059.akpm@osdl.org> Mikael Pettersson wrote: > > In addition, there is one unresolved issue: > - A counter's value is represented by a 64-bit software sum, > a 32-bit start value containing the HW counter's value at the > start of the current time slice, and the current HW counter's value > (now). The actual value is computed as sum + (now - start). > This is reflected in the mmap()ed state, which contains a variable- > length { u32 map; u32 start; u64 sum; } pmc[] array. > This layout is very cache-efficient on current 32 and 64-bit CPUs, > but there is a _possible_ concern that it won't do on 10+ GHz CPUs. > So the question is, should we change it to use 64-bit start values > already now (and take more cache misses), or should that wait a few > years until it becomes a necessity (causing ABI change issues)? I'd be inclined to make the change now, personally. ABI changes are a pain for everyone. From benh at kernel.crashing.org Sun Apr 3 11:16:28 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 03 Apr 2005 11:16:28 +1000 Subject: [PATCH] ppc64: Fix boot memory corruption Message-ID: <1112490989.6577.255.camel@gaston> Hi ! Nathan's patch "make OF node fixup code usable at runtim" is introducing a snaky bug. We do 2 passes over this code, one to measure how much memory will be needed so we can allocate a single block, and one to do the actual fixup. However, the new code does some result-checking of prom_alloc() which breaks this mecanism, as the first pass always starts at "0", thus we fail to measure the additional size properly and allocate a block smaller than what we'll actually use for the fixup. This cause us to override whatever sits there, with variable results depending on the memory layout of the machine (but typically crashes). This patch fixes it by starting the "measure" pass with an initial size set to 16 and not 0. Signed-off-by: Benjamin Herrenschmidt Index: linux-work/arch/ppc64/kernel/prom.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/prom.c 2005-04-03 10:02:55.000000000 +1000 +++ linux-work/arch/ppc64/kernel/prom.c 2005-04-03 11:08:18.000000000 +1000 @@ -601,8 +601,19 @@ /* Initialize virtual IRQ map */ virt_irq_init(); - /* Finish device-tree (pre-parsing some properties etc...) */ + /* + * Finish device-tree (pre-parsing some properties etc...) + * We do this in 2 passes. One with "measure_only" set, which + * will only measure the amount of memory needed, then we can + * allocate that memory, and call finish_node again. However, + * we must be careful as most routines will fail nowadays when + * prom_alloc() returns 0, so we must make sure our first pass + * doesn't start at 0. We pre-initialize size to 16 for that + * reason and then remove those additional 16 bytes + */ + size = 16; finish_node(allnodes, &size, NULL, 0, 0, 1); + size -= 16; end = start = (unsigned long)abs_to_virt(lmb_alloc(size, 128)); finish_node(allnodes, &end, NULL, 0, 0, 0); BUG_ON(end != start + size); From david at gibson.dropbear.id.au Mon Apr 4 13:25:19 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 4 Apr 2005 13:25:19 +1000 Subject: [PATCH 2.6.12-rc1-mm5 1/3] perfctr: ppc64 arch hooks In-Reply-To: <16973.17085.561804.567539@alkaid.it.uu.se> References: <200503312207.j2VM7YUI011924@alkaid.it.uu.se> <20050331151129.279b0618.akpm@osdl.org> <20050331234940.GA21676@localhost.localdomain> <20050331173302.3ec64e59.akpm@osdl.org> <16973.17085.561804.567539@alkaid.it.uu.se> Message-ID: <20050404032519.GB29805@localhost.localdomain> On Fri, Apr 01, 2005 at 02:46:53PM +0200, Mikael Pettersson wrote: > Andrew Morton writes: > > David Gibson wrote: > > > > > > On Thu, Mar 31, 2005 at 03:11:29PM -0800, Andrew Morton wrote: > > > > Mikael Pettersson wrote: > > > > > > > > > > Here's a 3-part patch kit which adds a ppc64 driver to perfctr, > > > > > written by David Gibson . > > > > > > > > Well that seems like progress. Where do we feel that we stand wrt > > > > preparedness for merging all this up? > > > > > > I'm still uneasy about it. There were sufficient changes made getting > > > this one ready to go that I'm not confident there aren't more > > > important things to be found. > > > > That's a bit open-ended. How do we determine whether more things will be > > needed? How do we know when we're done? > > I have two planned changes that will be done RSN: > - On x86/x86-64, user-space uses the mmap()ed state's TSC start > value as a way to detect if a user-space sampling operation > (which needs to be "virtually atomic") was preempted by the kernel. > On ppc{32,64} we've used the TB for the same thing up to now, > but that doesn't quite work because the TB is about a magnitude > or two too slow. So the plan is to change ppc to store a > software generation counter in the mmap()ed state, and change > the ppc user-space to check that one instead. If we're going to do it for ppc, we might as well do it for all platforms. That gets us one step closer to eliminating cstatus from the user visible stuff, too, which I think should be done. > - Move common stuff to . > > In addition, there is one unresolved issue: > - A counter's value is represented by a 64-bit software sum, > a 32-bit start value containing the HW counter's value at the > start of the current time slice, and the current HW counter's value > (now). The actual value is computed as sum + (now - start). > This is reflected in the mmap()ed state, which contains a variable- > length { u32 map; u32 start; u64 sum; } pmc[] array. > This layout is very cache-efficient on current 32 and 64-bit CPUs, > but there is a _possible_ concern that it won't do on 10+ GHz CPUs. > So the question is, should we change it to use 64-bit start values > already now (and take more cache misses), or should that wait a few > years until it becomes a necessity (causing ABI change issues)? Is there any way we could rearrange the user visible stuff to not include the 'map' field? After all userspace set up the counters, so it ought to know what the mapping is already... That would mean we could fit in a 64-bit start value without having to mess around to get good alignment. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From benh at kernel.crashing.org Mon Apr 4 17:28:01 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 04 Apr 2005 17:28:01 +1000 Subject: [PATCH] ppc64: Fix semantics of __ioremap Message-ID: <1112599682.26085.35.camel@gaston> Hi ! This patch fixes ppc64 __ioremap() so that it stops adding implicitely _PAGE_GUARDED when the cache is not writeback, and instead, let the callers provide the flag they want here. This allows things like framebuffers to explicitely request a non-cacheable and non-guarded mapping which is more efficient for that type of memory without side effects. The patch also fixes all current callers to add _PAGE_GUARDED except btext, which is fine without it. Signed-off-by: Benjamin Herrenschmidt Index: linux-work/arch/ppc64/kernel/pSeries_setup.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/pSeries_setup.c 2005-03-10 13:43:01.000000000 +1100 +++ linux-work/arch/ppc64/kernel/pSeries_setup.c 2005-04-04 17:18:34.000000000 +1000 @@ -363,7 +363,7 @@ find_udbg_vterm(); else if (physport) { /* Map the uart for udbg. */ - comport = (void *)__ioremap(physport, 16, _PAGE_NO_CACHE); + comport = (void *)ioremap(physport, 16); udbg_init_uart(comport, default_speed); ppc_md.udbg_putc = udbg_putc; Index: linux-work/arch/ppc64/kernel/maple_setup.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/maple_setup.c 2005-01-31 14:18:14.000000000 +1100 +++ linux-work/arch/ppc64/kernel/maple_setup.c 2005-04-04 17:18:49.000000000 +1000 @@ -142,7 +142,7 @@ if (physport) { void *comport; /* Map the uart for udbg. */ - comport = (void *)__ioremap(physport, 16, _PAGE_NO_CACHE); + comport = (void *)ioremap(physport, 16); udbg_init_uart(comport, default_speed); ppc_md.udbg_putc = udbg_putc; Index: linux-work/arch/ppc64/kernel/pci.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/pci.c 2005-04-03 10:02:55.000000000 +1000 +++ linux-work/arch/ppc64/kernel/pci.c 2005-04-04 17:18:05.000000000 +1000 @@ -547,8 +547,9 @@ if (range == NULL || (rlen < sizeof(struct isa_range))) { printk(KERN_ERR "no ISA ranges or unexpected isa range size," "mapping 64k\n"); - __ioremap_explicit(phb_io_base_phys, (unsigned long)phb_io_base_virt, - 0x10000, _PAGE_NO_CACHE); + __ioremap_explicit(phb_io_base_phys, + (unsigned long)phb_io_base_virt, + 0x10000, _PAGE_NO_CACHE | _PAGE_GUARDED); return; } @@ -576,7 +577,7 @@ __ioremap_explicit(phb_io_base_phys, (unsigned long) phb_io_base_virt, - size, _PAGE_NO_CACHE); + size, _PAGE_NO_CACHE | _PAGE_GUARDED); } } @@ -692,7 +693,7 @@ struct resource *res; hose->io_base_virt = __ioremap(hose->io_base_phys, size, - _PAGE_NO_CACHE); + _PAGE_NO_CACHE | _PAGE_GUARDED); DBG("phb%d io_base_phys 0x%lx io_base_virt 0x%lx\n", hose->global_number, hose->io_base_phys, (unsigned long) hose->io_base_virt); @@ -780,7 +781,8 @@ if (get_bus_io_range(bus, &start_phys, &start_virt, &size)) return 1; printk("mapping IO %lx -> %lx, size: %lx\n", start_phys, start_virt, size); - if (__ioremap_explicit(start_phys, start_virt, size, _PAGE_NO_CACHE)) + if (__ioremap_explicit(start_phys, start_virt, size, + _PAGE_NO_CACHE | _PAGE_GUARDED)) return 1; return 0; Index: linux-work/arch/ppc64/mm/init.c =================================================================== --- linux-work.orig/arch/ppc64/mm/init.c 2005-04-03 10:02:55.000000000 +1000 +++ linux-work/arch/ppc64/mm/init.c 2005-04-04 17:17:01.000000000 +1000 @@ -155,7 +155,8 @@ ptep = pte_alloc_kernel(&ioremap_mm, pmdp, ea); pa = abs_to_phys(pa); - set_pte_at(&ioremap_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, __pgprot(flags))); + set_pte_at(&ioremap_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, + __pgprot(flags))); spin_unlock(&ioremap_mm.page_table_lock); } else { unsigned long va, vpn, hash, hpteg; @@ -191,12 +192,9 @@ if ((flags & _PAGE_PRESENT) == 0) flags |= pgprot_val(PAGE_KERNEL); - if (flags & (_PAGE_NO_CACHE | _PAGE_WRITETHRU)) - flags |= _PAGE_GUARDED; - for (i = 0; i < size; i += PAGE_SIZE) { + for (i = 0; i < size; i += PAGE_SIZE) map_io_page(ea+i, pa+i, flags); - } return (void __iomem *) (ea + (addr & ~PAGE_MASK)); } @@ -205,7 +203,7 @@ void __iomem * ioremap(unsigned long addr, unsigned long size) { - return __ioremap(addr, size, _PAGE_NO_CACHE); + return __ioremap(addr, size, _PAGE_NO_CACHE | _PAGE_GUARDED); } void __iomem * @@ -272,7 +270,8 @@ return 1; } if (ea != (unsigned long) area->addr) { - printk(KERN_ERR "unexpected addr return from im_get_area\n"); + printk(KERN_ERR "unexpected addr return from " + "im_get_area\n"); return 1; } } @@ -315,7 +314,8 @@ continue; if (pte_present(page)) continue; - printk(KERN_CRIT "Whee.. Swapped out page in kernel page table\n"); + printk(KERN_CRIT "Whee.. Swapped out page in kernel page" + " table\n"); } while (address < end); } @@ -352,7 +352,7 @@ * Access to IO memory should be serialized by driver. * This code is modeled after vmalloc code - unmap_vm_area() * - * XXX what about calls before mem_init_done (ie python_countermeasures()) + * XXX what about calls before mem_init_done (ie python_countermeasures()) */ void iounmap(volatile void __iomem *token) { From benh at kernel.crashing.org Tue Apr 5 16:40:57 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 05 Apr 2005 16:40:57 +1000 Subject: [PATCH] ppc32: Fix AGP and sleep again Message-ID: <1112683258.9567.13.camel@gaston> Hi ! My previous patch that added sleep support for uninorth-agp and some AGP "off" stuff in radeonfb and aty128fb is breaking some configs. More specifically, it has problems with rage128 setups since the DRI code for these in X doesn't properly re-enable AGP on wakeup or console switch (unlike the radeon DRM). This patch fixes the problem for pmac once for all by using a different approach. The AGP driver "registers" special suspend/resume callbacks with some arch code that the fbdev's can later on call to suspend and resume AGP, making sure it's resumed back in the same state it was when suspended. This is platform specific for now. It would be too complicated to try to do a generic implementation of this at this point due to all sort of weird things going on with AGP on other architectures. We'll re-work that whole problem cleanly once we finally merge fbdev's and DRI. In the meantime, please apply this patch which brings back some r128 based laptops into working condition as far as system sleep is concerned. Signed-off-by: Benjamin Herrenschmidt Index: linux-work/drivers/char/agp/uninorth-agp.c =================================================================== --- linux-work.orig/drivers/char/agp/uninorth-agp.c 2005-03-15 11:57:17.000000000 +1100 +++ linux-work/drivers/char/agp/uninorth-agp.c 2005-04-05 15:20:29.000000000 +1000 @@ -10,6 +10,7 @@ #include #include #include +#include #include "agp.h" /* @@ -26,6 +27,7 @@ static int uninorth_rev; static int is_u3; + static int uninorth_fetch_size(void) { int i; @@ -264,7 +266,8 @@ &scratch); } while ((scratch & PCI_AGP_COMMAND_AGP) == 0 && ++timeout < 1000); if ((scratch & PCI_AGP_COMMAND_AGP) == 0) - printk(KERN_ERR PFX "failed to write UniNorth AGP command reg\n"); + printk(KERN_ERR PFX "failed to write UniNorth AGP" + " command register\n"); if (uninorth_rev >= 0x30) { /* This is an AGP V3 */ @@ -278,13 +281,24 @@ } #ifdef CONFIG_PM -static int agp_uninorth_suspend(struct pci_dev *pdev, pm_message_t state) +/* + * These Power Management routines are _not_ called by the normal PCI PM layer, + * but directly by the video driver through function pointers in the device + * tree. + */ +static int agp_uninorth_suspend(struct pci_dev *pdev) { + struct agp_bridge_data *bridge; u32 cmd; u8 agp; struct pci_dev *device = NULL; - if (state != PMSG_SUSPEND) + bridge = agp_find_bridge(pdev); + if (bridge == NULL) + return -ENODEV; + + /* Only one suspend supported */ + if (bridge->dev_private_data) return 0; /* turn off AGP on the video chip, if it was enabled */ @@ -309,12 +323,13 @@ printk("uninorth-agp: disabling AGP on device %s\n", pci_name(device)); cmd &= ~PCI_AGP_COMMAND_AGP; - pci_write_config_dword(device, agp + PCI_AGP_COMMAND, cmd); + pci_write_config_dword(device, agp + PCI_AGP_COMMAND, cmd); } /* turn off AGP on the bridge */ agp = pci_find_capability(pdev, PCI_CAP_ID_AGP); pci_read_config_dword(pdev, agp + PCI_AGP_COMMAND, &cmd); + bridge->dev_private_data = (void *)cmd; if (cmd & PCI_AGP_COMMAND_AGP) { printk("uninorth-agp: disabling AGP on bridge %s\n", pci_name(pdev)); @@ -329,9 +344,23 @@ static int agp_uninorth_resume(struct pci_dev *pdev) { + struct agp_bridge_data *bridge; + u32 command; + + bridge = agp_find_bridge(pdev); + if (bridge == NULL) + return -ENODEV; + + command = (u32)bridge->dev_private_data; + bridge->dev_private_data = NULL; + if (!(command & PCI_AGP_COMMAND_AGP)) + return 0; + + uninorth_agp_enable(bridge, command); + return 0; } -#endif +#endif /* CONFIG_PM */ static int uninorth_create_gatt_table(struct agp_bridge_data *bridge) { @@ -575,6 +604,12 @@ of_node_put(uninorth_node); } +#ifdef CONFIG_PM + /* Inform platform of our suspend/resume caps */ + pmac_register_agp_pm(pdev, agp_uninorth_suspend, agp_uninorth_resume); +#endif + + /* Allocate & setup our driver */ bridge = agp_alloc_bridge(); if (!bridge) return -ENOMEM; @@ -599,6 +634,11 @@ { struct agp_bridge_data *bridge = pci_get_drvdata(pdev); +#ifdef CONFIG_PM + /* Inform platform of our suspend/resume caps */ + pmac_register_agp_pm(pdev, NULL, NULL); +#endif + agp_remove_bridge(bridge); agp_put_bridge(bridge); } @@ -622,10 +662,6 @@ .id_table = agp_uninorth_pci_table, .probe = agp_uninorth_probe, .remove = agp_uninorth_remove, -#ifdef CONFIG_PM - .suspend = agp_uninorth_suspend, - .resume = agp_uninorth_resume, -#endif }; static int __init agp_uninorth_init(void) Index: linux-work/arch/ppc64/kernel/pmac_feature.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/pmac_feature.c 2005-03-29 15:44:35.000000000 +1000 +++ linux-work/arch/ppc64/kernel/pmac_feature.c 2005-04-05 14:39:52.000000000 +1000 @@ -674,3 +674,67 @@ dump_HT_speeds("PCI-X HT Downlink", cfg, freq); #endif } + +/* + * Early video resume hook + */ + +static void (*pmac_early_vresume_proc)(void *data) __pmacdata; +static void *pmac_early_vresume_data __pmacdata; + +void pmac_set_early_video_resume(void (*proc)(void *data), void *data) +{ + if (_machine != _MACH_Pmac) + return; + preempt_disable(); + pmac_early_vresume_proc = proc; + pmac_early_vresume_data = data; + preempt_enable(); +} +EXPORT_SYMBOL(pmac_set_early_video_resume); + + +/* + * AGP related suspend/resume code + */ + +static struct pci_dev *pmac_agp_bridge __pmacdata; +static int (*pmac_agp_suspend)(struct pci_dev *bridge) __pmacdata; +static int (*pmac_agp_resume)(struct pci_dev *bridge) __pmacdata; + +void __pmac pmac_register_agp_pm(struct pci_dev *bridge, + int (*suspend)(struct pci_dev *bridge), + int (*resume)(struct pci_dev *bridge)) +{ + if (suspend || resume) { + pmac_agp_bridge = bridge; + pmac_agp_suspend = suspend; + pmac_agp_resume = resume; + return; + } + if (bridge != pmac_agp_bridge) + return; + pmac_agp_suspend = pmac_agp_resume = NULL; + return; +} +EXPORT_SYMBOL(pmac_register_agp_pm); + +void __pmac pmac_suspend_agp_for_card(struct pci_dev *dev) +{ + if (pmac_agp_bridge == NULL || pmac_agp_suspend == NULL) + return; + if (pmac_agp_bridge->bus != dev->bus) + return; + pmac_agp_suspend(pmac_agp_bridge); +} +EXPORT_SYMBOL(pmac_suspend_agp_for_card); + +void __pmac pmac_resume_agp_for_card(struct pci_dev *dev) +{ + if (pmac_agp_bridge == NULL || pmac_agp_resume == NULL) + return; + if (pmac_agp_bridge->bus != dev->bus) + return; + pmac_agp_resume(pmac_agp_bridge); +} +EXPORT_SYMBOL(pmac_resume_agp_for_card); Index: linux-work/arch/ppc/platforms/pmac_feature.c =================================================================== --- linux-work.orig/arch/ppc/platforms/pmac_feature.c 2005-04-05 14:29:30.000000000 +1000 +++ linux-work/arch/ppc/platforms/pmac_feature.c 2005-04-05 15:20:06.000000000 +1000 @@ -2944,3 +2944,48 @@ if (pmac_early_vresume_proc) pmac_early_vresume_proc(pmac_early_vresume_data); } + +/* + * AGP related suspend/resume code + */ + +static struct pci_dev *pmac_agp_bridge __pmacdata; +static int (*pmac_agp_suspend)(struct pci_dev *bridge) __pmacdata; +static int (*pmac_agp_resume)(struct pci_dev *bridge) __pmacdata; + +void __pmac pmac_register_agp_pm(struct pci_dev *bridge, + int (*suspend)(struct pci_dev *bridge), + int (*resume)(struct pci_dev *bridge)) +{ + if (suspend || resume) { + pmac_agp_bridge = bridge; + pmac_agp_suspend = suspend; + pmac_agp_resume = resume; + return; + } + if (bridge != pmac_agp_bridge) + return; + pmac_agp_suspend = pmac_agp_resume = NULL; + return; +} +EXPORT_SYMBOL(pmac_register_agp_pm); + +void __pmac pmac_suspend_agp_for_card(struct pci_dev *dev) +{ + if (pmac_agp_bridge == NULL || pmac_agp_suspend == NULL) + return; + if (pmac_agp_bridge->bus != dev->bus) + return; + pmac_agp_suspend(pmac_agp_bridge); +} +EXPORT_SYMBOL(pmac_suspend_agp_for_card); + +void __pmac pmac_resume_agp_for_card(struct pci_dev *dev) +{ + if (pmac_agp_bridge == NULL || pmac_agp_resume == NULL) + return; + if (pmac_agp_bridge->bus != dev->bus) + return; + pmac_agp_resume(pmac_agp_bridge); +} +EXPORT_SYMBOL(pmac_resume_agp_for_card); Index: linux-work/drivers/video/aty/radeon_pm.c =================================================================== --- linux-work.orig/drivers/video/aty/radeon_pm.c 2005-04-01 09:04:19.000000000 +1000 +++ linux-work/drivers/video/aty/radeon_pm.c 2005-04-05 15:21:54.000000000 +1000 @@ -2520,13 +2520,10 @@ } -static/*extern*/ int susdisking = 0; - int radeonfb_pci_suspend(struct pci_dev *pdev, pm_message_t state) { struct fb_info *info = pci_get_drvdata(pdev); struct radeonfb_info *rinfo = info->par; - u8 agp; int i; if (state == pdev->dev.power.power_state) @@ -2542,11 +2539,6 @@ */ if (state != PM_SUSPEND_MEM) goto done; - if (susdisking) { - printk("radeonfb (%s): suspending to disk but state = %d\n", - pci_name(pdev), state); - goto done; - } acquire_console_sem(); @@ -2567,27 +2559,13 @@ rinfo->lock_blank = 1; del_timer_sync(&rinfo->lvds_timer); - /* Disable AGP. The AGP host should have done it, but since ordering - * isn't always properly guaranteed in this specific case, let's make - * sure it's disabled on card side now. Ultimately, when merging fbdev - * and dri into some common infrastructure, this will be handled - * more nicely. The host bridge side will (or will not) be dealt with - * by the bridge AGP driver, we don't attempt to touch it here. +#ifdef CONFIG_PPC_PMAC + /* On powermac, we have hooks to properly suspend/resume AGP now, + * use them here. We'll ultimately need some generic support here, + * but the generic code isn't quite ready for that yet */ - agp = pci_find_capability(pdev, PCI_CAP_ID_AGP); - if (agp) { - u32 cmd; - - pci_read_config_dword(pdev, agp + PCI_AGP_COMMAND, &cmd); - if (cmd & PCI_AGP_COMMAND_AGP) { - printk(KERN_INFO "radeonfb (%s): AGP was enabled, " - "disabling ...\n", - pci_name(pdev)); - cmd &= ~PCI_AGP_COMMAND_AGP; - pci_write_config_dword(pdev, agp + PCI_AGP_COMMAND, - cmd); - } - } + pmac_suspend_agp_for_card(pdev); +#endif /* CONFIG_PPC_PMAC */ /* If we support wakeup from poweroff, we save all regs we can including cfg * space @@ -2699,6 +2677,15 @@ rinfo->lock_blank = 0; radeon_screen_blank(rinfo, FB_BLANK_UNBLANK, 1); +#ifdef CONFIG_PPC_PMAC + /* On powermac, we have hooks to properly suspend/resume AGP now, + * use them here. We'll ultimately need some generic support here, + * but the generic code isn't quite ready for that yet + */ + pmac_resume_agp_for_card(pdev); +#endif /* CONFIG_PPC_PMAC */ + + /* Check status of dynclk */ if (rinfo->dynclk == 1) radeon_pm_enable_dynamic_mode(rinfo); Index: linux-work/include/asm-ppc/pmac_feature.h =================================================================== --- linux-work.orig/include/asm-ppc/pmac_feature.h 2005-03-15 11:59:39.000000000 +1100 +++ linux-work/include/asm-ppc/pmac_feature.h 2005-04-05 14:29:31.000000000 +1000 @@ -305,6 +305,17 @@ #define PMAC_FTR_DEF(x) ((_MACH_Pmac << 16) | (x)) +/* The AGP driver registers itself here */ +extern void pmac_register_agp_pm(struct pci_dev *bridge, + int (*suspend)(struct pci_dev *bridge), + int (*resume)(struct pci_dev *bridge)); + +/* Those are meant to be used by video drivers to deal with AGP + * suspend resume properly + */ +extern void pmac_suspend_agp_for_card(struct pci_dev *dev); +extern void pmac_resume_agp_for_card(struct pci_dev *dev); + /* * The part below is for use by macio_asic.c only, do not rely Index: linux-work/drivers/video/aty/aty128fb.c =================================================================== --- linux-work.orig/drivers/video/aty/aty128fb.c 2005-04-01 09:04:18.000000000 +1000 +++ linux-work/drivers/video/aty/aty128fb.c 2005-04-05 15:22:17.000000000 +1000 @@ -2331,7 +2331,6 @@ { struct fb_info *info = pci_get_drvdata(pdev); struct aty128fb_par *par = info->par; - u8 agp; /* We don't do anything but D2, for now we return 0, but * we may want to change that. How do we know if the BIOS @@ -2369,26 +2368,13 @@ par->asleep = 1; par->lock_blank = 1; - /* Disable AGP. The AGP host should have done it, but since ordering - * isn't always properly guaranteed in this specific case, let's make - * sure it's disabled on card side now. Ultimately, when merging fbdev - * and dri into some common infrastructure, this will be handled - * more nicely. The host bridge side will (or will not) be dealt with - * by the bridge AGP driver, we don't attempt to touch it here. +#ifdef CONFIG_PPC_PMAC + /* On powermac, we have hooks to properly suspend/resume AGP now, + * use them here. We'll ultimately need some generic support here, + * but the generic code isn't quite ready for that yet */ - agp = pci_find_capability(pdev, PCI_CAP_ID_AGP); - if (agp) { - u32 cmd; - - pci_read_config_dword(pdev, agp + PCI_AGP_COMMAND, &cmd); - if (cmd & PCI_AGP_COMMAND_AGP) { - printk(KERN_INFO "aty128fb: AGP was enabled, " - "disabling ...\n"); - cmd &= ~PCI_AGP_COMMAND_AGP; - pci_write_config_dword(pdev, agp + PCI_AGP_COMMAND, - cmd); - } - } + pmac_suspend_agp_for_card(pdev); +#endif /* CONFIG_PPC_PMAC */ /* We need a way to make sure the fbdev layer will _not_ touch the * framebuffer before we put the chip to suspend state. On 2.4, I @@ -2432,6 +2418,14 @@ par->lock_blank = 0; aty128fb_blank(0, info); +#ifdef CONFIG_PPC_PMAC + /* On powermac, we have hooks to properly suspend/resume AGP now, + * use them here. We'll ultimately need some generic support here, + * but the generic code isn't quite ready for that yet + */ + pmac_resume_agp_for_card(pdev); +#endif /* CONFIG_PPC_PMAC */ + pdev->dev.power.power_state = PMSG_ON; printk(KERN_DEBUG "aty128fb: resumed !\n"); From benh at kernel.crashing.org Tue Apr 5 17:15:11 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 05 Apr 2005 17:15:11 +1000 Subject: PCI Error Recovery API Proposal (updated) In-Reply-To: <20050314181420.GD498@austin.ibm.com> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> Message-ID: <1112685311.9518.35.camel@gaston> Hi ! I've been away for a while, but here is my latest update of the proposal, if we all agree with it, it will go to kernel/Documentation somewhere and we'll start implementing the ppc64 side of it. The error recovery API support is exposed by the driver in the form of a structure of function pointers pointed to by a new field in struct pci_driver. The absence of this pointer in pci_driver denotes an "non-aware" driver, behaviour on these is platform dependant. Platforms like ppc64 can try to simulate hotplug remove/add. The definition of "pci_error_token" is not covered here. It is based on Seto's work on the synchronous error detection. We still need to define functions for extracting infos out of an opaque error token. This is separate from this API. This structure has the form: struct pci_error_handlers { int (*error_detected)(struct pci_dev *dev, pci_error_token error); int (*error_recover)(struct pci_dev *dev); int (*error_restart)(struct pci_dev *dev); int (*link_reset)(struct pci_dev *dev); int (*slot_reset)(struct pci_dev *dev); }; A driver doesn't have to implement all of these callbacks. The only mandatory one is error_detected. If a callback is not implemented, the corresponding feature is considered unsupported. For example, if error_recover and error_restart (they really go together, see desscription to understand why) aren't there, then the driver is assumed as not doing any direct recovery and requires a reset. If link_reset is not implemented, the card is assumed as not caring about link resets, in which case, if recover is supported, the core can try recover (but not slot_reset unless it really did reset the slot). If slot reset is not supported, link reset can be called instead on a slot reset. At first, the call will always be : 1) error_detected() Error detected. This is sent once after an error has been detected. At this point, the device might not be accessible anymore depending on the platform (the slot will be isolated on ppc64). The driver may already have "noticed" the error because of a failing IO, but this is the proper "synchronisation point", that is, it gives a chance to the driver to cleanup, waiting for pending stuffs (timers, whatever, etc...) to complete, it can take semaphores, schedule, etc... everything but touch the device. Within this function and after it returns, the driver shouldn't do any new IOs. Called in task context. This is sort of a "quiesce" point. See note about interrupts at the end of this doc. Result codes: - PCIERR_RESULT_CAN_RECOVER: Return this if you think you might be able to recover the HW by just banging IOs or if you want to be given a chance to extract some diagnostic informations (see below). - PCIERR_RESULT_NEED_RESET: Return this if you think you can't recover unless the slot is reset. - PCIERR_RESULT_DISCONNECT: Return this if you think you won't recover at all, (this will detach the driver ? or just leave it dangling ? to be decided) So at this point, we have called error_detected() for all drivers on the segment that had the error. On ppc64, the slot is isolated. What happens now typically depends on the result from the drivers. If all drivers on the segment/slot return PCIERR_RESULT_CAN_RECOVER, we would re-enable IOs on the slot (or do nothing special if the platform doesn't isolate slots) and call 2). If not and we can reset slots, we go to 4), if neither, we have a dead slot. If it's an hotplug slot, we might "simulate" reset by triggering HW unplug/replug tho. 2) error_recover() This is the "early recovery" call. IOs are allowed again, but DMA is not (hrm... to be discussed, I prefer not), with some restrictions. This is NOT a callback for the driver to start operations again, only to peek/poke at the device, extract diagnostic informations if any, and eventually do things like trigger a device local reset or such things, but not restart operations. This is sent if all drivers on a segment agree that they can try to recover and no automatic link reset was performed by the HW. If the platform can't just re-enable IOs without a slot reset or a link reset, it doesn't call this callback and goes directly to 3) or 4). All IOs should be done _synchronously_ from withing this callback, errors triggered by them will be returned via the normal pci_check_whatever() api, no new error_detected() callback will be issued due to an error happening here. However, such an error might cause IOs to be re-blocked for the whole segment, and thus invalidate the recovery that other devices on the same segment might have done, forcing the whole segment into one of the next states, that is link reset or slot reset. Result codes: - PCIERR_RESULT_RECOVERED Return this if you think your device is fully functionnal and think you are ready to start to do your normal driver job again. There is no guarantee that because you returned that, you'll be allowed to actually proceed as another driver on the same segment might have failed and thus triggered a slot reset on platforms that support it. - PCIERR_RESULT_NEED_RESET Return this if you think your device is not recoverable in it's current state and you need a slot reset to proceed. - PCIERR_RESULT_DISCONNECT Same as above. Total failure, no recovery even after reset driver dead. (To be defined more precisely) 3) link_reset() This is called after the link has been reset. This is typically a PCI Express specific state at this point and is done wether a non fatal error has been detected that can be "solved" by resetting the link. The driver is informed here of that reset and should check if the device appears to be in working condition. This function acts a bit like 2) error_recover(), that is it is not supposed to restart normal driver IO operations right away, just "probe" the device to check it's recoverability status. If all is right, then the core will call error_restart() once all driver have ack'd link_reset(). Result codes: (identical to error_recover) 4) slot_reset() This is called after the slot has been hard reset (and PCI BARs re-configured by the platform). If the platform supports PCI hotplug, it can implement this by toggling power on the slot off/on. Drivers here have a chance to re-initialize the hardware (re-download firmware etc...), but drivers shouldn't restart normal IO processing operations at this point. (see note about interrupts, they aren't guaranteed to be delivered until the restart callback has been called). Upon success from this callback, the patform will call error_restart() to complete the error handling and let the driver restart normal IO request processing. However, a driver can still return a critical failure from here in case it just can't get it's device back from reset. There is just nothing we can do about it tho. The driver will just be considered "dead" in this case. Result codes: - PCIERR_RESULT_DISCONNECT Same as above. 5) error_restart() This is called if all drivers on the segment have returned PCIERR_RESULT_RECOVERED from one of the 3 prevous callbacks. That basically tells the driver to restart activity, everything is back & running. No result code is taken into account here. If a new error happens, it will restart a new error handling process. That's it. I think this covers all the possibilities. The way those callbacks are called is platform policy. A platform with no slot reset capability for example may want to just "ignore" drivers that can't recover (disconnect them) and try to let other cards on the same segment recover. Keep in mind that in most real life cases, though, there will be only one driver per segment. Now, there is a note about interrupts. If you get an interrupt and your device is dead or has been isolated, there is a problem :) After much thinking, I decided to leave that to the platform. That is, the recovery API only precies that: - There is no guarantee that interrupt delivery can proceed from any device on the segment starting from the error detection and until the restart callback is sent, at which point interrupts are expected to be fully operational. - There is no guarantee that interrupt delivery is stopped, that is, ad river that gets an interrupts after detecting an error, or that detects and error within the interrupt handler such that it prevents proper ack'ing of the interrupt (and thus removal of the source) should just return IRQ_NOTHANDLED. It's up to the platform to deal with taht condition, typically by masking the irq source during the duration of the error handling. It is expected that the platform "knows" which interrupts are routed to error-management capable slots and can deal with temporarily disabling that irq number during error processing (this isn't terribly complex). That means some IRQ latency for other devices sharing the interrupt, but there is simply no other way. High end platforms aren't supposed to share interrupts between many devices anyway :) Ben. From cfriesen at nortel.com Wed Apr 6 03:33:23 2005 From: cfriesen at nortel.com (Chris Friesen) Date: Tue, 05 Apr 2005 11:33:23 -0600 Subject: help, trying to invalidate entire icache on 970 Message-ID: <4252CBE3.3010701@nortel.com> I'm having issues with some code that is supposed to invalidate the entire icache on a 970. I have a little test app in userspace that overwrites an instruction and calls some kernel code to invalidate the whole cache. Unfortunately, sometimes the new instruction doesn't get run, and if I start a kernel build in the background, it occurs quite frequently. The kernel code is accessed via an ioctl() on a device node, and it looks like this: local_irq_save() sync repeated 513 times: b 128 <31 nops> isync local_irq_restore() Basically, I'm trying to do the brute-force method of flushing it, as described in the 970 manual. Obviously I'm missing something, but I'm not sure what. In case it matters, the machine is dual-cpu. Anyone have any ideas? Anyone have any such code that works? Chris From moilanen at austin.ibm.com Wed Apr 6 05:33:34 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Tue, 5 Apr 2005 14:33:34 -0500 Subject: help, trying to invalidate entire icache on 970 In-Reply-To: <4252CBE3.3010701@nortel.com> References: <4252CBE3.3010701@nortel.com> Message-ID: <20050405143334.33e466b0.moilanen@austin.ibm.com> On Tue, 05 Apr 2005 11:33:23 -0600 Chris Friesen wrote: > I'm having issues with some code that is supposed to invalidate the > entire icache on a 970. > > I have a little test app in userspace that overwrites an instruction and > calls some kernel code to invalidate the whole cache. Unfortunately, > sometimes the new instruction doesn't get run, and if I start a kernel > build in the background, it occurs quite frequently. IIRC to modify an instruction you need the following sequence to do flush the icache correctly: dcbst sync icbi isync I can't remember if the 970 actually requires this sequence or not. Jake From cfriesen at nortel.com Wed Apr 6 06:07:23 2005 From: cfriesen at nortel.com (Chris Friesen) Date: Tue, 05 Apr 2005 14:07:23 -0600 Subject: help, trying to invalidate entire icache on 970 In-Reply-To: <20050405143334.33e466b0.moilanen@austin.ibm.com> References: <4252CBE3.3010701@nortel.com> <20050405143334.33e466b0.moilanen@austin.ibm.com> Message-ID: <4252EFFB.5010000@nortel.com> Jake Moilanen wrote: > IIRC to modify an instruction you need the following sequence to do > flush the icache correctly: > > dcbst > sync > icbi > isync This works if you know the address that was modified. My problem is that I have an application (emulator) that modifies its own instructions but doesn't track the addresses. Thus I need to flush the entire dcache (on the 970 this is just a "sync"), and invalidate the entire icache. Chris From cfriesen at nortel.com Wed Apr 6 07:15:42 2005 From: cfriesen at nortel.com (Chris Friesen) Date: Tue, 05 Apr 2005 15:15:42 -0600 Subject: help, trying to invalidate entire icache on 970 In-Reply-To: <4252CBE3.3010701@nortel.com> References: <4252CBE3.3010701@nortel.com> Message-ID: <4252FFFE.5090700@nortel.com> Friesen, Christopher [CAR:VC21:EXCH] wrote: > Basically, I'm trying to do the brute-force method of flushing it, as > described in the 970 manual. Obviously I'm missing something, but I'm > not sure what. In case it matters, the machine is dual-cpu. > > Anyone have any ideas? Anyone have any such code that works? I've switched over to using the en_icbi method of invalidation. It seems to work, but I'm calling icbi once for every cacheline and that seems suboptimal. The manual says that 4 bits of the address are used to index the icache, and thus each icbi call with en_icbi enabled will result in 16 cachelines being invalidated. Unfortunately I didn't see anywhere where it explained which bits they are and how they map to cachelines. Just calling icbi 32 times and incrementing the address by a cacheline each time didn't work. I should be able to get away with only calling it 32 times, assuming I pick exactly the right addresses, but I'm at a loss as to which addresses to use. Anyone able to help? Chris From linas at austin.ibm.com Wed Apr 6 07:43:03 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Tue, 5 Apr 2005 16:43:03 -0500 Subject: help, trying to invalidate entire icache on 970 In-Reply-To: <4252FFFE.5090700@nortel.com> References: <4252CBE3.3010701@nortel.com> <4252FFFE.5090700@nortel.com> Message-ID: <20050405214303.GO15596@austin.ibm.com> On Tue, Apr 05, 2005 at 03:15:42PM -0600, Chris Friesen was heard to remark: > Friesen, Christopher [CAR:VC21:EXCH] wrote: > > >Basically, I'm trying to do the brute-force method of flushing it, as > >described in the 970 manual. Obviously I'm missing something, but I'm > >not sure what. In case it matters, the machine is dual-cpu. > > > >Anyone have any ideas? Anyone have any such code that works? > > I've switched over to using the en_icbi method of invalidation. It > seems to work, but I'm calling icbi once for every cacheline and that > seems suboptimal. > > The manual says that 4 bits of the address are used to index the icache, > and thus each icbi call with en_icbi enabled will result in 16 I'm not quite clear on what you are doing ... but some general remarks: In general, the caches tend to be n-way set associative. So invalidating a given cache line will invalidate only one of the n ways. This might explain why you first attempt didn't work. I don't know what n is for the 970. Typically is 2 or 4 for this class of cpu. It tends to vary from one model to another. Which "4 bits" are involved tends to vary from one core to another. Even if you found somethingthat worked on the 970, it might not work on the next generation, since the address lines would be wired differntly. Similar remarks apply for the assumption that theres only 16 or 32 cache blocks or lines or whatever ... I'm not sure I know what en_icbi is (have never scanned the 970 docs). Maybe its invalidating all cache lines that alias to the same address tag. Why, again, is it that you can't just call icbi with the address of the instruction that has been changed? --linas From cfriesen at nortel.com Wed Apr 6 08:15:46 2005 From: cfriesen at nortel.com (Chris Friesen) Date: Tue, 05 Apr 2005 16:15:46 -0600 Subject: help, trying to invalidate entire icache on 970 In-Reply-To: <20050405214303.GO15596@austin.ibm.com> References: <4252CBE3.3010701@nortel.com> <4252FFFE.5090700@nortel.com> <20050405214303.GO15596@austin.ibm.com> Message-ID: <42530E12.4090206@nortel.com> Linas Vepstas wrote: > In general, the caches tend to be n-way set associative. So > invalidating a given cache line will invalidate only one of > the n ways. This might explain why you first attempt didn't work. > I don't know what n is for the 970. Typically is 2 or 4 for this class > of cpu. It tends to vary from one model to another. The icache is direct mapped, but is indexed by four bits in the effective address such that a given physical address can be aliased to 16 positions in the cache. > Which "4 bits" are involved tends to vary from one core to another. > Even if you found somethingthat worked on the 970, it might not work on > the next generation, since the address lines would be wired differntly. > > Similar remarks apply for the assumption that theres only 16 or 32 cache > blocks or lines or whatever ... Right. This whole chunk of code is 970-specific. We have other code for other cpus (the 74xx for instance can flash-invalidate the whole icache with one instruction). > I'm not sure I know what en_icbi is (have never scanned the 970 docs). > Maybe its invalidating all cache lines that alias to the same address > tag. Yep. I'm trying to figure those aliasing patterns out so I can minimize the number of icbi calls needed. > Why, again, is it that you can't just call icbi with the address of the > instruction that has been changed? I have a pre-existing app that modifies itself and doesn't track the addresses. All I get is the app telling me "I just modified something." Thus, I have to flush the entire dcache, and invalidate the entire icache in order to ensure that the new code gets run. It's horribly kludgy I know, but that's what I've got to deal with. Chris From paulus at samba.org Wed Apr 6 10:28:09 2005 From: paulus at samba.org (Paul Mackerras) Date: Wed, 6 Apr 2005 10:28:09 +1000 Subject: help, trying to invalidate entire icache on 970 In-Reply-To: <4252EFFB.5010000@nortel.com> References: <4252CBE3.3010701@nortel.com> <20050405143334.33e466b0.moilanen@austin.ibm.com> <4252EFFB.5010000@nortel.com> Message-ID: <16979.11545.264476.711612@cargo.ozlabs.ibm.com> Chris Friesen writes: > My problem is that I have an application (emulator) that modifies its > own instructions but doesn't track the addresses. Thus I need to flush > the entire dcache (on the 970 this is just a "sync"), and invalidate the > entire icache. Current BK now has support for making pages non-executable with mprotect. You could mprotect the pages RW to start with and have a SIGSEGV handler. When the emulator tries to execute from a page you will get a SIGSEGV, and you can flush that page (with 32 x dcbst; sync; 32 x icbi; isync) and then mprotect it RX and return from the signal handler. If the emulator writes to it you get another SIGSEGV and mprotect it back to RW. Paul. From segher at kernel.crashing.org Wed Apr 6 12:49:33 2005 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Wed, 6 Apr 2005 04:49:33 +0200 Subject: help, trying to invalidate entire icache on 970 In-Reply-To: <20050405143334.33e466b0.moilanen@austin.ibm.com> References: <4252CBE3.3010701@nortel.com> <20050405143334.33e466b0.moilanen@austin.ibm.com> Message-ID: > IIRC to modify an instruction you need the following sequence to do > flush the icache correctly: > > dcbst > sync > icbi > isync > > I can't remember if the 970 actually requires this sequence or not. It does not, as the DL1 cache is store-through; i.e., the dcbst insn is superfluous here (the sync is required in general, though!) Segher From benh at kernel.crashing.org Wed Apr 6 14:10:38 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 06 Apr 2005 14:10:38 +1000 Subject: [PATCH] ppc64: Improve mapping of vDSO Message-ID: <1112760638.9518.93.camel@gaston> Hi Andrew ! This patch reworks the way the ppc64 is mapped in user memory by the kernel to make it more robust against possible collisions with executable segments. Instead of just whacking a VMA at 1Mb, I now use get_unmapped_area() with a hint, and I moved the mapping of the vDSO to after the mapping of the various ELF segments and of the interpreter, so that conflicts get caught properly (it still has to be before create_elf_tables since the later will fill the AT_SYSINFO_EHDR with the proper address). While I was at it, I also changed the 32 and 64 bits vDSO's to link at their "natural" address of 1Mb instead of 0. This is the address where they are normally mapped in absence of conflict. By doing so, it should be possible to properly prelink one it's been verified to work on glibc. Please apply for 2.6.12, Signed-off-by: Benjamin Herrenschmidt Index: linux-work/arch/ppc64/kernel/vdso.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/vdso.c 2005-03-07 10:22:15.000000000 +1100 +++ linux-work/arch/ppc64/kernel/vdso.c 2005-04-06 13:32:41.000000000 +1000 @@ -213,13 +213,14 @@ vdso_base = VDSO64_MBASE; } + current->thread.vdso_base = 0; + /* vDSO has a problem and was disabled, just don't "enable" it for the * process */ - if (vdso_pages == 0) { - current->thread.vdso_base = 0; + if (vdso_pages == 0) return 0; - } + vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); if (vma == NULL) return -ENOMEM; @@ -230,12 +231,16 @@ memset(vma, 0, sizeof(*vma)); /* - * pick a base address for the vDSO in process space. We have a default - * base of 1Mb on which we had a random offset up to 1Mb. - * XXX: Add possibility for a program header to specify that location + * pick a base address for the vDSO in process space. We try to put it + * at vdso_base which is the "natural" base for it, but we might fail + * and end up putting it elsewhere. */ + vdso_base = get_unmapped_area(NULL, vdso_base, + vdso_pages << PAGE_SHIFT, 0, 0); + if (vdso_base & ~PAGE_MASK) + return (int)vdso_base; + current->thread.vdso_base = vdso_base; - /* + ((unsigned long)vma & 0x000ff000); */ vma->vm_mm = mm; vma->vm_start = current->thread.vdso_base; Index: linux-work/fs/binfmt_elf.c =================================================================== --- linux-work.orig/fs/binfmt_elf.c 2005-04-03 10:02:57.000000000 +1000 +++ linux-work/fs/binfmt_elf.c 2005-04-06 13:10:49.000000000 +1000 @@ -782,14 +782,6 @@ goto out_free_dentry; } -#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES - retval = arch_setup_additional_pages(bprm, executable_stack); - if (retval < 0) { - send_sig(SIGKILL, current, 0); - goto out_free_dentry; - } -#endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */ - current->mm->start_stack = bprm->p; /* Now we do a little grungy work by mmaping the ELF image into @@ -949,6 +941,14 @@ set_binfmt(&elf_format); +#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES + retval = arch_setup_additional_pages(bprm, executable_stack); + if (retval < 0) { + send_sig(SIGKILL, current, 0); + goto out_free_dentry; + } +#endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */ + compute_creds(bprm); current->flags &= ~PF_FORKNOEXEC; create_elf_tables(bprm, &loc->elf_ex, (interpreter_type == INTERPRETER_AOUT), Index: linux-work/include/asm-ppc64/vdso.h =================================================================== --- linux-work.orig/include/asm-ppc64/vdso.h 2005-03-15 11:57:38.000000000 +1100 +++ linux-work/include/asm-ppc64/vdso.h 2005-04-06 13:33:20.000000000 +1000 @@ -4,12 +4,12 @@ #ifdef __KERNEL__ /* Default link addresses for the vDSOs */ -#define VDSO32_LBASE 0 -#define VDSO64_LBASE 0 +#define VDSO32_LBASE 0x100000 +#define VDSO64_LBASE 0x100000 /* Default map addresses */ -#define VDSO32_MBASE 0x100000 -#define VDSO64_MBASE 0x100000 +#define VDSO32_MBASE VDSO32_LBASE +#define VDSO64_MBASE VDSO64_LBASE #define VDSO_VERSION_STRING LINUX_2.6.12 From benh at kernel.crashing.org Wed Apr 6 14:57:29 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 06 Apr 2005 14:57:29 +1000 Subject: help, trying to invalidate entire icache on 970 In-Reply-To: References: <4252CBE3.3010701@nortel.com> <20050405143334.33e466b0.moilanen@austin.ibm.com> Message-ID: <1112763449.9568.99.camel@gaston> On Wed, 2005-04-06 at 04:49 +0200, Segher Boessenkool wrote: > > IIRC to modify an instruction you need the following sequence to do > > flush the icache correctly: > > > > dcbst > > sync > > icbi > > isync > > > > I can't remember if the 970 actually requires this sequence or not. > > It does not, as the DL1 cache is store-through; i.e., the dcbst insn > is superfluous here (the sync is required in general, though!) Isn't dcbst a nop on 970 anyway ? Ben. From benh at kernel.crashing.org Wed Apr 6 15:00:13 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 06 Apr 2005 15:00:13 +1000 Subject: help, trying to invalidate entire icache on 970 In-Reply-To: <42530E12.4090206@nortel.com> References: <4252CBE3.3010701@nortel.com> <4252FFFE.5090700@nortel.com> <20050405214303.GO15596@austin.ibm.com> <42530E12.4090206@nortel.com> Message-ID: <1112763613.9567.102.camel@gaston> On Tue, 2005-04-05 at 16:15 -0600, Chris Friesen wrote: > I have a pre-existing app that modifies itself and doesn't track the > addresses. All I get is the app telling me "I just modified something." > Thus, I have to flush the entire dcache, and invalidate the entire > icache in order to ensure that the new code gets run. It's horribly > kludgy I know, but that's what I've got to deal with. Stupid idea: have you checked that maybe when it calls you back for that "I just modified something", the address is actually still in one of the registers or a known stack location where you can "peek" at it ? :) That would solve your problem once for all ... Unless that app gets rebuilt regulary, but in this case, you should really get them to fix the API to the callback to take the address. Ben. From ntl at pobox.com Wed Apr 6 15:30:02 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 6 Apr 2005 00:30:02 -0500 Subject: [RFC/PATCH] numa: distinguish associativity domain from node id Message-ID: <20050406053002.GF3611@otto> Yes, yet another numa.c patch... this is strictly rfc for now. The ppc64 numa code makes some possibly invalid assumptions about the numbering of "associativity domains" (which may be considered NUMA nodes). As far as I've been able to determine from the architecture docs, there is no guarantee about the numbering of associativity domains, i.e. the values that are contained in ibm,associativity device node properties. Yet we seem to assume that the numbering of the domains begins at zero and that the range is contiguous, and we use the domain number for a given resource as its logical node id. This strikes me as a problem waiting to happen, and in fact I've been seeing some problems in the lab with larger machines violating or at least straining these assumptions. Consider one such case: the associativity domain for all memory in a partition is 0x1, but the processors are in shared mode (so no associativity info for them) -- all the memory is placed in node 1 while all cpus are mapped to node 0. But in this case, we should really have only one logical node, with all memory and cpus mapped to it. Another case I've seen is that of a partition with all processors and memory having an associativity domain of 0x1. We end up with everything in node 1 and an empty (yet online) node 0. I propose treating the associativity domain for a resource as a "cookie" without making any assumptions about its value. During numa init, each distinct domain is mapped to a logical node. so that the following holds: the logical node numbering begins at zero and is contiguous, and resources added after boot which do not map to an already initialized domain are associated with logical node 0. The patch implements these, and attempts to separate the notion of associativity domain from that of logical node where appropriate. Lightly tested on Power5 LPAR with two numa nodes - it boots and the information under /sys/devices/system/node looks correct. I'm going to omit a signed-off line for now; there's probably some stupid bug introduced or something that someone will find objectionable (memory hotplug folks?)... Please review. Thanks, Nathan arch/ppc64/mm/numa.c | 207 +++++++++++++++++++++++++++---------------- 1 files changed, 131 insertions(+), 76 deletions(-) Index: linux-2.6.12-rc2/arch/ppc64/mm/numa.c =================================================================== --- linux-2.6.12-rc2.orig/arch/ppc64/mm/numa.c 2005-04-05 12:59:28.000000000 -0500 +++ linux-2.6.12-rc2/arch/ppc64/mm/numa.c 2005-04-06 00:17:08.000000000 -0500 @@ -58,6 +58,69 @@ EXPORT_SYMBOL(numa_memory_lookup_table); EXPORT_SYMBOL(numa_cpumask_lookup_table); EXPORT_SYMBOL(nr_cpus_in_node); +/* Maps nid to platform "associativity domain". */ +#define INVALID_DOMAIN (-1) +static int nid_domain[MAX_NUMNODES] = { [0 ... (MAX_NUMNODES - 1)] = + INVALID_DOMAIN }; +/* nid to platform domain id. O(1). */ +static int nid_to_domain(int nid) +{ + BUG_ON(0 > nid || nid >= MAX_NUMNODES); + return nid_domain[nid]; +} + +/* Platform domain to nid. If the given domain does not map to a node, + * return -1. O(n). + */ +static int domain_to_nid(int domain) +{ + int nid; + + for_each_node(nid) + if (domain == nid_to_domain(nid)) + return nid; + return -1; +} + +/* Associate domain with the given nid. */ +static void __init assign_domain_to_nid(int domain, int nid) +{ + BUG_ON(0 > nid || nid >= MAX_NUMNODES); + BUG_ON(domain == INVALID_DOMAIN); + BUG_ON(nid_domain[nid] != INVALID_DOMAIN); + + nid_domain[nid] = domain; + dbg("OF associativity domain 0x%x mapped to node %i\n", domain, nid); +} + +/* Given a previously unencountered associativity domain, find the + * first unused slot in the nid_domain array where it can be plugged + * in. + */ +static int __init setup_domain(int domain) +{ + int nid; + + for_each_node(nid) { + int tmp = nid_to_domain(nid); + + if (tmp != INVALID_DOMAIN) + continue; + + /* Do not set up the same domain twice. */ + BUG_ON(tmp == domain); + + assign_domain_to_nid(domain, nid); + return nid; + } + + printk(KERN_WARNING "Can't associate domain 0x%x with a node, " + "MAX_NUMNODES=%i, num_online_nodes=%i\n", domain, + MAX_NUMNODES, num_online_nodes()); + + return 0; +} + static inline void map_cpu_to_node(int cpu, int node) { numa_cpu_lookup_table[cpu] = node; @@ -117,26 +180,58 @@ static struct device_node * __devinit fi /* must hold reference to node during call */ static int *of_get_associativity(struct device_node *dev) { - return (unsigned int *)get_property(dev, "ibm,associativity", NULL); + return (int *)get_property(dev, "ibm,associativity", NULL); } -static int of_node_numa_domain(struct device_node *device) +/* + * Given an OF device node, return the logical node id to which it + * belongs. If the node has no associativity information, the result + * is 0. During boot, this function will map domains to nodes as + * necessary. + */ +static int of_node_to_nid(struct device_node *dn) { - int numa_domain; - unsigned int *tmp; + int *tmp, domain, nid; if (min_common_depth == -1) return 0; - tmp = of_get_associativity(device); - if (tmp && (tmp[0] >= min_common_depth)) { - numa_domain = tmp[min_common_depth]; - } else { - dbg("WARNING: no NUMA information for %s\n", - device->full_name); - numa_domain = 0; + tmp = of_get_associativity(dn); + + if (!tmp || (tmp[0] < min_common_depth)) { + dbg("no NUMA information for %s\n", dn->full_name); + return 0; } - return numa_domain; + + domain = tmp[min_common_depth]; + + /* + * POWER4 LPAR uses 0xffff as invalid node, + * just use node zero. + */ + if (domain == 0xffff) + nid = 0; + else + nid = domain_to_nid(domain); + + /* If we haven't seen this domain before, associate it with a + * node if we're still in boot. If we're up and running and + * the domain is previously unknown, we have no choice but to + * map the resource to an initialized node, so we map it to + * nid 0. + */ + if (nid < 0) { + if (system_state < SYSTEM_RUNNING) { + nid = setup_domain(domain); + } else { + nid = 0; + dbg("Resource %s has associativity domain" + " %x which was not known at boot, assigning" + " to node %i\n", dn->full_name, domain, nid); + } + } + node_set_online(nid); + return nid; } /* @@ -228,7 +323,7 @@ static unsigned long read_n_cells(int n, */ static int numa_setup_cpu(unsigned long lcpu) { - int numa_domain = 0; + int nid = 0; struct device_node *cpu = find_cpu_node(lcpu); if (!cpu) { @@ -236,27 +331,16 @@ static int numa_setup_cpu(unsigned long goto out; } - numa_domain = of_node_numa_domain(cpu); + nid = of_node_to_nid(cpu); - if (numa_domain >= num_online_nodes()) { - /* - * POWER4 LPAR uses 0xffff as invalid node, - * dont warn in this case. - */ - if (numa_domain != 0xffff) - printk(KERN_ERR "WARNING: cpu %ld " - "maps to invalid NUMA node %d\n", - lcpu, numa_domain); - numa_domain = 0; - } out: - node_set_online(numa_domain); + node_set_online(nid); - map_cpu_to_node(lcpu, numa_domain); + map_cpu_to_node(lcpu, nid); of_node_put(cpu); - return numa_domain; + return nid; } static int cpu_numa_callback(struct notifier_block *nfb, @@ -319,7 +403,6 @@ static int __init parse_numa_properties( struct device_node *cpu = NULL; struct device_node *memory = NULL; int addr_cells, size_cells; - int max_domain = 0; long entries = lmb_end_of_DRAM() >> MEMORY_INCREMENT_SHIFT; unsigned long i; @@ -341,7 +424,7 @@ static int __init parse_numa_properties( if (min_common_depth < 0) return min_common_depth; - max_domain = numa_setup_cpu(boot_cpuid); + numa_setup_cpu(boot_cpuid); /* * Even though we connect cpus to numa domains later in SMP init, @@ -350,20 +433,8 @@ static int __init parse_numa_properties( * As a result of hotplug we could still have cpus appear later on * with larger node ids. In that case we force the cpu into node 0. */ - for_each_cpu(i) { - int numa_domain; - - cpu = find_cpu_node(i); - - if (cpu) { - numa_domain = of_node_numa_domain(cpu); - of_node_put(cpu); - - if (numa_