From paulus at samba.org Wed Mar 1 09:51:30 2006 From: paulus at samba.org (Paul Mackerras) Date: Wed, 1 Mar 2006 09:51:30 +1100 Subject: Membership stats (Was: Re: merge these lists?) In-Reply-To: References: <20060208110718.57e9f9f5.sfr@canb.auug.org.au> Message-ID: <17412.54258.868906.846176@cargo.ozlabs.ibm.com> Kumar Gala writes: > Where did we leave on with this? I was about to request that > marc.theaimsgroup.com start archiving some of the ppc lists but figured > doing it after we merged lists would be better. My current thought is to call the combined list "linuxppc-dev" and put the appropriate redirection in place from linuxppc64-dev to linuxppc-dev. If anyone really objects to that, shout now. Paul. From pradeep at us.ibm.com Wed Mar 1 13:20:54 2006 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Tue, 28 Feb 2006 19:20:54 -0700 Subject: Problems loading some select modules In-Reply-To: <17411.55866.172377.50234@cargo.ozlabs.ibm.com> Message-ID: Hello Paul, Please find the tar file with the files that you requested. (See attached file: findex.tar.gz) This was picked up from the svn tree (5421) of openib.org. This is not yet in the mainline. To set the context -this was seen on a Sles9sp2 machine and not a Rhel4U3 machine, as indicated in the previous mail. Thanks for looking into this! Pradeep pradeep at us.ibm.com Paul Mackerras wrote on 02/27/2006 09:06:02 PM: > Pradeep Satyanarayana writes: > > > I was trying to load some Infiniband modules (using modprobe) on Power5 > > machine (p570), and I get the following error: > > > > WARNING: Error inserting findex > > (/lib/modules/2.6.16-rc2/kernel/drivers/infiniband/core/findex.ko): Invalid > > module format > > > > Also, in /var/log/messages I see the following error about the same module: > > > > kernel: findex: doesn't contain .toc or .stubs. > > Interesting. I don't see findex.c in the kernel sources anywhere. It > could be that a very simple module that only accesses variables on the > stack would not need a toc, and maybe in this case the toolchain > doesn't generate a toc. Could you send me the source of your module > plus the generated findex.ko? > > Paul. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060228/8291c691/attachment.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: findex.tar.gz Type: application/octet-stream Size: 38685 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060228/8291c691/attachment.obj From mbligh at mbligh.org Wed Mar 1 10:56:24 2006 From: mbligh at mbligh.org (Martin Bligh) Date: Tue, 28 Feb 2006 15:56:24 -0800 Subject: 2.6.16-rc5-mm1 In-Reply-To: <20060228042439.43e6ef41.akpm@osdl.org> References: <20060228042439.43e6ef41.akpm@osdl.org> Message-ID: <4404E328.7070807@mbligh.org> Andrew Morton wrote: > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm1/ New panic on IBM power4 lpar of P690. 2.6.16-rc5-git3 is OK. (config: http://ftp.kernel.org/pub/linux/kernel/people/mbligh/config/abat/power4) http://test.kernel.org/24165/debug/console.log STAFProc version 2.6.2 initialized Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=32 NUMA PSERIES LPAR Modules linked in: NIP: C000000000064748 LR: C000000000064764 CTR: C0000000000A8B10 REGS: c00000077dfaad30 TRAP: 0300 Not tainted (2.6.16-rc5-mm1-autokern1) MSR: 8000000000009032 CR: 28000488 XER: 00000000 DAR: 0000000000000000, DSISR: 0000000040000000 TASK = c00000076a1ac720[11058] 'mingetty' THREAD: c00000077dfa8000 CPU: 1 GPR00: 0000000000000007 C00000077DFAAFB0 C000000000644F70 C00000076F303F08 GPR04: C00000076D478E00 0000000000000000 C00000076D478E00 0000000000000001 GPR08: 000000000000000A 0000000000000000 00000000000001AA C00000076F303F08 GPR12: 0000000048000442 C000000000548B80 0000000010060000 0000000010060000 GPR16: 0000000010060000 0000000010080000 0000000010080000 0000000010060000 GPR20: 0000000010010000 0000000000000001 0000000000000000 0000000000000001 GPR24: 0000000010003738 C00000077255F010 0000000000000001 0000000000000000 GPR28: 0000000000000007 0000000000000000 C00000000055F4A8 C0000000041F82A0 NIP [C000000000064748] .__rcu_process_callbacks+0x1fc/0x2f8 LR [C000000000064764] .__rcu_process_callbacks+0x218/0x2f8 Call Trace: [C00000077DFAAFB0] [C000000000064764] .__rcu_process_callbacks+0x218/0x2f8 (unreliable) [C00000077DFAB040] [C000000000064874] .rcu_process_callbacks+0x30/0x58 [C00000077DFAB0C0] [C000000000053848] .tasklet_action+0xe4/0x19c [C00000077DFAB160] [C000000000053EA8] .__do_softirq+0x9c/0x16c [C00000077DFAB200] [C00000000000B6C8] .do_softirq+0x74/0xac [C00000077DFAB280] [C000000000054388] .irq_exit+0x64/0x7c [C00000077DFAB300] [C00000000001E0AC] .timer_interrupt+0x460/0x48c [C00000077DFAB3E0] [C0000000000034DC] decrementer_common+0xdc/0x100 --- Exception: 901 at ._atomic_dec_and_lock+0x3c/0xb8 LR = .mntput_no_expire+0x30/0xcc [C00000077DFAB6D0] [C0000007680CF438] 0xc0000007680cf438 (unreliable) [C00000077DFAB750] [C0000000000CD6D0] .mntput_no_expire+0x30/0xcc [C00000077DFAB7E0] [C0000000000B9F40] .path_release+0x44/0x5c [C00000077DFAB870] [C0000000000ED840] .proc_pid_follow_link+0x34/0xf0 [C00000077DFAB900] [C0000000000BD01C] .__link_path_walk+0xe64/0x1394 [C00000077DFAB9E0] [C0000000000BD5DC] .link_path_walk+0x90/0x168 [C00000077DFABAE0] [C0000000000BDE28] .do_path_lookup+0x2fc/0x364 [C00000077DFABB90] [C0000000000BF4E0] .__user_walk_fd+0x68/0xa8 [C00000077DFABC30] [C0000000000B5578] .vfs_stat_fd+0x24/0x70 [C00000077DFABD30] [C0000000000B56BC] .sys_stat64+0x1c/0x50 [C00000077DFABE30] [C00000000000871C] syscall_exit+0x0/0x40 Instruction dump: 38000000 901d0080 e87f0040 2fa30000 419e00fc 7c6b1b78 3b800000 ebab0000 7d635b78 fbbf0040 60000000 e92b0008 f8410028 60000000 e9690010 <0>Kernel panic - not syncing: Fatal exception in interrupt smp_call_function on cpu 1: other cpus not responding (1) -- 0:conmux-control -- time-stamp -- Feb/28/06 5:08:52 -- From huangjq at cn.ibm.com Wed Mar 1 18:26:42 2006 From: huangjq at cn.ibm.com (Jin Qi Huang) Date: Wed, 1 Mar 2006 15:26:42 +0800 Subject: Kernel oops then panic when perform a soft reset on ppc64 box In-Reply-To: <17401.9075.295712.950980@cargo.ozlabs.ibm.com> Message-ID: I have found some information fron IBM website: 1. The state of the processor after taking the Soft Reset exception is unremarkable, because SRESET# merely causes an exception. 2. The SRESET# pin is only causes the processor to take the System Reset exception. this make me very doubt, since soft reset only causes a Systerm Reset exception and the state of the processor is unremarkable, why the exception handler of Linux let system go to die? thinks for you reply! -- Regards, Paul Mackerras 2006-02-20 10:03 To Jin Qi Huang/China/Contr/IBM at IBMCN cc linuxppc64-dev at ozlabs.org Subject Re: Kernel oops then panic when perform a soft reset on ppc64 box Jin Qi Huang writes: > When I perform a soft reset on HMC console to a ppc64 box, the kernel oops > then panic, here is the procedure to reproduce it: That's normal, what did you expect it to do? Paul. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060301/8e671e00/attachment.htm From olof at lixom.net Thu Mar 2 03:45:31 2006 From: olof at lixom.net (Olof Johansson) Date: Wed, 1 Mar 2006 10:45:31 -0600 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <4404E328.7070807@mbligh.org> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> Message-ID: <20060301164531.GA17755@pb15.lixom.net> On Tue, Feb 28, 2006 at 03:56:24PM -0800, Martin Bligh wrote: > Andrew Morton wrote: > >ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm1/ > > New panic on IBM power4 lpar of P690. 2.6.16-rc5-git3 is OK. > > (config: > http://ftp.kernel.org/pub/linux/kernel/people/mbligh/config/abat/power4) > > http://test.kernel.org/24165/debug/console.log For what it's worth, this is a NULL pointer dereference in the RCU code. Seems that the human-readible parts are printed at a differnet printk level (well, _at_ a level), so they fell off. Not good. Andrew and/or Paulus, see patch below. Thanks, Olof --- It seems that the die() output is printk'd without any prink level, so some distros will log the register dumps and the human readible format differently. (I.e. see http://test.kernel.org/24165/debug/console.log, which lacks the KERN_ALERT parts) Changing the die() output to include a level will likely confuse users that currently rely on getting the output where they're getting it, so instead remove it from the bad_page_fault() output. Signed-off-by: Olof Johansson diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index ec4adcb..fee050a 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -389,7 +389,7 @@ void bad_page_fault(struct pt_regs *regs /* kernel has accessed a bad area */ - printk(KERN_ALERT "Unable to handle kernel paging request for "); + printk("Unable to handle kernel paging request for "); switch (regs->trap) { case 0x300: case 0x380: @@ -402,8 +402,7 @@ void bad_page_fault(struct pt_regs *regs default: printk("unknown fault\n"); } - printk(KERN_ALERT "Faulting instruction address: 0x%08lx\n", - regs->nip); + printk("Faulting instruction address: 0x%08lx\n", regs->nip); die("Kernel access of bad area", regs, sig); } From greg at kroah.com Thu Mar 2 08:46:00 2006 From: greg at kroah.com (Greg KH) Date: Wed, 1 Mar 2006 13:46:00 -0800 Subject: fix build breakage in eeh.c in 2.6.16-rc5-git5 Message-ID: <20060301214600.GA17702@kroah.com> This patch should fixe a problem with eeh_add_device_late() not being defined in the ppc64 build process, causing the build to break. Signed-off-by: Greg Kroah-Hartman --- include/asm-powerpc/eeh.h | 1 + 1 files changed, 1 insertion(+) --- linux-2.6.15.orig/include/asm-powerpc/eeh.h 2006-03-01 11:30:19.000000000 -0800 +++ linux-2.6.15/include/asm-powerpc/eeh.h 2006-03-01 12:04:25.000000000 -0800 @@ -61,6 +61,7 @@ * to finish the eeh setup for this device. */ void eeh_add_device_early(struct device_node *); +void eeh_add_device_late(struct pci_dev *dev); void eeh_add_device_tree_early(struct device_node *); void eeh_add_device_tree_late(struct pci_bus *); From paulus at samba.org Thu Mar 2 08:54:56 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 2 Mar 2006 08:54:56 +1100 Subject: fix build breakage in eeh.c in 2.6.16-rc5-git5 In-Reply-To: <20060301214600.GA17702@kroah.com> References: <20060301214600.GA17702@kroah.com> Message-ID: <17414.6192.426294.502401@cargo.ozlabs.ibm.com> Greg KH writes: > This patch should fixe a problem with eeh_add_device_late() not being > defined in the ppc64 build process, causing the build to break. John Rose just sent a patch making eeh_add_device_late static and moving it to be defined before it is called in arch/powerpc/platforms/pseries/eeh.c. Since he maintains this stuff, I'm more inclined to take his patch. Paul. From greg at kroah.com Thu Mar 2 09:03:28 2006 From: greg at kroah.com (Greg KH) Date: Wed, 1 Mar 2006 14:03:28 -0800 Subject: fix build breakage in eeh.c in 2.6.16-rc5-git5 In-Reply-To: <17414.6192.426294.502401@cargo.ozlabs.ibm.com> References: <20060301214600.GA17702@kroah.com> <17414.6192.426294.502401@cargo.ozlabs.ibm.com> Message-ID: <20060301220328.GB7354@kroah.com> On Thu, Mar 02, 2006 at 08:54:56AM +1100, Paul Mackerras wrote: > Greg KH writes: > > > This patch should fixe a problem with eeh_add_device_late() not being > > defined in the ppc64 build process, causing the build to break. > > John Rose just sent a patch making eeh_add_device_late static and > moving it to be defined before it is called in > arch/powerpc/platforms/pseries/eeh.c. > > Since he maintains this stuff, I'm more inclined to take his patch. That's fine with me, as long as it makes it into 2.6.16-final :) thanks, greg k-h From greg at kroah.com Thu Mar 2 09:15:40 2006 From: greg at kroah.com (Greg KH) Date: Wed, 1 Mar 2006 14:15:40 -0800 Subject: fix build breakage in eeh.c in 2.6.16-rc5-git5 In-Reply-To: <20060301220328.GB7354@kroah.com> References: <20060301214600.GA17702@kroah.com> <17414.6192.426294.502401@cargo.ozlabs.ibm.com> <20060301220328.GB7354@kroah.com> Message-ID: <20060301221540.GA9638@kroah.com> On Wed, Mar 01, 2006 at 02:03:28PM -0800, Greg KH wrote: > On Thu, Mar 02, 2006 at 08:54:56AM +1100, Paul Mackerras wrote: > > Greg KH writes: > > > > > This patch should fixe a problem with eeh_add_device_late() not being > > > defined in the ppc64 build process, causing the build to break. > > > > John Rose just sent a patch making eeh_add_device_late static and > > moving it to be defined before it is called in > > arch/powerpc/platforms/pseries/eeh.c. > > > > Since he maintains this stuff, I'm more inclined to take his patch. > > That's fine with me, as long as it makes it into 2.6.16-final :) Hm, looks like my fix made it into Linus's tree, so you might want to send him the "correct" way to do this against that. thanks, greg k-h From paulmck at us.ibm.com Thu Mar 2 11:09:36 2006 From: paulmck at us.ibm.com (Paul E. McKenney) Date: Wed, 1 Mar 2006 16:09:36 -0800 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <20060301164531.GA17755@pb15.lixom.net> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> <20060301164531.GA17755@pb15.lixom.net> Message-ID: <20060302000936.GE1296@us.ibm.com> On Wed, Mar 01, 2006 at 10:45:31AM -0600, Olof Johansson wrote: > On Tue, Feb 28, 2006 at 03:56:24PM -0800, Martin Bligh wrote: > > Andrew Morton wrote: > > >ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm1/ > > > > New panic on IBM power4 lpar of P690. 2.6.16-rc5-git3 is OK. > > > > (config: > > http://ftp.kernel.org/pub/linux/kernel/people/mbligh/config/abat/power4) > > > > http://test.kernel.org/24165/debug/console.log > > For what it's worth, this is a NULL pointer dereference in the RCU > code. And in an area where it is tougher than usual to blame the problem on a broken use of RCU, as well. ;-) The "rcp" argument to __rcu_process_callbacks() is C00000076F303F08 and "rdp" is C00000076F303F08, or am I mis-remembering the POWER ABI? Thanx, Paul > Seems that the human-readible parts are printed at a differnet printk level > (well, _at_ a level), so they fell off. Not good. > > Andrew and/or Paulus, see patch below. > > > Thanks, > > Olof > > > --- > > It seems that the die() output is printk'd without any prink level, > so some distros will log the register dumps and the human readible > format differently. > > (I.e. see http://test.kernel.org/24165/debug/console.log, which lacks > the KERN_ALERT parts) > > Changing the die() output to include a level will likely confuse users > that currently rely on getting the output where they're getting it, > so instead remove it from the bad_page_fault() output. > > Signed-off-by: Olof Johansson > > > diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c > index ec4adcb..fee050a 100644 > --- a/arch/powerpc/mm/fault.c > +++ b/arch/powerpc/mm/fault.c > @@ -389,7 +389,7 @@ void bad_page_fault(struct pt_regs *regs > > /* kernel has accessed a bad area */ > > - printk(KERN_ALERT "Unable to handle kernel paging request for "); > + printk("Unable to handle kernel paging request for "); > switch (regs->trap) { > case 0x300: > case 0x380: > @@ -402,8 +402,7 @@ void bad_page_fault(struct pt_regs *regs > default: > printk("unknown fault\n"); > } > - printk(KERN_ALERT "Faulting instruction address: 0x%08lx\n", > - regs->nip); > + printk("Faulting instruction address: 0x%08lx\n", regs->nip); > > die("Kernel access of bad area", regs, sig); > } > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From paulus at samba.org Thu Mar 2 11:35:18 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 2 Mar 2006 11:35:18 +1100 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <20060301164531.GA17755@pb15.lixom.net> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> <20060301164531.GA17755@pb15.lixom.net> Message-ID: <17414.15814.146349.883153@cargo.ozlabs.ibm.com> Olof Johansson writes: > Seems that the human-readible parts are printed at a differnet printk level > (well, _at_ a level), so they fell off. Not good. My understanding was that printk lines without a level are considered to be at KERN_ERR or so. Is that wrong? > Andrew and/or Paulus, see patch below. It really seems strange to be *removing* printk level tags. I'd like to nack this until I understand why it will improve things. At the very least it needs a big fat comment so some janitor doesn't come along and put the tags back in. Paul. From mbligh at mbligh.org Thu Mar 2 12:14:21 2006 From: mbligh at mbligh.org (Martin Bligh) Date: Wed, 01 Mar 2006 17:14:21 -0800 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <17414.15814.146349.883153@cargo.ozlabs.ibm.com> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> <20060301164531.GA17755@pb15.lixom.net> <17414.15814.146349.883153@cargo.ozlabs.ibm.com> Message-ID: <440646ED.2030108@mbligh.org> Paul Mackerras wrote: > Olof Johansson writes: > > >>Seems that the human-readible parts are printed at a differnet printk level >>(well, _at_ a level), so they fell off. Not good. > > > My understanding was that printk lines without a level are considered > to be at KERN_ERR or so. Is that wrong? > > >>Andrew and/or Paulus, see patch below. > > > It really seems strange to be *removing* printk level tags. I'd like > to nack this until I understand why it will improve things. At the > very least it needs a big fat comment so some janitor doesn't come > along and put the tags back in. He's removing KERN_ALERT ... I guess it could get switched from KERN_ALERT to KERN_ERR, but ... Either way, KERN_ALERT seems way too low to me. I object to getting half the oops, and not the other half ;-) M. From olof at lixom.net Thu Mar 2 13:22:44 2006 From: olof at lixom.net (Olof Johansson) Date: Wed, 1 Mar 2006 20:22:44 -0600 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <440646ED.2030108@mbligh.org> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> <20060301164531.GA17755@pb15.lixom.net> <17414.15814.146349.883153@cargo.ozlabs.ibm.com> <440646ED.2030108@mbligh.org> Message-ID: <20060302022244.GB17755@pb15.lixom.net> On Wed, Mar 01, 2006 at 05:14:21PM -0800, Martin Bligh wrote: > He's removing KERN_ALERT ... I guess it could get switched from > KERN_ALERT to KERN_ERR, but ... > > Either way, KERN_ALERT seems way too low to me. I object to getting > half the oops, and not the other half ;-) Right. The new printk's were added recently, and I took the KERN_ALERT level from the x86 code then without double-checking what die() uses. I guess I could move the die() output over instead, or move them both to KERN_ERR. -Olof From paulus at samba.org Thu Mar 2 16:16:30 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 2 Mar 2006 16:16:30 +1100 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <440646ED.2030108@mbligh.org> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> <20060301164531.GA17755@pb15.lixom.net> <17414.15814.146349.883153@cargo.ozlabs.ibm.com> <440646ED.2030108@mbligh.org> Message-ID: <17414.32686.589133.160989@cargo.ozlabs.ibm.com> Martin Bligh writes: > He's removing KERN_ALERT ... I guess it could get switched from > KERN_ALERT to KERN_ERR, but ... > > Either way, KERN_ALERT seems way too low to me. I object to getting > half the oops, and not the other half ;-) KERN_ALERT is two steps higher in priority (lower number) than KERN_ERR. Why on earth would we see KERN_ERR messages but not KERN_ALERT messages? In fact die() should probably be using KERN_EMERG. Messages without a loglevel are by default logged at KERN_WARNING level, one step lower in priority than KERN_ERR. This all sounds to me like there is something wacky going on somewhere, and we need to get to the bottom of it rather than just remove printk tags. Paul. From anton at samba.org Thu Mar 2 16:24:29 2006 From: anton at samba.org (Anton Blanchard) Date: Thu, 2 Mar 2006 16:24:29 +1100 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <20060302022244.GB17755@pb15.lixom.net> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> <20060301164531.GA17755@pb15.lixom.net> <17414.15814.146349.883153@cargo.ozlabs.ibm.com> <440646ED.2030108@mbligh.org> <20060302022244.GB17755@pb15.lixom.net> Message-ID: <20060302052428.GF5552@krispykreme> > Right. The new printk's were added recently, and I took the KERN_ALERT > level from the x86 code then without double-checking what die() uses. I > guess I could move the die() output over instead, or move them both to > KERN_ERR. I just noticed x86 can now pass the log level around via show_trace_log_lvl and show_stack_log_lvl. Something we might want to add so we can KERN_EMERG the whole oops. Anton From santil at us.ibm.com Fri Mar 3 06:40:18 2006 From: santil at us.ibm.com (Santiago Leon) Date: Thu, 02 Mar 2006 13:40:18 -0600 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <20060131042903.GF28896@krispykreme> References: <20060131041055.5623C68A46@ozlabs.org> <20060131042903.GF28896@krispykreme> Message-ID: <44074A22.8060705@us.ibm.com> From: Michael Ellerman After a kexec the veth driver will fail when trying to register with the Hypervisor because the previous kernel has not unregistered. So if the registration fails, we unregister and then try again. Signed-off-by: Michael Ellerman Acked-by: Anton Blanchard Signed-off-by: Santiago Leon --- drivers/net/ibmveth.c | 32 ++++++++++++++++++++++++++------ 1 files changed, 26 insertions(+), 6 deletions(-) Looks good to me, and has been around for a couple of months. Index: kexec/drivers/net/ibmveth.c =================================================================== --- kexec.orig/drivers/net/ibmveth.c +++ kexec/drivers/net/ibmveth.c @@ -436,6 +436,31 @@ static void ibmveth_cleanup(struct ibmve ibmveth_free_buffer_pool(adapter, &adapter->rx_buff_pool[i]); } +static int ibmveth_register_logical_lan(struct ibmveth_adapter *adapter, + union ibmveth_buf_desc rxq_desc, u64 mac_address) +{ + int rc, try_again = 1; + + /* After a kexec the adapter will still be open, so our attempt to + * open it will fail. So if we get a failure we free the adapter and + * try again, but only once. */ +retry: + rc = h_register_logical_lan(adapter->vdev->unit_address, + adapter->buffer_list_dma, rxq_desc.desc, + adapter->filter_list_dma, mac_address); + + if (rc != H_Success && try_again) { + do { + rc = h_free_logical_lan(adapter->vdev->unit_address); + } while (H_isLongBusy(rc) || (rc == H_Busy)); + + try_again = 0; + goto retry; + } + + return rc; +} + static int ibmveth_open(struct net_device *netdev) { struct ibmveth_adapter *adapter = netdev->priv; @@ -504,12 +529,7 @@ static int ibmveth_open(struct net_devic ibmveth_debug_printk("filter list @ 0x%p\n", adapter->filter_list_addr); ibmveth_debug_printk("receive q @ 0x%p\n", adapter->rx_queue.queue_addr); - - lpar_rc = h_register_logical_lan(adapter->vdev->unit_address, - adapter->buffer_list_dma, - rxq_desc.desc, - adapter->filter_list_dma, - mac_address); + lpar_rc = ibmveth_register_logical_lan(adapter, rxq_desc, mac_address); if(lpar_rc != H_Success) { ibmveth_error_printk("h_register_logical_lan failed with %ld\n", lpar_rc); From michael at ellerman.id.au Fri Mar 3 11:22:45 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 3 Mar 2006 11:22:45 +1100 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <44074A22.8060705@us.ibm.com> References: <20060131041055.5623C68A46@ozlabs.org> <20060131042903.GF28896@krispykreme> <44074A22.8060705@us.ibm.com> Message-ID: <200603031122.51174.michael@ellerman.id.au> Hi Jeff, I realise it's late, but it'd be really good if you could send this up for 2.6.16, we're hosed without it. cheers On Fri, 3 Mar 2006 06:40, Santiago Leon wrote: > From: Michael Ellerman > > After a kexec the veth driver will fail when trying to register with the > Hypervisor because the previous kernel has not unregistered. > > So if the registration fails, we unregister and then try again. > > Signed-off-by: Michael Ellerman > Acked-by: Anton Blanchard > Signed-off-by: Santiago Leon > --- > > drivers/net/ibmveth.c | 32 ++++++++++++++++++++++++++------ > 1 files changed, 26 insertions(+), 6 deletions(-) > > Looks good to me, and has been around for a couple of months. > > Index: kexec/drivers/net/ibmveth.c > =================================================================== > --- kexec.orig/drivers/net/ibmveth.c > +++ kexec/drivers/net/ibmveth.c > @@ -436,6 +436,31 @@ static void ibmveth_cleanup(struct ibmve > ibmveth_free_buffer_pool(adapter, &adapter->rx_buff_pool[i]); > } > > +static int ibmveth_register_logical_lan(struct ibmveth_adapter *adapter, > + union ibmveth_buf_desc rxq_desc, u64 mac_address) > +{ > + int rc, try_again = 1; > + > + /* After a kexec the adapter will still be open, so our attempt to > + * open it will fail. So if we get a failure we free the adapter and > + * try again, but only once. */ > +retry: > + rc = h_register_logical_lan(adapter->vdev->unit_address, > + adapter->buffer_list_dma, rxq_desc.desc, > + adapter->filter_list_dma, mac_address); > + > + if (rc != H_Success && try_again) { > + do { > + rc = h_free_logical_lan(adapter->vdev->unit_address); > + } while (H_isLongBusy(rc) || (rc == H_Busy)); > + > + try_again = 0; > + goto retry; > + } > + > + return rc; > +} > + > static int ibmveth_open(struct net_device *netdev) > { > struct ibmveth_adapter *adapter = netdev->priv; > @@ -504,12 +529,7 @@ static int ibmveth_open(struct net_devic > ibmveth_debug_printk("filter list @ 0x%p\n", adapter->filter_list_addr); > ibmveth_debug_printk("receive q @ 0x%p\n", > adapter->rx_queue.queue_addr); > > - > - lpar_rc = h_register_logical_lan(adapter->vdev->unit_address, > - adapter->buffer_list_dma, > - rxq_desc.desc, > - adapter->filter_list_dma, > - mac_address); > + lpar_rc = ibmveth_register_logical_lan(adapter, rxq_desc, mac_address); > > if(lpar_rc != H_Success) { > ibmveth_error_printk("h_register_logical_lan failed with %ld\n", > lpar_rc); > > > > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/mailman/listinfo/linuxppc64-dev -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060303/c8e585b2/attachment.pgp From rdunlap at xenotime.net Fri Mar 3 11:34:23 2006 From: rdunlap at xenotime.net (Randy.Dunlap) Date: Thu, 2 Mar 2006 16:34:23 -0800 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <200603031122.51174.michael@ellerman.id.au> References: <20060131041055.5623C68A46@ozlabs.org> <20060131042903.GF28896@krispykreme> <44074A22.8060705@us.ibm.com> <200603031122.51174.michael@ellerman.id.au> Message-ID: <20060302163423.f758c5bc.rdunlap@xenotime.net> On Fri, 3 Mar 2006 11:22:45 +1100 Michael Ellerman wrote: > Hi Jeff, > > I realise it's late, but it'd be really good if you could send this up for > 2.6.16, we're hosed without it. I'm wondering if this means that for every virtual/hypervisor situation, we have to modify any $interested_drivers. Why wouldn't we come up with a cleaner solution (in the long term)? E.g., could the hypervisor know when one of it's virtual OSes dies or reboots and release its resources then? This patch just looks like a short-term solution to me. > cheers > > On Fri, 3 Mar 2006 06:40, Santiago Leon wrote: > > From: Michael Ellerman > > > > After a kexec the veth driver will fail when trying to register with the > > Hypervisor because the previous kernel has not unregistered. > > > > So if the registration fails, we unregister and then try again. > > > > Signed-off-by: Michael Ellerman > > Acked-by: Anton Blanchard > > Signed-off-by: Santiago Leon > > --- > > > > drivers/net/ibmveth.c | 32 ++++++++++++++++++++++++++------ > > 1 files changed, 26 insertions(+), 6 deletions(-) > > > > Looks good to me, and has been around for a couple of months. > > > > Index: kexec/drivers/net/ibmveth.c > > =================================================================== > > --- kexec.orig/drivers/net/ibmveth.c > > +++ kexec/drivers/net/ibmveth.c > > @@ -436,6 +436,31 @@ static void ibmveth_cleanup(struct ibmve > > ibmveth_free_buffer_pool(adapter, &adapter->rx_buff_pool[i]); > > } > > > > +static int ibmveth_register_logical_lan(struct ibmveth_adapter *adapter, > > + union ibmveth_buf_desc rxq_desc, u64 mac_address) > > +{ > > + int rc, try_again = 1; > > + > > + /* After a kexec the adapter will still be open, so our attempt to > > + * open it will fail. So if we get a failure we free the adapter and > > + * try again, but only once. */ > > +retry: > > + rc = h_register_logical_lan(adapter->vdev->unit_address, > > + adapter->buffer_list_dma, rxq_desc.desc, > > + adapter->filter_list_dma, mac_address); > > + > > + if (rc != H_Success && try_again) { > > + do { > > + rc = h_free_logical_lan(adapter->vdev->unit_address); > > + } while (H_isLongBusy(rc) || (rc == H_Busy)); > > + > > + try_again = 0; > > + goto retry; > > + } > > + > > + return rc; > > +} > > + > > static int ibmveth_open(struct net_device *netdev) > > { > > struct ibmveth_adapter *adapter = netdev->priv; > > @@ -504,12 +529,7 @@ static int ibmveth_open(struct net_devic > > ibmveth_debug_printk("filter list @ 0x%p\n", adapter->filter_list_addr); > > ibmveth_debug_printk("receive q @ 0x%p\n", > > adapter->rx_queue.queue_addr); > > > > - > > - lpar_rc = h_register_logical_lan(adapter->vdev->unit_address, > > - adapter->buffer_list_dma, > > - rxq_desc.desc, > > - adapter->filter_list_dma, > > - mac_address); > > + lpar_rc = ibmveth_register_logical_lan(adapter, rxq_desc, mac_address); > > > > if(lpar_rc != H_Success) { > > ibmveth_error_printk("h_register_logical_lan failed with %ld\n", > > lpar_rc); > > > > > > > > _______________________________________________ > > Linuxppc64-dev mailing list > > Linuxppc64-dev at ozlabs.org > > https://ozlabs.org/mailman/listinfo/linuxppc64-dev > > -- > Michael Ellerman > IBM OzLabs --- ~Randy From paulus at samba.org Fri Mar 3 12:00:54 2006 From: paulus at samba.org (Paul Mackerras) Date: Fri, 3 Mar 2006 12:00:54 +1100 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <20060302163423.f758c5bc.rdunlap@xenotime.net> References: <20060131041055.5623C68A46@ozlabs.org> <20060131042903.GF28896@krispykreme> <44074A22.8060705@us.ibm.com> <200603031122.51174.michael@ellerman.id.au> <20060302163423.f758c5bc.rdunlap@xenotime.net> Message-ID: <17415.38214.77398.803632@cargo.ozlabs.ibm.com> Randy.Dunlap writes: > E.g., could the hypervisor know when one of it's virtual OSes > dies or reboots and release its resources then? I think the point is that with kexec, the same virtual machine keeps running, so the hypervisor doesn't see the OS dying or rebooting. Paul. From michael at ellerman.id.au Fri Mar 3 12:10:47 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 3 Mar 2006 12:10:47 +1100 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <20060302163423.f758c5bc.rdunlap@xenotime.net> References: <20060131041055.5623C68A46@ozlabs.org> <200603031122.51174.michael@ellerman.id.au> <20060302163423.f758c5bc.rdunlap@xenotime.net> Message-ID: <200603031210.53220.michael@ellerman.id.au> On Fri, 3 Mar 2006 11:34, Randy.Dunlap wrote: > On Fri, 3 Mar 2006 11:22:45 +1100 Michael Ellerman wrote: > > Hi Jeff, > > > > I realise it's late, but it'd be really good if you could send this up > > for 2.6.16, we're hosed without it. > > I'm wondering if this means that for every virtual/hypervisor > situation, we have to modify any $interested_drivers. > Why wouldn't we come up with a cleaner solution (in the long term)? > > E.g., could the hypervisor know when one of it's virtual OSes > dies or reboots and release its resources then? It does exactly that for a regular reboot, but when we kexec we _don't_ die or reboot, as far as the Hypervisor is concerned it's all systems go. It's something of a double-edged sword, we're totally in control which gives us lots of flexibility, and _fast_ reboot times, but we also have to do a bit of extra stuff (ie. this patch) to keep things sane. cheers -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060303/911a8749/attachment.pgp From jgarzik at pobox.com Fri Mar 3 12:04:39 2006 From: jgarzik at pobox.com (Jeff Garzik) Date: Thu, 02 Mar 2006 20:04:39 -0500 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <44074A22.8060705@us.ibm.com> References: <20060131041055.5623C68A46@ozlabs.org> <20060131042903.GF28896@krispykreme> <44074A22.8060705@us.ibm.com> Message-ID: <44079627.6070100@pobox.com> Santiago Leon wrote: > From: Michael Ellerman > > After a kexec the veth driver will fail when trying to register with the > Hypervisor because the previous kernel has not unregistered. > > So if the registration fails, we unregister and then try again. > > Signed-off-by: Michael Ellerman > Acked-by: Anton Blanchard > Signed-off-by: Santiago Leon > --- > > drivers/net/ibmveth.c | 32 ++++++++++++++++++++++++++------ > 1 files changed, 26 insertions(+), 6 deletions(-) > > Looks good to me, and has been around for a couple of months. This seems completely bonkers to me: are resources available? if no free resources try again It makes resource checking pointless. Jeff From michael at ellerman.id.au Fri Mar 3 13:11:56 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 3 Mar 2006 13:11:56 +1100 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <44079627.6070100@pobox.com> References: <20060131041055.5623C68A46@ozlabs.org> <44074A22.8060705@us.ibm.com> <44079627.6070100@pobox.com> Message-ID: <200603031312.00787.michael@ellerman.id.au> On Fri, 3 Mar 2006 12:04, Jeff Garzik wrote: > Santiago Leon wrote: > > From: Michael Ellerman > > > > After a kexec the veth driver will fail when trying to register with the > > Hypervisor because the previous kernel has not unregistered. > > > > So if the registration fails, we unregister and then try again. > > > > Signed-off-by: Michael Ellerman > > Acked-by: Anton Blanchard > > Signed-off-by: Santiago Leon > > --- > > > > drivers/net/ibmveth.c | 32 ++++++++++++++++++++++++++------ > > 1 files changed, 26 insertions(+), 6 deletions(-) > > > > Looks good to me, and has been around for a couple of months. > > This seems completely bonkers to me: > > are resources available? > if no > free resources > try again I'm not sure I follow, are you suggesting we do the h_free_logical_lan() unconditionally, followed by h_register_logical_lan() ?? If that's what you mean, I didn't do it that way because it would effect the normal code path. This patch only modifies the behaviour if we fail to register the adapter. I'm much more comfortable changing the failure case than the default. cheers -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060303/b9cd998a/attachment.pgp From rdunlap at xenotime.net Fri Mar 3 15:12:42 2006 From: rdunlap at xenotime.net (Randy.Dunlap) Date: Thu, 2 Mar 2006 20:12:42 -0800 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <200603031210.53220.michael@ellerman.id.au> References: <20060131041055.5623C68A46@ozlabs.org> <200603031122.51174.michael@ellerman.id.au> <20060302163423.f758c5bc.rdunlap@xenotime.net> <200603031210.53220.michael@ellerman.id.au> Message-ID: <20060302201242.b688f811.rdunlap@xenotime.net> On Fri, 3 Mar 2006 12:10:47 +1100 Michael Ellerman wrote: > On Fri, 3 Mar 2006 11:34, Randy.Dunlap wrote: > > On Fri, 3 Mar 2006 11:22:45 +1100 Michael Ellerman wrote: > > > Hi Jeff, > > > > > > I realise it's late, but it'd be really good if you could send this up > > > for 2.6.16, we're hosed without it. > > > > I'm wondering if this means that for every virtual/hypervisor > > situation, we have to modify any $interested_drivers. > > Why wouldn't we come up with a cleaner solution (in the long term)? > > > > E.g., could the hypervisor know when one of it's virtual OSes > > dies or reboots and release its resources then? > > It does exactly that for a regular reboot, but when we kexec we _don't_ die or > reboot, as far as the Hypervisor is concerned it's all systems go. > > It's something of a double-edged sword, we're totally in control which gives > us lots of flexibility, and _fast_ reboot times, but we also have to do a bit > of extra stuff (ie. this patch) to keep things sane. s/this patch/some patch/ Yes, you have certainly thought about this more/longer than I have, so why is something more generic like this bad instead of good: Somewhere early in start_kernel() (e.g.), do an hv call that says "free all assigned resources". Maybe hv doesn't know "all assigned resources." Maybe it's just that this patch is simpler than an hv change, although this (current) patch could leave some other drivers that need to be "fixed," while an hv change wouldn't do that. So I'm not opposed to this current patch as a short-term solution, but I don't think it's the right long-term solution. --- ~Randy From david at gibson.dropbear.id.au Fri Mar 3 16:24:06 2006 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 3 Mar 2006 16:24:06 +1100 Subject: powerpc: Fix pud_ERROR() message Message-ID: <20060303052406.GK23766@localhost.localdomain> Paulus, please apply The powerpc pud_ERROR() function misleadingly prints a message indicating a pmd error. This patch fixes. Signed-off-by: David Gibson Index: working-2.6/include/asm-powerpc/pgtable-4k.h =================================================================== --- working-2.6.orig/include/asm-powerpc/pgtable-4k.h 2006-03-03 16:21:31.000000000 +1100 +++ working-2.6/include/asm-powerpc/pgtable-4k.h 2006-03-03 16:21:53.000000000 +1100 @@ -93,4 +93,4 @@ (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))) #define pud_ERROR(e) \ - printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pud_val(e)) + printk("%s:%d: bad pud %08lx.\n", __FILE__, __LINE__, pud_val(e)) -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson From prenuka at gmail.com Fri Mar 3 23:39:13 2006 From: prenuka at gmail.com (Renuka Pampana) Date: Fri, 3 Mar 2006 15:39:13 +0300 Subject: Invitation Message-ID: <9b23fc710603030439h443ad288n6e639e4c31badfd2@mail.gmail.com> Hi! I just signed up for a great promotion that I think you should check out. We can both get a free new IPOD nano! http://www.plogfile.com/?r=ATE0UXMIOQg0A2UEBC4M&i=gmail&z=1&tc=18 Talk to you soon! From prenuka at gmail.com Fri Mar 3 23:41:18 2006 From: prenuka at gmail.com (Renuka Pampana) Date: Fri, 3 Mar 2006 15:41:18 +0300 Subject: Invitation Message-ID: <9b23fc710603030441j386e7fccnaf1c83b2fa16fe60@mail.gmail.com> Hey! Check out this great promotion! We can both get the cool new iPod nano for free! http://www.saleadig.com/?r=cTF0UXMIOQg0A2UEBC4M&i=gmail&z=1&tc=18 Talk to you soon! http://www.saleadig.com/?r=cTF0UXMIOQg0A2UEBC4M&i=gmail&z=1&tc=18 From dhowells at redhat.com Sat Mar 4 03:03:00 2006 From: dhowells at redhat.com (David Howells) Date: Fri, 03 Mar 2006 16:03:00 +0000 Subject: Memory barriers and spin_unlock safety Message-ID: <32518.1141401780@warthog.cambridge.redhat.com> Hi, We've just had an interesting discussion on IRC and this has come up with two unanswered questions: (1) Is spin_unlock() is entirely safe on Pentium3+ and x86_64 where ?FENCE instructions are available? Consider the following case, where you want to do two reads effectively atomically, and so wrap them in a spinlock: spin_lock(&mtx); a = *A; b = *B; spin_unlock(&mtx); On x86 Pentium3+ and x86_64, what's to stop you from getting the reads done after the unlock since there's no LFENCE instruction there to stop you? What you'd expect is: LOCK WRITE mtx --> implies MFENCE READ *A } which may be reordered READ *B } WRITE mtx But what you might get instead is this: LOCK WRITE mtx --> implies MFENCE WRITE mtx --> implies SFENCE READ *A } which may be reordered READ *B } There doesn't seem to be anything that says that the reads can't leak outside of the locked section; at least, there doesn't in the AMD's system programming manual for Amd64 (book 2, section 7.1). Writes on the other hand may not happen out of order, so changing things inside a critical section would seem to be okay. On PowerPC, on the other hand, the barriers have to be made explicit because they're not implied by LWARX/STWCX or by ordinary stores: LWARX mtx STWCX mtx ISYNC READ *A } which may be reordered READ *B } LWSYNC WRITE mtx So, should the spin_unlock() on i386 and x86_64 be doing an LFENCE instruction before unlocking? (2) What is the minimum functionality that can be expected of a memory barriers? I was of the opinion that all we could expect is for the CPU executing one them to force the instructions it is executing to be complete up to a point - depending on the type of barrier - before continuing past it. On pentiums, x86_64, and frv this seems to be exactly what you get for a barrier; there doesn't seem to be any external evidence of it that appears on the bus, other than the CPU does a load of memory transactions. However, on ppc/ppc64, it seems to be more thorough than that, and there seems to be some special interaction between the CPU processing the instruction and the other CPUs in the system. It's not entirely obvious from the manual just what this does. As I understand it, Andrew Morton is of the opinion that issuing a read barrier on one CPU will cause the other CPUs in the system to sync up, but that doesn't look likely on all archs. David From dhowells at redhat.com Sat Mar 4 03:45:46 2006 From: dhowells at redhat.com (David Howells) Date: Fri, 03 Mar 2006 16:45:46 +0000 Subject: Memory barriers and spin_unlock safety In-Reply-To: <32518.1141401780@warthog.cambridge.redhat.com> References: <32518.1141401780@warthog.cambridge.redhat.com> Message-ID: <1146.1141404346@warthog.cambridge.redhat.com> David Howells wrote: > WRITE mtx > --> implies SFENCE Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of SFENCE implies that writes can be reordered, which sort of contradicts what the AMD64 System Programming Manual says. If this isn't true, then x86_64 at least should do MFENCE before the store in spin_unlock() or change the store to be LOCK'ed. The same may also apply for Pentium3+ class CPUs with the i386 arch. David From torvalds at osdl.org Sat Mar 4 03:55:35 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 08:55:35 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <32518.1141401780@warthog.cambridge.redhat.com> References: <32518.1141401780@warthog.cambridge.redhat.com> Message-ID: On Fri, 3 Mar 2006, David Howells wrote: > > We've just had an interesting discussion on IRC and this has come up with two > unanswered questions: > > (1) Is spin_unlock() is entirely safe on Pentium3+ and x86_64 where ?FENCE > instructions are available? > > Consider the following case, where you want to do two reads effectively > atomically, and so wrap them in a spinlock: > > spin_lock(&mtx); > a = *A; > b = *B; > spin_unlock(&mtx); > > On x86 Pentium3+ and x86_64, what's to stop you from getting the reads > done after the unlock since there's no LFENCE instruction there to stop > you? The rules are, afaik, that reads can pass buffered writes, BUT WRITES CANNOT PASS READS (aka "writes to memory are always carried out in program order"). IOW, reads can bubble up, but writes cannot. So the way I read the Intel rules is that "passing" is always about being done earlier than otherwise allowed, not about being done later. (You only "pass" somebody in traffic when you go ahead of them. If you fall behind them, you don't "pass" them, _they_ pass you). Now, this is not so much meant to be a semantic argument (the meaning of the word "pass") as to an explanation of what I believe Intel meant, since we know from Intel designers that the simple non-atomic write is supposedly a perfectly fine unlock instruction. So when Intel says "reads can be carried out speculatively and in any order", that just says that reads are not ordered wrt other _reads_. They _are_ ordered wrt other writes, but only one way: they can pass an earlier write, but they can't fall back behind a later one. This is consistent with (a) optimization (you want to do reads _early_, not late) (b) behaviour (we've been told that a single write is sufficient, with the exception of an early P6 core revision) (c) at least one way of reading the documentation. And I claim that (a) and (b) are the important parts, and that (c) is just the rationale. > (2) What is the minimum functionality that can be expected of a memory > barriers? I was of the opinion that all we could expect is for the CPU > executing one them to force the instructions it is executing to be > complete up to a point - depending on the type of barrier - before > continuing past it. Well, no. You should expect even _less_. The core can continue doing things past a barrier. For example, a write barrier may not actually serialize anything at all: the sane way of doing write barriers is to just put a note in the write-queue, and that note just disallows write queue entries from being moved around it. So you might have a write barrier with two writes on either side, and the writes might _both_ be outstanding wrt the core despite the barrier. So there's not necessarily any synchronization at all on a execution core level, just a partial ordering between the resulting actions of the core. > However, on ppc/ppc64, it seems to be more thorough than that, and there > seems to be some special interaction between the CPU processing the > instruction and the other CPUs in the system. It's not entirely obvious > from the manual just what this does. PPC has an absolutely _horrible_ memory ordering implementation, as far as I can tell. The thing is broken. I think it's just implementation breakage, not anything really fundamental, but the fact that their write barriers are expensive is a big sign that they are doing something bad. For example, their write buffers may not have a way to serialize in the buffers, and at that point from an _implementation_ standpoint, you just have to serialize the whole core to make sure that writes don't pass each other. > As I understand it, Andrew Morton is of the opinion that issuing a read > barrier on one CPU will cause the other CPUs in the system to sync up, but > that doesn't look likely on all archs. No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the other CPU. All barriers are purely local to one CPU, and do not generate any bus traffic what-so-ever. They only potentially affect the order of bus traffic due to the instructions around them (obviously). So a read barrier on one CPU _has_ to be paired with a write barrier on the other side in order to make sense (although the write barrier can obviously be of the implied kind, ie a lock/unlock event, or just architecture-specific knowledge of write behaviour, ie for example knowing that writes are always seen in-order on x86). Linus From torvalds at osdl.org Sat Mar 4 04:03:05 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 09:03:05 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <1146.1141404346@warthog.cambridge.redhat.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> Message-ID: On Fri, 3 Mar 2006, David Howells wrote: > David Howells wrote: > > > WRITE mtx > > --> implies SFENCE > > Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of > SFENCE implies that writes can be reordered, which sort of contradicts what > the AMD64 System Programming Manual says. Note that _normal_ writes never need an SFENCE, because they are ordered by the core. The reason to use SFENCE is because of _special_ writes. For example, if you use a non-temporal store, then the write buffer ordering goes away, because there is no write buffer involved (the store goes directly to the L2 or outside the bus). Or when you talk to weakly ordered memory (ie a frame buffer that isn't cached, and where the MTRR memory ordering bits say that writes be done speculatively), you may want to say "I'm going to do the store that starts the graphics pipeline, all my previous stores need to be done now". THAT is when you need to use SFENCE. So SFENCE really isn't about the "smp_wmb()" kind of fencing at all. It's about the much weaker ordering that is allowed by the special IO memory types and nontemporal instructions. (Actually, I think one special case of non-temporal instruction is the "repeat movs/stos" thing: I think you should _not_ use a "repeat stos" to unlock a spinlock, exactly because those stores are not ordered wrt each other, and they can bypass the write queue. Of course, doing that would be insane anyway, so no harm done ;^). > If this isn't true, then x86_64 at least should do MFENCE before the store in > spin_unlock() or change the store to be LOCK'ed. The same may also apply for > Pentium3+ class CPUs with the i386 arch. No. But if you want to make sure, you can always check with Intel engineers. I'm pretty sure I have this right, though, because Intel engineers have certainly looked at Linux sources and locking, and nobody has ever said that we'd need an SFENCE. Linus From arjan at infradead.org Sat Mar 4 07:02:12 2006 From: arjan at infradead.org (Arjan van de Ven) Date: Fri, 03 Mar 2006 21:02:12 +0100 Subject: Memory barriers and spin_unlock safety In-Reply-To: <1146.1141404346@warthog.cambridge.redhat.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> Message-ID: <1141416133.10732.65.camel@laptopd505.fenrus.org> On Fri, 2006-03-03 at 16:45 +0000, David Howells wrote: > David Howells wrote: > > > WRITE mtx > > --> implies SFENCE > > Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of > SFENCE implies that writes can be reordered, which sort of contradicts what > the AMD64 System Programming Manual says. there are 2 or 3 special instructions which do "non temporal stores" (movntq and movnit and maybe one more). sfense is designed for those. From dhowells at redhat.com Sat Mar 4 07:15:35 2006 From: dhowells at redhat.com (David Howells) Date: Fri, 03 Mar 2006 20:15:35 +0000 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> Message-ID: <5001.1141416935@warthog.cambridge.redhat.com> Linus Torvalds wrote: > The rules are, afaik, that reads can pass buffered writes, BUT WRITES > CANNOT PASS READS (aka "writes to memory are always carried out in program > order"). So in the example I gave, a read after the spin_unlock() may actually get executed before the store in the spin_unlock(), but a read before the unlock will not get executed after. > No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the > other CPU. Well, I think you mean will guarantee absolutely _nothing_ on the other CPU for the Linux kernel. According to the IBM powerpc book I have, it does actually do something on the other CPUs, though it doesn't say exactly what. Anyway, thanks. I'll write up some documentation on barriers for inclusion in the kernel. David From dhowells at redhat.com Sat Mar 4 07:17:07 2006 From: dhowells at redhat.com (David Howells) Date: Fri, 03 Mar 2006 20:17:07 +0000 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> Message-ID: <5041.1141417027@warthog.cambridge.redhat.com> Linus Torvalds wrote: > Note that _normal_ writes never need an SFENCE, because they are ordered > by the core. > > The reason to use SFENCE is because of _special_ writes. I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and that only io_wmb() should have that. David From benh at kernel.crashing.org Sat Mar 4 08:06:05 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 04 Mar 2006 08:06:05 +1100 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> Message-ID: <1141419966.3888.67.camel@localhost.localdomain> > PPC has an absolutely _horrible_ memory ordering implementation, as far as > I can tell. The thing is broken. I think it's just implementation > breakage, not anything really fundamental, but the fact that their write > barriers are expensive is a big sign that they are doing something bad. Are they ? read barriers and full barriers are, write barriers should be fairly cheap (but then, I haven't measured). > For example, their write buffers may not have a way to serialize in the > buffers, and at that point from an _implementation_ standpoint, you just > have to serialize the whole core to make sure that writes don't pass each > other. The main problem I've had in the past with the ppc barriers is more a subtle thing in the spec that unfortunately was taken to the word by implementors, and is that the simple write barrier (eieio) will only order within the same storage space, that is will not order between cacheable and non-cacheable storage. That means IOs could leak out of locks etc... Which is why we use expensive barriers in MMIO wrappers for now (though we might investigate the use of mmioXb instead in the future). > No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the > other CPU. All barriers are purely local to one CPU, and do not generate > any bus traffic what-so-ever. They only potentially affect the order of > bus traffic due to the instructions around them (obviously). Actually, the ppc's full barrier (sync) will generate bus traffic, and I think in some case eieio barriers can propagate to the chipset to enforce ordering there too depending on some voodoo settings and wether the storage space is cacheable or not. > So a read barrier on one CPU _has_ to be paired with a write barrier on > the other side in order to make sense (although the write barrier can > obviously be of the implied kind, ie a lock/unlock event, or just > architecture-specific knowledge of write behaviour, ie for example knowing > that writes are always seen in-order on x86). > > Linus > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ From torvalds at osdl.org Sat Mar 4 08:31:08 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 13:31:08 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <5001.1141416935@warthog.cambridge.redhat.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <5001.1141416935@warthog.cambridge.redhat.com> Message-ID: On Fri, 3 Mar 2006, David Howells wrote: > > So in the example I gave, a read after the spin_unlock() may actually get > executed before the store in the spin_unlock(), but a read before the unlock > will not get executed after. Yes. > > No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the > > other CPU. > > Well, I think you mean will guarantee absolutely _nothing_ on the other CPU for > the Linux kernel. According to the IBM powerpc book I have, it does actually > do something on the other CPUs, though it doesn't say exactly what. Yeah, Power really does have some funky stuff in their memory ordering. I'm not quite sure why, though. And it definitely isn't implied by any of the Linux kernel barriers. (They also do TLB coherency in hw etc strange things). Linus From torvalds at osdl.org Sat Mar 4 08:34:17 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 13:34:17 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <5041.1141417027@warthog.cambridge.redhat.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <5041.1141417027@warthog.cambridge.redhat.com> Message-ID: On Fri, 3 Mar 2006, David Howells wrote: > > I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and that > only io_wmb() should have that. Indeed. I think smp_wmb() should be a compiler fence only on x86(-64), ie just compile to a "barrier()" (and not even that on UP, of course). Linus From davem at davemloft.net Sat Mar 4 08:52:17 2006 From: davem at davemloft.net (David S. Miller) Date: Fri, 03 Mar 2006 13:52:17 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <200603031518.15806.hollisb@us.ibm.com> References: <1141419966.3888.67.camel@localhost.localdomain> <200603031518.15806.hollisb@us.ibm.com> Message-ID: <20060303.135217.65983538.davem@davemloft.net> From: Hollis Blanchard Date: Fri, 3 Mar 2006 15:18:13 -0600 > On Friday 03 March 2006 15:06, Benjamin Herrenschmidt wrote: > > The main problem I've had in the past with the ppc barriers is more a > > subtle thing in the spec that unfortunately was taken to the word by > > implementors, and is that the simple write barrier (eieio) will only > > order within the same storage space, that is will not order between > > cacheable and non-cacheable storage. > > I've heard Sparc has the same issue... in which case it may not be a "chip > designer was too literal" thing, but rather it really simplifies chip > implementation to do it that way. There is a "membar #MemIssue" that is meant to deal with this should it ever matter, but for most sparc64 chips it doesn't which is why we don't use that memory barrier type at all in the Linux kernel. For UltraSPARC-I and II it technically could matter in Relaxed Memory Ordering (RMO) mode which is what we run the kernel and 64-bit userspace in, but I've never seen an issue resulting from it. For UltraSPARC-III and later, the chip only implements the Total Store Ordering (TSO) memory model and the manual explicitly states that cacheable and non-cacheable memory operations are ordered, even using language such as "there is an implicit 'membar #MemIssue' between them". It further goes on to say: The UltraSPARCIII Cu processor maintains ordering between cacheable and non-cacheable accesses. The UltraSPARC III Cu processor maintains TSO ordering between memory references regardless of their cacheability. Niagara behaves almost identically to UltraSPARC-III in this area. From torvalds at osdl.org Sat Mar 4 09:04:21 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 14:04:21 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <1141419966.3888.67.camel@localhost.localdomain> References: <32518.1141401780@warthog.cambridge.redhat.com> <1141419966.3888.67.camel@localhost.localdomain> Message-ID: On Sat, 4 Mar 2006, Benjamin Herrenschmidt wrote: > > The main problem I've had in the past with the ppc barriers is more a > subtle thing in the spec that unfortunately was taken to the word by > implementors, and is that the simple write barrier (eieio) will only > order within the same storage space, that is will not order between > cacheable and non-cacheable storage. If so, a simple write barrier should be sufficient. That's exactly what the x86 write barriers do too, ie stores to magic IO space are _not_ ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a spin_unlock()) at all. On x86, we actually have this "CONFIG_X86_OOSTORE" configuration option that gets enable for you select a WINCHIP device, because that allows a weaker memory ordering for normal memory too, and that will end up using an "sfence" instruction for store buffers. But it's not normally enabled. So the eieio should be sufficient,then. Of course, the x86 store buffers do tend to flush out stuff after a certain cycle-delay too, so there may be drivers that technically are buggy on x86, but where the store buffer in practice is small and flushes out quickly enough that you'll never _see_ the bug. > Actually, the ppc's full barrier (sync) will generate bus traffic, and I > think in some case eieio barriers can propagate to the chipset to > enforce ordering there too depending on some voodoo settings and wether > the storage space is cacheable or not. Well, the regular kernel ops definitely won't depend on that, since that's not the case anywhere else. Linus From bcrl at linux.intel.com Sat Mar 4 08:51:14 2006 From: bcrl at linux.intel.com (Benjamin LaHaise) Date: Fri, 3 Mar 2006 13:51:14 -0800 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <5041.1141417027@warthog.cambridge.redhat.com> Message-ID: <20060303215114.GA13893@linux.intel.com> On Fri, Mar 03, 2006 at 01:34:17PM -0800, Linus Torvalds wrote: > Indeed. I think smp_wmb() should be a compiler fence only on x86(-64), ie > just compile to a "barrier()" (and not even that on UP, of course). Actually, no. At least in testing an implementation of Dekker's and Peterson's algorithms as a replacement for the locked operation in our spinlocks, it is absolutely necessary to have an sfence in the lock to ensure the lock is visible to the other CPU before proceeding. I'd use smp_wmb() as the fence is completely unnecessary on UP and is even irq-safe. Here's a copy of the Peterson's implementation to illustrate (it works, it's just slower than the existing spinlocks). -ben diff --git a/include/asm-x86_64/spinlock.h b/include/asm-x86_64/spinlock.h index fe484a6..45bd386 100644 --- a/include/asm-x86_64/spinlock.h +++ b/include/asm-x86_64/spinlock.h @@ -4,6 +4,8 @@ #include #include #include +#include +#include #include /* @@ -18,50 +20,53 @@ */ #define __raw_spin_is_locked(x) \ - (*(volatile signed int *)(&(x)->slock) <= 0) - -#define __raw_spin_lock_string \ - "\n1:\t" \ - "lock ; decl %0\n\t" \ - "js 2f\n" \ - LOCK_SECTION_START("") \ - "2:\t" \ - "rep;nop\n\t" \ - "cmpl $0,%0\n\t" \ - "jle 2b\n\t" \ - "jmp 1b\n" \ - LOCK_SECTION_END - -#define __raw_spin_unlock_string \ - "movl $1,%0" \ - :"=m" (lock->slock) : : "memory" + ((*(volatile signed int *)(x) & ~0xff) != 0) static inline void __raw_spin_lock(raw_spinlock_t *lock) { - __asm__ __volatile__( - __raw_spin_lock_string - :"=m" (lock->slock) : : "memory"); + int cpu = read_pda(cpunumber); + + barrier(); + lock->flags[cpu] = 1; + lock->turn = cpu ^ 1; + barrier(); + + asm volatile("sfence":::"memory"); + + while (lock->flags[cpu ^ 1] && (lock->turn != cpu)) { + cpu_relax(); + barrier(); + } } #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) static inline int __raw_spin_trylock(raw_spinlock_t *lock) { - int oldval; - - __asm__ __volatile__( - "xchgl %0,%1" - :"=q" (oldval), "=m" (lock->slock) - :"0" (0) : "memory"); - - return oldval > 0; + int cpu = read_pda(cpunumber); + barrier(); + if (__raw_spin_is_locked(lock)) + return 0; + + lock->flags[cpu] = 1; + lock->turn = cpu ^ 1; + asm volatile("sfence":::"memory"); + + if (lock->flags[cpu ^ 1] && (lock->turn != cpu)) { + lock->flags[cpu] = 0; + barrier(); + return 0; + } + return 1; } static inline void __raw_spin_unlock(raw_spinlock_t *lock) { - __asm__ __volatile__( - __raw_spin_unlock_string - ); + int cpu; + //asm volatile("lfence":::"memory"); + cpu = read_pda(cpunumber); + lock->flags[cpu] = 0; + barrier(); } #define __raw_spin_unlock_wait(lock) \ diff --git a/include/asm-x86_64/spinlock_types.h b/include/asm-x86_64/spinlock_types.h index 59efe84..a409cbf 100644 --- a/include/asm-x86_64/spinlock_types.h +++ b/include/asm-x86_64/spinlock_types.h @@ -6,10 +6,11 @@ #endif typedef struct { - volatile unsigned int slock; + volatile unsigned char turn; + volatile unsigned char flags[3]; } raw_spinlock_t; -#define __RAW_SPIN_LOCK_UNLOCKED { 1 } +#define __RAW_SPIN_LOCK_UNLOCKED { 0, { 0, } } typedef struct { volatile unsigned int lock; From torvalds at osdl.org Sat Mar 4 09:21:58 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 14:21:58 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <20060303215114.GA13893@linux.intel.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <5041.1141417027@warthog.cambridge.redhat.com> <20060303215114.GA13893@linux.intel.com> Message-ID: On Fri, 3 Mar 2006, Benjamin LaHaise wrote: > > Actually, no. At least in testing an implementation of Dekker's and > Peterson's algorithms as a replacement for the locked operation in > our spinlocks, it is absolutely necessary to have an sfence in the lock > to ensure the lock is visible to the other CPU before proceeding. I suspect you have some bug in your implementation. I think Dekker's algorithm depends on the reads and writes being ordered, and you don't seem to do that. The thing is, you pretty much _have_ to be wrong, because the x86-64 memory ordering rules are _exactly_ the same as for x86, and we've had that simple store as an unlock for a long long time. Linus From torvalds at osdl.org Sat Mar 4 09:36:34 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 14:36:34 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <5041.1141417027@warthog.cambridge.redhat.com> <20060303215114.GA13893@linux.intel.com> Message-ID: On Fri, 3 Mar 2006, Linus Torvalds wrote: > > I suspect you have some bug in your implementation. I think Dekker's > algorithm depends on the reads and writes being ordered, and you don't > seem to do that. IOW, I think you need a full memory barrier after the "lock->turn = cpu ^ 1;" and you should have a "smp_rmb()" in between your reads of "lock->flags[cpu ^ 1]" and "lock->turn" to give the ordering that Dekker (or Peterson) expects. IOW, the code should be something like lock->flags[other] = 1; smp_wmb(); lock->turn = other smp_mb(); while (lock->turn == cpu) { smp_rmb(); if (!lock->flags[other]) break; } where the wmb's are no-ops on x86, but the rmb's certainly are not. I _suspect_ that the fact that it starts working with an 'sfence' in there somewhere is just because the sfence ends up being "serializing enough" that it just happens to work, but that it has nothing to do with the current kernel wmb() being wrong. Linus From hollisb at us.ibm.com Sat Mar 4 08:18:13 2006 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Fri, 3 Mar 2006 15:18:13 -0600 Subject: Memory barriers and spin_unlock safety In-Reply-To: <1141419966.3888.67.camel@localhost.localdomain> References: <32518.1141401780@warthog.cambridge.redhat.com> <1141419966.3888.67.camel@localhost.localdomain> Message-ID: <200603031518.15806.hollisb@us.ibm.com> On Friday 03 March 2006 15:06, Benjamin Herrenschmidt wrote: > The main problem I've had in the past with the ppc barriers is more a > subtle thing in the spec that unfortunately was taken to the word by > implementors, and is that the simple write barrier (eieio) will only > order within the same storage space, that is will not order between > cacheable and non-cacheable storage. I've heard Sparc has the same issue... in which case it may not be a "chip designer was too literal" thing, but rather it really simplifies chip implementation to do it that way. -- Hollis Blanchard IBM Linux Technology Center From paulus at samba.org Sat Mar 4 21:58:04 2006 From: paulus at samba.org (Paul Mackerras) Date: Sat, 4 Mar 2006 21:58:04 +1100 Subject: Memory barriers and spin_unlock safety In-Reply-To: <1141419966.3888.67.camel@localhost.localdomain> References: <32518.1141401780@warthog.cambridge.redhat.com> <1141419966.3888.67.camel@localhost.localdomain> Message-ID: <17417.29372.744064.211813@cargo.ozlabs.ibm.com> Benjamin Herrenschmidt writes: > Actually, the ppc's full barrier (sync) will generate bus traffic, and I > think in some case eieio barriers can propagate to the chipset to > enforce ordering there too depending on some voodoo settings and wether > the storage space is cacheable or not. Eieio has to go to the PCI host bridge because it is supposed to prevent write-combining, both in the host bridge and in the CPU. Paul. From paulus at samba.org Sat Mar 4 21:58:06 2006 From: paulus at samba.org (Paul Mackerras) Date: Sat, 4 Mar 2006 21:58:06 +1100 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> Message-ID: <17417.29375.87604.537434@cargo.ozlabs.ibm.com> Linus Torvalds writes: > PPC has an absolutely _horrible_ memory ordering implementation, as far as > I can tell. The thing is broken. I think it's just implementation > breakage, not anything really fundamental, but the fact that their write > barriers are expensive is a big sign that they are doing something bad. An smp_wmb() is just an eieio on PPC, which is pretty cheap. I made wmb() be a sync though, because it seemed that there were drivers that expected wmb() to provide an ordering between a write to memory and a write to an MMIO register. If that is a bogus assumption then we could make wmb() lighter-weight (after auditing all the drivers we're interested in, of course, ...). And in a subsequent message: > If so, a simple write barrier should be sufficient. That's exactly what > the x86 write barriers do too, ie stores to magic IO space are _not_ > ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a > spin_unlock()) at all. By magic IO space, do you mean just any old memory-mapped device register in a PCI device, or do you mean something else? Paul. From schwab at suse.de Sun Mar 5 01:03:52 2006 From: schwab at suse.de (Andreas Schwab) Date: Sat, 04 Mar 2006 15:03:52 +0100 Subject: GigE on PowerMac G5 Message-ID: I suppose the NIC in the PowerMac G5 can do GigE, yet when plugged into a GB switch it is only willing to talk 100MB with it. Any idea why? Kernel is 2.6.16-rc5-git2. # lsprop /proc/device-tree/ht at 0,f2000000/pci at 6/ethernet at f name "ethernet" linux,phandle ff9c53d8 interrupt-parent ff9779b0 gbit-phy assigned-addresses 82047810 00000000 80400000 00000000 00200000 82047830 00000000 80300000 00000000 00100000 local-mac-address 00 0a 95 ba b8 70 .....p stats 00000000 00000000 00000000 00000000 00000000 reg 00047800 00000000 00000000 00000000 00000000 02047810 00000000 00000000 00000000 00020000 02047830 00000000 00000000 00000000 00010000 max-frame-size 000005ee (1518) address-bits 00000030 (48) built-in compatible "K2-GMAC" category "net" removable "network" network-type "ethernet" device_type "network" fast-back-to-back devsel-speed 00000002 max-latency 00000040 (64) min-grant 00000040 (64) interrupts 00000029 00000001 class-code 00020000 (131072) revision-id 00000000 device-id 0000004c (76) vendor-id 0000106b (4203) # ethtool eth0 Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised auto-negotiation: No Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 0 Transceiver: external Auto-negotiation: on Supports Wake-on: g Wake-on: d Current message level: 0x00000007 (7) Link detected: yes Andreas. -- Andreas Schwab, SuSE Labs, schwab at suse.de SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." From schwab at suse.de Sun Mar 5 01:53:38 2006 From: schwab at suse.de (Andreas Schwab) Date: Sat, 04 Mar 2006 15:53:38 +0100 Subject: GigE on PowerMac G5 Message-ID: [Sorry for duplicate posting, I've used the wrong list address.] I suppose the NIC in the PowerMac G5 can do GigE, yet when plugged into a GB switch it is only willing to talk 100MB with it. Any idea why? Kernel is 2.6.16-rc5-git2. # lsprop /proc/device-tree/ht at 0,f2000000/pci at 6/ethernet at f name "ethernet" linux,phandle ff9c53d8 interrupt-parent ff9779b0 gbit-phy assigned-addresses 82047810 00000000 80400000 00000000 00200000 82047830 00000000 80300000 00000000 00100000 local-mac-address 00 0a 95 ba b8 70 .....p stats 00000000 00000000 00000000 00000000 00000000 reg 00047800 00000000 00000000 00000000 00000000 02047810 00000000 00000000 00000000 00020000 02047830 00000000 00000000 00000000 00010000 max-frame-size 000005ee (1518) address-bits 00000030 (48) built-in compatible "K2-GMAC" category "net" removable "network" network-type "ethernet" device_type "network" fast-back-to-back devsel-speed 00000002 max-latency 00000040 (64) min-grant 00000040 (64) interrupts 00000029 00000001 class-code 00020000 (131072) revision-id 00000000 device-id 0000004c (76) vendor-id 0000106b (4203) # ethtool eth0 Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised auto-negotiation: No Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 0 Transceiver: external Auto-negotiation: on Supports Wake-on: g Wake-on: d Current message level: 0x00000007 (7) Link detected: yes Andreas. -- Andreas Schwab, SuSE Labs, schwab at suse.de SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." From torvalds at osdl.org Sun Mar 5 04:28:54 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Sat, 4 Mar 2006 09:28:54 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <17417.29375.87604.537434@cargo.ozlabs.ibm.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <17417.29375.87604.537434@cargo.ozlabs.ibm.com> Message-ID: On Sat, 4 Mar 2006, Paul Mackerras wrote: > > > If so, a simple write barrier should be sufficient. That's exactly what > > the x86 write barriers do too, ie stores to magic IO space are _not_ > > ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a > > spin_unlock()) at all. > > By magic IO space, do you mean just any old memory-mapped device > register in a PCI device, or do you mean something else? Any old memory-mapped device that has been marked as write-combining in the MTRR's or page tables. So the rules from the PC side (and like it or not, they end up being what all the drivers are tested with) are: - regular stores are ordered by write barriers - PIO stores are always synchronous - MMIO stores are ordered by IO semantics - PCI ordering must be honored: * write combining is only allowed on PCI memory resources that are marked prefetchable. If your host bridge does write combining in general, it's not a "host bridge", it's a "host disaster". * for others, writes can always be posted, but they cannot be re-ordered wrt either reads or writes to that device (ie a read will always be fully synchronizing) - io_wmb must be honored In addition, it will help a hell of a lot if you follow the PC notion of "per-region extra rules", ie you'd default to the non-prefetchable behaviour even for areas that are prefetchable from a PCI standpoint, but allow some way to relax the ordering rules in various ways. PC's use MTRR's or page table hints for this, but it's actually perfectly possible to do it by virtual address (ie decide on "ioremap()" time by looking at some bits that you've saved away to remap it to a certain virtual address range, and then use the virtual address as a hint for readl/writel whether you need to serialize or not). On x86, we already use the "virtual address" trick to distinguish between PIO and MMIO for the newer ioread/iowrite interface (the older inb/outb/readb/writeb interfaces obviously don't need that, since the IO space is statically encoded in the function call itself). The reason I mention the MTRR emulation is again just purely compatibility with drivers that get 99.9% of all the testing on a PC platform. Linus From benh at kernel.crashing.org Sun Mar 5 08:16:40 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 05 Mar 2006 08:16:40 +1100 Subject: GigE on PowerMac G5 In-Reply-To: References: Message-ID: <1141507000.17127.4.camel@localhost.localdomain> On Sat, 2006-03-04 at 15:53 +0100, Andreas Schwab wrote: > [Sorry for duplicate posting, I've used the wrong list address.] > > I suppose the NIC in the PowerMac G5 can do GigE, yet when plugged into a > GB switch it is only willing to talk 100MB with it. Any idea why? Kernel > is 2.6.16-rc5-git2. Works for me... Must be a problem with auto-neg and your switch, or the cable.... Can you check how the switch is configured maybe ? You can also try forcing the link speed with ethtool. Ben. > # lsprop /proc/device-tree/ht at 0,f2000000/pci at 6/ethernet at f > name "ethernet" > linux,phandle ff9c53d8 > interrupt-parent ff9779b0 > gbit-phy > assigned-addresses 82047810 00000000 80400000 00000000 00200000 > 82047830 00000000 80300000 00000000 00100000 > local-mac-address 00 0a 95 ba b8 70 .....p > stats 00000000 00000000 00000000 00000000 00000000 > reg 00047800 00000000 00000000 00000000 00000000 > 02047810 00000000 00000000 00000000 00020000 > 02047830 00000000 00000000 00000000 00010000 > max-frame-size 000005ee (1518) > address-bits 00000030 (48) > built-in > compatible "K2-GMAC" > category "net" > removable "network" > network-type "ethernet" > device_type "network" > fast-back-to-back > devsel-speed 00000002 > max-latency 00000040 (64) > min-grant 00000040 (64) > interrupts 00000029 00000001 > class-code 00020000 (131072) > revision-id 00000000 > device-id 0000004c (76) > vendor-id 0000106b (4203) > # ethtool eth0 > Settings for eth0: > Supported ports: [ TP MII ] > Supported link modes: 10baseT/Half 10baseT/Full > 100baseT/Half 100baseT/Full > 1000baseT/Half 1000baseT/Full > Supports auto-negotiation: Yes > Advertised link modes: 10baseT/Half 10baseT/Full > 100baseT/Half 100baseT/Full > 1000baseT/Half 1000baseT/Full > Advertised auto-negotiation: No > Speed: 100Mb/s > Duplex: Full > Port: MII > PHYAD: 0 > Transceiver: external > Auto-negotiation: on > Supports Wake-on: g > Wake-on: d > Current message level: 0x00000007 (7) > Link detected: yes > > Andreas. > From benh at kernel.crashing.org Sun Mar 5 09:49:53 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 05 Mar 2006 09:49:53 +1100 Subject: Memory barriers and spin_unlock safety In-Reply-To: <17417.29372.744064.211813@cargo.ozlabs.ibm.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <1141419966.3888.67.camel@localhost.localdomain> <17417.29372.744064.211813@cargo.ozlabs.ibm.com> Message-ID: <1141512594.17127.16.camel@localhost.localdomain> On Sat, 2006-03-04 at 21:58 +1100, Paul Mackerras wrote: > Benjamin Herrenschmidt writes: > > > Actually, the ppc's full barrier (sync) will generate bus traffic, and I > > think in some case eieio barriers can propagate to the chipset to > > enforce ordering there too depending on some voodoo settings and wether > > the storage space is cacheable or not. > > Eieio has to go to the PCI host bridge because it is supposed to > prevent write-combining, both in the host bridge and in the CPU. That can be disabled with HID bits tho ;) Ben. From paulus at samba.org Sun Mar 5 10:36:18 2006 From: paulus at samba.org (Paul Mackerras) Date: Sun, 5 Mar 2006 10:36:18 +1100 Subject: GigE on PowerMac G5 In-Reply-To: References: Message-ID: <17418.9330.763002.180595@cargo.ozlabs.ibm.com> Andreas Schwab writes: > I suppose the NIC in the PowerMac G5 can do GigE, yet when plugged into a > GB switch it is only willing to talk 100MB with it. Any idea why? Kernel > is 2.6.16-rc5-git2. It does 1000Mb/s here... # ethtool eth0 Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised auto-negotiation: No Speed: 1000Mb/s Duplex: Full Port: MII PHYAD: 0 Transceiver: external Auto-negotiation: on Supports Wake-on: g Wake-on: d Current message level: 0x00000007 (7) Link detected: yes Paul. From mbuesch at freenet.de Sun Mar 5 13:04:40 2006 From: mbuesch at freenet.de (Michael Buesch) Date: Sun, 5 Mar 2006 03:04:40 +0100 Subject: Memory barriers and spin_unlock safety In-Reply-To: <17417.29375.87604.537434@cargo.ozlabs.ibm.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <17417.29375.87604.537434@cargo.ozlabs.ibm.com> Message-ID: <200603050304.41436.mbuesch@freenet.de> On Saturday 04 March 2006 11:58, you wrote: > Linus Torvalds writes: > > > PPC has an absolutely _horrible_ memory ordering implementation, as far as > > I can tell. The thing is broken. I think it's just implementation > > breakage, not anything really fundamental, but the fact that their write > > barriers are expensive is a big sign that they are doing something bad. > > An smp_wmb() is just an eieio on PPC, which is pretty cheap. I made > wmb() be a sync though, because it seemed that there were drivers that > expected wmb() to provide an ordering between a write to memory and a > write to an MMIO register. If that is a bogus assumption then we > could make wmb() lighter-weight (after auditing all the drivers we're > interested in, of course, ...). In the bcm43xx driver there is code which looks like the following: /* Write some coherent DMA memory */ wmb(); /* Write MMIO, which depends on the DMA memory * write to be finished. */ Are the assumptions in this code correct? Is wmb() the correct thing to do here? I heavily tested this code on PPC UP and did not see any anormaly, yet. -- Greetings Michael. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060305/578694ae/attachment.pgp From david at gibson.dropbear.id.au Mon Mar 6 12:51:29 2006 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 6 Mar 2006 12:51:29 +1100 Subject: powerpc: Make pmd_bad() and pud_bad() checks non-trivial Message-ID: <20060306015129.GA21408@localhost.localdomain> Paulus, please apply. At present, the powerpc pmd_bad() and pud_bad() macros return false unless the fiven pmd or pud is zero. This patch makes these tests more thorough, checking if the given pmd or pud looks like a plausible pte page or pmd page pointer respectively. This can result in helpful error messages when messing with the pagetable code. Signed-off-by: David Gibson Index: working-2.6/include/asm-powerpc/pgtable.h =================================================================== --- working-2.6.orig/include/asm-powerpc/pgtable.h 2006-03-06 11:38:45.000000000 +1100 +++ working-2.6/include/asm-powerpc/pgtable.h 2006-03-06 12:51:14.000000000 +1100 @@ -188,9 +188,13 @@ static inline pte_t pfn_pte(unsigned lon #define pte_pfn(x) ((unsigned long)((pte_val(x)>>PTE_RPN_SHIFT))) #define pte_page(x) pfn_to_page(pte_pfn(x)) +#define PMD_BAD_BITS (PTE_TABLE_SIZE-1) +#define PUD_BAD_BITS (PMD_TABLE_SIZE-1) + #define pmd_set(pmdp, pmdval) (pmd_val(*(pmdp)) = (pmdval)) #define pmd_none(pmd) (!pmd_val(pmd)) -#define pmd_bad(pmd) (pmd_val(pmd) == 0) +#define pmd_bad(pmd) (!is_kernel_addr(pmd_val(pmd)) \ + || (pmd_val(pmd) & PMD_BAD_BITS)) #define pmd_present(pmd) (pmd_val(pmd) != 0) #define pmd_clear(pmdp) (pmd_val(*(pmdp)) = 0) #define pmd_page_kernel(pmd) (pmd_val(pmd) & ~PMD_MASKED_BITS) @@ -198,7 +202,8 @@ static inline pte_t pfn_pte(unsigned lon #define pud_set(pudp, pudval) (pud_val(*(pudp)) = (pudval)) #define pud_none(pud) (!pud_val(pud)) -#define pud_bad(pud) ((pud_val(pud)) == 0) +#define pud_bad(pud) (!is_kernel_addr(pud_val(pud)) \ + || (pud_val(pud) & PUD_BAD_BITS)) #define pud_present(pud) (pud_val(pud) != 0) #define pud_clear(pudp) (pud_val(*(pudp)) = 0) #define pud_page(pud) (pud_val(pud) & ~PUD_MASKED_BITS) -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson From michael at ellerman.id.au Mon Mar 6 13:29:07 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Mon, 6 Mar 2006 13:29:07 +1100 Subject: dead hvc_console with kdump kernel In-Reply-To: References: Message-ID: <200603061329.08262.michael@ellerman.id.au> On Fri, 24 Feb 2006 07:41, Ryan Arnold wrote: > If interrupts end up being disabled by the kexec call and you still need > the console you could try to find a way to set hp->irq = NO_IRQ in this > case such that the khvcd thread is continually rescheduled to poll the > hypervisor buffer and never sleeps indefinitely, as via the interrupt > driven method. Still not sure what's going on here, some interrupt weirdness. This patch serves as a workaround for the moment: Index: kdump/drivers/char/hvc_console.c =================================================================== --- kdump.orig/drivers/char/hvc_console.c 2006-03-06 12:19:42.000000000 +1100 +++ kdump/drivers/char/hvc_console.c 2006-03-06 12:22:32.000000000 +1100 @@ -591,10 +591,12 @@ static int hvc_poll(struct hvc_struct *h if (test_bit(TTY_THROTTLED, &tty->flags)) goto throttled; +#ifndef CONFIG_CRASH_DUMP /* If we aren't interrupt driven and aren't throttled, we always * request a reschedule */ if (hp->irq == NO_IRQ) +#endif poll_mask |= HVC_POLL_READ; /* Read data if any */ From michael at ellerman.id.au Mon Mar 6 17:26:30 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Mon, 6 Mar 2006 17:26:30 +1100 Subject: dead hvc_console with kdump kernel In-Reply-To: <200603061329.08262.michael@ellerman.id.au> References: <200603061329.08262.michael@ellerman.id.au> Message-ID: <200603061726.35175.michael@ellerman.id.au> On Mon, 6 Mar 2006 13:29, Michael Ellerman wrote: > On Fri, 24 Feb 2006 07:41, Ryan Arnold wrote: > > If interrupts end up being disabled by the kexec call and you still need > > the console you could try to find a way to set hp->irq = NO_IRQ in this > > case such that the khvcd thread is continually rescheduled to poll the > > hypervisor buffer and never sleeps indefinitely, as via the interrupt > > driven method. > > Still not sure what's going on here, some interrupt weirdness. I'm stuck on this, when we switch to the kdump kernel we just stop getting interrupts for the console. We never see them in xics_get_irq(), but everything else seems to be working dandy. :( -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060306/9f3bc794/attachment.pgp From schwab at suse.de Mon Mar 6 21:40:59 2006 From: schwab at suse.de (Andreas Schwab) Date: Mon, 06 Mar 2006 11:40:59 +0100 Subject: GigE on PowerMac G5 In-Reply-To: <1141507000.17127.4.camel@localhost.localdomain> (Benjamin Herrenschmidt's message of "Sun, 05 Mar 2006 08:16:40 +1100") References: <1141507000.17127.4.camel@localhost.localdomain> Message-ID: Benjamin Herrenschmidt writes: > On Sat, 2006-03-04 at 15:53 +0100, Andreas Schwab wrote: >> I suppose the NIC in the PowerMac G5 can do GigE, yet when plugged into a >> GB switch it is only willing to talk 100MB with it. Any idea why? Kernel >> is 2.6.16-rc5-git2. > > Works for me... Must be a problem with auto-neg and your switch, or the > cable.... Can you check how the switch is configured maybe ? You can > also try forcing the link speed with ethtool. It's not the cable, I have swapped it with another system where Gb is working fine. Neither it's the switch port, I have swapped it too. I can't force the speed with ethtool either. Any other idea what to look for? Andreas. -- Andreas Schwab, SuSE Labs, schwab at suse.de SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." From benh at kernel.crashing.org Tue Mar 7 00:15:06 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 07 Mar 2006 00:15:06 +1100 Subject: GigE on PowerMac G5 In-Reply-To: References: <1141507000.17127.4.camel@localhost.localdomain> Message-ID: <1141650907.11221.61.camel@localhost.localdomain> On Mon, 2006-03-06 at 11:40 +0100, Andreas Schwab wrote: > Benjamin Herrenschmidt writes: > > > On Sat, 2006-03-04 at 15:53 +0100, Andreas Schwab wrote: > >> I suppose the NIC in the PowerMac G5 can do GigE, yet when plugged into a > >> GB switch it is only willing to talk 100MB with it. Any idea why? Kernel > >> is 2.6.16-rc5-git2. > > > > Works for me... Must be a problem with auto-neg and your switch, or the > > cable.... Can you check how the switch is configured maybe ? You can > > also try forcing the link speed with ethtool. > > It's not the cable, I have swapped it with another system where Gb is > working fine. Neither it's the switch port, I have swapped it too. I > can't force the speed with ethtool either. Any other idea what to look > for? At this point, all I can say is... does it work in OS X ? Ben. From olh at suse.de Tue Mar 7 06:38:17 2006 From: olh at suse.de (Olaf Hering) Date: Mon, 6 Mar 2006 20:38:17 +0100 Subject: [PATCH] change compat shmget size arg to signed In-Reply-To: <20060224111242.08f14bd9.sfr@canb.auug.org.au> References: <20060224101644.548b0c24.sfr@canb.auug.org.au> <20060223232717.GB29454@suse.de> <20060224111242.08f14bd9.sfr@canb.auug.org.au> Message-ID: <20060306193817.GA3214@suse.de> On Fri, Feb 24, Stephen Rothwell wrote: > On Fri, 24 Feb 2006 00:27:17 +0100 Olaf Hering wrote: > > > > On Fri, Feb 24, Stephen Rothwell wrote: > > > > > Does the ltp test fail on a standard kernel(where SHMMAX is 0x2000000), or > > > only on a SLES kernel (where SHMMAX is ULONG_MAX)? > > > > It fails with SLES9 and SLES10. SLES9 has 0x2000000 as default. > > So what was shm_ctlmax set to when the test was run. > > I am trying to figure out why this test: > > if (size < SHMMIN || size > shm_ctlmax) > return -EINVAL; > > Doesn't return -EINVAL for size == 0xffffffff if shm_ctlmax is 0x2000000? shm_ctlmax is a sysctrl, so it can have anything. The ltp test is invalid. shmget02 dos not fail after: echo $(( 0x2000000 )) > /proc/sys/kernel/shmmax From johnrose at austin.ibm.com Tue Mar 7 12:03:28 2006 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 06 Mar 2006 19:03:28 -0600 Subject: [PATCH 1/3] cleanup PCI Host Bridge setup Message-ID: <1141693408.8166.16.camel@sinatra.austin.ibm.com> Since setup_phb() and pci_process_bridge_OF_ranges() are always called together, and since the latter falls under the category of "setup", move the latter into the former. Thanks- John Signed-off-by: John Rose diff -puN arch/powerpc/kernel/rtas_pci.c~cleanup_phb_setup arch/powerpc/kernel/rtas_pci.c --- 2_6_p5/arch/powerpc/kernel/rtas_pci.c~cleanup_phb_setup 2006-03-03 15:42:35.000000000 -0600 +++ 2_6_p5-johnrose/arch/powerpc/kernel/rtas_pci.c 2006-03-06 17:23:50.000000000 -0600 @@ -292,6 +292,8 @@ static int __devinit setup_phb(struct de phb->ops = &rtas_pci_ops; phb->buid = get_phb_buid(dev); + pci_process_bridge_OF_ranges(phb, dev, 0); + return 0; } @@ -323,7 +325,6 @@ unsigned long __init find_and_init_phbs( if (!phb) continue; setup_phb(node, phb); - pci_process_bridge_OF_ranges(phb, node, 0); pci_setup_phb_io(phb, index == 0); #ifdef CONFIG_PPC_PSERIES /* XXX This code need serious fixing ... --BenH */ @@ -369,7 +370,6 @@ struct pci_controller * __devinit init_p if (!phb) return NULL; setup_phb(dn, phb); - pci_process_bridge_OF_ranges(phb, dn, primary); pci_setup_phb_io_dynamic(phb, primary); _ From johnrose at austin.ibm.com Tue Mar 7 12:03:58 2006 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 06 Mar 2006 19:03:58 -0600 Subject: [PATCH 2/3] move init_phb_dynamic() to pseries Message-ID: <1141693438.8166.20.camel@sinatra.austin.ibm.com> Since init_phb_dynamic() only comes into play during dynamic partitioning on POWER systems, move it to pseries-specific file. This is also necessary for the addition of some pseries-specific fixups during PHB creation. Thanks- John Signed-off-by: John Rose diff -puN arch/powerpc/kernel/rtas_pci.c~move_init_phb_dyn arch/powerpc/kernel/rtas_pci.c --- 2_6_p5/arch/powerpc/kernel/rtas_pci.c~move_init_phb_dyn 2006-03-03 15:43:03.000000000 -0600 +++ 2_6_p5-johnrose/arch/powerpc/kernel/rtas_pci.c 2006-03-03 15:43:03.000000000 -0600 @@ -280,8 +280,7 @@ static int phb_set_bus_ranges(struct dev return 0; } -static int __devinit setup_phb(struct device_node *dev, - struct pci_controller *phb) +int __devinit setup_phb(struct device_node *dev, struct pci_controller *phb) { if (is_python(dev)) python_countermeasures(dev); @@ -360,26 +359,6 @@ unsigned long __init find_and_init_phbs( return 0; } -struct pci_controller * __devinit init_phb_dynamic(struct device_node *dn) -{ - struct pci_controller *phb; - int primary; - - primary = list_empty(&hose_list); - phb = pcibios_alloc_controller(dn); - if (!phb) - return NULL; - setup_phb(dn, phb); - - pci_setup_phb_io_dynamic(phb, primary); - - pci_devs_phb_init_dynamic(phb); - scan_phb(phb); - - return phb; -} -EXPORT_SYMBOL(init_phb_dynamic); - /* RPA-specific bits for removing PHBs */ int pcibios_remove_root_bus(struct pci_controller *phb) { diff -puN arch/powerpc/kernel/pci_64.c~move_init_phb_dyn arch/powerpc/kernel/pci_64.c diff -puN include/asm-powerpc/ppc-pci.h~move_init_phb_dyn include/asm-powerpc/ppc-pci.h --- 2_6_p5/include/asm-powerpc/ppc-pci.h~move_init_phb_dyn 2006-03-03 15:43:03.000000000 -0600 +++ 2_6_p5-johnrose/include/asm-powerpc/ppc-pci.h 2006-03-03 15:43:03.000000000 -0600 @@ -38,6 +38,7 @@ void *traverse_pci_devices(struct device void pci_devs_phb_init(void); void pci_devs_phb_init_dynamic(struct pci_controller *phb); +int setup_phb(struct device_node *dev, struct pci_controller *phb); void __devinit scan_phb(struct pci_controller *hose); /* From rtas_pci.h */ diff -puN arch/powerpc/platforms/pseries/pci_dlpar.c~move_init_phb_dyn arch/powerpc/platforms/pseries/pci_dlpar.c --- 2_6_p5/arch/powerpc/platforms/pseries/pci_dlpar.c~move_init_phb_dyn 2006-03-03 15:43:03.000000000 -0600 +++ 2_6_p5-johnrose/arch/powerpc/platforms/pseries/pci_dlpar.c 2006-03-06 17:23:46.000000000 -0600 @@ -27,6 +27,7 @@ #include #include +#include static struct pci_bus * find_bus_among_children(struct pci_bus *bus, @@ -179,3 +180,23 @@ pcibios_add_pci_devices(struct pci_bus * } } EXPORT_SYMBOL_GPL(pcibios_add_pci_devices); + +struct pci_controller * __devinit init_phb_dynamic(struct device_node *dn) +{ + struct pci_controller *phb; + int primary; + + primary = list_empty(&hose_list); + phb = pcibios_alloc_controller(dn); + if (!phb) + return NULL; + setup_phb(dn, phb); + + pci_setup_phb_io_dynamic(phb, primary); + + pci_devs_phb_init_dynamic(phb); + scan_phb(phb); + + return phb; +} +EXPORT_SYMBOL_GPL(init_phb_dynamic); _ From johnrose at austin.ibm.com Tue Mar 7 12:04:25 2006 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 06 Mar 2006 19:04:25 -0600 Subject: [PATCH 3/3] properly configure DDR/P5IOC children devs Message-ID: <1141693465.8166.22.camel@sinatra.austin.ibm.com> The dynamic add path for PCI Host Bridges can fail to configure children adapters under P5IOC controllers. It fails to properly fixup bus/device resources, and it fails to properly enable EEH. Both of these steps need to occur before any children devices are enabled in pci_bus_add_devices(). This fix has been tested for P5IOC and non-P5IOC slots. Thanks- John Signed-off-by: John Rose diff -puN arch/powerpc/kernel/pci_64.c~fixup_phb_devs arch/powerpc/kernel/pci_64.c --- 2_6_p5/arch/powerpc/kernel/pci_64.c~fixup_phb_devs 2006-03-03 15:43:38.000000000 -0600 +++ 2_6_p5-johnrose/arch/powerpc/kernel/pci_64.c 2006-03-03 15:43:39.000000000 -0600 @@ -589,7 +589,6 @@ void __devinit scan_phb(struct pci_contr #endif /* CONFIG_PPC_MULTIPLATFORM */ if (mode == PCI_PROBE_NORMAL) hose->last_busno = bus->subordinate = pci_scan_child_bus(bus); - pci_bus_add_devices(bus); } static int __init pcibios_init(void) @@ -608,8 +607,10 @@ static int __init pcibios_init(void) printk("PCI: Probing PCI hardware\n"); /* Scan all of the recorded PCI controllers. */ - list_for_each_entry_safe(hose, tmp, &hose_list, list_node) + list_for_each_entry_safe(hose, tmp, &hose_list, list_node) { scan_phb(hose); + pci_bus_add_devices(hose->bus); + } #ifndef CONFIG_PPC_ISERIES if (pci_probe_only) diff -puN arch/powerpc/platforms/pseries/pci_dlpar.c~fixup_phb_devs arch/powerpc/platforms/pseries/pci_dlpar.c --- 2_6_p5/arch/powerpc/platforms/pseries/pci_dlpar.c~fixup_phb_devs 2006-03-03 15:44:04.000000000 -0600 +++ 2_6_p5-johnrose/arch/powerpc/platforms/pseries/pci_dlpar.c 2006-03-03 15:46:25.000000000 -0600 @@ -195,7 +195,13 @@ struct pci_controller * __devinit init_p pci_setup_phb_io_dynamic(phb, primary); pci_devs_phb_init_dynamic(phb); + + if (dn->child) + eeh_add_device_tree_early(dn); + scan_phb(phb); + pcibios_fixup_new_pci_devices(phb->bus, 0); + pci_bus_add_devices(phb->bus); return phb; } _ From schwab at suse.de Tue Mar 7 23:53:21 2006 From: schwab at suse.de (Andreas Schwab) Date: Tue, 07 Mar 2006 13:53:21 +0100 Subject: GigE on PowerMac G5 In-Reply-To: <1141650907.11221.61.camel@localhost.localdomain> (Benjamin Herrenschmidt's message of "Tue, 07 Mar 2006 00:15:06 +1100") References: <1141507000.17127.4.camel@localhost.localdomain> <1141650907.11221.61.camel@localhost.localdomain> Message-ID: Benjamin Herrenschmidt writes: > At this point, all I can say is... does it work in OS X ? Strange, OS X can't do it either. Looks like I have a hardware problem. Andreas. -- Andreas Schwab, SuSE Labs, schwab at suse.de SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." From dhowells at redhat.com Wed Mar 8 04:36:59 2006 From: dhowells at redhat.com (David Howells) Date: Tue, 07 Mar 2006 17:36:59 +0000 Subject: Memory barriers and spin_unlock safety In-Reply-To: <5041.1141417027@warthog.cambridge.redhat.com> References: <5041.1141417027@warthog.cambridge.redhat.com> <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> Message-ID: <31420.1141753019@warthog.cambridge.redhat.com> David Howells wrote: > I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and > that only io_wmb() should have that. Hmmm... We don't actually have io_wmb()... Should the following be added to all archs? io_mb() io_rmb() io_wmb() David From dhowells at redhat.com Wed Mar 8 04:40:45 2006 From: dhowells at redhat.com (David Howells) Date: Tue, 07 Mar 2006 17:40:45 +0000 Subject: [PATCH] Document Linux's memory barriers Message-ID: <31492.1141753245@warthog.cambridge.redhat.com> The attached patch documents the Linux kernel's memory barriers. Signed-Off-By: David Howells --- warthog>diffstat -p1 mb.diff Documentation/memory-barriers.txt | 359 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 359 insertions(+) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt new file mode 100644 index 0000000..c2fc51b --- /dev/null +++ b/Documentation/memory-barriers.txt @@ -0,0 +1,359 @@ + ============================ + LINUX KERNEL MEMORY BARRIERS + ============================ + +Contents: + + (*) What are memory barriers? + + (*) Linux kernel memory barrier functions. + + (*) Implied kernel memory barriers. + + (*) i386 and x86_64 arch specific notes. + + +========================= +WHAT ARE MEMORY BARRIERS? +========================= + +Memory barriers are instructions to both the compiler and the CPU to impose a +partial ordering between the memory access operations specified either side of +the barrier. + +Older and less complex CPUs will perform memory accesses in exactly the order +specified, so if one is given the following piece of code: + + a = *A; + *B = b; + c = *C; + d = *D; + *E = e; + +It can be guaranteed that it will complete the memory access for each +instruction before moving on to the next line, leading to a definite sequence +of operations on the bus: + + read *A, write *B, read *C, read *D, write *E. + +However, with newer and more complex CPUs, this isn't always true because: + + (*) they can rearrange the order of the memory accesses to promote better use + of the CPU buses and caches; + + (*) reads are synchronous and may need to be done immediately to permit + progress, whereas writes can often be deferred without a problem; + + (*) and they are able to combine reads and writes to improve performance when + talking to the SDRAM (modern SDRAM chips can do batched accesses of + adjacent locations, cutting down on transaction setup costs). + +So what you might actually get from the above piece of code is: + + read *A, read *C+*D, write *E, write *B + +Under normal operation, this is probably not going to be a problem; however, +there are two circumstances where it definitely _can_ be a problem: + + (1) I/O + + Many I/O devices can be memory mapped, and so appear to the CPU as if + they're just memory locations. However, to control the device, the driver + has to make the right accesses in exactly the right order. + + Consider, for example, an ethernet chipset such as the AMD PCnet32. It + presents to the CPU an "address register" and a bunch of "data registers". + The way it's accessed is to write the index of the internal register you + want to access to the address register, and then read or write the + appropriate data register to access the chip's internal register: + + *ADR = ctl_reg_3; + reg = *DATA; + + The problem with a clever CPU or a clever compiler is that the write to + the address register isn't guaranteed to happen before the access to the + data register, if the CPU or the compiler thinks it is more efficient to + defer the address write: + + read *DATA, write *ADR + + then things will break. + + The way to deal with this is to insert an I/O memory barrier between the + two accesses: + + *ADR = ctl_reg_3; + mb(); + reg = *DATA; + + In this case, the barrier makes a guarantee that all memory accesses + before the barrier will happen before all the memory accesses after the + barrier. It does _not_ guarantee that all memory accesses before the + barrier will be complete by the time the barrier is complete. + + (2) Multiprocessor interaction + + When there's a system with more than one processor, these may be working + on the same set of data, but attempting not to use locks as locks are + quite expensive. This means that accesses that affect both CPUs may have + to be carefully ordered to prevent error. + + Consider the R/W semaphore slow path. In that, a waiting process is + queued on the semaphore, as noted by it having a record on its stack + linked to the semaphore's list: + + struct rw_semaphore { + ... + struct list_head waiters; + }; + + struct rwsem_waiter { + struct list_head list; + struct task_struct *task; + }; + + To wake up the waiter, the up_read() or up_write() functions have to read + the pointer from this record to know as to where the next waiter record + is, clear the task pointer, call wake_up_process() on the task, and + release the task struct reference held: + + READ waiter->list.next; + READ waiter->task; + WRITE waiter->task; + CALL wakeup + RELEASE task + + If any of these steps occur out of order, then the whole thing may fail. + + Note that the waiter does not get the semaphore lock again - it just waits + for its task pointer to be cleared. Since the record is on its stack, this + means that if the task pointer is cleared _before_ the next pointer in the + list is read, then another CPU might start processing the waiter and it + might clobber its stack before up*() functions have a chance to read the + next pointer. + + CPU 0 CPU 1 + =============================== =============================== + down_xxx() + Queue waiter + Sleep + up_yyy() + READ waiter->task; + WRITE waiter->task; + + Resume processing + down_xxx() returns + call foo() + foo() clobbers *waiter + + READ waiter->list.next; + --- OOPS --- + + This could be dealt with using a spinlock, but then the down_xxx() + function has to get the spinlock again after it's been woken up, which is + a waste of resources. + + The way to deal with this is to insert an SMP memory barrier: + + READ waiter->list.next; + READ waiter->task; + smp_mb(); + WRITE waiter->task; + CALL wakeup + RELEASE task + + In this case, the barrier makes a guarantee that all memory accesses + before the barrier will happen before all the memory accesses after the + barrier. It does _not_ guarantee that all memory accesses before the + barrier will be complete by the time the barrier is complete. + + SMP memory barriers are normally no-ops on a UP system because the CPU + orders overlapping accesses with respect to itself. + + +===================================== +LINUX KERNEL MEMORY BARRIER FUNCTIONS +===================================== + +The Linux kernel has six basic memory barriers: + + MANDATORY (I/O) SMP + =============== ================ + GENERAL mb() smp_mb() + READ rmb() smp_rmb() + WRITE wmb() smp_wmb() + +General memory barriers make a guarantee that all memory accesses specified +before the barrier will happen before all memory accesses specified after the +barrier. + +Read memory barriers make a guarantee that all memory reads specified before +the barrier will happen before all memory reads specified after the barrier. + +Write memory barriers make a guarantee that all memory writes specified before +the barrier will happen before all memory writes specified after the barrier. + +SMP memory barriers are no-ops on uniprocessor compiled systems because it is +assumed that a CPU will be self-consistent, and will order overlapping accesses +with respect to itself. + +There is no guarantee that any of the memory accesses specified before a memory +barrier will be complete by the completion of a memory barrier; the barrier can +be considered to draw a line in the access queue that accesses of the +appropriate type may not cross. + +There is no guarantee that issuing a memory barrier on one CPU will have any +direct effect on another CPU or any other hardware in the system. The indirect +effect will be the order the first CPU commits its accesses to the bus. + +Note that these are the _minimum_ guarantees. Different architectures may give +more substantial guarantees, but they may not be relied upon outside of arch +specific code. + + +There are some more advanced barriering functions: + + (*) set_mb(var, value) + (*) set_wmb(var, value) + + These assign the value to the variable and then insert at least a write + barrier after it, depending on the function. + + +============================== +IMPLIED KERNEL MEMORY BARRIERS +============================== + +Some of the other functions in the linux kernel imply memory barriers. For +instance all the following (pseudo-)locking functions imply barriers. + + (*) interrupt disablement and/or interrupts + (*) spin locks + (*) R/W spin locks + (*) mutexes + (*) semaphores + (*) R/W semaphores + +In all cases there are variants on a LOCK operation and an UNLOCK operation. + + (*) LOCK operation implication: + + Memory accesses issued after the LOCK will be completed after the LOCK + accesses have completed. + + Memory accesses issued before the LOCK may be completed after the LOCK + accesses have completed. + + (*) UNLOCK operation implication: + + Memory accesses issued before the UNLOCK will be completed before the + UNLOCK accesses have completed. + + Memory accesses issued after the UNLOCK may be completed before the UNLOCK + accesses have completed. + + (*) LOCK vs UNLOCK implication: + + The LOCK accesses will be completed before the unlock accesses. + +Locks and semaphores may not provide any guarantee of ordering on UP compiled +systems, and so can't be counted on in such a situation to actually do +anything at all, especially with respect to I/O memory barriering. + +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier +memory and I/O accesses individually, or interrupt handling will barrier +memory and I/O accesses on entry and on exit. This prevents an interrupt +routine interfering with accesses made in a disabled-interrupt section of code +and vice versa. + +This specification is a _minimum_ guarantee; any particular architecture may +provide more substantial guarantees, but these may not be relied upon outside +of arch specific code. + + +As an example, consider the following: + + *A = a; + *B = b; + LOCK + *C = c; + *D = d; + UNLOCK + *E = e; + *F = f; + +The following sequence of events on the bus is acceptable: + + LOCK, *F+*A, *E, *C+*D, *B, UNLOCK + +But none of the following are: + + *F+*A, *B, LOCK, *C, *D, UNLOCK, *E + *A, *B, *C, LOCK, *D, UNLOCK, *E, *F + *A, *B, LOCK, *C, UNLOCK, *D, *E, *F + *B, LOCK, *C, *D, UNLOCK, *F+*A, *E + + +Consider also the following (going back to the AMD PCnet example): + + DISABLE IRQ + *ADR = ctl_reg_3; + mb(); + x = *DATA; + *ADR = ctl_reg_4; + mb(); + *DATA = y; + *ADR = ctl_reg_5; + mb(); + z = *DATA; + ENABLE IRQ + + *ADR = ctl_reg_7; + mb(); + q = *DATA + + +What's to stop "z = *DATA" crossing "*ADR = ctl_reg_7" and reading from the +wrong register? (There's no guarantee that the process of handling an +interrupt will barrier memory accesses in any way). + + +============================== +I386 AND X86_64 SPECIFIC NOTES +============================== + +Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the +bus appear in program order - and so there's no requirement for any sort of +explicit memory barriers. + +From the Pentium-III onwards were three new memory barrier instructions: +LFENCE, SFENCE and MFENCE which correspond to the kernel memory barrier +functions rmb(), wmb() and mb(). However, there are additional implicit memory +barriers in the CPU implementation: + + (*) Interrupt processing implies mb(). + + (*) The LOCK prefix adds implication of mb() on whatever instruction it is + attached to. + + (*) Normal writes to memory imply wmb() [and so SFENCE is normally not + required]. + + (*) Normal writes imply a semi-rmb(): reads before a write may not complete + after that write, but reads after a write may complete before the write + (ie: reads may go _ahead_ of writes). + + (*) Non-temporal writes imply no memory barrier, and are the intended target + of SFENCE. + + (*) Accesses to uncached memory imply mb() [eg: memory mapped I/O]. + + +====================== +POWERPC SPECIFIC NOTES +====================== + +The powerpc is weakly ordered, and its read and write accesses may be +completed generally in any order. It's memory barriers are also to some extent +more substantial than the mimimum requirement, and may directly effect +hardware outside of the CPU. From matthew at wil.cx Wed Mar 8 04:40:57 2006 From: matthew at wil.cx (Matthew Wilcox) Date: Tue, 7 Mar 2006 10:40:57 -0700 Subject: Memory barriers and spin_unlock safety In-Reply-To: <31420.1141753019@warthog.cambridge.redhat.com> References: <5041.1141417027@warthog.cambridge.redhat.com> <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <31420.1141753019@warthog.cambridge.redhat.com> Message-ID: <20060307174057.GD7301@parisc-linux.org> On Tue, Mar 07, 2006 at 05:36:59PM +0000, David Howells wrote: > David Howells wrote: > > > I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and > > that only io_wmb() should have that. > > Hmmm... We don't actually have io_wmb()... Should the following be added to > all archs? > > io_mb() > io_rmb() > io_wmb() it's spelled mmiowb(), and reads from IO space are synchronous, so don't need barriers. From ak at suse.de Tue Mar 7 21:34:52 2006 From: ak at suse.de (Andi Kleen) Date: Tue, 7 Mar 2006 11:34:52 +0100 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <31492.1141753245@warthog.cambridge.redhat.com> References: <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <200603071134.52962.ak@suse.de> On Tuesday 07 March 2006 18:40, David Howells wrote: > +Older and less complex CPUs will perform memory accesses in exactly the order > +specified, so if one is given the following piece of code: > + > + a = *A; > + *B = b; > + c = *C; > + d = *D; > + *E = e; > + > +It can be guaranteed that it will complete the memory access for each > +instruction before moving on to the next line, leading to a definite sequence > +of operations on the bus: Actually gcc is free to reorder it (often it will not when it cannot prove that they don't alias, but sometimes it can) > + > + Consider, for example, an ethernet chipset such as the AMD PCnet32. It > + presents to the CPU an "address register" and a bunch of "data registers". > + The way it's accessed is to write the index of the internal register you > + want to access to the address register, and then read or write the > + appropriate data register to access the chip's internal register: > + > + *ADR = ctl_reg_3; > + reg = *DATA; You're not supposed to do it this way anyways. The official way to access MMIO space is using read/write[bwlq] Haven't read all of it sorry, but thanks for the work of documenting it. -Andi From torvalds at osdl.org Wed Mar 8 05:28:45 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Tue, 7 Mar 2006 10:28:45 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <1141755496.31814.56.camel@localhost.localdomain> References: <5041.1141417027@warthog.cambridge.redhat.com> <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <31420.1141753019@warthog.cambridge.redhat.com> <1141755496.31814.56.camel@localhost.localdomain> Message-ID: On Tue, 7 Mar 2006, Alan Cox wrote: > > What kind of mb/rmb/wmb goes with ioread/iowrite ? It seems we actually > need one that can work out what to do for the general io API ? The ioread/iowrite things only guarantee the laxer MMIO rules, since it _might_ be mmio. So you'd use the mmio barriers. In fact, I would suggest that architectures that can do PIO in a more relaxed manner (x86 cannot, since all the serialization is in hardware) would do even a PIO in the more relaxed ordering (ie writes can at least be posted, but obviously not merged, since that would be against PCI specs). x86 tends to serialize PIO too much (I think at least Intel CPU's will actually wait for the PIO write to be acknowledged by _something_ on the bus, although it obviously can't wait for the device to have acted on it). Linus From dhowells at redhat.com Wed Mar 8 05:30:40 2006 From: dhowells at redhat.com (David Howells) Date: Tue, 07 Mar 2006 18:30:40 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <200603071134.52962.ak@suse.de> References: <200603071134.52962.ak@suse.de> <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <7621.1141756240@warthog.cambridge.redhat.com> Andi Kleen wrote: > Actually gcc is free to reorder it > (often it will not when it cannot prove that they don't alias, but sometimes > it can) Yeah... I have mentioned the fact that compilers can reorder too, but obviously not enough. > You're not supposed to do it this way anyways. The official way to access > MMIO space is using read/write[bwlq] True, I suppose. I should make it clear that these accessor functions imply memory barriers, if indeed they do, and that you should use them rather than accessing I/O registers directly (at least, outside the arch you should). David From ak at suse.de Tue Mar 7 22:13:46 2006 From: ak at suse.de (Andi Kleen) Date: Tue, 7 Mar 2006 12:13:46 +0100 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <7621.1141756240@warthog.cambridge.redhat.com> References: <200603071134.52962.ak@suse.de> <31492.1141753245@warthog.cambridge.redhat.com> <7621.1141756240@warthog.cambridge.redhat.com> Message-ID: <200603071213.47885.ak@suse.de> On Tuesday 07 March 2006 19:30, David Howells wrote: > > You're not supposed to do it this way anyways. The official way to access > > MMIO space is using read/write[bwlq] > > True, I suppose. I should make it clear that these accessor functions imply > memory barriers, if indeed they do, I don't think they do. > and that you should use them rather than > accessing I/O registers directly (at least, outside the arch you should). Even inside the architecture it's a good idea. -Andi From alan at lxorguk.ukuu.org.uk Wed Mar 8 05:40:25 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Tue, 07 Mar 2006 18:40:25 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <31492.1141753245@warthog.cambridge.redhat.com> References: <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <1141756825.31814.75.camel@localhost.localdomain> On Maw, 2006-03-07 at 17:40 +0000, David Howells wrote: > +Older and less complex CPUs will perform memory accesses in exactly the order > +specified, so if one is given the following piece of code: Not really true. Some of the fairly old dumb processors don't do this to the bus, and just about anything with a cache wont (as it'll burst cache lines to main memory) > + want to access to the address register, and then read or write the > + appropriate data register to access the chip's internal register: > + > + *ADR = ctl_reg_3; > + reg = *DATA; Not allowed anyway > + In this case, the barrier makes a guarantee that all memory accesses > + before the barrier will happen before all the memory accesses after the > + barrier. It does _not_ guarantee that all memory accesses before the > + barrier will be complete by the time the barrier is complete. Better meaningful example would be barriers versus an IRQ handler. Which leads nicely onto section 2 > +General memory barriers make a guarantee that all memory accesses specified > +before the barrier will happen before all memory accesses specified after the > +barrier. No. They guarantee that to an observer also running on that set of processors the accesses to main memory will appear to be ordered in that manner. They don't guarantee I/O related ordering for non main memory due to things like PCI posting rules and NUMA goings on. As an example of the difference here a Geode will reorder stores as it feels but snoop the bus such that it can ensure an external bus master cannot observe this by holding it off the bus to fix up ordering violations first. > +Read memory barriers make a guarantee that all memory reads specified before > +the barrier will happen before all memory reads specified after the barrier. > + > +Write memory barriers make a guarantee that all memory writes specified before > +the barrier will happen before all memory writes specified after the barrier. Both with the caveat above > +There is no guarantee that any of the memory accesses specified before a memory > +barrier will be complete by the completion of a memory barrier; the barrier can > +be considered to draw a line in the access queue that accesses of the > +appropriate type may not cross. CPU generated accesses to main memory > + (*) interrupt disablement and/or interrupts > + (*) spin locks > + (*) R/W spin locks > + (*) mutexes > + (*) semaphores > + (*) R/W semaphores Should probably cover schedule() here. > +Locks and semaphores may not provide any guarantee of ordering on UP compiled > +systems, and so can't be counted on in su