From anton at samba.org Fri Jan 2 23:02:50 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 2 Jan 2004 23:02:50 +1100 Subject: pci_map_single return value In-Reply-To: <3FE8B4D5.2070405@us.ibm.com> References: <3FD732FD.10903@us.ibm.com> <20031217225328.GB25456@krispykreme> <3FE8B4D5.2070405@us.ibm.com> Message-ID: <20040102120250.GU28023@krispykreme> Hi Brian, > How does this look? This is what I currently have, it adds some documentation and links into the generic DMA API. Anton ===== Documentation/DMA-API.txt 1.3 vs edited ===== --- 1.3/Documentation/DMA-API.txt Mon May 26 16:18:46 2003 +++ edited/Documentation/DMA-API.txt Sun Dec 21 07:16:36 2003 @@ -199,6 +199,18 @@ cache width is. int +dma_error(dma_addr_t dma_addr) + +int +pci_dma_error(dma_addr_t dma_addr) + +In some circumstances dma_map_single and dma_map_page will fail to create +a mapping. A driver can check for these errors by testing the returned +dma address with dma_error(). A non zero return value means the mapping +could not be created and the driver should take appropriate action (eg +reduce current DMA mapping usage or delay and try again later). + +int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction direction) int @@ -210,7 +222,10 @@ Returns: the number of physical segments mapped (this may be shorted than passed in if the block layer determines that some elements of the scatter/gather list are physically adjacent and thus -may be mapped with a single entry). +may be mapped with a single entry). + +As with the other mapping interfaces, dma_map_sg can fail. When it +does, 0 is returned and a driver should take appropriate action. void dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nhwentries, ===== Documentation/DMA-mapping.txt 1.17 vs edited ===== --- 1.17/Documentation/DMA-mapping.txt Sun Aug 17 04:46:50 2003 +++ edited/Documentation/DMA-mapping.txt Fri Jan 2 22:50:47 2004 @@ -519,7 +519,7 @@ ends and the second one starts on a page boundary - in fact this is a huge advantage for cards which either cannot do scatter-gather or have very limited number of scatter-gather entries) and returns the actual number -of sg entries it mapped them to. +of sg entries it mapped them to. On failure 0 is returned. Then you should loop count times (note: this can be less than nents times) and use sg_dma_address() and sg_dma_len() macros where you previously @@ -809,6 +809,27 @@ deleted. 2) More to come... + + Handling Errors + +DMA address space is limited on some architectures and an allocation +failure can be determined by: + +- checking if pci_alloc_consistent returns NULL or pci_map_sg returns 0 + +- checking the returned dma_addr_t of pci_map_single and pci_map_page + by using pci_dma_error(): + + dma_addr_t dma_handle; + + dma_handle = pci_map_single(dev, addr, size, direction); + if (pci_dma_error(dma_handle)) { + /* + * reduce current DMA mapping usage, + * delay and try again later or + * reset driver. + */ + } Closing ===== include/asm-generic/dma-mapping.h 1.4 vs edited ===== --- 1.4/include/asm-generic/dma-mapping.h Tue Jan 14 09:37:47 2003 +++ edited/include/asm-generic/dma-mapping.h Sun Dec 21 06:12:41 2003 @@ -120,6 +120,12 @@ pci_dma_sync_sg(to_pci_dev(dev), sg, nelems, (int)direction); } +static inline int +dma_error(dma_addr_t dma_addr) +{ + return pci_dma_error(dma_addr); +} + /* Now for the API extensions over the pci_ one */ #define dma_alloc_noncoherent(d, s, h, f) dma_alloc_coherent(d, s, h, f) ===== include/asm-generic/pci-dma-compat.h 1.3 vs edited ===== --- 1.3/include/asm-generic/pci-dma-compat.h Tue Jan 14 03:26:02 2003 +++ edited/include/asm-generic/pci-dma-compat.h Sun Dec 21 06:09:37 2003 @@ -84,4 +84,10 @@ dma_sync_sg(hwdev == NULL ? NULL : &hwdev->dev, sg, nelems, (enum dma_data_direction)direction); } +static inline int +pci_dma_error(dma_addr_t dma_addr) +{ + return dma_error(dma_addr); +} + #endif ===== include/asm-i386/dma-mapping.h 1.2 vs edited ===== --- 1.2/include/asm-i386/dma-mapping.h Tue Jan 14 03:28:47 2003 +++ edited/include/asm-i386/dma-mapping.h Fri Jan 2 22:54:53 2004 @@ -91,6 +91,12 @@ { flush_write_buffers(); } + +static inline int +dma_error(dma_addr_t dma_addr) +{ + return 0; +} static inline int dma_supported(struct device *dev, u64 mask) ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jschopp at austin.ibm.com Tue Jan 6 04:43:08 2004 From: jschopp at austin.ibm.com (jschopp at austin.ibm.com) Date: Mon, 5 Jan 2004 11:43:08 -0600 (CST) Subject: spinlocks In-Reply-To: <20031228052954.GD24358@krispykreme> Message-ID: Just got back from vacation so I'm not sure what the status of this is. Dec 16, before I was on vacation, I sent a patch to the list to try to fix locks as well, much of it is relavant to this discussion though it probably needs updating now as locks have been changing underneath it. I think we additionally need to seperate out HMT_LOW and HMT_MEDIUM. It just happens that almost all of the current pSeries (and the G5 from Apple) don't support HMT, so it is a bit wasteful to call it. -JOel On Sun, 28 Dec 2003, Anton Blanchard wrote: > > Hi, > > We really have to get the new spinlocks beaten into shape... > > 1. They are massive: 24 inline instructions. They eat hot icache for > breakfast. > > 2. They add a bunch of clobbers: > > : "=&r"(tmp), "=&r"(tmp2) > : "r"(&lock->lock) > : "r3", "r4", "r5", "cr0", "cr1", "ctr", "xer", "memory"); > > We tie gcc's hands behind its back for the unlikely case that we have > to call into the hypervisor. > > 3. Separate spinlocks for iseries and pseries where most of it is > duplicated. > > 4. They add more reliance on the paca. We have to stop using the paca > for everything that isnt architecturally required and move to per cpu > data. In the end we may have to put the processor virtual area in the > paca, but we need to be thinking about this issue. > > As an aside, can someone explain why we reread the lock holder: > > lwsync # if odd, give up cycles\n\ > ldx %1,0,%2 # reverify the lock holder\n\ > cmpd %0,%1\n\ > bne 1b # new holder so restart\n\ > > Wont there be a race regardless of whether this code is there? > > 4. I like how we store r13 into the lock since it could save one > register and will make the guys wanting debug spinlocks a bit happier > (you can work out which cpu has the lock via the spinlock value) > > Im proposing a few things: > > 1. Recognise that once we are in SPLPAR mode, all performance bets are > off and we can burn more cycles. If we are calling into the hypervisor, > the path length there is going to dwarf us so why optimise for it? > > 2. Move the slow path out of line. We had problems with this due to the > limited reach of a conditional branch but we can fix this by compiling > with -ffunction-sections. We only then encounter problems if we get a > function that is larger than 32kB. If that happens, something is really > wrong :) > > 3. In the slow path call a single out of line function when calling > into the hypervisor that saves/restores all relevant registers. The call > will be nop'ed out by the cpufeature fixup stuff on non SPLPAR. With > the new module interface we should be able to handle cpufeature fixups > in modules. > > Outstanding stuff: > - implement the out of line splpar_spinlock code > - fix cpu features to fixup stuff in modules > - work out how to use FW_FEATURE_SPLPAR in the FTR_SECTION code > > Here is what Im thinking the spinlocks should look like: > > static inline void _raw_spin_lock(spinlock_t *lock) > { > unsigned long tmp; > > asm volatile( > "1: ldarx %0,0,%1 # spin_lock\n\ > cmpdi 0,%0,0\n\ > bne- 2f\n\ > stdcx. 13,0,%1\n\ > bne- 1b\n\ > isync\n\ > .subsection 1\n\ > 2:" > HMT_LOW > BEGIN_FTR_SECTION > " mflr %0\n\ > bl .splpar_spinlock\n" > END_FTR_SECTION_IFSET(CPU_FTR_SPLPAR) > " ldx %0,0,%1\n\ > cmpdi 0,%0,0\n\ > bne- 2b\n" > HMT_MEDIUM > " b 1b\n\ > .previous" > : "=&r"(tmp) > : "r"(&lock->lock) > : "cr0", "memory"); > } > > Anton > > ===== arch/ppc64/Makefile 1.39 vs edited ===== > --- 1.39/arch/ppc64/Makefile Tue Dec 9 03:23:33 2003 > +++ edited/arch/ppc64/Makefile Sun Dec 28 13:41:49 2003 > @@ -28,7 +28,8 @@ > > LDFLAGS := -m elf64ppc > LDFLAGS_vmlinux := -Bstatic -e $(KERNELLOAD) -Ttext $(KERNELLOAD) > -CFLAGS += -msoft-float -pipe -Wno-uninitialized -mminimal-toc > +CFLAGS += -msoft-float -pipe -Wno-uninitialized -mminimal-toc \ > + -mtraceback=none -ffunction-sections > > ifeq ($(CONFIG_POWER4_ONLY),y) > CFLAGS += -mcpu=power4 > ===== include/asm-ppc64/spinlock.h 1.7 vs edited ===== > --- 1.7/include/asm-ppc64/spinlock.h Sat Nov 15 05:45:32 2003 > +++ edited/include/asm-ppc64/spinlock.h Sun Dec 28 13:50:18 2003 > @@ -15,14 +15,14 @@ > * 2 of the License, or (at your option) any later version. > */ > > -#include > +#include > > /* > * The following define is being used to select basic or shared processor > * locking when running on an RPA platform. As we do more performance > * tuning, I would expect this selection mechanism to change. Dave E. > */ > -#define SPLPAR_LOCKS > +#undef SPLPAR_LOCKS > #define HVSC ".long 0x44000022\n" > > typedef struct { > @@ -138,25 +138,33 @@ > : "r3", "r4", "r5", "cr0", "cr1", "ctr", "xer", "memory"); > } > #else > -static __inline__ void _raw_spin_lock(spinlock_t *lock) > + > +static inline void _raw_spin_lock(spinlock_t *lock) > { > unsigned long tmp; > > - __asm__ __volatile__( > - "b 2f # spin_lock\n\ > -1:" > - HMT_LOW > -" ldx %0,0,%1 # load the lock value\n\ > - cmpdi 0,%0,0 # if not locked, try to acquire\n\ > - bne+ 1b\n\ > -2: \n" > - HMT_MEDIUM > -" ldarx %0,0,%1\n\ > + asm volatile( > +"1: ldarx %0,0,%1 # spin_lock\n\ > cmpdi 0,%0,0\n\ > - bne- 1b\n\ > + bne- 2f\n\ > stdcx. 13,0,%1\n\ > - bne- 2b\n\ > - isync" > + bne- 1b\n\ > + isync\n\ > + .subsection 1\n\ > +2:" > + HMT_LOW > +#if 0 > +BEGIN_FTR_SECTION > +" mflr %0\n\ > + bl .splpar_spinlock\n" > +END_FTR_SECTION_IFSET(CPU_FTR_SPLPAR) > +#endif > +" ldx %0,0,%1\n\ > + cmpdi 0,%0,0\n\ > + bne- 2b\n" > + HMT_MEDIUM > +" b 1b\n\ > + .previous" > : "=&r"(tmp) > : "r"(&lock->lock) > : "cr0", "memory"); > > > > ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From boutcher at us.ibm.com Tue Jan 6 04:50:36 2004 From: boutcher at us.ibm.com (David Boutcher) Date: Mon, 5 Jan 2004 11:50:36 -0600 Subject: spinlocks In-Reply-To: Message-ID: owner-linuxppc64-dev at lists.linuxppc.org wrote on 01/05/2004 11:43:08 AM: > I think we additionally need to seperate out HMT_LOW and HMT_MEDIUM. It > just happens that almost all of the current pSeries (and the G5 from > Apple) don't support HMT, so it is a bit wasteful to call it. Could, though the HMT_* instructions just map to "or 0,0,0", "or 1,1,1", etc, which are otherwise no-ops. The processor should do those in one cycle, or even less. Dave Boutcher IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jschopp at austin.ibm.com Tue Jan 6 05:07:45 2004 From: jschopp at austin.ibm.com (jschopp at austin.ibm.com) Date: Mon, 5 Jan 2004 12:07:45 -0600 (CST) Subject: spinlocks In-Reply-To: Message-ID: Based on past experience putting or instructins in locking routines I think it is worth doing. The performance hit isn't from the execution of the instructions but from the extra space they take up. It is not a huge difference, but it is measurable. Spinlocks are called enough that anything measurable should be optimized. Amdahl's law and all. On Mon, 5 Jan 2004, David Boutcher wrote: > Could, though the HMT_* instructions just map to "or 0,0,0", "or 1,1,1", > etc, which are otherwise no-ops. The processor should do those in one > cycle, or even less. > > Dave Boutcher > IBM Linux Technology Center > > ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Tue Jan 6 08:34:03 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 06 Jan 2004 08:34:03 +1100 Subject: spinlocks In-Reply-To: References: Message-ID: <1073338443.761.77.camel@gaston> On Tue, 2004-01-06 at 05:07, jschopp at austin.ibm.com wrote: > Based on past experience putting or instructins in locking routines I > think it is worth doing. The performance hit isn't from the execution of > the instructions but from the extra space they take up. It is not a huge > difference, but it is measurable. Spinlocks are called enough that > anything measurable should be optimized. Amdahl's law and all. I tend to think that our spinlocks are so big nowadays that it would probably be worth un-inlining them.... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Jan 6 11:52:32 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 6 Jan 2004 11:52:32 +1100 Subject: spinlocks In-Reply-To: References: <20031228052954.GD24358@krispykreme> Message-ID: <20040106005232.GK12213@krispykreme> Hi Joel, > Just got back from vacation so I'm not sure what the status of this is. > Dec 16, before I was on vacation, I sent a patch to the list to try to fix > locks as well, much of it is relavant to this discussion though it probably > needs updating now as locks have been changing underneath it. > > I think we additionally need to seperate out HMT_LOW and HMT_MEDIUM. It > just happens that almost all of the current pSeries (and the G5 from > Apple) don't support HMT, so it is a bit wasteful to call it. Yeah I pushed the HMT_LOW/HMT_MEDIUM bits into the out of line slow path based on your earlier results. My theory was the improvement you saw came mainly from less icache footprint, so putting them out of line in the slow and hopefully uncommon path should do the same. Assuming we have to have a single kernel image for all pseries/g5 platforms, then we cant do a lot about them other than nop'ing them out. Of course or r1,r1,r1 is a nop already although its not the preferred nop (preferred nop does get handled a little more efficiently on POWER4 from memory). Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Jan 7 00:09:37 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 7 Jan 2004 00:09:37 +1100 Subject: spinlocks In-Reply-To: <1073338443.761.77.camel@gaston> References: <1073338443.761.77.camel@gaston> Message-ID: <20040106130937.GL12213@krispykreme> > I tend to think that our spinlocks are so big nowadays that it would > probably be worth un-inlining them.... I prefer out of line slowpath directly below the function rather than one single out of line spinlock. It makes profiling much easier, while we can backtrace out of the spinlock when doing readprofile profiling, for hardware performance monitor profiling we get an address that happened somewhere in time and cant do a backtrace. We should give both methods a go, perhaps SMP kernel on UP and something larger like an 8way. Other than sdet is there a benchmark that will really stress our spinlocks and isnt a real pain to run? Heres my current idea for a spinlock: static inline void _raw_spin_lock(spinlock_t *lock) { unsigned long tmp; asm volatile( "1: ldarx %0,0,%1 # spin_lock\n\ cmpdi 0,%0,0\n\ bne- 2f\n\ stdcx. 13,0,%1\n\ bne- 1b\n\ isync\n\ .subsection 1\n\ 2:" HMT_LOW BEGIN_FTR_SECTION " mflr %0\n\ bl .splpar_spinlock_r%1 mtlr %0\n" END_FTR_SECTION_IFSET(CPU_FTR_SPLPAR) " ldx %0,0,%1\n\ cmpdi 0,%0,0\n\ bne- 2b\n" HMT_MEDIUM " b 1b\n\ .previous" : "=&r"(tmp) : "r"(&lock->lock) : "cr0", "memory"); } And below is the magic goo to bind it together, thanks to Alan Modra for pointing out I can create dynamic functions names in inline assembly :) Anton /* * the function that called us may have used stack below the SP, so we * allocate enough here to avoid it. */ #define STACKFRAMESIZE (288 + 3*8) #define SAVE_R3 0 #define SAVE_R4 8 #define SAVE_R5 16 /* junk the kernel provides */ #if 1 #define GLOBAL(A) A #define HVSC .long 0x44000022 #define r1 1 #define r3 3 #define r4 4 #define r5 5 #endif /* * NOTE: This code relies on the vpa and the processor id being within the * paca. Ugly stuff but it works for now. */ #define SPLPAR_SPINLOCK(REG) \ SPLPAR_spinlock_r##REG :\ stdu r1,-STACKFRAMESIZE(r1); \ std r4,SAVE_R4(r1); \ std r5,SAVE_R5(r1); \ lwz r5,0x280(REG); /* load dispatch counter */ \ andi. r4,5,1; /* if even then go back and spin */ \ beq 1f; \ std r3,SAVE_R3(r1); \ li 3,0xE4; /* give up the cycles H_CONFER */ \ lhz 4,0x18(REG); /* processor number */ \ HVSC; \ ld r3,SAVE_R3(r1); \ 1: ld r4,SAVE_R4(r1); \ ld r5,SAVE_R5(r1); \ addi r1,r1,STACKFRAMESIZE; \ blr SPLPAR_SPINLOCK(0) SPLPAR_SPINLOCK(3) SPLPAR_SPINLOCK(4) SPLPAR_SPINLOCK(5) SPLPAR_SPINLOCK(6) SPLPAR_SPINLOCK(7) SPLPAR_SPINLOCK(8) SPLPAR_SPINLOCK(9) SPLPAR_SPINLOCK(10) SPLPAR_SPINLOCK(11) SPLPAR_SPINLOCK(12) SPLPAR_SPINLOCK(14) SPLPAR_SPINLOCK(15) SPLPAR_SPINLOCK(16) SPLPAR_SPINLOCK(17) SPLPAR_SPINLOCK(18) SPLPAR_SPINLOCK(19) SPLPAR_SPINLOCK(20) SPLPAR_SPINLOCK(21) SPLPAR_SPINLOCK(22) SPLPAR_SPINLOCK(23) SPLPAR_SPINLOCK(24) SPLPAR_SPINLOCK(25) SPLPAR_SPINLOCK(26) SPLPAR_SPINLOCK(27) SPLPAR_SPINLOCK(28) SPLPAR_SPINLOCK(29) SPLPAR_SPINLOCK(30) SPLPAR_SPINLOCK(31) ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From dwm at austin.ibm.com Wed Jan 7 03:47:53 2004 From: dwm at austin.ibm.com (Doug Maxey) Date: Tue, 06 Jan 2004 10:47:53 -0600 Subject: 2.4/2.6 kdb access to iospace Message-ID: <200401061647.i06GlrIn008663@falcon30.maxey.austin.rr.com> Howdy, I am wondering if anyone has thought about a patch for linux kdb that would allow access to the iospace registers. Does anyone have a feel for what would be required? Something along the lines of an ioremap under the covers possibly? Or am I way off with this? ++doug ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Wed Jan 7 08:29:48 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 06 Jan 2004 15:29:48 -0600 Subject: [PATCH] rtas_extended_busy_delay_time() fix Message-ID: <1073424587.18091.23.camel@verve> The rtas_extended_busy_delay_time() function does not calculate the expected number of milliseconds given an RTAS return code. Julie DeWandel of Redhat pointed out this bug and solution. The comment above the function says: /* Given an RTAS status code of 990n compute the hinted delay of 10^n * (last digit) milliseconds. For now we bound at n=5 (100 secs). */ This matches the RPA description of what should happen, but the code doesn't do this. The calculation is a bit hard to follow, and contains a magic number with 1 too many zeroes. As a result, it calculates the following: rtas_extended_busy_delay_time(9900) = 0 rtas_extended_busy_delay_time(9901) = 1 rtas_extended_busy_delay_time(9902) = 10 rtas_extended_busy_delay_time(9903) = 100 rtas_extended_busy_delay_time(9904) = 1000 rtas_extended_busy_delay_time(9905) = 10000 The fix as proposed by Julie makes the calculation more obvious, and fixes the error. Comments welcome. Thanks- John diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c --- a/arch/ppc64/kernel/rtas.c Tue Jan 6 15:21:46 2004 +++ b/arch/ppc64/kernel/rtas.c Tue Jan 6 15:21:46 2004 @@ -197,9 +197,10 @@ order = 5; /* bound */ /* Use microseconds for reasonable accuracy */ - for (ms = 1000; order > 0; order--) - ms = ms * 10; - return ms / (1000000/HZ); /* round down is fine */ + for (ms=1; order > 0; order--) + ms *= 10; + + return ms; } int ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From lxiep at us.ibm.com Wed Jan 7 09:44:23 2004 From: lxiep at us.ibm.com (Linda Xie) Date: Tue, 06 Jan 2004 16:44:23 -0600 Subject: [PATCH][2.6] set up vio_dev's driver field Message-ID: <3FFB3A47.8090404@us.ltcfwd.linux.ibm.com> Hi-, The attached patch fixes vio_dev's driver field. Comments are welcome. Thanks, Linda -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: vio.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040106/65cd4edb/attachment.txt From hollisb at us.ibm.com Wed Jan 7 10:26:09 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 6 Jan 2004 17:26:09 -0600 Subject: [PATCH][2.6] set up vio_dev's driver field In-Reply-To: <3FFB3A47.8090404@us.ltcfwd.linux.ibm.com> References: <3FFB3A47.8090404@us.ltcfwd.linux.ibm.com> Message-ID: On Jan 6, 2004, at 4:44 PM, Linda Xie wrote: > diff -Nru a/arch/ppc64/kernel/vio.c b/arch/ppc64/kernel/vio.c > --- a/arch/ppc64/kernel/vio.c Tue Jan 6 16:29:17 2004 > +++ b/arch/ppc64/kernel/vio.c Tue Jan 6 16:29:17 2004 > @@ -189,7 +189,7 @@ > const struct vio_device_id* id; > > id = vio_match_device(driver->id_table, dev); > - if (id && (0 < driver->probe(dev, id))) { > + if (id && (0 == driver->probe(dev, id))) { > printk(KERN_DEBUG "%s: driver %s/%s took device %p\n", > __FUNCTION__, id->type, id->compat, dev); > dev->driver = driver; You're right that the drivers return 0 on success, but all this code is about to be replaced with 2.6 driver model code anyways. The driver model gives us basic sysfs presense and list locking for free. Could you test this patch instead? It should require no driver changes. (I don't think the patch will be whitespace-wrapped but let me know.) Comments from Greg KH also welcome, though Linda's mail prompted me to send this out before I've double-checked everything. :) In particular I had to create a static struct device to act as the VIO bus device, since the virtual bus doesn't have an actual root struct device (unlike PCI and USB)... -- Hollis Blanchard IBM Linux Technology Center -------------- next part -------------- ===== arch/ppc64/kernel/vio.c 1.4 vs edited ===== --- 1.4/arch/ppc64/kernel/vio.c Fri Dec 5 18:09:27 2003 +++ edited/arch/ppc64/kernel/vio.c Tue Jan 6 17:19:12 2004 @@ -4,6 +4,7 @@ * Copyright (c) 2003 IBM Corp. * Dave Engebretsen engebret at us.ibm.com * Santiago Leon santil at us.ibm.com + * Hollis Blanchard * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License @@ -16,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +28,8 @@ #include #endif +#define DBGENTER() pr_debug("%s entered\n", __FUNCTION__) + extern struct TceTable *build_tce_table(struct TceTable *tbl); extern dma_addr_t get_tces(struct TceTable *, unsigned order, @@ -33,83 +37,75 @@ extern void tce_free(struct TceTable *tbl, dma_addr_t dma_addr, unsigned order, unsigned num_pages); +static struct device vio_bus_device; /* for viodev->dev.parent */ +static int vio_num_address_cells; -static struct vio_bus vio_bus; -static LIST_HEAD(registered_vio_drivers); -int vio_num_address_cells; -EXPORT_SYMBOL(vio_num_address_cells); - -/* TODO: - * really fit into driver model (see include/linux/device.h) - * locking around list accesses - */ +/* convert from struct device to struct vio_dev and pass to driver. + * dev->driver has already been set by generic code because vio_bus_match + * succeeded. */ +static int vio_bus_probe(struct device *dev) +{ + struct vio_dev *viodev = to_vio_dev(dev); + struct vio_driver *viodrv = to_vio_driver(dev->driver); + const struct vio_device_id *id; + int error = -ENODEV; + + DBGENTER(); + + if (!viodrv->probe) + return error; + + id = vio_match_device(viodrv->id_table, viodev); + if (id) { + error = viodrv->probe(viodev, id); + } + + return error; +} + +/* convert from struct device to struct vio_dev and pass to driver. */ +static int vio_bus_remove(struct device *dev) +{ + struct vio_dev *viodev = to_vio_dev(dev); + struct vio_driver *viodrv = to_vio_driver(dev->driver); + + DBGENTER(); + + if (viodrv->remove) { + return viodrv->remove(viodev); + } + + /* driver can't remove */ + return 1; +} /** * vio_register_driver: - Register a new vio driver * @drv: The vio_driver structure to be registered. - * - * Adds the driver structure to the list of registered drivers - * Returns the number of vio devices which were claimed by the driver - * during registration. The driver remains registered even if the - * return value is zero. */ -int vio_register_driver(struct vio_driver *drv) +int vio_register_driver(struct vio_driver *viodrv) { - int count = 0; - struct vio_dev *dev; - - printk(KERN_DEBUG "%s: driver %s/%s registering\n", __FUNCTION__, - drv->id_table[0].type, drv->id_table[0].type); + printk(KERN_DEBUG "%s: driver %s registering\n", __FUNCTION__, + viodrv->name); - /* find matching devices not already claimed by other drivers and pass - * them to probe() */ - list_for_each_entry(dev, &vio_bus.devices, devices_list) { - const struct vio_device_id* id; - - if (dev->driver) - continue; /* this device is already owned */ - - id = vio_match_device(drv->id_table, dev); - if (drv && id) { - if (0 == drv->probe(dev, id)) { - printk(KERN_DEBUG " took device %p\n", dev); - dev->driver = drv; - count++; - } - } - } + /* fill in 'struct device' fields */ + viodrv->driver.name = viodrv->name; + viodrv->driver.bus = &vio_bus_type; + viodrv->driver.probe = vio_bus_probe; + viodrv->driver.remove = vio_bus_remove; - list_add_tail(&drv->node, ®istered_vio_drivers); - - return count; + return driver_register(&viodrv->driver); } EXPORT_SYMBOL(vio_register_driver); /** * vio_unregister_driver - Remove registration of vio driver. * @driver: The vio_driver struct to be removed form registration - * - * Searches for devices that are assigned to the driver and calls - * driver->remove() for each one. Removes the driver from the list - * of registered drivers. Returns the number of devices that were - * assigned to that driver. */ -int vio_unregister_driver(struct vio_driver *driver) +int vio_unregister_driver(struct vio_driver *viodrv) { - struct vio_dev *dev; - int devices_found = 0; - - list_for_each_entry(dev, &vio_bus.devices, devices_list) { - if (dev->driver == driver) { - driver->remove(dev); - dev->driver = NULL; - devices_found++; - } - } - - list_del(&driver->node); - - return devices_found; + driver_unregister(&viodrv->driver); + return 0; } EXPORT_SYMBOL(vio_unregister_driver); @@ -125,6 +121,8 @@ const struct vio_device_id * vio_match_device(const struct vio_device_id *ids, const struct vio_dev *dev) { + DBGENTER(); + while (ids->type) { if ((strncmp(dev->archdata->type, ids->type, strlen(ids->type)) == 0) && device_is_compatible((struct device_node*)dev->archdata, ids->compat)) @@ -137,12 +135,20 @@ /** * vio_bus_init: - Initialize the virtual IO bus */ -int __init +static int __init vio_bus_init(void) { struct device_node *node_vroot, *node_vdev; + int err; - INIT_LIST_HEAD(&vio_bus.devices); + err = bus_register(&vio_bus_type); + if (err) + return err; + + /* the parent of all vio devices */ + memset(&vio_bus_device, 0, sizeof(struct device)); + strcpy(vio_bus_device.bus_id, "vio"); + device_register(&vio_bus_device); /* * Create device node entries for each virtual device @@ -171,39 +177,21 @@ __initcall(vio_bus_init); -/** - * vio_probe_device - attach dev to appropriate driver - * @dev: device to find a driver for - * - * Walks the list of registered VIO drivers looking for one to take this - * device. - * - * Returns a pointer to the matched driver or NULL if driver is not - * found. - */ -struct vio_driver * __devinit vio_probe_device(struct vio_dev* dev) +/* vio_dev refcount hit 0 */ +static void __devinit vio_dev_release(struct device *dev) { - struct vio_driver *driver; + struct vio_dev *viodev = to_vio_dev(dev); - list_for_each_entry(driver, ®istered_vio_drivers, node) { - const struct vio_device_id* id; + DBGENTER(); - id = vio_match_device(driver->id_table, dev); - if (id && (0 < driver->probe(dev, id))) { - printk(KERN_DEBUG "%s: driver %s/%s took device %p\n", - __FUNCTION__, id->type, id->compat, dev); - dev->driver = driver; - return driver; - } - } - - printk(KERN_DEBUG "%s: device %p found no driver\n", __FUNCTION__, dev); - return NULL; + /* XXX free TCE table */ + of_node_put(viodev->archdata); + kfree(viodev); } /** * vio_register_device: - Register a new vio device. - * @archdata: The OF node for this device. + * @node_vdev: The OF node for this device. * * Creates and initializes a vio_dev structure from the data in * node_vdev (archdata) and adds it to the list of virtual devices. @@ -212,11 +200,13 @@ */ struct vio_dev * __devinit vio_register_device(struct device_node *node_vdev) { - struct vio_dev *dev; + struct vio_dev *viodev; unsigned int *unit_address; unsigned int *irq_p; - /* guarantee all vio_devs have 'device_type' field*/ + DBGENTER(); + + /* we need the 'device_type' property, in order to match with drivers */ if ((NULL == node_vdev->type)) { printk(KERN_WARNING "%s: node %s missing 'device_type'\n", __FUNCTION__, @@ -232,37 +222,46 @@ } /* allocate a vio_dev for this node */ - dev = kmalloc(sizeof(*dev), GFP_KERNEL); - if (!dev) + viodev = kmalloc(sizeof(struct vio_dev), GFP_KERNEL); + if (!viodev) { return NULL; - memset(dev, 0, sizeof(*dev)); + } + memset(viodev, 0, sizeof(struct vio_dev)); - dev->archdata = (void*)of_node_get(node_vdev); - dev->bus = &vio_bus; - dev->unit_address = *unit_address; - dev->tce_table = vio_build_tce_table(dev); - - irq_p = (unsigned int *) get_property(node_vdev, "interrupts", 0); - if(irq_p) { - dev->irq = irq_offset_up(*irq_p); - } else { - dev->irq = (unsigned int) -1; + viodev->archdata = (void *)of_node_get(node_vdev); + viodev->unit_address = *unit_address; + viodev->tce_table = vio_build_tce_table(viodev); + + viodev->irq = (unsigned int) -1; + irq_p = (unsigned int *)get_property(node_vdev, "interrupts", 0); + if (irq_p) { + viodev->irq = irq_offset_up(*irq_p); } - list_add_tail(&dev->devices_list, &vio_bus.devices); + /* init generic 'struct device' fields: */ + viodev->device.parent = &vio_bus_device; + viodev->device.bus = &vio_bus_type; + snprintf(viodev->device.bus_id, BUS_ID_SIZE, "%s@%lx", + node_vdev->name, viodev->unit_address); + viodev->device.release = vio_dev_release; - vio_probe_device(dev); /* finally, assign it to a driver */ + /* register with generic device framework */ + if (device_register(&viodev->device)) { + printk(KERN_ERR "%s: failed to register device %s\n", __FUNCTION__, + viodev->device.bus_id); + } - return dev; + return viodev; } +EXPORT_SYMBOL(vio_register_device); -int __devinit vio_unregister_device(struct vio_dev *dev) +int __devinit vio_unregister_device(struct vio_dev *viodev) { - list_del(&dev->devices_list); - of_node_put(dev->archdata); - + DBGENTER(); + device_unregister(&viodev->device); return 0; } +EXPORT_SYMBOL(vio_unregister_device); /** * vio_get_attribute: - get attribute for virtual device @@ -529,6 +528,30 @@ } } EXPORT_SYMBOL(vio_free_consistent); + +static int vio_bus_match(struct device *dev, struct device_driver *drv) +{ + const struct vio_dev *vio_dev = to_vio_dev(dev); + struct vio_driver *vio_drv = to_vio_driver(drv); + const struct vio_device_id *ids = vio_drv->id_table; + const struct vio_device_id *found_id; + + DBGENTER(); + + if (!ids) + return 0; + + found_id = vio_match_device(ids, vio_dev); + if (found_id) + return 1; + + return 0; +} + +struct bus_type vio_bus_type = { + .name = "vio", + .match = vio_bus_match, +}; EXPORT_SYMBOL(plpar_hcall_norets); EXPORT_SYMBOL(plpar_hcall_8arg_2ret); ===== include/asm-ppc64/vio.h 1.3 vs edited ===== --- 1.3/include/asm-ppc64/vio.h Tue Dec 16 15:22:18 2003 +++ edited/include/asm-ppc64/vio.h Fri Dec 19 13:57:42 2003 @@ -16,6 +16,7 @@ #include #include +#include #include #include #include @@ -64,11 +65,11 @@ void vio_free_consistent(struct vio_dev *dev, size_t size, void *vaddr, dma_addr_t dma_handle); +extern struct bus_type vio_bus_type; + struct vio_device_id { char *type; char *compat; -/* I don't think we need this - unsigned long driver_data; */ /* Data private to the driver */ }; struct vio_driver { @@ -76,55 +77,60 @@ char *name; const struct vio_device_id *id_table; /* NULL if wants all devices */ int (*probe) (struct vio_dev *dev, const struct vio_device_id *id); /* New device inserted */ - void (*remove) (struct vio_dev *dev); /* Device removed (NULL if not a hot-plug capable driver) */ + int (*remove) (struct vio_dev *dev); /* Device removed (NULL if not a hot-plug capable driver) */ unsigned long driver_data; + + struct device_driver driver; }; -struct vio_bus; +static inline struct vio_driver *to_vio_driver(struct device_driver *drv) +{ + return container_of(drv, struct vio_driver, driver); +} + /* * The vio_dev structure is used to describe virtual I/O devices. */ struct vio_dev { - struct list_head devices_list; /* node in list of all vio devices */ - struct device_node *archdata; /* Open Firmware node */ - struct vio_bus *bus; /* bus this device is on */ - struct vio_driver *driver; /* owning driver */ + struct device_node *archdata; /* Open Firmware node */ void *driver_data; /* data private to the driver */ unsigned long unit_address; - - struct TceTable *tce_table; /* vio_map_* uses this */ + struct TceTable *tce_table; /* vio_map_* uses this */ unsigned int irq; - struct proc_dir_entry *procent; /* device entry in /proc/bus/vio */ -}; -struct vio_bus { - struct list_head devices; /* list of virtual devices */ + struct device device; }; +static inline struct vio_dev *to_vio_dev(struct device *dev) +{ + return container_of(dev, struct vio_dev, device); +} +/* taken from pci_module_init() */ static inline int vio_module_init(struct vio_driver *drv) { - int rc = vio_register_driver (drv); + int rc = vio_register_driver(drv); - if (rc > 0) - return 0; + if (rc > 0) + return 0; - /* iff CONFIG_HOTPLUG and built into kernel, we should - * leave the driver around for future hotplug events. - * For the module case, a hotplug daemon of some sort - * should load a module in response to an insert event. */ + /* iff CONFIG_HOTPLUG and built into kernel, we should + * leave the driver around for future hotplug events. + * For the module case, a hotplug daemon of some sort + * should load a module in response to an insert event. */ #if defined(CONFIG_HOTPLUG) && !defined(MODULE) - if (rc == 0) - return 0; + if (rc == 0) + return 0; #else - if (rc == 0) - rc = -ENODEV; + if (rc == 0) + rc = -ENODEV; #endif - /* if we get here, we need to clean up vio driver instance - * and return some sort of error */ + /* if we get here, we need to clean up vio driver instance + * and return some sort of error */ + vio_unregister_driver(drv); - return rc; + return rc; } #endif /* _PHYP_H */ From olof at austin.ibm.com Wed Jan 7 11:01:09 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Tue, 6 Jan 2004 18:01:09 -0600 (CST) Subject: [PATCH] [2.4] [RHEL] Backport of benh's PTE mgmt changes Message-ID: Below is a 2.4 backport of parts of benh's 2.6 pte_free rewrite. It's different in a few ways: 1. 2.4 has no RCU. Instead I just send a syncronous IPI to all processors. Since the IPI won't be delivered until a processor is out of hash_page, it can be used as a barrier between new and old traversals. 2. There's no batching of TLB shootdowns, like in 2.6. So I had to hijack do_check_pgt_cache(). This is ugly, and I'm not too happy about it, but I think RedHat would be more likely to accept this than a change in generic code (at this point in the product cycle). Julie, feel free to prove me wrong. :-) 3. Because of the above reason, I had to add an extra per-cpu lock for the pte_freelist_batch structures. 4. The __hash_page locking is rougher than in 2.6. I left the hash locks there, since I believe they are still needed. 5. I recycled _PAGE_HASHNOIX, since it's never used. There were no other free bits available... (6. RedHat disabled the fast PTE/PMD/PGD allocator, so the patch won't apply cleanly to an ameslab or marcelo 2.4 tree, but the differences are pretty obvious.) I think that's it. Please provide feedback. We're working on a deadline with RedHat, so sooner is better than later. I'll be beating on this with the specweb benchmark over the next couple of days as well. :-) Thanks, Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM Index: arch/ppc64/kernel/htab.c =================================================================== RCS file: /cvs/local/rhel/arch/ppc64/kernel/htab.c,v retrieving revision 1.1.1.2 diff -w -p -u -r1.1.1.2 htab.c --- arch/ppc64/kernel/htab.c 5 Sep 2003 18:57:01 -0000 1.1.1.2 +++ arch/ppc64/kernel/htab.c 6 Jan 2004 22:31:58 -0000 @@ -320,8 +320,10 @@ int __hash_page(unsigned long ea, unsign unsigned long va, vpn; unsigned long newpp, prpn; unsigned long hpteflags, lock_slot; + unsigned long access_ok, tmp; long slot; pte_t old_pte, new_pte; + int ret = 0; /* Search the Linux page table for a match with va */ va = (vsid << 28) | (ea & 0x0fffffff); @@ -337,21 +339,52 @@ int __hash_page(unsigned long ea, unsign * Check the user's access rights to the page. If access should be * prevented then send the problem up to do_page_fault. */ -#ifdef CONFIG_SHARED_MEMORY_ADDRESSING + + /* + * Check the user's access rights to the page. If access should be + * prevented then send the problem up to do_page_fault. + */ + access |= _PAGE_PRESENT; - if (unlikely(access & ~(pte_val(*ptep)))) { + + /* We'll do access checking and _PAGE_BUSY setting in assembly, since + * it needs to be atomic. + */ + + __asm__ __volatile__ ("\n + 1: ldarx %0,0,%3\n + # Check access rights (access & ~(pte_val(*ptep)))\n + andc. %1,%2,%0\n + bne- 2f\n + # Check if PTE is busy\n + andi. %1,%0,%4\n + bne- 1b\n + ori %0,%0,%4\n + # Write the linux PTE atomically (setting busy)\n + stdcx. %0,0,%3\n + bne- 1b\n + li %1,1\n + b 3f\n + 2: stdcx. %0,0,%3 # to clear the reservation\n + li %1,0\n + 3:" + : "=r" (old_pte), "=r" (access_ok) + : "r" (access), "r" (ptep), "i" (_PAGE_BUSY) + : "cc", "memory"); + +#ifdef CONFIG_SHARED_MEMORY_ADDRESSING + if (unlikely(!access_ok)) { if(!(((ea >> SMALLOC_EA_SHIFT) == (SMALLOC_START >> SMALLOC_EA_SHIFT)) && ((current->thread.flags) & PPC_FLAG_SHARED))) { - spin_unlock(&hash_table_lock[lock_slot].lock); - return 1; + ret = 1; + goto out_unlock; } } #else - access |= _PAGE_PRESENT; - if (unlikely(access & ~(pte_val(*ptep)))) { - spin_unlock(&hash_table_lock[lock_slot].lock); - return 1; + if (unlikely(!access_ok)) { + ret = 1; + goto out_unlock; } #endif @@ -428,9 +461,22 @@ int __hash_page(unsigned long ea, unsign *ptep = new_pte; } +out_unlock: + tmp = _PAGE_BUSY; + + /* Clear _PAGE_BUSY flag atomically. */ + __asm__ __volatile__ (" + 1: ldarx %0,0,%2\n + andc. %0,%0,%1\n + stdcx. %0,0,%2\n + bne- 1b\n" + : "=r" (new_pte) + : "r" (tmp), "r" (ptep) + : "cc", "memory"); + spin_unlock(&hash_table_lock[lock_slot].lock); - return 0; + return ret; } /* @@ -497,12 +543,6 @@ int hash_page(unsigned long ea, unsigned pgdir = mm->pgd; if (pgdir == NULL) return 1; - /* - * Lock the Linux page table to prevent mmap and kswapd - * from modifying entries while we search and update - */ - spin_lock(&mm->page_table_lock); - ptep = find_linux_pte(pgdir, ea); /* * If no pte found or not present, send the problem up to @@ -515,8 +555,6 @@ int hash_page(unsigned long ea, unsigned ret = 1; } - spin_unlock(&mm->page_table_lock); - return ret; } Index: arch/ppc64/mm/init.c =================================================================== RCS file: /cvs/local/rhel/arch/ppc64/mm/init.c,v retrieving revision 1.1.1.1 diff -w -p -u -r1.1.1.1 init.c --- arch/ppc64/mm/init.c 7 Aug 2003 03:21:44 -0000 1.1.1.1 +++ arch/ppc64/mm/init.c 6 Jan 2004 22:42:55 -0000 @@ -104,9 +104,72 @@ unsigned long __max_memory; */ mmu_gather_t mmu_gathers[NR_CPUS]; +/* PTE free batching structures. We need a lock since not all + * operations take place under page_table_lock. Keep it per-CPU + * to avoid bottlenecks. + */ + +spinlock_t pte_freelist_lock[NR_CPUS] = { [0 ... NR_CPUS-1] = SPIN_LOCK_UNLOCKED}; +struct pte_freelist_batch *pte_freelist_cur[NR_CPUS]; + +unsigned long pte_freelist_forced_free; + +static void pte_free_smp_sync(void *arg) +{ + /* Do nothing, just ensure we sync with all CPUs */ +} + +/* This is only called when we are critically out of memory + * (and fail to get a page in pte_free_tlb). + */ +void pte_free_now(struct page *ptepage) +{ + pte_freelist_forced_free++; + + smp_call_function(pte_free_smp_sync, NULL, 0, 1); + + pte_free_kernel(page_address(ptepage)); +} + + +void pte_free_batch(struct pte_freelist_batch *batch) +{ + unsigned int i; + + /* A sync is good enough: It will ensure that no other + * CPU is currently traversing down to a free'd pte. + */ + + smp_call_function(pte_free_smp_sync, NULL, 0, 1); + + for (i = 0; i < batch->index; i++) + pte_free_kernel(page_address(batch->pages[i])); + free_page((unsigned long)batch); +} + + int do_check_pgt_cache(int low, int high) { int freed = 0; + struct pte_freelist_batch **batchp; + spinlock_t *lock = &pte_freelist_lock[smp_processor_id()]; + + /* We use this function to push the current pte free batch to be + * deallocated, since do_check_pgt_cache() is called at the end of each + * free_one_pgd() and other parts of VM relies on all PTE's being + * properly freed upon return from that function. + */ + + spin_lock(lock); + + batchp = &pte_freelist_cur[smp_processor_id()]; + + if(*batchp) { + pte_free_batch(*batchp); + *batchp = NULL; + } + + spin_unlock(lock); #if 0 if (pgtable_cache_size > high) { @@ -120,6 +183,7 @@ int do_check_pgt_cache(int low, int high } while (pgtable_cache_size > low); } #endif + return freed; } Index: include/asm-ppc64/mmu.h =================================================================== RCS file: /cvs/local/rhel/include/asm-ppc64/mmu.h,v retrieving revision 1.1.1.1 diff -w -p -u -r1.1.1.1 mmu.h Index: include/asm-ppc64/pgalloc.h =================================================================== RCS file: /cvs/local/rhel/include/asm-ppc64/pgalloc.h,v retrieving revision 1.1.1.2 diff -w -p -u -r1.1.1.2 pgalloc.h --- include/asm-ppc64/pgalloc.h 26 Sep 2003 14:42:15 -0000 1.1.1.2 +++ include/asm-ppc64/pgalloc.h 6 Jan 2004 22:34:39 -0000 @@ -112,7 +112,51 @@ pte_alloc_one(struct mm_struct *mm, unsi return NULL; } -#define pte_free(pte_page) pte_free_kernel(page_address(pte_page)) +#define pte_free(pte_page) __pte_free(pte_page) + +struct pte_freelist_batch +{ + unsigned int index; + struct page * pages[0]; +}; + +#define PTE_FREELIST_SIZE ((PAGE_SIZE - sizeof(struct pte_freelist_batch) / \ + sizeof(struct page *))) + +extern void pte_free_now(struct page *ptepage); +extern void pte_free_batch(struct pte_freelist_batch *batch); + +extern struct pte_freelist_batch *pte_freelist_cur[]; +extern spinlock_t pte_freelist_lock[]; + +static inline void __pte_free(struct page *ptepage) +{ + spinlock_t *lock = &pte_freelist_lock[smp_processor_id()]; + struct pte_freelist_batch **batchp; + + spin_lock(lock); + + batchp = &pte_freelist_cur[smp_processor_id()]; + + if (*batchp == NULL) { + *batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC); + if (*batchp == NULL) { + spin_unlock(lock); + pte_free_now(ptepage); + return; + } + (*batchp)->index = 0; + } + + (*batchp)->pages[(*batchp)->index++] = ptepage; + if ((*batchp)->index == PTE_FREELIST_SIZE) { + pte_free_batch(*batchp); + *batchp = NULL; + } + + spin_unlock(lock); +} + extern int do_check_pgt_cache(int, int); Index: include/asm-ppc64/pgtable.h =================================================================== RCS file: /cvs/local/rhel/include/asm-ppc64/pgtable.h,v retrieving revision 1.1.1.1 diff -w -p -u -r1.1.1.1 pgtable.h --- include/asm-ppc64/pgtable.h 7 Aug 2003 03:21:59 -0000 1.1.1.1 +++ include/asm-ppc64/pgtable.h 6 Jan 2004 22:34:23 -0000 @@ -88,22 +88,22 @@ * Bits in a linux-style PTE. These match the bits in the * (hardware-defined) PowerPC PTE as closely as possible. */ -#define _PAGE_PRESENT 0x001UL /* software: pte contains a translation */ -#define _PAGE_USER 0x002UL /* matches one of the PP bits */ -#define _PAGE_RW 0x004UL /* software: user write access allowed */ -#define _PAGE_GUARDED 0x008UL -#define _PAGE_COHERENT 0x010UL /* M: enforce memory coherence (SMP systems) */ -#define _PAGE_NO_CACHE 0x020UL /* I: cache inhibit */ -#define _PAGE_WRITETHRU 0x040UL /* W: cache write-through */ -#define _PAGE_DIRTY 0x080UL /* C: page changed */ -#define _PAGE_ACCESSED 0x100UL /* R: page referenced */ -#define _PAGE_HPTENOIX 0x200UL /* software: pte HPTE slot unknown */ -#define _PAGE_HASHPTE 0x400UL /* software: pte has an associated HPTE */ -#define _PAGE_EXEC 0x800UL /* software: i-cache coherence required */ -#define _PAGE_SECONDARY 0x8000UL /* software: HPTE is in secondary group */ -#define _PAGE_GROUP_IX 0x7000UL /* software: HPTE index within group */ +#define _PAGE_PRESENT 0x0001 /* software: pte contains a translation */ +#define _PAGE_USER 0x0002 /* matches one of the PP bits */ +#define _PAGE_RW 0x0004 /* software: user write access allowed */ +#define _PAGE_GUARDED 0x0008 +#define _PAGE_COHERENT 0x0010 /* M: enforce memory coherence (SMP systems) */ +#define _PAGE_NO_CACHE 0x0020 /* I: cache inhibit */ +#define _PAGE_WRITETHRU 0x0040 /* W: cache write-through */ +#define _PAGE_DIRTY 0x0080 /* C: page changed */ +#define _PAGE_ACCESSED 0x0100 /* R: page referenced */ +#define _PAGE_BUSY 0x0200 /* software: pte & hash are busy */ +#define _PAGE_HASHPTE 0x0400 /* software: pte has an associated HPTE */ +#define _PAGE_EXEC 0x0800 /* software: i-cache coherence required */ +#define _PAGE_GROUP_IX 0x7000 /* software: HPTE index within group */ +#define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */ /* Bits 0x7000 identify the index within an HPT Group */ -#define _PAGE_HPTEFLAGS (_PAGE_HASHPTE | _PAGE_HPTENOIX | _PAGE_SECONDARY | _PAGE_GROUP_IX) +#define _PAGE_HPTEFLAGS (_PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX) /* PAGE_MASK gives the right answer below, but only by accident */ /* It should be preserving the high 48 bits and then specifically */ /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */ @@ -290,12 +290,14 @@ static inline unsigned long pte_update( __asm__ __volatile__("\n\ 1: ldarx %0,0,%3 \n\ + andi. %1,%0,%7 # loop on _PAGE_BUSY set\n\ + bne- 1b \n\ andc %1,%0,%4 \n\ or %1,%1,%5 \n\ stdcx. %1,0,%3 \n\ bne- 1b" : "=&r" (old), "=&r" (tmp), "=m" (*p) - : "r" (p), "r" (clr), "r" (set), "m" (*p) + : "r" (p), "r" (clr), "r" (set), "m" (*p), "i" (_PAGE_BUSY) : "cc" ); return old; } ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Wed Jan 7 16:08:09 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 07 Jan 2004 16:08:09 +1100 Subject: [PATCH] [2.4] [RHEL] Backport of benh's PTE mgmt changes In-Reply-To: References: Message-ID: <1073452088.4067.75.camel@gaston> On Wed, 2004-01-07 at 11:01, olof at austin.ibm.com wrote: > Below is a 2.4 backport of parts of benh's 2.6 pte_free rewrite. It's > different in a few ways: > > 1. 2.4 has no RCU. Instead I just send a syncronous IPI to all processors. > Since the IPI won't be delivered until a processor is out of hash_page, it > can be used as a barrier between new and old traversals. But is also quite expensive.... > 2. There's no batching of TLB shootdowns, like in 2.6. So I had to hijack > do_check_pgt_cache(). This is ugly, and I'm not too happy about it, but > I think RedHat would be more likely to accept this than a change in > generic code (at this point in the product cycle). Julie, feel free to > prove me wrong. :-) > > 3. Because of the above reason, I had to add an extra per-cpu lock for the > pte_freelist_batch structures. > > 4. The __hash_page locking is rougher than in 2.6. I left the hash locks > there, since I believe they are still needed. > > 5. I recycled _PAGE_HASHNOIX, since it's never used. There were no other > free bits available... I moved bits around on 2.6, basically, _PAGE_FILE can be moved as it's only used when !_PAGE_PRESENT, to make room. > > (6. RedHat disabled the fast PTE/PMD/PGD allocator, so the patch won't > apply cleanly to an ameslab or marcelo 2.4 tree, but the differences are > pretty obvious.) > > > > I think that's it. Please provide feedback. We're working on a deadline > with RedHat, so sooner is better than later. I'll be beating on this with > the specweb benchmark over the next couple of days as well. :-) Comments in the patch. > + /* > + * Check the user's access rights to the page. If access should be > + * prevented then send the problem up to do_page_fault. > + */ > + > access |= _PAGE_PRESENT; > - if (unlikely(access & ~(pte_val(*ptep)))) { > + > + /* We'll do access checking and _PAGE_BUSY setting in assembly, since > + * it needs to be atomic. > + */ > + > + __asm__ __volatile__ ("\n > + 1: ldarx %0,0,%3\n > + # Check access rights (access & ~(pte_val(*ptep)))\n > + andc. %1,%2,%0\n > + bne- 2f\n > + # Check if PTE is busy\n > + andi. %1,%0,%4\n > + bne- 1b\n > + ori %0,%0,%4\n > + # Write the linux PTE atomically (setting busy)\n > + stdcx. %0,0,%3\n > + bne- 1b\n > + li %1,1\n > + b 3f\n > + 2: stdcx. %0,0,%3 # to clear the reservation\n > + li %1,0\n > + 3:" > + : "=r" (old_pte), "=r" (access_ok) > + : "r" (access), "r" (ptep), "i" (_PAGE_BUSY) > + : "cc", "memory"); .../... Heh, so you kept the C version stuffing the asm atomic stuff in :) Why note... well, it's definitely less invasive that what I did in 2.6 but also less performant since I optimized the branches to the ppc_md. hooks. That's probably ok for 2.4 though. > + /* Clear _PAGE_BUSY flag atomically. */ > + __asm__ __volatile__ (" > + 1: ldarx %0,0,%2\n > + andc. %0,%0,%1\n > + stdcx. %0,0,%2\n > + bne- 1b\n" > + : "=r" (new_pte) > + : "r" (tmp), "r" (ptep) > + : "cc", "memory"); I'm not sure we need to clear _PAGE_BUSY atomically.... I definitely don't in 2.6... But we need to make sure this clear happens after anything that was done previously. The rest is a bit scary but it's 2.4 so... :) I suppose it should work though I would have to spend more time looking at the code path in details Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From bishfak at in.ibm.com Wed Jan 7 16:29:18 2004 From: bishfak at in.ibm.com (Ishfak F Bhagat) Date: Wed, 7 Jan 2004 10:59:18 +0530 Subject: problem with building 64 bit library - symbols seem to be getting messed up In-Reply-To: <20040106092102.A13626@lists.linuxppc.org> Message-ID: Oops, my mail did not go as ASCII and got base64 encoded ! Here it is again: I am building a 64 bit library on ppc linux using the gcc (cross) compiler v3.2. During linking I get the following errors : ------------------- link errs ------------------ /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o: In function `no symbol': /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o(.text+0xe9c): multiple definition of `no symbol' /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o(.text+0x120): first defined here /opt/cross/lib/gcc-lib/powerpc64-linux/3.2/../../../../powerpc64-linux/bin/ld: Warning: size of symbol `' changed from 212 to 2444 in /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o: In function `list_delete_esc': /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o(.opd+0x138): multiple definition of `list_delete_esc' /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o(.text+0x1828):first defined here /opt/cross/lib/gcc-lib/powerpc64-linux/3.2/../../../../powerpc64-linux/bin/ld: Warning: size of symbol `list_delete_esc' changed from 3404 to 24 in /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o -------------------------------------------------- going thru the nm output for the above .o file, I realized that the symbols are getting messed up. Few of the initial characters for most symbols are getting stripped off, and some symbols just blank. E.g. the last symbol is seen as '8z' but is actually 'utf8z' in the src file ! ----------------- part of the nm output ------------------ 00000000000000ad d U U 0000000000000120 T 0000000000000e9c T 0000000000002808 T 0000000000002a78 T 0000000000000228 D 00000000000003d8 D 0000000000000468 D 0000000000006360 T 0000000000000498 D 00000000000063c8 T 0000000000000510 D 00000000000072ac T 0000000000000528 D 00000000000073d0 T 0000000000000540 D 00000000000074d0 T 0000000000004be8 T .ldap_explode_dns 000000000000043d d .ldap_explode_dns2 0000000000000000 T .malloc 0000000000000439 d 2ufn2 0000000000000430 d 8gthan 000000000000042f d 8slasht 000000000000042b d 8z --------------------------------------------------- I use the following command (thru my makefile) for compiling the above file : ----------------------- compile command ---------------------- powerpc64-linux-gcc -c -O0 -DTEMPLATEFILE="\"/usr/ldap/etc/ldaptemplates.conf\"" -DFILTERFILE="\"/usr/ldap/etc/ldapfilter.conf\"" -DLDAPV3 -DLOCALCP_TRANSLATION -DSSL -DLDAP_SSL_MAX -DLDAP_THREADSAFE -DSLAPD_CALLBACKS -DPPC_LINUX_2 -D_PPC_LINUX_2 -DLINUX2 -DLDAP_DEBUG -DLDAP_REFERRALS -DRDBM_CACHE -DREVERSE_INDEXING -DNEEDPROTOS -D__EXTENSIONS__ -DCLIENT_SERVER_LOCALCP_TRANSLATION -DLOCALCP_ASCII -D_XOPEN_SOURCE=500 -D_XOPEN_SOURCE_EXTENDED -D_BSD_SOURCE -D__SVR4 -D_ALL_SOURCE -mno-altivec -mabi=no-altivec -D_REENTRANT -D__64BIT__ -DPPC__64 -fPIC -nostartfiles -I/BUILD/sb/ldapdevnew/export/ppc_linux_2/ldap/usr/include -I/BUILD/sb/ldapdevnew/adks/common/include -I/BUILD/sb/ldapdevnew/adks/ppc_linux_2/gskit/701.9/include -I/BUILD/sb/ldapdevnew/adks/ppc_linux_2/kerberos5/include /BUILD/sb/ldapdevnew/src/libraries/libldap/getdn.c ----------------------- compile command ---------------------- Any clues to what is going wrong? NOTE: The above errors are noticed only on some files. I am able to build some other libraries successfully and the above errors are only noticed on some .c files for this particular library. Does it look like a compiler bug? I searched the SuSe website and 3.2 seems to be the latest GCC compiler. Any help is greatly appreciated. Thanks and Regards, Ishfak Bhagat Staff Software Engineer IBM India Software Labs, Pune Tie-Line - 92-47022 Tel - 91 20 26901022 (91 20 26982424, Extn: 1022) Fax - 91 20 26982425 ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Wed Jan 7 16:50:43 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Tue, 6 Jan 2004 23:50:43 -0600 (CST) Subject: [PATCH] [2.4] [RHEL] Backport of benh's PTE mgmt changes In-Reply-To: <1073452088.4067.75.camel@gaston> Message-ID: On Wed, 7 Jan 2004, Benjamin Herrenschmidt wrote: > On Wed, 2004-01-07 at 11:01, olof at austin.ibm.com wrote: > > Below is a 2.4 backport of parts of benh's 2.6 pte_free rewrite. It's > > different in a few ways: > > > > 1. 2.4 has no RCU. Instead I just send a syncronous IPI to all processors. > > Since the IPI won't be delivered until a processor is out of hash_page, it > > can be used as a barrier between new and old traversals. > > But is also quite expensive.... I'll try to see how visible is with workloads. I'm not sure how syncronization could be acheived without either RCU or IPI support, so hopefully it won't be a big hit. > I moved bits around on 2.6, basically, _PAGE_FILE can be moved as it's > only used when !_PAGE_PRESENT, to make room. Yeah, I wanted to keep changes at a minimum here so I just grabbed the first available. > Heh, so you kept the C version stuffing the asm atomic stuff > in :) Why note... well, it's definitely less invasive that what > I did in 2.6 but also less performant since I optimized the > branches to the ppc_md. hooks. That's probably ok for 2.4 though. Exactly. I quite honestly had no desire to rewrite the 2.4 function in asm. Performance should only be worse compared to 2.6, it doesn't do much harm to 2.4. In other words: I brought over the functionality, but not the additional performance enhancements. :) > > + /* Clear _PAGE_BUSY flag atomically. */ > > + __asm__ __volatile__ (" > > + 1: ldarx %0,0,%2\n > > + andc. %0,%0,%1\n > > + stdcx. %0,0,%2\n > > + bne- 1b\n" > > + : "=r" (new_pte) > > + : "r" (tmp), "r" (ptep) > > + : "cc", "memory"); > > I'm not sure we need to clear _PAGE_BUSY atomically.... I definitely > don't in 2.6... But we need to make sure this clear happens after > anything that was done previously. You're right. Easy to change. > The rest is a bit scary but it's 2.4 so... :) I suppose it should > work though I would have to spend more time looking at the code path > in details. Yeah, I wanted to backport the bug/design fix with as little disturbance to the rest as possible. I'll run some comparison benchmarks to see how much the added IPI hurts. -Olof ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From bishfak at in.ibm.com Thu Jan 8 00:01:59 2004 From: bishfak at in.ibm.com (Ishfak F Bhagat) Date: Wed, 7 Jan 2004 18:31:59 +0530 Subject: problem with building 64 bit library - symbols seem to be getting messed up - SOLVED In-Reply-To: Message-ID: Found the problem ! my sandbox was in NFS space and that was causing this corruption. If I have a local sandbox, everything works fine ! Wonder if it is my NFS server (which happens to be an AIX Machine) or the SLES 8 NFS client that is buggy. Regards, Ishfak Ishfak F Bhagat/India/IBM at IBMIN Sent by: owner-linuxppc64-dev at lists.linuxppc.org 01/07/2004 10:59 AM To: linuxppc64-dev at lists.linuxppc.org cc: Subject: Re: problem with building 64 bit library - symbols seem to be getting messed up Oops, my mail did not go as ASCII and got base64 encoded ! Here it is again: I am building a 64 bit library on ppc linux using the gcc (cross) compiler v3.2. During linking I get the following errors : ------------------- link errs ------------------ /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o: In function `no symbol': /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o(.text+0xe9c): multiple definition of `no symbol' /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o(.text+0x120): first defined here /opt/cross/lib/gcc-lib/powerpc64-linux/3.2/../../../../powerpc64-linux/bin/ld: Warning: size of symbol `' changed from 212 to 2444 in /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o: In function `list_delete_esc': /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o(.opd+0x138): multiple definition of `list_delete_esc' /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o(.text+0x1828):first defined here /opt/cross/lib/gcc-lib/powerpc64-linux/3.2/../../../../powerpc64-linux/bin/ld: Warning: size of symbol `list_delete_esc' changed from 3404 to 24 in /BUILD/sb/ldapdevnew/obj/ppc_linux_2/libraries/libldap/ssl/64/getdn.o -------------------------------------------------- going thru the nm output for the above .o file, I realized that the symbols are getting messed up. Few of the initial characters for most symbols are getting stripped off, and some symbols just blank. E.g. the last symbol is seen as '8z' but is actually 'utf8z' in the src file ! ----------------- part of the nm output ------------------ 00000000000000ad d U U 0000000000000120 T 0000000000000e9c T 0000000000002808 T 0000000000002a78 T 0000000000000228 D 00000000000003d8 D 0000000000000468 D 0000000000006360 T 0000000000000498 D 00000000000063c8 T 0000000000000510 D 00000000000072ac T 0000000000000528 D 00000000000073d0 T 0000000000000540 D 00000000000074d0 T 0000000000004be8 T .ldap_explode_dns 000000000000043d d .ldap_explode_dns2 0000000000000000 T .malloc 0000000000000439 d 2ufn2 0000000000000430 d 8gthan 000000000000042f d 8slasht 000000000000042b d 8z --------------------------------------------------- I use the following command (thru my makefile) for compiling the above file : ----------------------- compile command ---------------------- powerpc64-linux-gcc -c -O0 -DTEMPLATEFILE="\"/usr/ldap/etc/ldaptemplates.conf\"" -DFILTERFILE="\"/usr/ldap/etc/ldapfilter.conf\"" -DLDAPV3 -DLOCALCP_TRANSLATION -DSSL -DLDAP_SSL_MAX -DLDAP_THREADSAFE -DSLAPD_CALLBACKS -DPPC_LINUX_2 -D_PPC_LINUX_2 -DLINUX2 -DLDAP_DEBUG -DLDAP_REFERRALS -DRDBM_CACHE -DREVERSE_INDEXING -DNEEDPROTOS -D__EXTENSIONS__ -DCLIENT_SERVER_LOCALCP_TRANSLATION -DLOCALCP_ASCII -D_XOPEN_SOURCE=500 -D_XOPEN_SOURCE_EXTENDED -D_BSD_SOURCE -D__SVR4 -D_ALL_SOURCE -mno-altivec -mabi=no-altivec -D_REENTRANT -D__64BIT__ -DPPC__64 -fPIC -nostartfiles -I/BUILD/sb/ldapdevnew/export/ppc_linux_2/ldap/usr/include -I/BUILD/sb/ldapdevnew/adks/common/include -I/BUILD/sb/ldapdevnew/adks/ppc_linux_2/gskit/701.9/include -I/BUILD/sb/ldapdevnew/adks/ppc_linux_2/kerberos5/include /BUILD/sb/ldapdevnew/src/libraries/libldap/getdn.c ----------------------- compile command ---------------------- Any clues to what is going wrong? NOTE: The above errors are noticed only on some files. I am able to build some other libraries successfully and the above errors are only noticed on some .c files for this particular library. Does it look like a compiler bug? I searched the SuSe website and 3.2 seems to be the latest GCC compiler. Any help is greatly appreciated. Thanks and Regards, Ishfak Bhagat Staff Software Engineer IBM India Software Labs, Pune Tie-Line - 92-47022 Tel - 91 20 26901022 (91 20 26982424, Extn: 1022) Fax - 91 20 26982425 ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From engebret at vnet.ibm.com Thu Jan 8 01:40:06 2004 From: engebret at vnet.ibm.com (Dave Engebretsen) Date: Wed, 07 Jan 2004 08:40:06 -0600 Subject: spinlocks In-Reply-To: <20040106005232.GK12213@krispykreme> References: <20031228052954.GD24358@krispykreme> <20040106005232.GK12213@krispykreme> Message-ID: <3FFC1A46.4010202@vnet.ibm.com> Anton Blanchard wrote: > Assuming we have to have a single kernel image for all pseries/g5 > platforms, then we cant do a lot about them other than nop'ing them out. > Of course or r1,r1,r1 is a nop already although its not the preferred > nop (preferred nop does get handled a little more efficiently on POWER4 > from memory). > Is a single binary for Apple & pSeries a goal? While it has some obvious advantages, there is likely to be a number of areas (the spinlock discussion being one) where the goals are quite different. Dave. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Thu Jan 8 01:48:23 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Wed, 7 Jan 2004 08:48:23 -0600 (CST) Subject: spinlocks In-Reply-To: <3FFC1A46.4010202@vnet.ibm.com> Message-ID: On Wed, 7 Jan 2004, Dave Engebretsen wrote: > Is a single binary for Apple & pSeries a goal? While it has some > obvious advantages, there is likely to be a number of areas (the > spinlock discussion being one) where the goals are quite different. Are they really all that different? We need to keep the pSeries code running smoothly on a small-config SMP machine too (i.e. p615 and the like). -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From engebret at vnet.ibm.com Thu Jan 8 02:02:22 2004 From: engebret at vnet.ibm.com (Dave Engebretsen) Date: Wed, 07 Jan 2004 09:02:22 -0600 Subject: spinlocks In-Reply-To: References: Message-ID: <3FFC1F7E.1020904@vnet.ibm.com> olof at austin.ibm.com wrote: > On Wed, 7 Jan 2004, Dave Engebretsen wrote: > > >>Is a single binary for Apple & pSeries a goal? While it has some >>obvious advantages, there is likely to be a number of areas (the >>spinlock discussion being one) where the goals are quite different. > > > Are they really all that different? We need to keep the pSeries code > running smoothly on a small-config SMP machine too (i.e. p615 and the > like). > > > -Olof Maybe not - just raising the debate. Nothing is all this will not keep the code running smoothly on small config p615 machines. In many ways, the more advanced virtualaztion results in machines which are much smaller than anything else, so tuning for small is good for i/pSeries too. Everything being equal, I would just as soon see a common binary. But items like HMT priorities are almost certainly going to exist in the Mac binaries -- frankly, in the scheme of things a few extra noops in the kernel are not going to be the performance bottleneck an end user sees. Dave. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From engebret at vnet.ibm.com Thu Jan 8 02:09:32 2004 From: engebret at vnet.ibm.com (Dave Engebretsen) Date: Wed, 07 Jan 2004 09:09:32 -0600 Subject: spinlocks In-Reply-To: <20031228052954.GD24358@krispykreme> References: <20031228052954.GD24358@krispykreme> Message-ID: <3FFC212C.1010906@vnet.ibm.com> Just getting caught up after break - Anton Blanchard wrote: > Hi, > > We really have to get the new spinlocks beaten into shape... > > 3. Separate spinlocks for iseries and pseries where most of it is > duplicated. I do not follow this point - > As an aside, can someone explain why we reread the lock holder: > > lwsync # if odd, give up cycles\n\ > ldx %1,0,%2 # reverify the lock holder\n\ > cmpd %0,%1\n\ > bne 1b # new holder so restart\n\ > > Wont there be a race regardless of whether this code is there? It is a tricky case, but the sequence is required. Here is the situation: Proc A holds the lock Proc B sees proc A as the holder, then gets preempted Proc A drops the lock, then cedes for a long time Proc B reads proc A's yield count, which is valid (odd) Proc B confers to proc A, but does not wake up until after A is dispatched. The lwsync + reread ensures this cannot occur. > 1. Recognise that once we are in SPLPAR mode, all performance bets are > off and we can burn more cycles. If we are calling into the hypervisor, > the path length there is going to dwarf us so why optimise for it? While I agree performance is less important in SPLPAR mode than dedicated, it is still important. The vast majority of customers on iSeries run in this mode. Dave. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From engebret at vnet.ibm.com Thu Jan 8 02:39:07 2004 From: engebret at vnet.ibm.com (Dave Engebretsen) Date: Wed, 07 Jan 2004 09:39:07 -0600 Subject: spinlocks In-Reply-To: <20040106130937.GL12213@krispykreme> References: <1073338443.761.77.camel@gaston> <20040106130937.GL12213@krispykreme> Message-ID: <3FFC281B.2090007@vnet.ibm.com> Anton Blanchard wrote: >> BenH: >>I tend to think that our spinlocks are so big nowadays that it would >>probably be worth un-inlining them.... > If we uninline them, the advantage of leaf function optimizations are lost -- it seems like that would be a pretty big hit, right?. We don't have any good data, but it may well be about a wash vs. the 1/2 cache line of extra instructions introduced for shared processors. > > I prefer out of line slowpath directly below the function rather than > one single out of line spinlock. It makes profiling much easier, while we > can backtrace out of the spinlock when doing readprofile profiling, for > hardware performance monitor profiling we get an address that happened > somewhere in time and cant do a backtrace. > Isn't this going to result in shared processor locks always stacking the "mini-frame"? That is a pretty big hit for what is likely to be a very common customer configuration. > static inline void _raw_spin_lock(spinlock_t *lock) > { ... > #define SPLPAR_SPINLOCK(REG) \ > SPLPAR_spinlock_r##REG :\ > stdu r1,-STACKFRAMESIZE(r1); \ > std r4,SAVE_R4(r1); \ > std r5,SAVE_R5(r1); \ > lwz r5,0x280(REG); /* load dispatch counter */ \ > andi. r4,5,1; /* if even then go back and spin */ \ > beq 1f; \ > std r3,SAVE_R3(r1); \ > li 3,0xE4; /* give up the cycles H_CONFER */ \ > lhz 4,0x18(REG); /* processor number */ \ > HVSC; \ > ld r3,SAVE_R3(r1); \ > 1: ld r4,SAVE_R4(r1); \ > ld r5,SAVE_R5(r1); \ > addi r1,r1,STACKFRAMESIZE; \ > blr > > SPLPAR_SPINLOCK(0) What magic results in this ending up at the end of each function? When Peter & I were just looking at this, he pointed out that lwz r5,0x2580(0) may not quite have the intended results :) Also, where in this are cr0, cr1, and xer marked as clobbered? They are all volitile over the hcall. Dave. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From lxiep at us.ibm.com Thu Jan 8 03:42:24 2004 From: lxiep at us.ibm.com (Linda Xie) Date: Wed, 07 Jan 2004 10:42:24 -0600 Subject: [PATCH][2.6] Virtual Etherne Driver References: <1070489395.21837.89.camel@santit30> Message-ID: <3FFC36F0.6000205@us.ltcfwd.linux.ibm.com> > +static int __devinit ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id) > +{ > + int rc; > + struct net_device *netdev; > + struct ibmveth_adapter *adapter; > + > + unsigned int *mac_addr_p; > + unsigned int *mcastFilterSize_p; > + > + > + ibmveth_debug_printk_no_adapter("entering ibmveth_probe for UA 0x%lx\n", > + dev->unit_address); > + > + mac_addr_p = (unsigned int *) vio_get_attribute(dev, VETH_MAC_ADDR, 0); > + if(!mac_addr_p) { > + ibmveth_printk(KERN_WARNING "Can't find VETH_MAC_ADDR attribute\n"); > + return 0; > + } Should a non-zero value be returned from here? Since "0" usually means "SUCCESS". I would suggest that probe should return "-ENODEV"(not valid vio_dev) in this case. > + > + mcastFilterSize_p= (unsigned int *) vio_get_attribute(dev, VETH_MCAST_FILTER_SIZE, 0); > + if(!mcastFilterSize_p) { > + ibmveth_printk(KERN_WARNING "Can't find VETH_MCAST_FILTER_SIZE attribute\n"); > + return 0; > + } For the same reason. Thanks, Linda ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Thu Jan 8 08:43:22 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Wed, 7 Jan 2004 15:43:22 -0600 (CST) Subject: [PATCH] [2.4] [RHEL] Backport of benh's PTE mgmt changes In-Reply-To: Message-ID: On Tue, 6 Jan 2004 olof at forte.austin.ibm.com wrote: > I'll try to see how visible is with workloads. I'm not sure how > syncronization could be acheived without either RCU or IPI support, so > hopefully it won't be a big hit. No visible impact on SPECweb, as far as I can tell. SDET didn't show much a difference either, but it's hard to tell since numbers vary quite a bit. But anyway: The solution is too obvious: A rwlock, with hash_page taking it for reading, and pte_freelist_batch taking it for writing momentarily to syncronize with the readers. This way, batch free only has to wait for all hash_page()s to complete, and no IPI is needed. Code size of hash_page is largely unaltered from the old page_table_lock spin_locks. I also moved to a nonatomic clearing of _PAGE_BUSY, and added the PMD free's to be managed by the freelist stuff too, just as in 2.6. So, bottom line: This gives patch gives no particular performance benefit over baseline, but it does remove the deadlock hangs. New patch below. -Olof Index: arch/ppc64/kernel/htab.c =================================================================== RCS file: /cvs/local/rhel/arch/ppc64/kernel/htab.c,v retrieving revision 1.1.1.2 diff -p -u -r1.1.1.2 htab.c --- arch/ppc64/kernel/htab.c 5 Sep 2003 18:57:01 -0000 1.1.1.2 +++ arch/ppc64/kernel/htab.c 7 Jan 2004 20:25:52 -0000 @@ -64,6 +64,7 @@ HTAB htab_data = {NULL, 0, 0, 0, 0}; extern unsigned long _SDR1; extern unsigned long klimit; +extern rwlock_t pte_hash_lock; void make_pte(HPTE *htab, unsigned long va, unsigned long pa, int mode, unsigned long hash_mask, int large); @@ -320,8 +321,10 @@ int __hash_page(unsigned long ea, unsign unsigned long va, vpn; unsigned long newpp, prpn; unsigned long hpteflags, lock_slot; + unsigned long access_ok, tmp; long slot; pte_t old_pte, new_pte; + int ret = 0; /* Search the Linux page table for a match with va */ va = (vsid << 28) | (ea & 0x0fffffff); @@ -337,21 +340,52 @@ int __hash_page(unsigned long ea, unsign * Check the user's access rights to the page. If access should be * prevented then send the problem up to do_page_fault. */ -#ifdef CONFIG_SHARED_MEMORY_ADDRESSING + + /* + * Check the user's access rights to the page. If access should be + * prevented then send the problem up to do_page_fault. + */ + access |= _PAGE_PRESENT; - if (unlikely(access & ~(pte_val(*ptep)))) { + + /* We'll do access checking and _PAGE_BUSY setting in assembly, since + * it needs to be atomic. + */ + + __asm__ __volatile__ ("\n + 1: ldarx %0,0,%3\n + # Check access rights (access & ~(pte_val(*ptep)))\n + andc. %1,%2,%0\n + bne- 2f\n + # Check if PTE is busy\n + andi. %1,%0,%4\n + bne- 1b\n + ori %0,%0,%4\n + # Write the linux PTE atomically (setting busy)\n + stdcx. %0,0,%3\n + bne- 1b\n + li %1,1\n + b 3f\n + 2: stdcx. %0,0,%3 # to clear the reservation\n + li %1,0\n + 3:" + : "=r" (old_pte), "=r" (access_ok) + : "r" (access), "r" (ptep), "i" (_PAGE_BUSY) + : "cr0", "memory"); + +#ifdef CONFIG_SHARED_MEMORY_ADDRESSING + if (unlikely(!access_ok)) { if(!(((ea >> SMALLOC_EA_SHIFT) == (SMALLOC_START >> SMALLOC_EA_SHIFT)) && ((current->thread.flags) & PPC_FLAG_SHARED))) { - spin_unlock(&hash_table_lock[lock_slot].lock); - return 1; + ret = 1; + goto out_unlock; } } #else - access |= _PAGE_PRESENT; - if (unlikely(access & ~(pte_val(*ptep)))) { - spin_unlock(&hash_table_lock[lock_slot].lock); - return 1; + if (unlikely(!access_ok)) { + ret = 1; + goto out_unlock; } #endif @@ -428,9 +462,14 @@ int __hash_page(unsigned long ea, unsign *ptep = new_pte; } +out_unlock: + smp_wmb(); + + pte_val(*ptep) &= ~_PAGE_BUSY; + spin_unlock(&hash_table_lock[lock_slot].lock); - return 0; + return ret; } /* @@ -497,11 +536,14 @@ int hash_page(unsigned long ea, unsigned pgdir = mm->pgd; if (pgdir == NULL) return 1; - /* - * Lock the Linux page table to prevent mmap and kswapd - * from modifying entries while we search and update + /* The pte_hash_lock is used to block any PTE deallocations + * while we walk the tree and use the entry. While technically + * we both read and write the PTE entry while holding the read + * lock, the _PAGE_BUSY bit will block pte_update()s to the + * specific entry. */ - spin_lock(&mm->page_table_lock); + + read_lock(&pte_hash_lock); ptep = find_linux_pte(pgdir, ea); /* @@ -514,8 +556,7 @@ int hash_page(unsigned long ea, unsigned /* If no pte, send the problem up to do_page_fault */ ret = 1; } - - spin_unlock(&mm->page_table_lock); + read_unlock(&pte_hash_lock); return ret; } Index: arch/ppc64/mm/init.c =================================================================== RCS file: /cvs/local/rhel/arch/ppc64/mm/init.c,v retrieving revision 1.1.1.1 diff -p -u -r1.1.1.1 init.c --- arch/ppc64/mm/init.c 7 Aug 2003 03:21:44 -0000 1.1.1.1 +++ arch/ppc64/mm/init.c 7 Jan 2004 20:45:14 -0000 @@ -104,9 +104,78 @@ unsigned long __max_memory; */ mmu_gather_t mmu_gathers[NR_CPUS]; +/* PTE free batching structures. We need a lock since not all + * operations take place under page_table_lock. Keep it per-CPU + * to avoid bottlenecks. + */ + +spinlock_t pte_freelist_lock[NR_CPUS] = { [0 ... NR_CPUS-1] = SPIN_LOCK_UNLOCKED}; +struct pte_freelist_batch *pte_freelist_cur[NR_CPUS]; +rwlock_t pte_hash_lock = RW_LOCK_UNLOCKED; + +unsigned long pte_freelist_forced_free; + +static inline void pte_free_sync(void) +{ + unsigned long flags; + + /* A sync (lock/unlock) is good enough: It will ensure that no + * other CPU is in hash_page, currently traversing down to a + * free'd pte. + */ + + write_lock_irqsave(&pte_hash_lock, flags); + write_unlock_irqrestore(&pte_hash_lock, flags); +} + + +/* This is only called when we are critically out of memory + * (and fail to get a page in pte_free_tlb). + */ +void pte_free_now(pte_t *pte) +{ + pte_freelist_forced_free++; + + pte_free_sync(); + + pte_free_kernel(pte); +} + + +void pte_free_batch(struct pte_freelist_batch *batch) +{ + unsigned int i; + + pte_free_sync(); + + for (i = 0; i < batch->index; i++) + pte_free_kernel(batch->entry[i]); + free_page((unsigned long)batch); +} + + int do_check_pgt_cache(int low, int high) { int freed = 0; + struct pte_freelist_batch **batchp; + spinlock_t *lock = &pte_freelist_lock[smp_processor_id()]; + + /* We use this function to push the current pte free batch to be + * deallocated, since do_check_pgt_cache() is called at the end of each + * free_one_pgd() and other parts of VM relies on all PTE's being + * properly freed upon return from that function. + */ + + spin_lock(lock); + + batchp = &pte_freelist_cur[smp_processor_id()]; + + if(*batchp) { + pte_free_batch(*batchp); + *batchp = NULL; + } + + spin_unlock(lock); #if 0 if (pgtable_cache_size > high) { Index: include/asm-ppc64/pgalloc.h =================================================================== RCS file: /cvs/local/rhel/include/asm-ppc64/pgalloc.h,v retrieving revision 1.1.1.2 diff -p -u -r1.1.1.2 pgalloc.h --- include/asm-ppc64/pgalloc.h 26 Sep 2003 14:42:15 -0000 1.1.1.2 +++ include/asm-ppc64/pgalloc.h 7 Jan 2004 20:46:01 -0000 @@ -93,12 +93,6 @@ pte_free_kernel(pte_t *pte) } -static inline void -pmd_free (pmd_t *pmd) -{ - free_page((unsigned long)pmd); -} - #define pte_alloc_one_fast(mm, address) (0) static inline struct page * @@ -112,7 +106,57 @@ pte_alloc_one(struct mm_struct *mm, unsi return NULL; } -#define pte_free(pte_page) pte_free_kernel(page_address(pte_page)) +/* Use the PTE functions for freeing PMD as well, since the same + * problem with tree traversals apply. Since pmd pointers are always + * virtual, no need for a page_address() translation. + */ + +#define pte_free(pte_page) __pte_free(page_address(pte_page)) +#define pmd_free(pmd) __pte_free(pmd) + +struct pte_freelist_batch +{ + unsigned int index; + void* entry[0]; +}; + +#define PTE_FREELIST_SIZE ((PAGE_SIZE - sizeof(struct pte_freelist_batch) / \ + sizeof(struct page *))) + +extern void pte_free_now(pte_t *pte); +extern void pte_free_batch(struct pte_freelist_batch *batch); + +extern struct pte_freelist_batch *pte_freelist_cur[]; +extern spinlock_t pte_freelist_lock[]; + +static inline void __pte_free(pte_t *pte) +{ + spinlock_t *lock = &pte_freelist_lock[smp_processor_id()]; + struct pte_freelist_batch **batchp; + + spin_lock(lock); + + batchp = &pte_freelist_cur[smp_processor_id()]; + + if (*batchp == NULL) { + *batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC); + if (*batchp == NULL) { + spin_unlock(lock); + pte_free_now(pte); + return; + } + (*batchp)->index = 0; + } + + (*batchp)->entry[(*batchp)->index++] = pte; + if ((*batchp)->index == PTE_FREELIST_SIZE) { + pte_free_batch(*batchp); + *batchp = NULL; + } + + spin_unlock(lock); +} + extern int do_check_pgt_cache(int, int); Index: include/asm-ppc64/pgtable.h =================================================================== RCS file: /cvs/local/rhel/include/asm-ppc64/pgtable.h,v retrieving revision 1.1.1.1 diff -p -u -r1.1.1.1 pgtable.h --- include/asm-ppc64/pgtable.h 7 Aug 2003 03:21:59 -0000 1.1.1.1 +++ include/asm-ppc64/pgtable.h 7 Jan 2004 20:32:57 -0000 @@ -88,22 +88,22 @@ * Bits in a linux-style PTE. These match the bits in the * (hardware-defined) PowerPC PTE as closely as possible. */ -#define _PAGE_PRESENT 0x001UL /* software: pte contains a translation */ -#define _PAGE_USER 0x002UL /* matches one of the PP bits */ -#define _PAGE_RW 0x004UL /* software: user write access allowed */ -#define _PAGE_GUARDED 0x008UL -#define _PAGE_COHERENT 0x010UL /* M: enforce memory coherence (SMP systems) */ -#define _PAGE_NO_CACHE 0x020UL /* I: cache inhibit */ -#define _PAGE_WRITETHRU 0x040UL /* W: cache write-through */ -#define _PAGE_DIRTY 0x080UL /* C: page changed */ -#define _PAGE_ACCESSED 0x100UL /* R: page referenced */ -#define _PAGE_HPTENOIX 0x200UL /* software: pte HPTE slot unknown */ -#define _PAGE_HASHPTE 0x400UL /* software: pte has an associated HPTE */ -#define _PAGE_EXEC 0x800UL /* software: i-cache coherence required */ -#define _PAGE_SECONDARY 0x8000UL /* software: HPTE is in secondary group */ -#define _PAGE_GROUP_IX 0x7000UL /* software: HPTE index within group */ +#define _PAGE_PRESENT 0x0001 /* software: pte contains a translation */ +#define _PAGE_USER 0x0002 /* matches one of the PP bits */ +#define _PAGE_RW 0x0004 /* software: user write access allowed */ +#define _PAGE_GUARDED 0x0008 +#define _PAGE_COHERENT 0x0010 /* M: enforce memory coherence (SMP systems) */ +#define _PAGE_NO_CACHE 0x0020 /* I: cache inhibit */ +#define _PAGE_WRITETHRU 0x0040 /* W: cache write-through */ +#define _PAGE_DIRTY 0x0080 /* C: page changed */ +#define _PAGE_ACCESSED 0x0100 /* R: page referenced */ +#define _PAGE_BUSY 0x0200 /* software: pte & hash are busy */ +#define _PAGE_HASHPTE 0x0400 /* software: pte has an associated HPTE */ +#define _PAGE_EXEC 0x0800 /* software: i-cache coherence required */ +#define _PAGE_GROUP_IX 0x7000 /* software: HPTE index within group */ +#define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */ /* Bits 0x7000 identify the index within an HPT Group */ -#define _PAGE_HPTEFLAGS (_PAGE_HASHPTE | _PAGE_HPTENOIX | _PAGE_SECONDARY | _PAGE_GROUP_IX) +#define _PAGE_HPTEFLAGS (_PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX) /* PAGE_MASK gives the right answer below, but only by accident */ /* It should be preserving the high 48 bits and then specifically */ /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */ @@ -289,13 +289,15 @@ static inline unsigned long pte_update( unsigned long old, tmp; __asm__ __volatile__("\n\ -1: ldarx %0,0,%3 \n\ +1: ldarx %0,0,%3 \n\ + andi. %1,%0,%7 # loop on _PAGE_BUSY set\n\ + bne- 1b \n\ andc %1,%0,%4 \n\ or %1,%1,%5 \n\ stdcx. %1,0,%3 \n\ bne- 1b" : "=&r" (old), "=&r" (tmp), "=m" (*p) - : "r" (p), "r" (clr), "r" (set), "m" (*p) + : "r" (p), "r" (clr), "r" (set), "m" (*p), "i" (_PAGE_BUSY) : "cc" ); return old; } ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Thu Jan 8 08:53:29 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Wed, 7 Jan 2004 15:53:29 -0600 Subject: [PATCH][2.6] set up vio_dev's driver field In-Reply-To: References: Message-ID: On Jan 7, 2004, at 3:19 PM, Linda Xie wrote: > >Could you test this patch instead? > > I tested your patch. Here are the things I found: > > 1) DLPAR vio_dev removal completed w/o errors from kernel and user > space: > BTW, the current vio_unregister_device() code always returns 0 > (SUCCESS). Yes, unfortunately device_unregister doesn't return an error code. > 2) DLPAR vio_dev addition failed in vio_register_device() because > device_register(&viodev->device) > call failed. > > 3) Changes needed in vio_register_device(): It should free viodev > struct and return NULL > when device_register() fails. I guess so. I was thinking that failure to register with the device layer was non-fatal (the driver could still handle its device), but that's not true: the driver won't even be notified of the new device. So I'll make these changes. Of course the real question is why device_register() failed... I'll try to find a partition to debug on. -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Thu Jan 8 10:58:14 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 08 Jan 2004 10:58:14 +1100 Subject: [PATCH] [2.4] [RHEL] Backport of benh's PTE mgmt changes In-Reply-To: References: Message-ID: <1073519894.5753.117.camel@gaston> On Thu, 2004-01-08 at 08:43, olof at austin.ibm.com wrote: > On Tue, 6 Jan 2004 olof at forte.austin.ibm.com wrote: > > > I'll try to see how visible is with workloads. I'm not sure how > > syncronization could be acheived without either RCU or IPI support, so > > hopefully it won't be a big hit. > > No visible impact on SPECweb, as far as I can tell. SDET didn't show much > a difference either, but it's hard to tell since numbers vary quite a bit. > > But anyway: The solution is too obvious: A rwlock, with hash_page taking > it for reading, and pte_freelist_batch taking it for writing momentarily > to syncronize with the readers. This way, batch free only has to wait for > all hash_page()s to complete, and no IPI is needed. Code size of hash_page > is largely unaltered from the old page_table_lock spin_locks. A global rwlock may not be that good as it means global cache ping pong while the page table lock was per-mm ... except if you put the rwlock in the mm (like in the mmu_context), but then you need the mm pointer in pte_free etc... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Jan 8 11:21:18 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 7 Jan 2004 18:21:18 -0600 Subject: problem with building 64 bit library - symbols seem to be getting messed up - SOLVED In-Reply-To: ; from bishfak@in.ibm.com on Wed, Jan 07, 2004 at 06:31:59PM +0530 References: Message-ID: <20040107182117.A41368@forte.austin.ibm.com> On Wed, Jan 07, 2004 at 06:31:59PM +0530, Ishfak F Bhagat wrote: > > Found the problem ! > my sandbox was in NFS space and that was causing this corruption. If I > have a local sandbox, everything works fine ! > > Wonder if it is my NFS server (which happens to be an AIX Machine) or the > SLES 8 NFS client that is buggy. There have been NFS corruption bugs found and fixed in SLES8. Try getting a newer SLES8 --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Fri Jan 9 04:02:10 2004 From: johnrose at austin.ibm.com (John Rose) Date: Thu, 08 Jan 2004 11:02:10 -0600 Subject: [PATCH] xics_disable_irq() irq server bug Message-ID: <1073581330.21007.7.camel@verve> The xics_disable_irq() function can use an incorrect irq server when setting an irq to the lowest priority. This happens in the case of CONFIG_IRQ_ALL_CPUS=y, which is the default case. This bug prevents DLPAR removal for slots that contain adapters that have been activated since boot. Will push tomorrow if there are no objections. Thanks- John diff -Nru a/arch/ppc64/kernel/xics.c b/arch/ppc64/kernel/xics.c --- a/arch/ppc64/kernel/xics.c Thu Jan 8 10:58:28 2004 +++ b/arch/ppc64/kernel/xics.c Thu Jan 8 10:58:28 2004 @@ -213,21 +213,17 @@ pSeriesLP_qirr_info }; + /* XXX Fix this when we clean up large irq support */ extern cpumask_t get_irq_affinity(unsigned int irq); -void xics_enable_irq(unsigned int irq) +static int get_irq_server(unsigned int irq) { - long call_status; - unsigned int server; cpumask_t cpumask = get_irq_affinity(irq); cpumask_t allcpus = CPU_MASK_ALL; cpumask_t tmp = CPU_MASK_NONE; - - irq = irq_offset_down(irq); - if (irq == XICS_IPI) - return; - + unsigned int server; + #ifdef CONFIG_IRQ_ALL_CPUS /* For the moment only implement delivery to all cpus or one cpu */ if (smp_threads_ready) { @@ -247,7 +243,20 @@ #else server = default_server; #endif + return server; + +} + +void xics_enable_irq(unsigned int irq) +{ + long call_status; + unsigned int server; + irq = irq_offset_down(irq); + if (irq == XICS_IPI) + return; + + server = get_irq_server(irq); call_status = rtas_call(ibm_set_xive, 3, 1, NULL, irq, server, DEFAULT_PRIORITY); if (call_status != 0) { @@ -268,6 +277,7 @@ void xics_disable_irq(unsigned int irq) { long call_status; + unsigned int server; irq = irq_offset_down(irq); if (irq == XICS_IPI) @@ -280,9 +290,9 @@ return; } + server = get_irq_server(irq); /* Have to set XIVE to 0xff to be able to remove a slot */ - call_status = rtas_call(ibm_set_xive, 3, 1, NULL, irq, default_server, - 0xff); + call_status = rtas_call(ibm_set_xive, 3, 1, NULL, irq, server, 0xff); if (call_status != 0) { printk("xics_disable_irq: irq=%x: ibm_set_xive(0xff) returned %lx\n", irq, call_status); ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Jan 9 17:18:06 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 9 Jan 2004 17:18:06 +1100 Subject: ppc64 PTE hacks In-Reply-To: <20031223235632.GE934@krispykreme> References: <20031223235632.GE934@krispykreme> Message-ID: <20040109061805.GC25504@krispykreme> > I just remembered we never merged this patch from Paul. It would be > great to get rid of the flush_tlb_* functions. Here it is updated for 2.6, using percpu data etc. Its currently getting some stress testing and if that passes and there are no concerns I'll merge it in. As Ben mentioned we need it for page aging to work. Anton ===== arch/ppc64/kernel/pSeries_htab.c 1.13 vs edited ===== --- 1.13/arch/ppc64/kernel/pSeries_htab.c Fri Dec 5 10:00:40 2003 +++ edited/arch/ppc64/kernel/pSeries_htab.c Fri Jan 9 14:39:15 2004 @@ -300,7 +300,7 @@ int i, j; HPTE *hptep; Hpte_dword0 dw0; - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[smp_processor_id()]; + struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); /* XXX fix for large ptes */ unsigned long large = 0; ===== arch/ppc64/kernel/pSeries_lpar.c 1.35 vs edited ===== --- 1.35/arch/ppc64/kernel/pSeries_lpar.c Thu Nov 13 10:23:27 2003 +++ edited/arch/ppc64/kernel/pSeries_lpar.c Fri Jan 9 14:39:15 2004 @@ -602,7 +602,7 @@ { int i; unsigned long flags; - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[smp_processor_id()]; + struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); spin_lock_irqsave(&pSeries_lpar_tlbie_lock, flags); ===== arch/ppc64/kernel/process.c 1.44 vs edited ===== --- 1.44/arch/ppc64/kernel/process.c Wed Dec 17 15:27:52 2003 +++ edited/arch/ppc64/kernel/process.c Fri Jan 9 14:39:15 2004 @@ -49,6 +49,7 @@ #include #include #include +#include #ifndef CONFIG_SMP struct task_struct *last_task_used_math = NULL; @@ -145,6 +146,8 @@ if (new->thread.regs && last_task_used_altivec == new) new->thread.regs->msr |= MSR_VEC; #endif /* CONFIG_ALTIVEC */ + + flush_tlb_pending(); new_thread = &new->thread; old_thread = ¤t->thread; ===== arch/ppc64/mm/Makefile 1.9 vs edited ===== --- 1.9/arch/ppc64/mm/Makefile Wed Dec 17 16:08:23 2003 +++ edited/arch/ppc64/mm/Makefile Fri Jan 9 14:39:18 2004 @@ -4,6 +4,6 @@ EXTRA_CFLAGS += -mno-minimal-toc -obj-y := fault.o init.o extable.o imalloc.o hash_utils.o hash_low.o +obj-y := fault.o init.o extable.o imalloc.o hash_utils.o hash_low.o tlb.o obj-$(CONFIG_DISCONTIGMEM) += numa.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o ===== arch/ppc64/mm/hash_utils.c 1.44 vs edited ===== --- 1.44/arch/ppc64/mm/hash_utils.c Sun Jan 4 21:47:33 2004 +++ edited/arch/ppc64/mm/hash_utils.c Fri Jan 9 14:39:18 2004 @@ -325,8 +325,7 @@ ppc_md.flush_hash_range(context, number, local); } else { int i; - struct ppc64_tlb_batch *batch = - &ppc64_tlb_batch[smp_processor_id()]; + struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); for (i = 0; i < number; i++) flush_hash_page(context, batch->addr[i], batch->pte[i], ===== arch/ppc64/mm/init.c 1.54 vs edited ===== --- 1.54/arch/ppc64/mm/init.c Sun Jan 4 21:47:33 2004 +++ edited/arch/ppc64/mm/init.c Fri Jan 9 15:38:03 2004 @@ -170,17 +170,27 @@ printk("%d pages swap cached\n",cached); } -void * -ioremap(unsigned long addr, unsigned long size) -{ #ifdef CONFIG_PPC_ISERIES + +void *ioremap(unsigned long addr, unsigned long size) +{ return (void*)addr; +} + +void iounmap(void *addr) +{ + return; +} + #else + +void * +ioremap(unsigned long addr, unsigned long size) +{ void *ret = __ioremap(addr, size, _PAGE_NO_CACHE); if(mem_init_done) return eeh_ioremap(addr, ret); /* may remap the addr */ return ret; -#endif } void * @@ -326,7 +336,7 @@ * * XXX what about calls before mem_init_done (ie python_countermeasures()) */ -void pSeries_iounmap(void *addr) +void iounmap(void *addr) { unsigned long address, start, end, size; struct mm_struct *mm; @@ -352,29 +362,18 @@ spin_lock(&mm->page_table_lock); dir = pgd_offset_i(address); - flush_cache_all(); + flush_cache_vunmap(address, end); do { unmap_im_area_pmd(dir, address, end - address); address = (address + PGDIR_SIZE) & PGDIR_MASK; dir++; } while (address && (address < end)); - __flush_tlb_range(mm, start, end); + flush_tlb_kernel_range(start, end); spin_unlock(&mm->page_table_lock); return; } -void iounmap(void *addr) -{ -#ifdef CONFIG_PPC_ISERIES - /* iSeries I/O Remap is a noop */ - return; -#else - /* DRENG / PPPBBB todo */ - return pSeries_iounmap(addr); -#endif -} - int iounmap_explicit(void *addr, unsigned long size) { struct vm_struct *area; @@ -463,152 +462,7 @@ } } -void -flush_tlb_mm(struct mm_struct *mm) -{ - struct vm_area_struct *mp; - - spin_lock(&mm->page_table_lock); - - for (mp = mm->mmap; mp != NULL; mp = mp->vm_next) - __flush_tlb_range(mm, mp->vm_start, mp->vm_end); - - /* XXX are there races with checking cpu_vm_mask? - Anton */ - cpus_clear(mm->cpu_vm_mask); - - spin_unlock(&mm->page_table_lock); -} - -/* - * Callers should hold the mm->page_table_lock - */ -void -flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr) -{ - unsigned long context = 0; - pgd_t *pgd; - pmd_t *pmd; - pte_t *ptep; - pte_t pte; - int local = 0; - cpumask_t tmp; - - switch( REGION_ID(vmaddr) ) { - case VMALLOC_REGION_ID: - pgd = pgd_offset_k( vmaddr ); - break; - case IO_REGION_ID: - pgd = pgd_offset_i( vmaddr ); - break; - case USER_REGION_ID: - pgd = pgd_offset( vma->vm_mm, vmaddr ); - context = vma->vm_mm->context; - - /* XXX are there races with checking cpu_vm_mask? - Anton */ - tmp = cpumask_of_cpu(smp_processor_id()); - if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp)) - local = 1; - - break; - default: - panic("flush_tlb_page: invalid region 0x%016lx", vmaddr); - - } - - if (!pgd_none(*pgd)) { - pmd = pmd_offset(pgd, vmaddr); - if (pmd_present(*pmd)) { - ptep = pte_offset_kernel(pmd, vmaddr); - /* Check if HPTE might exist and flush it if so */ - pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0)); - if ( pte_val(pte) & _PAGE_HASHPTE ) { - flush_hash_page(context, vmaddr, pte, local); - } - } - WARN_ON(pmd_hugepage(*pmd)); - } -} - -struct ppc64_tlb_batch ppc64_tlb_batch[NR_CPUS]; - -void -__flush_tlb_range(struct mm_struct *mm, unsigned long start, unsigned long end) -{ - pgd_t *pgd; - pmd_t *pmd; - pte_t *ptep; - pte_t pte; - unsigned long pgd_end, pmd_end; - unsigned long context = 0; - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[smp_processor_id()]; - unsigned long i = 0; - int local = 0; - cpumask_t tmp; - - switch(REGION_ID(start)) { - case VMALLOC_REGION_ID: - pgd = pgd_offset_k(start); - break; - case IO_REGION_ID: - pgd = pgd_offset_i(start); - break; - case USER_REGION_ID: - pgd = pgd_offset(mm, start); - context = mm->context; - - /* XXX are there races with checking cpu_vm_mask? - Anton */ - tmp = cpumask_of_cpu(smp_processor_id()); - if (cpus_equal(mm->cpu_vm_mask, tmp)) - local = 1; - - break; - default: - panic("flush_tlb_range: invalid region for start (%016lx) and end (%016lx)\n", start, end); - } - - do { - pgd_end = (start + PGDIR_SIZE) & PGDIR_MASK; - if (pgd_end > end) - pgd_end = end; - if (!pgd_none(*pgd)) { - pmd = pmd_offset(pgd, start); - do { - pmd_end = (start + PMD_SIZE) & PMD_MASK; - if (pmd_end > end) - pmd_end = end; - if (pmd_present(*pmd)) { - ptep = pte_offset_kernel(pmd, start); - do { - if (pte_val(*ptep) & _PAGE_HASHPTE) { - pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0)); - if (pte_val(pte) & _PAGE_HASHPTE) { - batch->pte[i] = pte; - batch->addr[i] = start; - i++; - if (i == PPC64_TLB_BATCH_NR) { - flush_hash_range(context, i, local); - i = 0; - } - } - } - start += PAGE_SIZE; - ++ptep; - } while (start < pmd_end); - } else { - WARN_ON(pmd_hugepage(*pmd)); - start = pmd_end; - } - ++pmd; - } while (start < pgd_end); - } else { - start = pgd_end; - } - ++pgd; - } while (start < end); - - if (i) - flush_hash_range(context, i, local); -} +#endif void free_initmem(void) { ===== arch/ppc64/mm/tlb.c 1.54 vs edited ===== --- /dev/null 2004-01-07 16:07:03.000000000 +1100 +++ edited/arch/ppc64/mm/tlb.c 2003-12-28 16:19:25.000000000 +1100 @@ -0,0 +1,84 @@ +/* + * This file contains the routines for flushing entries from the + * TLB and MMU hash table. + * + * Derived from arch/ppc64/mm/init.c: + * Copyright (C) 1995-1996 Gary Thomas (gdt at linuxppc.org) + * + * Modifications by Paul Mackerras (PowerMac) (paulus at cs.anu.edu.au) + * and Cort Dougan (PReP) (cort at cs.nmt.edu) + * Copyright (C) 1996 Paul Mackerras + * Amiga/APUS changes by Jesper Skov (jskov at cygnus.co.uk). + * + * Derived from "arch/i386/mm/init.c" + * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds + * + * Dave Engebretsen + * Rework for PPC64 port. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include +#include +#include +#include +#include +#include +#include + +DEFINE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch); + +/* + * Update the MMU hash table to correspond with a change to + * a Linux PTE. If wrprot is true, it is permissible to + * change the existing HPTE to read-only rather than removing it + * (if we remove it we should clear the _PTE_HPTEFLAGS bits). + */ +void hpte_update(pte_t *ptep, unsigned long pte, int wrprot) +{ + struct page *ptepage; + struct mm_struct *mm; + unsigned long addr; + int i; + unsigned long context = 0; + struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); + + ptepage = virt_to_page(ptep); + mm = (struct mm_struct *) ptepage->mapping; + addr = ptepage->index + (((unsigned long)ptep & ~PAGE_MASK) << 9); + if (REGION_ID(addr) == USER_REGION_ID) + context = mm->context; + i = batch->index; + if (unlikely(i != 0 && context != batch->context)) { + flush_tlb_pending(); + i = 0; + } + if (i == 0) { + batch->context = context; + batch->mm = mm; + } + batch->pte[i] = __pte(pte); + batch->addr[i] = addr; + batch->index = ++i; + if (i >= PPC64_TLB_BATCH_NR) + flush_tlb_pending(); +} + +void __flush_tlb_pending(struct ppc64_tlb_batch *batch) +{ + int i; + int local = 0; + + i = batch->index; + if (batch->mm->cpu_vm_mask == (1 << smp_processor_id())) + local = 1; + if (i == 1) + flush_hash_page(batch->context, batch->addr[0], batch->pte[0], + local); + else + flush_hash_range(batch->context, i, local); + batch->index = 0; +}===== include/asm-ppc64/pgtable.h 1.30 vs edited ===== --- 1.30/include/asm-ppc64/pgtable.h Wed Dec 17 16:08:23 2003 +++ edited/include/asm-ppc64/pgtable.h Fri Jan 9 14:39:18 2004 @@ -11,6 +11,7 @@ #include /* For TASK_SIZE */ #include #include +#include #endif /* __ASSEMBLY__ */ /* PMD_SHIFT determines what a second-level page table entry can map */ @@ -288,71 +289,93 @@ /* Atomic PTE updates */ -static inline unsigned long pte_update( pte_t *p, unsigned long clr, - unsigned long set ) +static inline unsigned long pte_update(pte_t *p, unsigned long clr) { unsigned long old, tmp; - + __asm__ __volatile__( "1: ldarx %0,0,%3 # pte_update\n\ - andi. %1,%0,%7\n\ + andi. %1,%0,%6\n\ bne- 1b \n\ andc %1,%0,%4 \n\ - or %1,%1,%5 \n\ stdcx. %1,0,%3 \n\ bne- 1b" : "=&r" (old), "=&r" (tmp), "=m" (*p) - : "r" (p), "r" (clr), "r" (set), "m" (*p), "i" (_PAGE_BUSY) + : "r" (p), "r" (clr), "m" (*p), "i" (_PAGE_BUSY) : "cc" ); return old; } +/* PTE updating functions */ +extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot); + static inline int ptep_test_and_clear_young(pte_t *ptep) { - return (pte_update(ptep, _PAGE_ACCESSED, 0) & _PAGE_ACCESSED) != 0; + unsigned long old; + + old = pte_update(ptep, _PAGE_ACCESSED | _PAGE_HPTEFLAGS); + if (old & _PAGE_HASHPTE) { + hpte_update(ptep, old, 0); + flush_tlb_pending(); /* XXX generic code doesn't flush */ + } + return (old & _PAGE_ACCESSED) != 0; } static inline int ptep_test_and_clear_dirty(pte_t *ptep) { - return (pte_update(ptep, _PAGE_DIRTY, 0) & _PAGE_DIRTY) != 0; -} + unsigned long old; -static inline pte_t ptep_get_and_clear(pte_t *ptep) -{ - return __pte(pte_update(ptep, ~_PAGE_HPTEFLAGS, 0)); + old = pte_update(ptep, _PAGE_DIRTY); + if ((~old & (_PAGE_HASHPTE | _PAGE_RW | _PAGE_DIRTY)) == 0) + hpte_update(ptep, old, 1); + return (old & _PAGE_DIRTY) != 0; } static inline void ptep_set_wrprotect(pte_t *ptep) { - pte_update(ptep, _PAGE_RW, 0); + unsigned long old; + + old = pte_update(ptep, _PAGE_RW); + if ((~old & (_PAGE_HASHPTE | _PAGE_RW | _PAGE_DIRTY)) == 0) + hpte_update(ptep, old, 1); } -static inline void ptep_mkdirty(pte_t *ptep) +static inline pte_t ptep_get_and_clear(pte_t *ptep) { - pte_update(ptep, 0, _PAGE_DIRTY); + unsigned long old = pte_update(ptep, ~0UL); + + if (old & _PAGE_HASHPTE) + hpte_update(ptep, old, 0); + return __pte(old); } -/* - * Macro to mark a page protection value as "uncacheable". - */ -#define pgprot_noncached(prot) (__pgprot(pgprot_val(prot) | _PAGE_NO_CACHE | _PAGE_GUARDED)) +static inline void pte_clear(pte_t * ptep) +{ + unsigned long old = pte_update(ptep, ~0UL); -#define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) + if (old & _PAGE_HASHPTE) + hpte_update(ptep, old, 0); +} /* * set_pte stores a linux PTE into the linux page table. - * On machines which use an MMU hash table we avoid changing the - * _PAGE_HASHPTE bit. */ static inline void set_pte(pte_t *ptep, pte_t pte) { - pte_update(ptep, ~_PAGE_HPTEFLAGS, pte_val(pte) & ~_PAGE_HPTEFLAGS); + /* XXX is there a better way to handle this? */ + if (pte_present(*ptep)) + pte_clear(ptep); + if (pte_present(pte)) + flush_tlb_pending(); + *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS; } -static inline void pte_clear(pte_t * ptep) -{ - pte_update(ptep, ~_PAGE_HPTEFLAGS, 0); -} +/* + * Macro to mark a page protection value as "uncacheable". + */ +#define pgprot_noncached(prot) (__pgprot(pgprot_val(prot) | _PAGE_NO_CACHE | _PAGE_GUARDED)) + +#define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) extern unsigned long ioremap_bot, ioremap_base; ===== include/asm-ppc64/tlb.h 1.10 vs edited ===== --- 1.10/include/asm-ppc64/tlb.h Wed Dec 17 15:51:16 2003 +++ edited/include/asm-ppc64/tlb.h Fri Jan 9 14:39:18 2004 @@ -12,11 +12,9 @@ #ifndef _PPC64_TLB_H #define _PPC64_TLB_H -#include #include -#include -#include +struct mmu_gather; static inline void tlb_flush(struct mmu_gather *tlb); /* Avoid pulling in another include just for this */ @@ -29,66 +27,13 @@ #define tlb_start_vma(tlb, vma) do { } while (0) #define tlb_end_vma(tlb, vma) do { } while (0) -/* Should make this at least as large as the generic batch size, but it - * takes up too much space */ -#define PPC64_TLB_BATCH_NR 192 - -struct ppc64_tlb_batch { - unsigned long index; - pte_t pte[PPC64_TLB_BATCH_NR]; - unsigned long addr[PPC64_TLB_BATCH_NR]; - unsigned long vaddr[PPC64_TLB_BATCH_NR]; -}; - -extern struct ppc64_tlb_batch ppc64_tlb_batch[NR_CPUS]; - -static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, - unsigned long address) -{ - int cpu = smp_processor_id(); - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[cpu]; - unsigned long i = batch->index; - pte_t pte; - cpumask_t local_cpumask = cpumask_of_cpu(cpu); - - if (pte_val(*ptep) & _PAGE_HASHPTE) { - pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0)); - if (pte_val(pte) & _PAGE_HASHPTE) { - - batch->pte[i] = pte; - batch->addr[i] = address; - i++; - - if (i == PPC64_TLB_BATCH_NR) { - int local = 0; - - if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) - local = 1; - - flush_hash_range(tlb->mm->context, i, local); - i = 0; - } - } - } - - batch->index = i; -} +#define __tlb_remove_tlb_entry(tlb, pte, address) do { } while (0) extern void pte_free_finish(void); static inline void tlb_flush(struct mmu_gather *tlb) { - int cpu = smp_processor_id(); - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[cpu]; - int local = 0; - cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id()); - - if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) - local = 1; - - flush_hash_range(tlb->mm->context, batch->index, local); - batch->index = 0; - + flush_tlb_pending(); pte_free_finish(); } ===== include/asm-ppc64/tlbflush.h 1.4 vs edited ===== --- 1.4/include/asm-ppc64/tlbflush.h Fri Jun 7 18:21:41 2002 +++ edited/include/asm-ppc64/tlbflush.h Fri Jan 9 14:39:18 2004 @@ -1,10 +1,6 @@ #ifndef _PPC64_TLBFLUSH_H #define _PPC64_TLBFLUSH_H -#include -#include -#include - /* * TLB flushing: * @@ -15,21 +11,37 @@ * - flush_tlb_pgtables(mm, start, end) flushes a range of page tables */ -extern void flush_tlb_mm(struct mm_struct *mm); -extern void flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr); -extern void __flush_tlb_range(struct mm_struct *mm, - unsigned long start, unsigned long end); -#define flush_tlb_range(vma, start, end) \ - __flush_tlb_range(vma->vm_mm, start, end) +#include +#include + +#define PPC64_TLB_BATCH_NR 192 -#define flush_tlb_kernel_range(start, end) \ - __flush_tlb_range(&init_mm, (start), (end)) +struct mm_struct; +struct ppc64_tlb_batch { + unsigned long index; + unsigned long context; + struct mm_struct *mm; + pte_t pte[PPC64_TLB_BATCH_NR]; + unsigned long addr[PPC64_TLB_BATCH_NR]; + unsigned long vaddr[PPC64_TLB_BATCH_NR]; +}; +DECLARE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch); -static inline void flush_tlb_pgtables(struct mm_struct *mm, - unsigned long start, unsigned long end) +extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch); + +static inline void flush_tlb_pending(void) { - /* PPC has hw page tables. */ + struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); + + if (batch->index) + __flush_tlb_pending(batch); } + +#define flush_tlb_mm(mm) flush_tlb_pending() +#define flush_tlb_page(vma, addr) flush_tlb_pending() +#define flush_tlb_range(vma, start, end) flush_tlb_pending() +#define flush_tlb_kernel_range(start, end) flush_tlb_pending() +#define flush_tlb_pgtables(mm, start, end) do { } while (0) extern void flush_hash_page(unsigned long context, unsigned long ea, pte_t pte, int local); ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From gregkh at us.ibm.com Sat Jan 10 05:29:50 2004 From: gregkh at us.ibm.com (Greg KH) Date: Fri, 9 Jan 2004 10:29:50 -0800 Subject: [PATCH][2.6] set up vio_dev's driver field In-Reply-To: References: <3FFB3A47.8090404@us.ltcfwd.linux.ibm.com> Message-ID: <20040109182950.GA8858@us.ibm.com> On Tue, Jan 06, 2004 at 05:26:09PM -0600, Hollis Blanchard wrote: > On Jan 6, 2004, at 4:44 PM, Linda Xie wrote: > >diff -Nru a/arch/ppc64/kernel/vio.c b/arch/ppc64/kernel/vio.c > >--- a/arch/ppc64/kernel/vio.c Tue Jan 6 16:29:17 2004 > >+++ b/arch/ppc64/kernel/vio.c Tue Jan 6 16:29:17 2004 > >@@ -189,7 +189,7 @@ > > const struct vio_device_id* id; > > > > id = vio_match_device(driver->id_table, dev); > >- if (id && (0 < driver->probe(dev, id))) { > >+ if (id && (0 == driver->probe(dev, id))) { > > printk(KERN_DEBUG "%s: driver %s/%s took device > > %p\n", > > __FUNCTION__, id->type, id->compat, dev); > > dev->driver = driver; > > You're right that the drivers return 0 on success, but all this code is > about to be replaced with 2.6 driver model code anyways. The driver > model gives us basic sysfs presense and list locking for free. > > Could you test this patch instead? It should require no driver changes. > (I don't think the patch will be whitespace-wrapped but let me know.) It is wrapped :( > Comments from Greg KH also welcome, though Linda's mail prompted me to > send this out before I've double-checked everything. :) In particular I > had to create a static struct device to act as the VIO bus device, > since the virtual bus doesn't have an actual root struct device (unlike > PCI and USB)... Ick, ick, ick. _Please_ never make a struct device static. Bad things will happen if you get your reference counting wrong. Hm, actually it looks like it will get messed up if the release() function gets called for it. Are you doing this to get a "parent" device to hang everything else off of? (I have no idea what "vio" is, is it a bus?) Few comments on the patch below: > +int vio_register_driver(struct vio_driver *viodrv) > { > - int count = 0; > - struct vio_dev *dev; > - > - printk(KERN_DEBUG "%s: driver %s/%s registering\n", __FUNCTION__, > - drv->id_table[0].type, drv->id_table[0].type); > + printk(KERN_DEBUG "%s: driver %s registering\n", __FUNCTION__, > + viodrv->name); > > - /* find matching devices not already claimed by other drivers and > pass > - * them to probe() */ > - list_for_each_entry(dev, &vio_bus.devices, devices_list) { > - const struct vio_device_id* id; > - > - if (dev->driver) > - continue; /* this device is already owned */ > - > - id = vio_match_device(drv->id_table, dev); > - if (drv && id) { > - if (0 == drv->probe(dev, id)) { > - printk(KERN_DEBUG " took device %p\n", dev); > - dev->driver = drv; > - count++; > - } > - } > - } > + /* fill in 'struct device' fields */ > + viodrv->driver.name = viodrv->name; > + viodrv->driver.bus = &vio_bus_type; > + viodrv->driver.probe = vio_bus_probe; > + viodrv->driver.remove = vio_bus_remove; Don't you mean "driver" structure in that comment? > -int vio_unregister_driver(struct vio_driver *driver) > +int vio_unregister_driver(struct vio_driver *viodrv) Why return anything here? Who cares at unregister time? Are you going to fail something if it doesn't happen? It always will be unregistered :) > +static int __init > vio_bus_init(void) > { > struct device_node *node_vroot, *node_vdev; > + int err; > > - INIT_LIST_HEAD(&vio_bus.devices); > + err = bus_register(&vio_bus_type); > + if (err) > + return err; > + > + /* the parent of all vio devices */ > + memset(&vio_bus_device, 0, sizeof(struct device)); > + strcpy(vio_bus_device.bus_id, "vio"); > + device_register(&vio_bus_device); Oops, you just set up the "release" function of your static device to call kfree when it is unregistered. Not good :( If you don't have a place on the system bus, just keep the first device's parent as NULL. That way it will be placed at the top of the device directory properly. This is how pci busses work. > +static inline struct vio_driver *to_vio_driver(struct device_driver > *drv) > +{ > + return container_of(drv, struct vio_driver, driver); > +} > + It's ok to make that a macro instead. It doesn't have to be a inline function. > +/* taken from pci_module_init() */ > static inline int vio_module_init(struct vio_driver *drv) > { > - int rc = vio_register_driver (drv); > + int rc = vio_register_driver(drv); > > - if (rc > 0) > - return 0; > + if (rc > 0) > + return 0; > > - /* iff CONFIG_HOTPLUG and built into kernel, we should > - * leave the driver around for future hotplug events. > - * For the module case, a hotplug daemon of some sort > - * should load a module in response to an insert event. */ > + /* iff CONFIG_HOTPLUG and built into kernel, we should > + * leave the driver around for future hotplug events. > + * For the module case, a hotplug daemon of some sort > + * should load a module in response to an insert event. */ > #if defined(CONFIG_HOTPLUG) && !defined(MODULE) > - if (rc == 0) > - return 0; > + if (rc == 0) > + return 0; > #else > - if (rc == 0) > - rc = -ENODEV; > + if (rc == 0) > + rc = -ENODEV; > #endif > > - /* if we get here, we need to clean up vio driver instance > - * and return some sort of error */ > + /* if we get here, we need to clean up vio driver instance > + * and return some sort of error */ > + vio_unregister_driver(drv); > > - return rc; > + return rc; > } Eeek! I want to fix that code in pci_module_init() so it doesn't do this at all. Please don't copy that horrible function. Just register the driver with a call to vio_register_driver() and drop the whole vio_module_init() completly. I'll be doing that for pci soon, and there's no reason you want to duplicate this broken logic (you always want your module probe to succeed, for lots of reasons...) Hope this helps, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From gregkh at us.ibm.com Sat Jan 10 05:30:30 2004 From: gregkh at us.ibm.com (Greg KH) Date: Fri, 9 Jan 2004 10:30:30 -0800 Subject: [PATCH][2.6] set up vio_dev's driver field In-Reply-To: References: Message-ID: <20040109183030.GB8858@us.ibm.com> On Wed, Jan 07, 2004 at 03:53:29PM -0600, Hollis Blanchard wrote: > > Of course the real question is why device_register() failed... I'll try > to find a partition to debug on. Commonly this happens if you try to register two devices with the same name. See if that's the case in your drivers. thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Sat Jan 10 06:18:46 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Fri, 9 Jan 2004 13:18:46 -0600 Subject: [PATCH][2.6] set up vio_dev's driver field In-Reply-To: <20040109182950.GA8858@us.ibm.com> References: <3FFB3A47.8090404@us.ltcfwd.linux.ibm.com> <20040109182950.GA8858@us.ibm.com> Message-ID: On Jan 9, 2004, at 12:29 PM, Greg KH wrote: > On Tue, Jan 06, 2004 at 05:26:09PM -0600, Hollis Blanchard wrote: >> Comments from Greg KH also welcome, though Linda's mail prompted me to >> send this out before I've double-checked everything. :) In particular >> I >> had to create a static struct device to act as the VIO bus device, >> since the virtual bus doesn't have an actual root struct device >> (unlike >> PCI and USB)... > > Ick, ick, ick. _Please_ never make a struct device static. Bad things > will happen if you get your reference counting wrong. Hm, actually it > looks like it will get messed up if the release() function gets called > for it. > > Are you doing this to get a "parent" device to hang everything else off > of? Yes, exactly. Without such a parent device, the devices will be listed at the same level as PCI busses. I think they could be more organized than that; Open Firmware for example creates a "vdevice" directory at the same level as the PCI busses. > (I have no idea what "vio" is, is it a bus?) "vio" means "virtual IO", such as virtual ethernet or virtual SCSI. These can also be hotplugged... > Few comments on the patch below: > >> + /* fill in 'struct device' fields */ >> + viodrv->driver.name = viodrv->name; >> + viodrv->driver.bus = &vio_bus_type; >> + viodrv->driver.probe = vio_bus_probe; >> + viodrv->driver.remove = vio_bus_remove; > > Don't you mean "driver" structure in that comment? Yup, thanks. >> -int vio_unregister_driver(struct vio_driver *driver) >> +int vio_unregister_driver(struct vio_driver *viodrv) > > Why return anything here? Who cares at unregister time? Are you going > to fail something if it doesn't happen? It always will be unregistered > :) Habit I guess. Almost all functions can fail... Here, what if the driver is in use when somebody tried to unregister it? Right now, vio_unregister_driver() is very simple, but maybe in the future it will become more complicated, and then we'll wish we had that return code and didn't have to modify callers. So habit, and planning for the future. >> +static int __init >> vio_bus_init(void) >> { >> struct device_node *node_vroot, *node_vdev; >> + int err; >> >> - INIT_LIST_HEAD(&vio_bus.devices); >> + err = bus_register(&vio_bus_type); >> + if (err) >> + return err; >> + >> + /* the parent of all vio devices */ >> + memset(&vio_bus_device, 0, sizeof(struct device)); >> + strcpy(vio_bus_device.bus_id, "vio"); >> + device_register(&vio_bus_device); > > Oops, you just set up the "release" function of your static device to > call kfree when it is unregistered. Not good :( Hmm, how would it become unregistered? > If you don't have a place on the system bus, just keep the first > device's parent as NULL. That way it will be placed at the top of the > device directory properly. This is how pci busses work. There could potentially be dozens of virtual devices (console, ethernet, scsi). Right now the "devices" sysfs directory contains only directories (pci*, legacy, system). Without a fake VIO parent device, there will be no VIO analog to the PCI busses here, and all virtual devices (with names like "vty at 30000000" and "llan at 3000010") will appear in this directory. Is that what you want? Would that impact tools like lsbus? >> +/* taken from pci_module_init() */ >> static inline int vio_module_init(struct vio_driver *drv) > Eeek! I want to fix that code in pci_module_init() so it doesn't do > this at all. Please don't copy that horrible function. Just register > the driver with a call to vio_register_driver() and drop the whole > vio_module_init() completly. I'll be doing that for pci soon, and > there's no reason you want to duplicate this broken logic (you always > want your module probe to succeed, for lots of reasons...) Good to know. PCI has been the model for the VIO code. I figured there was a reason for this function, even if I didn't understand it... :) -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From gregkh at us.ibm.com Sat Jan 10 06:41:00 2004 From: gregkh at us.ibm.com (Greg KH) Date: Fri, 9 Jan 2004 11:41:00 -0800 Subject: [PATCH][2.6] set up vio_dev's driver field In-Reply-To: References: <3FFB3A47.8090404@us.ltcfwd.linux.ibm.com> <20040109182950.GA8858@us.ibm.com> Message-ID: <20040109194100.GA7512@us.ibm.com> On Fri, Jan 09, 2004 at 01:18:46PM -0600, Hollis Blanchard wrote: > On Jan 9, 2004, at 12:29 PM, Greg KH wrote: > >On Tue, Jan 06, 2004 at 05:26:09PM -0600, Hollis Blanchard wrote: > >>Comments from Greg KH also welcome, though Linda's mail prompted me to > >>send this out before I've double-checked everything. :) In particular > >>I > >>had to create a static struct device to act as the VIO bus device, > >>since the virtual bus doesn't have an actual root struct device > >>(unlike > >>PCI and USB)... > > > >Ick, ick, ick. _Please_ never make a struct device static. Bad things > >will happen if you get your reference counting wrong. Hm, actually it > >looks like it will get messed up if the release() function gets called > >for it. > > > >Are you doing this to get a "parent" device to hang everything else off > >of? > > Yes, exactly. Without such a parent device, the devices will be listed > at the same level as PCI busses. I think they could be more organized > than that; Open Firmware for example creates a "vdevice" directory at > the same level as the PCI busses. Ah, ok. See below for more... > >(I have no idea what "vio" is, is it a bus?) > > "vio" means "virtual IO", such as virtual ethernet or virtual SCSI. > These can also be hotplugged... But can your "parent" device ever go away? Can the vio code be built as a module? > >>-int vio_unregister_driver(struct vio_driver *driver) > >>+int vio_unregister_driver(struct vio_driver *viodrv) > > > >Why return anything here? Who cares at unregister time? Are you going > >to fail something if it doesn't happen? It always will be unregistered > >:) > > Habit I guess. Almost all functions can fail... Here, what if the > driver is in use when somebody tried to unregister it? Right now, > vio_unregister_driver() is very simple, but maybe in the future it will > become more complicated, and then we'll wish we had that return code > and didn't have to modify callers. So habit, and planning for the > future. But my point is, what could you ever do if unregister fails? > >>+static int __init > >> vio_bus_init(void) > >> { > >> struct device_node *node_vroot, *node_vdev; > >>+ int err; > >> > >>- INIT_LIST_HEAD(&vio_bus.devices); > >>+ err = bus_register(&vio_bus_type); > >>+ if (err) > >>+ return err; > >>+ > >>+ /* the parent of all vio devices */ > >>+ memset(&vio_bus_device, 0, sizeof(struct device)); > >>+ strcpy(vio_bus_device.bus_id, "vio"); > >>+ device_register(&vio_bus_device); > > > >Oops, you just set up the "release" function of your static device to > >call kfree when it is unregistered. Not good :( > > Hmm, how would it become unregistered? You never unload this code? It can't be a module? You never clean up all of your devices at shutdown time? If you do, this needs to be dynamically created. > >If you don't have a place on the system bus, just keep the first > >device's parent as NULL. That way it will be placed at the top of the > >device directory properly. This is how pci busses work. > > There could potentially be dozens of virtual devices (console, > ethernet, scsi). Right now the "devices" sysfs directory contains only > directories (pci*, legacy, system). Without a fake VIO parent device, > there will be no VIO analog to the PCI busses here, and all virtual > devices (with names like "vty at 30000000" and "llan at 3000010") will appear > in this directory. Is that what you want? Would that impact tools like > lsbus? No, a "fake" parent device is fine in this case. But what kind of devices do these VIO devices hang off of? They should have some kind of addressable device as a parent, right? But if not, then yes, a "fake" parent device is ok. thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From agl at us.ibm.com Sat Jan 10 08:19:49 2004 From: agl at us.ibm.com (Adam Litke) Date: 09 Jan 2004 13:19:49 -0800 Subject: [RFC] implicit hugetlb pages (new patch-set) Message-ID: <1073683188.1297.105.camel@agtpad> I have been working with implicit hugetlb stuff lately and have added 2 new features / fixes to the original code: * Safe fallback when implicit allocations fail * Dynamic hugetlb area resizing for 32-bit address spaces The code is broken out into 3 smaller patches: hugetlb_implicit, mmu_context_to_struct, hugetlb_dyn_as which I will include and describe in reply to this email. This has been tested on SpecJBB with a 32-bit JVM and results have been good so far. Can anyone think of cases where this code will break badly? I have tried to consider some cases and I came up with these: * Contention between the stack area and the hugetlb area. * I currently don't shrink the hugetlb region, should I? * Are there any problems with extending the mmu_context to a struct? * Any other things I didn't think of Thanks. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From agl at us.ibm.com Sat Jan 10 08:27:20 2004 From: agl at us.ibm.com (Adam Litke) Date: 09 Jan 2004 13:27:20 -0800 Subject: [RFC] implicit hugetlb pages (hugetlb_implicit) In-Reply-To: <1073683188.1297.105.camel@agtpad> References: <1073683188.1297.105.camel@agtpad> Message-ID: <1073683640.1297.111.camel@agtpad> hugetlb_implicit (2.6.0): This patch includes the anonymous mmap work from Dave Gibson (right?) as well as my shared mem support. I have added safe fallback for implicit allocations. This patch uses a fixed address space range of 80000000 - c0000000 for huge pages. -- snip -- diff -purN linux-2.6.0/fs/hugetlbfs/inode.c linux-2.6.0-implicit/fs/hugetlbfs/inode.c --- linux-2.6.0/fs/hugetlbfs/inode.c 2003-12-17 18:59:36.000000000 -0800 +++ linux-2.6.0-implicit/fs/hugetlbfs/inode.c 2004-01-08 16:19:31.000000000 -0800 @@ -26,12 +26,17 @@ #include #include #include +#include #include +#include /* some random number */ #define HUGETLBFS_MAGIC 0x958458f6 +extern int mmap_use_hugepages; +extern int mmap_hugepages_map_sz; + static struct super_operations hugetlbfs_ops; static struct address_space_operations hugetlbfs_aops; struct file_operations hugetlbfs_file_operations; @@ -82,7 +87,7 @@ static int hugetlbfs_file_mmap(struct fi unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); #else -static unsigned long +unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { @@ -115,6 +120,65 @@ hugetlb_get_unmapped_area(struct file *f } #endif +int mmap_hugetlb_implicit(unsigned long len) +{ + /* Are we enabled? */ + if (!mmap_use_hugepages) + return 0; + /* Must be HPAGE aligned */ + if (len & ~HPAGE_MASK) + return 0; + /* Are we under the minimum size? */ + if (mmap_hugepages_map_sz + && len < (mmap_hugepages_map_sz << 20)) + return 0; + /* Do we have enough free huge pages? */ + if (!is_hugepage_mem_enough(len)) + return 0; + + return 1; +} + +unsigned long +try_hugetlb_get_unmapped_area(struct file *file, unsigned long addr, + unsigned long len, unsigned long pgoff, unsigned long *flags) +{ + long pre_error = 0; + + /* Check some prerequisites */ + if (!capable(CAP_IPC_LOCK)) + pre_error = -EPERM; + else if (file) + pre_error = -EINVAL; + + /* Explicit requests for huge pages are allowed to return errors */ + if (*flags & MAP_HUGETLB) { + if (pre_error) + return pre_error; + return hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags); + } + + /* + * When implicit request fails, return 0 so we can + * retry later with regular pages. + */ + if (mmap_hugetlb_implicit(len)) { + if (pre_error) + goto out; + addr = hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags); + if (IS_ERR((void *)addr)) + goto out; + else { + *flags |= MAP_HUGETLB; + return addr; + } + } + +out: + *flags &= ~MAP_HUGETLB; + return 0; +} + /* * Read a page. Again trivial. If it didn't already exist * in the page cache, it is zero-filled. diff -purN linux-2.6.0/include/asm-i386/mman.h linux-2.6.0-implicit/include/asm-i386/mman.h --- linux-2.6.0/include/asm-i386/mman.h 2003-12-17 18:58:15.000000000 -0800 +++ linux-2.6.0-implicit/include/asm-i386/mman.h 2004-01-08 16:19:31.000000000 -0800 @@ -11,6 +11,11 @@ #define MAP_SHARED 0x01 /* Share changes */ #define MAP_PRIVATE 0x02 /* Changes are private */ +#ifdef CONFIG_HUGETLB_PAGE +#define MAP_HUGETLB 0x04 /* Use huge pages */ +#else +#define MAP_HUGETLB 0x00 +#endif #define MAP_TYPE 0x0f /* Mask for type of mapping */ #define MAP_FIXED 0x10 /* Interpret addr exactly */ #define MAP_ANONYMOUS 0x20 /* don't use a file */ diff -purN linux-2.6.0/include/asm-ppc64/mman.h linux-2.6.0-implicit/include/asm-ppc64/mman.h --- linux-2.6.0/include/asm-ppc64/mman.h 2003-12-17 18:58:47.000000000 -0800 +++ linux-2.6.0-implicit/include/asm-ppc64/mman.h 2004-01-08 16:19:31.000000000 -0800 @@ -18,6 +18,11 @@ #define MAP_SHARED 0x01 /* Share changes */ #define MAP_PRIVATE 0x02 /* Changes are private */ +#ifdef CONFIG_HUGETLB_PAGE +#define MAP_HUGETLB 0x04 +#else +#define MAP_HUGETLB 0x0 +#endif #define MAP_TYPE 0x0f /* Mask for type of mapping */ #define MAP_FIXED 0x10 /* Interpret addr exactly */ #define MAP_ANONYMOUS 0x20 /* don't use a file */ diff -purN linux-2.6.0/include/linux/hugetlb.h linux-2.6.0-implicit/include/linux/hugetlb.h --- linux-2.6.0/include/linux/hugetlb.h 2003-12-17 18:58:49.000000000 -0800 +++ linux-2.6.0-implicit/include/linux/hugetlb.h 2004-01-08 16:19:31.000000000 -0800 @@ -118,4 +118,9 @@ static inline void set_file_hugepages(st #endif /* !CONFIG_HUGETLBFS */ +unsigned long +hugetlb_get_unmapped_area(struct file *, unsigned long, unsigned long, + unsigned long, unsigned long); + + #endif /* _LINUX_HUGETLB_H */ diff -purN linux-2.6.0/include/linux/mman.h linux-2.6.0-implicit/include/linux/mman.h --- linux-2.6.0/include/linux/mman.h 2003-12-17 18:58:15.000000000 -0800 +++ linux-2.6.0-implicit/include/linux/mman.h 2004-01-08 16:19:31.000000000 -0800 @@ -58,6 +58,9 @@ calc_vm_flag_bits(unsigned long flags) return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | _calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) | +#ifdef CONFIG_HUGETLB_PAGE + _calc_vm_trans(flags, MAP_HUGETLB, VM_HUGETLB ) | +#endif _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ); } diff -purN linux-2.6.0/include/linux/sysctl.h linux-2.6.0-implicit/include/linux/sysctl.h --- linux-2.6.0/include/linux/sysctl.h 2003-12-17 18:58:56.000000000 -0800 +++ linux-2.6.0-implicit/include/linux/sysctl.h 2004-01-08 16:19:31.000000000 -0800 @@ -127,6 +127,10 @@ enum KERN_PANIC_ON_OOPS=57, /* int: whether we will panic on an oops */ KERN_HPPA_PWRSW=58, /* int: hppa soft-power enable */ KERN_HPPA_UNALIGNED=59, /* int: hppa unaligned-trap enable */ + KERN_SHMUSEHUGEPAGES=60, /* int: back shm with huge pages */ + KERN_MMAPUSEHUGEPAGES=61, /* int: back anon mmap with huge pages */ + KERN_HPAGES_PER_FILE=62, /* int: max bigpages per file */ + KERN_HPAGES_MAP_SZ=63, /* int: min size (MB) of mapping */ }; diff -purN linux-2.6.0/ipc/shm.c linux-2.6.0-implicit/ipc/shm.c --- linux-2.6.0/ipc/shm.c 2003-12-17 18:58:49.000000000 -0800 +++ linux-2.6.0-implicit/ipc/shm.c 2004-01-08 16:19:31.000000000 -0800 @@ -32,6 +32,9 @@ #define shm_flags shm_perm.mode +extern int shm_use_hugepages; +extern int shm_hugepages_per_file; + static struct file_operations shm_file_operations; static struct vm_operations_struct shm_vm_ops; @@ -165,6 +168,31 @@ static struct vm_operations_struct shm_v .nopage = shmem_nopage, }; +#ifdef CONFIG_HUGETLBFS +int shm_with_hugepages(int shmflag, size_t size) +{ + /* flag specified explicitly */ + if (shmflag & SHM_HUGETLB) + return 1; + /* Are we disabled? */ + if (!shm_use_hugepages) + return 0; + /* Must be HPAGE aligned */ + if (size & ~HPAGE_MASK) + return 0; + /* Are we under the max per file? */ + if ((size >> HPAGE_SHIFT) > shm_hugepages_per_file) + return 0; + /* Do we have enough free huge pages? */ + if (!is_hugepage_mem_enough(size)) + return 0; + + return 1; +} +#else +int shm_with_hugepages(int shmflag, size_t size) { return 0; } +#endif + static int newseg (key_t key, int shmflg, size_t size) { int error; @@ -194,8 +222,10 @@ static int newseg (key_t key, int shmflg return error; } - if (shmflg & SHM_HUGETLB) + if (shm_with_hugepages(shmflg, size)) { + shmflg |= SHM_HUGETLB; file = hugetlb_zero_setup(size); + } else { sprintf (name, "SYSV%08x", key); file = shmem_file_setup(name, size, VM_ACCOUNT); diff -purN linux-2.6.0/kernel/sysctl.c linux-2.6.0-implicit/kernel/sysctl.c --- linux-2.6.0/kernel/sysctl.c 2003-12-17 18:58:08.000000000 -0800 +++ linux-2.6.0-implicit/kernel/sysctl.c 2004-01-08 16:19:31.000000000 -0800 @@ -60,6 +60,8 @@ extern int cad_pid; extern int pid_max; extern int sysctl_lower_zone_protection; extern int min_free_kbytes; +extern int shm_use_hugepages, shm_hugepages_per_file; +extern int mmap_use_hugepages, mmap_hugepages_map_sz; /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; @@ -579,6 +581,40 @@ static ctl_table kern_table[] = { .mode = 0644, .proc_handler = &proc_dointvec, }, +#ifdef CONFIG_HUGETLBFS + { + .ctl_name = KERN_SHMUSEHUGEPAGES, + .procname = "shm-use-hugepages", + .data = &shm_use_hugepages, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = KERN_MMAPUSEHUGEPAGES, + .procname = "mmap-use-hugepages", + .data = &mmap_use_hugepages, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = KERN_HPAGES_PER_FILE, + .procname = "shm-hugepages-per-file", + .data = &shm_hugepages_per_file, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = KERN_HPAGES_MAP_SZ, + .procname = "mmap-hugepages-min-mapping", + .data = &mmap_hugepages_map_sz, + .maxlen = sizeof(int), + .mode 0644, + .proc_handler = &proc_dointvec, + }, +#endif { .ctl_name = 0 } }; diff -purN linux-2.6.0/mm/mmap.c linux-2.6.0-implicit/mm/mmap.c --- linux-2.6.0/mm/mmap.c 2003-12-17 18:58:58.000000000 -0800 +++ linux-2.6.0-implicit/mm/mmap.c 2004-01-08 16:20:10.000000000 -0800 @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -59,6 +60,9 @@ EXPORT_SYMBOL(sysctl_overcommit_memory); EXPORT_SYMBOL(sysctl_overcommit_ratio); EXPORT_SYMBOL(vm_committed_space); +int mmap_use_hugepages = 0; +int mmap_hugepages_map_sz = 256; + /* * Requires inode->i_mapping->i_shared_sem */ @@ -473,7 +477,7 @@ unsigned long do_mmap_pgoff(struct file int correct_wcount = 0; int error; struct rb_node ** rb_link, * rb_parent; - unsigned long charged = 0; + unsigned long charged = 0, addr_save = addr; if (file) { if (!file->f_op || !file->f_op->mmap) @@ -501,8 +505,17 @@ unsigned long do_mmap_pgoff(struct file /* Obtain the address to map to. we verify (or select) it and ensure * that it represents a valid section of the address space. + * VM_HUGETLB will never appear in vm_flags when CONFIG_HUGETLB is + * unset. */ - addr = get_unmapped_area(file, addr, len, pgoff, flags); +#ifdef CONFIG_HUGETLBFS + addr = try_hugetlb_get_unmapped_area(NULL, addr, len, pgoff, &flags); + if (IS_ERR((void *)addr)) + return addr; + else if (addr == 0) +#endif + addr = get_unmapped_area(file, addr_save, len, pgoff, flags); + if (addr & ~PAGE_MASK) return addr; @@ -566,6 +579,9 @@ unsigned long do_mmap_pgoff(struct file default: return -EINVAL; case MAP_PRIVATE: +#ifdef CONFIG_HUGETLBFS + case (MAP_PRIVATE|MAP_HUGETLB): +#endif vm_flags &= ~(VM_SHARED | VM_MAYSHARE); /* fall through */ case MAP_SHARED: @@ -650,10 +666,31 @@ munmap_back: error = file->f_op->mmap(file, vma); if (error) goto unmap_and_free_vma; - } else if (vm_flags & VM_SHARED) { - error = shmem_zero_setup(vma); - if (error) - goto free_vma; + } else if ((vm_flags & VM_SHARED) || (vm_flags & VM_HUGETLB)) { + if (!is_vm_hugetlb_page(vma)) { + error = shmem_zero_setup(vma); + if (error) + goto free_vma; + } else { + /* + * Presumably hugetlb_zero_setup() acquires a + * reference count for us. The difference + * between this and the shmem_zero_setup() + * case is that we can encounter an error + * _after_ allocating the file. The error + * path was adjusted slightly to fput() for us. + */ + struct file *new_file = hugetlb_zero_setup(len); + if (IS_ERR(new_file)) { + error = PTR_ERR(new_file); + goto free_vma; + } else { + vma->vm_file = new_file; + error = new_file->f_op->mmap(new_file, vma); + if (error) + goto unmap_and_free_vma; + } + } } /* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform @@ -701,11 +738,21 @@ out: unmap_and_free_vma: if (correct_wcount) atomic_inc(&inode->i_writecount); - vma->vm_file = NULL; - fput(file); - /* Undo any partial mapping done by a device driver. */ + /* + * Undo any partial mapping done by a device driver. + * hugetlb wants to know the vma's file etc. so nuke + * the file afterward. + */ zap_page_range(vma, vma->vm_start, vma->vm_end - vma->vm_start); + + /* + * vma->vm_file may be different from file in the hugetlb case. + */ + if (vma->vm_file) + fput(vma->vm_file); + vma->vm_file = NULL; + free_vma: kmem_cache_free(vm_area_cachep, vma); unacct_error: diff -purN linux-2.6.0/mm/shmem.c linux-2.6.0-implicit/mm/shmem.c --- linux-2.6.0/mm/shmem.c 2003-12-17 18:58:48.000000000 -0800 +++ linux-2.6.0-implicit/mm/shmem.c 2004-01-08 16:19:31.000000000 -0800 @@ -40,6 +40,29 @@ #include #include +int shm_use_hugepages; + +/* + * On 64bit archs the vmalloc area is very large, + * so we allocate the array in vmalloc on 64bit archs. + * + * Assuming 2M pages (x86 and x86-64) those default setting + * will allow up to 128G of bigpages in a single file on + * 64bit archs and 64G on 32bit archs using the max + * kmalloc size of 128k. So tweaking in practice is needed + * only to go past 128G of bigpages per file on 64bit archs. + * + * This sysctl is in page units (each page large BIGPAGE_SIZE). + */ +#ifdef CONFIG_HUGETLBFS +#if BITS_PER_LONG == 64 +int shm_hugepages_per_file = 128UL << (30 - HPAGE_SHIFT); +#else +int shm_hugepages_per_file = 131072 / sizeof(struct page *); +#endif +#endif + + /* This magic number is used in glibc for posix shared memory */ #define TMPFS_MAGIC 0x01021994 -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From agl at us.ibm.com Sat Jan 10 08:29:38 2004 From: agl at us.ibm.com (Adam Litke) Date: 09 Jan 2004 13:29:38 -0800 Subject: [RFC] implicit hugetlb pages (mmu_context_to_struct) In-Reply-To: <1073683188.1297.105.camel@agtpad> References: <1073683188.1297.105.camel@agtpad> Message-ID: <1073683778.1298.115.camel@agtpad> mmu_context_to_struct (2.6.0): This patch converts the mmu_context variable to a structure. It is needed for the dynamic address space resizing patch. -- snip -- diff -purN linux-2.6.0/arch/ppc64/kernel/htab.c linux-2.6.0-context-struct/arch/ppc64/kernel/htab.c --- linux-2.6.0/arch/ppc64/kernel/htab.c 2003-12-17 18:58:57.000000000 -0800 +++ linux-2.6.0-context-struct/arch/ppc64/kernel/htab.c 2004-01-08 15:23:05.000000000 -0800 @@ -390,7 +390,7 @@ int hash_page(unsigned long ea, unsigned if (mm == NULL) return 1; - vsid = get_vsid(mm->context, ea); + vsid = get_vsid(mm->context.flags, ea); break; case IO_REGION_ID: mm = &ioremap_mm; diff -purN linux-2.6.0/arch/ppc64/kernel/stab.c linux-2.6.0-context-struct/arch/ppc64/kernel/stab.c --- linux-2.6.0/arch/ppc64/kernel/stab.c 2003-12-17 18:59:17.000000000 -0800 +++ linux-2.6.0-context-struct/arch/ppc64/kernel/stab.c 2004-01-08 15:23:05.000000000 -0800 @@ -270,14 +270,14 @@ int ste_allocate(unsigned long ea) if (REGION_ID(ea) >= KERNEL_REGION_ID) { kernel_segment = 1; vsid = get_kernel_vsid(ea); - context = REGION_ID(ea); + context.flags = REGION_ID(ea); } else { if (! current->mm) return 1; context = current->mm->context; - vsid = get_vsid(context, ea); + vsid = get_vsid(context.flags, ea); } esid = GET_ESID(ea); @@ -307,7 +307,7 @@ static void preload_stab(struct task_str for (esid = 0; esid < 16; esid++) { unsigned long ea = esid << SID_SHIFT; - vsid = get_vsid(mm->context, ea); + vsid = get_vsid(mm->context.flags, ea); __ste_allocate(esid, vsid, 0, mm->context); } } else { @@ -321,7 +321,7 @@ static void preload_stab(struct task_str if (!IS_VALID_EA(pc) || (REGION_ID(pc) >= KERNEL_REGION_ID)) return; - vsid = get_vsid(mm->context, pc); + vsid = get_vsid(mm->context.flags, pc); __ste_allocate(GET_ESID(pc), vsid, 0, mm->context); } @@ -329,7 +329,7 @@ static void preload_stab(struct task_str if (!IS_VALID_EA(stack) || (REGION_ID(stack) >= KERNEL_REGION_ID)) return; - vsid = get_vsid(mm->context, stack); + vsid = get_vsid(mm->context.flags, stack); __ste_allocate(GET_ESID(stack), vsid, 0, mm->context); } } diff -purN linux-2.6.0/arch/ppc64/mm/hugetlbpage.c linux-2.6.0-context-struct/arch/ppc64/mm/hugetlbpage.c --- linux-2.6.0/arch/ppc64/mm/hugetlbpage.c 2003-12-17 18:58:50.000000000 -0800 +++ linux-2.6.0-context-struct/arch/ppc64/mm/hugetlbpage.c 2004-01-08 15:50:29.000000000 -0800 @@ -245,7 +245,7 @@ static int open_32bit_htlbpage_range(str struct vm_area_struct *vma; unsigned long addr; - if (mm->context & CONTEXT_LOW_HPAGES) + if (mm->context.flags & CONTEXT_LOW_HPAGES) return 0; /* The window is already open */ /* Check no VMAs are in the region */ @@ -282,7 +282,7 @@ static int open_32bit_htlbpage_range(str /* FIXME: do we need to scan for PTEs too? */ - mm->context |= CONTEXT_LOW_HPAGES; + mm->context.flags |= CONTEXT_LOW_HPAGES; /* the context change must make it to memory before the slbia, * so that further SLB misses do the right thing. */ @@ -590,7 +590,6 @@ full_search: } } - unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) @@ -780,7 +779,7 @@ static void flush_hash_hugepage(mm_conte BUG_ON(hugepte_bad(pte)); BUG_ON(!in_hugepage_area(context, ea)); - vsid = get_vsid(context, ea); + vsid = get_vsid(context.flags, ea); va = (vsid << 28) | (ea & 0x0fffffff); vpn = va >> LARGE_PAGE_SHIFT; diff -purN linux-2.6.0/arch/ppc64/mm/init.c linux-2.6.0-context-struct/arch/ppc64/mm/init.c --- linux-2.6.0/arch/ppc64/mm/init.c 2003-12-17 18:58:57.000000000 -0800 +++ linux-2.6.0-context-struct/arch/ppc64/mm/init.c 2004-01-08 15:23:05.000000000 -0800 @@ -275,7 +275,7 @@ flush_tlb_page(struct vm_area_struct *vm break; case USER_REGION_ID: pgd = pgd_offset( vma->vm_mm, vmaddr ); - context = vma->vm_mm->context; + context = vma->vm_mm->context.flags; /* XXX are there races with checking cpu_vm_mask? - Anton */ tmp = cpumask_of_cpu(smp_processor_id()); @@ -327,7 +327,7 @@ __flush_tlb_range(struct mm_struct *mm, break; case USER_REGION_ID: pgd = pgd_offset(mm, start); - context = mm->context; + context = mm->context.flags; /* XXX are there races with checking cpu_vm_mask? - Anton */ tmp = cpumask_of_cpu(smp_processor_id()); @@ -431,7 +431,7 @@ void __init mm_init_ppc64(void) mmu_context_queue.tail = NUM_USER_CONTEXT-1; mmu_context_queue.size = NUM_USER_CONTEXT; for(index=0; index < NUM_USER_CONTEXT ;index++) { - mmu_context_queue.elements[index] = index+FIRST_USER_CONTEXT; + mmu_context_queue.elements[index].flags = index+FIRST_USER_CONTEXT; } /* Setup guard pages for the Paca's */ @@ -717,7 +717,7 @@ void update_mmu_cache(struct vm_area_str return; ptep = find_linux_pte(pgdir, ea); - vsid = get_vsid(vma->vm_mm->context, ea); + vsid = get_vsid(vma->vm_mm->context.flags, ea); tmp = cpumask_of_cpu(smp_processor_id()); if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp)) diff -purN linux-2.6.0/arch/ppc64/xmon/xmon.c linux-2.6.0-context-struct/arch/ppc64/xmon/xmon.c --- linux-2.6.0/arch/ppc64/xmon/xmon.c 2003-12-17 18:59:28.000000000 -0800 +++ linux-2.6.0-context-struct/arch/ppc64/xmon/xmon.c 2004-01-08 15:23:05.000000000 -0800 @@ -1936,7 +1936,7 @@ mem_translate() // if in user range, use the current task's page directory else if ( ( ea >= USER_START ) && ( ea <= USER_END ) ) { mm = current->mm; - vsid = get_vsid(mm->context, ea ); + vsid = get_vsid(mm->context.flags, ea ); } pgdir = mm->pgd; va = ( vsid << 28 ) | ( ea & 0x0fffffff ); diff -purN linux-2.6.0/include/asm-ppc64/mmu.h linux-2.6.0-context-struct/include/asm-ppc64/mmu.h --- linux-2.6.0/include/asm-ppc64/mmu.h 2003-12-17 18:59:05.000000000 -0800 +++ linux-2.6.0-context-struct/include/asm-ppc64/mmu.h 2004-01-08 15:42:11.000000000 -0800 @@ -15,8 +15,10 @@ #ifndef __ASSEMBLY__ -/* Default "unsigned long" context */ -typedef unsigned long mm_context_t; +/* Time to allow for more things here */ +typedef struct { + unsigned long flags; +} mm_context_t; #ifdef CONFIG_HUGETLB_PAGE #define CONTEXT_LOW_HPAGES (1UL<<63) diff -purN linux-2.6.0/include/asm-ppc64/mmu_context.h linux-2.6.0-context-struct/include/asm-ppc64/mmu_context.h --- linux-2.6.0/include/asm-ppc64/mmu_context.h 2003-12-17 18:58:40.000000000 -0800 +++ linux-2.6.0-context-struct/include/asm-ppc64/mmu_context.h 2004-01-08 15:43:07.000000000 -0800 @@ -127,8 +127,8 @@ destroy_context(struct mm_struct *mm) #endif mmu_context_queue.size++; - mmu_context_queue.elements[index] = - mm->context & ~CONTEXT_LOW_HPAGES; + mmu_context_queue.elements[index].flags = + mm->context.flags & ~CONTEXT_LOW_HPAGES; spin_unlock_irqrestore(&mmu_context_queue.lock, flags); } diff -purN linux-2.6.0/include/asm-ppc64/page.h linux-2.6.0-context-struct/include/asm-ppc64/page.h --- linux-2.6.0/include/asm-ppc64/page.h 2003-12-17 18:58:04.000000000 -0800 +++ linux-2.6.0-context-struct/include/asm-ppc64/page.h 2004-01-08 15:51:39.000000000 -0800 @@ -32,6 +32,7 @@ /* For 64-bit processes the hugepage range is 1T-1.5T */ #define TASK_HPAGE_BASE (0x0000010000000000UL) #define TASK_HPAGE_END (0x0000018000000000UL) + /* For 32-bit processes the hugepage range is 2-3G */ #define TASK_HPAGE_BASE_32 (0x80000000UL) #define TASK_HPAGE_END_32 (0xc0000000UL) @@ -39,14 +40,14 @@ #define ARCH_HAS_HUGEPAGE_ONLY_RANGE #define is_hugepage_only_range(addr, len) \ ( ((addr > (TASK_HPAGE_BASE-len)) && (addr < TASK_HPAGE_END)) || \ - ((current->mm->context & CONTEXT_LOW_HPAGES) && \ + ((current->mm->context.flags & CONTEXT_LOW_HPAGES) && \ (addr > (TASK_HPAGE_BASE_32-len)) && (addr < TASK_HPAGE_END_32)) ) + #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA - #define in_hugepage_area(context, addr) \ ((cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) && \ ((((addr) >= TASK_HPAGE_BASE) && ((addr) < TASK_HPAGE_END)) || \ - (((context) & CONTEXT_LOW_HPAGES) && \ + (((context.flags) & CONTEXT_LOW_HPAGES) && \ (((addr) >= TASK_HPAGE_BASE_32) && ((addr) < TASK_HPAGE_END_32))))) #else /* !CONFIG_HUGETLB_PAGE */ diff -purN linux-2.6.0/include/asm-ppc64/tlb.h linux-2.6.0-context-struct/include/asm-ppc64/tlb.h --- linux-2.6.0/include/asm-ppc64/tlb.h 2003-12-17 18:58:40.000000000 -0800 +++ linux-2.6.0-context-struct/include/asm-ppc64/tlb.h 2004-01-08 15:23:05.000000000 -0800 @@ -65,7 +65,7 @@ static inline void __tlb_remove_tlb_entr if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) local = 1; - flush_hash_range(tlb->mm->context, i, local); + flush_hash_range(tlb->mm->context.flags, i, local); i = 0; } } @@ -84,7 +84,7 @@ static inline void tlb_flush(struct mmu_ if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) local = 1; - flush_hash_range(tlb->mm->context, batch->index, local); + flush_hash_range(tlb->mm->context.flags, batch->index, local); batch->index = 0; } -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From agl at us.ibm.com Sat Jan 10 08:33:18 2004 From: agl at us.ibm.com (Adam Litke) Date: 09 Jan 2004 13:33:18 -0800 Subject: [RFC] implicit hugetlb pages (hugetlb_dyn_as) In-Reply-To: <1073683188.1297.105.camel@agtpad> References: <1073683188.1297.105.camel@agtpad> Message-ID: <1073683998.1297.120.camel@agtpad> hugetlb_dyn_as (2.6.0): This patch adds support for dynamic resizing of the address space region used to address hugetlb pages. This region starts empty and grows down from f0000000 in segment sized increments as needed. Requires hugetlb_implicit and mmu_context_to_struct. -- snip -- diff -purN linux-2.6.0-implicit/arch/ppc64/kernel/setup.c linux-2.6.0-implicit+dynas/arch/ppc64/kernel/setup.c --- linux-2.6.0-implicit/arch/ppc64/kernel/setup.c 2004-01-09 10:50:20.000000000 -0800 +++ linux-2.6.0-implicit+dynas/arch/ppc64/kernel/setup.c 2004-01-09 11:06:23.000000000 -0800 @@ -523,6 +523,9 @@ void __init setup_arch(char **cmdline_p) init_mm.end_code = (unsigned long) _etext; init_mm.end_data = (unsigned long) _edata; init_mm.brk = klimit; +#ifdef CONFIG_HUGETLB_PAGE + init_mm.context.hugetlb_base = TASK_HPAGE_BASE_32; +#endif /* Save unparsed command line copy for /proc/cmdline */ strcpy(saved_command_line, cmd_line); diff -purN linux-2.6.0-implicit/arch/ppc64/mm/hugetlbpage.c linux-2.6.0-implicit+dynas/arch/ppc64/mm/hugetlbpage.c --- linux-2.6.0-implicit/arch/ppc64/mm/hugetlbpage.c 2004-01-09 11:37:33.000000000 -0800 +++ linux-2.6.0-implicit+dynas/arch/ppc64/mm/hugetlbpage.c 2004-01-09 11:14:31.000000000 -0800 @@ -249,14 +249,14 @@ static int open_32bit_htlbpage_range(str return 0; /* The window is already open */ /* Check no VMAs are in the region */ - vma = find_vma(mm, TASK_HPAGE_BASE_32); + vma = find_vma(mm, mm->context.hugetlb_base); if (vma && (vma->vm_start < TASK_HPAGE_END_32)) return -EBUSY; /* Clean up any leftover PTE pages in the region */ spin_lock(&mm->page_table_lock); - for (addr = TASK_HPAGE_BASE_32; addr < TASK_HPAGE_END_32; + for (addr = mm->context.hugetlb_base; addr < TASK_HPAGE_END_32; addr += PMD_SIZE) { pgd_t *pgd = pgd_offset(mm, addr); pmd_t *pmd = pmd_offset(pgd, addr); @@ -590,6 +590,32 @@ full_search: } } +unsigned long grow_hugetlb_region(unsigned long hpage_base, unsigned long len) +{ + struct vm_area_struct *vma = NULL; + unsigned long i, new_base, vma_start = hpage_base; + + vma = find_vma(current->mm, vma_start); + vma_start = (vma && vma->vm_start < TASK_HPAGE_END_32) ? + vma->vm_start : TASK_HPAGE_END_32; + printk("First vma in hugetlb region starts at: %lx\n", vma_start); + + new_base = _ALIGN_DOWN(vma_start - len, 256<<20); + if (new_base < TASK_HPAGE_BASE_32) + return -ENOMEM; + + printk("Try to move hugetlb_base down to: %lx\n", new_base); + vma = find_vma(current->mm, new_base); + if (vma && vma->vm_start < hpage_base) { + printk("Found vma at %lx aborting\n", vma->vm_start); + return -ENOMEM; + } + + current->mm->context.hugetlb_base = new_base; + printk("Area clean returning an area at: %lx\n", vma_start-len); + return vma_start - len; +} + unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) @@ -610,7 +636,7 @@ unsigned long hugetlb_get_unmapped_area( if (err) return err; /* Should this just be EINVAL? */ - base = TASK_HPAGE_BASE_32; + base = current->mm->context.hugetlb_base; end = TASK_HPAGE_END_32; } else { base = TASK_HPAGE_BASE; @@ -624,7 +650,7 @@ unsigned long hugetlb_get_unmapped_area( for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) { /* At this point: (!vma || addr < vma->vm_end). */ if (addr + len > end) - return -ENOMEM; + break; /* We couldn't find an area */ if (!vma || (addr + len) <= vma->vm_start) return addr; addr = ALIGN(vma->vm_end, HPAGE_SIZE); @@ -633,6 +659,8 @@ unsigned long hugetlb_get_unmapped_area( * this alignment shouldn't have skipped over any * other vmas */ } + /* Get the space by expanding the hugetlb region */ + return grow_hugetlb_region(base, len); } static inline unsigned long computeHugeHptePP(unsigned int hugepte) diff -purN linux-2.6.0-implicit/fs/hugetlbfs/inode.c linux-2.6.0-implicit+dynas/fs/hugetlbfs/inode.c --- linux-2.6.0-implicit/fs/hugetlbfs/inode.c 2004-01-09 10:50:34.000000000 -0800 +++ linux-2.6.0-implicit+dynas/fs/hugetlbfs/inode.c 2004-01-09 11:16:30.000000000 -0800 @@ -155,6 +155,7 @@ try_hugetlb_get_unmapped_area(struct fil if (*flags & MAP_HUGETLB) { if (pre_error) return pre_error; + printk("Doing explicit hugetlb mmap\n"); return hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags); } @@ -165,10 +166,13 @@ try_hugetlb_get_unmapped_area(struct fil if (mmap_hugetlb_implicit(len)) { if (pre_error) goto out; + printk("Doing implicit hugetlb mmap..."); addr = hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags); - if (IS_ERR((void *)addr)) + if (IS_ERR((void *)addr)) { + printk("failed - falling back.\n"); goto out; - else { + } else { + printk("succeeded.\n"); *flags |= MAP_HUGETLB; return addr; } diff -purN linux-2.6.0-implicit/include/asm-ppc64/mmu.h linux-2.6.0-implicit+dynas/include/asm-ppc64/mmu.h --- linux-2.6.0-implicit/include/asm-ppc64/mmu.h 2004-01-09 11:37:33.000000000 -0800 +++ linux-2.6.0-implicit+dynas/include/asm-ppc64/mmu.h 2004-01-09 11:17:35.000000000 -0800 @@ -18,6 +18,9 @@ /* Time to allow for more things here */ typedef struct { unsigned long flags; +#ifdef CONFIG_HUGETLB_PAGE + unsigned long hugetlb_base; +#endif } mm_context_t; #ifdef CONFIG_HUGETLB_PAGE diff -purN linux-2.6.0-implicit/include/asm-ppc64/mmu_context.h linux-2.6.0-implicit+dynas/include/asm-ppc64/mmu_context.h --- linux-2.6.0-implicit/include/asm-ppc64/mmu_context.h 2004-01-09 11:37:33.000000000 -0800 +++ linux-2.6.0-implicit+dynas/include/asm-ppc64/mmu_context.h 2004-01-09 11:18:44.000000000 -0800 @@ -90,6 +90,9 @@ init_new_context(struct task_struct *tsk head = mmu_context_queue.head; mm->context = mmu_context_queue.elements[head]; +#ifdef CONFIG_HUGETLB_PAGE + mm->context.hugetlb_base = TASK_HPAGE_END_32; +#endif head = (head < LAST_USER_CONTEXT-1) ? head+1 : 0; mmu_context_queue.head = head; diff -purN linux-2.6.0-implicit/include/asm-ppc64/page.h linux-2.6.0-implicit+dynas/include/asm-ppc64/page.h --- linux-2.6.0-implicit/include/asm-ppc64/page.h 2004-01-09 11:37:33.000000000 -0800 +++ linux-2.6.0-implicit+dynas/include/asm-ppc64/page.h 2004-01-09 11:22:58.000000000 -0800 @@ -33,22 +33,28 @@ #define TASK_HPAGE_BASE (0x0000010000000000UL) #define TASK_HPAGE_END (0x0000018000000000UL) -/* For 32-bit processes the hugepage range is 2-3G */ -#define TASK_HPAGE_BASE_32 (0x80000000UL) -#define TASK_HPAGE_END_32 (0xc0000000UL) +/* + * We have much greater contention for segments in a + * 32-bit address space. Therefore, the region reserved + * for huge pages is dynamically resized. These values + * define the maximum range allowed for huge pages. + */ +#define TASK_HPAGE_BASE_32 (0x40000000UL) +#define TASK_HPAGE_END_32 (0xf0000000UL) #define ARCH_HAS_HUGEPAGE_ONLY_RANGE #define is_hugepage_only_range(addr, len) \ ( ((addr > (TASK_HPAGE_BASE-len)) && (addr < TASK_HPAGE_END)) || \ ((current->mm->context.flags & CONTEXT_LOW_HPAGES) && \ - (addr > (TASK_HPAGE_BASE_32-len)) && (addr < TASK_HPAGE_END_32)) ) + (addr > (current->mm->context.hugetlb_base-len)) && \ + (addr < TASK_HPAGE_END_32)) ) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define in_hugepage_area(context, addr) \ ((cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) && \ ((((addr) >= TASK_HPAGE_BASE) && ((addr) < TASK_HPAGE_END)) || \ (((context.flags) & CONTEXT_LOW_HPAGES) && \ - (((addr) >= TASK_HPAGE_BASE_32) && ((addr) < TASK_HPAGE_END_32))))) + (((addr) >= context.hugetlb_base) && ((addr) < TASK_HPAGE_END_32))))) #else /* !CONFIG_HUGETLB_PAGE */ -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Sat Jan 10 09:46:43 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Fri, 9 Jan 2004 16:46:43 -0600 Subject: per page execute In-Reply-To: <20031230221841.GB22998@bubble.sa.bigpond.net.au>; from amodra@bigpond.net.au on Wed, Dec 31, 2003 at 08:48:41AM +1030 References: <20031227121524.GA24358@krispykreme> <20031230221841.GB22998@bubble.sa.bigpond.net.au> Message-ID: <20040109164643.A21956@forte.austin.ibm.com> I'm reading some old email ... On Wed, Dec 31, 2003 at 08:48:41AM +1030, Alan Modra wrote: > > On Sat, Dec 27, 2003 at 11:15:25PM +1100, Anton Blanchard wrote: > > [25] .plt NOBITS 10010c08 000c00 0000c0 00 WAX 0 0 4 > > [26] .bss NOBITS 10010cc8 000c00 000004 00 WA 0 0 1 > > > > Look how the non executable bss butts right onto the executable plt. > > Even with the patch below, we are failing some security tests that try > > and exec stuff out of the bss. Thats because the stuff ends up in the same > > page as the plt. Alan, could this be considered a toolchain bug? > > Possibly. What about .got (exec) and adjacent .sdata (non-exec)? The > ABI says that shared libs access .sdata via the got pointer, so > there's no hope of separating them. Dumb question: can you pad out the .got to a 4K boundary? Yes, that's a bogus fix is the kernel page size is different. --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From amodra at bigpond.net.au Sat Jan 10 10:28:37 2004 From: amodra at bigpond.net.au (Alan Modra) Date: Sat, 10 Jan 2004 09:58:37 +1030 Subject: per page execute In-Reply-To: <20040109164643.A21956@forte.austin.ibm.com> References: <20031227121524.GA24358@krispykreme> <20031230221841.GB22998@bubble.sa.bigpond.net.au> <20040109164643.A21956@forte.austin.ibm.com> Message-ID: <20040109232837.GN2969@bubble.modra.org> On Fri, Jan 09, 2004 at 04:46:43PM -0600, linas at austin.ibm.com wrote: > Dumb question: can you pad out the .got to a 4K boundary? Yes. One way is to create some empty sections with the required alignment. .section ".got" .p2align 12 .section ".sdata" .p2align 12 You can also tweak linker scripts to do the alignment. -- Alan Modra IBM OzLabs - Linux Technology Centre ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From rsa at us.ibm.com Sat Jan 10 10:57:51 2004 From: rsa at us.ibm.com (Ryan Arnold) Date: 09 Jan 2004 17:57:51 -0600 Subject: Looking for commonality in 2.6 vio drivers sysfs requirements Message-ID: <1073692671.26608.220.camel@SigurRos.rchland.ibm.com> Hey all, I'm looking to see what kind of requirements for sysfs entries we're going to need for the PPC64 vio device drivers to determine if there is a heavy enough requirement to justify separating the forthcoming vio sysfs enablement code from the vio bus code in vio.c into something like vio-sysfs.c, much like scsi_sysfs.c, net-sysfs.c, or pci_sysfs.c. The drivers that are effected are: ibmveth: no sysfs impl. currently in 2.6 (Santiago Leon?) hvcs: Hypervisor Virtual Console Server : in dev. (Ryan Arnold) hvsi:I have no idea (Hollis Blanchard?) ibmvscsi : no sysfs impl. currently in 2.6 (Santiago Leon?) vscsi_server: I have no idea (Dave Boutcher?) hvc_console: no sysfs impl currently in 2.6 (?) I'll send out an email on hvcs sysfs requirements early next week. Ryan S. Arnold IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Sat Jan 10 11:31:31 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Fri, 9 Jan 2004 18:31:31 -0600 Subject: spinlocks In-Reply-To: <3FFC212C.1010906@vnet.ibm.com>; from engebret@vnet.ibm.com on Wed, Jan 07, 2004 at 09:09:32AM -0600 References: <20031228052954.GD24358@krispykreme> <3FFC212C.1010906@vnet.ibm.com> Message-ID: <20040109183131.B21956@forte.austin.ibm.com> On Wed, Jan 07, 2004 at 09:09:32AM -0600, Dave Engebretsen wrote: > Anton Blanchard wrote: > > As an aside, can someone explain why we reread the lock holder: > > > > lwsync # if odd, give up cycles\n\ > > ldx %1,0,%2 # reverify the lock holder\n\ > > cmpd %0,%1\n\ > > bne 1b # new holder so restart\n\ > > > > Wont there be a race regardless of whether this code is there? > > It is a tricky case, but the sequence is required. Here is the situation: > > Proc A holds the lock > Proc B sees proc A as the holder, then gets preempted > Proc A drops the lock, then cedes for a long time > Proc B reads proc A's yield count, which is valid (odd) > Proc B confers to proc A, but does not wake up until after A is dispatched. > > The lwsync + reread ensures this cannot occur. Uhh, can someone copy these comments into the code, for future reference? Over the last year, I've fixed several locking/race bugs that involved some subtle assumptions. -- linas speaking of subtle assumptions: why do HMT_LOW, HMT_MED need to be no-ops? Why just make them be nothing at all? ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Sun Jan 11 11:16:33 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 11 Jan 2004 11:16:33 +1100 Subject: [PATCH][2.6] set up vio_dev's driver field In-Reply-To: <20040109194100.GA7512@us.ibm.com> References: <3FFB3A47.8090404@us.ltcfwd.linux.ibm.com> <20040109182950.GA8858@us.ibm.com> <20040109194100.GA7512@us.ibm.com> Message-ID: <1073780193.764.35.camel@gaston> > No, a "fake" parent device is fine in this case. But what kind of > devices do these VIO devices hang off of? They should have some kind of > addressable device as a parent, right? But if not, then yes, a "fake" > parent device is ok. They don't have an addressable device as a parent. Those devices exist on machines running several logical partitions (IBM virtualisation environment) and the "API" to them is via traps to the hypervisor. (So basically syscall-like traps to the layer below linux). Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Sun Jan 11 11:18:31 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 11 Jan 2004 11:18:31 +1100 Subject: Looking for commonality in 2.6 vio drivers sysfs requirements In-Reply-To: <1073692671.26608.220.camel@SigurRos.rchland.ibm.com> References: <1073692671.26608.220.camel@SigurRos.rchland.ibm.com> Message-ID: <1073780310.765.37.camel@gaston> On Sat, 2004-01-10 at 10:57, Ryan Arnold wrote: > Hey all, > > I'm looking to see what kind of requirements for sysfs entries we're > going to need for the PPC64 vio device drivers to determine if there is > a heavy enough requirement to justify separating the forthcoming vio > sysfs enablement code from the vio bus code in vio.c into something like > vio-sysfs.c, much like scsi_sysfs.c, net-sysfs.c, or pci_sysfs.c. > > The drivers that are effected are: Well, if the problem is just about splitting files, I'd say it's not very important at this point. Implement things, and if you end up with too much sysfs-related stuff in the vio.c file, then split it :) It's just a matter of taste... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Jan 11 11:56:27 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 11 Jan 2004 11:56:27 +1100 Subject: spinlocks In-Reply-To: <3FFC1A46.4010202@vnet.ibm.com> References: <20031228052954.GD24358@krispykreme> <20040106005232.GK12213@krispykreme> <3FFC1A46.4010202@vnet.ibm.com> Message-ID: <20040111005627.GA6663@krispykreme> Hi, > Is a single binary for Apple & pSeries a goal? While it has some > obvious advantages, there is likely to be a number of areas (the > spinlock discussion being one) where the goals are quite different. I think the goals of G5 and POWER4 SMP and LPAR are quite similar as far as spinlocks go, they should both benefit from an improvement in icache usage. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Jan 11 11:57:41 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 11 Jan 2004 11:57:41 +1100 Subject: spinlocks In-Reply-To: References: <3FFC1A46.4010202@vnet.ibm.com> Message-ID: <20040111005741.GB6663@krispykreme> > Are they really all that different? We need to keep the pSeries code > running smoothly on a small-config SMP machine too (i.e. p615 and the > like). Yeah I like the pressure it will put on us to keep things reasonably small. Check out the size of a default 2.6 ppc64 compile, its huge! Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Jan 11 12:35:38 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 11 Jan 2004 12:35:38 +1100 Subject: spinlocks In-Reply-To: <3FFC212C.1010906@vnet.ibm.com> References: <20031228052954.GD24358@krispykreme> <3FFC212C.1010906@vnet.ibm.com> Message-ID: <20040111013538.GC6663@krispykreme> > >3. Separate spinlocks for iseries and pseries where most of it is > >duplicated. > I do not follow this point - Was just thinking out aloud, perhaps the inline portion of the iseries and pseries spinlocks could be shared. (Assuming we out of line the SPLPAR bits. > It is a tricky case, but the sequence is required. Here is the situation: > > Proc A holds the lock > Proc B sees proc A as the holder, then gets preempted > Proc A drops the lock, then cedes for a long time > Proc B reads proc A's yield count, which is valid (odd) > Proc B confers to proc A, but does not wake up until after A is dispatched. > > The lwsync + reread ensures this cannot occur. OK. Im wondering what stops that scenario from happening in the 5 instructions between when we reverify the lock holder and actually call into the hypervisor. > While I agree performance is less important in SPLPAR mode than > dedicated, it is still important. The vast majority of customers on > iSeries run in this mode. Sure. Do we have an estimate for the path length for a phyp confer hcall where we return straight back to the partition? If its in the order of a 100 instructions, then I prefer to add 10 instructions in linux to that path rather than 10 instructions to the non SPLPAR path. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Jan 11 12:52:30 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 11 Jan 2004 12:52:30 +1100 Subject: spinlocks In-Reply-To: <3FFC281B.2090007@vnet.ibm.com> References: <1073338443.761.77.camel@gaston> <20040106130937.GL12213@krispykreme> <3FFC281B.2090007@vnet.ibm.com> Message-ID: <20040111015230.GD6663@krispykreme> > If we uninline them, the advantage of leaf function optimizations are > lost -- it seems like that would be a pretty big hit, right?. We don't > have any good data, but it may well be about a wash vs. the 1/2 cache > line of extra instructions introduced for shared processors. We can execute a large number of instructions in the time it takes to satisfy one cache miss from memory. A half a cache line is an awfully large thing to inline for something as common as a spinlock. The only data we have so far is from Joel, and his results show a small but noticeable improvement from removing 2 spinlock instructions in our fast path. > Isn't this going to result in shared processor locks always stacking the > "mini-frame"? That is a pretty big hit for what is likely to be a very > common customer configuration. Perhaps. I dont see why such a big hit, considering phyp will often end up swapping contexts in that code path. Im guessing that will take a long time to complete. > What magic results in this ending up at the end of each function? There is only 1 copy of it in the kernel. > When Peter & I were just looking at this, he pointed out that lwz > r5,0x2580(0) may not quite have the intended results :) Thanks, it needs some work still :) > Also, where in this are cr0, cr1, and xer marked as clobbered? They are > all volitile over the hcall. We'll have to add them. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Mon Jan 12 12:54:00 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 12 Jan 2004 12:54:00 +1100 Subject: [RFC] implicit hugetlb pages (mmu_context_to_struct) In-Reply-To: <1073683778.1298.115.camel@agtpad> References: <1073683188.1297.105.camel@agtpad> <1073683778.1298.115.camel@agtpad> Message-ID: <20040112015400.GA8262@zax> On Fri, Jan 09, 2004 at 01:29:38PM -0800, Adam Litke wrote: > > mmu_context_to_struct (2.6.0): > This patch converts the mmu_context variable to a structure. It is > needed for the dynamic address space resizing patch. Ok, changing the context to a struct is a reasonable idea, but "flags" is a really bad name for the existing field. What the mmu_context (currently) mostly contains is the actual mm context ID which is used to create VSIDs. Additionally it has exactly one flag - CONTEXT_LOW_HPAGES - but that's only there because I avoided doing this conversion to a struct when I first did the hugepage support. If you're going to convert the mmu context to a struct, it would make sense to pull that flag out and put it in its own field. If we add dynamic hugepage range support in the right way, I think we'll be able to subsume the LOW_HPAGES flag into whatever information the dynamic range stuff needs. I also think it would probably be better to no longer include the whole mmu context structure in the mmu_context_queue. All it needs to store is the actual context number, and having the other information there has the potential to cause strange and subtle bugs. I remember debugging various problems from not properly clearing the LOW_HPAGES flag when contexts where placed back into the queue. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Mon Jan 12 13:35:15 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 12 Jan 2004 13:35:15 +1100 Subject: [RFC] implicit hugetlb pages (hugetlb_dyn_as) In-Reply-To: <1073683998.1297.120.camel@agtpad> References: <1073683188.1297.105.camel@agtpad> <1073683998.1297.120.camel@agtpad> Message-ID: <20040112023515.GC8262@zax> On Fri, Jan 09, 2004 at 01:33:18PM -0800, Adam Litke wrote: > > hugetlb_dyn_as (2.6.0): > This patch adds support for dynamic resizing of the address space > region used to address hugetlb pages. This region starts empty and > grows down from f0000000 in segment sized increments as needed. Have you considered the approach of instead of using a single hugepage range, using a bitmask indicating which segments are hugepage segments? That's more flexible, and I suspect it may actually lead to some implementation simplifications, especially if you want to support shrinking the hugepage region. > Requires hugetlb_implicit and mmu_context_to_struct. This chage is orthogonal to the hugetlb_implicit stuff, it would be nice to make it independent of that patch. Comments on some details below: > diff -purN linux-2.6.0-implicit/arch/ppc64/kernel/setup.c linux-2.6.0-implicit+dynas/arch/ppc64/kernel/setup.c > +++ linux-2.6.0-implicit+dynas/arch/ppc64/kernel/setup.c 2004-01-09 11:06:23.000000000 -0800 > @@ -523,6 +523,9 @@ void __init setup_arch(char **cmdline_p) > init_mm.end_code = (unsigned long) _etext; > init_mm.end_data = (unsigned long) _edata; > init_mm.brk = klimit; > +#ifdef CONFIG_HUGETLB_PAGE > + init_mm.context.hugetlb_base = TASK_HPAGE_BASE_32; > +#endif Erm... this appears to be giving init the largest possible hugetlb range, which seems odd. I can't see why init would want hugepages. > /* Save unparsed command line copy for /proc/cmdline */ > strcpy(saved_command_line, cmd_line); > diff -purN linux-2.6.0-implicit/arch/ppc64/mm/hugetlbpage.c linux-2.6.0-implicit+dynas/arch/ppc64/mm/hugetlbpage.c > +++ linux-2.6.0-implicit+dynas/arch/ppc64/mm/hugetlbpage.c 2004-01-09 11:14:31.000000000 -0800 > @@ -249,14 +249,14 @@ static int open_32bit_htlbpage_range(str > return 0; /* The window is already open */ > > /* Check no VMAs are in the region */ > - vma = find_vma(mm, TASK_HPAGE_BASE_32); > + vma = find_vma(mm, mm->context.hugetlb_base); > > if (vma && (vma->vm_start < TASK_HPAGE_END_32)) > return -EBUSY; > > /* Clean up any leftover PTE pages in the region */ > spin_lock(&mm->page_table_lock); > - for (addr = TASK_HPAGE_BASE_32; addr < TASK_HPAGE_END_32; > + for (addr = mm->context.hugetlb_base; addr < TASK_HPAGE_END_32; > addr += PMD_SIZE) { > pgd_t *pgd = pgd_offset(mm, addr); > pmd_t *pmd = pmd_offset(pgd, addr); > @@ -590,6 +590,32 @@ full_search: > } > } > > +unsigned long grow_hugetlb_region(unsigned long hpage_base, unsigned long len) > +{ > + struct vm_area_struct *vma = NULL; > + unsigned long i, new_base, vma_start = hpage_base; i is an unused variable. > + vma = find_vma(current->mm, vma_start); > + vma_start = (vma && vma->vm_start < TASK_HPAGE_END_32) ? > + vma->vm_start : TASK_HPAGE_END_32; > + printk("First vma in hugetlb region starts at: %lx\n", vma_start); > + new_base = _ALIGN_DOWN(vma_start - len, 256<<20); > + if (new_base < TASK_HPAGE_BASE_32) > + return -ENOMEM; > + printk("Try to move hugetlb_base down to: %lx\n", new_base); > + vma = find_vma(current->mm, new_base); > + if (vma && vma->vm_start < hpage_base) { > + printk("Found vma at %lx aborting\n", vma->vm_start); > + return -ENOMEM; > + } > + > + current->mm->context.hugetlb_base = new_base; > + printk("Area clean returning an area at: %lx\n", vma_start-len); > + return vma_start - len; > +} This isn't quite sufficient. There could be non-hugepage SLB entries in place for the segments which have now become hugepage segments, so you'll need to flush those. The simplest approach is probably an IPI to slbia on all CPUs, like we do in open_32bit_htlbpage_range() when we set the LOW_HPAGES flag. Speaking of which, you no longer need the LOW_HPAGES flag, instead you can just test whether the (32bit) hugepage range is non-empty. > unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > unsigned long len, unsigned long pgoff, > unsigned long flags) > @@ -610,7 +636,7 @@ unsigned long hugetlb_get_unmapped_area( > if (err) > return err; /* Should this just be EINVAL? */ > > - base = TASK_HPAGE_BASE_32; > + base = current->mm->context.hugetlb_base; > end = TASK_HPAGE_END_32; > } else { > base = TASK_HPAGE_BASE; > @@ -624,7 +650,7 @@ unsigned long hugetlb_get_unmapped_area( > for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) { > /* At this point: (!vma || addr < vma->vm_end). */ > if (addr + len > end) > - return -ENOMEM; > + break; /* We couldn't find an area */ > if (!vma || (addr + len) <= vma->vm_start) > return addr; > addr = ALIGN(vma->vm_end, HPAGE_SIZE); > @@ -633,6 +659,8 @@ unsigned long hugetlb_get_unmapped_area( > * this alignment shouldn't have skipped over any > * other vmas */ > } > + /* Get the space by expanding the hugetlb region */ > + return grow_hugetlb_region(base, len); > } > > static inline unsigned long computeHugeHptePP(unsigned int hugepte) > diff -purN linux-2.6.0-implicit/fs/hugetlbfs/inode.c linux-2.6.0-implicit+dynas/fs/hugetlbfs/inode.c > +++ linux-2.6.0-implicit+dynas/fs/hugetlbfs/inode.c 2004-01-09 11:16:30.000000000 -0800 > @@ -155,6 +155,7 @@ try_hugetlb_get_unmapped_area(struct fil > if (*flags & MAP_HUGETLB) { > if (pre_error) > return pre_error; > + printk("Doing explicit hugetlb mmap\n"); > return hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags); > } > > @@ -165,10 +166,13 @@ try_hugetlb_get_unmapped_area(struct fil > if (mmap_hugetlb_implicit(len)) { > if (pre_error) > goto out; > + printk("Doing implicit hugetlb mmap..."); > addr = hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags); > - if (IS_ERR((void *)addr)) > + if (IS_ERR((void *)addr)) { > + printk("failed - falling back.\n"); > goto out; > - else { > + } else { > + printk("succeeded.\n"); > *flags |= MAP_HUGETLB; > return addr; Afaict this is the only dependence on the implicit hugepage patch, and all it does is add some comments. Best to ditch it so the patches become independent. > diff -purN linux-2.6.0-implicit/include/asm-ppc64/mmu.h linux-2.6.0-implicit+dynas/include/asm-ppc64/mmu.h > +++ linux-2.6.0-implicit+dynas/include/asm-ppc64/mmu.h 2004-01-09 11:17:35.000000000 -0800 > @@ -18,6 +18,9 @@ > /* Time to allow for more things here */ > typedef struct { > unsigned long flags; > +#ifdef CONFIG_HUGETLB_PAGE > + unsigned long hugetlb_base; > +#endif > } mm_context_t; > > #ifdef CONFIG_HUGETLB_PAGE > diff -purN linux-2.6.0-implicit/include/asm-ppc64/mmu_context.h linux-2.6.0-implicit+dynas/include/asm-ppc64/mmu_context.h > +++ linux-2.6.0-implicit+dynas/include/asm-ppc64/mmu_context.h 2004-01-09 11:18:44.000000000 -0800 > @@ -90,6 +90,9 @@ init_new_context(struct task_struct *tsk > > head = mmu_context_queue.head; > mm->context = mmu_context_queue.elements[head]; > +#ifdef CONFIG_HUGETLB_PAGE > + mm->context.hugetlb_base = TASK_HPAGE_END_32; > +#endif > > head = (head < LAST_USER_CONTEXT-1) ? head+1 : 0; > mmu_context_queue.head = head; > diff -purN linux-2.6.0-implicit/include/asm-ppc64/page.h linux-2.6.0-implicit+dynas/include/asm-ppc64/page.h > +++ linux-2.6.0-implicit+dynas/include/asm-ppc64/page.h 2004-01-09 11:22:58.000000000 -0800 > @@ -33,22 +33,28 @@ > #define TASK_HPAGE_BASE (0x0000010000000000UL) > #define TASK_HPAGE_END (0x0000018000000000UL) > > -/* For 32-bit processes the hugepage range is 2-3G */ > -#define TASK_HPAGE_BASE_32 (0x80000000UL) > -#define TASK_HPAGE_END_32 (0xc0000000UL) > +/* > + * We have much greater contention for segments in a > + * 32-bit address space. Therefore, the region reserved > + * for huge pages is dynamically resized. These values > + * define the maximum range allowed for huge pages. > + */ > +#define TASK_HPAGE_BASE_32 (0x40000000UL) > +#define TASK_HPAGE_END_32 (0xf0000000UL) > > #define ARCH_HAS_HUGEPAGE_ONLY_RANGE > #define is_hugepage_only_range(addr, len) \ > ( ((addr > (TASK_HPAGE_BASE-len)) && (addr < TASK_HPAGE_END)) || \ > ((current->mm->context.flags & CONTEXT_LOW_HPAGES) && \ > - (addr > (TASK_HPAGE_BASE_32-len)) && (addr < TASK_HPAGE_END_32)) ) > + (addr > (current->mm->context.hugetlb_base-len)) && \ > + (addr < TASK_HPAGE_END_32)) ) I assume context.hugetlb_base is supposed to be protected by the mmap_sem. Have you double checked to make sure that all the callers of this macro hold the mmap_sem? > #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA > #define in_hugepage_area(context, addr) \ > ((cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) && \ > ((((addr) >= TASK_HPAGE_BASE) && ((addr) < TASK_HPAGE_END)) || \ > (((context.flags) & CONTEXT_LOW_HPAGES) && \ > - (((addr) >= TASK_HPAGE_BASE_32) && ((addr) < TASK_HPAGE_END_32))))) > + (((addr) >= context.hugetlb_base) && ((addr) < TASK_HPAGE_END_32))))) > #else /* !CONFIG_HUGETLB_PAGE */ -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Mon Jan 12 15:19:18 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 12 Jan 2004 15:19:18 +1100 Subject: [RFC] implicit hugetlb pages (hugetlb_implicit) In-Reply-To: <1073683640.1297.111.camel@agtpad> References: <1073683188.1297.105.camel@agtpad> <1073683640.1297.111.camel@agtpad> Message-ID: <20040112041918.GD8262@zax> On Fri, Jan 09, 2004 at 01:27:20PM -0800, Adam Litke wrote: > > hugetlb_implicit (2.6.0): > This patch includes the anonymous mmap work from Dave Gibson > (right?) I'm not sure what you're referring to here. My patches for lbss support also include support for copy-on-write of hugepages and various other changes which can make them act kind of like anonymous pages. But I don't see much in this patch that looks familiar. > as well as my shared mem support. I have added safe fallback for > implicit allocations. This patch uses a fixed address space range of > 80000000 - c0000000 for huge pages. Some detailed comments below: > diff -purN linux-2.6.0/fs/hugetlbfs/inode.c linux-2.6.0-implicit/fs/hugetlbfs/inode.c > +++ linux-2.6.0-implicit/fs/hugetlbfs/inode.c 2004-01-08 16:19:31.000000000 -0800 > @@ -26,12 +26,17 @@ > #include > #include > #include > +#include > > #include > +#include > > /* some random number */ > #define HUGETLBFS_MAGIC 0x958458f6 > > +extern int mmap_use_hugepages; > +extern int mmap_hugepages_map_sz; > + > static struct super_operations hugetlbfs_ops; > static struct address_space_operations hugetlbfs_aops; > struct file_operations hugetlbfs_file_operations; > @@ -82,7 +87,7 @@ static int hugetlbfs_file_mmap(struct fi > unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > unsigned long len, unsigned long pgoff, unsigned long flags); > #else > -static unsigned long > +unsigned long > hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > unsigned long len, unsigned long pgoff, unsigned long flags) > { > @@ -115,6 +120,65 @@ hugetlb_get_unmapped_area(struct file *f > } > #endif > > +int mmap_hugetlb_implicit(unsigned long len) > +{ > + /* Are we enabled? */ > + if (!mmap_use_hugepages) > + return 0; > + /* Must be HPAGE aligned */ > + if (len & ~HPAGE_MASK) > + return 0; > + /* Are we under the minimum size? */ > + if (mmap_hugepages_map_sz > + && len < (mmap_hugepages_map_sz << 20)) > + return 0; > + /* Do we have enough free huge pages? */ > + if (!is_hugepage_mem_enough(len)) > + return 0; Is this test safe/necessary? i.e. a) is there any potential race which could cause the mmap() to fail because it's short of memory despite suceeding the test here and b) can't we just let the mmap fail and fall back then rather than checking beforehand? Do we need/want any consideration of the given "hint" address here? > + return 1; > +} > + > +unsigned long > +try_hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > + unsigned long len, unsigned long pgoff, unsigned long *flags) > +{ > + long pre_error = 0; > + > + /* Check some prerequisites */ > + if (!capable(CAP_IPC_LOCK)) > + pre_error = -EPERM; > + else if (file) > + pre_error = -EINVAL; We can't use the file argument, and the only caller passes NULL, so it shouldn't be there at all. > + /* Explicit requests for huge pages are allowed to return errors */ > + if (*flags & MAP_HUGETLB) { > + if (pre_error) > + return pre_error; > + return hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags); > + } > + > + /* > + * When implicit request fails, return 0 so we can > + * retry later with regular pages. > + */ > + if (mmap_hugetlb_implicit(len)) { > + if (pre_error) > + goto out; > + addr = hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags); > + if (IS_ERR((void *)addr)) > + goto out; > + else { > + *flags |= MAP_HUGETLB; > + return addr; > + } > + } > + > +out: > + *flags &= ~MAP_HUGETLB; > + return 0; > +} This does assume that 0 is never a valid address returned for a hugepage range. That's true now, but it makes be slightly uncomfortable, since there's no inherent reason we couldn't make segment zero a hugepage segment. > /* > * Read a page. Again trivial. If it didn't already exist > * in the page cache, it is zero-filled. > diff -purN linux-2.6.0/include/asm-i386/mman.h linux-2.6.0-implicit/include/asm-i386/mman.h > +++ linux-2.6.0-implicit/include/asm-i386/mman.h 2004-01-08 16:19:31.000000000 -0800 > @@ -11,6 +11,11 @@ > > #define MAP_SHARED 0x01 /* Share changes */ > #define MAP_PRIVATE 0x02 /* Changes are private */ > +#ifdef CONFIG_HUGETLB_PAGE > +#define MAP_HUGETLB 0x04 /* Use huge pages */ > +#else > +#define MAP_HUGETLB 0x00 > +#endif > #define MAP_TYPE 0x0f /* Mask for type of mapping */ I think MAP_HUGETLB should lie outside the MAP_TYPE bits. It doesn't specify a distinctly different mapping type like SHARED or PRIVATE, so it belongs as a flag, not in the low bits. Also, this is part of the ABI, so it shouldn't be conditional upon CONFIG options. > #define MAP_FIXED 0x10 /* Interpret addr exactly */ > #define MAP_ANONYMOUS 0x20 /* don't use a file */ > diff -purN linux-2.6.0/include/asm-ppc64/mman.h linux-2.6.0-implicit/include/asm-ppc64/mman.h > +++ linux-2.6.0-implicit/include/asm-ppc64/mman.h 2004-01-08 16:19:31.000000000 -0800 > @@ -18,6 +18,11 @@ > > #define MAP_SHARED 0x01 /* Share changes */ > #define MAP_PRIVATE 0x02 /* Changes are private */ > +#ifdef CONFIG_HUGETLB_PAGE > +#define MAP_HUGETLB 0x04 > +#else > +#define MAP_HUGETLB 0x0 > +#endif Ditto. > #define MAP_TYPE 0x0f /* Mask for type of mapping */ > #define MAP_FIXED 0x10 /* Interpret addr exactly */ > #define MAP_ANONYMOUS 0x20 /* don't use a file */ > diff -purN linux-2.6.0/include/linux/hugetlb.h linux-2.6.0-implicit/include/linux/hugetlb.h > +++ linux-2.6.0-implicit/include/linux/hugetlb.h 2004-01-08 16:19:31.000000000 -0800 > @@ -118,4 +118,9 @@ static inline void set_file_hugepages(st > > #endif /* !CONFIG_HUGETLBFS */ > > +unsigned long > +hugetlb_get_unmapped_area(struct file *, unsigned long, unsigned long, > + unsigned long, unsigned long); > + > + > #endif /* _LINUX_HUGETLB_H */ > diff -purN linux-2.6.0/include/linux/mman.h linux-2.6.0-implicit/include/linux/mman.h > +++ linux-2.6.0-implicit/include/linux/mman.h 2004-01-08 16:19:31.000000000 -0800 > @@ -58,6 +58,9 @@ calc_vm_flag_bits(unsigned long flags) > return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | > _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | > _calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) | > +#ifdef CONFIG_HUGETLB_PAGE > + _calc_vm_trans(flags, MAP_HUGETLB, VM_HUGETLB ) | > +#endif > _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ); > } > > diff -purN linux-2.6.0/include/linux/sysctl.h linux-2.6.0-implicit/include/linux/sysctl.h > +++ linux-2.6.0-implicit/include/linux/sysctl.h 2004-01-08 16:19:31.000000000 -0800 > @@ -127,6 +127,10 @@ enum > KERN_PANIC_ON_OOPS=57, /* int: whether we will panic on an oops */ > KERN_HPPA_PWRSW=58, /* int: hppa soft-power enable */ > KERN_HPPA_UNALIGNED=59, /* int: hppa unaligned-trap enable */ > + KERN_SHMUSEHUGEPAGES=60, /* int: back shm with huge pages */ > + KERN_MMAPUSEHUGEPAGES=61, /* int: back anon mmap with huge pages */ > + KERN_HPAGES_PER_FILE=62, /* int: max bigpages per file */ > + KERN_HPAGES_MAP_SZ=63, /* int: min size (MB) of mapping */ > }; > > > diff -purN linux-2.6.0/ipc/shm.c linux-2.6.0-implicit/ipc/shm.c > +++ linux-2.6.0-implicit/ipc/shm.c 2004-01-08 16:19:31.000000000 -0800 > @@ -32,6 +32,9 @@ > > #define shm_flags shm_perm.mode > > +extern int shm_use_hugepages; > +extern int shm_hugepages_per_file; > + > static struct file_operations shm_file_operations; > static struct vm_operations_struct shm_vm_ops; > > @@ -165,6 +168,31 @@ static struct vm_operations_struct shm_v > .nopage = shmem_nopage, > }; > > +#ifdef CONFIG_HUGETLBFS > +int shm_with_hugepages(int shmflag, size_t size) > +{ > + /* flag specified explicitly */ > + if (shmflag & SHM_HUGETLB) > + return 1; > + /* Are we disabled? */ > + if (!shm_use_hugepages) > + return 0; > + /* Must be HPAGE aligned */ > + if (size & ~HPAGE_MASK) > + return 0; > + /* Are we under the max per file? */ > + if ((size >> HPAGE_SHIFT) > shm_hugepages_per_file) > + return 0; I don't really understand this per-file restriction. More comments below. > + /* Do we have enough free huge pages? */ > + if (!is_hugepage_mem_enough(size)) > + return 0; Same concerns with this test as in the mmap case. > + return 1; > +} > +#else > +int shm_with_hugepages(int shmflag, size_t size) { return 0; } > +#endif > + > static int newseg (key_t key, int shmflg, size_t size) > { > int error; > @@ -194,8 +222,10 @@ static int newseg (key_t key, int shmflg > return error; > } > > - if (shmflg & SHM_HUGETLB) > + if (shm_with_hugepages(shmflg, size)) { > + shmflg |= SHM_HUGETLB; > file = hugetlb_zero_setup(size); > + } > else { > sprintf (name, "SYSV%08x", key); > file = shmem_file_setup(name, size, VM_ACCOUNT); > diff -purN linux-2.6.0/kernel/sysctl.c linux-2.6.0-implicit/kernel/sysctl.c > +++ linux-2.6.0-implicit/kernel/sysctl.c 2004-01-08 16:19:31.000000000 -0800 > @@ -60,6 +60,8 @@ extern int cad_pid; > extern int pid_max; > extern int sysctl_lower_zone_protection; > extern int min_free_kbytes; > +extern int shm_use_hugepages, shm_hugepages_per_file; > +extern int mmap_use_hugepages, mmap_hugepages_map_sz; > > /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ > static int maxolduid = 65535; > @@ -579,6 +581,40 @@ static ctl_table kern_table[] = { > .mode = 0644, > .proc_handler = &proc_dointvec, > }, > +#ifdef CONFIG_HUGETLBFS > + { > + .ctl_name = KERN_SHMUSEHUGEPAGES, > + .procname = "shm-use-hugepages", > + .data = &shm_use_hugepages, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = &proc_dointvec, > + }, > + { > + .ctl_name = KERN_MMAPUSEHUGEPAGES, > + .procname = "mmap-use-hugepages", > + .data = &mmap_use_hugepages, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = &proc_dointvec, > + }, > + { > + .ctl_name = KERN_HPAGES_PER_FILE, > + .procname = "shm-hugepages-per-file", > + .data = &shm_hugepages_per_file, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = &proc_dointvec, > + }, > + { > + .ctl_name = KERN_HPAGES_MAP_SZ, > + .procname = "mmap-hugepages-min-mapping", > + .data = &mmap_hugepages_map_sz, > + .maxlen = sizeof(int), > + .mode 0644, > + .proc_handler = &proc_dointvec, > + }, > +#endif > { .ctl_name = 0 } > }; > > diff -purN linux-2.6.0/mm/mmap.c linux-2.6.0-implicit/mm/mmap.c > +++ linux-2.6.0-implicit/mm/mmap.c 2004-01-08 16:20:10.000000000 -0800 > @@ -20,6 +20,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -59,6 +60,9 @@ EXPORT_SYMBOL(sysctl_overcommit_memory); > EXPORT_SYMBOL(sysctl_overcommit_ratio); > EXPORT_SYMBOL(vm_committed_space); > > +int mmap_use_hugepages = 0; > +int mmap_hugepages_map_sz = 256; > + > /* > * Requires inode->i_mapping->i_shared_sem > */ > @@ -473,7 +477,7 @@ unsigned long do_mmap_pgoff(struct file > int correct_wcount = 0; > int error; > struct rb_node ** rb_link, * rb_parent; > - unsigned long charged = 0; > + unsigned long charged = 0, addr_save = addr; > > if (file) { > if (!file->f_op || !file->f_op->mmap) > @@ -501,8 +505,17 @@ unsigned long do_mmap_pgoff(struct file > > /* Obtain the address to map to. we verify (or select) it and ensure > * that it represents a valid section of the address space. > + * VM_HUGETLB will never appear in vm_flags when CONFIG_HUGETLB is > + * unset. > */ > - addr = get_unmapped_area(file, addr, len, pgoff, flags); > +#ifdef CONFIG_HUGETLBFS > + addr = try_hugetlb_get_unmapped_area(NULL, addr, len, pgoff, &flags); > + if (IS_ERR((void *)addr)) > + return addr; This doesn't look right - we don't fall back if try_hugetlb...() fails. But it can fail if we don't have the right permissions, for one thing in which case we certainly do want to fall back. > + else if (addr == 0) > +#endif > + addr = get_unmapped_area(file, addr_save, len, pgoff, flags); Hmm... yes. I think the logic would be simpler if try_hugetlb..() always returned error, rather than zero and we fall back in all cases. That also lets us eliminate the ugly #ifdef by defining try_hugetlb...() to -ENOSYS in the !CONFIG_HUGETLBFS case. > if (addr & ~PAGE_MASK) > return addr; > > @@ -566,6 +579,9 @@ unsigned long do_mmap_pgoff(struct file > default: > return -EINVAL; > case MAP_PRIVATE: > +#ifdef CONFIG_HUGETLBFS > + case (MAP_PRIVATE|MAP_HUGETLB): > +#endif This bit of ugliness wouldn't be necessary if MAP_HUGETLB were up in the high bits like it should be. Also note that without my hugepage COW patches, MAP_PRIVATE semantics don't actually work on hugepages. > vm_flags &= ~(VM_SHARED | VM_MAYSHARE); > /* fall through */ > case MAP_SHARED: > @@ -650,10 +666,31 @@ munmap_back: > error = file->f_op->mmap(file, vma); > if (error) > goto unmap_and_free_vma; > - } else if (vm_flags & VM_SHARED) { > - error = shmem_zero_setup(vma); > - if (error) > - goto free_vma; > + } else if ((vm_flags & VM_SHARED) || (vm_flags & VM_HUGETLB)) { > + if (!is_vm_hugetlb_page(vma)) { > + error = shmem_zero_setup(vma); > + if (error) > + goto free_vma; > + } else { > + /* > + * Presumably hugetlb_zero_setup() acquires a > + * reference count for us. The difference > + * between this and the shmem_zero_setup() > + * case is that we can encounter an error > + * _after_ allocating the file. The error > + * path was adjusted slightly to fput() for us. > + */ > + struct file *new_file = hugetlb_zero_setup(len); > + if (IS_ERR(new_file)) { > + error = PTR_ERR(new_file); > + goto free_vma; > + } else { > + vma->vm_file = new_file; > + error = new_file->f_op->mmap(new_file, vma); > + if (error) > + goto unmap_and_free_vma; > + } > + } > } > > /* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform > @@ -701,11 +738,21 @@ out: > unmap_and_free_vma: > if (correct_wcount) > atomic_inc(&inode->i_writecount); > - vma->vm_file = NULL; > - fput(file); > > - /* Undo any partial mapping done by a device driver. */ > + /* > + * Undo any partial mapping done by a device driver. > + * hugetlb wants to know the vma's file etc. so nuke > + * the file afterward. > + */ > zap_page_range(vma, vma->vm_start, vma->vm_end - vma->vm_start); > + > + /* > + * vma->vm_file may be different from file in the hugetlb case. > + */ > + if (vma->vm_file) > + fput(vma->vm_file); > + vma->vm_file = NULL; > + > free_vma: > kmem_cache_free(vm_area_cachep, vma); > unacct_error: > diff -purN linux-2.6.0/mm/shmem.c linux-2.6.0-implicit/mm/shmem.c > +++ linux-2.6.0-implicit/mm/shmem.c 2004-01-08 16:19:31.000000000 -0800 > @@ -40,6 +40,29 @@ > #include > #include > > +int shm_use_hugepages; > + > +/* > + * On 64bit archs the vmalloc area is very large, > + * so we allocate the array in vmalloc on 64bit archs. > + * > + * Assuming 2M pages (x86 and x86-64) those default setting > + * will allow up to 128G of bigpages in a single file on > + * 64bit archs and 64G on 32bit archs using the max > + * kmalloc size of 128k. So tweaking in practice is needed > + * only to go past 128G of bigpages per file on 64bit archs. > + * > + * This sysctl is in page units (each page large BIGPAGE_SIZE). > + */ > +#ifdef CONFIG_HUGETLBFS > +#if BITS_PER_LONG == 64 > +int shm_hugepages_per_file = 128UL << (30 - HPAGE_SHIFT); > +#else > +int shm_hugepages_per_file = 131072 / sizeof(struct page *); > +#endif > +#endif I'm not sure what array this is talking about. I don't see why this limit on the number of hugepages per file exists. > /* This magic number is used in glibc for posix shared memory */ > #define TMPFS_MAGIC 0x01021994 > -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From agl at us.ibm.com Tue Jan 13 10:38:00 2004 From: agl at us.ibm.com (Adam Litke) Date: 12 Jan 2004 15:38:00 -0800 Subject: [RFC] implicit hugetlb pages (hugetlb_implicit) In-Reply-To: <20040112041918.GD8262@zax> References: <1073683188.1297.105.camel@agtpad> <1073683640.1297.111.camel@agtpad> <20040112041918.GD8262@zax> Message-ID: <1073950680.710.120.camel@agtpad> Thank you for your comments and suggestions. They are proving very helpful as I work to clean this up. On Sun, 2004-01-11 at 20:19, David Gibson wrote: > On Fri, Jan 09, 2004 at 01:27:20PM -0800, Adam Litke wrote: > > > > hugetlb_implicit (2.6.0): > > This patch includes the anonymous mmap work from Dave Gibson > > (right?) > > I'm not sure what you're referring to here. My patches for lbss > support also include support for copy-on-write of hugepages and > various other changes which can make them act kind of like anonymous > pages. > > But I don't see much in this patch that looks familiar. Hmm. Could the original author of hugetlb for anonymous mmap claim credit for the initial code? > > + /* Do we have enough free huge pages? */ > > + if (!is_hugepage_mem_enough(len)) > > + return 0; > > Is this test safe/necessary? i.e. a) is there any potential race > which could cause the mmap() to fail because it's short of memory > despite suceeding the test here and b) can't we just let the mmap fail > and fall back then rather than checking beforehand? You're right. Now that safe fallback is working, we might as well defer this test to get_unmapped area. > > Do we need/want any consideration of the given "hint" address here? I am trying to do what the kernel does for normal mmaps here. If someone hints at an address, they hopefully have a good reason for it. I wouldn't want to override it just so I can do implicit hugetlb. Most applications pass NULL for the hint right? > > + /* Explicit requests for huge pages are allowed to return errors */ > > + if (*flags & MAP_HUGETLB) { > > + if (pre_error) > > + return pre_error; > > + return hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags); > > + } > > + > > + /* > > + * When implicit request fails, return 0 so we can > > + * retry later with regular pages. > > + */ > > + if (mmap_hugetlb_implicit(len)) { > > + if (pre_error) > > + goto out; > > + addr = hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags); > > + if (IS_ERR((void *)addr)) > > + goto out; > > + else { > > + *flags |= MAP_HUGETLB; > > + return addr; > > + } > > + } > > + > > +out: > > + *flags &= ~MAP_HUGETLB; > > + return 0; > > +} > > This does assume that 0 is never a valid address returned for a > hugepage range. That's true now, but it makes be slightly > uncomfortable, since there's no inherent reason we couldn't make > segment zero a hugepage segment. You definately found an ugly part of the patch. Cleanup in progress. > > +#ifdef CONFIG_HUGETLBFS > > +int shm_with_hugepages(int shmflag, size_t size) > > +{ > > + /* flag specified explicitly */ > > + if (shmflag & SHM_HUGETLB) > > + return 1; > > + /* Are we disabled? */ > > + if (!shm_use_hugepages) > > + return 0; > > + /* Must be HPAGE aligned */ > > + if (size & ~HPAGE_MASK) > > + return 0; > > + /* Are we under the max per file? */ > > + if ((size >> HPAGE_SHIFT) > shm_hugepages_per_file) > > + return 0; > > I don't really understand this per-file restriction. More comments > below. Since hugetlb pages are a relatively scarce resource, this is a rudimentary method to ensure that one application doesn't allocate more than its fair share of hugetlb memory. > > + /* Do we have enough free huge pages? */ > > + if (!is_hugepage_mem_enough(size)) > > + return 0; > > Same concerns with this test as in the mmap case. Your right. This is racey. I haven't given the shared mem part of the patch nearly as much attention as the mmap part. I am going to leave this partially broken until I clean up the fallback code for mmaps so I can put that here as well. > > @@ -501,8 +505,17 @@ unsigned long do_mmap_pgoff(struct file > > > > /* Obtain the address to map to. we verify (or select) it and ensure > > * that it represents a valid section of the address space. > > + * VM_HUGETLB will never appear in vm_flags when CONFIG_HUGETLB is > > + * unset. > > */ > > - addr = get_unmapped_area(file, addr, len, pgoff, flags); > > +#ifdef CONFIG_HUGETLBFS > > + addr = try_hugetlb_get_unmapped_area(NULL, addr, len, pgoff, &flags); > > + if (IS_ERR((void *)addr)) > > + return addr; > > This doesn't look right - we don't fall back if try_hugetlb...() > fails. But it can fail if we don't have the right permissions, for > one thing in which case we certainly do want to fall back. I admit this is messy and I am working on cleaning it up. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Tue Jan 13 10:40:28 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Mon, 12 Jan 2004 17:40:28 -0600 Subject: EEH Error Recovery Message-ID: <20040112174028.A30368@forte.austin.ibm.com> I am about to start working on EEH error recovery. Looks to me like this will be a long and complicated process. Any thoughts/opinions/requirements regarding this topic? Anyone else out there who is twiddling EEH? Whose toes might I be stepping on? Who might I have to work with closely? --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Tue Jan 13 10:57:08 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 13 Jan 2004 10:57:08 +1100 Subject: [RFC] implicit hugetlb pages (hugetlb_implicit) In-Reply-To: <1073950680.710.120.camel@agtpad> References: <1073683188.1297.105.camel@agtpad> <1073683640.1297.111.camel@agtpad> <20040112041918.GD8262@zax> <1073950680.710.120.camel@agtpad> Message-ID: <20040112235708.GB10302@zax> On Mon, Jan 12, 2004 at 03:38:00PM -0800, Adam Litke wrote: > Thank you for your comments and suggestions. They are proving very > helpful as I work to clean this up. Glad to hear it :) > On Sun, 2004-01-11 at 20:19, David Gibson wrote: > > On Fri, Jan 09, 2004 at 01:27:20PM -0800, Adam Litke wrote: > > > > > > hugetlb_implicit (2.6.0): > > > This patch includes the anonymous mmap work from Dave Gibson > > > (right?) > > > > I'm not sure what you're referring to here. My patches for lbss > > support also include support for copy-on-write of hugepages and > > various other changes which can make them act kind of like anonymous > > pages. > > > > But I don't see much in this patch that looks familiar. > > Hmm. Could the original author of hugetlb for anonymous mmap claim > credit for the initial code? I think I once knew who it was, but I've forgotten, sorry. Incidentally, you probably do want to fold in my hugepage-COW stuff (although it does mean some more generic changes). Otherwise hugepages are always MAP_SHARED, which means with an implicit hugepage mmap() certain regions of memory will silently have totally different semantics to what you expect - it could get very weird across a fork(). And for that matter there's at least one plain-old-bug in the current hugepage code which is addressed in my patch (the LOW_HPAGES bit isn't propagated correctly across a fork()). I'll attach my patch, which also includes the hugepage ELF segment stuff. I'm afraid I haven't had a chance to separate out those parts of the patch yet. > > > + /* Do we have enough free huge pages? */ > > > + if (!is_hugepage_mem_enough(len)) > > > + return 0; > > > > Is this test safe/necessary? i.e. a) is there any potential race > > which could cause the mmap() to fail because it's short of memory > > despite suceeding the test here and b) can't we just let the mmap > > fail and fall back then rather than checking beforehand? > > You're right. Now that safe fallback is working, we might as well > defer this test to get_unmapped area. Ok. > > Do we need/want any consideration of the given "hint" address here? > > I am trying to do what the kernel does for normal mmaps here. If > someone hints at an address, they hopefully have a good reason for it. > I wouldn't want to override it just so I can do implicit hugetlb. Most > applications pass NULL for the hint right? That's kind of my point: what if someone gives a hugepage aligned size with a non-aligned hint address - currently the test is only on the size. We either have to map at somewhere other than the hint address (which is what the patch does now, I think), or only attempt a hugepage map if the hint address is also aligned. This is for the case where the hint really is a hint, of course - so we don't have to obey it. If it's MAP_FIXED it's a different code path, and we never attempt a hugepage mapping (unless it's explicitly from hugetlbfs). Perhaps we should, though. > > > + /* Explicit requests for huge pages are allowed to return errors */ > > > + if (*flags & MAP_HUGETLB) { > > > + if (pre_error) > > > + return pre_error; > > > + return hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags); > > > + } > > > + > > > + /* > > > + * When implicit request fails, return 0 so we can > > > + * retry later with regular pages. > > > + */ > > > + if (mmap_hugetlb_implicit(len)) { > > > + if (pre_error) > > > + goto out; > > > + addr = hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags); > > > + if (IS_ERR((void *)addr)) > > > + goto out; > > > + else { > > > + *flags |= MAP_HUGETLB; > > > + return addr; > > > + } > > > + } > > > + > > > +out: > > > + *flags &= ~MAP_HUGETLB; > > > + return 0; > > > +} > > > > This does assume that 0 is never a valid address returned for > > a hugepage range. That's true now, but it makes be slightly > > uncomfortable, since there's no inherent reason we couldn't make > > segment zero a hugepage segment. > > You definately found an ugly part of the patch. Cleanup in progress. Excellent. > > > +#ifdef CONFIG_HUGETLBFS > > > +int shm_with_hugepages(int shmflag, size_t size) > > > +{ > > > + /* flag specified explicitly */ > > > + if (shmflag & SHM_HUGETLB) > > > + return 1; > > > + /* Are we disabled? */ > > > + if (!shm_use_hugepages) > > > + return 0; > > > + /* Must be HPAGE aligned */ > > > + if (size & ~HPAGE_MASK) > > > + return 0; > > > + /* Are we under the max per file? */ > > > + if ((size >> HPAGE_SHIFT) > shm_hugepages_per_file) > > > + return 0; > > > > I don't really understand this per-file restriction. More comments > > below. > > Since hugetlb pages are a relatively scarce resource, this is a > rudimentary method to ensure that one application doesn't allocate > more than its fair share of hugetlb memory. Ah, ok. It's probably worth adding a comment or two to that effect. At the moment I don't think this is particularly necessary, since you need root (well CAP_IPC_LOCK) to allocate hugepages. But we may well want to change that, so some sort of limit is probably a good idea. I wonder if there is a more direct way of accomplishing this. > > > + /* Do we have enough free huge pages? */ > > > + if (!is_hugepage_mem_enough(size)) > > > + return 0; > > > > Same concerns with this test as in the mmap case. > > Your right. This is racey. I haven't given the shared mem part of the > patch nearly as much attention as the mmap part. I am going to leave > this partially broken until I clean up the fallback code for mmaps so > I can put that here as well. Fair enough. > > > @@ -501,8 +505,17 @@ unsigned long do_mmap_pgoff(struct file > > > > > > /* Obtain the address to map to. we verify (or select) it and ensure > > > * that it represents a valid section of the address space. > > > + * VM_HUGETLB will never appear in vm_flags when CONFIG_HUGETLB is > > > + * unset. > > > */ > > > - addr = get_unmapped_area(file, addr, len, pgoff, flags); > > > +#ifdef CONFIG_HUGETLBFS > > > + addr = try_hugetlb_get_unmapped_area(NULL, addr, len, pgoff, &flags); > > > + if (IS_ERR((void *)addr)) > > > + return addr; > > > > This doesn't look right - we don't fall back if try_hugetlb...() > > fails. But it can fail if we don't have the right permissions, for > > one thing in which case we certainly do want to fall back. > > I admit this is messy and I am working on cleaning it up. Great. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson -------------- next part -------------- diff -urN ppc64-linux-2.5/arch/ppc64/mm/hugetlbpage.c linux-gogogo/arch/ppc64/mm/hugetlbpage.c --- ppc64-linux-2.5/arch/ppc64/mm/hugetlbpage.c 2003-10-14 22:33:33.000000000 +1000 +++ linux-gogogo/arch/ppc64/mm/hugetlbpage.c 2003-11-25 17:04:25.000000000 +1100 @@ -118,6 +118,16 @@ #define hugepte_page(x) pfn_to_page(hugepte_pfn(x)) #define hugepte_none(x) (!(hugepte_val(x) & _HUGEPAGE_PFN)) +#define hugepte_write(x) (hugepte_val(x) & _HUGEPAGE_RW) +#define hugepte_same(A,B) \ + (((hugepte_val(A) ^ hugepte_val(B)) & ~_HUGEPAGE_HPTEFLAGS) == 0) + +static inline hugepte_t hugepte_mkwrite(hugepte_t pte) +{ + hugepte_val(pte) |= _HUGEPAGE_RW; + return pte; +} + static void free_huge_page(struct page *page); static void flush_hash_hugepage(mm_context_t context, unsigned long ea, @@ -219,20 +229,6 @@ pmd_clear((pmd_t *)(ptep+i)); } -/* - * This function checks for proper alignment of input addr and len parameters. - */ -int is_aligned_hugepage_range(unsigned long addr, unsigned long len) -{ - if (len & ~HPAGE_MASK) - return -EINVAL; - if (addr & ~HPAGE_MASK) - return -EINVAL; - if (! is_hugepage_only_range(addr, len)) - return -EINVAL; - return 0; -} - static void do_slbia(void *unused) { asm volatile ("isync; slbia; isync":::"memory"); @@ -251,8 +247,11 @@ /* Check no VMAs are in the region */ vma = find_vma(mm, TASK_HPAGE_BASE_32); - if (vma && (vma->vm_start < TASK_HPAGE_END_32)) + if (vma && (vma->vm_start < TASK_HPAGE_END_32)) { + printk(KERN_DEBUG "Low HTLB region busy: PID=%d vma @ %lx-%lx\n", + current->pid, vma->vm_start, vma->vm_end); return -EBUSY; + } /* Clean up any leftover PTE pages in the region */ spin_lock(&mm->page_table_lock); @@ -293,6 +292,43 @@ return 0; } +int is_aligned_hugepage_range(unsigned long addr, unsigned long len) +{ + if (len & ~HPAGE_MASK) + return -EINVAL; + if (addr & ~HPAGE_MASK) + return -EINVAL; + if (! is_hugepage_only_range(addr, len)) + return -EINVAL; + return 0; +} + +int is_potential_hugepage_range(unsigned long addr, unsigned long len) +{ + if (len & ~HPAGE_MASK) + return -EINVAL; + if (addr & ~HPAGE_MASK) + return -EINVAL; + if (! is_hugepage_potential_range(addr, len)) + return -EINVAL; + return 0; +} + + +int prepare_hugepage_range(unsigned long addr, unsigned long len) +{ + int ret; + + BUG_ON(is_potential_hugepage_range(addr, len) != 0); + + if (is_hugepage_low_range(addr, len)) { + ret = open_32bit_htlbpage_range(current->mm); + if (ret) + return ret; + } + return 0; +} + int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma) { @@ -300,6 +336,16 @@ struct page *ptepage; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + cpumask_t tmp; + int cow; + int local; + + /* XXX are there races with checking cpu_vm_mask? - Anton */ + tmp = cpumask_of_cpu(smp_processor_id()); + if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp)) + local = 1; + + cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; while (addr < end) { BUG_ON(! in_hugepage_area(src->context, addr)); @@ -310,6 +356,17 @@ return -ENOMEM; src_pte = hugepte_offset(src, addr); + + if (cow) { + entry = __hugepte(hugepte_update(src_pte, + _HUGEPAGE_RW + | _HUGEPAGE_HPTEFLAGS, + 0)); + if ((addr % HPAGE_SIZE) == 0) + flush_hash_hugepage(src->context, addr, + entry, local); + } + entry = *src_pte; if ((addr % HPAGE_SIZE) == 0) { @@ -483,12 +540,16 @@ struct mm_struct *mm = current->mm; unsigned long addr; int ret = 0; + int writable; WARN_ON(!is_vm_hugetlb_page(vma)); BUG_ON((vma->vm_start % HPAGE_SIZE) != 0); BUG_ON((vma->vm_end % HPAGE_SIZE) != 0); spin_lock(&mm->page_table_lock); + + writable = (vma->vm_flags & VM_WRITE) && (vma->vm_flags & VM_SHARED); + for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { unsigned long idx; hugepte_t *pte = hugepte_alloc(mm, addr); @@ -518,15 +579,25 @@ ret = -ENOMEM; goto out; } - ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC); - unlock_page(page); + /* This is a new page, all full of zeroes. If + * we're MAP_SHARED, the page needs to go into + * the page cache. If it's MAP_PRIVATE it + * might as well be made "anonymous" now or + * we'll just have to copy it on the first + * write. */ + if (vma->vm_flags & VM_SHARED) { + ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC); + unlock_page(page); + } else { + writable = (vma->vm_flags & VM_WRITE); + } if (ret) { hugetlb_put_quota(mapping); free_huge_page(page); goto out; } } - setup_huge_pte(mm, page, pte, vma->vm_flags & VM_WRITE); + setup_huge_pte(mm, page, pte, writable); } out: spin_unlock(&mm->page_table_lock); @@ -659,10 +730,9 @@ if (!in_hugepage_area(mm->context, ea)) return -1; - ea &= ~(HPAGE_SIZE-1); - /* We have to find the first hugepte in the batch, since * that's the one that will store the HPTE flags */ + ea &= HPAGE_MASK; ptep = hugepte_offset(mm, ea); /* Search the Linux page table for a match with va */ @@ -683,7 +753,7 @@ * prevented then send the problem up to do_page_fault. */ is_write = access & _PAGE_RW; - if (unlikely(is_write && !(hugepte_val(*ptep) & _HUGEPAGE_RW))) + if (unlikely(is_write && !hugepte_write(*ptep))) return 1; /* @@ -886,10 +956,11 @@ spin_unlock(&htlbpage_lock); } htlbpage_max = htlbpage_free = htlbpage_total = i; - printk("Total HugeTLB memory allocated, %d\n", htlbpage_free); + printk(KERN_INFO "Total HugeTLB memory allocated, %d\n", + htlbpage_free); } else { htlbpage_max = 0; - printk("CPU does not support HugeTLB\n"); + printk(KERN_INFO "CPU does not support HugeTLB\n"); } return 0; @@ -914,6 +985,121 @@ return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpage_free; } +static int hugepage_cow(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, hugepte_t *ptep, hugepte_t pte) +{ + struct page *old_page, *new_page; + int i; + cpumask_t tmp; + int local; + + BUG_ON(!pfn_valid(hugepte_pfn(*ptep))); + + old_page = hugepte_page(*ptep); + + /* XXX are there races with checking cpu_vm_mask? - Anton */ + tmp = cpumask_of_cpu(smp_processor_id()); + if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp)) + local = 1; + + /* If no-one else is actually using this page, avoid the copy + * and just make the page writable */ + if (!TestSetPageLocked(old_page)) { + int avoidcopy = (page_count(old_page) == 1); + unlock_page(old_page); + if (avoidcopy) { + for (i = 0; i < HUGEPTE_BATCH_SIZE; i++) + set_hugepte(ptep+i, hugepte_mkwrite(pte)); + + + pte = __hugepte(hugepte_update(ptep, _HUGEPAGE_HPTEFLAGS, 0)); + if (hugepte_val(pte) & _HUGEPAGE_HASHPTE) + flush_hash_hugepage(mm->context, address, + pte, local); + spin_unlock(&mm->page_table_lock); + return VM_FAULT_MINOR; + } + } + + page_cache_get(old_page); + + spin_unlock(&mm->page_table_lock); + + new_page = alloc_hugetlb_page(); + if (! new_page) { + page_cache_release(old_page); + + /* Logically this is OOM, not a SIGBUS, but an OOM + * could cause the kernel to go killing other + * processes which won't help the hugepage situation + * at all (?) */ + return VM_FAULT_SIGBUS; + } + + for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) + copy_user_highpage(new_page + i, old_page + i, address + i*PAGE_SIZE); + + spin_lock(&mm->page_table_lock); + + /* XXX are there races with checking cpu_vm_mask? - Anton */ + tmp = cpumask_of_cpu(smp_processor_id()); + if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp)) + local = 1; + + ptep = hugepte_offset(mm, address); + if (hugepte_same(*ptep, pte)) { + /* Break COW */ + for (i = 0; i < HUGEPTE_BATCH_SIZE; i++) + hugepte_update(ptep, ~0, + hugepte_val(mk_hugepte(new_page, 1))); + + if (hugepte_val(pte) & _HUGEPAGE_HASHPTE) + flush_hash_hugepage(mm->context, address, + pte, local); + + /* Make the old page be freed below */ + new_page = old_page; + } + page_cache_release(new_page); + page_cache_release(old_page); + spin_unlock(&mm->page_table_lock); + return VM_FAULT_MINOR; +} + +int handle_hugetlb_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma, + unsigned long address, int write_access) +{ + hugepte_t *ptep; + int rc = VM_FAULT_SIGBUS; + + spin_lock(&mm->page_table_lock); + + ptep = hugepte_offset(mm, address & HPAGE_MASK); + + if ( (! ptep) || hugepte_none(*ptep)) + goto fail; + + /* Otherwise, there ought to be a real hugepte here */ + BUG_ON(hugepte_bad(*ptep)); + + rc = VM_FAULT_MINOR; + + if (! (write_access && !hugepte_write(*ptep))) { + printk(KERN_WARNING "Unexpected hugepte fault (wr=%d hugepte=%08x\n", + write_access, hugepte_val(*ptep)); + goto fail; + } + + /* The only faults we should actually get are COWs */ + /* this drops the page_table_lock */ + return hugepage_cow(mm, vma, address, ptep, *ptep); + + fail: + spin_unlock(&mm->page_table_lock); + + return rc; +} + /* * We cannot handle pagefaults against hugetlb pages at all. They cause * handle_mm_fault() to try to instantiate regular-sized pages in the diff -urN ppc64-linux-2.5/arch/ppc64/mm/init.c linux-gogogo/arch/ppc64/mm/init.c --- ppc64-linux-2.5/arch/ppc64/mm/init.c 2003-10-24 09:50:18.000000000 +1000 +++ linux-gogogo/arch/ppc64/mm/init.c 2003-11-25 14:29:53.000000000 +1100 @@ -549,7 +549,11 @@ ++ptep; } while (start < pmd_end); } else { - WARN_ON(pmd_hugepage(*pmd)); + /* We don't need to flush huge + * pages here, because that's + * done in + * copy_hugetlb_page_range() + * if necessary */ start = pmd_end; } ++pmd; diff -urN ppc64-linux-2.5/fs/binfmt_elf.c linux-gogogo/fs/binfmt_elf.c --- ppc64-linux-2.5/fs/binfmt_elf.c 2003-10-23 08:29:46.000000000 +1000 +++ linux-gogogo/fs/binfmt_elf.c 2003-11-27 15:58:12.000000000 +1100 @@ -265,11 +265,81 @@ #ifndef elf_map +#ifdef CONFIG_HUGETLBFS +#include + +static unsigned long elf_htlb_map(struct file *filep, unsigned long addr, + struct elf_phdr *eppnt, int prot, int type) +{ + struct file *htlbfile; + unsigned long start, len; + unsigned long map_addr; + int retval; + + printk(KERN_DEBUG "Found HTLB ELF segment %lx-%lx\n", + addr, addr + eppnt->p_memsz); + start = addr & HPAGE_MASK; + len = ALIGN(eppnt->p_memsz + (addr & ~HPAGE_MASK), HPAGE_SIZE); + + /* If we have data from the file to put in the segment, we + * have to make it writable, so that we can read it in there + * (mprotect() doesn't work on hugepages */ + if (eppnt->p_filesz != 0) + prot |= PROT_WRITE; + + if (is_potential_hugepage_range(start, len) != 0) { + printk(KERN_WARNING "HTLB ELF segment is not a valid hugepage range\n"); + return -EINVAL; + } + + htlbfile = hugetlb_zero_setup(eppnt->p_memsz); + if (IS_ERR(htlbfile)) { + printk(KERN_WARNING "Unable to allocate HTLB ELF segment (%ld)\n", + PTR_ERR(htlbfile)); + return PTR_ERR(htlbfile); + } + set_file_hugepages(htlbfile); + down_write(¤t->mm->mmap_sem); + map_addr = do_mmap(htlbfile, start, len, prot, type, 0); + up_write(¤t->mm->mmap_sem); + fput(htlbfile); + + if (eppnt->p_filesz != 0) { + loff_t pos = eppnt->p_offset; + + printk("Reading %lu bytes of file data into HTLB segment\n", + (unsigned long) eppnt->p_filesz); + retval = vfs_read(filep, (void __user *)addr, eppnt->p_filesz, &pos); + printk("HTLB read returned %d\n", retval); + if (retval < 0) { + extern asmlinkage long sys_munmap(unsigned long, size_t); + sys_munmap(start, len); + return retval; + } + } + + + return map_addr; +} +#else +static inline int elf_htlb_map(struct file *filep, unsigned long addr, + struct elf_phdr *eppnt, int prot, int type) +{ + return -ENOSYS; +} +#endif static unsigned long elf_map(struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type) { unsigned long map_addr; + if (eppnt->p_flags & PF_LINUX_HTLB) { + map_addr = elf_htlb_map(filep, addr, eppnt, prot, type); + if (map_addr < (unsigned long)(-1024)) + return map_addr; + printk(KERN_DEBUG "Falling back to non HTLB allocation\n"); + } + down_write(¤t->mm->mmap_sem); map_addr = do_mmap(filep, ELF_PAGESTART(addr), eppnt->p_filesz + ELF_PAGEOFFSET(eppnt->p_vaddr), prot, type, diff -urN ppc64-linux-2.5/include/asm-ppc64/mmu_context.h linux-gogogo/include/asm-ppc64/mmu_context.h --- ppc64-linux-2.5/include/asm-ppc64/mmu_context.h 2003-09-12 21:06:51.000000000 +1000 +++ linux-gogogo/include/asm-ppc64/mmu_context.h 2003-11-25 13:07:49.000000000 +1100 @@ -80,6 +80,8 @@ { long head; unsigned long flags; + /* This does the right thing across a fork (I hope) */ + unsigned long low_hpages = mm->context & CONTEXT_LOW_HPAGES; spin_lock_irqsave(&mmu_context_queue.lock, flags); @@ -90,6 +92,7 @@ head = mmu_context_queue.head; mm->context = mmu_context_queue.elements[head]; + mm->context |= low_hpages; head = (head < LAST_USER_CONTEXT-1) ? head+1 : 0; mmu_context_queue.head = head; diff -urN ppc64-linux-2.5/include/asm-ppc64/page.h linux-gogogo/include/asm-ppc64/page.h --- ppc64-linux-2.5/include/asm-ppc64/page.h 2003-09-12 21:06:51.000000000 +1000 +++ linux-gogogo/include/asm-ppc64/page.h 2003-11-24 18:00:54.000000000 +1100 @@ -37,11 +37,22 @@ #define TASK_HPAGE_END_32 (0xc0000000UL) #define ARCH_HAS_HUGEPAGE_ONLY_RANGE +#define ARCH_HAS_PREPARE_HUGEPAGE_RANGE + +#define is_hugepage_low_range(addr, len) \ + (((addr) > (TASK_HPAGE_BASE_32-(len))) && ((addr) < TASK_HPAGE_END_32)) +#define is_hugepage_high_range(addr, len) \ + (((addr) > (TASK_HPAGE_BASE-(len))) && ((addr) < TASK_HPAGE_END)) + +#define is_hugepage_potential_range(addr, len) \ + (is_hugepage_high_range(addr, len) || is_hugepage_low_range(addr, len)) #define is_hugepage_only_range(addr, len) \ - ( ((addr > (TASK_HPAGE_BASE-len)) && (addr < TASK_HPAGE_END)) || \ - ((current->mm->context & CONTEXT_LOW_HPAGES) && \ - (addr > (TASK_HPAGE_BASE_32-len)) && (addr < TASK_HPAGE_END_32)) ) + (is_hugepage_high_range((addr), (len)) || \ + ( (current->mm->context & CONTEXT_LOW_HPAGES) && \ + is_hugepage_low_range((addr), (len)) ) ) + #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA +#define ARCH_HANDLES_HUGEPAGE_FAULTS #define in_hugepage_area(context, addr) \ ((cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) && \ diff -urN ppc64-linux-2.5/include/linux/elf.h linux-gogogo/include/linux/elf.h --- ppc64-linux-2.5/include/linux/elf.h 2003-10-07 11:38:42.000000000 +1000 +++ linux-gogogo/include/linux/elf.h 2003-11-18 16:46:12.000000000 +1100 @@ -271,6 +271,11 @@ #define PF_W 0x2 #define PF_X 0x1 +#define PF_MASKOS 0x0ff00000 +#define PF_MASKPROC 0xf0000000 + +#define PF_LINUX_HTLB 0x00100000 + typedef struct elf32_phdr{ Elf32_Word p_type; Elf32_Off p_offset; diff -urN ppc64-linux-2.5/include/linux/hugetlb.h linux-gogogo/include/linux/hugetlb.h --- ppc64-linux-2.5/include/linux/hugetlb.h 2003-09-27 22:48:37.000000000 +1000 +++ linux-gogogo/include/linux/hugetlb.h 2003-11-25 15:04:35.000000000 +1100 @@ -41,6 +41,22 @@ #define is_hugepage_only_range(addr, len) 0 #endif +#ifndef ARCH_HAS_PREPARE_HUGEPAGE_RANGE +#define is_potential_hugepage_range(addr, len) \ + (is_aligned_hugepage_range((addr), (len))) +#define prepare_hugepage_range(addr, len) (0) +#else +int is_potential_hugepage_range(unsigned long addr, unsigned long len); +int prepare_hugepage_range(unsigned long addr, unsigned long len); +#endif + +#ifndef ARCH_HANDLES_HUGEPAGE_FAULTS +#define handle_hugetlb_mm_fault(mm, vma, a, w) (VM_FAULT_SIGBUS) +#else +int handle_hugetlb_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma, + unsigned long address, int write_access); +#endif + #else /* !CONFIG_HUGETLB_PAGE */ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) @@ -61,6 +77,8 @@ #define mark_mm_hugetlb(mm, vma) do { } while (0) #define follow_huge_pmd(mm, addr, pmd, write) 0 #define is_aligned_hugepage_range(addr, len) 0 +#define is_allowed_hugepage_range(addr, len) 0 +#define prepare_hugepage_range(addr, len) (-EINVAL) #define pmd_huge(x) 0 #define is_hugepage_only_range(addr, len) 0 diff -urN ppc64-linux-2.5/mm/memory.c linux-gogogo/mm/memory.c --- ppc64-linux-2.5/mm/memory.c 2003-11-17 11:20:18.000000000 +1100 +++ linux-gogogo/mm/memory.c 2003-11-18 12:42:34.000000000 +1100 @@ -1603,7 +1603,8 @@ inc_page_state(pgfault); if (is_vm_hugetlb_page(vma)) - return VM_FAULT_SIGBUS; /* mapping truncation does this. */ + /* mapping truncation can do this. */ + return handle_hugetlb_mm_fault(mm, vma, address, write_access); /* * We need the page table lock to synchronize with kswapd diff -urN ppc64-linux-2.5/mm/mmap.c linux-gogogo/mm/mmap.c --- ppc64-linux-2.5/mm/mmap.c 2003-10-23 08:29:46.000000000 +1000 +++ linux-gogogo/mm/mmap.c 2003-11-25 15:04:49.000000000 +1100 @@ -787,7 +787,9 @@ /* * Make sure that addr and length are properly aligned. */ - ret = is_aligned_hugepage_range(addr, len); + ret = is_potential_hugepage_range(addr, len); + if (ret == 0) + ret = prepare_hugepage_range(addr, len); } else { /* * Ensure that a normal request is not falling in a From olh at suse.de Wed Jan 14 05:07:41 2004 From: olh at suse.de (Olaf Hering) Date: Tue, 13 Jan 2004 19:07:41 +0100 Subject: possible asm syntax errors in spinlock.h Message-ID: <20040113180741.GA18807@suse.de> I got this crash several times on a p660 and p630: Entering kdb (current=0xc000000177748c70, pid 12) on processor 5 due to KDB_ENTER() [5]kdb> e e = 0x000000000000000e [5]kdb> excp cpu 5: Vector: 300 (Data Access) at [c000000177743ba0] pc: c000000000052198 lr: c00000000005242c sp: c000000177743e20 msr: a000000000001032 dar: 280 dsisr: 200000 current = 0xc000000177748c70 paca = 0xc00000000044c000 current = c000000177748c70, pid = 12, comm = migration/5 [5]kdb> bt 0xc000000177748c70 00000012 00000001 0 005 stop 0xc0000001777491a0*migration/5 SP(esp) PC(eip) Function(args) 0xc000000177743e20 0xc000000000052198 .move_task_away +0x420 0xc000000177743ed0 0xc00000000005242c .migration_thread +0x1d0 0xc000000177743f90 0xc000000000018938 .kernel_thread +0x4c [5]kdb> rd gpr0 = 0xc000000000444000 gpr1 = 0xc000000177743e20 gpr2 = 0xc00000000059edf0 gpr3 = 0xc000000008189560 gpr4 = 0x0000000000000000 gpr5 = 0x0000000000000000 gpr6 = 0x0000000024002042 gpr7 = 0x0000000000000000 gpr8 = 0x0000000000000000 gpr9 = 0x0000000000000000 gpr10 = 0xc000000007f4ef20 gpr11 = 0xc00000000059c010 gpr12 = 0x000000003ccbf700 gpr13 = 0xc00000000044c000 gpr14 = 0x0000000000000000 gpr15 = 0x0000000000000000 gpr16 = 0x0000000000000000 gpr17 = 0x0000000000000000 gpr18 = 0x0000000000000000 gpr19 = 0x0000000000000000 gpr20 = 0x0000000000230000 gpr21 = 0x00000000006b0000 gpr22 = 0x0000000000000000 gpr23 = 0x0000000000400000 gpr24 = 0xc0000000005a5ba0 gpr25 = 0xc0000000005a5ba0 gpr26 = 0xa000000000009032 gpr27 = 0xc000000007f501b8 gpr28 = 0xc000000008189560 gpr29 = 0xc000000007f26f20 gpr30 = 0xc0000000004df0a0 gpr31 = 0xc000000177743e20 nip = 0xc000000000052198 msr = 0xa000000000001032 esp = 0xc000000177743e20 orig_gpr3 = 0x0000000000230000 ctr = 0x0000000000000000 link = 0xc00000000005242c xer = 0x0000000020000000 ccr = 0x0000000084002042 mq = 0x0000000000000000 trap = 0x0000000000000300 dar = 0x0000000000000280 dsisr = 0x0000000000200000 result = 0x0000000000000000 ®s = 0xc000000177743ba0 [5]kdb> id c000000000052190 0xc000000000052190 .move_task_away+0x418 cmpdi r0,0 0xc000000000052194 .move_task_away+0x41c beq 0xc0000000000521c4 .move_task_away+0x44c 0xc000000000052198 .move_task_away+0x420 lwz r5,640(r0) 0xc00000000005219c .move_task_away+0x424 andi. r11,r5,1 0xc0000000000521a0 .move_task_away+0x428 beq 0xc000000000052188 .move_task_away+0x410 0xc0000000000521a4 .move_task_away+0x42c .long 0x7c2004ac 0xc0000000000521a8 .move_task_away+0x430 ldx r11,r0,r10 0xc0000000000521ac .move_task_away+0x434 cmpd r0,r11 0xc0000000000521b0 .move_task_away+0x438 bne 0xc000000000052188 .move_task_away+0x410 0xc0000000000521b4 .move_task_away+0x43c li r3,228 0xc0000000000521b8 .move_task_away+0x440 lhz r4,24(r0) 0xc0000000000521bc .move_task_away+0x444 svca 8 0xc000000000052198 looks like a r5 = *0x280 according to dar. This one might fix it, according to Segher. Please review. --- /dev/shm/linuxppc64-2.5/include/asm-ppc64/spinlock.h 2003-11-14 19:45:32.000000000 +0100 +++ ./include/asm-ppc64/spinlock.h 2004-01-13 19:03:32.000000000 +0100 @@ -47,7 +47,7 @@ static __inline__ int _raw_spin_trylock( stdcx. 13,0,%1\n\ bne- 1b\n\ isync\n\ -2:" : "=&r"(tmp) +2:" : "=&b"(tmp) : "r"(&lock->lock) : "cr0", "memory"); @@ -95,7 +95,7 @@ static __inline__ void _raw_spin_lock(sp stdcx. 13,0,%2\n\ bne- 2b\n\ isync" - : "=&r"(tmp), "=&r"(tmp2) + : "=&b"(tmp), "=&b"(tmp2) : "r"(&lock->lock) : "r0", "r3", "r4", "r5", "ctr", "cr0", "cr1", "cr2", "cr3", "cr4", "xer", "memory"); @@ -133,7 +133,7 @@ static __inline__ void _raw_spin_lock(sp stdcx. 13,0,%2\n\ bne- 2b\n\ isync" - : "=&r"(tmp), "=&r"(tmp2) + : "=&b"(tmp), "=&b"(tmp2) : "r"(&lock->lock) : "r3", "r4", "r5", "cr0", "cr1", "ctr", "xer", "memory"); } @@ -157,7 +157,7 @@ static __inline__ void _raw_spin_lock(sp stdcx. 13,0,%1\n\ bne- 2b\n\ isync" - : "=&r"(tmp) + : "=&b"(tmp) : "r"(&lock->lock) : "cr0", "memory"); } @@ -211,7 +211,7 @@ static __inline__ int _raw_read_trylock( bne- 1b\n\ li %1,1\n\ isync\n\ -2:" : "=&r"(tmp), "=&r"(ret) +2:" : "=&b"(tmp), "=&b"(ret) : "r"(&rw->lock) : "cr0", "memory"); @@ -253,7 +253,7 @@ static __inline__ void _raw_read_lock(rw stdcx. %0,0,%2\n\ bne- 2b\n\ isync" - : "=&r"(tmp), "=&r"(tmp2) + : "=&b"(tmp), "=&b"(tmp2) : "r"(&rw->lock) : "r0", "r3", "r4", "r5", "ctr", "cr0", "cr1", "cr2", "cr3", "cr4", "xer", "memory"); @@ -290,7 +290,7 @@ static __inline__ void _raw_read_lock(rw stdcx. %0,0,%2\n\ bne- 2b\n\ isync" - : "=&r"(tmp), "=&r"(tmp2) + : "=&b"(tmp), "=&b"(tmp2) : "r"(&rw->lock) : "r3", "r4", "r5", "cr0", "cr1", "ctr", "xer", "memory"); } @@ -314,7 +314,7 @@ static __inline__ void _raw_read_lock(rw stdcx. %0,0,%1\n\ bne- 2b\n\ isync" - : "=&r"(tmp) + : "=&b"(tmp) : "r"(&rw->lock) : "cr0", "memory"); } @@ -331,7 +331,7 @@ static __inline__ void _raw_read_unlock( addic %0,%0,-1\n\ stdcx. %0,0,%1\n\ bne- 1b" - : "=&r"(tmp) + : "=&b"(tmp) : "r"(&rw->lock) : "cr0", "memory"); } @@ -350,7 +350,7 @@ static __inline__ int _raw_write_trylock bne- 1b\n\ li %1,1\n\ isync\n\ -2:" : "=&r"(tmp), "=&r"(ret) +2:" : "=&b"(tmp), "=&b"(ret) : "r"(&rw->lock), "r"(-1) : "cr0", "memory"); @@ -393,7 +393,7 @@ static __inline__ void _raw_write_lock(r stdcx. 13,0,%2\n\ bne- 2b\n\ isync" - : "=&r"(tmp), "=&r"(tmp2) + : "=&b"(tmp), "=&b"(tmp2) : "r"(&rw->lock) : "r0", "r3", "r4", "r5", "ctr", "cr0", "cr1", "cr2", "cr3", "cr4", "xer", "memory"); @@ -433,7 +433,7 @@ static __inline__ void _raw_write_lock(r stdcx. 13,0,%2\n\ bne- 2b\n\ isync" - : "=&r"(tmp), "=&r"(tmp2) + : "=&b"(tmp), "=&b"(tmp2) : "r"(&rw->lock) : "r3", "r4", "r5", "cr0", "cr1", "ctr", "xer", "memory"); } @@ -457,7 +457,7 @@ static __inline__ void _raw_write_lock(r stdcx. 13,0,%1\n\ bne- 2b\n\ isync" - : "=&r"(tmp) + : "=&b"(tmp) : "r"(&rw->lock) : "cr0", "memory"); } -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From segher at kernel.crashing.org Wed Jan 14 05:19:32 2004 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Tue, 13 Jan 2004 19:19:32 +0100 Subject: possible asm syntax errors in spinlock.h In-Reply-To: <20040113180741.GA18807@suse.de> References: <20040113180741.GA18807@suse.de> Message-ID: <091C35C8-45F5-11D8-859B-000A95A4DC02@kernel.crashing.org> > This one might fix it, according to Segher. > > Please review. I think you changed more than are really necessary, but I don't have the code handy to really check. You only need to change the args that are used in constructs like some_load_or_store some_offset(%N) . Segher ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From segher at kernel.crashing.org Wed Jan 14 05:41:02 2004 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Tue, 13 Jan 2004 19:41:02 +0100 Subject: possible asm syntax errors in spinlock.h In-Reply-To: <091C35C8-45F5-11D8-859B-000A95A4DC02@kernel.crashing.org> References: <20040113180741.GA18807@suse.de> <091C35C8-45F5-11D8-859B-000A95A4DC02@kernel.crashing.org> Message-ID: <09F42C2C-45F8-11D8-859B-000A95A4DC02@kernel.crashing.org> > I think you changed more than are really necessary, but I don't have > the code handy to really check. You only need to change the args > that are used in constructs like some_load_or_store some_offset(%N) . Like so. Segher -------------- next part -------------- A non-text attachment was scrubbed... Name: patch-spinlock Type: application/octet-stream Size: 2745 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040113/8f3c7ead/attachment.obj From olh at suse.de Wed Jan 14 05:45:13 2004 From: olh at suse.de (Olaf Hering) Date: Tue, 13 Jan 2004 19:45:13 +0100 Subject: possible asm syntax errors in spinlock.h In-Reply-To: <091C35C8-45F5-11D8-859B-000A95A4DC02@kernel.crashing.org> References: <20040113180741.GA18807@suse.de> <091C35C8-45F5-11D8-859B-000A95A4DC02@kernel.crashing.org> Message-ID: <20040113184513.GA16276@suse.de> On Tue, Jan 13, Segher Boessenkool wrote: > >This one might fix it, according to Segher. > > > >Please review. > > I think you changed more than are really necessary, but I don't have > the code handy to really check. You only need to change the args > that are used in constructs like some_load_or_store some_offset(%N) . This one might work better: --- /dev/shm/linuxppc64-2.5/include/asm-ppc64/spinlock.h 2003-11-14 19:45:32.000000000 +0100 +++ ./include/asm-ppc64/spinlock.h 2004-01-13 19:31:06.000000000 +0100 @@ -95,7 +95,7 @@ static __inline__ void _raw_spin_lock(sp stdcx. 13,0,%2\n\ bne- 2b\n\ isync" - : "=&r"(tmp), "=&r"(tmp2) + : "=&b"(tmp), "=&r"(tmp2) : "r"(&lock->lock) : "r0", "r3", "r4", "r5", "ctr", "cr0", "cr1", "cr2", "cr3", "cr4", "xer", "memory"); @@ -133,7 +133,7 @@ static __inline__ void _raw_spin_lock(sp stdcx. 13,0,%2\n\ bne- 2b\n\ isync" - : "=&r"(tmp), "=&r"(tmp2) + : "=&b"(tmp), "=&r"(tmp2) : "r"(&lock->lock) : "r3", "r4", "r5", "cr0", "cr1", "ctr", "xer", "memory"); } @@ -253,7 +253,7 @@ static __inline__ void _raw_read_lock(rw stdcx. %0,0,%2\n\ bne- 2b\n\ isync" - : "=&r"(tmp), "=&r"(tmp2) + : "=&b"(tmp), "=&r"(tmp2) : "r"(&rw->lock) : "r0", "r3", "r4", "r5", "ctr", "cr0", "cr1", "cr2", "cr3", "cr4", "xer", "memory"); @@ -290,7 +290,7 @@ static __inline__ void _raw_read_lock(rw stdcx. %0,0,%2\n\ bne- 2b\n\ isync" - : "=&r"(tmp), "=&r"(tmp2) + : "=&b"(tmp), "=&r"(tmp2) : "r"(&rw->lock) : "r3", "r4", "r5", "cr0", "cr1", "ctr", "xer", "memory"); } @@ -393,7 +393,7 @@ static __inline__ void _raw_write_lock(r stdcx. 13,0,%2\n\ bne- 2b\n\ isync" - : "=&r"(tmp), "=&r"(tmp2) + : "=&b"(tmp), "=&r"(tmp2) : "r"(&rw->lock) : "r0", "r3", "r4", "r5", "ctr", "cr0", "cr1", "cr2", "cr3", "cr4", "xer", "memory"); @@ -433,7 +433,7 @@ static __inline__ void _raw_write_lock(r stdcx. 13,0,%2\n\ bne- 2b\n\ isync" - : "=&r"(tmp), "=&r"(tmp2) + : "=&b"(tmp), "=&r"(tmp2) : "r"(&rw->lock) : "r3", "r4", "r5", "cr0", "cr1", "ctr", "xer", "memory"); } -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Jan 14 08:18:10 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 14 Jan 2004 08:18:10 +1100 Subject: possible asm syntax errors in spinlock.h In-Reply-To: <20040113180741.GA18807@suse.de> References: <20040113180741.GA18807@suse.de> Message-ID: <20040113211810.GA13397@krispykreme> > I got this crash several times on a p660 and p630: > -2:" : "=&r"(tmp) > +2:" : "=&b"(tmp) Yep if you are using SPLPAR locks then its definitely a bug. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From kravetz at us.ibm.com Wed Jan 14 08:40:42 2004 From: kravetz at us.ibm.com (Mike Kravetz) Date: Tue, 13 Jan 2004 13:40:42 -0800 Subject: good ppc64 kernel source for p615 Message-ID: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> Hello, I just recently acquired a p615 and have been attempting to get a ppc64 kernel up and running on this box. I have built the toolchain as described at 'penguinppc64.org'. The tooolchain appears to work as I can build and boot 2.4 kernels. However, 2.6 kernels (pulled from source.scl.ameslab.gov) fail to boot. Yesterday, I was seeing it fail at: time_init: decrementer frequency = 124.999447 MHz time_init: processor frequency = 1000.000000 MHz cpu 0: Vector: 380 (Data SLB Access) at [c0000000fef9fb00] pc: c000000000039498 (.prom_n_addr_cells+0x14/0x6c) lr: c00000000041dea0 (.vio_bus_init+0x3c/0xac) sp: c0000000fef9fd80 msr: 9000000000009032 dar: 70 dsisr: 200000 current = 0xc0000000fef6f360 paca = 0xc00000000049e000 pid = 1, comm = swapper A fix for this was submitted, and after pulling the fix I am getting a little further. However, I am still unable to boot. It 'hangs' shortly after this: [boot]0020 XICS Init [boot]0021 XICS Done PID hash table entries: 16 (order 4: 256 bytes) time_init: decrementer frequency = 124.999928 MHz time_init: processor frequency = 1000.000000 MHz The machine just sits here and I am unable to even enter the xmon debugger. FYI - I am just using the default config (make defconfig). Any suggestions would be appreciated. Thanks, -- Mike ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Wed Jan 14 09:28:38 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 13 Jan 2004 16:28:38 -0600 Subject: good ppc64 kernel source for p615 In-Reply-To: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> Message-ID: On Jan 13, 2004, at 3:40 PM, Mike Kravetz wrote: > > Yesterday, I was seeing it fail at: > > time_init: decrementer frequency = 124.999447 MHz > time_init: processor frequency = 1000.000000 MHz > cpu 0: Vector: 380 (Data SLB Access) at [c0000000fef9fb00] > pc: c000000000039498 (.prom_n_addr_cells+0x14/0x6c) > lr: c00000000041dea0 (.vio_bus_init+0x3c/0xac) > sp: c0000000fef9fd80 > msr: 9000000000009032 > dar: 70 > dsisr: 200000 > current = 0xc0000000fef6f360 > paca = 0xc00000000049e000 > pid = 1, comm = swapper > > A fix for this was submitted, and after pulling the fix I > am getting a little further. Yeah, sorry about that. > However, I am still unable to boot. It 'hangs' shortly after this: > > [boot]0020 XICS Init > [boot]0021 XICS Done > PID hash table entries: 16 (order 4: 256 bytes) > time_init: decrementer frequency = 124.999928 MHz > time_init: processor frequency = 1000.000000 MHz > > The machine just sits here and I am unable to even enter > the xmon debugger. I'm surprised because I would expect to see the message "missing or empty /vdevice node" here. I guess it's possible there's still a problem here; my test system finally came back so I'll check it out. You might try the kernel parameter "initcall_debug" to print addresses of functions being called at that stage of boot (see do_initcalls in init/main.c). Then refer to System.map, although I think since we tend to have embedded sysmap these days that printk could be a bit more useful... -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Jan 14 10:57:45 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 14 Jan 2004 10:57:45 +1100 Subject: good ppc64 kernel source for p615 In-Reply-To: References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> Message-ID: <20040113235744.GB13397@krispykreme> > >However, I am still unable to boot. It 'hangs' shortly after this: > > > >[boot]0020 XICS Init > >[boot]0021 XICS Done > >PID hash table entries: 16 (order 4: 256 bytes) > >time_init: decrementer frequency = 124.999928 MHz > >time_init: processor frequency = 1000.000000 MHz > > > >The machine just sits here and I am unable to even enter > >the xmon debugger. Thats a classic "you havent set your console=" problem. SLES has some magic to look at your console device in OF and make a guess in the kernel as to where we should talk. We should probably merge the patch into 2.6. Are you using a serial console? Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From kravetz at us.ibm.com Wed Jan 14 13:52:45 2004 From: kravetz at us.ibm.com (Mike Kravetz) Date: Tue, 13 Jan 2004 18:52:45 -0800 Subject: good ppc64 kernel source for p615 In-Reply-To: <20040113235744.GB13397@krispykreme>; from anton@samba.org on Wed, Jan 14, 2004 at 10:57:45AM +1100 References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> Message-ID: <20040113185245.A1747@w-mikek2.beaverton.ibm.com> On Wed, Jan 14, 2004 at 10:57:45AM +1100, Anton Blanchard wrote: > > Thats a classic "you havent set your console=" problem. SLES has some > magic to look at your console device in OF and make a guess in the > kernel as to where we should talk. We should probably merge the patch > into 2.6. > You got it! When I started this, I expected to get to a point where the boot would fail due to a configuration problem(initrd). However, since I didn't have the console set I didn't see this happening (as I expected). I'll dig the 'AUTOCONSOLE' code out of the SLES kernel and pass it along to see if it may be 'acceptable'. Thanks! -- Mike ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Jan 14 17:32:50 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 14 Jan 2004 17:32:50 +1100 Subject: compiling a 32/64bit biarch toolchain Message-ID: <20040114063250.GA24637@krispykreme> Hi, I recently had to compile a 32/64bit biarch toolchain and there were a number of tricks required. Heres a quick attempt to document them. Better run the instructions below by hand, theres a good chance you'll blow up somewhere in the middle and have to fix things by hand. Hopefully someone will come up with a nicer way to do this :) I have also attached two patches: ngpt_patch - ngpt is doing naughty things in autoconf, its trying to link stuff before we have a libc available. Fix that by removing the tests, they are just some sanity checks and our toolchain passes. (I needed this patch to compile glibc) glibc64_backward_compatibility_patch - produces an old school glibc, only required if you have existing applications. Anton #!/bin/sh # grab CVS binutils # grab CVS gcc (3.3 or 3.4) # grab Alan Modra's latest gcc patch (ftp://ftp.linuxppc64.org/pub/people/amodra/) # grab CVS glibc rm -rf src-ppc64 gcc-ppc64 glibc-ppc32 glibc-ppc64 mkdir src-ppc64 gcc-ppc64 glibc-ppc32 glibc-ppc64 cd src-ppc64 ../build_binutils cd .. cd gcc-ppc64 ../build_gcc_1 cd .. # # Stop here if you are only building a toolchain to compile kernels # cd glibc-ppc32 ../build32_glibc_1 rm -rf ../glibc-ppc32/* ../build32_glibc_2 cd .. cd glibc-ppc64 ../build_glibc_1 rm -rf ../glibc-ppc64/* ../build_glibc_2 cd .. rm -rf gcc-ppc64/* cd gcc-ppc64 ../build_gcc_2 cd .. rm -rf glibc-ppc32/* cd glibc-ppc32 ../build32_glibc_3 cd .. rm -rf glibc-ppc64/* cd glibc-ppc64 ../build_glibc_3 cd .. # Dont seem to need this in recent 3.3/3.4 builds: # # cd /usr/local/lib/gcc-lib/powerpc-linux/${GCC_VERSION}/include # rm -rf root asm linux bits -------------- next part -------------- #!/bin/sh -e ROOT=/usr/local/ppc64 SRC=/scratch/anton/toolchain/new/src HOST=powerpc-linux MAKEOPTS=-j20 PATH=$ROOT/bin:/usr/bin:/bin export ROOT PATH $SRC/configure --prefix=$ROOT \ --build=$HOST --host=$HOST --disable-nls --target=powerpc-linux \ --enable-targets=powerpc64-linux make $MAKEOPTS make install cd $ROOT/bin for z in addr2line ar as c++filt ld nm objcopy objdump ranlib readelf \ size strings strip; do # The next line should only do something on a powerpc-linux host test -x powerpc-linux-$z || ln -sf $z powerpc-linux-$z ln -sf powerpc-linux-$z powerpc64-linux-$z done rm powerpc64-linux-as powerpc64-linux-ld cat > powerpc64-linux-as << \EOF #! /bin/sh exec /usr/local/ppc64/bin/powerpc-linux-as -a64 "$@" EOF cat > powerpc64-linux-ld << \EOF #! /bin/sh exec /usr/local/ppc64/bin/powerpc-linux-ld -melf64ppc "$@" EOF chmod a+x powerpc64-linux-as powerpc64-linux-ld mkdir -p $ROOT/powerpc64-linux/bin cd $ROOT/powerpc64-linux/bin for z in ar nm ranlib strip; do test -x $z || ln -sf ../../powerpc-linux/bin/$z $z done ln -sf $ROOT/bin/powerpc64-linux-as as ln -sf $ROOT/bin/powerpc64-linux-ld ld -------------- next part -------------- #!/bin/sh -e ROOT=/usr/local/ppc64 SRC=/scratch/anton/toolchain/new/gcc-3.3 HOST=powerpc-linux MAKEOPTS=-j20 PATH=$ROOT/bin:/usr/bin:/bin export ROOT PATH $SRC/configure --prefix=$ROOT \ --build=$HOST --host=$HOST --target=powerpc-linux --enable-biarch \ --disable-nls --disable-threads --disable-shared --enable-languages=c make $MAKEOPTS make install # work out the gcc version GCC_VERSION="`grep "^gcc_version " Makefile | sed 's/gcc_version[ =]*//'`" # Make a link so configure knows we have a 64 bit compiler too. cd $ROOT/bin for z in gcc cpp; do # The following line should only do something on a powerpc-linux host. test -x powerpc-linux-$z || ln -sf $z powerpc-linux-$z cat > powerpc64-linux-$z << EOF #! /bin/sh exec /usr/local/ppc64/bin/powerpc-linux-$z -m64 "\$@" EOF chmod a+x powerpc64-linux-$z done cd $ROOT/powerpc64-linux/bin ln -sf $ROOT/bin/powerpc64-linux-gcc gcc # Allow the biarch compiler to find the ppc64 library and crt files. cd $ROOT/powerpc-linux ln -sf ../powerpc64-linux/lib lib64 # The following is a hack for a glibc. As at 2003-08-11, glibc requires # libgcc_eh.a, but this is only built with a shared libgcc. cd $ROOT/lib/gcc-lib/powerpc-linux/$GCC_VERSION ln -sf libgcc.a libgcc_eh.a ln -sf libgcc.a 64/libgcc_eh.a -------------- next part -------------- #!/bin/sh -e ROOT=/usr/local/ppc64 SRC=/scratch/anton/toolchain/new/libc_work HOST=powerpc-linux MAKEOPTS=-j20 HEADERS=/scratch/anton/ameslab-2.5/include PATH=$ROOT/bin:/usr/bin:/bin export ROOT HEADERS PATH ( cd $HEADERS && rm asm ) ( cd $HEADERS && ln -sf asm-ppc asm ) $SRC/configure \ --prefix=$ROOT/powerpc-linux --build=$HOST --host=powerpc-linux \ --with-headers=$HEADERS --without-cvs --with-tls \ --enable-add-ons=nptl --disable-shared make $MAKEOPTS make install -------------- next part -------------- #!/bin/sh -e ROOT=/usr/local/ppc64 SRC=/scratch/anton/toolchain/new/libc_work HOST=powerpc-linux MAKEOPTS=-j20 HEADERS=/scratch/anton/ameslab-2.5/include PATH=$ROOT/bin:/usr/bin:/bin export ROOT HEADERS PATH ( cd $HEADERS && rm asm ) ( cd $HEADERS && ln -sf asm-ppc asm ) $SRC/configure \ --prefix=$ROOT/powerpc-linux --build=$HOST --host=powerpc-linux \ --with-headers=$HEADERS --without-cvs --with-tls \ --enable-add-ons=nptl --enable-shared make $MAKEOPTS make install mkdir -p $ROOT/powerpc-linux/include/linux/ mkdir -p $ROOT/powerpc-linux/include/asm/ mkdir -p $ROOT/powerpc-linux/include/asm-generic/ cp -a $HEADERS/linux/* $ROOT/powerpc-linux/include/linux/ cp -a $HEADERS/asm-ppc/* $ROOT/powerpc-linux/include/asm/ cp -a $HEADERS/asm-generic/* $ROOT/powerpc-linux/include/asm-generic/ -------------- next part -------------- #!/bin/sh -e ROOT=/usr/local/ppc64 SRC=/scratch/anton/toolchain/new/libc_work HOST=powerpc-linux MAKEOPTS=-j20 HEADERS=/scratch/anton/ameslab-2.5/include PATH=$ROOT/bin:/usr/bin:/bin export ROOT HEADERS PATH ( cd $HEADERS && rm asm ) ( cd $HEADERS && ln -sf asm-ppc64 asm ) $SRC/configure \ --prefix=$ROOT/powerpc64-linux --build=$HOST --host=powerpc64-linux \ --with-headers=$HEADERS --without-cvs --with-tls \ --enable-add-ons=nptl --disable-shared make $MAKEOPTS make install -------------- next part -------------- #!/bin/sh -e ROOT=/usr/local/ppc64 SRC=/scratch/anton/toolchain/new/libc_work HOST=powerpc-linux MAKEOPTS=-j20 HEADERS=/scratch/anton/ameslab-2.5/include PATH=$ROOT/bin:/usr/bin:/bin export ROOT HEADERS PATH ( cd $HEADERS && rm asm ) ( cd $HEADERS && ln -sf asm-ppc64 asm ) $SRC/configure \ --prefix=$ROOT/powerpc64-linux --build=$HOST --host=powerpc64-linux \ --with-headers=$HEADERS --without-cvs --with-tls \ --enable-add-ons=nptl --enable-shared make $MAKEOPTS make install mkdir -p $ROOT/powerpc64-linux/include/linux/ mkdir -p $ROOT/powerpc64-linux/include/asm/ mkdir -p $ROOT/powerpc64-linux/include/asm-generic/ cp -a $HEADERS/linux/* $ROOT/powerpc64-linux/include/linux/ cp -a $HEADERS/asm-ppc64/* $ROOT/powerpc64-linux/include/asm/ cp -a $HEADERS/asm-generic/* $ROOT/powerpc64-linux/include/asm-generic/ -------------- next part -------------- #!/bin/sh -e ROOT=/usr/local/ppc64 SRC=/scratch/anton/toolchain/new/gcc-3.4 HOST=powerpc-linux MAKEOPTS=-j20 PATH=$ROOT/bin:/usr/bin:/bin export ROOT PATH # Fix for some biarch madness mkdir $ROOT/powerpc-linux/lib/64/nof for i in crt?.o do ln -s $ROOT/powerpc-linux/lib/64/nof/$i ../../../lib64/ done $SRC/configure --prefix=$ROOT \ --build=$HOST --host=$HOST --target=powerpc-linux --enable-biarch \ --disable-nls --enable-shared --enable-__cxa_atexit \ --enable-languages=c,c++ #--enable-languages=c,c++,f77 - add f77 if you want #--with-headers=$ROOT/powerpc-linux/include - not needed any more? make $MAKEOPTS make install # work out the gcc version GCC_VERSION="`grep "^gcc_version " Makefile | sed 's/gcc_version[ =]*//'`" cd $ROOT/bin for z in gcc cpp g++ c++; do # The following line should only do something on a powerpc-linux host. test -x powerpc-linux-$z || ln -sf $z powerpc-linux-$z cat > powerpc64-linux-$z << EOF #! /bin/sh exec $ROOT/bin/powerpc-linux-$z -m64 -isystem /usr/local/ppc64/lib/gcc-lib/powerpc-linux/${GCC_VERSION}/include -isystem /usr/local/ppc64/powerpc64-linux/include "\$@" EOF chmod a+x powerpc64-linux-$z done -------------- next part -------------- #!/bin/sh -e ROOT=/usr/local/ppc64 SRC=/scratch/anton/toolchain/new/libc_work HOST=powerpc-linux MAKEOPTS=-j20 HEADERS=/scratch/anton/ameslab-2.5/include PATH=$ROOT/bin:/usr/bin:/bin export ROOT HEADERS PATH ( cd $HEADERS && rm asm ) ( cd $HEADERS && ln -sf asm-ppc asm ) $SRC/configure \ --prefix=$ROOT/powerpc-linux --build=$HOST --host=powerpc-linux \ --with-headers=$HEADERS --without-cvs --with-tls \ --enable-add-ons=nptl --enable-shared make $MAKEOPTS make install -------------- next part -------------- #!/bin/sh -e ROOT=/usr/local/ppc64 SRC=/scratch/anton/toolchain/new/libc_work HOST=powerpc-linux MAKEOPTS=-j20 HEADERS=/scratch/anton/ameslab-2.5/include PATH=$ROOT/bin:/usr/bin:/bin export ROOT HEADERS PATH ( cd $HEADERS && rm asm ) ( cd $HEADERS && ln -sf asm-ppc64 asm ) $SRC/configure \ --prefix=$ROOT/powerpc64-linux --build=$HOST --host=powerpc64-linux \ --with-headers=$HEADERS --without-cvs --with-tls \ --enable-add-ons=nptl --enable-shared make $MAKEOPTS make install -------------- next part -------------- Index: nptl/sysdeps/pthread/configure =================================================================== RCS file: /cvs/glibc/libc/nptl/sysdeps/pthread/configure,v retrieving revision 1.10 diff -u -r1.10 configure --- nptl/sysdeps/pthread/configure 3 Dec 2003 06:50:01 -0000 1.10 +++ nptl/sysdeps/pthread/configure 14 Jan 2004 05:29:40 -0000 @@ -24,136 +24,3 @@ fi -echo "$as_me:$LINENO: checking for forced unwind support" >&5 -echo $ECHO_N "checking for forced unwind support... $ECHO_C" >&6 -if test "${libc_cv_forced_unwind+set}" = set; then - echo $ECHO_N "(cached) $ECHO_C" >&6 -else - cat >conftest.$ac_ext <<_ACEOF -/* confdefs.h. */ -_ACEOF -cat confdefs.h >>conftest.$ac_ext -cat >>conftest.$ac_ext <<_ACEOF -/* end confdefs.h. */ -#include -int -main () -{ - -struct _Unwind_Exception exc; -struct _Unwind_Context *context; -_Unwind_GetCFA (context) - ; - return 0; -} -_ACEOF -rm -f conftest.$ac_objext conftest$ac_exeext -if { (eval echo "$as_me:$LINENO: \"$ac_link\"") >&5 - (eval $ac_link) 2>conftest.er1 - ac_status=$? - grep -v '^ *+' conftest.er1 >conftest.err - rm -f conftest.er1 - cat conftest.err >&5 - echo "$as_me:$LINENO: \$? = $ac_status" >&5 - (exit $ac_status); } && - { ac_try='test -z "$ac_c_werror_flag" - || test ! -s conftest.err' - { (eval echo "$as_me:$LINENO: \"$ac_try\"") >&5 - (eval $ac_try) 2>&5 - ac_status=$? - echo "$as_me:$LINENO: \$? = $ac_status" >&5 - (exit $ac_status); }; } && - { ac_try='test -s conftest$ac_exeext' - { (eval echo "$as_me:$LINENO: \"$ac_try\"") >&5 - (eval $ac_try) 2>&5 - ac_status=$? - echo "$as_me:$LINENO: \$? = $ac_status" >&5 - (exit $ac_status); }; }; then - libc_cv_forced_unwind=yes -else - echo "$as_me: failed program was:" >&5 -sed 's/^/| /' conftest.$ac_ext >&5 - -libc_cv_forced_unwind=no -fi -rm -f conftest.err conftest.$ac_objext \ - conftest$ac_exeext conftest.$ac_ext -fi -echo "$as_me:$LINENO: result: $libc_cv_forced_unwind" >&5 -echo "${ECHO_T}$libc_cv_forced_unwind" >&6 -if test $libc_cv_forced_unwind = yes; then - cat >>confdefs.h <<\_ACEOF -#define HAVE_FORCED_UNWIND 1 -_ACEOF - - old_CFLAGS="$CFLAGS" - CFLAGS="$CFLAGS -Werror -fexceptions" - echo "$as_me:$LINENO: checking for C cleanup handling" >&5 -echo $ECHO_N "checking for C cleanup handling... $ECHO_C" >&6 -if test "${libc_cv_c_cleanup+set}" = set; then - echo $ECHO_N "(cached) $ECHO_C" >&6 -else - cat >conftest.$ac_ext <<_ACEOF -/* confdefs.h. */ -_ACEOF -cat confdefs.h >>conftest.$ac_ext -cat >>conftest.$ac_ext <<_ACEOF -/* end confdefs.h. */ - -#include -void cl (void *a) { } -int -main () -{ - - int a __attribute__ ((cleanup (cl))); - puts ("test") - ; - return 0; -} -_ACEOF -rm -f conftest.$ac_objext conftest$ac_exeext -if { (eval echo "$as_me:$LINENO: \"$ac_link\"") >&5 - (eval $ac_link) 2>conftest.er1 - ac_status=$? - grep -v '^ *+' conftest.er1 >conftest.err - rm -f conftest.er1 - cat conftest.err >&5 - echo "$as_me:$LINENO: \$? = $ac_status" >&5 - (exit $ac_status); } && - { ac_try='test -z "$ac_c_werror_flag" - || test ! -s conftest.err' - { (eval echo "$as_me:$LINENO: \"$ac_try\"") >&5 - (eval $ac_try) 2>&5 - ac_status=$? - echo "$as_me:$LINENO: \$? = $ac_status" >&5 - (exit $ac_status); }; } && - { ac_try='test -s conftest$ac_exeext' - { (eval echo "$as_me:$LINENO: \"$ac_try\"") >&5 - (eval $ac_try) 2>&5 - ac_status=$? - echo "$as_me:$LINENO: \$? = $ac_status" >&5 - (exit $ac_status); }; }; then - libc_cv_c_cleanup=yes -else - echo "$as_me: failed program was:" >&5 -sed 's/^/| /' conftest.$ac_ext >&5 - -libc_cv_c_cleanup=no -fi -rm -f conftest.err conftest.$ac_objext \ - conftest$ac_exeext conftest.$ac_ext -fi -echo "$as_me:$LINENO: result: $libc_cv_c_cleanup" >&5 -echo "${ECHO_T}$libc_cv_c_cleanup" >&6 - CFLAGS="$old_CFLAGS" - if test $libc_cv_c_cleanup = no; then - { { echo "$as_me:$LINENO: error: the compiler must support C cleanup handling" >&5 -echo "$as_me: error: the compiler must support C cleanup handling" >&2;} - { (exit 1); exit 1; }; } - fi -else - { { echo "$as_me:$LINENO: error: forced unwind support is required" >&5 -echo "$as_me: error: forced unwind support is required" >&2;} - { (exit 1); exit 1; }; } -fi -------------- next part -------------- Index: shlib-versions =================================================================== RCS file: /cvs/glibc/libc/shlib-versions,v retrieving revision 1.66 diff -u -r1.66 shlib-versions --- shlib-versions 5 Sep 2002 09:32:03 -0000 1.66 +++ shlib-versions 14 Jan 2004 05:29:37 -0000 @@ -24,7 +24,7 @@ s390x-.*-linux.* DEFAULT GLIBC_2.2 cris-.*-linux.* DEFAULT GLIBC_2.2 x86_64-.*-linux.* DEFAULT GLIBC_2.2.5 -powerpc64-.*-linux.* DEFAULT GLIBC_2.3 +powerpc64-.*-linux.* DEFAULT GLIBC_2.2.5 .*-.*-gnu-gnu.* DEFAULT GLIBC_2.2.6 # Configuration Library=version Earliest symbol set (optional) @@ -70,7 +70,7 @@ mips.*-.*-linux.* ld=ld.so.1 GLIBC_2.0 GLIBC_2.2 hppa.*-.*-.* ld=ld.so.1 GLIBC_2.2 s390x-.*-linux.* ld=ld64.so.1 GLIBC_2.2 -powerpc64.*-.*-linux.* ld=ld64.so.1 GLIBC_2.3 +powerpc64.*-.*-linux.* ld=ld64.so.1 GLIBC_2.2.5 cris-.*-linux.* ld=ld.so.1 GLIBC_2.2 x86_64-.*-linux.* ld=ld-linux-x86-64.so.2 GLIBC_2.2.5 # We use the ELF ABI standard name for the default. Index: linuxthreads/shlib-versions =================================================================== RCS file: /cvs/glibc/libc/linuxthreads/shlib-versions,v retrieving revision 1.10 diff -u -r1.10 shlib-versions --- linuxthreads/shlib-versions 5 Sep 2002 10:14:32 -0000 1.10 +++ linuxthreads/shlib-versions 14 Jan 2004 05:29:38 -0000 @@ -7,5 +7,5 @@ s390x-.*-linux.* libpthread=0 GLIBC_2.2 cris-.*-linux.* libpthread=0 GLIBC_2.2 x86_64-.*-linux.* libpthread=0 GLIBC_2.2.5 -powerpc64-.*-linux.* libpthread=0 GLIBC_2.3 +powerpc64-.*-linux.* libpthread=0 GLIBC_2.2.5 .*-.*-linux.* libpthread=0 Index: sysdeps/unix/sysv/linux/powerpc/bits/stat.h =================================================================== RCS file: /cvs/glibc/libc/sysdeps/unix/sysv/linux/powerpc/bits/stat.h,v retrieving revision 1.7 diff -u -r1.7 stat.h --- sysdeps/unix/sysv/linux/powerpc/bits/stat.h 26 Jun 2003 17:00:37 -0000 1.7 +++ sysdeps/unix/sysv/linux/powerpc/bits/stat.h 14 Jan 2004 05:29:43 -0000 @@ -24,13 +24,18 @@ #include /* Versions of the `struct stat' data structure. */ -#define _STAT_VER_LINUX_OLD 1 -#define _STAT_VER_KERNEL 1 -#define _STAT_VER_SVR4 2 -#define _STAT_VER_LINUX 3 #if __WORDSIZE == 32 +# define _STAT_VER_LINUX_OLD 1 +# define _STAT_VER_KERNEL 1 +# define _STAT_VER_SVR4 2 +# define _STAT_VER_LINUX 3 # define _STAT_VER _STAT_VER_LINUX #else +/* We used STAT_VER_LINUX 3 in glibc 2.2.5, which has the exact + * layout as the kernel struct stat, so we define them to be the same. + */ +# define _STAT_VER_LINUX 3 +# define _STAT_VER_KERNEL 3 # define _STAT_VER _STAT_VER_KERNEL #endif From olh at suse.de Wed Jan 14 19:07:49 2004 From: olh at suse.de (Olaf Hering) Date: Wed, 14 Jan 2004 09:07:49 +0100 Subject: good ppc64 kernel source for p615 In-Reply-To: <20040113185245.A1747@w-mikek2.beaverton.ibm.com> References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> Message-ID: <20040114080749.GA374@suse.de> On Tue, Jan 13, Mike Kravetz wrote: > > On Wed, Jan 14, 2004 at 10:57:45AM +1100, Anton Blanchard wrote: > > > > Thats a classic "you havent set your console=" problem. SLES has some > > magic to look at your console device in OF and make a guess in the > > kernel as to where we should talk. We should probably merge the patch > > into 2.6. > > > > You got it! When I started this, I expected to get to a point > where the boot would fail due to a configuration problem(initrd). > However, since I didn't have the console set I didn't see this > happening (as I expected). > > I'll dig the 'AUTOCONSOLE' code out of the SLES kernel and pass it > along to see if it may be 'acceptable'. Please apply it soon, and dont forget ppc32 while your are at it. diff -purNX /home/olaf/kernel/kernel_exclude.txt linux-2.6.0-test11.orig/arch/ppc/kernel/setup.c linux-2.6.0-test11.SuSE/arch/ppc/kernel/setup.c --- linux-2.6.0-test11.orig/arch/ppc/kernel/setup.c 2003-11-26 20:45:38.000000000 +0000 +++ linux-2.6.0-test11.SuSE/arch/ppc/kernel/setup.c 2003-11-30 19:23:42.000000000 +0000 @@ -452,6 +452,56 @@ platform_init(unsigned long r3, unsigned } } } + +#ifdef CONFIG_SERIAL_CORE_CONSOLE + /* Hack -- add console=ttySn if necessary */ + if(strstr(cmd_line, "console=") == NULL) { + extern char *of_stdout_device; + struct device_node *prom_stdout; + + prom_stdout = find_path_device(of_stdout_device); + if (prom_stdout) { + unsigned char *name; + printk(KERN_INFO "of_stdout_device %s\n", of_stdout_device); + name = get_property(prom_stdout, "name", NULL); + if (name) { + int i; +#if 1 + printk(KERN_INFO "name %s\n", name); +#endif + i = -1; +#ifdef CONFIG_SERIAL_8250_CONSOLE + if (strcmp(name, "serial") == 0) { + u32 *reg = (u32 *)get_property(prom_stdout, "reg", &i); + if (i > 8) { + switch (reg[1]) { + case 0x3f8: i = 0; break; + case 0x2f8: i = 1; break; + case 0x898: i = 2; break; + case 0x890: i = 3; break; + } + } + } +#endif +#ifdef CONFIG_SERIAL_PMACZILOG_CONSOLE + if (strcmp(name, "ch-a") == 0) + i = 0; + if (strcmp(name, "ch-b") == 0) + i = 1; +#endif + if (i >= 0) { + char tmp_cmd_line[512]; + snprintf(tmp_cmd_line, 512, + "AUTOCONSOLE console=ttyS%d %s", + i, cmd_line); + memcpy(cmd_line, tmp_cmd_line, 512); + printk("console= not found, add console=ttyS%d\n", i); + } + } + } + } +#endif + #ifdef CONFIG_ADB if (strstr(cmd_line, "adb_sync")) { extern int __adb_probe_sync; diff -purNX /home/olaf/kernel/kernel_exclude.txt linux-2.6.0-test11.orig/arch/ppc/syslib/prom_init.c linux-2.6.0-test11.SuSE/arch/ppc/syslib/prom_init.c --- linux-2.6.0-test11.orig/arch/ppc/syslib/prom_init.c 2003-11-26 20:45:52.000000000 +0000 +++ linux-2.6.0-test11.SuSE/arch/ppc/syslib/prom_init.c 2003-11-30 19:24:01.000000000 +0000 @@ -118,7 +118,7 @@ ihandle prom_stdout __initdata = 0; char *prom_display_paths[FB_MAX] __initdata = { 0, }; phandle prom_display_nodes[FB_MAX] __initdata; unsigned int prom_num_displays __initdata = 0; -static char *of_stdout_device __initdata = 0; +char *of_stdout_device __initdata = 0; static ihandle prom_disp_node __initdata = 0; unsigned int rtas_data; /* physical pointer */ @@ -861,6 +861,11 @@ prom_init(int r3, int r4, prom_entry pp) for (i = 0; i < prom_num_displays; ++i) prom_display_paths[i] = PTRUNRELOC(prom_display_paths[i]); +#ifdef CONFIG_SERIAL_CORE_CONSOLE + /* Relocate the of stdout for console autodetection */ + of_stdout_device = PTRUNRELOC(of_stdout_device); +#endif + prom_print("returning 0x"); prom_print_hex(phys); prom_print("from prom_init\n"); diff -p -purNX kernel_exclude.txt x/linux-2.6.0-test10/arch/ppc64/kernel/setup.c linux-2.6.0-test10/arch/ppc64/kernel/setup.c --- x/linux-2.6.0-test10/arch/ppc64/kernel/setup.c 2003-11-25 20:51:52.000000000 +0100 +++ linux-2.6.0-test10/arch/ppc64/kernel/setup.c 2003-11-26 12:29:26.000000000 +0100 @@ -399,6 +399,44 @@ void parse_cmd_line(unsigned long r3, un } #endif +#ifdef CONFIG_PPC_PSERIES + /* Hack -- add console=ttySn,9600 if necessary */ + if(strstr(cmd_line, "console=") == NULL) { + struct device_node *prom_stdout = find_path_device(of_stdout_device); + u32 *reg; + int i; + char *name, *val = NULL; + printk("of_stdout_device %s\n", of_stdout_device); + if (prom_stdout) { + name = (char *)get_property(prom_stdout, "name", NULL); + if (name) { + if (strcmp(name, "serial") == 0) { + reg = (u32 *)get_property(prom_stdout, "reg", &i); + if (i > 8) { + switch (reg[1]) { + case 0x3f8: val = "ttyS0,9600"; break; + case 0x2f8: val = "ttyS1,9600"; break; + case 0x898: val = "ttyS2,9600"; break; + case 0x890: val = "ttyS3,9600"; break; + } + } + } else if (strcmp(name, "vty") == 0) { + /* pSeries LPAR virtual console */ + val = "hvc0"; + } + if (val) { + char tmp_cmd_line[CMD_LINE_SIZE]; + snprintf(tmp_cmd_line, CMD_LINE_SIZE, + "AUTOCONSOLE console=%s %s", + val, cmd_line); + memcpy(cmd_line, tmp_cmd_line, CMD_LINE_SIZE); + printk("console= not found, add console=%s\n", val); + } + } + } + } +#endif + /* Look for mem= option on command line */ if (strstr(cmd_line, "mem=")) { char *p, *q; diff -p -purNX kernel_exclude.txt x/linux-2.6.0-test10/include/asm-ppc64/bootinfo.h linux-2.6.0-test10/include/asm-ppc64/bootinfo.h --- x/linux-2.6.0-test10/include/asm-ppc64/bootinfo.h 2003-11-24 02:32:06.000000000 +0100 +++ linux-2.6.0-test10/include/asm-ppc64/bootinfo.h 2003-11-26 12:30:17.000000000 +0100 @@ -17,6 +17,8 @@ #include +#define CMD_LINE_SIZE 512 + /* We use a u32 for the type of the fields since they're written by * the bootloader which is a 32-bit process and read by the kernel * which is a 64-bit process. This way they can both agree on the -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Thu Jan 15 02:26:32 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Wed, 14 Jan 2004 09:26:32 -0600 Subject: autoconsole In-Reply-To: <20040114080749.GA374@suse.de> References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> Message-ID: <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> On Jan 14, 2004, at 2:07 AM, Olaf Hering wrote: > On Tue, Jan 13, Mike Kravetz wrote: >> On Wed, Jan 14, 2004 at 10:57:45AM +1100, Anton Blanchard wrote: >>> >>> Thats a classic "you havent set your console=" problem. SLES has some >>> magic to look at your console device in OF and make a guess in the >>> kernel as to where we should talk. We should probably merge the patch >>> into 2.6. >> You got it! When I started this, I expected to get to a point >> where the boot would fail due to a configuration problem(initrd). >> However, since I didn't have the console set I didn't see this >> happening (as I expected). >> >> I'll dig the 'AUTOCONSOLE' code out of the SLES kernel and pass it >> along to see if it may be 'acceptable'. > > Please apply it soon, and dont forget ppc32 while your are at it. [snip] I haven't had time to check it out yet, but Sparc pushed an add_preferred_console() to 2.5 a couple weeks ago. See arch/sparc/kernel/setup.c set_preferred_console() ; it's a bit cleaner looking than what you've posted here. :) -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Thu Jan 15 02:32:26 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Wed, 14 Jan 2004 09:32:26 -0600 Subject: [PATCH][2.6] Nested Interrupt support Message-ID: <1074094346.2389.42.camel@magik> The xics code is not behaving completly correct. When a hw interrupt is taken the CPPR is changed to 0x5. If while this interrupt is being processed, the CPU gets interrupted with a higher priority interrupt (eg IPI), the IPI's EOI will write the CPPR back down to 0xFF instead of what it was at when it interrupted the hw interrupt (0x5). One concern I have is at the end of ppc_irq_dispatch_handler(), there is a check to see if the desc->handler went away due to an interrupt being disabled. If the handler does go away, desc->handler->end will not be called and the irq_stack will get out of sync. I could not find anywhere were this handler would actually be removed (eg function pointer set to zero). Why is this code still here? Thanks, Jake -------------- next part -------------- # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1344 -> 1.1345 # arch/ppc64/kernel/xics.c 1.36 -> 1.37 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 04/01/14 moilanen at threadlp13.austin.ibm.com 1.1345 # Nested interrupt support. # -------------------------------------------- # diff -Nru a/arch/ppc64/kernel/xics.c b/arch/ppc64/kernel/xics.c --- a/arch/ppc64/kernel/xics.c Wed Jan 14 09:07:59 2004 +++ b/arch/ppc64/kernel/xics.c Wed Jan 14 09:07:59 2004 @@ -92,6 +92,21 @@ static unsigned int default_server = 0xFF; static unsigned int default_distrib_server = 0; +/* Number of nested IRQs we can store */ +#define IRQ_DEPTH 2 + +struct cpu_irq_stack +{ + int depth; + int priority[IRQ_DEPTH]; + int irq[IRQ_DEPTH]; +}; + +struct cpu_irq_stack _irq_stack[NR_CPUS]; + +#define irq_stack _irq_stack[smp_processor_id()] +#define irq_stack_depth (irq_stack).depth + /* * XICS only has a single IPI, so encode the messages per CPU */ @@ -293,20 +308,36 @@ void xics_end_irq(unsigned int irq) { int cpu = smp_processor_id(); + unsigned int priority; + + if (irq >= 0 && irq != irq_offset_up(xics_irq_8259_cascade)) { + irq_stack_depth--; + priority = irq_stack.priority[irq_stack_depth]; + } else { + priority = 0xff; + } iosync(); - ops->xirr_info_set(cpu, ((0xff<<24) | (irq_offset_down(irq)))); + ops->xirr_info_set(cpu, (priority<<24) | (irq_offset_down(irq))); } void xics_mask_and_ack_irq(u_int irq) { int cpu = smp_processor_id(); + unsigned int priority; if (irq < irq_offset_value()) { + if (irq >= 0) { + irq_stack_depth--; + priority = irq_stack.priority[irq_stack_depth]; + } else { + priority = 0xff; + } + i8259_pic.ack(irq); iosync(); - ops->xirr_info_set(cpu, ((0xff<<24) | + ops->xirr_info_set(cpu, ((priority<<24) | xics_irq_8259_cascade_real)); iosync(); } @@ -316,10 +347,12 @@ { u_int cpu = smp_processor_id(); u_int vec; + u_int priority; int irq; vec = ops->xirr_info_get(cpu); - /* (vec >> 24) == old priority */ + + priority = vec >> 24; vec &= 0x00ffffff; /* for sanity, this had better be < NR_IRQS - 16 */ @@ -336,6 +369,13 @@ } else { irq = irq_offset_up(vec); } + + if (irq >= 0) { + irq_stack.priority[irq_stack_depth] = priority; + irq_stack.irq[irq_stack_depth] = irq; + irq_stack_depth++; + } + return irq; } @@ -404,7 +444,7 @@ void xics_init_IRQ(void) { - int i; + int i, j; unsigned long intr_size = 0; struct device_node *np; uint *ireg, ilen, indx = 0; @@ -522,6 +562,14 @@ xics_8259_pic.disable = i8259_pic.disable; for (i = 0; i < 16; ++i) get_real_irq_desc(i)->handler = &xics_8259_pic; + + for (i = 0; i < NR_CPUS; i++) { + _irq_stack[i].depth = 0; + for (j = 0; j < IRQ_DEPTH; j++) { + _irq_stack[i].priority[j] = 0xff; + _irq_stack[i].irq[j] = -1; + } + } ops->cppr_info(boot_cpuid, 0xff); iosync(); From olh at suse.de Thu Jan 15 04:30:04 2004 From: olh at suse.de (Olaf Hering) Date: Wed, 14 Jan 2004 18:30:04 +0100 Subject: possible asm syntax errors in spinlock.h In-Reply-To: <20040113211810.GA13397@krispykreme> References: <20040113180741.GA18807@suse.de> <20040113211810.GA13397@krispykreme> Message-ID: <20040114173004.GC15207@suse.de> On Wed, Jan 14, Anton Blanchard wrote: > > > I got this crash several times on a p660 and p630: > > > -2:" : "=&r"(tmp) > > +2:" : "=&b"(tmp) > > Yep if you are using SPLPAR locks then its definitely a bug. Can you fix it in the tree? -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From meissner at suse.de Thu Jan 15 04:38:09 2004 From: meissner at suse.de (Marcus Meissner) Date: Wed, 14 Jan 2004 18:38:09 +0100 Subject: small fix to ppc32_timer_create Message-ID: <20040114173809.GA30646@suse.de> Hi, Small obvious fix to ppc32_timer_create. Since sys_timer_create access structures we pass on the stack, we need set_fs(KERNEL_DS). Ciao, Marcus --- arch/ppc64/kernel/sys_ppc32.c 2004-01-14 12:17:56.000000000 +0000 +++ arch/ppc64/kernel/sys_ppc32.c 2004-01-14 17:20:26.000000000 +0000 @@ -2934,6 +2934,7 @@ return -EFAULT; savefs = get_fs(); + set_fs(KERNEL_DS); err = sys_timer_create(clock, &event, &t); set_fs(savefs); ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu Jan 15 10:16:38 2004 From: anton at samba.org (Anton Blanchard) Date: Thu, 15 Jan 2004 10:16:38 +1100 Subject: possible asm syntax errors in spinlock.h In-Reply-To: <20040114173004.GC15207@suse.de> References: <20040113180741.GA18807@suse.de> <20040113211810.GA13397@krispykreme> <20040114173004.GC15207@suse.de> Message-ID: <20040114231638.GB27924@krispykreme> > Can you fix it in the tree? Yep, I just pushed it. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu Jan 15 12:18:08 2004 From: anton at samba.org (Anton Blanchard) Date: Thu, 15 Jan 2004 12:18:08 +1100 Subject: autoconsole In-Reply-To: <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> Message-ID: <20040115011808.GD27924@krispykreme> > I haven't had time to check it out yet, but Sparc pushed an > add_preferred_console() to 2.5 a couple weeks ago. See > arch/sparc/kernel/setup.c set_preferred_console() ; it's a bit cleaner > looking than what you've posted here. :) Agreed, how does this look? I could only compile test it, I dont have a machine to run on at the moment. Could we ever end up with the console on a hvc other than 0? Anton ===== arch/ppc64/kernel/setup.c 1.31 vs edited ===== --- 1.31/arch/ppc64/kernel/setup.c Wed Oct 8 12:53:40 2003 +++ edited/arch/ppc64/kernel/setup.c Thu Jan 15 12:14:06 2004 @@ -405,6 +405,56 @@ } } +static int __init set_preferred_console(void) +{ + struct device_node *prom_stdout; + char *name; + + /* The user has requested a console so this is already set up. */ + if (strstr(cmd_line, "console=")) + return -EBUSY; + + prom_stdout = find_path_device(of_stdout_device); + if (!prom_stdout) + return -ENODEV; + + name = (char *)get_property(prom_stdout, "name", NULL); + if (!name) + return -ENODEV; + + if (strcmp(name, "serial") == 0) { + int i; + u32 *reg = (u32 *)get_property(prom_stdout, "reg", &i); + if (i > 8) { + int offset; + switch (reg[1]) { + case 0x3f8: + offset = 0; + break; + case 0x2f8: + offset = 1; + break; + case 0x898: + offset = 2; + break; + case 0x890: + offset = 3; + break; + default: + /* We dont recognise the serial port */ + return -ENODEV; + } + + return add_preferred_console("ttyS", offset, NULL); + } + } else if (strcmp(name, "vty") == 0) { + /* pSeries LPAR virtual console */ + return add_preferred_console("hvc", 0, NULL); + } + + return -ENODEV; +} +console_initcall(set_preferred_console); char *bi_tag2str(unsigned long tag) { ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Thu Jan 15 21:40:34 2004 From: olh at suse.de (Olaf Hering) Date: Thu, 15 Jan 2004 11:40:34 +0100 Subject: autoconsole In-Reply-To: <20040115011808.GD27924@krispykreme> References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> Message-ID: <20040115104034.GB14269@suse.de> On Thu, Jan 15, Anton Blanchard wrote: > > > I haven't had time to check it out yet, but Sparc pushed an > > add_preferred_console() to 2.5 a couple weeks ago. See > > arch/sparc/kernel/setup.c set_preferred_console() ; it's a bit cleaner > > looking than what you've posted here. :) > > Agreed, how does this look? I could only compile test it, I dont have a > machine to run on at the moment. Thanks Anton! I wasnt aware of that function. > Could we ever end up with the console on a hvc other than 0? I dont know, we had hvc0 and noone complained. Or they used console= all the time. -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jdewand at redhat.com Fri Jan 16 01:56:58 2004 From: jdewand at redhat.com (Julie DeWandel) Date: Thu, 15 Jan 2004 09:56:58 -0500 Subject: lparcfg.c bug fix Message-ID: <4006AA3A.6080509@redhat.com> While code reviewing the lparcfg.c file, I noticed what I believe to be an error. I've attached a patch to correct it and would ask that someone submit it to the ameslab tree (since I am unfamiliar with the proper procedure), but only if they agree this is a problem. Thanks, Julie -- Julie DeWandel Red Hat, Inc. Tel (978) 692-3113 x23251 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lparcfg.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040115/d97bb66e/attachment.txt From kravetz at us.ibm.com Fri Jan 16 03:58:49 2004 From: kravetz at us.ibm.com (Mike Kravetz) Date: Thu, 15 Jan 2004 08:58:49 -0800 Subject: autoconsole In-Reply-To: <20040115011808.GD27924@krispykreme>; from anton@samba.org on Thu, Jan 15, 2004 at 12:18:08PM +1100 References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> Message-ID: <20040115085849.A1808@w-mikek2.beaverton.ibm.com> On Thu, Jan 15, 2004 at 12:18:08PM +1100, Anton Blanchard wrote: > > Agreed, how does this look? I could only compile test it, I dont have a > machine to run on at the moment. > > + > + if (strcmp(name, "serial") == 0) { > + int i; > + u32 *reg = (u32 *)get_property(prom_stdout, "reg", &i); > + if (i > 8) { > + int offset; > + switch (reg[1]) { > + case 0x3f8: > + offset = 0; > + break; > + case 0x2f8: > + offset = 1; > + break; > + case 0x898: > + offset = 2; > + break; > + case 0x890: > + offset = 3; > + break; > + default: > + /* We dont recognise the serial port */ > + return -ENODEV; > + } > + > + return add_preferred_console("ttyS", offset, NULL); > + } My only concern would be the lack of a 'speed' setting for the serial port/console. Is there any way to determine the 'speed' of the serial port? I don't know this code/architecture well enough, but am looking. Note that the SLES code (or at least that ported by Olaf) had the speed hard coded to 9600. My 'guess' is that the speed of the serial ports is configurable, but again I don't know this arch well enough to say. I'll try out this code on my box with a 9600 speed serial console and let you know what happens. -- Mike ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Fri Jan 16 04:06:31 2004 From: olh at suse.de (Olaf Hering) Date: Thu, 15 Jan 2004 18:06:31 +0100 Subject: autoconsole In-Reply-To: <20040115085849.A1808@w-mikek2.beaverton.ibm.com> References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040115085849.A1808@w-mikek2.beaverton.ibm.com> Message-ID: <20040115170631.GA22399@suse.de> On Thu, Jan 15, Mike Kravetz wrote: > I'll try out this code on my box with a 9600 speed serial console and > let you know what happens. The serial driver defaults to 9600, so it will (and indeed does) work as expected. -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From segher at kernel.crashing.org Fri Jan 16 04:21:03 2004 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Thu, 15 Jan 2004 18:21:03 +0100 Subject: autoconsole In-Reply-To: <20040115170631.GA22399@suse.de> References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040115085849.A1808@w-mikek2.beaverton.ibm.com> <20040115170631.GA22399@suse.de> Message-ID: <3238B645-477F-11D8-A8C3-000A95A4DC02@kernel.crashing.org> >> I'll try out this code on my box with a 9600 speed serial console and >> let you know what happens. > > The serial driver defaults to 9600, so it will (and indeed does) work > as > expected. Some firmwares set it to 19200, and it would be nice to not have to switch the com settings on the attached terminal halfway through booting ;-) It's trivial to detect the current setting, just do the inverse of how the baud speed is set. Segher ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From dwm at austin.ibm.com Fri Jan 16 05:06:22 2004 From: dwm at austin.ibm.com (Doug Maxey) Date: Thu, 15 Jan 2004 12:06:22 -0600 Subject: autoconsole Message-ID: <200401151806.i0FI6MIK013402@localhost.localdomain> Segher, On the JS20, the speed is always 19200 for the serial console. ++doug On Thu, 15 Jan 2004 18:21:03 +0100, Segher Boessenkool wrote: >Some firmwares set it to 19200, and it would be nice to not have to >switch the com settings on the attached terminal halfway through >booting ;-) >It's trivial to detect the current setting, just do the inverse of how >the baud speed is set. >Segher ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From segher at kernel.crashing.org Fri Jan 16 05:37:27 2004 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Thu, 15 Jan 2004 19:37:27 +0100 Subject: autoconsole In-Reply-To: <200401151806.i0FI6MIK013402@localhost.localdomain> References: <200401151806.i0FI6MIK013402@localhost.localdomain> Message-ID: On 15-jan-04, at 19:06, Doug Maxey wrote: > On the JS20, the speed is always 19200 for the serial console. I know that, you know that, but the kernel didn't, last time I tried... Segher ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Fri Jan 16 08:03:14 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Thu, 15 Jan 2004 15:03:14 -0600 Subject: [PATCH] [TRIVIAL] include guards in include/asm-ppc64 (2.6) Message-ID: <40070012.8080901@austin.ibm.com> I guess I'm feeling janitorial today... Except for the case of percpu.h this adds include guards where they are missing. I changed the existing guard in percpu.h from __ARCH_I386_PERCPU__ to __ARCH_PPC64_PERCPU__. Patch is against 2.5 ameslab tree. Nathan -------------- next part -------------- A non-text attachment was scrubbed... Name: asm_ppc64_headers.patch Type: text/x-patch Size: 2661 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040115/b81c62f7/attachment.bin From anton at samba.org Fri Jan 16 08:31:59 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 16 Jan 2004 08:31:59 +1100 Subject: small fix to ppc32_timer_create In-Reply-To: <20040114173809.GA30646@suse.de> References: <20040114173809.GA30646@suse.de> Message-ID: <20040115213159.GA25094@krispykreme> > Small obvious fix to ppc32_timer_create. Since sys_timer_create access > structures we pass on the stack, we need set_fs(KERNEL_DS). Nice catch Marcus. Applied. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Fri Jan 16 10:31:29 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Thu, 15 Jan 2004 17:31:29 -0600 Subject: [2.6] EEH detection patch Message-ID: <400722D1.8070709@austin.ibm.com> The attached patch is an EEH detection fix that was applied to 2.4 months ago. It'll apply to the current 2.6 ames tree with a 1-line offset. Appearantly there's some issues with the current EEH detection code (i.e. the pci device walking code is deadlock prone). Since this fix will actually increase the risk for false EEH positives and as such increase the deadlock windows, it might not be suitable to be applied at the moment. I just wanted to make sure it's known that the current detection code is faulty, so the fix can go in when it's safe to use and/or the eeh_check_failure() code is fixed. -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: eeh-detection-patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040115/de4b7d0b/attachment.txt From rod at thalescomputers.fr Sat Jan 17 01:02:49 2004 From: rod at thalescomputers.fr (=?ISO-8859-1?Q?R=E9gis_Odey=E9?=) Date: Fri, 16 Jan 2004 15:02:49 +0100 Subject: NAP mode on powerpc 970 Message-ID: <4007EF09.1070305@thalescomputers.fr> Hi, I'm currently working with a JS20 running Suse SLES8. And I would like to analyse the behaviour of the NAP mode of the powerpc 970. I was not able to find anything related to the NAP mode in the ppc64 branch. Is there somebody who try to patch the idle loop of the ppc64 branch as it is already done in the ppc32 branch for some ppc processors ? The other related issue I have is how to enable the NAP mode by setting the hypervisor HID0 register. Is there a way to write the hypervisor register through the kernel ? through the firmware of the JS20 ? Any related information will be really helpful. Thanks. -- R?gis Odey? Thales Computers, a Thales company. www.thalescomputers.com E-mail: rod at thalescomputers.fr Tel: +33 (0)4 98 16 34 86 - Fax: +33 (0)4 98 16 34 01 ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Sat Jan 17 05:01:19 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Fri, 16 Jan 2004 12:01:19 -0600 Subject: NAP mode on powerpc 970 In-Reply-To: <4007EF09.1070305@thalescomputers.fr> References: <4007EF09.1070305@thalescomputers.fr> Message-ID: <1074276078.1240.227.camel@magik> > I was not able to find anything related to the NAP mode in the ppc64 > branch. Linux does not support NAP mode for the 970's. Going to this mode powers off the caches (thus killing the caches). There are some other potential issues (thermal, and power supply). This should not be done in Linux in the first place. The correct place is in the FW. > Is there somebody who try to patch the idle loop of the ppc64 branch as > it is already done in the ppc32 branch for some ppc processors ? > > The other related issue I have is how to enable the NAP mode by setting > the hypervisor HID0 register. Is there a way to write the hypervisor > register through the kernel ? through the firmware of the JS20 ? Yes, bit 9 on HID0 needs to be set then the POW bit in the MSR (bit 45). This will quiesce the processor and prefetch engine and put you into doze mode. The buses will get the quiesced next and once they are you will be in NAP mode. Any interrupt will kick you back to full-power mode. HID0 is a hypervisor resource and linux does not have access to it. There is more information in book 4 of the 970. Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Sat Jan 17 05:19:44 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Fri, 16 Jan 2004 12:19:44 -0600 Subject: [2.4] SLB noloop patch Message-ID: <40082B40.4070808@austin.ibm.com> The 2.5 equivalent of this patch got baked into Anton's big SLB rewrite. There seems to be less interest to bring the bigger rewrite back to 2.4, but the noloop stuff is still a valuable enhancement (and smaller in scope). I've attached the patch, unless someone objects I'll push it to ameslab early next week (with the hope that it'll make it to Marcelo as well). Thanks, -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: slb-noloop-patch.24 Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040116/4f802373/attachment.txt From benh at kernel.crashing.org Sat Jan 17 13:02:27 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 17 Jan 2004 13:02:27 +1100 Subject: NAP mode on powerpc 970 In-Reply-To: <1074276078.1240.227.camel@magik> References: <4007EF09.1070305@thalescomputers.fr> <1074276078.1240.227.camel@magik> Message-ID: <1074304946.8360.15.camel@gaston> On Sat, 2004-01-17 at 05:01, Jake Moilanen wrote: > > I was not able to find anything related to the NAP mode in the ppc64 > > branch. > > Linux does not support NAP mode for the 970's. Going to this mode > powers off the caches (thus killing the caches). There are some other > potential issues (thermal, and power supply). > > This should not be done in Linux in the first place. The correct place > is in the FW. No, no ... I do it on the G5 without problem :) The north bridge will get the CPU out of NAP mode for snooping, the cache aren't powered off. I will get the patch doing that to ameslab 2.6 soon, when I start getting the G5 bits in. > > Is there somebody who try to patch the idle loop of the ppc64 branch as > > it is already done in the ppc32 branch for some ppc processors ? > > > > The other related issue I have is how to enable the NAP mode by setting > > the hypervisor HID0 register. Is there a way to write the hypervisor > > register through the kernel ? through the firmware of the JS20 ? > > Yes, bit 9 on HID0 needs to be set then the POW bit in the MSR (bit > 45). This will quiesce the processor and prefetch engine and put you > into doze mode. The buses will get the quiesced next and once they are > you will be in NAP mode. Any interrupt will kick you back to full-power > mode. > > HID0 is a hypervisor resource and linux does not have access to it. > > There is more information in book 4 of the 970. Then HV could do it as well... Actually, on the js20, it could just leave HID0:NAP set permanently and let the kernel use MSR:POW from its idle loop... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Mon Jan 19 01:15:36 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 19 Jan 2004 01:15:36 +1100 Subject: [PATCH] [TRIVIAL] include guards in include/asm-ppc64 (2.6) In-Reply-To: <40070012.8080901@austin.ibm.com> References: <40070012.8080901@austin.ibm.com> Message-ID: <20040118141536.GD6293@krispykreme> Hi Nathan, > I guess I'm feeling janitorial today... > > Except for the case of percpu.h this adds include guards where they are > missing. I changed the existing guard in percpu.h from > __ARCH_I386_PERCPU__ to __ARCH_PPC64_PERCPU__. > > Patch is against 2.5 ameslab tree. Looks good to me. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Mon Jan 19 01:17:42 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 19 Jan 2004 01:17:42 +1100 Subject: autoconsole In-Reply-To: <3238B645-477F-11D8-A8C3-000A95A4DC02@kernel.crashing.org> References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040115085849.A1808@w-mikek2.beaverton.ibm.com> <20040115170631.GA22399@suse.de> <3238B645-477F-11D8-A8C3-000A95A4DC02@kernel.crashing.org> Message-ID: <20040118141742.GE6293@krispykreme> > It's trivial to detect the current setting, just do the inverse of > how the baud speed is set. Anyone feel like coding this up? Or does OF export the baud rate somewhere? Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Mon Jan 19 15:20:22 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 19 Jan 2004 15:20:22 +1100 Subject: [PATCH][2.6] Nested Interrupt support In-Reply-To: <1074094346.2389.42.camel@magik> References: <1074094346.2389.42.camel@magik> Message-ID: <20040119042022.GA20834@krispykreme> Hi Jake, > The xics code is not behaving completly correct. When a hw interrupt is > taken the CPPR is changed to 0x5. If while this interrupt is being > processed, the CPU gets interrupted with a higher priority interrupt (eg > IPI), the IPI's EOI will write the CPPR back down to 0xFF instead of > what it was at when it interrupted the hw interrupt (0x5). Looks good. Could we use per cpu data here (do we init per cpu data before the xics setup)? Also Im wondering if we should have a quick check for overflow of the buffer. > One concern I have is at the end of ppc_irq_dispatch_handler(), there is > a check to see if the desc->handler went away due to an interrupt being > disabled. If the handler does go away, desc->handler->end will not be > called and the irq_stack will get out of sync. I could not find > anywhere were this handler would actually be removed (eg function > pointer set to zero). Why is this code still here? Im not sure, how does it match up with what x86 does these days? Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Mon Jan 19 20:21:13 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 19 Jan 2004 20:21:13 +1100 Subject: autoconsole In-Reply-To: <20040118141742.GE6293@krispykreme> References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040115085849.A1808@w-mikek2.beaverton.ibm.com> <20040115170631.GA22399@suse.de> <3238B645-477F-11D8-A8C3-000A95A4DC02@kernel.crashing.org> <20040118141742.GE6293@krispykreme> Message-ID: <1074504073.814.56.camel@gaston> On Mon, 2004-01-19 at 01:17, Anton Blanchard wrote: > > It's trivial to detect the current setting, just do the inverse of > > how the baud speed is set. > > Anyone feel like coding this up? Or does OF export the baud rate somewhere? OF doesn't afaik, and i'm not sure you can even read the register on the 8530 at least Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From segher at kernel.crashing.org Tue Jan 20 01:14:38 2004 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Mon, 19 Jan 2004 15:14:38 +0100 Subject: autoconsole In-Reply-To: <1074504073.814.56.camel@gaston> References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040115085849.A1808@w-mikek2.beaverton.ibm.com> <20040115170631.GA22399@suse.de> <3238B645-477F-11D8-A8C3-000A95A4DC02@kernel.crashing.org> <20040118141742.GE6293@krispykreme> <1074504073.814.56.camel@gaston> Message-ID: > OF doesn't afaik, and i'm not sure you can even read the register on > the 8530 at least You can read it on 16450 and up, at least. It's fair to assume that's or lowest common denominator (or lower than it, actually). Point me to where to detect it, and I'll do it. Segher ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From kravetz at us.ibm.com Tue Jan 20 04:04:18 2004 From: kravetz at us.ibm.com (Mike Kravetz) Date: Mon, 19 Jan 2004 09:04:18 -0800 Subject: autoconsole In-Reply-To: ; from segher@kernel.crashing.org on Mon, Jan 19, 2004 at 03:14:38PM +0100 References: <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040115085849.A1808@w-mikek2.beaverton.ibm.com> <20040115170631.GA22399@suse.de> <3238B645-477F-11D8-A8C3-000A95A4DC02@kernel.crashing.org> <20040118141742.GE6293@krispykreme> <1074504073.814.56.camel@gaston> Message-ID: <20040119090418.B1802@w-mikek2.beaverton.ibm.com> On Mon, Jan 19, 2004 at 03:14:38PM +0100, Segher Boessenkool wrote: > > Point me to where to detect it, and I'll do it. > Me too. :) Last week, I wrote some code to dump what I thought was all the serial port related information from OF. Unfortunately, I couldn't make any sense of it. This is like the code: + if (i > 8) { + int offset; + switch (reg[1]) { + case 0x3f8: + offset = 0; + break; + case 0x2f8: + offset = 1; + break; + case 0x898: + offset = 2; + break; + case 0x890: + offset = 3; + break; + default: + /* We dont recognise the serial port */ + return -ENODEV; + } + + return add_preferred_console("ttyS", offset, NULL); + } Can anyone point to where this stuff is documented? -- Mike ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Tue Jan 20 04:15:18 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Mon, 19 Jan 2004 11:15:18 -0600 Subject: [PATCH] [2.6] xmon doesn't compile without CONFIG_MAGIC_SYSRQ Message-ID: <400C10A6.2020701@austin.ibm.com> I guess I can see that one would want to use xmon without magic_sysrq, so attached patch makes xmon compile without it. -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: xmon-no-sysrq Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040119/4bae58ba/attachment.txt From segher at kernel.crashing.org Tue Jan 20 04:45:11 2004 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Mon, 19 Jan 2004 18:45:11 +0100 Subject: autoconsole In-Reply-To: <20040119090418.B1802@w-mikek2.beaverton.ibm.com> References: <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040115085849.A1808@w-mikek2.beaverton.ibm.com> <20040115170631.GA22399@suse.de> <3238B645-477F-11D8-A8C3-000A95A4DC02@kernel.crashing.org> <20040118141742.GE6293@krispykreme> <1074504073.814.56.camel@gaston> <20040119090418.B1802@w-mikek2.beaverton.ibm.com> Message-ID: <3AC0E7B4-4AA7-11D8-9C4D-000A95A4DC02@kernel.crashing.org> > Can anyone point to where this stuff is documented? I think the CHRP docs should document this... 3f8, 2f8 are just the legacy x86 i/o addresses for the first and second serial ports; I assume 898, 890 are the CHRP standardized addresses for the third and fourth? Segher ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Tue Jan 20 05:03:14 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Mon, 19 Jan 2004 12:03:14 -0600 Subject: [2.4] PCI bases with value 0 -- upstream status? Message-ID: <400C1BE2.2080009@austin.ibm.com> I'm not sure how many times this subject has come up, but here we go again: The pci_read_bridge_bases() code in drivers/pci/pci.c assumes that all resources start on a non-zero address, which is not true on our systems. On LPAR machines, as well as some SMP configs, we might very well have a 0 base. I think the fixes have been in Ames before, but might have been taken out to keep us aligned with mainline? Have patches to fix drivers/pci/pci.c upstream been shot down? If so, should we add it back to Ames? This keeps a mainline or ameslab 2.4 kernel from using the first device of an LPAR system, and it's shown up on a p650 in SMP mode as well, resulting in unprobed internal SCSI interfaces. -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From mdewand at redhat.com Tue Jan 20 05:39:23 2004 From: mdewand at redhat.com (Mark DeWandel) Date: Mon, 19 Jan 2004 13:39:23 -0500 Subject: Comments concerning Enhanced Flash Patch Message-ID: <20040119183923.GC7011@redhat.com> Regarding the attached enhanced flash patch, Red Hat has the following questions and issues: o In rtas_do_extended_delay(), should the blocking state of the current task be TASK_UNINTERRUPTIBLE? If you're waiting on hardware, you normally want to block uninterruptibly however this doesn't appear to be the case here. Can someone please comment on this? o Although there is enforcement for a single opener, there is concern about providing mutual exclusion for multiple writers. Creating new processes/threads via fork/pthread_create in conjunction with possible effects from dup(2) could introduce the risk of data corruption. Red Hat suggests the use of a semaphore that guards reads and writes either in the entire module or for a specific file. o In manage_flash() and validate_flash(), there was concern about the worst case elapsed length of time for the execution of the loop. Terminating the loop after some threshold and returning an error code (-EIO?) is one suggested solution. Please advise whether this is an acceptable solution or whether a problem even exists at all. o In rtas_flash_init(), there is the possibility for memory leaks in the failure cases. In addition, there is also the possibility of dereferencing a null pointer in initialize_flash_pde_data() via dp if create_flash_pde() returns null. This really must be fixed. -- Mark DeWandel Red Hat, Inc. (978) 692-3113 ext. 23252 -------------- next part -------------- diff -urpN -X /home/johnrose/tmp/diffignore.txt /usr/src/linux-2.4.21-6.EL/arch/ppc64/kernel/ppc_ksyms.c ./EL_ef/arch/ppc64/kernel/ppc_ksyms.c --- linux-2.4.21-6.EL/arch/ppc64/kernel/ppc_ksyms.c 2003-12-09 13:42:04.000000000 -0600 +++ ./arch/ppc64/kernel/ppc_ksyms.c 2004-01-15 15:31:47.000000000 -0600 @@ -266,6 +266,7 @@ EXPORT_SYMBOL(rtas_token); EXPORT_SYMBOL(rtas_call); EXPORT_SYMBOL(rtas_data_buf); EXPORT_SYMBOL(rtas_data_buf_lock); +EXPORT_SYMBOL(rtas_do_extended_delay); #endif #ifndef CONFIG_PPC_ISERIES diff -urpN -X /home/johnrose/tmp/diffignore.txt /usr/src/linux-2.4.21-6.EL/arch/ppc64/kernel/rtas.c ./EL_ef/arch/ppc64/kernel/rtas.c --- linux-2.4.21-6.EL/arch/ppc64/kernel/rtas.c 2003-12-09 13:41:30.000000000 -0600 +++ ./arch/ppc64/kernel/rtas.c 2004-01-15 16:09:20.000000000 -0600 @@ -184,6 +184,31 @@ rtas_call(int token, int nargs, int nret return (ulong)((nret > 0) ? rtas_args->rets[0] : 0); } +/* Given an RTAS status code of 990n perform the hinted delay of 10^n + * (last digit) milliseconds. For now we bound at n=5 (100 secs). + */ +void +rtas_do_extended_delay(int status) +{ + int order = status - 9900; + unsigned long ms; + unsigned long jiffies; + + if (order < 0) + order = 0; /* RTC depends on this for -2 clock busy */ + else if (order > 5) + order = 5; /* bound */ + + /* Use microseconds for reasonable accuracy */ + for (ms=1; order > 0; order--) + ms *= 10; + + jiffies = (ms * HZ) / 1000; + + set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(jiffies); +} + #define FLASH_BLOCK_LIST_VERSION (1UL) static void rtas_flash_firmware(void) diff -urpN -X /home/johnrose/tmp/diffignore.txt /usr/src/linux-2.4.21-6.EL/arch/ppc64/kernel/rtas_flash.c ./EL_ef/arch/ppc64/kernel/rtas_flash.c --- linux-2.4.21-6.EL/arch/ppc64/kernel/rtas_flash.c 2002-11-28 17:53:11.000000000 -0600 +++ ./arch/ppc64/kernel/rtas_flash.c 2004-01-15 16:33:31.000000000 -0600 @@ -24,7 +24,53 @@ #define MODULE_VERSION "1.0" #define MODULE_NAME "rtas_flash" -#define FIRMWARE_FLASH_NAME "firmware_flash" +#define FIRMWARE_FLASH_NAME "firmware_flash" +#define FIRMWARE_UPDATE_NAME "firmware_update" +#define MANAGE_FLASH_NAME "manage_flash" +#define VALIDATE_FLASH_NAME "validate_flash" + +/* General RTAS Status Codes */ +#define RTAS_RC_SUCCESS 0 +#define RTAS_RC_HW_ERR -1 +#define RTAS_RC_BUSY -2 + +/* Flash image status values */ +#define FLASH_AUTH -9002 /* RTAS Not Service Authority Partition */ +#define FLASH_NO_OP -1099 /* No operation initiated by user */ +#define FLASH_IMG_SHORT -1005 /* Flash image shorter than expected */ +#define FLASH_IMG_BAD_LEN -1004 /* Bad length value in flash list block */ +#define FLASH_IMG_NULL_DATA -1003 /* Bad data value in flash list block */ +#define FLASH_IMG_READY 0 /* Firmware img ready for flash on reboot */ + +/* Manage image status values */ +#define MANAGE_AUTH -9002 /* RTAS Not Service Authority Partition */ +#define MANAGE_ACTIVE_ERR -9001 /* RTAS Cannot Overwrite Active Img */ +#define MANAGE_NO_OP -1099 /* No operation initiated by user */ +#define MANAGE_PARAM_ERR -3 /* RTAS Parameter Error */ +#define MANAGE_HW_ERR -1 /* RTAS Hardware Error */ + +/* Validate image status values */ +#define VALIDATE_AUTH -9002 /* RTAS Not Service Authority Partition */ +#define VALIDATE_NO_OP -1099 /* No operation initiated by the user */ +#define VALIDATE_INCOMPLETE -1002 /* User copied < VALIDATE_BUF_SIZE */ +#define VALIDATE_READY -1001 /* Firmware image ready for validation */ +#define VALIDATE_PARAM_ERR -3 /* RTAS Parameter Error */ +#define VALIDATE_HW_ERR -1 /* RTAS Hardware Error */ +#define VALIDATE_TMP_UPDATE 0 /* Validate Return Status */ +#define VALIDATE_FLASH_AUTH 1 /* Validate Return Status */ +#define VALIDATE_INVALID_IMG 2 /* Validate Return Status */ +#define VALIDATE_CUR_UNKNOWN 3 /* Validate Return Status */ +#define VALIDATE_TMP_COMMIT_DL 4 /* Validate Return Status */ +#define VALIDATE_TMP_COMMIT 5 /* Validate Return Status */ +#define VALIDATE_TMP_UPDATE_DL 6 /* Validate Return Status */ + +/* ibm,manage-flash-image operation tokens */ +#define RTAS_REJECT_TMP_IMG 0 +#define RTAS_COMMIT_TMP_IMG 1 + +/* Array sizes */ +#define VALIDATE_BUF_SIZE 4096 +#define RTAS_MSG_MAXLEN 64 /* Local copy of the flash block list. * We only allow one open of the flash proc file and create this @@ -36,21 +82,35 @@ * is treated as the number of entries currently in the block * (i.e. not a byte count). This is all fixed on release. */ -static struct flash_block_list *flist; -static char *flash_msg; -static int flash_possible; - -static int rtas_flash_open(struct inode *inode, struct file *file) -{ - if ((file->f_mode & FMODE_WRITE) && flash_possible) { - if (flist) - return -EBUSY; - flist = (struct flash_block_list *)get_free_page(GFP_KERNEL); - if (!flist) - return -ENOMEM; - } - return 0; -} + +/* Status int must be first member of struct */ +struct rtas_update_flash_t +{ + int status; /* Flash update status */ + struct flash_block_list *flist; /* Local copy of flash block list */ +}; + +/* Status int must be first member of struct */ +struct rtas_manage_flash_t +{ + int status; /* Returned status */ + unsigned int op; /* Reject or commit image */ +}; + +/* Status int must be first member of struct */ +struct rtas_validate_flash_t +{ + int status; /* Returned status */ + char buf[VALIDATE_BUF_SIZE]; /* Candidate image buffer */ + unsigned int buf_size; /* Size of image buf */ + unsigned int update_results; /* Update results token */ +}; + +static spinlock_t flash_file_open_lock = SPIN_LOCK_UNLOCKED; +static struct proc_dir_entry *firmware_flash_pde = NULL; +static struct proc_dir_entry *firmware_update_pde = NULL; +static struct proc_dir_entry *validate_pde = NULL; +static struct proc_dir_entry *manage_pde = NULL; /* Do simple sanity checks on the flash image. */ static int flash_list_valid(struct flash_block_list *flist) @@ -59,32 +119,27 @@ static int flash_list_valid(struct flash int i; unsigned long block_size, image_size; - flash_msg = NULL; /* Paranoid self test here. We also collect the image size. */ image_size = 0; for (f = flist; f; f = f->next) { for (i = 0; i < f->num_blocks; i++) { if (f->blocks[i].data == NULL) { - flash_msg = "error: internal error null data\n"; - return 0; + return FLASH_IMG_NULL_DATA; } block_size = f->blocks[i].length; if (block_size <= 0 || block_size > PAGE_SIZE) { - flash_msg = "error: internal error bad length\n"; - return 0; + return FLASH_IMG_BAD_LEN; } image_size += block_size; } } - if (image_size < (256 << 10)) { - if (image_size < 2) - flash_msg = NULL; /* allow "clear" of image */ - else - flash_msg = "error: flash image short\n"; - return 0; - } + + if (image_size < 2) + return FLASH_NO_OP; + printk(KERN_INFO "FLASH: flash image with %ld bytes stored for hardware flash on reboot\n", image_size); - return 1; + + return FLASH_IMG_READY; } static void free_flash_list(struct flash_block_list *f) @@ -103,56 +158,91 @@ static void free_flash_list(struct flash static int rtas_flash_release(struct inode *inode, struct file *file) { - if (flist) { - /* Always clear saved list on a new attempt. */ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_update_flash_t *uf; + + uf = (struct rtas_update_flash_t *) dp->data; + if (uf->flist) { + /* File was opened in write mode for a new flash attempt */ + /* Clear saved list */ if (rtas_firmware_flash_list.next) { free_flash_list(rtas_firmware_flash_list.next); rtas_firmware_flash_list.next = NULL; } - if (flash_list_valid(flist)) - rtas_firmware_flash_list.next = flist; + if (uf->status != FLASH_AUTH) + uf->status = flash_list_valid(uf->flist); + + if (uf->status == FLASH_IMG_READY) + rtas_firmware_flash_list.next = uf->flist; else - free_flash_list(flist); - flist = NULL; + free_flash_list(uf->flist); + + uf->flist = NULL; } + + atomic_dec(&dp->count); return 0; } +static int get_flash_status_msg(int status, char *buf, int size) +{ + int len; + + switch (status) { + case FLASH_AUTH: + len = snprintf(buf, size, "error: this partition does not have service authority\n"); + break; + case FLASH_NO_OP: + len = snprintf(buf, size, "info: no firmware image for flash\n"); + break; + case FLASH_IMG_SHORT: + len = snprintf(buf, size, "error: flash image short\n"); + break; + case FLASH_IMG_BAD_LEN: + len = snprintf(buf, size, "error: internal error bad length\n"); + break; + case FLASH_IMG_NULL_DATA: + len = snprintf(buf, size, "error: internal error null data\n"); + break; + case FLASH_IMG_READY: + len = snprintf(buf, size, "ready: firmware image ready for flash on reboot\n"); + break; + default: + len = snprintf(buf, size, "error: unexpected status value %d\n", status); + break; + } + + return len >= size ? size-1 : len; +} + /* Reading the proc file will show status (not the firmware contents) */ static ssize_t rtas_flash_read(struct file *file, char *buf, size_t count, loff_t *ppos) { - int error; - char *msg; + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_update_flash_t *uf; + char msg[RTAS_MSG_MAXLEN]; int msglen; - if (!flash_possible) { - msg = "error: this partition does not have service authority\n"; - } else if (flist) { - msg = "info: this file is busy for write by some process\n"; - } else if (flash_msg) { - msg = flash_msg; /* message from last flash attempt */ - } else if (rtas_firmware_flash_list.next) { - msg = "ready: firmware image ready for flash on reboot\n"; - } else { - msg = "info: no firmware image for flash\n"; + uf = (struct rtas_update_flash_t *) dp->data; + + if (!strcmp(dp->name, FIRMWARE_FLASH_NAME)) { + msglen = get_flash_status_msg(uf->status, msg, RTAS_MSG_MAXLEN); + } else { /* FIRMWARE_UPDATE_NAME */ + msglen = sprintf(msg, "%d\n", uf->status); } - msglen = strlen(msg); + + if (*ppos >= msglen) + return 0; + msglen -= *ppos; if (msglen > count) msglen = count; - if (ppos && *ppos != 0) - return 0; /* be cheap */ - - error = verify_area(VERIFY_WRITE, buf, msglen); - if (error) - return -EINVAL; - - copy_to_user(buf, msg, msglen); + if (copy_to_user(buf, msg + (*ppos), msglen)) + return -EFAULT; + *ppos += msglen; - if (ppos) - *ppos = msglen; return msglen; } @@ -164,14 +254,28 @@ static ssize_t rtas_flash_read(struct fi static ssize_t rtas_flash_write(struct file *file, const char *buffer, size_t count, loff_t *off) { - size_t len = count; + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_update_flash_t *uf; char *p; int next_free; - struct flash_block_list *fl = flist; + struct flash_block_list *fl; + + uf = (struct rtas_update_flash_t *) dp->data; - if (!flash_possible || len == 0) - return len; /* discard data */ + if (uf->status == FLASH_AUTH || count == 0) + return count; /* discard data */ + /* In the case that the image is not ready for flashing, the memory + * allocated for the block list will be freed upon the release of the + * proc file + */ + if (uf->flist == NULL) { + uf->flist = (struct flash_block_list *) get_free_page(GFP_KERNEL); + if (!uf->flist) + return -ENOMEM; + } + + fl = uf->flist; while (fl->next) fl = fl->next; /* seek to last block_list for append */ next_free = fl->num_blocks; @@ -184,47 +288,366 @@ static ssize_t rtas_flash_write(struct f next_free = 0; } - if (len > PAGE_SIZE) - len = PAGE_SIZE; + if (count > PAGE_SIZE) + count = PAGE_SIZE; p = (char *)get_free_page(GFP_KERNEL); if (!p) return -ENOMEM; - if(copy_from_user(p, buffer, len)) { + + if(copy_from_user(p, buffer, count)) { free_page((unsigned long)p); return -EFAULT; } fl->blocks[next_free].data = p; - fl->blocks[next_free].length = len; + fl->blocks[next_free].length = count; fl->num_blocks++; - return len; + return count; +} + +static int rtas_excl_open(struct inode *inode, struct file *file) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + + /* Enforce exclusive open with use count of PDE */ + spin_lock(&flash_file_open_lock); + if (atomic_read(&dp->count) > 1) { + spin_unlock(&flash_file_open_lock); + return -EBUSY; + } + + atomic_inc(&dp->count); + spin_unlock(&flash_file_open_lock); + + return 0; +} + +static int rtas_excl_release(struct inode *inode, struct file *file) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + + atomic_dec(&dp->count); + + return 0; +} + +static void manage_flash(struct rtas_manage_flash_t *args_buf) +{ + s32 rc; + + while (1) { + rc = (s32) rtas_call(rtas_token("ibm,manage-flash-image"), 1, + 1, NULL, (long) args_buf->op); + if (rc == RTAS_RC_BUSY) + udelay(1); + else if (rtas_is_extended_busy(rc)) + rtas_do_extended_delay(rc); + else + break; + } + + args_buf->status = rc; +} + +static ssize_t manage_flash_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_manage_flash_t *args_buf; + char msg[RTAS_MSG_MAXLEN]; + int msglen; + + args_buf = (struct rtas_manage_flash_t *) dp->data; + if (args_buf == NULL) + return 0; + + msglen = sprintf(msg, "%d\n", args_buf->status); + if (*ppos >= msglen) + return 0; + + msglen -= *ppos; + if (msglen > count) + msglen = count; + + if (copy_to_user(buf, msg + (*ppos), msglen)) + return -EFAULT; + *ppos += msglen; + + return msglen; +} + +static ssize_t manage_flash_write(struct file *file, const char *buf, + size_t count, loff_t *off) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_manage_flash_t *args_buf; + const char reject_str[] = "0"; + const char commit_str[] = "1"; + char msg[RTAS_MSG_MAXLEN]; + int op; + + args_buf = (struct rtas_manage_flash_t *) dp->data; + if ((args_buf->status == MANAGE_AUTH) || (count == 0)) + return count; + + if (count > RTAS_MSG_MAXLEN) + count = RTAS_MSG_MAXLEN; + if (copy_from_user(msg, buf, count)) + return -EFAULT; + + if (strncmp(buf, reject_str, strlen(reject_str)) == 0) + op = RTAS_REJECT_TMP_IMG; + else if (strncmp(buf, commit_str, strlen(commit_str)) == 0) + op = RTAS_COMMIT_TMP_IMG; + else + return -EINVAL; + + args_buf->op = op; + manage_flash(args_buf); + *off += count; + + return count; +} + +static void validate_flash(struct rtas_validate_flash_t *args_buf) +{ + int token = rtas_token("ibm,validate-flash-image"); + unsigned int wait_time; + long update_results; + s32 rc; + + rc = 0; + while(1) { + spin_lock(&rtas_data_buf_lock); + memcpy(rtas_data_buf, args_buf->buf, VALIDATE_BUF_SIZE); + rc = (s32) rtas_call(token, 2, 2, &update_results, + __pa(rtas_data_buf), args_buf->buf_size); + memcpy(args_buf->buf, rtas_data_buf, VALIDATE_BUF_SIZE); + spin_unlock(&rtas_data_buf_lock); + + if (rc == RTAS_RC_BUSY) + udelay(1); + else if (rtas_is_extended_busy(rc)) { + rtas_do_extended_delay(rc); + } else + break; + } + + args_buf->status = rc; + args_buf->update_results = (u32) update_results; +} + +static int get_validate_flash_msg(struct rtas_validate_flash_t *args_buf, + char *msg, int size) +{ + int n; + + if (args_buf->status >= VALIDATE_TMP_UPDATE) { + n = snprintf(msg, size, "%u\n", args_buf->update_results); + if ((args_buf->update_results >= VALIDATE_CUR_UNKNOWN) || + (args_buf->update_results == VALIDATE_TMP_UPDATE)) + n += snprintf(msg + n, size - n, "%s\n", args_buf->buf); + } else { + n = snprintf(msg, size, "%d\n", args_buf->status); + } + + return n >= size ? size - 1 : n; +} + +static ssize_t validate_flash_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_validate_flash_t *args_buf; + char msg[RTAS_MSG_MAXLEN]; + int msglen; + + args_buf = (struct rtas_validate_flash_t *) dp->data; + + msglen = get_validate_flash_msg(args_buf, msg, RTAS_MSG_MAXLEN); + + if (*ppos >= msglen) + return 0; + + msglen -= *ppos; + if (msglen > count) + msglen = count; + + if (copy_to_user(buf, msg + (*ppos), msglen)) + return -EFAULT; + *ppos += msglen; + + return msglen; +} + +static ssize_t validate_flash_write(struct file *file, const char *buf, + size_t count, loff_t *off) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_validate_flash_t *args_buf; + + args_buf = (struct rtas_validate_flash_t *) dp->data; + + if (dp->data == NULL) { + dp->data = kmalloc(sizeof(struct rtas_validate_flash_t), + GFP_KERNEL); + if (dp->data == NULL) + return -ENOMEM; + } + + /* We are only interested in the first 4K of the + * candidate image */ + if ((*off >= VALIDATE_BUF_SIZE) || + (args_buf->status == VALIDATE_AUTH)) { + *off += count; + return count; + } + + if (*off + count >= VALIDATE_BUF_SIZE) { + count = VALIDATE_BUF_SIZE - *off; + args_buf->status = VALIDATE_READY; + } else { + args_buf->status = VALIDATE_INCOMPLETE; + } + + if (copy_from_user(args_buf->buf + *off, buf, count)) + return -EFAULT; + *off += count; + + return count; +} + +static int validate_flash_release(struct inode *inode, struct file *file) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_validate_flash_t *args_buf; + + args_buf = (struct rtas_validate_flash_t *) dp->data; + + if (args_buf->status == VALIDATE_READY) { + args_buf->buf_size = VALIDATE_BUF_SIZE; + validate_flash(args_buf); + } + + atomic_dec(&dp->count); + + return 0; +} + +static inline void remove_flash_pde(struct proc_dir_entry *dp) +{ + if (dp) { + if (dp->data != NULL) + kfree(dp->data); + remove_proc_entry(dp->name, rtas_proc_dir); + } +} + +static inline int initialize_flash_pde_data(const char *rtas_call_name, + size_t buf_size, + struct proc_dir_entry *dp) +{ + int *status; + int token; + + dp->data = kmalloc(buf_size, GFP_KERNEL); + if (dp->data == NULL) { + remove_flash_pde(dp); + return -ENOMEM; + } + + memset(dp->data, 0, buf_size); + + /* This code assumes that the status int is the first member of the + * struct + */ + status = (int *) dp->data; + token = rtas_token(rtas_call_name); + if (token == RTAS_UNKNOWN_SERVICE) + *status = FLASH_AUTH; + else + *status = FLASH_NO_OP; + + return 0; +} + +static inline struct proc_dir_entry * create_flash_pde(const char *filename, + struct file_operations *fops) +{ + struct proc_dir_entry *ent = NULL; + + ent = create_proc_entry(filename, S_IRUSR | S_IWUSR, rtas_proc_dir); + if (ent != NULL) { + ent->nlink = 1; + ent->proc_fops = fops; + ent->owner = THIS_MODULE; + } + + return ent; } static struct file_operations rtas_flash_operations = { read: rtas_flash_read, write: rtas_flash_write, - open: rtas_flash_open, + open: rtas_excl_open, release: rtas_flash_release, }; +static struct file_operations manage_flash_operations = { + read: manage_flash_read, + write: manage_flash_write, + open: rtas_excl_open, + release: rtas_excl_release, +}; + +static struct file_operations validate_flash_operations = { + read: validate_flash_read, + write: validate_flash_write, + open: rtas_excl_open, + release: validate_flash_release, +}; int __init rtas_flash_init(void) { - struct proc_dir_entry *ent = NULL; + int rc; if (!rtas_proc_dir) { printk(KERN_WARNING "rtas proc dir does not already exist"); return -ENOENT; } - if (rtas_token("ibm,update-flash-64-and-reboot") != RTAS_UNKNOWN_SERVICE) - flash_possible = 1; - - if ((ent = create_proc_entry(FIRMWARE_FLASH_NAME, S_IRUSR | S_IWUSR, rtas_proc_dir)) != NULL) { - ent->nlink = 1; - ent->proc_fops = &rtas_flash_operations; - ent->owner = THIS_MODULE; - } + firmware_flash_pde = create_flash_pde(FIRMWARE_FLASH_NAME, + &rtas_flash_operations); + rc = initialize_flash_pde_data("ibm,update-flash-64-and-reboot", + sizeof(struct rtas_update_flash_t), + firmware_flash_pde); + if (rc != 0) + return rc; + + firmware_update_pde = create_flash_pde(FIRMWARE_UPDATE_NAME, + &rtas_flash_operations); + rc = initialize_flash_pde_data("ibm,update-flash-64-and-reboot", + sizeof(struct rtas_update_flash_t), + firmware_update_pde); + if (rc != 0) + return rc; + + validate_pde = create_flash_pde(VALIDATE_FLASH_NAME, + &validate_flash_operations); + rc = initialize_flash_pde_data("ibm,validate-flash-image", + sizeof(struct rtas_validate_flash_t), + validate_pde); + if (rc != 0) + return rc; + + manage_pde = create_flash_pde(MANAGE_FLASH_NAME, + &manage_flash_operations); + rc = initialize_flash_pde_data("ibm,manage-flash-image", + sizeof(struct rtas_manage_flash_t), + manage_pde); + if (rc != 0) + return rc; + return 0; } @@ -232,7 +655,10 @@ void __exit rtas_flash_cleanup(void) { if (!rtas_proc_dir) return; - remove_proc_entry(FIRMWARE_FLASH_NAME, rtas_proc_dir); + remove_flash_pde(firmware_flash_pde); + remove_flash_pde(firmware_update_pde); + remove_flash_pde(validate_pde); + remove_flash_pde(manage_pde); } module_init(rtas_flash_init); diff -urpN -X /home/johnrose/tmp/diffignore.txt /usr/src/linux-2.4.21-6.EL/include/asm-ppc64/rtas.h ./EL_ef/include/asm-ppc64/rtas.h --- linux-2.4.21-6.EL/include/asm-ppc64/rtas.h 2003-12-09 13:41:33.000000000 -0600 +++ ./include/asm-ppc64/rtas.h 2004-01-15 15:32:13.000000000 -0600 @@ -182,6 +182,13 @@ extern int rtas_errinjct_close(unsigned extern struct proc_dir_entry *rtas_proc_dir; extern struct errinjct_token ei_token_list[MAX_ERRINJCT_TOKENS]; +/* Given an RTAS status code of 9900..9905 compute the hinted delay */ +void rtas_do_extended_delay(int status); +static inline int rtas_is_extended_busy(int status) +{ + return status >= 9900 && status <= 9905; +} + extern void pSeries_log_error(char *buf, unsigned int err_type, int fatal); /* Error types logged. */ From johnrose at austin.ibm.com Tue Jan 20 06:15:05 2004 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 19 Jan 2004 13:15:05 -0600 Subject: rtas syscall Message-ID: <1074539705.23918.30.camel@verve> Paul, Rusty, Everyone- A month or two ago, I pushed an implementation of an RTAS system call as proposed by Rusty and Paul to Ameslab 2.6. I picked a syscall number of 255 for this, because it was free. How can I ensure that this number will be reserved upstream for my syscall? Thanks- John ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jschopp at austin.ibm.com Tue Jan 20 07:14:21 2004 From: jschopp at austin.ibm.com (jschopp at austin.ibm.com) Date: Mon, 19 Jan 2004 14:14:21 -0600 (CST) Subject: [PATCH][2.6] Nested Interrupt support In-Reply-To: <20040119042022.GA20834@krispykreme> Message-ID: We can't use percpu data here because the memory manager hasn't been initialized. To use per cpu data we need to be able to call kmalloc. On Mon, 19 Jan 2004, Anton Blanchard wrote: > Looks good. Could we use per cpu data here (do we init per cpu data > before the xics setup)? Also Im wondering if we should have a quick > check for overflow of the buffer. > ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Tue Jan 20 07:19:00 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Mon, 19 Jan 2004 14:19:00 -0600 Subject: NAP mode on powerpc 970 In-Reply-To: <1074304946.8360.15.camel@gaston> References: <4007EF09.1070305@thalescomputers.fr> <1074276078.1240.227.camel@magik> <1074304946.8360.15.camel@gaston> Message-ID: <1074543540.1100.258.camel@magik> > > Linux does not support NAP mode for the 970's. Going to this mode > > powers off the caches (thus killing the caches). There are some other > > potential issues (thermal, and power supply). > > > > This should not be done in Linux in the first place. The correct place > > is in the FW. > > No, no ... I do it on the G5 without problem :) On the G5, how does the OS have knowledge if it needs to go into NAP mode or not? From my understanding, going in and out of NAP mode causes power spikes and there are some unknown thermal implications on the CPU. > The north bridge will get the CPU out of NAP mode for snooping, the > cache aren't powered off. You are correct. After getting a second opinion from a different FW guy, I was told that the cache's are not powered off. > Then HV could do it as well... Actually, on the js20, it could just > leave HID0:NAP set permanently and let the kernel use MSR:POW from > its idle loop... When I talked to the HV team, they do not want to do this solution. They are leaning towards another alternative, or having it done in FW so the change does not need to be done in multiple places (e.g. Different linux distros, and AIX). Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Tue Jan 20 07:47:50 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Mon, 19 Jan 2004 14:47:50 -0600 Subject: [PATCH] [2.6] xmon doesn't compile without CONFIG_MAGIC_SYSRQ In-Reply-To: <400C10A6.2020701@austin.ibm.com> References: <400C10A6.2020701@austin.ibm.com> Message-ID: <400C4276.6070404@austin.ibm.com> Olof Johansson wrote: > I guess I can see that one would want to use xmon without magic_sysrq, > so attached patch makes xmon compile without it. There's also a missing dependency from CONFIG_XMON to CONFIG_DEBUG_KERNEL, added by attached patch. I'll push it together with the other xmon change. -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: xmon-debug-kernel Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040119/611024a0/attachment.txt From moilanen at austin.ibm.com Tue Jan 20 08:22:49 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Mon, 19 Jan 2004 15:22:49 -0600 Subject: [PATCH][2.6] Nested Interrupt support In-Reply-To: <20040119042022.GA20834@krispykreme> References: <1074094346.2389.42.camel@magik> <20040119042022.GA20834@krispykreme> Message-ID: <1074547368.1098.326.camel@magik> On Sun, 2004-01-18 at 22:20, Anton Blanchard wrote: > Looks good. Could we use per cpu data here (do we init per cpu data > before the xics setup)? Also Im wondering if we should have a quick > check for overflow of the buffer. I do have a debug patch that did more validation of the stacks if you want more. I attached a patch w/ updates. > > One concern I have is at the end of ppc_irq_dispatch_handler(), there is > > a check to see if the desc->handler went away due to an interrupt being > > disabled. If the handler does go away, desc->handler->end will not be > > called and the irq_stack will get out of sync. I could not find > > anywhere were this handler would actually be removed (eg function > > pointer set to zero). Why is this code still here? > > Im not sure, how does it match up with what x86 does these days? It looks like x86 just does desc->handler->end(irq) w/o checking that desc->handler is valid or not. I'll make the change to just do the end() call. Thanks, Jake -------------- next part -------------- # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1389 -> 1.1390 # arch/ppc64/kernel/irq.c 1.53 -> 1.54 # arch/ppc64/kernel/xics.c 1.37 -> 1.38 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 04/01/19 moilanen at threadlp13.austin.ibm.com 1.1390 # XICs nested interrupt support. # -------------------------------------------- # diff -Nru a/arch/ppc64/kernel/irq.c b/arch/ppc64/kernel/irq.c --- a/arch/ppc64/kernel/irq.c Mon Jan 19 15:16:10 2004 +++ b/arch/ppc64/kernel/irq.c Mon Jan 19 15:16:10 2004 @@ -822,16 +822,9 @@ } out: desc->status &= ~IRQ_INPROGRESS; - /* - * The ->end() handler has to deal with interrupts which got - * disabled while the handler was running. - */ - if (desc->handler) { - if (desc->handler->end) - desc->handler->end(irq); - else if (desc->handler->enable) - desc->handler->enable(irq); - } + + desc->handler->end(irq); + spin_unlock(&desc->lock); } diff -Nru a/arch/ppc64/kernel/xics.c b/arch/ppc64/kernel/xics.c --- a/arch/ppc64/kernel/xics.c Mon Jan 19 15:16:10 2004 +++ b/arch/ppc64/kernel/xics.c Mon Jan 19 15:16:10 2004 @@ -92,6 +92,21 @@ static unsigned int default_server = 0xFF; static unsigned int default_distrib_server = 0; +/* Number of nested IRQs we can store */ +#define IRQ_DEPTH 2 + +struct cpu_irq_stack +{ + int depth; + int priority[IRQ_DEPTH]; + int irq[IRQ_DEPTH]; +}; + +struct cpu_irq_stack _irq_stack[NR_CPUS]; + +#define irq_stack _irq_stack[smp_processor_id()] +#define irq_stack_depth (irq_stack).depth + /* * XICS only has a single IPI, so encode the messages per CPU */ @@ -302,20 +317,36 @@ void xics_end_irq(unsigned int irq) { int cpu = smp_processor_id(); + unsigned int priority; + + if (irq >= 0 && irq != irq_offset_up(xics_irq_8259_cascade)) { + irq_stack_depth--; + priority = irq_stack.priority[irq_stack_depth]; + } else { + priority = 0xff; + } iosync(); - ops->xirr_info_set(cpu, ((0xff<<24) | (irq_offset_down(irq)))); + ops->xirr_info_set(cpu, (priority<<24) | (irq_offset_down(irq))); } void xics_mask_and_ack_irq(u_int irq) { int cpu = smp_processor_id(); + unsigned int priority; if (irq < irq_offset_value()) { + if (irq >= 0) { + irq_stack_depth--; + priority = irq_stack.priority[irq_stack_depth]; + } else { + priority = 0xff; + } + i8259_pic.ack(irq); iosync(); - ops->xirr_info_set(cpu, ((0xff<<24) | + ops->xirr_info_set(cpu, ((priority<<24) | xics_irq_8259_cascade_real)); iosync(); } @@ -325,10 +356,12 @@ { u_int cpu = smp_processor_id(); u_int vec; + u_int priority; int irq; vec = ops->xirr_info_get(cpu); - /* (vec >> 24) == old priority */ + + priority = vec >> 24; vec &= 0x00ffffff; /* for sanity, this had better be < NR_IRQS - 16 */ @@ -345,6 +378,16 @@ } else { irq = irq_offset_up(vec); } + + if (irq >= 0) { + if (irq_stack_depth >= IRQ_DEPTH) + panic("Illegal irq stack depth"); + + irq_stack.priority[irq_stack_depth] = priority; + irq_stack.irq[irq_stack_depth] = irq; + irq_stack_depth++; + } + return irq; } @@ -413,7 +456,7 @@ void xics_init_IRQ(void) { - int i; + int i, j; unsigned long intr_size = 0; struct device_node *np; uint *ireg, ilen, indx = 0; @@ -531,6 +574,14 @@ xics_8259_pic.disable = i8259_pic.disable; for (i = 0; i < 16; ++i) get_real_irq_desc(i)->handler = &xics_8259_pic; + + for (i = 0; i < NR_CPUS; i++) { + _irq_stack[i].depth = 0; + for (j = 0; j < IRQ_DEPTH; j++) { + _irq_stack[i].priority[j] = 0xff; + _irq_stack[i].irq[j] = -1; + } + } ops->cppr_info(boot_cpuid, 0xff); iosync(); From benh at kernel.crashing.org Tue Jan 20 08:41:28 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 20 Jan 2004 08:41:28 +1100 Subject: autoconsole In-Reply-To: <3AC0E7B4-4AA7-11D8-9C4D-000A95A4DC02@kernel.crashing.org> References: <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040115085849.A1808@w-mikek2.beaverton.ibm.com> <20040115170631.GA22399@suse.de> <3238B645-477F-11D8-A8C3-000A95A4DC02@kernel.crashing.org> <20040118141742.GE6293@krispykreme> <1074504073.814.56.camel@gaston> <20040119090418.B1802@w-mikek2.beaverton.ibm.com> <3AC0E7B4-4AA7-11D8-9C4D-000A95A4DC02@kernel.crashing.org> Message-ID: <1074548488.10585.35.camel@gaston> On Tue, 2004-01-20 at 04:45, Segher Boessenkool wrote: > > Can anyone point to where this stuff is documented? > > I think the CHRP docs should document this... > > 3f8, 2f8 are just the legacy x86 i/o addresses for the first and > second serial ports; I assume 898, 890 are the CHRP standardized > addresses for the third and fourth? can't you use the OF node names instead ? Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Tue Jan 20 08:42:26 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Mon, 19 Jan 2004 15:42:26 -0600 Subject: [PATCH] [2.6] xmon doesn't compile without CONFIG_MAGIC_SYSRQ In-Reply-To: <400C4276.6070404@austin.ibm.com> References: <400C10A6.2020701@austin.ibm.com> <400C4276.6070404@austin.ibm.com> Message-ID: <400C4F42.50009@austin.ibm.com> > There's also a missing dependency from CONFIG_XMON to > CONFIG_DEBUG_KERNEL, added by attached patch. I'll push it together with > the other xmon change. > ===== Kconfig 1.33 vs edited ===== > --- 1.33/arch/ppc64/Kconfig Tue Dec 16 22:27:52 2003 > +++ edited/Kconfig Mon Jan 19 14:44:56 2004 > @@ -380,6 +380,7 @@ > > config XMON > bool "XMON" > + depends on DEBUG_KERNEL > help > Include in-kernel hooks for the xmon kernel monitor/debugger. > Unless you are intending to debug the kernel, say N here. Would you consider the attached patch instead, which I posted a while back? Both xmon and kdb should depend on DEBUG_KERNEL. Additionally, one should be able to enable DEBUG_KERNEL without having to enable a debugger. Thanks, Nathan -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: debugger_optional.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040119/9ccddeab/attachment.txt From benh at kernel.crashing.org Tue Jan 20 08:50:39 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 20 Jan 2004 08:50:39 +1100 Subject: NAP mode on powerpc 970 In-Reply-To: <1074543540.1100.258.camel@magik> References: <4007EF09.1070305@thalescomputers.fr> <1074276078.1240.227.camel@magik> <1074304946.8360.15.camel@gaston> <1074543540.1100.258.camel@magik> Message-ID: <1074549039.11809.48.camel@gaston> > On the G5, how does the OS have knowledge if it needs to go into NAP > mode or not? From my understanding, going in and out of NAP mode causes > power spikes and there are some unknown thermal implications on the > CPU. I just do it on idle loop like with any previous G4 or G3 CPU, it appears to work fine. Do you have some documentation about the possible issues ? I didn't catch anything special in the Darwin code regarding use on NAP mode neither, it seem to be used normally there as well. > When I talked to the HV team, they do not want to do this solution. > They are leaning towards another alternative, or having it done in FW so > the change does not need to be done in multiple places (e.g. Different > linux distros, and AIX). No comment... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Tue Jan 20 09:09:51 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 20 Jan 2004 09:09:51 +1100 Subject: [2.4] PCI bases with value 0 -- upstream status? In-Reply-To: <400C1BE2.2080009@austin.ibm.com> References: <400C1BE2.2080009@austin.ibm.com> Message-ID: <1074550190.12326.59.camel@gaston> On Tue, 2004-01-20 at 05:03, Olof Johansson wrote: > I'm not sure how many times this subject has come up, but here we go again: > > The pci_read_bridge_bases() code in drivers/pci/pci.c assumes that all > resources start on a non-zero address, which is not true on our systems. > On LPAR machines, as well as some SMP configs, we might very well have a > 0 base. > Yes, 0 is a valid base on PCI afaik, though lots of code tend to assume it's not... I'd suggest you run the patch through lkml though... Getting in in 2.6 first would surely help. Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Tue Jan 20 09:12:01 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 20 Jan 2004 09:12:01 +1100 Subject: [PATCH][2.6] Nested Interrupt support In-Reply-To: References: Message-ID: <1074550321.10595.62.camel@gaston> On Tue, 2004-01-20 at 07:14, jschopp at austin.ibm.com wrote: > We can't use percpu data here because the memory manager hasn't been > initialized. To use per cpu data we need to be able to call kmalloc. Ugh ? Hopefully not ! percpu are stored in a separate .data section, you don't need kmalloc to be available to use them... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jschopp at austin.ibm.com Tue Jan 20 09:43:48 2004 From: jschopp at austin.ibm.com (jschopp at austin.ibm.com) Date: Mon, 19 Jan 2004 16:43:48 -0600 (CST) Subject: [PATCH][2.6] Nested Interrupt support In-Reply-To: <1074550321.10595.62.camel@gaston> Message-ID: Upon further thought it appears that I was incorrect and you are correct. Jake should be able to use DEFINE_PER_CPU just fine. I'm afraid I was thinking of the __alloc_per_cpu, which is not what would be used here. On Tue, 20 Jan 2004, Benjamin Herrenschmidt wrote: > > We can't use percpu data here because the memory manager hasn't been > > initialized. To use per cpu data we need to be able to call kmalloc. > > Ugh ? Hopefully not ! > > percpu are stored in a separate .data section, you don't need > kmalloc to be available to use them... ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Tue Jan 20 09:56:08 2004 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 19 Jan 2004 16:56:08 -0600 Subject: Comments concerning Enhanced Flash Patch In-Reply-To: References: Message-ID: <1074552968.23918.105.camel@verve> Hi Mark- > o In rtas_do_extended_delay(), should the blocking state of the current > task be TASK_UNINTERRUPTIBLE? If you're waiting on hardware, you > normally want to block uninterruptibly however this doesn't appear > to be the case here. Can someone please comment on this? To catch up the rest of the list, we have changed the handling code for the "extended RTAS delay" code to use schedule_timeout() rather than udelay(). The extended delay time can be up to 100 secs, which is excessive for udelay(). As far as this question, I think you should be able to signal a task while it's blocking for one of these delays. However, my initial patch didn't check for pending signals after schedule_timeout() returns. Mark, I'll have a patch out to you shortly that does. > o Although there is enforcement for a single opener, there is concern > about providing mutual exclusion for multiple writers. Creating new > processes/threads via fork/pthread_create in conjunction with possible > effects from dup(2) could introduce the risk of data corruption. Red > Hat suggests the use of a semaphore that guards reads and writes either > in the entire module or for a specific file. Accounting for possible/reasonable user error vs. keeping the code simple. In the realm of /proc/ppc64 files, there are _plenty_ of places where this type of data corruption could occur. Errinjct and scanlog are two quick examples. My additions to the flash code did not introduce this vulnerability, but regardless, it doesn't seem probable to me. A user who writes a pthreaded app for such a job is just begging for trouble :) If I'm wrong, we should address this problem across the board. > o In manage_flash() and validate_flash(), there was concern about the > worst case elapsed length of time for the execution of the loop. > Terminating the loop after some threshold and returning an error code > (-EIO?) is one suggested solution. Please advise whether this is an > acceptable solution or whether a problem even exists at all. I don't think that any of the RTAS-related kernel code does this. The case of firmware endlessly returning busy seems remote to me. Opinions? Another problem that exists across the board, if it's a real possibility. > o In rtas_flash_init(), there is the possibility for memory leaks in the > failure cases. In addition, there is also the possibility of > dereferencing a null pointer in initialize_flash_pde_data() via dp if > create_flash_pde() returns null. This really must be fixed. Agreed, will be in patch. Thanks- John ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From rusty at au1.ibm.com Tue Jan 20 11:45:02 2004 From: rusty at au1.ibm.com (Rusty Russell) Date: Tue, 20 Jan 2004 11:45:02 +1100 Subject: rtas syscall In-Reply-To: Your message of "Mon, 19 Jan 2004 13:15:05 MDT." <1074539705.23918.30.camel@verve> Message-ID: <20040120004745.0CA3F17DE9@ozlabs.au.ibm.com> In message <1074539705.23918.30.camel at verve> you write: > Paul, Rusty, Everyone- > > A month or two ago, I pushed an implementation of an RTAS system call as > proposed by Rusty and Paul to Ameslab 2.6. I picked a syscall number of > 255 for this, because it was free. How can I ensure that this number > will be reserved upstream for my syscall? Submit the syscall reservation as a patch, or just send a mail as you have done. It's not unusual for syscall reservation patches to go in upstream long before the actual implementation. Cheers, Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Tue Jan 20 14:55:48 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 20 Jan 2004 14:55:48 +1100 Subject: [PPC64] Fix for 32-bit execve() error path Message-ID: <20040120035548.GE6455@zax> Andrew, please apply. The patch below fixes a bug in ppc64's 32-bit execve() path. It duplicates logic already in the normal fs/exec.c do_execve() to avoid dropping a NULL mm. The bprm.mm becomes NULL once the exec passes the "point of no return". Without this patch a failure past that point (e.g. mmap() failure) will cause an oops, with it just a killed process. diff -urN ppc64-linux-2.5/arch/ppc64/kernel/sys_ppc32.c linux-gogogo/arch/ppc64/kernel/sys_ppc32.c --- ppc64-linux-2.5/arch/ppc64/kernel/sys_ppc32.c 2004-01-19 14:20:32.484450172 +1100 +++ linux-gogogo/arch/ppc64/kernel/sys_ppc32.c 2004-01-20 14:15:02.093551035 +1100 @@ -2084,7 +2084,8 @@ security_bprm_free(&bprm); out_mm: - mmdrop(bprm.mm); + if (bprm.mm) + mmdrop(bprm.mm); out_file: if (bprm.file) { -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Jan 20 22:34:21 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 20 Jan 2004 22:34:21 +1100 Subject: NAP mode on powerpc 970 In-Reply-To: <1074543540.1100.258.camel@magik> References: <4007EF09.1070305@thalescomputers.fr> <1074276078.1240.227.camel@magik> <1074304946.8360.15.camel@gaston> <1074543540.1100.258.camel@magik> Message-ID: <20040120113421.GL3620@krispykreme> > When I talked to the HV team, they do not want to do this solution. > They are leaning towards another alternative, or having it done in FW so > the change does not need to be done in multiple places (e.g. Different > linux distros, and AIX). I disagree, we have to support both Apple and IBM products. Stashing stuff into our FW may be a good idea for AIX but we dont have to do it too. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Jan 21 01:02:20 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 21 Jan 2004 01:02:20 +1100 Subject: [2.4] PCI bases with value 0 -- upstream status? In-Reply-To: <400C1BE2.2080009@austin.ibm.com> References: <400C1BE2.2080009@austin.ibm.com> Message-ID: <20040120140220.GP3620@krispykreme> > I'm not sure how many times this subject has come up, but here we go again: > > The pci_read_bridge_bases() code in drivers/pci/pci.c assumes that all > resources start on a non-zero address, which is not true on our systems. > On LPAR machines, as well as some SMP configs, we might very well have a > 0 base. > > I think the fixes have been in Ames before, but might have been taken > out to keep us aligned with mainline? Have patches to fix > drivers/pci/pci.c upstream been shot down? If so, should we add it back > to Ames? Ive tried countless times to get the damn thing in but Linus refuses each time :) Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Wed Jan 21 03:50:45 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 20 Jan 2004 10:50:45 -0600 Subject: RTAS syscall reservation Message-ID: <1074617445.25124.0.camel@verve> Does the following look appropriate for reserving the RTAS syscall on 2.4? If so, I'll mail to lkml. Thanks- John diff -Nru a/include/asm-ppc/unistd.h b/include/asm-ppc/unistd.h --- a/include/asm-ppc/unistd.h Tue Jan 20 10:47:07 2004 +++ b/include/asm-ppc/unistd.h Tue Jan 20 10:47:07 2004 @@ -256,6 +256,7 @@ #define __NR_clock_nanosleep 248 #endif #define __NR_swapcontext 249 +#define __NR_rtas 255 #define __NR(n) #n diff -Nru a/include/asm-ppc64/unistd.h b/include/asm-ppc64/unistd.h --- a/include/asm-ppc64/unistd.h Tue Jan 20 10:47:07 2004 +++ b/include/asm-ppc64/unistd.h Tue Jan 20 10:47:07 2004 @@ -244,6 +244,7 @@ #define __NR_alloc_hugepages 232 #define __NR_free_hugepages 233 #define __NR_exit_group 234 +#define __NR_rtas 255 /* On powerpc a system call basically clobbers the same registers like a * function call, with the exception of LR (which is needed for the ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Wed Jan 21 10:40:31 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 20 Jan 2004 17:40:31 -0600 Subject: [2.4] [PATCH] [RFC] Increasing MAX_ORDER for large mem configs Message-ID: <400DBC6F.9010900@austin.ibm.com> There's a variable defined that will override MAX_ORDER to what it's set at. ia64 uses it whenever they have large pages enabled. For us, in some cases, it's very beneficial to set it: to get larger dentry/buffer/inode hash tables for very large mem configs. Attached patch ups the default from order 11. There's a risk of negative impact for some small to mid-size mem configs, since the hashes might take more memory on the system, but it's at the same time supposed to be controlled by total amount of ram (and capped at MAX_ORDER). With this, does anyone have concerns with this increase? Would someone be willing to try booting a small (iSeries?) config to make sure it behaves ok? I don't have access to one myself. Thanks, -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: max-order Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040120/9925659b/attachment.txt From paulus at samba.org Wed Jan 21 11:56:22 2004 From: paulus at samba.org (Paul Mackerras) Date: Wed, 21 Jan 2004 11:56:22 +1100 Subject: RTAS syscall reservation In-Reply-To: <1074617445.25124.0.camel@verve> References: <1074617445.25124.0.camel@verve> Message-ID: <16397.52790.518865.682322@cargo.ozlabs.ibm.com> John Rose writes: > Does the following look appropriate for reserving the RTAS syscall on > 2.4? If so, I'll mail to lkml. Don't worry about it, I have it assigned already. :) Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From gringorat at yahoo.ca Wed Jan 21 15:42:45 2004 From: gringorat at yahoo.ca (Yannick Bertrand) Date: Tue, 20 Jan 2004 23:42:45 -0500 (EST) Subject: Successful boot on a 7025-F80 Message-ID: <20040121044245.55817.qmail@web60107.mail.yahoo.com> Hi, I just wanted to tell that I successfully booted Linux 2.4.21 (with the ppc64 patch) on a IBM 7025-F80 (with two 450Mhz RS64-III cpus and 4 Gigs of RAM). I built the kernel with the ppc64 toolchain. I'm actually using the kernel to run Gentoo Linux PPC (which is intended to run on Apple hardware!). Here's the output of dmesg : Starting Linux PPC64 2.4.21 ----------------------------------------------------- naca = 0xc000000000004000 naca->pftSize = 0x1a naca->paca = 0xc000000000426000 systemcfg = 0xc000000000005000 systemcfg->platform = 0x100 systemcfg->processor = 0x340001 systemcfg->processorCount = 0x2 systemcfg->physicalMemorySize = 0x100000000 systemcfg->dCacheL1LineSize = 0x80 systemcfg->iCacheL1LineSize = 0x80 htab_data.htab = 0xc0000000f4000000 htab_data.num_ptegs = 0x80000 ----------------------------------------------------- Linux version 2.4.21 (root at localhost.localdomain) (gcc version 3.2.3) #1 SMP Thu Jan 15 16:06:36 EST 2004 Boot arguments: console=ttyS0,9600n8 root=/dev/sdc1 ro On node 0 totalpages: 1048576 zone(0): 1048576 pages. zone(1): 0 pages. zone(2): 0 pages. Kernel command line: console=ttyS0,9600n8 root=/dev/sdc1 ro time_init: decrementer frequency = 451.186209 MHz time_init: processor frequency = 451.200000 MHz Console: colour dummy device 80x25 Calibrating delay loop... 901.12 BogoMIPS Memory: 3973968k available (2992k kernel code, 3124k data, 500k init) [c000000000000000,c000000100000000] kdb version 2.1 by Scott Lurndal, Keith Owens. Copyright SGI, All Rights Reserved Dentry cache hash table entries: 131072 (order: 9, 2097152 bytes) Inode cache hash table entries: 131072 (order: 9, 2097152 bytes) Mount cache hash table entries: 256 (order: 0, 4096 bytes) Buffer-cache hash table entries: 262144 (order: 9, 2097152 bytes) Page-cache hash table entries: 262144 (order: 9, 2097152 bytes) proc_ppc64: Creating /proc/ppc64/pmc PCI: Creating ../proc/ppc64/pcifr PCI: Creating ../proc/ppc64/pci POSIX conformance testing by UNIFIX Entering SMP Mode... Probe found 2 CPUs Waiting for 1 CPUs Processor 1 found. Waiting on wait_init_idle (map = 0x0) All processors have done init_idle PCI: Probing PCI hardware ISA bridge at 00:10.0 PCI: Probing PCI hardware done Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket i/pSeries Real Time Clock Driver v1.1 RTAS daemon started PPC64 nvram contains 262144 bytes Starting kswapd Journalled Block Device driver loaded devfs: v1.12c (20020818) Richard Gooch (rgooch at atnf.csiro.au) devfs: boot_options: 0x1 Installing knfsd (copyright (C) 1996 okir at monad.swb.de). initialize_kbd: Keyboard reset failed, no ACK Detected PS/2 Mouse Port. pty: 256 Unix98 ptys configured keyboard: Timeout - AT keyboard not present?(ed) keyboard: Timeout - AT keyboard not present?(f4) Serial driver version 5.05c (2001-07-08) with MANY_PORTS SHARE_IRQ SERIAL_PCI enabled ttyS00 at 0x03f8 (irq = 4) is a 16550A ttyS01 at 0x02f8 (irq = 3) is a 16550A Floppy drive(s): fd0 is 2.88M FDC 0 is a National Semiconductor PC87306 RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize loop: loaded (max 8 devices) Intel(R) PRO/1000 Network Driver - version 5.0.43-k1 Copyright (c) 1999-2003 Intel Corporation. pcnet32.c:v1.27a 10.02.2002 tsbogend at alpha.franken.de PCI: Enabling device 04:01.0 (0140 -> 0143) pcnet32: PCnet/FAST 79C971 at 0x2ec00, warning: CSR address invalid, using instead PROM address of 00 06 29 6c b0 85 tx_start_pt(0x0c00):~220 bytes, BCR18(6821):BurstWrEn NoUFlow SRAMSIZE=0x7f00, SRAM_BND=0x4000, assigned IRQ 36. eth0: registered as PCnet/FAST 79C971 PCI: Device 11:01.0 not available because of resource collisions pcnet32: failed to enable device -- err=-22 PCI: Enabling device 2a:01.0 (0140 -> 0143) pcnet32: PCnet/FAST 79C971 at 0x291c00, warning: CSR address invalid, using instead PROM address of 00 06 29 6c 90 00 tx_start_pt(0x0c00):~220 bytes, BCR18(6821):BurstWrEn NoUFlow SRAMSIZE=0x7f00, SRAM_BND=0x4000, assigned IRQ 58. eth1: registered as PCnet/FAST 79C971 pcnet32: 2 cards_found. Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx SCSI subsystem driver Revision: 1.00 PCI: Enabling device 1d:01.0 (0140 -> 0143) sym53c8xx: at PCI bus 29, device 1, function 0 sym53c8xx: setting PCI_COMMAND_MASTER...(fix-up) sym53c8xx: 53c875 detected PCI: Enabling device 24:01.0 (0140 -> 0143) sym53c8xx: at PCI bus 36, device 1, function 0 sym53c8xx: setting PCI_COMMAND_MASTER...(fix-up) sym53c8xx: 53c875 detected PCI: Enabling device 01:01.0 (0140 -> 0143) sym53c8xx: at PCI bus 1, device 1, function 0 sym53c8xx: setting PCI_COMMAND_MASTER...(fix-up) sym53c8xx: setting PCI_COMMAND_INVALIDATE (fix-up) sym53c8xx: 53c896 detected PCI: Enabling device 01:01.1 (0140 -> 0143) sym53c8xx: at PCI bus 1, device 1, function 1 sym53c8xx: setting PCI_COMMAND_MASTER...(fix-up) sym53c8xx: setting PCI_COMMAND_INVALIDATE (fix-up) sym53c8xx: 53c896 detected sym53c875-0: rev 0x3 on pci bus 29 device 1 function 0 irq 54 sym53c875-0: ID 7, Fast-20, Parity Checking sym53c875-1: rev 0x3 on pci bus 36 device 1 function 0 irq 56 sym53c875-1: ID 7, Fast-20, Parity Checking sym53c896-2: rev 0x7 on pci bus 1 device 1 function 0 irq 35 sym53c896-2: ID 7, Fast-40, Parity Checking sym53c896-2: handling phase mismatch from SCRIPTS. sym53c896-3: rev 0x7 on pci bus 1 device 1 function 1 irq 34 sym53c896-3: ID 7, Fast-40, Parity Checking sym53c896-3: handling phase mismatch from SCRIPTS. scsi0 : sym53c8xx-1.7.3c-20010512 scsi1 : sym53c8xx-1.7.3c-20010512 scsi2 : sym53c8xx-1.7.3c-20010512 scsi3 : sym53c8xx-1.7.3c-20010512 sym53c875-0-<8,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) Vendor: IBM Model: DMVS09D Rev: 0255 Type: Direct-Access ANSI SCSI revision: 03 sym53c875-0-<9,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) Vendor: IBM Model: DMVS09D Rev: 0255 Type: Direct-Access ANSI SCSI revision: 03 sym53c875-0-<10,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) Vendor: IBM Model: DMVS Rev: 0255 Type: Direct-Access ANSI SCSI revision: 03 sym53c875-0-<11,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) Vendor: IBM Model: DMVS Rev: 0100 Type: Direct-Access ANSI SCSI revision: 03 sym53c875-0-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) Vendor: IBM Model: ST318305LC Rev: C549 Type: Direct-Access ANSI SCSI revision: 03 sym53c875-0-<13,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) Vendor: IBM Model: ST318305LC Rev: C507 Type: Direct-Access ANSI SCSI revision: 03 Vendor: IBM Model: HSBP06E RSU2SCSI Rev: B018 Type: Enclosure ANSI SCSI revision: 02 sym53c875-1-<8,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) Vendor: IBM Model: DMVS09D Rev: 0255 Type: Direct-Access ANSI SCSI revision: 03 sym53c875-1-<9,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) Vendor: IBM Model: DMVS09D Rev: 0255 Type: Direct-Access ANSI SCSI revision: 03 sym53c875-1-<10,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) Vendor: IBM Model: DMVS Rev: 0255 Type: Direct-Access ANSI SCSI revision: 03 sym53c875-1-<11,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) Vendor: IBM Model: DMVS Rev: 0255 Type: Direct-Access ANSI SCSI revision: 03 sym53c875-1-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) Vendor: IBM Model: ST318305LC Rev: C507 Type: Direct-Access ANSI SCSI revision: 03 sym53c875-1-<13,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) Vendor: IBM Model: DMVS Rev: 0255 Type: Direct-Access ANSI SCSI revision: 03 Vendor: IBM Model: HSBP06E RSU2SCSI Rev: B018 Type: Enclosure ANSI SCSI revision: 02 sym53c896-2-<0,*>: FAST-10 SCSI 10.0 MB/s (100.0 ns, offset 15) Vendor: ARCHIVE Model: IBM-STD224000N!D Rev: 7500 Type: Sequential-Access ANSI SCSI revision: 02 Vendor: IBM Model: CDRM00203 !K Rev: 1_03 Type: CD-ROM ANSI SCSI revision: 02 st: Version 20020805, bufsize 32768, wrt 30720, max init. bufs 4, s/g segs 16 Attached scsi tape st0 at scsi2, channel 0, id 0, lun 0 Attached scsi disk sda at scsi0, channel 0, id 8, lun 0 Attached scsi disk sdb at scsi0, channel 0, id 9, lun 0 Attached scsi disk sdc at scsi0, channel 0, id 10, lun 0 Attached scsi disk sdd at scsi0, channel 0, id 11, lun 0 Attached scsi disk sde at scsi0, channel 0, id 12, lun 0 Attached scsi disk sdf at scsi0, channel 0, id 13, lun 0 Attached scsi disk sdg at scsi1, channel 0, id 8, lun 0 Attached scsi disk sdh at scsi1, channel 0, id 9, lun 0 Attached scsi disk sdi at scsi1, channel 0, id 10, lun 0 Attached scsi disk sdj at scsi1, channel 0, id 11, lun 0 Attached scsi disk sdk at scsi1, channel 0, id 12, lun 0 Attached scsi disk sdl at scsi1, channel 0, id 13, lun 0 SCSI device sda: 17774160 512-byte hdwr sectors (9100 MB) Partition check: /dev/scsi/host0/bus0/target8/lun0: p1 p2 SCSI device sdb: 17774160 512-byte hdwr sectors (9100 MB) /dev/scsi/host0/bus0/target9/lun0: p1 p2 sdc: Spinning up disk..............ready SCSI device sdc: 17774160 512-byte hdwr sectors (9100 MB) /dev/scsi/host0/bus0/target10/lun0: p1 sdd: Spinning up disk...............ready SCSI device sdd: 17774160 512-byte hdwr sectors (9100 MB) /dev/scsi/host0/bus0/target11/lun0: sym53c875-0-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) sde: Spinning up disk...<6>sym53c875-0-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) ..<6>sym53c875-0-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) .<6>sym53c875-0-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) .<6>sym53c875-0-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) .<6>sym53c875-0-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) ready SCSI device sde: 17774160 512-byte hdwr sectors (9100 MB) /dev/scsi/host0/bus0/target12/lun0: sym53c875-0-<13,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) sdf: Spinning up disk...<6>sym53c875-0-<13,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) ..<6>sym53c875-0-<13,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) .<6>sym53c875-0-<13,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) .<6>sym53c875-0-<13,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) .<6>sym53c875-0-<13,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) ready SCSI device sdf: 35548320 512-byte hdwr sectors (18201 MB) /dev/scsi/host0/bus0/target13/lun0: SCSI device sdg: 17774160 512-byte hdwr sectors (9100 MB) /dev/scsi/host1/bus0/target8/lun0: SCSI device sdh: 17774160 512-byte hdwr sectors (9100 MB) /dev/scsi/host1/bus0/target9/lun0: sdi: Spinning up disk..............ready SCSI device sdi: 17774160 512-byte hdwr sectors (9100 MB) /dev/scsi/host1/bus0/target10/lun0: sdj: Spinning up disk...............ready SCSI device sdj: 17774160 512-byte hdwr sectors (9100 MB) /dev/scsi/host1/bus0/target11/lun0: sym53c875-1-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) sdk: Spinning up disk...<6>sym53c875-1-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) ..<6>sym53c875-1-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) .<6>sym53c875-1-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) .<6>sym53c875-1-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) .<6>sym53c875-1-<12,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16) ready SCSI device sdk: 35548320 512-byte hdwr sectors (18201 MB) /dev/scsi/host1/bus0/target12/lun0: sdl: Spinning up disk.................................................................................................not responding... sdl : READ CAPACITY failed. sdl : status = 1, message = 00, host = 0, driver = 28 Current sd00:00: sns = 70 2 ASC=4c ASCQ= 0 Raw sense data:0x70 0x00 0x02 0x00 0x00 0x00 0x00 0x18 0x00 0x00 0x00 0x00 0x4c 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x01 0x13 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 sdl : block size assumed to be 512 bytes, disk size 1GB. /dev/scsi/host1/bus0/target13/lun0:<6>Device 08:b0 not ready. I/O error: dev 08:b0, sector 0 Device 08:b0 not ready. I/O error: dev 08:b0, sector 0 unable to read partition table Attached scsi CD-ROM sr0 at scsi2, channel 0, id 1, lun 0 sym53c896-2-<1,*>: FAST-10 SCSI 10.0 MB/s (100.0 ns, offset 15) sr0: scsi-1 drive Uniform CD-ROM driver Revision: 3.12 Attached scsi generic sg6 at scsi0, channel 0, id 15, lun 0, type 13 Attached scsi generic sg13 at scsi1, channel 0, id 15, lun 0, type 13 NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP, IGMP IP: routing cache hash table of 16384 buckets, 256Kbytes TCP: Hash tables configured (established 131072 bind 65536) IPv4 over IPv4 tunneling driver NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. VFS: Mounted root (ext2 filesystem) readonly. Mounted devfs on /dev Freeing unused kernel memory: 500k init Yannick Bertrand ______________________________________________________________________ Post your free ad now! http://personals.yahoo.ca ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From mdewand at redhat.com Thu Jan 22 00:30:37 2004 From: mdewand at redhat.com (Mark DeWandel) Date: Wed, 21 Jan 2004 08:30:37 -0500 Subject: Comments concerning Enhanced Flash Patch In-Reply-To: <1074552968.23918.105.camel@verve> References: <1074552968.23918.105.camel@verve> Message-ID: <20040121133037.GB11814@redhat.com> John, I have just one complaint about the memory leak fix in the attached patch that you sent me yesterday. Calling remove_flash_pde() more than once for a given proc_dir_entry can occur if the kmalloc() in initialize_flash_pde_data() fails. If this happens, remove_flash_pde() will try to free dp->data and remove the proc_dir_entry more than once. One solution is to remove the call to remove_flash_pde() in initialize_flash_pde_data() since it is guaranteed to be called in rtas_flash_init() for the error path. The consensus among Red Hat engineers is that the following issues still must be resolved: [1] In manage_flash() and validate_flash(), a bug in firmware could effectively become a busy wait if the return code from rtas_call() is consistently RTAS_RC_BUSY for a prolonged period of time. The check for pending signals in rtas_do_extended_delay() provides a bail-out of sorts in the blocking case but doesn't guarantee termination of the loop if there's a bug in firmware and no signal is posted. Even if this never becomes a real problem, providing a way out of this loop after some threshold certainly doesn't hurt anything. It's just good defensive programming. [2] The need for mutual exclusion in the read/write paths is still a sticking point as well. The introduction of a semaphore to guard these paths is all that is being requested. Can we get a patch which includes this? -- Mark DeWandel Red Hat, Inc. (978) 692-3113 ext. 23252 -------------- next part -------------- diff -X /home/johnrose/tmp/diffignore.txt -urpN /usr/src/linux-2.4.21-6.EL/arch/ppc64/kernel/ppc_ksyms.c ./EL_ef/arch/ppc64/kernel/ppc_ksyms.c --- linux-2.4.21-6.EL/arch/ppc64/kernel/ppc_ksyms.c 2003-12-09 13:42:04.000000000 -0600 +++ ./EL_ef/arch/ppc64/kernel/ppc_ksyms.c 2004-01-15 15:31:47.000000000 -0600 @@ -266,6 +266,7 @@ EXPORT_SYMBOL(rtas_token); EXPORT_SYMBOL(rtas_call); EXPORT_SYMBOL(rtas_data_buf); EXPORT_SYMBOL(rtas_data_buf_lock); +EXPORT_SYMBOL(rtas_do_extended_delay); #endif #ifndef CONFIG_PPC_ISERIES diff -X /home/johnrose/tmp/diffignore.txt -urpN /usr/src/linux-2.4.21-6.EL/arch/ppc64/kernel/rtas.c ./EL_ef/arch/ppc64/kernel/rtas.c --- linux-2.4.21-6.EL/arch/ppc64/kernel/rtas.c 2003-12-09 13:41:30.000000000 -0600 +++ ./EL_ef/arch/ppc64/kernel/rtas.c 2004-01-19 16:13:36.000000000 -0600 @@ -184,6 +184,35 @@ rtas_call(int token, int nargs, int nret return (ulong)((nret > 0) ? rtas_args->rets[0] : 0); } +/* Given an RTAS status code of 990n perform the hinted delay of 10^n + * (last digit) milliseconds. For now we bound at n=5 (100 secs). + */ +int +rtas_do_extended_delay(int status) +{ + int order = status - 9900; + unsigned long ms; + unsigned long jiffies; + + if (order < 0) + order = 0; /* RTC depends on this for -2 clock busy */ + else if (order > 5) + order = 5; /* bound */ + + /* Use microseconds for reasonable accuracy */ + for (ms=1; order > 0; order--) + ms *= 10; + + jiffies = (ms * HZ) / 1000; + + set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(jiffies); + if (signal_pending(current)) + return RTAS_DELAY_INTR; + + return 0; +} + #define FLASH_BLOCK_LIST_VERSION (1UL) static void rtas_flash_firmware(void) diff -X /home/johnrose/tmp/diffignore.txt -urpN /usr/src/linux-2.4.21-6.EL/arch/ppc64/kernel/rtas_flash.c ./EL_ef/arch/ppc64/kernel/rtas_flash.c --- linux-2.4.21-6.EL/arch/ppc64/kernel/rtas_flash.c 2002-11-28 17:53:11.000000000 -0600 +++ ./EL_ef/arch/ppc64/kernel/rtas_flash.c 2004-01-19 17:28:16.000000000 -0600 @@ -24,7 +24,56 @@ #define MODULE_VERSION "1.0" #define MODULE_NAME "rtas_flash" -#define FIRMWARE_FLASH_NAME "firmware_flash" +#define FIRMWARE_FLASH_NAME "firmware_flash" +#define FIRMWARE_UPDATE_NAME "firmware_update" +#define MANAGE_FLASH_NAME "manage_flash" +#define VALIDATE_FLASH_NAME "validate_flash" + +/* General RTAS Status Codes */ +#define RTAS_RC_SUCCESS 0 +#define RTAS_RC_HW_ERR -1 +#define RTAS_RC_BUSY -2 + +/* Interrupted RTAS operation */ +#define RTAS_INTR -1098 + +/* Flash image status values */ +#define FLASH_AUTH -9002 /* RTAS Not Service Authority Partition */ +#define FLASH_NO_OP -1099 /* No operation initiated by user */ +#define FLASH_IMG_SHORT -1005 /* Flash image shorter than expected */ +#define FLASH_IMG_BAD_LEN -1004 /* Bad length value in flash list block */ +#define FLASH_IMG_NULL_DATA -1003 /* Bad data value in flash list block */ +#define FLASH_IMG_READY 0 /* Firmware img ready for flash on reboot */ + +/* Manage image status values */ +#define MANAGE_AUTH -9002 /* RTAS Not Service Authority Partition */ +#define MANAGE_ACTIVE_ERR -9001 /* RTAS Cannot Overwrite Active Img */ +#define MANAGE_NO_OP -1099 /* No operation initiated by user */ +#define MANAGE_PARAM_ERR -3 /* RTAS Parameter Error */ +#define MANAGE_HW_ERR -1 /* RTAS Hardware Error */ + +/* Validate image status values */ +#define VALIDATE_AUTH -9002 /* RTAS Not Service Authority Partition */ +#define VALIDATE_NO_OP -1099 /* No operation initiated by the user */ +#define VALIDATE_INCOMPLETE -1002 /* User copied < VALIDATE_BUF_SIZE */ +#define VALIDATE_READY -1001 /* Firmware image ready for validation */ +#define VALIDATE_PARAM_ERR -3 /* RTAS Parameter Error */ +#define VALIDATE_HW_ERR -1 /* RTAS Hardware Error */ +#define VALIDATE_TMP_UPDATE 0 /* Validate Return Status */ +#define VALIDATE_FLASH_AUTH 1 /* Validate Return Status */ +#define VALIDATE_INVALID_IMG 2 /* Validate Return Status */ +#define VALIDATE_CUR_UNKNOWN 3 /* Validate Return Status */ +#define VALIDATE_TMP_COMMIT_DL 4 /* Validate Return Status */ +#define VALIDATE_TMP_COMMIT 5 /* Validate Return Status */ +#define VALIDATE_TMP_UPDATE_DL 6 /* Validate Return Status */ + +/* ibm,manage-flash-image operation tokens */ +#define RTAS_REJECT_TMP_IMG 0 +#define RTAS_COMMIT_TMP_IMG 1 + +/* Array sizes */ +#define VALIDATE_BUF_SIZE 4096 +#define RTAS_MSG_MAXLEN 64 /* Local copy of the flash block list. * We only allow one open of the flash proc file and create this @@ -36,21 +85,35 @@ * is treated as the number of entries currently in the block * (i.e. not a byte count). This is all fixed on release. */ -static struct flash_block_list *flist; -static char *flash_msg; -static int flash_possible; - -static int rtas_flash_open(struct inode *inode, struct file *file) -{ - if ((file->f_mode & FMODE_WRITE) && flash_possible) { - if (flist) - return -EBUSY; - flist = (struct flash_block_list *)get_free_page(GFP_KERNEL); - if (!flist) - return -ENOMEM; - } - return 0; -} + +/* Status int must be first member of struct */ +struct rtas_update_flash_t +{ + int status; /* Flash update status */ + struct flash_block_list *flist; /* Local copy of flash block list */ +}; + +/* Status int must be first member of struct */ +struct rtas_manage_flash_t +{ + int status; /* Returned status */ + unsigned int op; /* Reject or commit image */ +}; + +/* Status int must be first member of struct */ +struct rtas_validate_flash_t +{ + int status; /* Returned status */ + char buf[VALIDATE_BUF_SIZE]; /* Candidate image buffer */ + unsigned int buf_size; /* Size of image buf */ + unsigned int update_results; /* Update results token */ +}; + +static spinlock_t flash_file_open_lock = SPIN_LOCK_UNLOCKED; +static struct proc_dir_entry *firmware_flash_pde = NULL; +static struct proc_dir_entry *firmware_update_pde = NULL; +static struct proc_dir_entry *validate_pde = NULL; +static struct proc_dir_entry *manage_pde = NULL; /* Do simple sanity checks on the flash image. */ static int flash_list_valid(struct flash_block_list *flist) @@ -59,32 +122,27 @@ static int flash_list_valid(struct flash int i; unsigned long block_size, image_size; - flash_msg = NULL; /* Paranoid self test here. We also collect the image size. */ image_size = 0; for (f = flist; f; f = f->next) { for (i = 0; i < f->num_blocks; i++) { if (f->blocks[i].data == NULL) { - flash_msg = "error: internal error null data\n"; - return 0; + return FLASH_IMG_NULL_DATA; } block_size = f->blocks[i].length; if (block_size <= 0 || block_size > PAGE_SIZE) { - flash_msg = "error: internal error bad length\n"; - return 0; + return FLASH_IMG_BAD_LEN; } image_size += block_size; } } - if (image_size < (256 << 10)) { - if (image_size < 2) - flash_msg = NULL; /* allow "clear" of image */ - else - flash_msg = "error: flash image short\n"; - return 0; - } + + if (image_size < 2) + return FLASH_NO_OP; + printk(KERN_INFO "FLASH: flash image with %ld bytes stored for hardware flash on reboot\n", image_size); - return 1; + + return FLASH_IMG_READY; } static void free_flash_list(struct flash_block_list *f) @@ -103,56 +161,91 @@ static void free_flash_list(struct flash static int rtas_flash_release(struct inode *inode, struct file *file) { - if (flist) { - /* Always clear saved list on a new attempt. */ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_update_flash_t *uf; + + uf = (struct rtas_update_flash_t *) dp->data; + if (uf->flist) { + /* File was opened in write mode for a new flash attempt */ + /* Clear saved list */ if (rtas_firmware_flash_list.next) { free_flash_list(rtas_firmware_flash_list.next); rtas_firmware_flash_list.next = NULL; } - if (flash_list_valid(flist)) - rtas_firmware_flash_list.next = flist; + if (uf->status != FLASH_AUTH) + uf->status = flash_list_valid(uf->flist); + + if (uf->status == FLASH_IMG_READY) + rtas_firmware_flash_list.next = uf->flist; else - free_flash_list(flist); - flist = NULL; + free_flash_list(uf->flist); + + uf->flist = NULL; } + + atomic_dec(&dp->count); return 0; } +static int get_flash_status_msg(int status, char *buf, int size) +{ + int len; + + switch (status) { + case FLASH_AUTH: + len = snprintf(buf, size, "error: this partition does not have service authority\n"); + break; + case FLASH_NO_OP: + len = snprintf(buf, size, "info: no firmware image for flash\n"); + break; + case FLASH_IMG_SHORT: + len = snprintf(buf, size, "error: flash image short\n"); + break; + case FLASH_IMG_BAD_LEN: + len = snprintf(buf, size, "error: internal error bad length\n"); + break; + case FLASH_IMG_NULL_DATA: + len = snprintf(buf, size, "error: internal error null data\n"); + break; + case FLASH_IMG_READY: + len = snprintf(buf, size, "ready: firmware image ready for flash on reboot\n"); + break; + default: + len = snprintf(buf, size, "error: unexpected status value %d\n", status); + break; + } + + return len >= size ? size-1 : len; +} + /* Reading the proc file will show status (not the firmware contents) */ static ssize_t rtas_flash_read(struct file *file, char *buf, size_t count, loff_t *ppos) { - int error; - char *msg; + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_update_flash_t *uf; + char msg[RTAS_MSG_MAXLEN]; int msglen; - if (!flash_possible) { - msg = "error: this partition does not have service authority\n"; - } else if (flist) { - msg = "info: this file is busy for write by some process\n"; - } else if (flash_msg) { - msg = flash_msg; /* message from last flash attempt */ - } else if (rtas_firmware_flash_list.next) { - msg = "ready: firmware image ready for flash on reboot\n"; - } else { - msg = "info: no firmware image for flash\n"; + uf = (struct rtas_update_flash_t *) dp->data; + + if (!strcmp(dp->name, FIRMWARE_FLASH_NAME)) { + msglen = get_flash_status_msg(uf->status, msg, RTAS_MSG_MAXLEN); + } else { /* FIRMWARE_UPDATE_NAME */ + msglen = sprintf(msg, "%d\n", uf->status); } - msglen = strlen(msg); + + if (*ppos >= msglen) + return 0; + msglen -= *ppos; if (msglen > count) msglen = count; - if (ppos && *ppos != 0) - return 0; /* be cheap */ - - error = verify_area(VERIFY_WRITE, buf, msglen); - if (error) - return -EINVAL; - - copy_to_user(buf, msg, msglen); + if (copy_to_user(buf, msg + (*ppos), msglen)) + return -EFAULT; + *ppos += msglen; - if (ppos) - *ppos = msglen; return msglen; } @@ -164,14 +257,28 @@ static ssize_t rtas_flash_read(struct fi static ssize_t rtas_flash_write(struct file *file, const char *buffer, size_t count, loff_t *off) { - size_t len = count; + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_update_flash_t *uf; char *p; int next_free; - struct flash_block_list *fl = flist; + struct flash_block_list *fl; + + uf = (struct rtas_update_flash_t *) dp->data; + + if (uf->status == FLASH_AUTH || count == 0) + return count; /* discard data */ - if (!flash_possible || len == 0) - return len; /* discard data */ + /* In the case that the image is not ready for flashing, the memory + * allocated for the block list will be freed upon the release of the + * proc file + */ + if (uf->flist == NULL) { + uf->flist = (struct flash_block_list *) get_free_page(GFP_KERNEL); + if (!uf->flist) + return -ENOMEM; + } + fl = uf->flist; while (fl->next) fl = fl->next; /* seek to last block_list for append */ next_free = fl->num_blocks; @@ -184,55 +291,409 @@ static ssize_t rtas_flash_write(struct f next_free = 0; } - if (len > PAGE_SIZE) - len = PAGE_SIZE; + if (count > PAGE_SIZE) + count = PAGE_SIZE; p = (char *)get_free_page(GFP_KERNEL); if (!p) return -ENOMEM; - if(copy_from_user(p, buffer, len)) { + + if(copy_from_user(p, buffer, count)) { free_page((unsigned long)p); return -EFAULT; } fl->blocks[next_free].data = p; - fl->blocks[next_free].length = len; + fl->blocks[next_free].length = count; fl->num_blocks++; - return len; + return count; +} + +static int rtas_excl_open(struct inode *inode, struct file *file) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + + /* Enforce exclusive open with use count of PDE */ + spin_lock(&flash_file_open_lock); + if (atomic_read(&dp->count) > 1) { + spin_unlock(&flash_file_open_lock); + return -EBUSY; + } + + atomic_inc(&dp->count); + spin_unlock(&flash_file_open_lock); + + return 0; +} + +static int rtas_excl_release(struct inode *inode, struct file *file) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + + atomic_dec(&dp->count); + + return 0; +} + +static void manage_flash(struct rtas_manage_flash_t *args_buf) +{ + s32 delay_rc; + s32 rc; + + while (1) { + rc = (s32) rtas_call(rtas_token("ibm,manage-flash-image"), 1, + 1, NULL, (long) args_buf->op); + if (rc == RTAS_RC_BUSY) + udelay(1); + else if (rtas_is_extended_busy(rc)) { + if ((delay_rc = rtas_do_extended_delay(rc))) { + /* Delay interrupted */ + args_buf->status = delay_rc; + break; + } + } else { + args_buf->status = rc; + break; + } + } +} + +static ssize_t manage_flash_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_manage_flash_t *args_buf; + char msg[RTAS_MSG_MAXLEN]; + int msglen; + + args_buf = (struct rtas_manage_flash_t *) dp->data; + if (args_buf == NULL) + return 0; + + msglen = sprintf(msg, "%d\n", args_buf->status); + if (*ppos >= msglen) + return 0; + + msglen -= *ppos; + if (msglen > count) + msglen = count; + + if (copy_to_user(buf, msg + (*ppos), msglen)) + return -EFAULT; + *ppos += msglen; + + return msglen; +} + +static ssize_t manage_flash_write(struct file *file, const char *buf, + size_t count, loff_t *off) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_manage_flash_t *args_buf; + const char reject_str[] = "0"; + const char commit_str[] = "1"; + char msg[RTAS_MSG_MAXLEN]; + int op; + + args_buf = (struct rtas_manage_flash_t *) dp->data; + if ((args_buf->status == MANAGE_AUTH) || (count == 0)) + return count; + + if (count > RTAS_MSG_MAXLEN) + count = RTAS_MSG_MAXLEN; + if (copy_from_user(msg, buf, count)) + return -EFAULT; + + if (strncmp(buf, reject_str, strlen(reject_str)) == 0) + op = RTAS_REJECT_TMP_IMG; + else if (strncmp(buf, commit_str, strlen(commit_str)) == 0) + op = RTAS_COMMIT_TMP_IMG; + else + return -EINVAL; + + args_buf->op = op; + manage_flash(args_buf); + *off += count; + + return count; +} + +static void validate_flash(struct rtas_validate_flash_t *args_buf) +{ + int token = rtas_token("ibm,validate-flash-image"); + unsigned int wait_time; + long update_results; + s32 delay_rc; + s32 rc; + + rc = 0; + while(1) { + spin_lock(&rtas_data_buf_lock); + memcpy(rtas_data_buf, args_buf->buf, VALIDATE_BUF_SIZE); + rc = (s32) rtas_call(token, 2, 2, &update_results, + __pa(rtas_data_buf), args_buf->buf_size); + memcpy(args_buf->buf, rtas_data_buf, VALIDATE_BUF_SIZE); + spin_unlock(&rtas_data_buf_lock); + + if (rc == RTAS_RC_BUSY) + udelay(1); + else if (rtas_is_extended_busy(rc)) { + if ((delay_rc = rtas_do_extended_delay(rc))) { + /* Delay interrupted */ + args_buf->status = delay_rc; + break; + } + } else { + args_buf->status = rc; + args_buf->update_results = (u32) update_results; + break; + } + } +} + +static int get_validate_flash_msg(struct rtas_validate_flash_t *args_buf, + char *msg, int size) +{ + int n; + + if (args_buf->status >= VALIDATE_TMP_UPDATE) { + n = snprintf(msg, size, "%u\n", args_buf->update_results); + if ((args_buf->update_results >= VALIDATE_CUR_UNKNOWN) || + (args_buf->update_results == VALIDATE_TMP_UPDATE)) + n += snprintf(msg + n, size - n, "%s\n", args_buf->buf); + } else { + n = snprintf(msg, size, "%d\n", args_buf->status); + } + + return n >= size ? size - 1 : n; +} + +static ssize_t validate_flash_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_validate_flash_t *args_buf; + char msg[RTAS_MSG_MAXLEN]; + int msglen; + + args_buf = (struct rtas_validate_flash_t *) dp->data; + + msglen = get_validate_flash_msg(args_buf, msg, RTAS_MSG_MAXLEN); + + if (*ppos >= msglen) + return 0; + + msglen -= *ppos; + if (msglen > count) + msglen = count; + + if (copy_to_user(buf, msg + (*ppos), msglen)) + return -EFAULT; + *ppos += msglen; + + return msglen; +} + +static ssize_t validate_flash_write(struct file *file, const char *buf, + size_t count, loff_t *off) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_validate_flash_t *args_buf; + + args_buf = (struct rtas_validate_flash_t *) dp->data; + + if (dp->data == NULL) { + dp->data = kmalloc(sizeof(struct rtas_validate_flash_t), + GFP_KERNEL); + if (dp->data == NULL) + return -ENOMEM; + } + + /* We are only interested in the first 4K of the + * candidate image */ + if ((*off >= VALIDATE_BUF_SIZE) || + (args_buf->status == VALIDATE_AUTH)) { + *off += count; + return count; + } + + if (*off + count >= VALIDATE_BUF_SIZE) { + count = VALIDATE_BUF_SIZE - *off; + args_buf->status = VALIDATE_READY; + } else { + args_buf->status = VALIDATE_INCOMPLETE; + } + + if (copy_from_user(args_buf->buf + *off, buf, count)) + return -EFAULT; + *off += count; + + return count; +} + +static int validate_flash_release(struct inode *inode, struct file *file) +{ + struct proc_dir_entry *dp = file->f_dentry->d_inode->u.generic_ip; + struct rtas_validate_flash_t *args_buf; + + args_buf = (struct rtas_validate_flash_t *) dp->data; + + if (args_buf->status == VALIDATE_READY) { + args_buf->buf_size = VALIDATE_BUF_SIZE; + validate_flash(args_buf); + } + + atomic_dec(&dp->count); + + return 0; +} + +static inline void remove_flash_pde(struct proc_dir_entry *dp) +{ + if (dp) { + if (dp->data != NULL) + kfree(dp->data); + remove_proc_entry(dp->name, rtas_proc_dir); + } +} + +static inline int initialize_flash_pde_data(const char *rtas_call_name, + size_t buf_size, + struct proc_dir_entry *dp) +{ + int *status; + int token; + + dp->data = kmalloc(buf_size, GFP_KERNEL); + if (dp->data == NULL) { + remove_flash_pde(dp); + return -ENOMEM; + } + + memset(dp->data, 0, buf_size); + + /* This code assumes that the status int is the first member of the + * struct + */ + status = (int *) dp->data; + token = rtas_token(rtas_call_name); + if (token == RTAS_UNKNOWN_SERVICE) + *status = FLASH_AUTH; + else + *status = FLASH_NO_OP; + + return 0; +} + +static inline struct proc_dir_entry * create_flash_pde(const char *filename, + struct file_operations *fops) +{ + struct proc_dir_entry *ent = NULL; + + ent = create_proc_entry(filename, S_IRUSR | S_IWUSR, rtas_proc_dir); + if (ent != NULL) { + ent->nlink = 1; + ent->proc_fops = fops; + ent->owner = THIS_MODULE; + } + + return ent; } static struct file_operations rtas_flash_operations = { read: rtas_flash_read, write: rtas_flash_write, - open: rtas_flash_open, + open: rtas_excl_open, release: rtas_flash_release, }; +static struct file_operations manage_flash_operations = { + read: manage_flash_read, + write: manage_flash_write, + open: rtas_excl_open, + release: rtas_excl_release, +}; + +static struct file_operations validate_flash_operations = { + read: validate_flash_read, + write: validate_flash_write, + open: rtas_excl_open, + release: validate_flash_release, +}; + +#define CHECK_PDE_CREATE(_pdevar, _rcvar, _label) \ + if (!_pdevar) { \ + _rcvar = -ENOMEM; \ + goto _label; \ + } + +#define CHECK_RC(_rc, _label) \ + if (_rc != 0) \ + goto _label; int __init rtas_flash_init(void) { - struct proc_dir_entry *ent = NULL; + int rc; if (!rtas_proc_dir) { - printk(KERN_WARNING "rtas proc dir does not already exist"); + printk(KERN_WARNING "%s: rtas proc dir does not already exist", + __FUNCTION__); return -ENOENT; } - if (rtas_token("ibm,update-flash-64-and-reboot") != RTAS_UNKNOWN_SERVICE) - flash_possible = 1; - - if ((ent = create_proc_entry(FIRMWARE_FLASH_NAME, S_IRUSR | S_IWUSR, rtas_proc_dir)) != NULL) { - ent->nlink = 1; - ent->proc_fops = &rtas_flash_operations; - ent->owner = THIS_MODULE; + firmware_flash_pde = create_flash_pde(FIRMWARE_FLASH_NAME, + &rtas_flash_operations); + CHECK_PDE_CREATE(firmware_flash_pde, rc, done); + + rc = initialize_flash_pde_data("ibm,update-flash-64-and-reboot", + sizeof(struct rtas_update_flash_t), + firmware_flash_pde); + CHECK_RC(rc, done); + + firmware_update_pde = create_flash_pde(FIRMWARE_UPDATE_NAME, + &rtas_flash_operations); + CHECK_PDE_CREATE(firmware_update_pde, rc, done); + + rc = initialize_flash_pde_data("ibm,update-flash-64-and-reboot", + sizeof(struct rtas_update_flash_t), + firmware_update_pde); + CHECK_RC(rc, done); + + validate_pde = create_flash_pde(VALIDATE_FLASH_NAME, + &validate_flash_operations); + CHECK_PDE_CREATE(validate_pde, rc, done); + + rc = initialize_flash_pde_data("ibm,validate-flash-image", + sizeof(struct rtas_validate_flash_t), + validate_pde); + CHECK_RC(rc, done); + + manage_pde = create_flash_pde(MANAGE_FLASH_NAME, + &manage_flash_operations); + CHECK_PDE_CREATE(manage_pde, rc, done); + + rc = initialize_flash_pde_data("ibm,manage-flash-image", + sizeof(struct rtas_manage_flash_t), + manage_pde); +done: + if (rc != 0) { + remove_flash_pde(firmware_flash_pde); + remove_flash_pde(firmware_update_pde); + remove_flash_pde(validate_pde); + remove_flash_pde(manage_pde); } - return 0; + + return rc; } void __exit rtas_flash_cleanup(void) { if (!rtas_proc_dir) return; - remove_proc_entry(FIRMWARE_FLASH_NAME, rtas_proc_dir); + + remove_flash_pde(firmware_flash_pde); + remove_flash_pde(firmware_update_pde); + remove_flash_pde(validate_pde); + remove_flash_pde(manage_pde); } module_init(rtas_flash_init); diff -X /home/johnrose/tmp/diffignore.txt -urpN /usr/src/linux-2.4.21-6.EL/include/asm-ppc64/rtas.h ./EL_ef/include/asm-ppc64/rtas.h --- linux-2.4.21-6.EL/include/asm-ppc64/rtas.h 2003-12-09 13:41:33.000000000 -0600 +++ ./EL_ef/include/asm-ppc64/rtas.h 2004-01-19 17:28:45.000000000 -0600 @@ -24,6 +24,9 @@ #define MAX_ERRINJCT_TOKENS 8 /* Max # tokens. */ #define WORKSPACE_SIZE 1024 +/* Extended Delay Interrupted by Signal */ +#define RTAS_DELAY_INTR -1098 + /* * In general to call RTAS use rtas_token("string") to lookup * an RTAS token for the given string (e.g. "event-scan"). @@ -182,6 +185,13 @@ extern int rtas_errinjct_close(unsigned extern struct proc_dir_entry *rtas_proc_dir; extern struct errinjct_token ei_token_list[MAX_ERRINJCT_TOKENS]; +/* Given an RTAS status code of 9900..9905 compute the hinted delay */ +extern int rtas_do_extended_delay(int status); +static inline int rtas_is_extended_busy(int status) +{ + return status >= 9900 && status <= 9905; +} + extern void pSeries_log_error(char *buf, unsigned int err_type, int fatal); /* Error types logged. */ From olof at austin.ibm.com Thu Jan 22 15:47:01 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Wed, 21 Jan 2004 22:47:01 -0600 (CST) Subject: [2.4] SLB noloop patch In-Reply-To: <40082B40.4070808@austin.ibm.com> Message-ID: On Fri, 16 Jan 2004, Olof Johansson wrote: > The 2.5 equivalent of this patch got baked into Anton's big SLB rewrite. > There seems to be less interest to bring the bigger rewrite back to 2.4, > but the noloop stuff is still a valuable enhancement (and smaller in scope). Julie Dewandel found a glaring error in the previous patch. Here's an incremental diff, I'll push it to BK in the morning. It's been tested quite a bit for the last few days so I'm quite sure it's correct now. -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- ===== arch/ppc64/kernel/head.S 1.21 vs edited ===== --- 1.21/arch/ppc64/kernel/head.S Mon Jan 19 11:23:02 2004 +++ edited/arch/ppc64/kernel/head.S Tue Jan 20 13:36:05 2004 @@ -1225,8 +1225,9 @@ mulld r20,r20,r21 clrldi r20,r20,28 /* r20 = vsid */ - /* No free entry - just take the next entry, round-robin */ - /* XXX we should get the number of SLB entries from the naca */ + /* No searching for free entries, just take the next + * entry round-robin + */ SLB_NUM_ENTRIES = 64 2: mfspr r21,SPRG3 ld r22,PACASTABRR(r21) @@ -1250,17 +1251,15 @@ * for the kernel stack during the first part of exception exit * which gets invalidated due to a tlbie from another cpu at a * non recoverable point (after setting srr0/1) - Anton - */ - slbmfee r23,r22 - srdi r23,r23,28 - /* + * * This is incorrect (r1 is not the kernel stack) if we entered * from userspace but there is no critical window from userspace * so this should be OK. Also if we cast out the userspace stack * segment while in userspace we will fault it straight back in. */ - srdi r21,r1,28 - cmpd r21,r23 + xor r23,r1,r21 + srdi r23,r23,28 + cmpdi r23,0 beq- 2b /* Invalidate the old entry */ From segher at kernel.crashing.org Thu Jan 22 19:39:59 2004 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Thu, 22 Jan 2004 09:39:59 +0100 Subject: autoconsole In-Reply-To: <1074548488.10585.35.camel@gaston> References: <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040115085849.A1808@w-mikek2.beaverton.ibm.com> <20040115170631.GA22399@suse.de> <3238B645-477F-11D8-A8C3-000A95A4DC02@kernel.crashing.org> <20040118141742.GE6293@krispykreme> <1074504073.814.56.camel@gaston> <20040119090418.B1802@w-mikek2.beaverton.ibm.com> <3AC0E7B4-4AA7-11D8-9C4D-000A95A4DC02@kernel.crashing.org> <1074548488.10585.35.camel@gaston> Message-ID: <902F6216-4CB6-11D8-90BE-000A95A4DC02@kernel.crashing.org> >> 3f8, 2f8 are just the legacy x86 i/o addresses for the first and >> second serial ports; I assume 898, 890 are the CHRP standardized >> addresses for the third and fourth? > > can't you use the OF node names instead ? Sure: /some_host_bridge/some_pci/some_faked_isa/NS16450 at 3f8 /some_host_bridge/some_pci/some_faked_isa/NS16450 at 2f8 /some_host_bridge/some_pci/some_faked_isa/NS16450 at 898 /some_host_bridge/some_pci/some_faked_isa/NS16450 at 890 I don't think they're actually called this in any current Open Firmware implementation, though; not even serial at 3f8 or something like that (although they _should_ be!) But I suspect this is not what you meant? I don't know what you did mean then, though. Segher ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From rod at thalescomputers.fr Thu Jan 22 21:41:34 2004 From: rod at thalescomputers.fr (=?ISO-8859-1?Q?R=E9gis_Odey=E9?=) Date: Thu, 22 Jan 2004 11:41:34 +0100 Subject: MPIC Timer of U3 Message-ID: <400FA8DE.8080603@thalescomputers.fr> Hi, I'm trying to use the MPIC Timer sub-module of the U3 (host bridge on JS20). First of all, I did not see any API in the kernel except for the definition of MPIC stucture in open_pic_defs.h, so I added a small module (dynamically loaded) remapping the mpic structure of the U3 (through ioremap call). The mapping seems to be OK because I read the Vendor Id properly and I'm able to read/write to the interruptions part of the MPIC (Source part of the MPIC structure). But unfortunately, the Timer Frequency and the Timers seem not to be programmable (re-read always 0x0). Is there anybody who experimented such a behaviour with the MPIC Timer of the U3 ? Regards. -- R?gis Odey? Thales Computers, a Thales company. www.thalescomputers.com E-mail: rod at thalescomputers.fr Tel: +33 (0)4 98 16 34 86 - Fax: +33 (0)4 98 16 34 01 ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Fri Jan 23 01:28:10 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Thu, 22 Jan 2004 08:28:10 -0600 Subject: [PATCH][2.6] Nested Interrupt support In-Reply-To: <20040119042022.GA20834@krispykreme> References: <1074094346.2389.42.camel@magik> <20040119042022.GA20834@krispykreme> Message-ID: <1074781690.23288.571.camel@magik> > Looks good. Could we use per cpu data here (do we init per cpu data > before the xics setup)? Also Im wondering if we should have a quick > check for overflow of the buffer. > Here's the patch using per cpu data for the irq stack. Thanks, Jake -------------- next part -------------- # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1393 -> 1.1394 # arch/ppc64/kernel/irq.c 1.54 -> 1.55 # arch/ppc64/kernel/xics.c 1.38 -> 1.39 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 04/01/22 moilanen at threadlp13.austin.ibm.com 1.1394 # Nested interrupt support. # -------------------------------------------- # diff -Nru a/arch/ppc64/kernel/irq.c b/arch/ppc64/kernel/irq.c --- a/arch/ppc64/kernel/irq.c Thu Jan 22 08:20:25 2004 +++ b/arch/ppc64/kernel/irq.c Thu Jan 22 08:20:25 2004 @@ -822,16 +822,9 @@ } out: desc->status &= ~IRQ_INPROGRESS; - /* - * The ->end() handler has to deal with interrupts which got - * disabled while the handler was running. - */ - if (desc->handler) { - if (desc->handler->end) - desc->handler->end(irq); - else if (desc->handler->enable) - desc->handler->enable(irq); - } + + desc->handler->end(irq); + spin_unlock(&desc->lock); } diff -Nru a/arch/ppc64/kernel/xics.c b/arch/ppc64/kernel/xics.c --- a/arch/ppc64/kernel/xics.c Thu Jan 22 08:20:25 2004 +++ b/arch/ppc64/kernel/xics.c Thu Jan 22 08:20:25 2004 @@ -92,6 +92,21 @@ static unsigned int default_server = 0xFF; static unsigned int default_distrib_server = 0; +/* Number of nested IRQs we can store */ +#define IRQ_DEPTH 2 + +struct cpu_irq_stack +{ + int depth; + int priority[IRQ_DEPTH]; + int irq[IRQ_DEPTH]; +}; + +DEFINE_PER_CPU(struct cpu_irq_stack, _irq_stack); + +#define irq_stack __get_cpu_var(_irq_stack) +#define irq_stack_depth (irq_stack).depth + /* * XICS only has a single IPI, so encode the messages per CPU */ @@ -302,20 +317,36 @@ void xics_end_irq(unsigned int irq) { int cpu = smp_processor_id(); + unsigned int priority; + + if (irq >= 0 && irq != irq_offset_up(xics_irq_8259_cascade)) { + irq_stack_depth--; + priority = irq_stack.priority[irq_stack_depth]; + } else { + priority = 0xff; + } iosync(); - ops->xirr_info_set(cpu, ((0xff<<24) | (irq_offset_down(irq)))); + ops->xirr_info_set(cpu, (priority<<24) | (irq_offset_down(irq))); } void xics_mask_and_ack_irq(u_int irq) { int cpu = smp_processor_id(); + unsigned int priority; if (irq < irq_offset_value()) { + if (irq >= 0) { + irq_stack_depth--; + priority = irq_stack.priority[irq_stack_depth]; + } else { + priority = 0xff; + } + i8259_pic.ack(irq); iosync(); - ops->xirr_info_set(cpu, ((0xff<<24) | + ops->xirr_info_set(cpu, ((priority<<24) | xics_irq_8259_cascade_real)); iosync(); } @@ -325,10 +356,12 @@ { u_int cpu = smp_processor_id(); u_int vec; + u_int priority; int irq; vec = ops->xirr_info_get(cpu); - /* (vec >> 24) == old priority */ + + priority = vec >> 24; vec &= 0x00ffffff; /* for sanity, this had better be < NR_IRQS - 16 */ @@ -345,6 +378,16 @@ } else { irq = irq_offset_up(vec); } + + if (irq >= 0) { + if (irq_stack_depth >= IRQ_DEPTH) + panic("Illegal irq stack depth"); + + irq_stack.priority[irq_stack_depth] = priority; + irq_stack.irq[irq_stack_depth] = irq; + irq_stack_depth++; + } + return irq; } @@ -413,7 +456,7 @@ void xics_init_IRQ(void) { - int i; + int i, j; unsigned long intr_size = 0; struct device_node *np; uint *ireg, ilen, indx = 0; @@ -531,6 +574,14 @@ xics_8259_pic.disable = i8259_pic.disable; for (i = 0; i < 16; ++i) get_real_irq_desc(i)->handler = &xics_8259_pic; + + for (i = 0; i < NR_CPUS; i++) { + per_cpu(_irq_stack, i).depth = 0; + for (j = 0; j < IRQ_DEPTH; j++) { + per_cpu(_irq_stack, i).priority[j] = 0xff; + per_cpu(_irq_stack, i).irq[j] = -1; + } + } ops->cppr_info(boot_cpuid, 0xff); iosync(); From agl at us.ibm.com Fri Jan 23 05:48:39 2004 From: agl at us.ibm.com (Adam Litke) Date: 22 Jan 2004 10:48:39 -0800 Subject: `tlbiel' instruction still breaks POWER3 on 2.6.0 Message-ID: <1074797319.669.26.camel@agtpad> I know this has been discussed before. Stock 2.6.0 won't boot power3 because of the infamous 'tlbiel' instruction stuff. 'arch/ppc64/Makefile' is also hardcoded to power4 via -mcpu=power4. I know about the workaround and have gotten a kernel to boot. Is there a rolled up patch in another ppc64 tree (ameslab, rsync)? Could we get it pushed into 2.6? -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Fri Jan 23 07:50:38 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Thu, 22 Jan 2004 14:50:38 -0600 Subject: `tlbiel' instruction still breaks POWER3 on 2.6.0 In-Reply-To: <1074797319.669.26.camel@agtpad>; from agl@us.ibm.com on Thu, Jan 22, 2004 at 10:48:39AM -0800 References: <1074797319.669.26.camel@agtpad> Message-ID: <20040122145038.M22416@forte.austin.ibm.com> On Thu, Jan 22, 2004 at 10:48:39AM -0800, Adam Litke wrote: > > I know this has been discussed before. Stock 2.6.0 won't boot power3 > because of the infamous 'tlbiel' instruction stuff. > 'arch/ppc64/Makefile' is also hardcoded to power4 via -mcpu=power4. I > know about the workaround and have gotten a kernel to boot. Is there a > rolled up patch in another ppc64 tree (ameslab, rsync)? Could we get it ameslab 2.6 boots fine on power3 fyi it even works fine when the userland is the old suse sles8, pre-rc-anything. > pushed into 2.6? Im guessing there's a lot in ameslab thats not in 2.6 yet? ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Fri Jan 23 08:45:05 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Thu, 22 Jan 2004 15:45:05 -0600 Subject: [2.4] [PATCH] hash_page rework, take 2 Message-ID: <40104461.5030804@austin.ibm.com> Ok, so the previous approach of the hash_page rework had a few drawbacks as pointed out by Ben and others. Here's a new try, I'm looking for any feedback I can get on it! The IPI approach ended up causing a whole lot of interrupts, so I went with a rwlock per CPU instead. hash_page() takes the lock in read mode, so all the deallocation code needs to do is make sure it could take all locks for writing. Once it's been able to do so it's guaranteed that no readers are holding on to references to a PTE about to be deallocated. While I was at it, I switched over to the per-HPTE locking that 2.6 uses. I've been kicking this around quite a bit in the specweb setup, and it's been running fine. I didn't see any obvious contention for any of the structures on an 8-way machine, so I didn't pursure further enhancements by aligning stuff on cache lines. This is, after all, not 2.6. :-) Thanks, Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nfont at austin.ibm.com Fri Jan 23 08:58:57 2004 From: nfont at austin.ibm.com (Nathan Fontenot) Date: Thu, 22 Jan 2004 15:58:57 -0600 Subject: rtas-last-error patch Message-ID: <1074808737.3189.39.camel@mudbug.austin.ibm.com> The attached patch will log an error message to nvram anytime a rtas call returns hardware failure. This is done by making an additional rtas call with the rtas-last-error token and logging the returned buffer. This ability is something that several people have expressed interest in having to aid in debugging. I am hoping to push this to Ames lab around the end of next week. All Comments welcome. Nathan Fontenot -- -------------- next part -------------- A non-text attachment was scrubbed... Name: rtas.patch Type: text/x-patch Size: 1987 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040122/70b8d972/attachment.bin From anton at samba.org Fri Jan 23 09:02:28 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 23 Jan 2004 09:02:28 +1100 Subject: `tlbiel' instruction still breaks POWER3 on 2.6.0 In-Reply-To: <1074797319.669.26.camel@agtpad> References: <1074797319.669.26.camel@agtpad> Message-ID: <20040122220227.GF11236@krispykreme> > I know this has been discussed before. Stock 2.6.0 won't boot power3 > because of the infamous 'tlbiel' instruction stuff. > 'arch/ppc64/Makefile' is also hardcoded to power4 via -mcpu=power4. I > know about the workaround and have gotten a kernel to boot. Is there a > rolled up patch in another ppc64 tree (ameslab, rsync)? Could we get it > pushed into 2.6? 2.6.0 is ancient :) The fix is in current linus BK and ameslab-2.5 Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Fri Jan 23 11:24:32 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 23 Jan 2004 11:24:32 +1100 Subject: MPIC Timer of U3 In-Reply-To: <400FA8DE.8080603@thalescomputers.fr> References: <400FA8DE.8080603@thalescomputers.fr> Message-ID: <1074817472.974.162.camel@gaston> On Thu, 2004-01-22 at 21:41, R?gis Odey? wrote: > Hi, > > I'm trying to use the MPIC Timer sub-module of the U3 (host bridge on JS20). I don't recommend that. I can't tell more at this point, but this resource may not be available in future revisions of the product. (Actually, if js20 uses U3H, I'm not sure the timer is still there at all in this revision even) > First of all, I did not see any API in the kernel except or the > definition of MPIC stucture in open_pic_defs.h, so I added a small > module (dynamically loaded) remapping the mpic structure of the U3 > (through ioremap call). > > The mapping seems to be OK because I read the Vendor Id properly and I'm > able to read/write to the interruptions part of the MPIC (Source part of > the MPIC structure). > > But unfortunately, the Timer Frequency and the Timers seem not to be > programmable (re-read always 0x0). > > Is there anybody who experimented such a behaviour with the MPIC Timer > of the U3 ? Hrm... It may just not exist in the U3H version of the cell. Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Jan 23 12:29:26 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 23 Jan 2004 12:29:26 +1100 Subject: `tlbiel' instruction still breaks POWER3 on 2.6.0 In-Reply-To: <20040122145038.M22416@forte.austin.ibm.com> References: <1074797319.669.26.camel@agtpad> <20040122145038.M22416@forte.austin.ibm.com> Message-ID: <20040123012926.GI11236@krispykreme> > ameslab 2.6 boots fine on power3 > > fyi it even works fine when the userland is the old suse sles8, > pre-rc-anything. This is good to know. As far as I know the only patch needed to compile upstream 2.6 on SLES8 is attached below. The fix is actually in -mm but I doubt we'll get it into mainline (Ive tried a few times and Linus spat it out). If anyone else is having build problems on existing distros can they speak up now? (using either current ameslab or linus BK) > Im guessing there's a lot in ameslab thats not in 2.6 yet? Its getting closer. The core bits we need to get merged are SPLPAR spinlocks and large irqs. Anton -- Workaround for ppc64 compiler bug which has since been fixed gr16_work-anton/include/linux/compiler-gcc.h | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) diff -puN include/linux/compiler-gcc.h~reloc_hide_patch include/linux/compiler-gcc.h --- gr16_work/include/linux/compiler-gcc.h~reloc_hide_patch 2003-09-27 15:18:31.000000000 -0500 +++ gr16_work-anton/include/linux/compiler-gcc.h 2003-09-27 15:18:44.000000000 -0500 @@ -13,5 +13,5 @@ shouldn't recognize the original var, and make assumptions about it */ #define RELOC_HIDE(ptr, off) \ ({ unsigned long __ptr; \ - __asm__ ("" : "=g"(__ptr) : "0"(ptr)); \ + __asm__ ("" : "=r"(__ptr) : "0"(ptr)); \ (typeof(ptr)) (__ptr + (off)); }) _ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From kennyt at php.net Fri Jan 23 13:49:57 2004 From: kennyt at php.net (Ken Tossell) Date: Thu, 22 Jan 2004 21:49:57 -0500 (EST) Subject: Successful boot on a 7025-F80 In-Reply-To: <20040121044245.55817.qmail@web60107.mail.yahoo.com> References: <20040121044245.55817.qmail@web60107.mail.yahoo.com> Message-ID: On Tue, 20 Jan 2004, Yannick Bertrand wrote: > I'm actually using the kernel to run Gentoo Linux PPC > (which is intended to run on Apple hardware!). Wow! I wish I had a ppc64 laying around! (Gentoo > * :) Yep, actually got me to reply to a linuxppc64-dev mail! Ken ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Jan 23 14:03:10 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 23 Jan 2004 14:03:10 +1100 Subject: autoconsole In-Reply-To: <20040118141742.GE6293@krispykreme> References: <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040115085849.A1808@w-mikek2.beaverton.ibm.com> <20040115170631.GA22399@suse.de> <3238B645-477F-11D8-A8C3-000A95A4DC02@kernel.crashing.org> <20040118141742.GE6293@krispykreme> Message-ID: <20040123030310.GK11236@krispykreme> > Anyone feel like coding this up? Or does OF export the baud rate somewhere? I just merged the simple console detection patch. If someone feels the urge to add to it feel free :) Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Fri Jan 23 14:24:26 2004 From: paulus at samba.org (Paul Mackerras) Date: Fri, 23 Jan 2004 14:24:26 +1100 Subject: [PATCH][2.6] Nested Interrupt support In-Reply-To: <1074781690.23288.571.camel@magik> References: <1074094346.2389.42.camel@magik> <20040119042022.GA20834@krispykreme> <1074781690.23288.571.camel@magik> Message-ID: <16400.37866.867318.95501@cargo.ozlabs.ibm.com> Jake Moilanen writes: > Here's the patch using per cpu data for the irq stack. Can't we find somewhere on the kernel stack to stash this? Could we use regs->softe maybe? Regards, Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Fri Jan 23 22:35:49 2004 From: olh at suse.de (Olaf Hering) Date: Fri, 23 Jan 2004 12:35:49 +0100 Subject: autoconsole In-Reply-To: <20040123030310.GK11236@krispykreme> References: <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040115085849.A1808@w-mikek2.beaverton.ibm.com> <20040115170631.GA22399@suse.de> <3238B645-477F-11D8-A8C3-000A95A4DC02@kernel.crashing.org> <20040118141742.GE6293@krispykreme> <20040123030310.GK11236@krispykreme> Message-ID: <20040123113549.GA20697@suse.de> On Fri, Jan 23, Anton Blanchard wrote: > > > Anyone feel like coding this up? Or does OF export the baud rate somewhere? > > I just merged the simple console detection patch. If someone feels the > urge to add to it feel free :) Move it into CONFIG_*PSERIES* or Rochester will be unhappy. -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sat Jan 24 20:35:09 2004 From: anton at samba.org (Anton Blanchard) Date: Sat, 24 Jan 2004 20:35:09 +1100 Subject: ppc64 PTE hacks In-Reply-To: <20040109061805.GC25504@krispykreme> References: <20031223235632.GE934@krispykreme> <20040109061805.GC25504@krispykreme> Message-ID: <20040124093509.GP11236@krispykreme> > Here it is updated for 2.6, using percpu data etc. Its currently getting > some stress testing and if that passes and there are no concerns I'll > merge it in. As Ben mentioned we need it for page aging to work. It turns out there were some nasty bugs (rmap stuff wasnt working on vmalloc regions). We were also doing spurious flushes on ptes that previously had the DIRTY/RW bits changed. Im stressing this for a while, if things look good and there are no complaints I'll check it in. Anton ===== arch/ppc64/kernel/pSeries_htab.c 1.14 vs edited ===== --- 1.14/arch/ppc64/kernel/pSeries_htab.c Tue Jan 20 13:07:05 2004 +++ edited/arch/ppc64/kernel/pSeries_htab.c Sat Jan 24 17:04:43 2004 @@ -300,7 +300,7 @@ int i, j; HPTE *hptep; Hpte_dword0 dw0; - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[smp_processor_id()]; + struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); /* XXX fix for large ptes */ unsigned long large = 0; ===== arch/ppc64/kernel/pSeries_lpar.c 1.37 vs edited ===== --- 1.37/arch/ppc64/kernel/pSeries_lpar.c Fri Jan 23 11:18:06 2004 +++ edited/arch/ppc64/kernel/pSeries_lpar.c Sat Jan 24 17:04:43 2004 @@ -420,10 +420,8 @@ lpar_rc = plpar_pte_protect(flags, slot, (avpn << 7)); - if (lpar_rc == H_Not_Found) { - udbg_printf("updatepp missed\n"); + if (lpar_rc == H_Not_Found) return -1; - } if (lpar_rc != H_Success) panic("bad return code from pte protect rc = %lx\n", lpar_rc); @@ -521,10 +519,8 @@ lpar_rc = plpar_pte_remove(H_AVPN, slot, (avpn << 7), &dummy1, &dummy2); - if (lpar_rc == H_Not_Found) { - udbg_printf("invalidate missed\n"); + if (lpar_rc == H_Not_Found) return; - } if (lpar_rc != H_Success) panic("Bad return code from invalidate rc = %lx\n", lpar_rc); @@ -539,7 +535,7 @@ { int i; unsigned long flags; - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[smp_processor_id()]; + struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); spin_lock_irqsave(&pSeries_lpar_tlbie_lock, flags); ===== arch/ppc64/kernel/process.c 1.46 vs edited ===== --- 1.46/arch/ppc64/kernel/process.c Thu Jan 22 18:37:14 2004 +++ edited/arch/ppc64/kernel/process.c Sat Jan 24 17:04:44 2004 @@ -49,14 +49,20 @@ #include #include #include +#include #ifndef CONFIG_SMP struct task_struct *last_task_used_math = NULL; struct task_struct *last_task_used_altivec = NULL; #endif -struct mm_struct ioremap_mm = { pgd : ioremap_dir - ,page_table_lock : SPIN_LOCK_UNLOCKED }; +struct mm_struct ioremap_mm = { + .pgd = ioremap_dir, + .mm_users = ATOMIC_INIT(2), + .mm_count = ATOMIC_INIT(1), + .cpu_vm_mask = CPU_MASK_ALL, + .page_table_lock = SPIN_LOCK_UNLOCKED, +}; char *sysmap = NULL; unsigned long sysmap_size = 0; @@ -145,6 +151,8 @@ if (new->thread.regs && last_task_used_altivec == new) new->thread.regs->msr |= MSR_VEC; #endif /* CONFIG_ALTIVEC */ + + flush_tlb_pending(); new_thread = &new->thread; old_thread = ¤t->thread; ===== arch/ppc64/mm/Makefile 1.12 vs edited ===== --- 1.12/arch/ppc64/mm/Makefile Thu Jan 22 16:29:08 2004 +++ edited/arch/ppc64/mm/Makefile Sat Jan 24 17:04:44 2004 @@ -4,6 +4,6 @@ EXTRA_CFLAGS += -mno-minimal-toc -obj-y := fault.o init.o imalloc.o hash_utils.o hash_low.o +obj-y := fault.o init.o imalloc.o hash_utils.o hash_low.o tlb.o obj-$(CONFIG_DISCONTIGMEM) += numa.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o ===== arch/ppc64/mm/hash_utils.c 1.45 vs edited ===== --- 1.45/arch/ppc64/mm/hash_utils.c Tue Jan 20 13:07:09 2004 +++ edited/arch/ppc64/mm/hash_utils.c Sat Jan 24 17:04:44 2004 @@ -325,8 +325,7 @@ ppc_md.flush_hash_range(context, number, local); } else { int i; - struct ppc64_tlb_batch *batch = - &ppc64_tlb_batch[smp_processor_id()]; + struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); for (i = 0; i < number; i++) flush_hash_page(context, batch->addr[i], batch->pte[i], ===== arch/ppc64/mm/init.c 1.55 vs edited ===== --- 1.55/arch/ppc64/mm/init.c Tue Jan 20 13:07:09 2004 +++ edited/arch/ppc64/mm/init.c Sat Jan 24 17:04:44 2004 @@ -90,57 +90,6 @@ /* max amount of RAM to use */ unsigned long __max_memory; -/* This is declared as we are using the more or less generic - * include/asm-ppc64/tlb.h file -- tgall - */ -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); -DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); -unsigned long pte_freelist_forced_free; - -static void pte_free_smp_sync(void *arg) -{ - /* Do nothing, just ensure we sync with all CPUs */ -} - -/* This is only called when we are critically out of memory - * (and fail to get a page in pte_free_tlb). - */ -void pte_free_now(struct page *ptepage) -{ - pte_freelist_forced_free++; - - smp_call_function(pte_free_smp_sync, NULL, 0, 1); - - pte_free(ptepage); -} - -static void pte_free_rcu_callback(void *arg) -{ - struct pte_freelist_batch *batch = arg; - unsigned int i; - - for (i = 0; i < batch->index; i++) - pte_free(batch->pages[i]); - free_page((unsigned long)batch); -} - -void pte_free_submit(struct pte_freelist_batch *batch) -{ - INIT_RCU_HEAD(&batch->rcu); - call_rcu(&batch->rcu, pte_free_rcu_callback, batch); -} - -void pte_free_finish(void) -{ - /* This is safe as we are holding page_table_lock */ - struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur); - - if (*batchp == NULL) - return; - pte_free_submit(*batchp); - *batchp = NULL; -} - void show_mem(void) { int total = 0, reserved = 0; @@ -170,17 +119,27 @@ printk("%d pages swap cached\n",cached); } -void * -ioremap(unsigned long addr, unsigned long size) -{ #ifdef CONFIG_PPC_ISERIES + +void *ioremap(unsigned long addr, unsigned long size) +{ return (void*)addr; +} + +void iounmap(void *addr) +{ + return; +} + #else + +void * +ioremap(unsigned long addr, unsigned long size) +{ void *ret = __ioremap(addr, size, _PAGE_NO_CACHE); if(mem_init_done) return eeh_ioremap(addr, ret); /* may remap the addr */ return ret; -#endif } void * @@ -326,7 +285,7 @@ * * XXX what about calls before mem_init_done (ie python_countermeasures()) */ -void pSeries_iounmap(void *addr) +void iounmap(void *addr) { unsigned long address, start, end, size; struct mm_struct *mm; @@ -352,29 +311,18 @@ spin_lock(&mm->page_table_lock); dir = pgd_offset_i(address); - flush_cache_all(); + flush_cache_vunmap(address, end); do { unmap_im_area_pmd(dir, address, end - address); address = (address + PGDIR_SIZE) & PGDIR_MASK; dir++; } while (address && (address < end)); - __flush_tlb_range(mm, start, end); + flush_tlb_kernel_range(start, end); spin_unlock(&mm->page_table_lock); return; } -void iounmap(void *addr) -{ -#ifdef CONFIG_PPC_ISERIES - /* iSeries I/O Remap is a noop */ - return; -#else - /* DRENG / PPPBBB todo */ - return pSeries_iounmap(addr); -#endif -} - int iounmap_explicit(void *addr, unsigned long size) { struct vm_struct *area; @@ -463,152 +411,7 @@ } } -void -flush_tlb_mm(struct mm_struct *mm) -{ - struct vm_area_struct *mp; - - spin_lock(&mm->page_table_lock); - - for (mp = mm->mmap; mp != NULL; mp = mp->vm_next) - __flush_tlb_range(mm, mp->vm_start, mp->vm_end); - - /* XXX are there races with checking cpu_vm_mask? - Anton */ - cpus_clear(mm->cpu_vm_mask); - - spin_unlock(&mm->page_table_lock); -} - -/* - * Callers should hold the mm->page_table_lock - */ -void -flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr) -{ - unsigned long context = 0; - pgd_t *pgd; - pmd_t *pmd; - pte_t *ptep; - pte_t pte; - int local = 0; - cpumask_t tmp; - - switch( REGION_ID(vmaddr) ) { - case VMALLOC_REGION_ID: - pgd = pgd_offset_k( vmaddr ); - break; - case IO_REGION_ID: - pgd = pgd_offset_i( vmaddr ); - break; - case USER_REGION_ID: - pgd = pgd_offset( vma->vm_mm, vmaddr ); - context = vma->vm_mm->context; - - /* XXX are there races with checking cpu_vm_mask? - Anton */ - tmp = cpumask_of_cpu(smp_processor_id()); - if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp)) - local = 1; - - break; - default: - panic("flush_tlb_page: invalid region 0x%016lx", vmaddr); - - } - - if (!pgd_none(*pgd)) { - pmd = pmd_offset(pgd, vmaddr); - if (pmd_present(*pmd)) { - ptep = pte_offset_kernel(pmd, vmaddr); - /* Check if HPTE might exist and flush it if so */ - pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0)); - if ( pte_val(pte) & _PAGE_HASHPTE ) { - flush_hash_page(context, vmaddr, pte, local); - } - } - WARN_ON(pmd_hugepage(*pmd)); - } -} - -struct ppc64_tlb_batch ppc64_tlb_batch[NR_CPUS]; - -void -__flush_tlb_range(struct mm_struct *mm, unsigned long start, unsigned long end) -{ - pgd_t *pgd; - pmd_t *pmd; - pte_t *ptep; - pte_t pte; - unsigned long pgd_end, pmd_end; - unsigned long context = 0; - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[smp_processor_id()]; - unsigned long i = 0; - int local = 0; - cpumask_t tmp; - - switch(REGION_ID(start)) { - case VMALLOC_REGION_ID: - pgd = pgd_offset_k(start); - break; - case IO_REGION_ID: - pgd = pgd_offset_i(start); - break; - case USER_REGION_ID: - pgd = pgd_offset(mm, start); - context = mm->context; - - /* XXX are there races with checking cpu_vm_mask? - Anton */ - tmp = cpumask_of_cpu(smp_processor_id()); - if (cpus_equal(mm->cpu_vm_mask, tmp)) - local = 1; - - break; - default: - panic("flush_tlb_range: invalid region for start (%016lx) and end (%016lx)\n", start, end); - } - - do { - pgd_end = (start + PGDIR_SIZE) & PGDIR_MASK; - if (pgd_end > end) - pgd_end = end; - if (!pgd_none(*pgd)) { - pmd = pmd_offset(pgd, start); - do { - pmd_end = (start + PMD_SIZE) & PMD_MASK; - if (pmd_end > end) - pmd_end = end; - if (pmd_present(*pmd)) { - ptep = pte_offset_kernel(pmd, start); - do { - if (pte_val(*ptep) & _PAGE_HASHPTE) { - pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0)); - if (pte_val(pte) & _PAGE_HASHPTE) { - batch->pte[i] = pte; - batch->addr[i] = start; - i++; - if (i == PPC64_TLB_BATCH_NR) { - flush_hash_range(context, i, local); - i = 0; - } - } - } - start += PAGE_SIZE; - ++ptep; - } while (start < pmd_end); - } else { - WARN_ON(pmd_hugepage(*pmd)); - start = pmd_end; - } - ++pmd; - } while (start < pgd_end); - } else { - start = pgd_end; - } - ++pgd; - } while (start < end); - - if (i) - flush_hash_range(context, i, local); -} +#endif void free_initmem(void) { ===== include/asm-ppc64/pgtable.h 1.33 vs edited ===== --- 1.33/include/asm-ppc64/pgtable.h Thu Jan 22 17:11:59 2004 +++ edited/include/asm-ppc64/pgtable.h Sat Jan 24 17:04:45 2004 @@ -12,6 +12,7 @@ #include /* For TASK_SIZE */ #include #include +#include #endif /* __ASSEMBLY__ */ /* PMD_SHIFT determines what a second-level page table entry can map */ @@ -289,71 +290,115 @@ /* Atomic PTE updates */ -static inline unsigned long pte_update( pte_t *p, unsigned long clr, - unsigned long set ) +static inline unsigned long pte_update(pte_t *p, unsigned long clr) { unsigned long old, tmp; - + __asm__ __volatile__( "1: ldarx %0,0,%3 # pte_update\n\ - andi. %1,%0,%7\n\ + andi. %1,%0,%6\n\ bne- 1b \n\ andc %1,%0,%4 \n\ - or %1,%1,%5 \n\ stdcx. %1,0,%3 \n\ bne- 1b" : "=&r" (old), "=&r" (tmp), "=m" (*p) - : "r" (p), "r" (clr), "r" (set), "m" (*p), "i" (_PAGE_BUSY) + : "r" (p), "r" (clr), "m" (*p), "i" (_PAGE_BUSY) : "cc" ); return old; } +/* PTE updating functions */ +extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot); + static inline int ptep_test_and_clear_young(pte_t *ptep) { - return (pte_update(ptep, _PAGE_ACCESSED, 0) & _PAGE_ACCESSED) != 0; + unsigned long old; + + old = pte_update(ptep, _PAGE_ACCESSED | _PAGE_HPTEFLAGS); + if (old & _PAGE_HASHPTE) { + hpte_update(ptep, old, 0); + flush_tlb_pending(); /* XXX generic code doesn't flush */ + } + return (old & _PAGE_ACCESSED) != 0; } +/* + * On RW/DIRTY bit transitions we can avoid flushing the hpte. For the + * moment we do it but we need to test if the optimisation is worth it. + */ +#if 1 static inline int ptep_test_and_clear_dirty(pte_t *ptep) { - return (pte_update(ptep, _PAGE_DIRTY, 0) & _PAGE_DIRTY) != 0; + unsigned long old; + + old = pte_update(ptep, _PAGE_DIRTY | _PAGE_HPTEFLAGS); + if (old & _PAGE_HASHPTE) + hpte_update(ptep, old, 0); + return (old & _PAGE_DIRTY) != 0; } -static inline pte_t ptep_get_and_clear(pte_t *ptep) +static inline void ptep_set_wrprotect(pte_t *ptep) +{ + unsigned long old; + + old = pte_update(ptep, _PAGE_RW | _PAGE_HPTEFLAGS); + if (old & _PAGE_HASHPTE) + hpte_update(ptep, old, 0); +} +#else +static inline int ptep_test_and_clear_dirty(pte_t *ptep) { - return __pte(pte_update(ptep, ~_PAGE_HPTEFLAGS, 0)); + unsigned long old; + + old = pte_update(ptep, _PAGE_DIRTY); + if ((~old & (_PAGE_HASHPTE | _PAGE_RW | _PAGE_DIRTY)) == 0) + hpte_update(ptep, old, 1); + return (old & _PAGE_DIRTY) != 0; } static inline void ptep_set_wrprotect(pte_t *ptep) { - pte_update(ptep, _PAGE_RW, 0); + unsigned long old; + + old = pte_update(ptep, _PAGE_RW); + if ((~old & (_PAGE_HASHPTE | _PAGE_RW | _PAGE_DIRTY)) == 0) + hpte_update(ptep, old, 1); } +#endif -static inline void ptep_mkdirty(pte_t *ptep) +static inline pte_t ptep_get_and_clear(pte_t *ptep) { - pte_update(ptep, 0, _PAGE_DIRTY); + unsigned long old = pte_update(ptep, ~0UL); + + if (old & _PAGE_HASHPTE) + hpte_update(ptep, old, 0); + return __pte(old); } -/* - * Macro to mark a page protection value as "uncacheable". - */ -#define pgprot_noncached(prot) (__pgprot(pgprot_val(prot) | _PAGE_NO_CACHE | _PAGE_GUARDED)) +static inline void pte_clear(pte_t * ptep) +{ + unsigned long old = pte_update(ptep, ~0UL); -#define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) + if (old & _PAGE_HASHPTE) + hpte_update(ptep, old, 0); +} /* * set_pte stores a linux PTE into the linux page table. - * On machines which use an MMU hash table we avoid changing the - * _PAGE_HASHPTE bit. */ static inline void set_pte(pte_t *ptep, pte_t pte) { - pte_update(ptep, ~_PAGE_HPTEFLAGS, pte_val(pte) & ~_PAGE_HPTEFLAGS); + if (pte_present(*ptep)) + pte_clear(ptep); + *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS; } -static inline void pte_clear(pte_t * ptep) -{ - pte_update(ptep, ~_PAGE_HPTEFLAGS, 0); -} +/* + * Macro to mark a page protection value as "uncacheable". + */ +#define pgprot_noncached(prot) (__pgprot(pgprot_val(prot) | _PAGE_NO_CACHE | _PAGE_GUARDED)) + +#define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) extern unsigned long ioremap_bot, ioremap_base; ===== include/asm-ppc64/tlb.h 1.11 vs edited ===== --- 1.11/include/asm-ppc64/tlb.h Tue Jan 20 13:08:24 2004 +++ edited/include/asm-ppc64/tlb.h Sat Jan 24 17:04:45 2004 @@ -12,11 +12,9 @@ #ifndef _PPC64_TLB_H #define _PPC64_TLB_H -#include #include -#include -#include +struct mmu_gather; static inline void tlb_flush(struct mmu_gather *tlb); /* Avoid pulling in another include just for this */ @@ -29,66 +27,13 @@ #define tlb_start_vma(tlb, vma) do { } while (0) #define tlb_end_vma(tlb, vma) do { } while (0) -/* Should make this at least as large as the generic batch size, but it - * takes up too much space */ -#define PPC64_TLB_BATCH_NR 192 - -struct ppc64_tlb_batch { - unsigned long index; - pte_t pte[PPC64_TLB_BATCH_NR]; - unsigned long addr[PPC64_TLB_BATCH_NR]; - unsigned long vaddr[PPC64_TLB_BATCH_NR]; -}; - -extern struct ppc64_tlb_batch ppc64_tlb_batch[NR_CPUS]; - -static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, - unsigned long address) -{ - int cpu = smp_processor_id(); - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[cpu]; - unsigned long i = batch->index; - pte_t pte; - cpumask_t local_cpumask = cpumask_of_cpu(cpu); - - if (pte_val(*ptep) & _PAGE_HASHPTE) { - pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0)); - if (pte_val(pte) & _PAGE_HASHPTE) { - - batch->pte[i] = pte; - batch->addr[i] = address; - i++; - - if (i == PPC64_TLB_BATCH_NR) { - int local = 0; - - if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) - local = 1; - - flush_hash_range(tlb->mm->context, i, local); - i = 0; - } - } - } - - batch->index = i; -} +#define __tlb_remove_tlb_entry(tlb, pte, address) do { } while (0) extern void pte_free_finish(void); static inline void tlb_flush(struct mmu_gather *tlb) { - int cpu = smp_processor_id(); - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[cpu]; - int local = 0; - cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id()); - - if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) - local = 1; - - flush_hash_range(tlb->mm->context, batch->index, local); - batch->index = 0; - + flush_tlb_pending(); pte_free_finish(); } ===== include/asm-ppc64/tlbflush.h 1.4 vs edited ===== --- 1.4/include/asm-ppc64/tlbflush.h Fri Jun 7 18:21:41 2002 +++ edited/include/asm-ppc64/tlbflush.h Sat Jan 24 17:04:45 2004 @@ -1,10 +1,6 @@ #ifndef _PPC64_TLBFLUSH_H #define _PPC64_TLBFLUSH_H -#include -#include -#include - /* * TLB flushing: * @@ -15,21 +11,37 @@ * - flush_tlb_pgtables(mm, start, end) flushes a range of page tables */ -extern void flush_tlb_mm(struct mm_struct *mm); -extern void flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr); -extern void __flush_tlb_range(struct mm_struct *mm, - unsigned long start, unsigned long end); -#define flush_tlb_range(vma, start, end) \ - __flush_tlb_range(vma->vm_mm, start, end) +#include +#include + +#define PPC64_TLB_BATCH_NR 192 -#define flush_tlb_kernel_range(start, end) \ - __flush_tlb_range(&init_mm, (start), (end)) +struct mm_struct; +struct ppc64_tlb_batch { + unsigned long index; + unsigned long context; + struct mm_struct *mm; + pte_t pte[PPC64_TLB_BATCH_NR]; + unsigned long addr[PPC64_TLB_BATCH_NR]; + unsigned long vaddr[PPC64_TLB_BATCH_NR]; +}; +DECLARE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch); -static inline void flush_tlb_pgtables(struct mm_struct *mm, - unsigned long start, unsigned long end) +extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch); + +static inline void flush_tlb_pending(void) { - /* PPC has hw page tables. */ + struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); + + if (batch->index) + __flush_tlb_pending(batch); } + +#define flush_tlb_mm(mm) flush_tlb_pending() +#define flush_tlb_page(vma, addr) flush_tlb_pending() +#define flush_tlb_range(vma, start, end) flush_tlb_pending() +#define flush_tlb_kernel_range(start, end) flush_tlb_pending() +#define flush_tlb_pgtables(mm, start, end) do { } while (0) extern void flush_hash_page(unsigned long context, unsigned long ea, pte_t pte, int local); ===== include/linux/init_task.h 1.27 vs edited ===== --- 1.27/include/linux/init_task.h Tue Aug 19 12:46:23 2003 +++ edited/include/linux/init_task.h Sat Jan 24 17:04:46 2004 @@ -40,6 +40,7 @@ .mmap_sem = __RWSEM_INITIALIZER(name.mmap_sem), \ .page_table_lock = SPIN_LOCK_UNLOCKED, \ .mmlist = LIST_HEAD_INIT(name.mmlist), \ + .cpu_vm_mask = CPU_MASK_ALL, \ .default_kioctx = INIT_KIOCTX(name.default_kioctx, name), \ } ===== mm/vmalloc.c 1.29 vs edited ===== --- 1.29/mm/vmalloc.c Wed Oct 8 12:53:44 2003 +++ edited/mm/vmalloc.c Sat Jan 24 17:04:46 2004 @@ -114,15 +114,16 @@ unsigned long size, pgprot_t prot, struct page ***pages) { - unsigned long end; + unsigned long base, end; + base = address & PGDIR_MASK; address &= ~PGDIR_MASK; end = address + size; if (end > PGDIR_SIZE) end = PGDIR_SIZE; do { - pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address); + pte_t * pte = pte_alloc_kernel(&init_mm, pmd, base + address); if (!pte) return -ENOMEM; if (map_area_pte(pte, address, end - address, prot, pages)) ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hozer at hozed.org Sun Jan 25 09:03:03 2004 From: hozer at hozed.org (Troy Benjegerdes) Date: Sat, 24 Jan 2004 16:03:03 -0600 Subject: NAP mode on powerpc 970 In-Reply-To: <20040120113421.GL3620@krispykreme> References: <4007EF09.1070305@thalescomputers.fr> <1074276078.1240.227.camel@magik> <1074304946.8360.15.camel@gaston> <1074543540.1100.258.camel@magik> <20040120113421.GL3620@krispykreme> Message-ID: <20040124220303.GX25308@kalmia.hozed.org> On Tue, Jan 20, 2004 at 10:34:21PM +1100, Anton Blanchard wrote: > > > > When I talked to the HV team, they do not want to do this solution. > > They are leaning towards another alternative, or having it done in FW so > > the change does not need to be done in multiple places (e.g. Different > > linux distros, and AIX). > > I disagree, we have to support both Apple and IBM products. Stashing > stuff into our FW may be a good idea for AIX but we dont have to do it > too. More importantly... people doing HPC on IBM hardware will want to run *WITHOUT* the overhead of the hypervisor. The hypervisor is a great idea for virtualization for the enterprise area, but when you are trying to get every last floating point operation out of a CPU you run 1 process per CPU, and the operating system becomes pure overhead. -- -------------------------------------------------------------------------- Troy Benjegerdes 'da hozer' hozer at drgw.net ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Jan 25 16:34:55 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 25 Jan 2004 16:34:55 +1100 Subject: ppc64 PTE hacks In-Reply-To: <20040124093509.GP11236@krispykreme> References: <20031223235632.GE934@krispykreme> <20040109061805.GC25504@krispykreme> <20040124093509.GP11236@krispykreme> Message-ID: <20040125053455.GT11236@krispykreme> > It turns out there were some nasty bugs (rmap stuff wasnt working on > vmalloc regions). We were also doing spurious flushes on ptes that > previously had the DIRTY/RW bits changed. > > Im stressing this for a while, if things look good and there are no > complaints I'll check it in. Oops, I was missing the all important tlb.c Anton --- /dev/null 2003-12-27 21:14:33.000000000 +1100 +++ junk/arch/ppc64/mm/tlb.c 2004-01-24 17:05:16.968766720 +1100 @@ -0,0 +1,144 @@ +/* + * This file contains the routines for flushing entries from the + * TLB and MMU hash table. + * + * Derived from arch/ppc64/mm/init.c: + * Copyright (C) 1995-1996 Gary Thomas (gdt at linuxppc.org) + * + * Modifications by Paul Mackerras (PowerMac) (paulus at cs.anu.edu.au) + * and Cort Dougan (PReP) (cort at cs.nmt.edu) + * Copyright (C) 1996 Paul Mackerras + * Amiga/APUS changes by Jesper Skov (jskov at cygnus.co.uk). + * + * Derived from "arch/i386/mm/init.c" + * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds + * + * Dave Engebretsen + * Rework for PPC64 port. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +DEFINE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch); + +/* This is declared as we are using the more or less generic + * include/asm-ppc64/tlb.h file -- tgall + */ +DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); +unsigned long pte_freelist_forced_free; + +/* + * Update the MMU hash table to correspond with a change to + * a Linux PTE. If wrprot is true, it is permissible to + * change the existing HPTE to read-only rather than removing it + * (if we remove it we should clear the _PTE_HPTEFLAGS bits). + */ +void hpte_update(pte_t *ptep, unsigned long pte, int wrprot) +{ + struct page *ptepage; + struct mm_struct *mm; + unsigned long addr; + int i; + unsigned long context = 0; + struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); + + ptepage = virt_to_page(ptep); + mm = (struct mm_struct *) ptepage->mapping; + addr = ptep_to_address(ptep); + + if (REGION_ID(addr) == USER_REGION_ID) + context = mm->context; + i = batch->index; + /* + * Something has gone wrong, probably a missing flush_tlb_*. + * Warn here so we can catch such problems. + */ + WARN_ON(i != 0 && context != batch->context); + if (i == 0) { + batch->context = context; + batch->mm = mm; + } + batch->pte[i] = __pte(pte); + batch->addr[i] = addr; + batch->index = ++i; + if (i >= PPC64_TLB_BATCH_NR) + flush_tlb_pending(); +} + +void __flush_tlb_pending(struct ppc64_tlb_batch *batch) +{ + int i; + cpumask_t tmp = cpumask_of_cpu(smp_processor_id()); + int local = 0; + + BUG_ON(in_interrupt()); + + i = batch->index; + if (cpus_equal(batch->mm->cpu_vm_mask, tmp)) + local = 1; + + if (i == 1) + flush_hash_page(batch->context, batch->addr[0], batch->pte[0], + local); + else + flush_hash_range(batch->context, i, local); + batch->index = 0; +} + +static void pte_free_smp_sync(void *arg) +{ + /* Do nothing, just ensure we sync with all CPUs */ +} + +/* This is only called when we are critically out of memory + * (and fail to get a page in pte_free_tlb). + */ +void pte_free_now(struct page *ptepage) +{ + pte_freelist_forced_free++; + + smp_call_function(pte_free_smp_sync, NULL, 0, 1); + + pte_free(ptepage); +} + +static void pte_free_rcu_callback(void *arg) +{ + struct pte_freelist_batch *batch = arg; + unsigned int i; + + for (i = 0; i < batch->index; i++) + pte_free(batch->pages[i]); + free_page((unsigned long)batch); +} + +void pte_free_submit(struct pte_freelist_batch *batch) +{ + INIT_RCU_HEAD(&batch->rcu); + call_rcu(&batch->rcu, pte_free_rcu_callback, batch); +} + +void pte_free_finish(void) +{ + /* This is safe as we are holding page_table_lock */ + struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur); + + if (*batchp == NULL) + return; + pte_free_submit(*batchp); + *batchp = NULL; +} ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Mon Jan 26 23:09:28 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 26 Jan 2004 23:09:28 +1100 Subject: ppc64 PTE hacks In-Reply-To: <20040125053455.GT11236@krispykreme> References: <20031223235632.GE934@krispykreme> <20040109061805.GC25504@krispykreme> <20040124093509.GP11236@krispykreme> <20040125053455.GT11236@krispykreme> Message-ID: <20040126120928.GZ11236@krispykreme> I noticed another thing that worries me: static inline int ptep_test_and_clear_dirty(pte_t *ptep) { unsigned long old; old = pte_update(ptep, _PAGE_DIRTY | _PAGE_HPTEFLAGS); if (old & _PAGE_HASHPTE) hpte_update(ptep, old, 0); return (old & _PAGE_DIRTY) != 0; } #define ptep_clear_flush_dirty(__vma, __address, __ptep) \ ({ \ int __dirty = ptep_test_and_clear_dirty(__ptep); \ if (__dirty) \ flush_tlb_page(__vma, __address); \ __dirty; \ }) We call ptep_clear_flush_dirty in msync. Even if the pte was not dirty ptep_test_and_clear_dirty will add the pte to the batch and zero hpteflags. Notice how we only flush if the pte was dirty. So we can miss the flush... We can fix this in a few ways. - Code the flush into ptep_test_and_clear_dirty since msync is the only place that uses it. - Fix the generic ptep_clear_flush_dirty code to always call flush_tlb_page - Create our own version of ptep_clear_flush_dirty - Fix hpte_update to not cast out the hpte on DIRTY bit transitions Thoughts? Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Mon Jan 26 23:17:55 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 26 Jan 2004 23:17:55 +1100 Subject: ppc64 PTE hacks In-Reply-To: <20040126120928.GZ11236@krispykreme> References: <20031223235632.GE934@krispykreme> <20040109061805.GC25504@krispykreme> <20040124093509.GP11236@krispykreme> <20040125053455.GT11236@krispykreme> <20040126120928.GZ11236@krispykreme> Message-ID: <1075119453.5655.54.camel@gaston> > - Code the flush into ptep_test_and_clear_dirty since msync is the only place > that uses it. > - Fix the generic ptep_clear_flush_dirty code to always call flush_tlb_page > - Create our own version of ptep_clear_flush_dirty > - Fix hpte_update to not cast out the hpte on DIRTY bit transitions Or better, do a smarter pte_update that does nothing if the bit wasn't modified... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hozer at hozed.org Tue Jan 27 04:12:36 2004 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 26 Jan 2004 11:12:36 -0600 Subject: spinlocks In-Reply-To: <3FFC1F7E.1020904@vnet.ibm.com> References: <3FFC1F7E.1020904@vnet.ibm.com> Message-ID: <20040126171236.GB25308@kalmia.hozed.org> On Wed, Jan 07, 2004 at 09:02:22AM -0600, Dave Engebretsen wrote: > > olof at austin.ibm.com wrote: > >On Wed, 7 Jan 2004, Dave Engebretsen wrote: > > > > > >>Is a single binary for Apple & pSeries a goal? While it has some > >>obvious advantages, there is likely to be a number of areas (the > >>spinlock discussion being one) where the goals are quite different. > > > > > >Are they really all that different? We need to keep the pSeries code > >running smoothly on a small-config SMP machine too (i.e. p615 and the > >like). > > > > > >-Olof > > Maybe not - just raising the debate. Nothing is all this will not keep > the code running smoothly on small config p615 machines. In many ways, > the more advanced virtualaztion results in machines which are much > smaller than anything else, so tuning for small is good for i/pSeries too. > > Everything being equal, I would just as soon see a common binary. But > items like HMT priorities are almost certainly going to exist in the Mac > binaries -- frankly, in the scheme of things a few extra noops in the > kernel are not going to be the performance bottleneck an end user sees. Even if no distribution actually ships a common binary, I vote we need to support it. The people MOST interested in a common binary are people doing QA testing. If someone shows up on linuxppc64-dev with a strange kernel bug, the first thing I would want to say is "Can you reproduce this bug on THIS binary kernel image?". I have both pmac and PReP machines, and even a 5-10% performance hit would be worth it to not have to spend the time to compile a different kernel for both machines. Supporting a common binary makes it easier to get new users up and running so they don't have to figure out which one of 15 different configurations they actually need on their hardware. Oh yeah, and don't forget installers!! *one* kernel binary for installers is SO much easier. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Tue Jan 27 12:11:49 2004 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 26 Jan 2004 19:11:49 -0600 Subject: PCI Probe Question Message-ID: <1075165909.8973.22.camel@verve> How feasible is the generic portion of the patch below? Currently, the PCI probe code exports pci_scan_slot() for use by hotplug (or dlpar) modules. This turns out to be a problem, because pci_scan_slot() scans forward for 8 devfn values, starting at the one passed in. We need pci_scan_device() to be available for module use. Scanning the pci_bus corresponding to a phb reveals 3 slots within 8 devfn values. Consider a phb with 3 slots at devfn values 58, 5a, and 5e. A dlpar add of the first slot would call scan_slot() with 58, and scan_slot() adds redudant pci_dev's for the 5a and 5e to the devices list of the phb bus. For this case, we already know the devfn value we're interested in. The device_node structure for the new slot has it. If we call pci_scan_device() directly, we avoid the redundant devices problem. Thoughts? John diff -Nru a/drivers/pci/hotplug/rpadlpar_core.c b/drivers/pci/hotplug/rpadlpar_core.c --- a/drivers/pci/hotplug/rpadlpar_core.c Mon Jan 26 18:58:20 2004 +++ b/drivers/pci/hotplug/rpadlpar_core.c Mon Jan 26 18:58:20 2004 @@ -138,14 +138,36 @@ return 0; } +static int dlpar_pci_scan_device(struct pci_bus *bus, int devfn) +{ + struct pci_dev *dev; + + dev = pci_scan_device(bus, devfn); + + if (!dev) + return 1; + + /* Fix up broken headers */ + pci_fixup_device(PCI_FIXUP_HEADER, dev); + + /* + * Add the device to our list of discovered devices + * and the bus list for fixup functions, etc. + */ + INIT_LIST_HEAD(&dev->global_list); + list_add_tail(&dev->bus_list, &bus->devices); + + return 0; +} + static struct pci_dev *dlpar_pci_add_bus(struct device_node *dn) { struct pci_controller *hose = dn->phb; struct pci_dev *dev = NULL; - - /* Scan phb bus for devices, adding new ones to bus->devices */ - if (!pci_scan_slot(hose->bus, dn->devfn)) { - printk(KERN_ERR "%s: found no devices on bus\n", __FUNCTION__); + + /* Scan phb bus for EADS device, adding new one to bus->devices */ + if (dlpar_pci_scan_device(hose->bus, dn->devfn)) { + printk(KERN_ERR "%s: found no device on bus\n", __FUNCTION__); return NULL; } diff -Nru a/drivers/pci/probe.c b/drivers/pci/probe.c --- a/drivers/pci/probe.c Mon Jan 26 18:58:20 2004 +++ b/drivers/pci/probe.c Mon Jan 26 18:58:20 2004 @@ -483,7 +483,7 @@ * Read the config data for a PCI device, sanity-check it * and fill in the dev structure... */ -static struct pci_dev * __devinit +struct pci_dev * __devinit pci_scan_device(struct pci_bus *bus, int devfn) { struct pci_dev *dev; @@ -685,4 +685,5 @@ EXPORT_SYMBOL(pci_do_scan_bus); EXPORT_SYMBOL(pci_scan_slot); EXPORT_SYMBOL(pci_scan_bridge); +EXPORT_SYMBOL(pci_scan_device); #endif diff -Nru a/drivers/pci/quirks.c b/drivers/pci/quirks.c --- a/drivers/pci/quirks.c Mon Jan 26 18:58:20 2004 +++ b/drivers/pci/quirks.c Mon Jan 26 18:58:20 2004 @@ -976,3 +976,7 @@ pci_do_fixups(dev, pass, pcibios_fixups); pci_do_fixups(dev, pass, pci_fixups); } + +#ifdef CONFIG_HOTPLUG +EXPORT_SYMBOL(pci_fixup_device); +#endif diff -Nru a/include/linux/pci.h b/include/linux/pci.h --- a/include/linux/pci.h Mon Jan 26 18:58:20 2004 +++ b/include/linux/pci.h Mon Jan 26 18:58:20 2004 @@ -580,6 +580,7 @@ return pci_scan_bus_parented(NULL, bus, ops, sysdata); } int pci_scan_slot(struct pci_bus *bus, int devfn); +struct pci_dev * pci_scan_device(struct pci_bus *bus, int devfn); void pci_bus_add_devices(struct pci_bus *bus); void pci_name_device(struct pci_dev *dev); char *pci_class_name(u32 class); ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From gregkh at us.ibm.com Tue Jan 27 12:25:07 2004 From: gregkh at us.ibm.com (Greg KH) Date: Mon, 26 Jan 2004 17:25:07 -0800 Subject: PCI Probe Question In-Reply-To: <1075165909.8973.22.camel@verve> References: <1075165909.8973.22.camel@verve> Message-ID: <20040127012507.GA3295@us.ibm.com> On Mon, Jan 26, 2004 at 07:11:49PM -0600, John Rose wrote: > How feasible is the generic portion of the patch below? > > Currently, the PCI probe code exports pci_scan_slot() for use by hotplug > (or dlpar) modules. This turns out to be a problem, because > pci_scan_slot() scans forward for 8 devfn values, starting at the one > passed in. We need pci_scan_device() to be available for module use. Wait, what's wrong with scanning for all 8 devfn values? That's what we have to do, right? > Scanning the pci_bus corresponding to a phb reveals 3 slots within 8 > devfn values. What is a "phb"? > Consider a phb with 3 slots at devfn values 58, 5a, and > 5e. A dlpar add of the first slot would call scan_slot() with 58, and > scan_slot() adds redudant pci_dev's for the 5a and 5e to the devices > list of the phb bus. So you are trying to individually add the different pci devices within the same device? Hm, that's not very nice. So you are not really talking about a physical slot here, right? You can divide up a physical device among partitions? Also, how could this be a multifunction device? > For this case, we already know the devfn value we're interested in. The > device_node structure for the new slot has it. If we call > pci_scan_device() directly, we avoid the redundant devices problem. > > Thoughts? If you convince me you really have to do this, why duplicate the existing pci code in your driver? How about just creating a pci_scan_single_device() function for the pci core that does that logic (and make pci_scan_slot() call it.) That would make your code simpler, and we would not have to export pci_fixup_device(). thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Tue Jan 27 14:20:00 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Mon, 26 Jan 2004 21:20:00 -0600 (CST) Subject: [2.5][PATCH] Remove warnings from rtas-proc.c Message-ID: I'm tired of seeing the warnings go by every time I compile. It might be valid C99 code, but my GCC still warns about it: arch/ppc64/kernel/rtas-proc.c: In function `ppc_rtas_poweron_read': arch/ppc64/kernel/rtas-proc.c:294: warning: ISO C90 forbids mixed declarations and code ..and so on.. Patch below. -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ===== arch/ppc64/kernel/rtas-proc.c 1.12 vs edited ===== --- 1.12/arch/ppc64/kernel/rtas-proc.c Mon Jan 19 20:07:06 2004 +++ edited/arch/ppc64/kernel/rtas-proc.c Mon Jan 26 21:17:40 2004 @@ -285,13 +285,13 @@ size_t count, loff_t *ppos) { char stkbuf[40]; /* its small, its on stack */ - int n; + int n, sn; if (power_on_time == 0) n = snprintf(stkbuf, 40, "Power on time not set\n"); else n = snprintf(stkbuf, 40, "%lu\n", power_on_time); - int sn = strlen (stkbuf) +1; + sn = strlen (stkbuf) +1; if (*ppos >= sn) return 0; if (n > sn - *ppos) @@ -331,18 +331,19 @@ static ssize_t ppc_rtas_progress_read(struct file * file, char * buf, size_t count, loff_t *ppos) { - int n = 0; + int sn, n = 0; + char *tmpbuf; if (progress_led == NULL) return 0; - char * tmpbuf = kmalloc (MAX_LINELENGTH, GFP_KERNEL); + tmpbuf = kmalloc (MAX_LINELENGTH, GFP_KERNEL); if (!tmpbuf) { printk(KERN_ERR "error: kmalloc failed\n"); return -ENOMEM; } n = sprintf (tmpbuf, "%s\n", progress_led); - int sn = strlen (tmpbuf) +1; + sn = strlen (tmpbuf) +1; if (*ppos >= sn) { kfree (tmpbuf); return 0; @@ -398,15 +399,14 @@ { unsigned int year, mon, day, hour, min, sec; unsigned long *ret = kmalloc(4*8, GFP_KERNEL); - int n, error; + int n, sn, error; + char stkbuf[40]; /* its small, its on stack */ error = rtas_call(rtas_token("get-time-of-day"), 0, 8, ret); year = ret[0]; mon = ret[1]; day = ret[2]; hour = ret[3]; min = ret[4]; sec = ret[5]; - char stkbuf[40]; /* its small, its on stack */ - if (error != 0){ printk(KERN_WARNING "error: reading the clock returned: %s\n", ppc_rtas_process_error(error)); @@ -416,7 +416,7 @@ } kfree(ret); - int sn = strlen (stkbuf) +1; + sn = strlen (stkbuf) +1; if (*ppos >= sn) return 0; if (n > sn - *ppos) @@ -860,11 +860,12 @@ static ssize_t ppc_rtas_tone_freq_read(struct file * file, char * buf, size_t count, loff_t *ppos) { - int n; + int n, sn; char stkbuf[40]; /* its small, its on stack */ + n = snprintf(stkbuf, 40, "%lu\n", rtas_tone_frequency); - int sn = strlen (stkbuf) +1; + sn = strlen (stkbuf) +1; if (*ppos >= sn) return 0; if (n > sn - *ppos) @@ -913,11 +914,12 @@ static ssize_t ppc_rtas_tone_volume_read(struct file * file, char * buf, size_t count, loff_t *ppos) { - int n; + int n, sn; char stkbuf[40]; /* its small, its on stack */ + n = snprintf(stkbuf, 40, "%lu\n", rtas_tone_volume); - int sn = strlen (stkbuf) +1; + sn = strlen (stkbuf) +1; if (*ppos >= sn) return 0; if (n > sn - *ppos) ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Jan 27 14:51:46 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 27 Jan 2004 14:51:46 +1100 Subject: [2.5][PATCH] Remove warnings from rtas-proc.c In-Reply-To: References: Message-ID: <20040127035146.GF11236@krispykreme> > I'm tired of seeing the warnings go by every time I compile. It might be > valid C99 code, but my GCC still warns about it: Looks good. The recent fixes to the lparcfg code are causing warnings too Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Tue Jan 27 16:17:04 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Mon, 26 Jan 2004 23:17:04 -0600 (CST) Subject: [2.5][PATCH] Remove warnings from rtas-proc.c In-Reply-To: <20040127035146.GF11236@krispykreme> Message-ID: On Tue, 27 Jan 2004, Anton Blanchard wrote: > > I'm tired of seeing the warnings go by every time I compile. It might be > > valid C99 code, but my GCC still warns about it: > > Looks good. > > The recent fixes to the lparcfg code are causing warnings too I guess I never enable it. :) Fixes to both have been pushed to ameslab. -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Tue Jan 27 19:38:31 2004 From: paulus at samba.org (Paul Mackerras) Date: Tue, 27 Jan 2004 19:38:31 +1100 Subject: PCI Probe Question In-Reply-To: <20040127012507.GA3295@us.ibm.com> References: <1075165909.8973.22.camel@verve> <20040127012507.GA3295@us.ibm.com> Message-ID: <16406.9095.277972.439160@cargo.ozlabs.ibm.com> Greg KH writes: > On Mon, Jan 26, 2004 at 07:11:49PM -0600, John Rose wrote: > > How feasible is the generic portion of the patch below? > > > > Currently, the PCI probe code exports pci_scan_slot() for use by hotplug > > (or dlpar) modules. This turns out to be a problem, because > > pci_scan_slot() scans forward for 8 devfn values, starting at the one > > passed in. We need pci_scan_device() to be available for module use. > > Wait, what's wrong with scanning for all 8 devfn values? That's what we > have to do, right? Hmmm, I don't think you want to scan all 8 values if it is a single-function device, since it may well respond at all 8 function addresses, or even do weird things if you access func != 0. > > Scanning the pci_bus corresponding to a phb reveals 3 slots within 8 > > devfn values. > > What is a "phb"? PCI host bridge. You know what that is, right? :) > > Consider a phb with 3 slots at devfn values 58, 5a, and > > 5e. A dlpar add of the first slot would call scan_slot() with 58, and > > scan_slot() adds redudant pci_dev's for the 5a and 5e to the devices > > list of the phb bus. > > So you are trying to individually add the different pci devices within > the same device? Hm, that's not very nice. So you are not really > talking about a physical slot here, right? You can divide up a physical > device among partitions? Well, the thing is that we have PCI-PCI bridges which are multifunction devices, that is, you get 4 bridges in the one PCI device. Each bridge typically has only one slot behind it, since the bridge is where the hotplugging is done. And the hypervisor likes to be able to hand out each slot to a different partition. So your partition might get to access function 2 of the PCI-PCI bridge device but not function 0. Hope that makes it a little clearer. Regards, Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Jan 27 21:46:40 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 27 Jan 2004 21:46:40 +1100 Subject: ioremap problems Message-ID: <20040127104640.GK11236@krispykreme> Hi, Just got this might sleep warning when using recent 2.6 ameslab. Looks like __add_new_im_area is doing a kmalloc inside a spinlock. Someone in the mood to fix it up? Im a little side tracked at the moment :) Anton [c000000000051d88] .__might_sleep+0xec/0x130 [c000000000088704] .kmem_cache_alloc+0xb8/0xc0 [c0000000000426e0] .__add_new_im_area+0x60/0xb0 [c0000000000427c8] .__im_get_area+0x98/0xc0 [c0000000000428c4] .im_get_free_area+0xd4/0xfc [c000000000041270] .__ioremap+0x94/0xcc [c000000000041194] .ioremap+0x1c/0x64 [c00000000022b458] .e1000_probe+0x138/0x5dc ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Jan 27 21:55:33 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 27 Jan 2004 21:55:33 +1100 Subject: ioremap problems In-Reply-To: <20040127104640.GK11236@krispykreme> References: <20040127104640.GK11236@krispykreme> Message-ID: <20040127105533.GL11236@krispykreme> > Just got this might sleep warning when using recent 2.6 ameslab. Looks > like __add_new_im_area is doing a kmalloc inside a spinlock. Someone > in the mood to fix it up? Im a little side tracked at the moment :) Heres the spinlock sleep debugging patch. Give it a spin on a 2.6 kernel (dont forget to enable the config option), I'll bet there are still 5 bugs in our drivers left to find. Anton -- Sleep with spinlock debugging. gr16c-anton/arch/ppc64/Kconfig | 7 +++++++ gr16c-anton/include/asm-ppc64/hardirq.h | 2 +- gr16c-anton/include/linux/preempt.h | 17 +++++++++++------ gr16c-anton/kernel/sched.c | 4 ++-- 4 files changed, 21 insertions(+), 9 deletions(-) diff -puN arch/ppc64/Kconfig~spinlock_sleep arch/ppc64/Kconfig --- gr16c/arch/ppc64/Kconfig~spinlock_sleep 2004-01-23 15:05:56.365930995 +1100 +++ gr16c-anton/arch/ppc64/Kconfig 2004-01-23 15:05:56.384934743 +1100 @@ -410,6 +410,13 @@ config DEBUG_PAGEALLOC This results in a large slowdown, but helps to find certain types of memory corruptions. +config DEBUG_SPINLOCK_SLEEP + bool "Sleep-inside-spinlock checking" + depends on DEBUG_KERNEL + help + If you say Y here, various routines which may sleep will become very + noisy if they are called with a spinlock held. + endmenu source "security/Kconfig" diff -puN include/asm-ppc64/hardirq.h~spinlock_sleep include/asm-ppc64/hardirq.h --- gr16c/include/asm-ppc64/hardirq.h~spinlock_sleep 2004-01-23 15:05:56.368931587 +1100 +++ gr16c-anton/include/asm-ppc64/hardirq.h 2004-01-23 15:05:56.385934940 +1100 @@ -80,7 +80,7 @@ typedef struct { #define irq_enter() (preempt_count() += HARDIRQ_OFFSET) -#ifdef CONFIG_PREEMPT +#if defined(CONFIG_PREEMPT) || defined(CONFIG_DEBUG_SPINLOCK_SLEEP) # define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != kernel_locked()) # define IRQ_EXIT_OFFSET (HARDIRQ_OFFSET-1) #else diff -puN include/linux/preempt.h~spinlock_sleep include/linux/preempt.h --- gr16c/include/linux/preempt.h~spinlock_sleep 2004-01-23 15:05:56.372932376 +1100 +++ gr16c-anton/include/linux/preempt.h 2004-01-23 15:05:56.387935334 +1100 @@ -24,6 +24,17 @@ do { \ extern void preempt_schedule(void); +#define preempt_check_resched() \ +do { \ + if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \ + preempt_schedule(); \ +} while (0) +#else +#define preempt_check_resched() do { } while (0) +#endif + +#if defined(CONFIG_PREEMPT) || defined(CONFIG_DEBUG_SPINLOCK_SLEEP) + #define preempt_disable() \ do { \ inc_preempt_count(); \ @@ -36,12 +47,6 @@ do { \ dec_preempt_count(); \ } while (0) -#define preempt_check_resched() \ -do { \ - if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \ - preempt_schedule(); \ -} while (0) - #define preempt_enable() \ do { \ preempt_enable_no_resched(); \ diff -puN kernel/fork.c~spinlock_sleep kernel/fork.c diff -puN kernel/sched.c~spinlock_sleep kernel/sched.c --- gr16c/kernel/sched.c~spinlock_sleep 2004-01-23 15:05:56.379933756 +1100 +++ gr16c-anton/kernel/sched.c 2004-01-23 15:05:56.393936518 +1100 @@ -728,7 +728,7 @@ void sched_fork(task_t *p) INIT_LIST_HEAD(&p->run_list); p->array = NULL; spin_lock_init(&p->switch_lock); -#ifdef CONFIG_PREEMPT +#if defined(CONFIG_PREEMPT) || defined(CONFIG_DEBUG_SPINLOCK_SLEEP) /* * During context-switch we hold precisely one spinlock, which * schedule_tail drops. (in the common case it's this_rq()->lock, @@ -2659,7 +2659,7 @@ void __init init_idle(task_t *idle, int local_irq_restore(flags); /* Set the preempt count _outside_ the spinlocks! */ -#ifdef CONFIG_PREEMPT +#if defined(CONFIG_PREEMPT) || defined(CONFIG_DEBUG_SPINLOCK_SLEEP) idle->thread_info->preempt_count = (idle->lock_depth >= 0); #else idle->thread_info->preempt_count = 0; _ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Jan 27 23:03:06 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 27 Jan 2004 23:03:06 +1100 Subject: [2.5][PATCH] Remove warnings from rtas-proc.c In-Reply-To: References: <20040127035146.GF11236@krispykreme> Message-ID: <20040127120306.GQ11236@krispykreme> > I guess I never enable it. :) Fixes to both have been pushed to ameslab. Thanks Olof! Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Wed Jan 28 03:52:38 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 27 Jan 2004 10:52:38 -0600 Subject: PCI Probe Question In-Reply-To: References: Message-ID: <1075222358.10285.6.camel@verve> > I am currently testing pcnet32 PHP on a PPC64 plateform. One of the > adapters I > am using is a multifunction device (it has 4 ports). From a PCI > Hotplug point of > view, because we phsically insert/remove the adapter, it makes more > sense to > have all functions(4 ports) get configured/unconfigured at the same > time. > > Linda > > The proposed change is in the RPA DLPAR driver. The PCI Hotplug driver calls pci_scan_slot() when "hotplug adding" an adapter, and this behavior isn't affected by the proposed change. John ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Wed Jan 28 05:06:54 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 27 Jan 2004 12:06:54 -0600 Subject: [PATCH] ioremap kmallocs inside spinlocks - review request Message-ID: <1075226812.10285.17.camel@verve> Anton, Team- Good catch, there are two such uses of kmalloc in that file. Thoughts on the patch below? The vm_struct's that might be required are preallocated inside im_get_area() and im_get_free_area(), so that the kmalloc's are moved outside the spinlock. The nastiness of this patch is that zero, one, or both of the alloc'ed vm_structs are freed in different functions than they are allocated in. The other option is to report back to im_get_area() and im_get_free_area() about which of the new regions were used, and free where necessary there. I think I like my way better than that. OR the easiest option is a one-line change within the spinlock from GFP_KERNEL to GFP_ATOMIC. :) But I understand that this is discouraged. Comments welcome, thanks- John diff -Nru a/arch/ppc64/mm/imalloc.c b/arch/ppc64/mm/imalloc.c --- a/arch/ppc64/mm/imalloc.c Tue Jan 27 11:55:38 2004 +++ b/arch/ppc64/mm/imalloc.c Tue Jan 27 11:55:38 2004 @@ -102,112 +102,111 @@ } static struct vm_struct * split_im_region(unsigned long v_addr, - unsigned long size, struct vm_struct *parent) + unsigned long size, struct vm_struct *parent, + struct vm_struct *new_vm1, struct vm_struct *new_vm2) { - struct vm_struct *vm1 = NULL; - struct vm_struct *vm2 = NULL; - struct vm_struct *new_vm = NULL; + struct vm_struct *requested_vm = NULL; - vm1 = (struct vm_struct *) kmalloc(sizeof(*vm1), GFP_KERNEL); - if (vm1 == NULL) { - printk(KERN_ERR "%s() out of memory\n", __FUNCTION__); - return NULL; - } - if (v_addr == (unsigned long) parent->addr) { - /* Use existing parent vm_struct to represent child, allocate - * new one for the remainder of parent range + /* Use existing parent vm_struct to represent child, use + * first new one for the remainder of parent range, free + * second new one */ - vm1->size = parent->size - size; - vm1->addr = (void *) (v_addr + size); - vm1->next = parent->next; + new_vm1->size = parent->size - size; + new_vm1->addr = (void *) (v_addr + size); + new_vm1->next = parent->next; parent->size = size; - parent->next = vm1; - new_vm = parent; + parent->next = new_vm1; + requested_vm = parent; + + kfree(new_vm2); } else if (v_addr + size == (unsigned long) parent->addr + parent->size) { - /* Allocate new vm_struct to represent child, use existing - * parent one for remainder of parent range + /* Use first new vm_struct to represent child, use existing + * parent one for remainder of parent range, free second new + * vm_struct */ - vm1->size = size; - vm1->addr = (void *) v_addr; - vm1->next = parent->next; - new_vm = vm1; + new_vm1->size = size; + new_vm1->addr = (void *) v_addr; + new_vm1->next = parent->next; + requested_vm = new_vm1; parent->size -= size; - parent->next = vm1; + parent->next = new_vm1; + + kfree(new_vm2); } else { - /* Allocate two new vm_structs for the new child and + /* Use two new vm_structs for the new child and * uppermost remainder, and use existing parent one for the * lower remainder of parent range */ - vm2 = (struct vm_struct *) kmalloc(sizeof(*vm2), GFP_KERNEL); - if (vm2 == NULL) { - printk(KERN_ERR "%s() out of memory\n", __FUNCTION__); - kfree(vm1); - return NULL; - } - - vm1->size = size; - vm1->addr = (void *) v_addr; - vm1->next = vm2; - new_vm = vm1; - - vm2->size = ((unsigned long) parent->addr + parent->size) - - (v_addr + size); - vm2->addr = (void *) v_addr + size; - vm2->next = parent->next; + + new_vm1->size = size; + new_vm1->addr = (void *) v_addr; + new_vm1->next = new_vm2; + requested_vm = new_vm1; + + new_vm2->size = ((unsigned long) parent->addr + parent->size) + - (v_addr + size); + new_vm2->addr = (void *) v_addr + size; + new_vm2->next = parent->next; parent->size = v_addr - (unsigned long) parent->addr; - parent->next = vm1; + parent->next = new_vm1; } - return new_vm; + return requested_vm; } static struct vm_struct * __add_new_im_area(unsigned long req_addr, - unsigned long size) + unsigned long size, + struct vm_struct *new_vm) { - struct vm_struct **p, *tmp, *area; + struct vm_struct **p, *tmp; for (p = &imlist; (tmp = *p) ; p = &tmp->next) { if (req_addr + size <= (unsigned long)tmp->addr) break; } - area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_KERNEL); - if (!area) - return NULL; - area->flags = 0; - area->addr = (void *)req_addr; - area->size = size; - area->next = *p; - *p = area; + new_vm->flags = 0; + new_vm->addr = (void *)req_addr; + new_vm->size = size; + new_vm->next = *p; + *p = new_vm; - return area; + return new_vm; } static struct vm_struct * __im_get_area(unsigned long req_addr, unsigned long size, - int criteria) + int criteria, + struct vm_struct *new_vm1, + struct vm_struct *new_vm2) { struct vm_struct *tmp; int status; status = im_region_status(req_addr, size, &tmp); if ((criteria & status) == 0) { + kfree(new_vm1); + kfree(new_vm2); return NULL; } switch (status) { case IM_REGION_UNUSED: - tmp = __add_new_im_area(req_addr, size); + tmp = __add_new_im_area(req_addr, size, new_vm1); + kfree(new_vm2); break; case IM_REGION_SUBSET: - tmp = split_im_region(req_addr, size, tmp); + tmp = split_im_region(req_addr, size, tmp, new_vm1, + new_vm2); break; case IM_REGION_EXISTS: + kfree(new_vm1); + kfree(new_vm2); break; default: printk(KERN_ERR "%s() unexpected imalloc region status\n", @@ -221,8 +220,27 @@ struct vm_struct * im_get_free_area(unsigned long size) { struct vm_struct *area; + struct vm_struct *new_vm1; + struct vm_struct *new_vm2; unsigned long addr; + /* Allocate new vm_structs here to avoid kmalloc inside spinlock. + * If not used, these will be freed in __im_get_area() or + * split_im_region(). + */ + new_vm1 = (struct vm_struct *) kmalloc(sizeof(*new_vm1), GFP_KERNEL); + if (!new_vm1) { + printk(KERN_ERR "%s() out of memory\n", __FUNCTION__); + return NULL; + } + + new_vm2 = (struct vm_struct *) kmalloc(sizeof(*new_vm2), GFP_KERNEL); + if (!new_vm2) { + printk(KERN_ERR "%s() out of memory\n", __FUNCTION__); + kfree(new_vm1); + return NULL; + } + write_lock(&imlist_lock); if (get_free_im_addr(size, &addr)) { printk(KERN_ERR "%s() cannot obtain addr for size 0x%lx\n", @@ -231,7 +249,7 @@ goto next_im_done; } - area = __im_get_area(addr, size, IM_REGION_UNUSED); + area = __im_get_area(addr, size, IM_REGION_UNUSED, new_vm1, new_vm2); if (area == NULL) { printk(KERN_ERR "%s() cannot obtain area for addr 0x%lx size 0x%lx\n", @@ -246,9 +264,28 @@ int criteria) { struct vm_struct *area; + struct vm_struct *new_vm1; + struct vm_struct *new_vm2; + + /* Allocate new vm_structs here to avoid kmalloc inside spinlock. + * If not used, these will be freed in __im_get_area() or + * split_im_region(). + */ + new_vm1 = (struct vm_struct *) kmalloc(sizeof(*new_vm1), GFP_KERNEL); + if (!new_vm1) { + printk(KERN_ERR "%s() out of memory\n", __FUNCTION__); + return NULL; + } + + new_vm2 = (struct vm_struct *) kmalloc(sizeof(*new_vm2), GFP_KERNEL); + if (!new_vm2) { + printk(KERN_ERR "%s() out of memory\n", __FUNCTION__); + kfree(new_vm1); + return NULL; + } write_lock(&imlist_lock); - area = __im_get_area(v_addr, size, criteria); + area = __im_get_area(v_addr, size, criteria, new_vm1, new_vm2); write_unlock(&imlist_lock); return area; } ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Wed Jan 28 05:39:02 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Tue, 27 Jan 2004 12:39:02 -0600 Subject: [PATCH] ioremap kmallocs inside spinlocks - review request In-Reply-To: <1075226812.10285.17.camel@verve> References: <1075226812.10285.17.camel@verve> Message-ID: <4016B046.2060103@austin.ibm.com> I have a naive suggestion for fixing the "sleep while atomic" warnings... what about changing imlist_lock to a semaphore? If imlist is traversed in interrupt context, then this is obviously not feasible, but I thought I would check. Nathan ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jdewand at redhat.com Wed Jan 28 07:36:59 2004 From: jdewand at redhat.com (Julie DeWandel) Date: Tue, 27 Jan 2004 15:36:59 -0500 Subject: virt_to_bus for ppc64 Message-ID: <4016CBEB.8020106@redhat.com> Hi, I'm playing around with 2.6 code and one of the modules I am building is looking for a virt_to_bus function. I was wondering if the following patch would be appropriate to use: --- linux-2.6.1/include/asm-ppc64/io.h.orig 2004-01-27 15:02:15.000000000 -0500 +++ linux-2.6.1/include/asm-ppc64/io.h 2004-01-27 15:04:20.000000000 -0500 @@ -143,6 +143,7 @@ static inline unsigned long virt_to_phys #endif return __pa((unsigned long)address); } +#define virt_to_bus virt_to_phys static inline void * phys_to_virt(unsigned long address) { @@ -151,6 +152,7 @@ static inline void * phys_to_virt(unsign #endif return (void *) __va(address); } +#define bus_to_virt phys_to_virt /* * Change "struct page" to physical address. -- Julie DeWandel Red Hat, Inc. Tel (978) 692-3113 x23251 ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From gregkh at us.ibm.com Wed Jan 28 07:39:39 2004 From: gregkh at us.ibm.com (Greg KH) Date: Tue, 27 Jan 2004 12:39:39 -0800 Subject: PCI Probe Question In-Reply-To: <16406.9095.277972.439160@cargo.ozlabs.ibm.com> References: <1075165909.8973.22.camel@verve> <20040127012507.GA3295@us.ibm.com> <16406.9095.277972.439160@cargo.ozlabs.ibm.com> Message-ID: <20040127203939.GA6102@us.ibm.com> On Tue, Jan 27, 2004 at 07:38:31PM +1100, Paul Mackerras wrote: > Greg KH writes: > > > On Mon, Jan 26, 2004 at 07:11:49PM -0600, John Rose wrote: > > > How feasible is the generic portion of the patch below? > > > > > > Currently, the PCI probe code exports pci_scan_slot() for use by hotplug > > > (or dlpar) modules. This turns out to be a problem, because > > > pci_scan_slot() scans forward for 8 devfn values, starting at the one > > > passed in. We need pci_scan_device() to be available for module use. > > > > Wait, what's wrong with scanning for all 8 devfn values? That's what we > > have to do, right? > > Hmmm, I don't think you want to scan all 8 values if it is a > single-function device, since it may well respond at all 8 function > addresses, or even do weird things if you access func != 0. Hence the following code in pci_scan_slot(): /* * If this is a single function device, * don't scan past the first function. */ if (!dev->multifunction) break; > Well, the thing is that we have PCI-PCI bridges which are > multifunction devices, that is, you get 4 bridges in the one PCI > device. Each bridge typically has only one slot behind it, since the > bridge is where the hotplugging is done. And the hypervisor likes to > be able to hand out each slot to a different partition. So your > partition might get to access function 2 of the PCI-PCI bridge device > but not function 0. Ick, ick, ick. How do you all handle this in your boot up code then? > Hope that makes it a little clearer. Yes, thanks. I think my proposed change will still work for you, right? thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Wed Jan 28 07:52:50 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 27 Jan 2004 14:52:50 -0600 Subject: PCI Probe Question In-Reply-To: <20040127012507.GA3295@us.ibm.com> References: <1075165909.8973.22.camel@verve> <20040127012507.GA3295@us.ibm.com> Message-ID: <1075236769.10285.30.camel@verve> Hi Greg- Thanks for responding. I think Paul answered your other questions better than I could have :) > How do you all handle this in your boot up code then? Not sure I understand what you're asking, but I believe that the hypervisor only allows access (during scan of the PHB device) to the devfn's corresponding to the slots that your partition currently owns. > If you convince me you really have to do this, why duplicate the > existing pci code in your driver? How about just creating a > pci_scan_single_device() function for the pci core that does that logic > (and make pci_scan_slot() call it.) Agreed. I initially wanted to minimize my changes to generic code, but this looks better. How's this patch? Thanks- John diff -Nru a/drivers/pci/hotplug/rpadlpar_core.c b/drivers/pci/hotplug/rpadlpar_core.c --- a/drivers/pci/hotplug/rpadlpar_core.c Tue Jan 27 14:41:07 2004 +++ b/drivers/pci/hotplug/rpadlpar_core.c Tue Jan 27 14:41:07 2004 @@ -143,9 +143,9 @@ struct pci_controller *hose = dn->phb; struct pci_dev *dev = NULL; - /* Scan phb bus for devices, adding new ones to bus->devices */ - if (!pci_scan_slot(hose->bus, dn->devfn)) { - printk(KERN_ERR "%s: found no devices on bus\n", __FUNCTION__); + /* Scan phb bus for EADS device, adding new one to bus->devices */ + if (!pci_scan_single_device(hose->bus, dn->devfn)) { + printk(KERN_ERR "%s: found no device on bus\n", __FUNCTION__); return NULL; } diff -Nru a/drivers/pci/probe.c b/drivers/pci/probe.c --- a/drivers/pci/probe.c Tue Jan 27 14:41:07 2004 +++ b/drivers/pci/probe.c Tue Jan 27 14:41:07 2004 @@ -535,6 +535,30 @@ return dev; } +struct pci_dev * __devinit +pci_scan_single_device(struct pci_bus *bus, int devfn) +{ + struct pci_dev *dev; + + dev = pci_scan_device(bus, devfn); + pci_scan_msi_device(dev); + + if (!dev) + return NULL; + + /* Fix up broken headers */ + pci_fixup_device(PCI_FIXUP_HEADER, dev); + + /* + * Add the device to our list of discovered devices + * and the bus list for fixup functions, etc. + */ + INIT_LIST_HEAD(&dev->global_list); + list_add_tail(&dev->bus_list, &bus->devices); + + return dev; +} + /** * pci_scan_slot - scan a PCI slot on a bus for devices. * @bus: PCI bus to scan @@ -551,39 +575,17 @@ for (func = 0; func < 8; func++, devfn++) { struct pci_dev *dev; - dev = pci_scan_device(bus, devfn); - pci_scan_msi_device(dev); -#if 0 - if (func == 0) { - if (!dev) + dev = pci_scan_single_device(bus, devfn); + if (dev) { + nr++; + + /* + * If this is a single function device, + * don't scan past the first function. + */ + if (!dev->multifunction) break; - } else { - if (!dev) - continue; - dev->multifunction = 1; } -#else - if (!dev) - continue; -#endif - - /* Fix up broken headers */ - pci_fixup_device(PCI_FIXUP_HEADER, dev); - - /* - * Add the device to our list of discovered devices - * and the bus list for fixup functions, etc. - */ - INIT_LIST_HEAD(&dev->global_list); - list_add_tail(&dev->bus_list, &bus->devices); - nr++; - - /* - * If this is a single function device, - * don't scan past the first function. - */ - if (!dev->multifunction) - break; } return nr; } @@ -686,4 +688,5 @@ EXPORT_SYMBOL(pci_do_scan_bus); EXPORT_SYMBOL(pci_scan_slot); EXPORT_SYMBOL(pci_scan_bridge); +EXPORT_SYMBOL(pci_scan_single_device); #endif diff -Nru a/include/linux/pci.h b/include/linux/pci.h --- a/include/linux/pci.h Tue Jan 27 14:41:07 2004 +++ b/include/linux/pci.h Tue Jan 27 14:41:07 2004 @@ -585,6 +585,7 @@ return pci_scan_bus_parented(NULL, bus, ops, sysdata); } int pci_scan_slot(struct pci_bus *bus, int devfn); +struct pci_dev * pci_scan_single_device(struct pci_bus *bus, int devfn); void pci_bus_add_devices(struct pci_bus *bus); void pci_name_device(struct pci_dev *dev); char *pci_class_name(u32 class); ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From gregkh at us.ibm.com Wed Jan 28 08:05:29 2004 From: gregkh at us.ibm.com (Greg KH) Date: Tue, 27 Jan 2004 13:05:29 -0800 Subject: PCI Probe Question In-Reply-To: <1075236769.10285.30.camel@verve> References: <1075165909.8973.22.camel@verve> <20040127012507.GA3295@us.ibm.com> <1075236769.10285.30.camel@verve> Message-ID: <20040127210529.GB6332@us.ibm.com> On Tue, Jan 27, 2004 at 02:52:50PM -0600, John Rose wrote: > Hi Greg- > > Thanks for responding. I think Paul answered your other questions > better than I could have :) > > > How do you all handle this in your boot up code then? > > Not sure I understand what you're asking, but I believe that the > hypervisor only allows access (during scan of the PHB device) to the > devfn's corresponding to the slots that your partition currently owns. But doesn't the ppc64 start up code eventually call pci_scan_slot()? I haven't looked through your startup code to determine this or not. > > If you convince me you really have to do this, why duplicate the > > existing pci code in your driver? How about just creating a > > pci_scan_single_device() function for the pci core that does that logic > > (and make pci_scan_slot() call it.) > > Agreed. I initially wanted to minimize my changes to generic code, but > this looks better. How's this patch? How about a patch against a clean kernel so I can see it better? :) thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Wed Jan 28 08:50:34 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 28 Jan 2004 08:50:34 +1100 Subject: virt_to_bus for ppc64 In-Reply-To: <4016CBEB.8020106@redhat.com> References: <4016CBEB.8020106@redhat.com> Message-ID: <1075240233.5658.216.camel@gaston> On Wed, 2004-01-28 at 07:36, Julie DeWandel wrote: > Hi, > > I'm playing around with 2.6 code and one of the modules I am building is > looking for a virt_to_bus function. I was wondering if the following > patch would be appropriate to use: virt_to_bus is depcreated, but you probably know that :) The problem is that on ppc64, you _MUST_ go through the PCI DMA API, because this is the only way you'll have the iommu setup properly for your device to be able to access main memory. virt_to_bus() cannot work. Time for fixing this module... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Wed Jan 28 09:07:01 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 27 Jan 2004 16:07:01 -0600 Subject: PCI Probe Question In-Reply-To: <20040127210529.GB6332@us.ibm.com> References: <1075165909.8973.22.camel@verve> <20040127012507.GA3295@us.ibm.com> <1075236769.10285.30.camel@verve> <20040127210529.GB6332@us.ibm.com> Message-ID: <1075241220.10285.42.camel@verve> > But doesn't the ppc64 start up code eventually call pci_scan_slot()? I > haven't looked through your startup code to determine this or not. Yes, but at boot time this doesn't result in redundant entries. Consider a PHB with slots at functions 0, 3, and 7, and consider that 0 isn't owned by you at boot. So 3 and 7 get picked up at boot by pci_scan_slot(). Now when you dlpar add 0, you don't want 3 and 7 to be added again. That's my fundamental problem. Will have that bkbits patch right out :) Thanks- John ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Wed Jan 28 10:28:22 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 27 Jan 2004 17:28:22 -0600 Subject: PCI Probe Question In-Reply-To: <20040127210529.GB6332@us.ibm.com> References: <1075165909.8973.22.camel@verve> <20040127012507.GA3295@us.ibm.com> <1075236769.10285.30.camel@verve> <20040127210529.GB6332@us.ibm.com> Message-ID: <1075246102.10285.54.camel@verve> Hi- > How about a patch against a clean kernel so I can see it better? :) Here's a patch against bkbits. Jake's fix for probing multifunc devices with no function 0 will affect code in the same neighorhood, so the resolution of that will affect this. Non-Greg readers, find info on Jake's problem/fix at: http://www.ussg.iu.edu/hypermail/linux/kernel/0401.3/0766.html Thanks- John diff -Nru a/drivers/pci/probe.c b/drivers/pci/probe.c --- a/drivers/pci/probe.c Tue Jan 27 17:22:31 2004 +++ b/drivers/pci/probe.c Tue Jan 27 17:22:31 2004 @@ -535,6 +535,30 @@ return dev; } +struct pci_dev * __devinit +pci_scan_single_device(struct pci_bus *bus, int devfn) +{ + struct pci_dev *dev; + + dev = pci_scan_device(bus, devfn); + pci_scan_msi_device(dev); + + if (!dev) + return NULL; + + /* Fix up broken headers */ + pci_fixup_device(PCI_FIXUP_HEADER, dev); + + /* + * Add the device to our list of discovered devices + * and the bus list for fixup functions, etc. + */ + INIT_LIST_HEAD(&dev->global_list); + list_add_tail(&dev->bus_list, &bus->devices); + + return dev; +} + /** * pci_scan_slot - scan a PCI slot on a bus for devices. * @bus: PCI bus to scan @@ -551,34 +575,23 @@ for (func = 0; func < 8; func++, devfn++) { struct pci_dev *dev; - dev = pci_scan_device(bus, devfn); - pci_scan_msi_device(dev); - if (func == 0) { - if (!dev) - break; + dev = pci_scan_single_device(bus, devfn); + if (dev) { + nr++; + + /* + * If this is a single function device, + * don't scan past the first function. + */ + if (!dev->multifunction) + if (func > 0) + dev->multifunction = 1; + else + break; } else { - if (!dev) - continue; - dev->multifunction = 1; + if (func == 0) + break; } - - /* Fix up broken headers */ - pci_fixup_device(PCI_FIXUP_HEADER, dev); - - /* - * Add the device to our list of discovered devices - * and the bus list for fixup functions, etc. - */ - INIT_LIST_HEAD(&dev->global_list); - list_add_tail(&dev->bus_list, &bus->devices); - nr++; - - /* - * If this is a single function device, - * don't scan past the first function. - */ - if (!dev->multifunction) - break; } return nr; } @@ -681,4 +694,5 @@ EXPORT_SYMBOL(pci_do_scan_bus); EXPORT_SYMBOL(pci_scan_slot); EXPORT_SYMBOL(pci_scan_bridge); +EXPORT_SYMBOL(pci_scan_single_device); #endif diff -Nru a/include/linux/pci.h b/include/linux/pci.h --- a/include/linux/pci.h Tue Jan 27 17:22:31 2004 +++ b/include/linux/pci.h Tue Jan 27 17:22:31 2004 @@ -585,6 +585,7 @@ return pci_scan_bus_parented(NULL, bus, ops, sysdata); } int pci_scan_slot(struct pci_bus *bus, int devfn); +struct pci_dev * pci_scan_single_device(struct pci_bus *bus, int devfn); void pci_bus_add_devices(struct pci_bus *bus); void pci_name_device(struct pci_dev *dev); char *pci_class_name(u32 class); ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Wed Jan 28 10:52:21 2004 From: paulus at samba.org (Paul Mackerras) Date: Wed, 28 Jan 2004 10:52:21 +1100 Subject: PCI Probe Question In-Reply-To: References: Message-ID: <16406.63925.224418.719207@cargo.ozlabs.ibm.com> Linda Xie writes: > I am currently testing pcnet32 PHP on a PPC64 plateform. One of the > adapters I am using is a multifunction device (it has 4 ports). From > a PCI Hotplug point of view, because we phsically insert/remove the > adapter, it makes more sense to have all functions(4 ports) get > configured/unconfigured at the same time. If that is the same 4-port pcnet32 card that I have used in the past, it has a PCI-PCI bridge with four slots behind it, each with a separate pcnet32 chip connected to it. So that card doesn't have 4 functions, in the PCI sense of the term. Regards, Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Wed Jan 28 11:31:33 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 27 Jan 2004 18:31:33 -0600 Subject: [PATCH] [2.6] set_preferred_console breaks iSeries Message-ID: <401702E5.1020507@austin.ibm.com> I've pushed the attached patch to ameslab, it breaks building on iSeries. There's other breakage right now too: vio.c is not compiled for iSeries, but ibmvscsi needs it for vio_enable_interrupts and vio_disable_interrupts. If vio.c is added, it won't build since it seems to rely on OF stuff (get_property, plpar_hcall_norets, find_devices, etc). -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Jan 28 11:46:19 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 28 Jan 2004 11:46:19 +1100 Subject: [PATCH] [2.6] set_preferred_console breaks iSeries In-Reply-To: <401702E5.1020507@austin.ibm.com> References: <401702E5.1020507@austin.ibm.com> Message-ID: <20040128004618.GW11236@krispykreme> Hi, > I've pushed the attached patch to ameslab, it breaks building on iSeries. Sorry about that, I fixed it and sent an update to akpm but forgot to do the push :) Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Wed Jan 28 11:57:19 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 27 Jan 2004 18:57:19 -0600 Subject: [PATCH] [2.6] set_preferred_console breaks iSeries In-Reply-To: <20040128004618.GW11236@krispykreme> References: <401702E5.1020507@austin.ibm.com> <20040128004618.GW11236@krispykreme> Message-ID: <401708EF.2050202@austin.ibm.com> Anton Blanchard wrote: >>I've pushed the attached patch to ameslab, it breaks building on iSeries. > > Sorry about that, I fixed it and sent an update to akpm but forgot to do the > push :) ...and I forgot to attach the patch. Here it is. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: set_preferred_console-buildfix Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040127/66e04d09/attachment.txt From paulus at samba.org Wed Jan 28 21:03:04 2004 From: paulus at samba.org (Paul Mackerras) Date: Wed, 28 Jan 2004 21:03:04 +1100 Subject: PCI Probe Question In-Reply-To: References: Message-ID: <16407.35032.364597.792833@cargo.ozlabs.ibm.com> Linda Xie writes: > >it has a PCI-PCI bridge with four slots behind it > > Does that mean you can plug an adapter into one of the four slots? No, I meant "slots" in the abstract sense as in device IDs on a PCI bus segment. The card I have used has four AMD PCNet chips connected to a PCI-PCI bridge chip, and has four network connectors on the end. There are no physical PCI slot connectors on the board. > The card I am using doesn't have any slots, it has 4 network > connectors. Sounds like the same thing. The point is that even though it is a single card, it looks like a PCI-PCI bridge with four devices behind it, each using a different PCI device ID. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From rod at thalescomputers.fr Thu Jan 29 05:33:31 2004 From: rod at thalescomputers.fr (=?ISO-8859-1?Q?R=E9gis_Odey=E9?=) Date: Wed, 28 Jan 2004 19:33:31 +0100 Subject: High Precision Timer on JS20 Message-ID: <4018007B.7080500@thalescomputers.fr> Hi, I'm now trying to use the High Precision Timer (HPET) of the 8111 Hypertransport IO hub. This function is part of the LPC sub-device which is supposed to be probed as a pci sub-device (device B, function 0 in the IO Hub doc.). But it seems that the only sub-device listed in the device tree and detected through the pci scan is the IDE controller which is the function 1 of this part of the IO hub. Even trying to reach the function 0 (and the other available functions) through pci_config_read call is failing (returning 0xffffffff). Did I miss something ? Is there a control of the pci function number in the related RTAS call ? Regards. -- R?gis Odey? Thales Computers, a Thales company. www.thalescomputers.com E-mail: rod at thalescomputers.fr Tel: +33 (0)4 98 16 34 86 - Fax: +33 (0)4 98 16 34 01 ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Thu Jan 29 07:21:37 2004 From: olh at suse.de (Olaf Hering) Date: Wed, 28 Jan 2004 21:21:37 +0100 Subject: [PATCH] kdb dmesg output broken after log_buf changes Message-ID: <20040128202137.GA24101@suse.de> This change should fix then dmesg command. diff -p -purN linuxppc64-2.5/kernel/printk.c linuxppc64-2.5.kdb/kernel/printk.c --- linuxppc64-2.5/kernel/printk.c 2004-01-20 03:08:38.000000000 +0100 +++ linuxppc64-2.5.kdb/kernel/printk.c 2004-01-28 21:19:28.000000000 +0100 @@ -366,7 +366,7 @@ out: void kdb_syslog_data(char *syslog_data[4]) { syslog_data[0] = log_buf; - syslog_data[1] = log_buf + sizeof(log_buf); + syslog_data[1] = log_buf + __LOG_BUF_LEN; syslog_data[2] = log_buf + log_end - (logged_chars < __LOG_BUF_LEN ? logged_chars : __LOG_BUF_LEN); syslog_data[3] = log_buf + log_end; } -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From segher at kernel.crashing.org Thu Jan 29 20:32:04 2004 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Thu, 29 Jan 2004 10:32:04 +0100 Subject: High Precision Timer on JS20 In-Reply-To: <4018007B.7080500@thalescomputers.fr> References: <4018007B.7080500@thalescomputers.fr> Message-ID: > I'm now trying to use the High Precision Timer (HPET) of the 8111 > Hypertransport IO hub. This function is part of the LPC sub-device > which > is supposed to be probed as a pci sub-device (device B, function 0 in > the IO Hub doc.). You'll have to enable it first (DevB:0xa0). > But it seems that the only sub-device listed in the device tree and > detected through the pci scan is the IDE controller which is the > function 1 of this part of the IO hub. > > Even trying to reach the function 0 (and the other available functions) > through pci_config_read call is failing (returning 0xffffffff). Function 0 can't be disabled; you must be doing something wrong. Maybe you're trying for the wrong address? Bus 0 dev 4 func 0 is what you're after... Segher ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Fri Jan 30 11:55:46 2004 From: greg at kroah.com (Greg KH) Date: Thu, 29 Jan 2004 16:55:46 -0800 Subject: PCI Probe Question In-Reply-To: <1075246102.10285.54.camel@verve> References: <1075165909.8973.22.camel@verve> <20040127012507.GA3295@us.ibm.com> <1075236769.10285.30.camel@verve> <20040127210529.GB6332@us.ibm.com> <1075246102.10285.54.camel@verve> Message-ID: <20040130005546.GA11252@kroah.com> On Tue, Jan 27, 2004 at 05:28:22PM -0600, John Rose wrote: > Hi- > > > How about a patch against a clean kernel so I can see it better? :) > > Here's a patch against bkbits. Jake's fix for probing multifunc devices > with no function 0 will affect code in the same neighorhood, so the > resolution of that will affect this. Non-Greg readers, find info on > Jake's problem/fix at: > http://www.ussg.iu.edu/hypermail/linux/kernel/0401.3/0766.html Thanks, I've applied this to my PCI trees and will send it on to Linus. greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Fri Jan 30 21:33:26 2004 From: olh at suse.de (Olaf Hering) Date: Fri, 30 Jan 2004 11:33:26 +0100 Subject: [PATCH] kdb dmesg output broken after log_buf changes In-Reply-To: <20040130001259.GD28168@krispykreme> References: <20040128202137.GA24101@suse.de> <20040129221304.GB28168@krispykreme> <20040129221931.GA6071@suse.de> <20040130001259.GD28168@krispykreme> Message-ID: <20040130103326.GA6811@suse.de> On Fri, Jan 30, Anton Blanchard wrote: > > > Like this? Or update to 4.3? Oh yes. :) > > Cool. Considering my lack of kdb knowledge I'll leave it to someone else > (Will?) to merge that patch. I sent it in (too large for the list), like the dis-asm.h adding linuxppc64-dev at lists.linuxppc.org back to Cc: > I cleaned up the xmon hooks to make them somewhat generic. The hope here > is that we can switch debuggers at run time. It will also allow us to > step through problems (at the moment if you get a DSI, hit xmon and fix > the instruction up, you still end up panicing. Ive wanted a number of > times to be able to fix it and continue on). > > Patch is attached below. > > My question is, how does kdb work? I cant see any hooks in arch/ppc64 to > handle dodgy page faults etc. Does kdb work without requiring any arch > hooks? > > Anton > > ===== arch/ppc64/Kconfig 1.39 vs edited ===== > --- 1.39/arch/ppc64/Kconfig Sat Jan 24 13:39:39 2004 > +++ edited/arch/ppc64/Kconfig Tue Jan 27 00:07:51 2004 > @@ -378,16 +378,15 @@ > keys are documented in . Don't say Y > unless you really know what this hack does. > > -choice > - optional > - depends on DEBUG_KERNEL > - prompt "Kernel Debugger" > - > config XMON > bool "XMON" > help > Include in-kernel hooks for the xmon kernel monitor/debugger. > Unless you are intending to debug the kernel, say N here. > + > +config XMON_DEFAULT > + bool "Enable xmon by default" > + depends on XMON > > config KDB > bool "KDB" > @@ -395,17 +394,10 @@ > Include in-kernel hooks for the kdb kernel monitor/debugger. > Unless you are intending to debug the kernel, say N here. > > -endchoice > - > - > -config XMON_DEFAULT > - bool "Enable xmon by default" > - depends on XMON > - > +# XXX FIXME > config KDB_OFF > bool "Turn KDB off as default." > depends on KDB > - > help > KDB will remain built into the kernel, but will be turned off. > "cat 1 > /proc/sys/kernel/kdb" to turn it on. > ===== arch/ppc64/kernel/open_pic.c 1.25 vs edited ===== > --- 1.25/arch/ppc64/kernel/open_pic.c Tue Jan 20 14:56:57 2004 > +++ edited/arch/ppc64/kernel/open_pic.c Tue Jan 27 00:07:52 2004 > @@ -591,9 +591,9 @@ > request_irq(openpic_vec_ipi+1, openpic_ipi_action, SA_INTERRUPT, > "IPI1 (reschedule)", 0); > request_irq(openpic_vec_ipi+2, openpic_ipi_action, SA_INTERRUPT, > - "IPI2 (invalidate tlb)", 0); > + "IPI2 (unused)", 0); > request_irq(openpic_vec_ipi+3, openpic_ipi_action, SA_INTERRUPT, > - "IPI3 (xmon break)", 0); > + "IPI3 (debugger break)", 0); > > for ( i = 0; i < OPENPIC_NUM_IPI ; i++ ) > openpic_enable_ipi(openpic_vec_ipi+i); > ===== arch/ppc64/kernel/ppc_ksyms.c 1.35 vs edited ===== > --- 1.35/arch/ppc64/kernel/ppc_ksyms.c Sun Jan 25 00:55:49 2004 > +++ edited/arch/ppc64/kernel/ppc_ksyms.c Tue Jan 27 00:35:17 2004 > @@ -203,24 +203,21 @@ > EXPORT_SYMBOL(timer_interrupt); > EXPORT_SYMBOL(get_wchan); > EXPORT_SYMBOL(console_drivers); > -#ifdef CONFIG_XMON > -EXPORT_SYMBOL(xmon); > -#endif > > #ifdef CONFIG_DEBUG_KERNEL > -extern void (*debugger)(struct pt_regs *regs); > -extern int (*debugger_bpt)(struct pt_regs *regs); > -extern int (*debugger_sstep)(struct pt_regs *regs); > -extern int (*debugger_iabr_match)(struct pt_regs *regs); > -extern int (*debugger_dabr_match)(struct pt_regs *regs); > -extern void (*debugger_fault_handler)(struct pt_regs *regs); > +extern int (*__debugger)(struct pt_regs *regs); > +extern int (*__debugger_bpt)(struct pt_regs *regs); > +extern int (*__debugger_sstep)(struct pt_regs *regs); > +extern int (*__debugger_iabr_match)(struct pt_regs *regs); > +extern int (*__debugger_dabr_match)(struct pt_regs *regs); > +extern int (*__debugger_fault_handler)(struct pt_regs *regs); > > -EXPORT_SYMBOL(debugger); > -EXPORT_SYMBOL(debugger_bpt); > -EXPORT_SYMBOL(debugger_sstep); > -EXPORT_SYMBOL(debugger_iabr_match); > -EXPORT_SYMBOL(debugger_dabr_match); > -EXPORT_SYMBOL(debugger_fault_handler); > +EXPORT_SYMBOL(__debugger); > +EXPORT_SYMBOL(__debugger_bpt); > +EXPORT_SYMBOL(__debugger_sstep); > +EXPORT_SYMBOL(__debugger_iabr_match); > +EXPORT_SYMBOL(__debugger_dabr_match); > +EXPORT_SYMBOL(__debugger_fault_handler); > #endif > > EXPORT_SYMBOL(tb_ticks_per_usec); > ===== arch/ppc64/kernel/setup.c 1.49 vs edited ===== > --- 1.49/arch/ppc64/kernel/setup.c Wed Jan 28 11:25:51 2004 > +++ edited/arch/ppc64/kernel/setup.c Wed Jan 28 16:31:19 2004 > @@ -68,10 +68,6 @@ > unsigned long decr_overclock_set = 0; > unsigned long decr_overclock_proc0_set = 0; > > -#ifdef CONFIG_XMON > -extern void xmon_map_scc(void); > -#endif > - > char saved_command_line[256]; > unsigned char aux_device_present; > > @@ -148,15 +144,14 @@ > unsigned long r6, unsigned long r7) > { > #ifdef CONFIG_PPC_PSERIES > - unsigned int ret, i; > + unsigned int ret, i; > #endif > > #ifdef CONFIG_XMON_DEFAULT > - debugger = xmon; > - debugger_bpt = xmon_bpt; > - debugger_sstep = xmon_sstep; > - debugger_iabr_match = xmon_iabr_match; > - debugger_dabr_match = xmon_dabr_match; > + xmon_init(); > +#endif > +#ifdef CONFIG_KDB_DEFAULT > + /* XXX FIXME */ > #endif > > #ifdef CONFIG_PPC_ISERIES > @@ -555,12 +550,14 @@ > calibrate_delay = ppc64_calibrate_delay; > > ppc64_boot_msg(0x12, "Setup Arch"); > + > #ifdef CONFIG_XMON > - xmon_map_scc(); > - if (strstr(cmd_line, "xmon")) > - xmon(0); > + if (strstr(cmd_line, "xmon")) { > + /* ensure xmon is enabled */ > + xmon_init(); > + debugger(0); > + } > #endif /* CONFIG_XMON */ > - > > /* > * Set cache line size based on type of cpu as a default. > ===== arch/ppc64/kernel/smp.c 1.56 vs edited ===== > --- 1.56/arch/ppc64/kernel/smp.c Thu Jan 22 14:44:04 2004 > +++ edited/arch/ppc64/kernel/smp.c Tue Jan 27 00:30:12 2004 > @@ -414,7 +414,7 @@ > > void smp_message_recv(int msg, struct pt_regs *regs) > { > - switch( msg ) { > + switch(msg) { > case PPC_MSG_CALL_FUNCTION: > #ifdef CONFIG_KDB > kdb_smp_regs[smp_processor_id()]=regs; > @@ -430,11 +430,11 @@ > /* spare */ > break; > #endif > -#ifdef CONFIG_XMON > - case PPC_MSG_XMON_BREAK: > - xmon(regs); > +#ifdef CONFIG_DEBUG_KERNEL > + case PPC_MSG_DEBUGGER_BREAK: > + debugger(regs); > break; > -#endif /* CONFIG_XMON */ > +#endif > default: > printk("SMP %d: smp_message_recv(): unknown msg %d\n", > smp_processor_id(), msg); > @@ -447,12 +447,12 @@ > smp_message_pass(cpu, PPC_MSG_RESCHEDULE, 0, 0); > } > > -#ifdef CONFIG_XMON > -void smp_send_xmon_break(int cpu) > +#ifdef CONFIG_DEBUG_KERNEL > +void smp_send_debugger_break(int cpu) > { > - smp_message_pass(cpu, PPC_MSG_XMON_BREAK, 0, 0); > + smp_message_pass(cpu, PPC_MSG_DEBUGGER_BREAK, 0, 0); > } > -#endif /* CONFIG_XMON */ > +#endif > > static void stop_this_cpu(void *dummy) > { > @@ -530,10 +530,7 @@ > printk("smp_call_function on cpu %d: other cpus not " > "responding (%d)\n", smp_processor_id(), > atomic_read(&data.started)); > -#ifdef CONFIG_DEBUG_KERNEL > - if (debugger) > - debugger(0); > -#endif > + debugger(0); > goto out; > } > } > @@ -548,10 +545,7 @@ > smp_processor_id(), > atomic_read(&data.finished), > atomic_read(&data.started)); > -#ifdef CONFIG_DEBUG_KERNEL > - if (debugger) > - debugger(0); > -#endif > + debugger(0); > goto out; > } > } > ===== arch/ppc64/kernel/traps.c 1.25 vs edited ===== > --- 1.25/arch/ppc64/kernel/traps.c Tue Jan 20 13:07:09 2004 > +++ edited/arch/ppc64/kernel/traps.c Tue Jan 27 00:33:00 2004 > @@ -46,12 +46,12 @@ > #endif > > #ifdef CONFIG_DEBUG_KERNEL > -void (*debugger)(struct pt_regs *regs); > -int (*debugger_bpt)(struct pt_regs *regs); > -int (*debugger_sstep)(struct pt_regs *regs); > -int (*debugger_iabr_match)(struct pt_regs *regs); > -int (*debugger_dabr_match)(struct pt_regs *regs); > -void (*debugger_fault_handler)(struct pt_regs *regs); > +int (*__debugger)(struct pt_regs *regs); > +int (*__debugger_bpt)(struct pt_regs *regs); > +int (*__debugger_sstep)(struct pt_regs *regs); > +int (*__debugger_iabr_match)(struct pt_regs *regs); > +int (*__debugger_dabr_match)(struct pt_regs *regs); > +int (*__debugger_fault_handler)(struct pt_regs *regs); > #endif > > /* > @@ -88,11 +88,8 @@ > _exception(int signr, siginfo_t *info, struct pt_regs *regs) > { > if (!user_mode(regs)) { > -#ifdef CONFIG_DEBUG_KERNEL > - if (debugger) > - debugger(regs); > -#endif > - die("Exception in kernel mode\n", regs, signr); > + if (!debugger(regs)) > + die("Exception in kernel mode\n", regs, signr); > } > > force_sig_info(signr, info, current); > @@ -146,11 +143,7 @@ > } > #endif > > -#ifdef CONFIG_DEBUG_KERNEL > - if (debugger) > - debugger(regs); > - else > -#endif > + if (!debugger(regs)) > panic("System Reset"); > > /* Must die if the interrupt is not recoverable */ > @@ -228,14 +221,11 @@ > } > #endif > > -#ifdef CONFIG_DEBUG_KERNEL > - if (debugger_fault_handler) { > - debugger_fault_handler(regs); > + if (debugger_fault_handler(regs)) > return; > - } > - if (debugger) > - debugger(regs); > -#endif > + if (debugger(regs)) > + return; > + > console_verbose(); > spin_lock_irq(&die_lock); > bust_spinlocks(1); > @@ -267,10 +257,8 @@ > { > siginfo_t info; > > -#ifdef CONFIG_DEBUG_KERNEL > - if (debugger_iabr_match && debugger_iabr_match(regs)) > + if (debugger_iabr_match(regs)) > return; > -#endif > info.si_signo = SIGTRAP; > info.si_errno = 0; > info.si_code = TRAP_BRKPT; > @@ -387,10 +375,9 @@ > } else if (regs->msr & 0x20000) { > /* trap exception */ > > -#ifdef CONFIG_DEBUG_KERNEL > - if (debugger_bpt && debugger_bpt(regs)) > + if (debugger_bpt(regs)) > return; > -#endif > + > if (check_bug_trap(regs)) { > regs->nip += 4; > return; > @@ -434,10 +421,9 @@ > > regs->msr &= ~MSR_SE; /* Turn off 'trace' bit */ > > -#ifdef CONFIG_DEBUG_KERNEL > - if (debugger_sstep && debugger_sstep(regs)) > + if (debugger_sstep(regs)) > return; > -#endif > + > info.si_signo = SIGTRAP; > info.si_errno = 0; > info.si_code = TRAP_TRACE; > ===== arch/ppc64/kernel/xics.c 1.39 vs edited ===== > --- 1.39/arch/ppc64/kernel/xics.c Thu Jan 22 14:44:05 2004 > +++ edited/arch/ppc64/kernel/xics.c Tue Jan 27 00:20:34 2004 > @@ -375,11 +375,11 @@ > smp_message_recv(PPC_MSG_MIGRATE_TASK, regs); > } > #endif > -#ifdef CONFIG_XMON > - if (test_and_clear_bit(PPC_MSG_XMON_BREAK, > +#ifdef CONFIG_DEBUG_KERNEL > + if (test_and_clear_bit(PPC_MSG_DEBUGGER_BREAK, > &xics_ipi_message[cpu].value)) { > mb(); > - smp_message_recv(PPC_MSG_XMON_BREAK, regs); > + smp_message_recv(PPC_MSG_DEBUGGER_BREAK, regs); > } > #endif > } > ===== arch/ppc64/mm/fault.c 1.14 vs edited ===== > --- 1.14/arch/ppc64/mm/fault.c Fri Sep 12 21:01:40 2003 > +++ edited/arch/ppc64/mm/fault.c Tue Jan 27 00:31:32 2004 > @@ -37,12 +37,6 @@ > #include > #include > > -#include > - > -#ifdef CONFIG_DEBUG_KERNEL > -int debugger_kernel_faults = 1; > -#endif > - > void bad_page_fault(struct pt_regs *, unsigned long, int); > > /* > @@ -60,13 +54,10 @@ > unsigned long code = SEGV_MAPERR; > unsigned long is_write = error_code & 0x02000000; > > -#ifdef CONFIG_DEBUG_KERNEL > - if (debugger_fault_handler && (regs->trap == 0x300 || > - regs->trap == 0x380)) { > - debugger_fault_handler(regs); > - return; > + if (regs->trap == 0x300 || regs->trap == 0x380) { > + if (debugger_fault_handler(regs)) > + return; > } > -#endif > > /* On a kernel SLB miss we can only check for a valid exception entry */ > if (!user_mode(regs) && (regs->trap == 0x380)) { > @@ -74,13 +65,10 @@ > return; > } > > -#ifdef CONFIG_DEBUG_KERNEL > if (error_code & 0x00400000) { > - /* DABR match */ > if (debugger_dabr_match(regs)) > return; > } > -#endif > > if (in_atomic() || mm == NULL) { > bad_page_fault(regs, address, SIGSEGV); > @@ -149,11 +137,6 @@ > info.si_errno = 0; > info.si_code = code; > info.si_addr = (void *) address; > -#ifdef CONFIG_XMON > - ifppcdebug(PPCDBG_SIGNALXMON) > - PPCDBG_ENTER_DEBUGGER_REGS(regs); > -#endif > - > force_sig_info(SIGSEGV, &info, current); > return; > } > @@ -207,9 +190,7 @@ > } > > /* kernel has accessed a bad area */ > -#ifdef CONFIG_DEBUG_KERNEL > - if (debugger_kernel_faults) > - debugger(regs); > -#endif > + if (debugger(regs)) > + return; > die("Kernel access of bad area", regs, sig); > } > ===== arch/ppc64/mm/init.c 1.55 vs edited ===== > --- 1.55/arch/ppc64/mm/init.c Tue Jan 20 13:07:09 2004 > +++ edited/arch/ppc64/mm/init.c Tue Jan 27 00:31:39 2004 > @@ -59,6 +59,7 @@ > #include > #include > #include > +#include > > #ifdef CONFIG_PPC_ISERIES > #include > @@ -694,7 +695,7 @@ > if (start == 0) { > udbg_printf("do_init_bootmem: failed to allocate a bitmap.\n"); > udbg_printf("\tbootmap_pages = 0x%lx.\n", bootmap_pages); > - PPCDBG_ENTER_DEBUGGER(); > + debugger(0); > } > > boot_mapsize = init_bootmem(start >> PAGE_SHIFT, total_pages); > ===== arch/ppc64/xmon/start.c 1.8 vs edited ===== > --- 1.8/arch/ppc64/xmon/start.c Thu Jan 22 17:11:59 2004 > +++ edited/arch/ppc64/xmon/start.c Tue Jan 27 00:38:33 2004 > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -31,30 +32,30 @@ > } > > #ifdef CONFIG_MAGIC_SYSRQ > +extern int xmon(struct pt_regs *pt_regs); > > static void sysrq_handle_xmon(int key, struct pt_regs *pt_regs, > struct tty_struct *tty) > { > + /* ensure xmon is enabled */ > + xmon_init(); > xmon(pt_regs); > } > > static struct sysrq_key_op sysrq_xmon_op = > { > .handler = sysrq_handle_xmon, > - .help_msg = "xmon", > + .help_msg = "Xmon", > .action_msg = "Entering xmon\n", > }; > > -#endif /* CONFIG_MAGIC_SYSRQ */ > - > -void > -xmon_map_scc(void) > +static int __init setup_xmon_sysrq(void) > { > -#ifdef CONFIG_MAGIC_SYSRQ > - /* This maybe isn't the best place to register sysrq 'x' */ > __sysrq_put_key_op('x', &sysrq_xmon_op); > -#endif /* CONFIG_MAGIC_SYSRQ */ > + return 0; > } > +__initcall(setup_xmon_sysrq); > +#endif /* CONFIG_MAGIC_SYSRQ */ > > int > xmon_write(void *handle, void *ptr, int nb) > @@ -79,11 +80,6 @@ > void *xmon_stdin; > void *xmon_stdout; > void *xmon_stderr; > - > -void > -xmon_init(void) > -{ > -} > > int > xmon_putc(int c, void *f) > ===== arch/ppc64/xmon/xmon.c 1.32 vs edited ===== > --- 1.32/arch/ppc64/xmon/xmon.c Tue Jan 20 13:07:09 2004 > +++ edited/arch/ppc64/xmon/xmon.c Tue Jan 27 00:37:43 2004 > @@ -72,11 +72,10 @@ > static unsigned bpinstr = 0x7fe00008; /* trap */ > > /* Prototypes */ > -extern void (*debugger_fault_handler)(struct pt_regs *); > static int cmds(struct pt_regs *); > static int mread(unsigned long, void *, int); > static int mwrite(unsigned long, void *, int); > -static void handle_fault(struct pt_regs *); > +static int handle_fault(struct pt_regs *); > static void byterev(unsigned char *, int); > static void memex(void); > static int bsesc(void); > @@ -114,10 +113,6 @@ > #endif /* CONFIG_SMP */ > static void csum(void); > static void bootcmds(void); > -static void mem_translate(void); > -static void mem_check(void); > -static void mem_find_real(void); > -static void mem_find_vsid(void); > > static void debug_trace(void); > > @@ -223,7 +218,7 @@ > #endif > } > > -void > +int > xmon(struct pt_regs *excp) > { > struct pt_regs regs; > @@ -328,17 +323,9 @@ > clear_bit(smp_processor_id(), &cpus_in_xmon); > #endif /* CONFIG_SMP */ > set_msrd(msr); /* restore interrupt enable */ > -} > > -void > -xmon_irq(int irq, void *d, struct pt_regs *regs) > -{ > - unsigned long flags; > - local_save_flags(flags); > - local_irq_disable(); > - printf("Keyboard interrupt\n"); > - xmon(regs); > - local_irq_restore(flags); > + /* XXX return 1 to try and recover */ > + return 0; > } > > int > @@ -522,18 +509,6 @@ > case 'z': > memzcan(); > break; > - case 'x': > - mem_translate(); > - break; > - case 'c': > - mem_check(); > - break; > - case 'f': > - mem_find_real(); > - break; > - case 'e': > - mem_find_vsid(); > - break; > case 'i': > show_mem(); > break; > @@ -630,7 +605,7 @@ > printf("stopping all cpus\n"); > /* interrupt other cpu(s) */ > cpu = MSG_ALL_BUT_SELF; > - smp_send_xmon_break(cpu); > + smp_send_debugger_break(cpu); > return; > } > termch = cmd; > @@ -1180,7 +1155,7 @@ > > n = 0; > if (setjmp(bus_error_jmp) == 0) { > - debugger_fault_handler = handle_fault; > + __debugger_fault_handler = handle_fault; > sync(); > p = (char *)adrs; > q = (char *)buf; > @@ -1205,7 +1180,7 @@ > __delay(200); > n = size; > } > - debugger_fault_handler = 0; > + __debugger_fault_handler = 0; > return n; > } > > @@ -1217,7 +1192,7 @@ > > n = 0; > if (setjmp(bus_error_jmp) == 0) { > - debugger_fault_handler = handle_fault; > + __debugger_fault_handler = handle_fault; > sync(); > p = (char *) adrs; > q = (char *) buf; > @@ -1244,14 +1219,14 @@ > } else { > printf("*** Error writing address %x\n", adrs + n); > } > - debugger_fault_handler = 0; > + __debugger_fault_handler = 0; > return n; > } > > static int fault_type; > static char *fault_chars[] = { "--", "**", "##" }; > > -static void > +static int > handle_fault(struct pt_regs *regs) > { > switch (regs->trap) { > @@ -1267,6 +1242,8 @@ > } > > longjmp(bus_error_jmp, 1); > + > + return 0; > } > > #define SWAP(a, b, t) ((t) = (a), (a) = (b), (b) = (t)) > @@ -1880,7 +1857,7 @@ > char namebuf[128]; > > if (setjmp(bus_error_jmp) == 0) { > - debugger_fault_handler = handle_fault; > + __debugger_fault_handler = handle_fault; > sync(); > name = kallsyms_lookup(address, &size, &offset, &modname, > namebuf); > @@ -1891,7 +1868,7 @@ > name = "symbol lookup failed"; > } > > - debugger_fault_handler = 0; > + __debugger_fault_handler = 0; > > if (!name) { > char addrstr[sizeof("0x%lx") + (BITS_PER_LONG*3/10)]; > @@ -1919,240 +1896,8 @@ > } > } > > -void > -mem_translate() > +static void debug_trace(void) > { > - int c; > - unsigned long ea, va, vsid, vpn, page, hpteg_slot_primary, hpteg_slot_secondary, primary_hash, i, *steg, esid, stabl; > - HPTE * hpte; > - struct mm_struct * mm; > - pte_t *ptep = NULL; > - void * pgdir; > - > - c = inchar(); > - if ((isxdigit(c) && c != 'f' && c != 'd') || c == '\n') > - termch = c; > - scanhex((void *)&ea); > - > - if ((ea >= KRANGE_START) && (ea <= (KRANGE_START + (1UL<<60)))) { > - ptep = 0; > - vsid = get_kernel_vsid(ea); > - va = ( vsid << 28 ) | ( ea & 0x0fffffff ); > - } else { > - // if in vmalloc range, use the vmalloc page directory > - if ( ( ea >= VMALLOC_START ) && ( ea <= VMALLOC_END ) ) { > - mm = &init_mm; > - vsid = get_kernel_vsid( ea ); > - } > - // if in ioremap range, use the ioremap page directory > - else if ( ( ea >= IMALLOC_START ) && ( ea <= IMALLOC_END ) ) { > - mm = &ioremap_mm; > - vsid = get_kernel_vsid( ea ); > - } > - // if in user range, use the current task's page directory > - else if ( ( ea >= USER_START ) && ( ea <= USER_END ) ) { > - mm = current->mm; > - vsid = get_vsid(mm->context, ea ); > - } > - pgdir = mm->pgd; > - va = ( vsid << 28 ) | ( ea & 0x0fffffff ); > - ptep = find_linux_pte( pgdir, ea ); > - } > - > - vpn = ((vsid << 28) | (((ea) & 0xFFFF000))) >> 12; > - page = vpn & 0xffff; > - esid = (ea >> 28) & 0xFFFFFFFFF; > - > - // Search the primary group for an available slot > - primary_hash = ( vsid & 0x7fffffffff ) ^ page; > - hpteg_slot_primary = ( primary_hash & htab_data.htab_hash_mask ) * HPTES_PER_GROUP; > - hpteg_slot_secondary = ( ~primary_hash & htab_data.htab_hash_mask ) * HPTES_PER_GROUP; > - > - printf("ea : %.16lx\n", ea); > - printf("esid : %.16lx\n", esid); > - printf("vsid : %.16lx\n", vsid); > - > - printf("\nSoftware Page Table\n-------------------\n"); > - printf("ptep : %.16lx\n", ((unsigned long *)ptep)); > - if(ptep) { > - printf("*ptep : %.16lx\n", *((unsigned long *)ptep)); > - } > - > - hpte = htab_data.htab + hpteg_slot_primary; > - printf("\nHardware Page Table\n-------------------\n"); > - printf("htab base : %.16lx\n", htab_data.htab); > - printf("slot primary : %.16lx\n", hpteg_slot_primary); > - printf("slot secondary : %.16lx\n", hpteg_slot_secondary); > - printf("\nPrimary Group\n"); > - for (i=0; i<8; ++i) { > - if ( hpte->dw0.dw0.v != 0 ) { > - printf("%d: (hpte)%.16lx %.16lx\n", i, hpte->dw0.dword0, hpte->dw1.dword1); > - printf(" vsid: %.13lx api: %.2lx hash: %.1lx\n", > - (hpte->dw0.dw0.avpn)>>5, > - (hpte->dw0.dw0.avpn) & 0x1f, > - (hpte->dw0.dw0.h)); > - printf(" rpn: %.13lx \n", (hpte->dw1.dw1.rpn)); > - printf(" pp: %.1lx \n", > - ((hpte->dw1.dw1.pp0)<<2)|(hpte->dw1.dw1.pp)); > - printf(" wimgn: %.2lx reference: %.1lx change: %.1lx\n", > - ((hpte->dw1.dw1.w)<<4)| > - ((hpte->dw1.dw1.i)<<3)| > - ((hpte->dw1.dw1.m)<<2)| > - ((hpte->dw1.dw1.g)<<1)| > - ((hpte->dw1.dw1.n)<<0), > - hpte->dw1.dw1.r, hpte->dw1.dw1.c); > - } > - hpte++; > - } > - > - printf("\nSecondary Group\n"); > - // Search the secondary group > - hpte = htab_data.htab + hpteg_slot_secondary; > - for (i=0; i<8; ++i) { > - if(hpte->dw0.dw0.v) { > - printf("%d: (hpte)%.16lx %.16lx\n", i, hpte->dw0.dword0, hpte->dw1.dword1); > - printf(" vsid: %.13lx api: %.2lx hash: %.1lx\n", > - (hpte->dw0.dw0.avpn)>>5, > - (hpte->dw0.dw0.avpn) & 0x1f, > - (hpte->dw0.dw0.h)); > - printf(" rpn: %.13lx \n", (hpte->dw1.dw1.rpn)); > - printf(" pp: %.1lx \n", > - ((hpte->dw1.dw1.pp0)<<2)|(hpte->dw1.dw1.pp)); > - printf(" wimgn: %.2lx reference: %.1lx change: %.1lx\n", > - ((hpte->dw1.dw1.w)<<4)| > - ((hpte->dw1.dw1.i)<<3)| > - ((hpte->dw1.dw1.m)<<2)| > - ((hpte->dw1.dw1.g)<<1)| > - ((hpte->dw1.dw1.n)<<0), > - hpte->dw1.dw1.r, hpte->dw1.dw1.c); > - } > - hpte++; > - } > - > - printf("\nHardware Segment Table\n-----------------------\n"); > - stabl = (unsigned long)(KERNELBASE+(_ASR&0xFFFFFFFFFFFFFFFE)); > - steg = (unsigned long *)((stabl) | ((esid & 0x1f) << 7)); > - > - printf("stab base : %.16lx\n", stabl); > - printf("slot : %.16lx\n", steg); > - > - for (i=0; i<8; ++i) { > - printf("%d: (ste) %.16lx %.16lx\n", i, > - *((unsigned long *)(steg+i*2)),*((unsigned long *)(steg+i*2+1)) ); > - } > -} > - > -void mem_check() > -{ > - unsigned long htab_size_bytes; > - unsigned long htab_end; > - unsigned long last_rpn; > - HPTE *hpte1, *hpte2; > - > - htab_size_bytes = htab_data.htab_num_ptegs * 128; // 128B / PTEG > - htab_end = (unsigned long)htab_data.htab + htab_size_bytes; > - // last_rpn = (naca->physicalMemorySize-1) >> PAGE_SHIFT; > - last_rpn = 0xfffff; > - > - printf("\nHardware Page Table Check\n-------------------\n"); > - printf("htab base : %.16lx\n", htab_data.htab); > - printf("htab size : %.16lx\n", htab_size_bytes); > - > -#if 1 > - for(hpte1 = htab_data.htab; hpte1 < (HPTE *)htab_end; hpte1++) { > - if ( hpte1->dw0.dw0.v != 0 ) { > - if ( hpte1->dw1.dw1.rpn <= last_rpn ) { > - for(hpte2 = hpte1+1; hpte2 < (HPTE *)htab_end; hpte2++) { > - if ( hpte2->dw0.dw0.v != 0 ) { > - if(hpte1->dw1.dw1.rpn == hpte2->dw1.dw1.rpn) { > - printf(" Duplicate rpn: %.13lx \n", (hpte1->dw1.dw1.rpn)); > - printf(" hpte1: %16.16lx *hpte1: %16.16lx %16.16lx\n", > - hpte1, hpte1->dw0.dword0, hpte1->dw1.dword1); > - printf(" hpte2: %16.16lx *hpte2: %16.16lx %16.16lx\n", > - hpte2, hpte2->dw0.dword0, hpte2->dw1.dword1); > - } > - } > - } > - } else { > - printf(" Bogus rpn: %.13lx \n", (hpte1->dw1.dw1.rpn)); > - printf(" hpte: %16.16lx *hpte: %16.16lx %16.16lx\n", > - hpte1, hpte1->dw0.dword0, hpte1->dw1.dword1); > - } > - } > - } > -#endif > - printf("\nDone -------------------\n"); > -} > - > -void mem_find_real() > -{ > - unsigned long htab_size_bytes; > - unsigned long htab_end; > - unsigned long last_rpn; > - HPTE *hpte1; > - unsigned long pa, rpn; > - int c; > - > - c = inchar(); > - if ((isxdigit(c) && c != 'f' && c != 'd') || c == '\n') > - termch = c; > - scanhex((void *)&pa); > - rpn = pa >> 12; > - > - htab_size_bytes = htab_data.htab_num_ptegs * 128; // 128B / PTEG > - htab_end = (unsigned long)htab_data.htab + htab_size_bytes; > - // last_rpn = (naca->physicalMemorySize-1) >> PAGE_SHIFT; > - last_rpn = 0xfffff; > - > - printf("\nMem Find RPN\n-------------------\n"); > - printf("htab base : %.16lx\n", htab_data.htab); > - printf("htab size : %.16lx\n", htab_size_bytes); > - > - for(hpte1 = htab_data.htab; hpte1 < (HPTE *)htab_end; hpte1++) { > - if ( hpte1->dw0.dw0.v != 0 ) { > - if ( hpte1->dw1.dw1.rpn == rpn ) { > - printf(" Found rpn: %.13lx \n", (hpte1->dw1.dw1.rpn)); > - printf(" hpte: %16.16lx *hpte1: %16.16lx %16.16lx\n", > - hpte1, hpte1->dw0.dword0, hpte1->dw1.dword1); > - } > - } > - } > - printf("\nDone -------------------\n"); > -} > - > -void mem_find_vsid() > -{ > - unsigned long htab_size_bytes; > - unsigned long htab_end; > - HPTE *hpte1; > - unsigned long vsid; > - int c; > - > - c = inchar(); > - if ((isxdigit(c) && c != 'f' && c != 'd') || c == '\n') > - termch = c; > - scanhex((void *)&vsid); > - > - htab_size_bytes = htab_data.htab_num_ptegs * 128; // 128B / PTEG > - htab_end = (unsigned long)htab_data.htab + htab_size_bytes; > - > - printf("\nMem Find VSID\n-------------------\n"); > - printf("htab base : %.16lx\n", htab_data.htab); > - printf("htab size : %.16lx\n", htab_size_bytes); > - > - for(hpte1 = htab_data.htab; hpte1 < (HPTE *)htab_end; hpte1++) { > - if ( hpte1->dw0.dw0.v != 0 ) { > - if ( ((hpte1->dw0.dw0.avpn)>>5) == vsid ) { > - printf(" Found vsid: %.16lx \n", ((hpte1->dw0.dw0.avpn) >> 5)); > - printf(" hpte: %16.16lx *hpte1: %16.16lx %16.16lx\n", > - hpte1, hpte1->dw0.dword0, hpte1->dw1.dword1); > - } > - } > - } > - printf("\nDone -------------------\n"); > -} > - > -static void debug_trace(void) { > unsigned long val, cmd, on; > > cmd = skipbl(); > @@ -2198,4 +1943,13 @@ > } > cmd = skipbl(); > } > +} > + > +void xmon_init(void) > +{ > + __debugger = xmon; > + __debugger_bpt = xmon_bpt; > + __debugger_sstep = xmon_sstep; > + __debugger_iabr_match = xmon_iabr_match; > + __debugger_dabr_match = xmon_dabr_match; > } > ===== include/asm-ppc64/ppcdebug.h 1.4 vs edited ===== > --- 1.4/include/asm-ppc64/ppcdebug.h Fri Sep 13 21:19:46 2002 > +++ edited/include/asm-ppc64/ppcdebug.h Tue Jan 27 00:07:54 2004 > @@ -95,24 +95,11 @@ > #define ppcdebugset(FLAGS) (udbg_ifdebug(FLAGS)) > #define PPCDBG_BINFMT (test_thread_flag(TIF_32BIT) ? PPCDBG_BINFMT32 : PPCDBG_BINFMT64) > > -#ifdef CONFIG_XMON > -#define PPCDBG_ENTER_DEBUGGER() xmon(0) > -#define PPCDBG_ENTER_DEBUGGER_REGS(X) xmon(X) > -#endif > - > #else > #define PPCDBG(...) do {;} while (0) > #define PPCDBGCALL(FLAGS,FUNCTION) do {;} while (0) > #define ifppcdebug(...) if (0) > #define ppcdebugset(FLAGS) (0) > #endif /* CONFIG_PPCDBG */ > - > -#ifndef PPCDBG_ENTER_DEBUGGER > -#define PPCDBG_ENTER_DEBUGGER() do {;} while(0) > -#endif > - > -#ifndef PPCDBG_ENTER_DEBUGGER_REGS > -#define PPCDBG_ENTER_DEBUGGER_REGS(A) do {;} while(0) > -#endif > > #endif /*__PPCDEBUG_H */ > ===== include/asm-ppc64/smp.h 1.17 vs edited ===== > --- 1.17/include/asm-ppc64/smp.h Tue Jan 20 13:08:24 2004 > +++ edited/include/asm-ppc64/smp.h Tue Jan 27 00:24:40 2004 > @@ -29,8 +29,7 @@ > #ifdef CONFIG_SMP > > extern void smp_message_pass(int target, int msg, unsigned long data, int wait); > -extern void smp_send_tlb_invalidate(int); > -extern void smp_send_xmon_break(int cpu); > +extern void smp_send_debugger_break(int cpu); > struct pt_regs; > extern void smp_message_recv(int, struct pt_regs *); > > @@ -63,17 +62,22 @@ > * in /proc/interrupts will be wrong!!! --Troy */ > #define PPC_MSG_CALL_FUNCTION 0 > #define PPC_MSG_RESCHEDULE 1 > +/* This is unused now */ > +#if 0 > #define PPC_MSG_MIGRATE_TASK 2 > -#define PPC_MSG_XMON_BREAK 3 > +#endif > +#define PPC_MSG_DEBUGGER_BREAK 3 > > void smp_init_iSeries(void); > void smp_init_pSeries(void); > > #endif /* !(CONFIG_SMP) */ > -#endif /* __ASSEMBLY__ */ > > #define get_hard_smp_processor_id(CPU) (paca[(CPU)].xHwProcNum) > -#define set_hard_smp_processor_id(CPU, VAL) do { (paca[(CPU)].xHwProcNum = VAL); } while (0) > +#define set_hard_smp_processor_id(CPU, VAL) \ > + do { (paca[(CPU)].xHwProcNum = VAL); } while (0) > + > +#endif /* __ASSEMBLY__ */ > > #endif /* !(_PPC64_SMP_H) */ > #endif /* __KERNEL__ */ > ===== include/asm-ppc64/system.h 1.25 vs edited ===== > --- 1.25/include/asm-ppc64/system.h Thu Jan 22 16:29:20 2004 > +++ edited/include/asm-ppc64/system.h Tue Jan 27 00:28:18 2004 > @@ -9,6 +9,7 @@ > */ > > #include > +#include > #include > #include > #include > @@ -53,30 +54,40 @@ > #endif /* CONFIG_SMP */ > > #ifdef CONFIG_DEBUG_KERNEL > -extern void (*debugger)(struct pt_regs *regs); > -extern int (*debugger_bpt)(struct pt_regs *regs); > -extern int (*debugger_sstep)(struct pt_regs *regs); > -extern int (*debugger_iabr_match)(struct pt_regs *regs); > -extern int (*debugger_dabr_match)(struct pt_regs *regs); > -extern void (*debugger_fault_handler)(struct pt_regs *regs); > -#else > -#define debugger(regs) do { } while (0) > -#define debugger_bpt(regs) 0 > -#define debugger_sstep(regs) 0 > -#define debugger_iabr_match(regs) 0 > -#define debugger_dabr_match(regs) 0 > -#define debugger_fault_handler ((void (*)(struct pt_regs *))0) > -#endif > + > +extern int (*__debugger)(struct pt_regs *regs); > +extern int (*__debugger_bpt)(struct pt_regs *regs); > +extern int (*__debugger_sstep)(struct pt_regs *regs); > +extern int (*__debugger_iabr_match)(struct pt_regs *regs); > +extern int (*__debugger_dabr_match)(struct pt_regs *regs); > +extern int (*__debugger_fault_handler)(struct pt_regs *regs); > + > +#define DEBUGGER_BOILERPLATE(__NAME) \ > +static inline int __NAME(struct pt_regs *regs) \ > +{ \ > + if (unlikely(__ ## __NAME)) \ > + return __ ## __NAME(regs); \ > + return 0; \ > +} > + > +DEBUGGER_BOILERPLATE(debugger) > +DEBUGGER_BOILERPLATE(debugger_bpt) > +DEBUGGER_BOILERPLATE(debugger_sstep) > +DEBUGGER_BOILERPLATE(debugger_iabr_match) > +DEBUGGER_BOILERPLATE(debugger_dabr_match) > +DEBUGGER_BOILERPLATE(debugger_fault_handler) > > #ifdef CONFIG_XMON > -extern void xmon_irq(int, void *, struct pt_regs *); > +extern void xmon_init(void); > +#endif > > -extern void xmon(struct pt_regs *regs); > -extern int xmon_bpt(struct pt_regs *regs); > -extern int xmon_sstep(struct pt_regs *regs); > -extern int xmon_iabr_match(struct pt_regs *regs); > -extern int xmon_dabr_match(struct pt_regs *regs); > -extern void (*xmon_fault_handler)(struct pt_regs *regs); > +#else > +#define DEBUGGER(regs) 0 > +#define DEBUGGER_BPT(regs) 0 > +#define DEBUGGER_SSTEP(regs) 0 > +#define DEBUGGER_IABR_MATCH(regs) 0 > +#define DEBUGGER_DABR_MATCH(regs) 0 > +#define DEBUGGER_FAULT_HANDLER(regs) 0 > #endif > > extern void show_regs(struct pt_regs * regs); > -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Sat Jan 31 03:40:44 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Fri, 30 Jan 2004 10:40:44 -0600 Subject: [PATCH][2.6] rtas error-inject support Message-ID: <1075480843.682.188.camel@magik> Here is support for the rtas error-inject call. Error inject is used by many test organizations to inject hardware errors to test the error paths when there is a real hardware error. Error inject should not be used in a production environment. Thanks, Jake -------------- next part -------------- # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1393 -> 1.1394 # arch/ppc64/kernel/rtas.c 1.21 -> 1.22 # arch/ppc64/defconfig 1.41 -> 1.42 # arch/ppc64/kernel/rtas-proc.c 1.12 -> 1.13 # arch/ppc64/Kconfig 1.35 -> 1.36 # include/asm-ppc64/rtas.h 1.17 -> 1.18 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 04/01/30 moilanen at threadlp13.austin.ibm.com 1.1394 # Error Inject support # -------------------------------------------- # diff -Nru a/arch/ppc64/Kconfig b/arch/ppc64/Kconfig --- a/arch/ppc64/Kconfig Fri Jan 30 10:30:39 2004 +++ b/arch/ppc64/Kconfig Fri Jan 30 10:30:39 2004 @@ -164,6 +164,14 @@ Provide system capacity information via human readable = pairs through a /proc/ppc64/lparcfg interface. +config RTAS_ERRINJCT + bool "RTAS Errinject" + depends on PPC_RTAS + help + Provide ability to inject errors into hardware for the purpose + of testing hardware error code path. Do not use on production + machine. + endmenu diff -Nru a/arch/ppc64/defconfig b/arch/ppc64/defconfig --- a/arch/ppc64/defconfig Fri Jan 30 10:30:39 2004 +++ b/arch/ppc64/defconfig Fri Jan 30 10:30:39 2004 @@ -59,7 +59,7 @@ # CONFIG_RTAS_FLASH is not set CONFIG_SCANLOG=y CONFIG_PPC_RTAS=y - +# CONFIG_RTAS_ERRINJCT is not set # # General setup # diff -Nru a/arch/ppc64/kernel/rtas-proc.c b/arch/ppc64/kernel/rtas-proc.c --- a/arch/ppc64/kernel/rtas-proc.c Fri Jan 30 10:30:39 2004 +++ b/arch/ppc64/kernel/rtas-proc.c Fri Jan 30 10:30:39 2004 @@ -126,6 +126,7 @@ static unsigned long rtas_tone_frequency = 1000; static unsigned long rtas_tone_volume = 0; +static unsigned int open_token = 0; /* ****************STRUCTS******************************************* */ struct individual_sensor { @@ -165,6 +166,12 @@ size_t count, loff_t *ppos); static ssize_t ppc_rtas_rmo_buf_read(struct file *file, char *buf, size_t count, loff_t *ppos); +static int ppc_rtas_errinjct_open(struct inode *inode, struct file *file); +static int ppc_rtas_errinjct_release(struct inode *inode, struct file *file); +static ssize_t ppc_rtas_errinjct_write(struct file * file, const char * buf, + size_t count, loff_t *ppos); +static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf, + size_t count, loff_t *ppos); struct file_operations ppc_rtas_poweron_operations = { .read = ppc_rtas_poweron_read, @@ -189,6 +196,13 @@ .write = ppc_rtas_tone_volume_write }; +struct file_operations ppc_rtas_errinjct_operations = { + .open = ppc_rtas_errinjct_open, + .read = ppc_rtas_errinjct_read, + .write = ppc_rtas_errinjct_write, + .release = ppc_rtas_errinjct_release +}; + static struct file_operations ppc_rtas_rmo_buf_ops = { .read = ppc_rtas_rmo_buf_read, }; @@ -207,7 +221,8 @@ void proc_rtas_init(void) { struct proc_dir_entry *entry; - + int errinjct_token; + rtas_node = of_find_node_by_name(NULL, "rtas"); if ((rtas_node == NULL) || (systemcfg->platform == PLATFORM_ISERIES_LPAR)) { return; @@ -244,6 +259,14 @@ entry = create_proc_entry("rmo_buffer", S_IRUSR, proc_ppc64.rtas); if (entry) entry->proc_fops = &ppc_rtas_rmo_buf_ops; + +#ifdef CONFIG_RTAS_ERRINJCT + errinjct_token = rtas_token("ibm,errinjct"); + if (errinjct_token != RTAS_UNKNOWN_SERVICE) { + entry = create_proc_entry("errinjct",S_IWUSR|S_IRUGO, proc_ppc64.rtas); + if (entry) entry->proc_fops = &ppc_rtas_errinjct_operations; + } +#endif } /* ****************************************************************** */ @@ -928,6 +951,139 @@ return -EFAULT; } *ppos += n; + return n; +} + +/* ****************************************************************** */ +/* ERRINJCT */ +/* ****************************************************************** */ +static int ppc_rtas_errinjct_open(struct inode *inode, struct file *file) +{ + int rc; + + /* We will only allow one process to use error inject at a + time. Since errinjct is usually only used for testing, + this shouldn't be an issue */ + if (open_token) { + return -EAGAIN; + } + rc = rtas_errinjct_open(); + if (rc < 0) { + return -EIO; + } + open_token = rc; + + return 0; +} + +static ssize_t ppc_rtas_errinjct_write(struct file * file, const char * buf, + size_t count, loff_t *ppos) +{ + + char * ei_token; + char * workspace = NULL; + size_t max_len; + int token_len; + int rc; + + /* Verify the errinjct token length */ + if (count < ERRINJCT_TOKEN_LEN) { + max_len = count; + } else { + max_len = ERRINJCT_TOKEN_LEN; + } + + token_len = strnlen(buf, max_len); + token_len++; /* Add one for the null termination */ + + ei_token = (char *)kmalloc(token_len, GFP_KERNEL); + if (!ei_token) { + printk(KERN_WARNING "error: kmalloc failed\n"); + return -ENOMEM; + } + + strncpy(ei_token, buf, token_len); + + if (count > token_len + WORKSPACE_SIZE) { + count = token_len + WORKSPACE_SIZE; + } + + buf += token_len; + + /* check if there is a workspace */ + if (count > token_len) { + /* Verify the workspace size */ + if ((count - token_len) > WORKSPACE_SIZE) { + max_len = WORKSPACE_SIZE; + } else { + max_len = count - token_len; + } + + workspace = (char *)kmalloc(max_len, GFP_KERNEL); + if (!workspace) { + printk(KERN_WARNING "error: failed kmalloc\n"); + kfree(ei_token); + return -ENOMEM; + } + copy_from_user(workspace, buf, max_len); + } + + rc = rtas_errinjct(open_token, ei_token, workspace, max_len); + + if (count > token_len) { + kfree(workspace); + } + kfree(ei_token); + + return rc < 0 ? rc : count; +} + +static int ppc_rtas_errinjct_release(struct inode *inode, struct file *file) +{ + int rc; + + rc = rtas_errinjct_close(open_token); + if (rc) { + return rc; + } + open_token = 0; + return 0; +} + +static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + char * buffer; + int i; + int n = 0; + + buffer = (char *)kmalloc(MAX_ERRINJCT_TOKENS * (ERRINJCT_TOKEN_LEN+1), + GFP_KERNEL); + if (!buffer) { + printk(KERN_ERR "error: kmalloc failed\n"); + return -ENOMEM; + } + + for (i = 0; i < MAX_ERRINJCT_TOKENS && ei_token_list[i].value; i++) { + n += sprintf(buffer+n, ei_token_list[i].name); + n += sprintf(buffer+n, "\n"); + } + + if (*ppos >= strlen(buffer)) { + kfree(buffer); + return 0; + } + if (n > strlen(buffer) - *ppos) + n = strlen(buffer) - *ppos; + + if (n > count) + n = count; + + memcpy(buf, buffer + *ppos, n); + + *ppos += n; + + kfree(buffer); return n; } diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c --- a/arch/ppc64/kernel/rtas.c Fri Jan 30 10:30:39 2004 +++ b/arch/ppc64/kernel/rtas.c Fri Jan 30 10:30:39 2004 @@ -33,6 +33,7 @@ #include struct flash_block_list_header rtas_firmware_flash_list = {0, 0}; +struct errinjct_token ei_token_list[MAX_ERRINJCT_TOKENS]; /* * prom_init() is called very early on, before the kernel text @@ -191,6 +192,10 @@ int order = status - 9900; unsigned long ms; + if (status < RTAS_EXTENDED_DELAY_MIN || + status > RTAS_EXTENDED_DELAY_MAX) + return 0; + if (order < 0) order = 0; /* RTC depends on this for -2 clock busy */ else if (order > 5) @@ -423,6 +428,159 @@ return 0; } + +#ifdef CONFIG_RTAS_ERRINJCT +int +rtas_errinjct_open(void) +{ + u32 ret[2]; + int open_token; + int rc; + unsigned int time; + + + while (1) { + /* + * The rc and open_token values are backwards due to a + * misprint in the RPA. + */ + open_token = rtas_call(rtas_token("ibm,open-errinjct"), 0, 2, (void *) &ret); + rc = ret[0]; + + if (rc == RTAS_BUSY) { + continue; + } + + if ((time = rtas_extended_busy_delay_time(rc))) { + udelay(time * 1000); + continue; + } + + if (rc < 0) { + printk(KERN_WARNING "error: ibm,open-errinjct failed (%d)\n", rc); + return rc; + } + + return open_token; + } +} + +int +rtas_errinjct(unsigned int open_token, char * ei_token, char * workspace, size_t workspace_size) +{ + struct errinjct_token * ei; + int rtas_ei_token = -1; + unsigned int time; + int rc = 0; + int i; + + ei = ei_token_list; + for (i = 0; i < MAX_ERRINJCT_TOKENS && ei->name; i++) { + if (strcmp(ei_token, ei->name) == 0) { + rtas_ei_token = ei->value; + break; + } + ei++; + } + if (rtas_ei_token == -1) { + return -EINVAL; + } + + spin_lock(&rtas_data_buf_lock); + + while (1) { + if (rc != RTAS_BUSY && workspace) { + memset(rtas_data_buf, 0, RTAS_DATA_BUF_SIZE); + memcpy(rtas_data_buf, workspace, workspace_size); + } + + rc = rtas_call(rtas_token("ibm,errinjct"), 3, 1, NULL, + rtas_ei_token, open_token, __pa(rtas_data_buf)); + + if (rc == RTAS_BUSY) { + continue; + } + + if ((time = rtas_extended_busy_delay_time(rc))) { + spin_unlock(&rtas_data_buf_lock); + udelay(time * 1000); + spin_lock(&rtas_data_buf_lock); + continue; + } + + if (rc != 0) { + printk(KERN_WARNING "error: ibm,errinjct failed (%d)\n", rc); + } + + spin_unlock(&rtas_data_buf_lock); + + return rc; + } +} + +int +rtas_errinjct_close(unsigned int open_token) +{ + int rc; + unsigned int time; + + while (1) { + rc = rtas_call(rtas_token("ibm,close-errinjct"), 1, 1, NULL, open_token); + + if (rc == RTAS_BUSY) { + continue; + } + + if ((time = rtas_extended_busy_delay_time(rc))) { + udelay(time * 1000); + continue; + } + + if (rc != 0) { + printk(KERN_WARNING "error: ibm,close-errinjct failed (%d)\n", rc); + } + + return rc; + } +} + +static int __init rtas_errinjct_init(void) +{ + char * token_array; + char * end_array; + int array_len = 0; + int len; + int i, j; + + token_array = (char *) get_property(rtas.dev, "ibm,errinjct-tokens", + &array_len); + end_array = token_array + array_len; + for (i = 0, j = 0; i < MAX_ERRINJCT_TOKENS && token_array < end_array; i++) { + + len = strnlen(token_array, ERRINJCT_TOKEN_LEN) + 1; + ei_token_list[i].name = (char *) kmalloc(len, GFP_KERNEL); + if (!ei_token_list[i].name) { + printk(KERN_WARNING "error: kmalloc failed\n"); + return -ENOMEM; + } + + strcpy(ei_token_list[i].name, token_array); + token_array += len; + + ei_token_list[i].value = *(int *)token_array; + token_array += sizeof(int); + } + for (; i < MAX_ERRINJCT_TOKENS; i++) { + ei_token_list[i].name = 0; + ei_token_list[i].value = 0; + } + + return 0; + +} +#endif + +__initcall(rtas_errinjct_init); EXPORT_SYMBOL(rtas_firmware_flash_list); diff -Nru a/include/asm-ppc64/rtas.h b/include/asm-ppc64/rtas.h --- a/include/asm-ppc64/rtas.h Fri Jan 30 10:30:39 2004 +++ b/include/asm-ppc64/rtas.h Fri Jan 30 10:30:39 2004 @@ -22,6 +22,11 @@ /* Buffer size for ppc_rtas system call. */ #define RTAS_RMOBUF_MAX (64 * 1024) +/* Error inject defines */ +#define ERRINJCT_TOKEN_LEN 24 /* Max length of an error inject token */ +#define MAX_ERRINJCT_TOKENS 15 /* Max # tokens. */ +#define WORKSPACE_SIZE 1024 + /* RTAS return codes */ #define RTAS_BUSY -2 /* RTAS Return Status - Busy */ #define RTAS_EXTENDED_DELAY_MIN 9900 @@ -141,6 +146,11 @@ unsigned char buffer[1]; /* allocated by klimit bump */ }; +struct errinjct_token { + char * name; + int value; +}; + struct flash_block { char *data; unsigned long length; @@ -178,6 +188,9 @@ extern int rtas_get_sensor(int sensor, int index, int *state); extern int rtas_get_power_level(int powerdomain, int *level); extern int rtas_set_indicator(int indicator, int index, int new_value); +extern int rtas_errinjct_open(void); +extern int rtas_errinjct(unsigned int, char *, char *, size_t); +extern int rtas_errinjct_close(unsigned int); /* Given an RTAS status code of 9900..9905 compute the hinted delay */ unsigned int rtas_extended_busy_delay_time(int status); @@ -187,6 +200,7 @@ } extern void pSeries_log_error(char *buf, unsigned int err_type, int fatal); +extern struct errinjct_token ei_token_list[MAX_ERRINJCT_TOKENS]; /* Error types logged. */ #define ERR_FLAG_ALREADY_LOGGED 0x0 From willschm at us.ibm.com Sat Jan 31 03:54:37 2004 From: willschm at us.ibm.com (Will Schmidt) Date: Fri, 30 Jan 2004 10:54:37 -0600 Subject: Fw: [PATCH] kdb dmesg output broken after log_buf changes Message-ID: > My question is, how does kdb work? I cant see any hooks in arch/ppc64 to > handle dodgy page faults etc. Does kdb work without requiring any arch > hooks? When the changes were made in traps.c and fault.c to use the debugger() call, instead of the xmon() call, kdb was changed to do the same. The KDB change to set those hooks are at the bottom of arch/ppc64/kdb/kdbasupport.c in function kdba_init()... willschm at us.ibm.com Linux on PowerPC-64 Development IBM Rochester ----- Forwarded by Will Schmidt/Rochester/IBM on 01/30/2004 10:53 AM ----- |---------+----------------------------> | | Will Schmidt | | | | | | 01/30/2004 08:50 | | | AM | |---------+----------------------------> >--------------------------------------------------------------------------------------------------------------| | | | To: Olaf Hering | | cc: Anton Blanchard , linuxppc64-dev at lists.linuxppc.org | | From: Will Schmidt/Rochester/IBM at IBMUS | | Subject: Re: [PATCH] kdb dmesg output broken after log_buf changes(Document link: Will Schmidt) | >--------------------------------------------------------------------------------------------------------------| willschm at us.ibm.com Linux on PowerPC-64 Development IBM Rochester ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Sat Jan 31 04:27:47 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Fri, 30 Jan 2004 11:27:47 -0600 Subject: [PATCH][2.6] rtas error-inject support In-Reply-To: <1075480843.682.188.camel@magik> References: <1075480843.682.188.camel@magik> Message-ID: <9F2C2DDE-5349-11D8-928E-000A95A0560C@us.ibm.com> On Jan 30, 2004, at 10:40 AM, Jake Moilanen wrote: > Here is support for the rtas error-inject call. [snip] +static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf, + size_t count, loff_t *ppos) ... + memcpy(buf, buffer + *ppos, n); That should be copy_to_user(), right? (ppc_rtas_errinjct_write() does use copy_from_user().) -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Sat Jan 31 05:56:09 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Fri, 30 Jan 2004 12:56:09 -0600 Subject: [PATCH][2.6] rtas error-inject support In-Reply-To: <9F2C2DDE-5349-11D8-928E-000A95A0560C@us.ibm.com> References: <1075480843.682.188.camel@magik> <9F2C2DDE-5349-11D8-928E-000A95A0560C@us.ibm.com> Message-ID: <1075488969.682.192.camel@magik> > +static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf, > + size_t count, loff_t *ppos) > ... > + memcpy(buf, buffer + *ppos, n); > > That should be copy_to_user(), right? (ppc_rtas_errinjct_write() does > use copy_from_user().) Whoops, your right. Good catch. This was leftover from the port from 2.4. I attached the new patch. Thanks, Jake -------------- next part -------------- # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1393 -> 1.1394 # arch/ppc64/kernel/rtas.c 1.21 -> 1.22 # arch/ppc64/defconfig 1.41 -> 1.42 # arch/ppc64/kernel/rtas-proc.c 1.12 -> 1.13 # arch/ppc64/Kconfig 1.35 -> 1.36 # include/asm-ppc64/rtas.h 1.17 -> 1.18 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 04/01/30 moilanen at threadlp13.austin.ibm.com 1.1394 # Error inject support # -------------------------------------------- # diff -Nru a/arch/ppc64/Kconfig b/arch/ppc64/Kconfig --- a/arch/ppc64/Kconfig Fri Jan 30 12:50:16 2004 +++ b/arch/ppc64/Kconfig Fri Jan 30 12:50:16 2004 @@ -164,6 +164,14 @@ Provide system capacity information via human readable = pairs through a /proc/ppc64/lparcfg interface. +config RTAS_ERRINJCT + bool "RTAS Errinject" + depends on PPC_RTAS + help + Provide ability to inject errors into hardware for the purpose + of testing hardware error code path. Do not use on production + machine. + endmenu diff -Nru a/arch/ppc64/defconfig b/arch/ppc64/defconfig --- a/arch/ppc64/defconfig Fri Jan 30 12:50:16 2004 +++ b/arch/ppc64/defconfig Fri Jan 30 12:50:16 2004 @@ -59,7 +59,7 @@ # CONFIG_RTAS_FLASH is not set CONFIG_SCANLOG=y CONFIG_PPC_RTAS=y - +# CONFIG_RTAS_ERRINJCT is not set # # General setup # diff -Nru a/arch/ppc64/kernel/rtas-proc.c b/arch/ppc64/kernel/rtas-proc.c --- a/arch/ppc64/kernel/rtas-proc.c Fri Jan 30 12:50:16 2004 +++ b/arch/ppc64/kernel/rtas-proc.c Fri Jan 30 12:50:16 2004 @@ -126,6 +126,7 @@ static unsigned long rtas_tone_frequency = 1000; static unsigned long rtas_tone_volume = 0; +static unsigned int open_token = 0; /* ****************STRUCTS******************************************* */ struct individual_sensor { @@ -165,6 +166,12 @@ size_t count, loff_t *ppos); static ssize_t ppc_rtas_rmo_buf_read(struct file *file, char *buf, size_t count, loff_t *ppos); +static int ppc_rtas_errinjct_open(struct inode *inode, struct file *file); +static int ppc_rtas_errinjct_release(struct inode *inode, struct file *file); +static ssize_t ppc_rtas_errinjct_write(struct file * file, const char * buf, + size_t count, loff_t *ppos); +static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf, + size_t count, loff_t *ppos); struct file_operations ppc_rtas_poweron_operations = { .read = ppc_rtas_poweron_read, @@ -189,6 +196,13 @@ .write = ppc_rtas_tone_volume_write }; +struct file_operations ppc_rtas_errinjct_operations = { + .open = ppc_rtas_errinjct_open, + .read = ppc_rtas_errinjct_read, + .write = ppc_rtas_errinjct_write, + .release = ppc_rtas_errinjct_release +}; + static struct file_operations ppc_rtas_rmo_buf_ops = { .read = ppc_rtas_rmo_buf_read, }; @@ -207,7 +221,8 @@ void proc_rtas_init(void) { struct proc_dir_entry *entry; - + int errinjct_token; + rtas_node = of_find_node_by_name(NULL, "rtas"); if ((rtas_node == NULL) || (systemcfg->platform == PLATFORM_ISERIES_LPAR)) { return; @@ -244,6 +259,14 @@ entry = create_proc_entry("rmo_buffer", S_IRUSR, proc_ppc64.rtas); if (entry) entry->proc_fops = &ppc_rtas_rmo_buf_ops; + +#ifdef CONFIG_RTAS_ERRINJCT + errinjct_token = rtas_token("ibm,errinjct"); + if (errinjct_token != RTAS_UNKNOWN_SERVICE) { + entry = create_proc_entry("errinjct",S_IWUSR|S_IRUGO, proc_ppc64.rtas); + if (entry) entry->proc_fops = &ppc_rtas_errinjct_operations; + } +#endif } /* ****************************************************************** */ @@ -928,6 +951,139 @@ return -EFAULT; } *ppos += n; + return n; +} + +/* ****************************************************************** */ +/* ERRINJCT */ +/* ****************************************************************** */ +static int ppc_rtas_errinjct_open(struct inode *inode, struct file *file) +{ + int rc; + + /* We will only allow one process to use error inject at a + time. Since errinjct is usually only used for testing, + this shouldn't be an issue */ + if (open_token) { + return -EAGAIN; + } + rc = rtas_errinjct_open(); + if (rc < 0) { + return -EIO; + } + open_token = rc; + + return 0; +} + +static ssize_t ppc_rtas_errinjct_write(struct file * file, const char * buf, + size_t count, loff_t *ppos) +{ + + char * ei_token; + char * workspace = NULL; + size_t max_len; + int token_len; + int rc; + + /* Verify the errinjct token length */ + if (count < ERRINJCT_TOKEN_LEN) { + max_len = count; + } else { + max_len = ERRINJCT_TOKEN_LEN; + } + + token_len = strnlen(buf, max_len); + token_len++; /* Add one for the null termination */ + + ei_token = (char *)kmalloc(token_len, GFP_KERNEL); + if (!ei_token) { + printk(KERN_WARNING "error: kmalloc failed\n"); + return -ENOMEM; + } + + strncpy(ei_token, buf, token_len); + + if (count > token_len + WORKSPACE_SIZE) { + count = token_len + WORKSPACE_SIZE; + } + + buf += token_len; + + /* check if there is a workspace */ + if (count > token_len) { + /* Verify the workspace size */ + if ((count - token_len) > WORKSPACE_SIZE) { + max_len = WORKSPACE_SIZE; + } else { + max_len = count - token_len; + } + + workspace = (char *)kmalloc(max_len, GFP_KERNEL); + if (!workspace) { + printk(KERN_WARNING "error: failed kmalloc\n"); + kfree(ei_token); + return -ENOMEM; + } + copy_from_user(workspace, buf, max_len); + } + + rc = rtas_errinjct(open_token, ei_token, workspace, max_len); + + if (count > token_len) { + kfree(workspace); + } + kfree(ei_token); + + return rc < 0 ? rc : count; +} + +static int ppc_rtas_errinjct_release(struct inode *inode, struct file *file) +{ + int rc; + + rc = rtas_errinjct_close(open_token); + if (rc) { + return rc; + } + open_token = 0; + return 0; +} + +static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + char * buffer; + int i; + int n = 0; + + buffer = (char *)kmalloc(MAX_ERRINJCT_TOKENS * (ERRINJCT_TOKEN_LEN+1), + GFP_KERNEL); + if (!buffer) { + printk(KERN_ERR "error: kmalloc failed\n"); + return -ENOMEM; + } + + for (i = 0; i < MAX_ERRINJCT_TOKENS && ei_token_list[i].value; i++) { + n += sprintf(buffer+n, ei_token_list[i].name); + n += sprintf(buffer+n, "\n"); + } + + if (*ppos >= strlen(buffer)) { + kfree(buffer); + return 0; + } + if (n > strlen(buffer) - *ppos) + n = strlen(buffer) - *ppos; + + if (n > count) + n = count; + + copy_to_user(buf, buffer + *ppos, n); + + *ppos += n; + + kfree(buffer); return n; } diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c --- a/arch/ppc64/kernel/rtas.c Fri Jan 30 12:50:15 2004 +++ b/arch/ppc64/kernel/rtas.c Fri Jan 30 12:50:16 2004 @@ -33,6 +33,7 @@ #include struct flash_block_list_header rtas_firmware_flash_list = {0, 0}; +struct errinjct_token ei_token_list[MAX_ERRINJCT_TOKENS]; /* * prom_init() is called very early on, before the kernel text @@ -191,6 +192,10 @@ int order = status - 9900; unsigned long ms; + if (status < RTAS_EXTENDED_DELAY_MIN || + status > RTAS_EXTENDED_DELAY_MAX) + return 0; + if (order < 0) order = 0; /* RTC depends on this for -2 clock busy */ else if (order > 5) @@ -423,6 +428,159 @@ return 0; } + +#ifdef CONFIG_RTAS_ERRINJCT +int +rtas_errinjct_open(void) +{ + u32 ret[2]; + int open_token; + int rc; + unsigned int time; + + + while (1) { + /* + * The rc and open_token values are backwards due to a + * misprint in the RPA. + */ + open_token = rtas_call(rtas_token("ibm,open-errinjct"), 0, 2, (void *) &ret); + rc = ret[0]; + + if (rc == RTAS_BUSY) { + continue; + } + + if ((time = rtas_extended_busy_delay_time(rc))) { + udelay(time * 1000); + continue; + } + + if (rc < 0) { + printk(KERN_WARNING "error: ibm,open-errinjct failed (%d)\n", rc); + return rc; + } + + return open_token; + } +} + +int +rtas_errinjct(unsigned int open_token, char * ei_token, char * workspace, size_t workspace_size) +{ + struct errinjct_token * ei; + int rtas_ei_token = -1; + unsigned int time; + int rc = 0; + int i; + + ei = ei_token_list; + for (i = 0; i < MAX_ERRINJCT_TOKENS && ei->name; i++) { + if (strcmp(ei_token, ei->name) == 0) { + rtas_ei_token = ei->value; + break; + } + ei++; + } + if (rtas_ei_token == -1) { + return -EINVAL; + } + + spin_lock(&rtas_data_buf_lock); + + while (1) { + if (rc != RTAS_BUSY && workspace) { + memset(rtas_data_buf, 0, RTAS_DATA_BUF_SIZE); + memcpy(rtas_data_buf, workspace, workspace_size); + } + + rc = rtas_call(rtas_token("ibm,errinjct"), 3, 1, NULL, + rtas_ei_token, open_token, __pa(rtas_data_buf)); + + if (rc == RTAS_BUSY) { + continue; + } + + if ((time = rtas_extended_busy_delay_time(rc))) { + spin_unlock(&rtas_data_buf_lock); + udelay(time * 1000); + spin_lock(&rtas_data_buf_lock); + continue; + } + + if (rc != 0) { + printk(KERN_WARNING "error: ibm,errinjct failed (%d)\n", rc); + } + + spin_unlock(&rtas_data_buf_lock); + + return rc; + } +} + +int +rtas_errinjct_close(unsigned int open_token) +{ + int rc; + unsigned int time; + + while (1) { + rc = rtas_call(rtas_token("ibm,close-errinjct"), 1, 1, NULL, open_token); + + if (rc == RTAS_BUSY) { + continue; + } + + if ((time = rtas_extended_busy_delay_time(rc))) { + udelay(time * 1000); + continue; + } + + if (rc != 0) { + printk(KERN_WARNING "error: ibm,close-errinjct failed (%d)\n", rc); + } + + return rc; + } +} + +static int __init rtas_errinjct_init(void) +{ + char * token_array; + char * end_array; + int array_len = 0; + int len; + int i, j; + + token_array = (char *) get_property(rtas.dev, "ibm,errinjct-tokens", + &array_len); + end_array = token_array + array_len; + for (i = 0, j = 0; i < MAX_ERRINJCT_TOKENS && token_array < end_array; i++) { + + len = strnlen(token_array, ERRINJCT_TOKEN_LEN) + 1; + ei_token_list[i].name = (char *) kmalloc(len, GFP_KERNEL); + if (!ei_token_list[i].name) { + printk(KERN_WARNING "error: kmalloc failed\n"); + return -ENOMEM; + } + + strcpy(ei_token_list[i].name, token_array); + token_array += len; + + ei_token_list[i].value = *(int *)token_array; + token_array += sizeof(int); + } + for (; i < MAX_ERRINJCT_TOKENS; i++) { + ei_token_list[i].name = 0; + ei_token_list[i].value = 0; + } + + return 0; + +} +#endif + +__initcall(rtas_errinjct_init); EXPORT_SYMBOL(rtas_firmware_flash_list); diff -Nru a/include/asm-ppc64/rtas.h b/include/asm-ppc64/rtas.h --- a/include/asm-ppc64/rtas.h Fri Jan 30 12:50:16 2004 +++ b/include/asm-ppc64/rtas.h Fri Jan 30 12:50:16 2004 @@ -22,6 +22,11 @@ /* Buffer size for ppc_rtas system call. */ #define RTAS_RMOBUF_MAX (64 * 1024) +/* Error inject defines */ +#define ERRINJCT_TOKEN_LEN 24 /* Max length of an error inject token */ +#define MAX_ERRINJCT_TOKENS 15 /* Max # tokens. */ +#define WORKSPACE_SIZE 1024 + /* RTAS return codes */ #define RTAS_BUSY -2 /* RTAS Return Status - Busy */ #define RTAS_EXTENDED_DELAY_MIN 9900 @@ -141,6 +146,11 @@ unsigned char buffer[1]; /* allocated by klimit bump */ }; +struct errinjct_token { + char * name; + int value; +}; + struct flash_block { char *data; unsigned long length; @@ -178,6 +188,9 @@ extern int rtas_get_sensor(int sensor, int index, int *state); extern int rtas_get_power_level(int powerdomain, int *level); extern int rtas_set_indicator(int indicator, int index, int new_value); +extern int rtas_errinjct_open(void); +extern int rtas_errinjct(unsigned int, char *, char *, size_t); +extern int rtas_errinjct_close(unsigned int); /* Given an RTAS status code of 9900..9905 compute the hinted delay */ unsigned int rtas_extended_busy_delay_time(int status); @@ -187,6 +200,7 @@ } extern void pSeries_log_error(char *buf, unsigned int err_type, int fatal); +extern struct errinjct_token ei_token_list[MAX_ERRINJCT_TOKENS]; /* Error types logged. */ #define ERR_FLAG_ALREADY_LOGGED 0x0 From hollisb at us.ibm.com Sat Jan 31 06:47:34 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Fri, 30 Jan 2004 13:47:34 -0600 Subject: [PATCH][2.6] rtas error-inject support In-Reply-To: <1075488969.682.192.camel@magik> References: <1075480843.682.188.camel@magik> <9F2C2DDE-5349-11D8-928E-000A95A0560C@us.ibm.com> <1075488969.682.192.camel@magik> Message-ID: <264AA26A-535D-11D8-928E-000A95A0560C@us.ibm.com> On Jan 30, 2004, at 12:56 PM, Jake Moilanen wrote: >> +static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf, >> + size_t count, loff_t *ppos) >> ... >> + memcpy(buf, buffer + *ppos, n); >> >> That should be copy_to_user(), right? (ppc_rtas_errinjct_write() does >> use copy_from_user().) > > Whoops, your right. Good catch. This was leftover from the port from > 2.4. That statement was alarming... :) I only found one memcpy left in that file in 2.4, but I guess we're supposed to check for EFAULT: ===== arch/ppc64/kernel/rtas-proc.c 1.13 vs edited ===== --- 1.13/arch/ppc64/kernel/rtas-proc.c Mon Jan 26 23:11:12 2004 +++ edited/arch/ppc64/kernel/rtas-proc.c Fri Jan 30 13:55:18 2004 @@ -492,7 +492,10 @@ else *eof = 1; - memcpy(buf, buffer + off, n); + if (copy_to_user(buf, buffer + off, n)) { + kfree(buffer); + return -EFAULT; + } *start = buf; kfree(buffer); return n; (Mike do you want to check that in to the 2.4 tree?) -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Sat Jan 31 07:03:43 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Fri, 30 Jan 2004 14:03:43 -0600 Subject: [PATCH][2.6] rtas error-inject support In-Reply-To: <264AA26A-535D-11D8-928E-000A95A0560C@us.ibm.com> References: <1075480843.682.188.camel@magik> <9F2C2DDE-5349-11D8-928E-000A95A0560C@us.ibm.com> <1075488969.682.192.camel@magik> <264AA26A-535D-11D8-928E-000A95A0560C@us.ibm.com> Message-ID: <1075493023.682.199.camel@magik> > > Whoops, your right. Good catch. This was leftover from the port from > > 2.4. > > That statement was alarming... :) I only found one memcpy left in that > file in 2.4, but I guess we're supposed to check for EFAULT: IIRC Linas went through a couple of months ago and fixed up rtas-proc.c. The whole file was using memcpy instead of copy_to/from_user(). Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Sat Jan 31 09:29:03 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Fri, 30 Jan 2004 16:29:03 -0600 Subject: [PATCH][2.6] rtas error-inject support In-Reply-To: <264AA26A-535D-11D8-928E-000A95A0560C@us.ibm.com> References: <1075480843.682.188.camel@magik> <9F2C2DDE-5349-11D8-928E-000A95A0560C@us.ibm.com> <1075488969.682.192.camel@magik> <264AA26A-535D-11D8-928E-000A95A0560C@us.ibm.com> Message-ID: <1075501743.681.214.camel@magik> > > That statement was alarming... :) I only found one memcpy left in that > file in 2.4, but I guess we're supposed to check for EFAULT: Here's a patch w/ check for EFAULT. Thanks, Jake -------------- next part -------------- # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1393 -> 1.1394 # arch/ppc64/kernel/rtas.c 1.21 -> 1.22 # arch/ppc64/defconfig 1.41 -> 1.42 # arch/ppc64/kernel/rtas-proc.c 1.12 -> 1.13 # arch/ppc64/Kconfig 1.35 -> 1.36 # include/asm-ppc64/rtas.h 1.17 -> 1.18 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 04/01/30 moilanen at threadlp13.austin.ibm.com 1.1394 # Error inject support. # -------------------------------------------- # diff -Nru a/arch/ppc64/Kconfig b/arch/ppc64/Kconfig --- a/arch/ppc64/Kconfig Fri Jan 30 16:25:35 2004 +++ b/arch/ppc64/Kconfig Fri Jan 30 16:25:35 2004 @@ -164,6 +164,14 @@ Provide system capacity information via human readable = pairs through a /proc/ppc64/lparcfg interface. +config RTAS_ERRINJCT + bool "RTAS Errinject" + depends on PPC_RTAS + help + Provide ability to inject errors into hardware for the purpose + of testing hardware error code path. Do not use on production + machine. + endmenu diff -Nru a/arch/ppc64/defconfig b/arch/ppc64/defconfig --- a/arch/ppc64/defconfig Fri Jan 30 16:25:35 2004 +++ b/arch/ppc64/defconfig Fri Jan 30 16:25:35 2004 @@ -59,7 +59,7 @@ # CONFIG_RTAS_FLASH is not set CONFIG_SCANLOG=y CONFIG_PPC_RTAS=y - +# CONFIG_RTAS_ERRINJCT is not set # # General setup # diff -Nru a/arch/ppc64/kernel/rtas-proc.c b/arch/ppc64/kernel/rtas-proc.c --- a/arch/ppc64/kernel/rtas-proc.c Fri Jan 30 16:25:35 2004 +++ b/arch/ppc64/kernel/rtas-proc.c Fri Jan 30 16:25:35 2004 @@ -126,6 +126,7 @@ static unsigned long rtas_tone_frequency = 1000; static unsigned long rtas_tone_volume = 0; +static unsigned int open_token = 0; /* ****************STRUCTS******************************************* */ struct individual_sensor { @@ -165,6 +166,12 @@ size_t count, loff_t *ppos); static ssize_t ppc_rtas_rmo_buf_read(struct file *file, char *buf, size_t count, loff_t *ppos); +static int ppc_rtas_errinjct_open(struct inode *inode, struct file *file); +static int ppc_rtas_errinjct_release(struct inode *inode, struct file *file); +static ssize_t ppc_rtas_errinjct_write(struct file * file, const char * buf, + size_t count, loff_t *ppos); +static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf, + size_t count, loff_t *ppos); struct file_operations ppc_rtas_poweron_operations = { .read = ppc_rtas_poweron_read, @@ -189,6 +196,13 @@ .write = ppc_rtas_tone_volume_write }; +struct file_operations ppc_rtas_errinjct_operations = { + .open = ppc_rtas_errinjct_open, + .read = ppc_rtas_errinjct_read, + .write = ppc_rtas_errinjct_write, + .release = ppc_rtas_errinjct_release +}; + static struct file_operations ppc_rtas_rmo_buf_ops = { .read = ppc_rtas_rmo_buf_read, }; @@ -207,7 +221,8 @@ void proc_rtas_init(void) { struct proc_dir_entry *entry; - + int errinjct_token; + rtas_node = of_find_node_by_name(NULL, "rtas"); if ((rtas_node == NULL) || (systemcfg->platform == PLATFORM_ISERIES_LPAR)) { return; @@ -244,6 +259,14 @@ entry = create_proc_entry("rmo_buffer", S_IRUSR, proc_ppc64.rtas); if (entry) entry->proc_fops = &ppc_rtas_rmo_buf_ops; + +#ifdef CONFIG_RTAS_ERRINJCT + errinjct_token = rtas_token("ibm,errinjct"); + if (errinjct_token != RTAS_UNKNOWN_SERVICE) { + entry = create_proc_entry("errinjct",S_IWUSR|S_IRUGO, proc_ppc64.rtas); + if (entry) entry->proc_fops = &ppc_rtas_errinjct_operations; + } +#endif } /* ****************************************************************** */ @@ -928,6 +951,146 @@ return -EFAULT; } *ppos += n; + return n; +} + +/* ****************************************************************** */ +/* ERRINJCT */ +/* ****************************************************************** */ +static int ppc_rtas_errinjct_open(struct inode *inode, struct file *file) +{ + int rc; + + /* We will only allow one process to use error inject at a + time. Since errinjct is usually only used for testing, + this shouldn't be an issue */ + if (open_token) { + return -EAGAIN; + } + rc = rtas_errinjct_open(); + if (rc < 0) { + return -EIO; + } + open_token = rc; + + return 0; +} + +static ssize_t ppc_rtas_errinjct_write(struct file * file, const char * buf, + size_t count, loff_t *ppos) +{ + + char * ei_token; + char * workspace = NULL; + size_t max_len; + int token_len; + int rc; + + /* Verify the errinjct token length */ + if (count < ERRINJCT_TOKEN_LEN) { + max_len = count; + } else { + max_len = ERRINJCT_TOKEN_LEN; + } + + token_len = strnlen(buf, max_len); + token_len++; /* Add one for the null termination */ + + ei_token = (char *)kmalloc(token_len, GFP_KERNEL); + if (!ei_token) { + printk(KERN_WARNING "error: kmalloc failed\n"); + return -ENOMEM; + } + + strncpy(ei_token, buf, token_len); + + if (count > token_len + WORKSPACE_SIZE) { + count = token_len + WORKSPACE_SIZE; + } + + buf += token_len; + + /* check if there is a workspace */ + if (count > token_len) { + /* Verify the workspace size */ + if ((count - token_len) > WORKSPACE_SIZE) { + max_len = WORKSPACE_SIZE; + } else { + max_len = count - token_len; + } + + workspace = (char *)kmalloc(max_len, GFP_KERNEL); + if (!workspace) { + printk(KERN_WARNING "error: failed kmalloc\n"); + kfree(ei_token); + return -ENOMEM; + } + if (copy_from_user(workspace, buf, max_len)) { + kfree(ei_token); + kfree(workspace); + return -EFAULT; + } + } + + rc = rtas_errinjct(open_token, ei_token, workspace, max_len); + + if (count > token_len) { + kfree(workspace); + } + kfree(ei_token); + + return rc < 0 ? rc : count; +} + +static int ppc_rtas_errinjct_release(struct inode *inode, struct file *file) +{ + int rc; + + rc = rtas_errinjct_close(open_token); + if (rc) { + return rc; + } + open_token = 0; + return 0; +} + +static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + char * buffer; + int i; + int n = 0; + + buffer = (char *)kmalloc(MAX_ERRINJCT_TOKENS * (ERRINJCT_TOKEN_LEN+1), + GFP_KERNEL); + if (!buffer) { + printk(KERN_ERR "error: kmalloc failed\n"); + return -ENOMEM; + } + + for (i = 0; i < MAX_ERRINJCT_TOKENS && ei_token_list[i].value; i++) { + n += sprintf(buffer+n, ei_token_list[i].name); + n += sprintf(buffer+n, "\n"); + } + + if (*ppos >= strlen(buffer)) { + kfree(buffer); + return 0; + } + if (n > strlen(buffer) - *ppos) + n = strlen(buffer) - *ppos; + + if (n > count) + n = count; + + if (copy_to_user(buf, buffer + *ppos, n)) { + kfree(buffer); + return -EFAULT; + } + + *ppos += n; + + kfree(buffer); return n; } diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c --- a/arch/ppc64/kernel/rtas.c Fri Jan 30 16:25:35 2004 +++ b/arch/ppc64/kernel/rtas.c Fri Jan 30 16:25:35 2004 @@ -33,6 +33,7 @@ #include struct flash_block_list_header rtas_firmware_flash_list = {0, 0}; +struct errinjct_token ei_token_list[MAX_ERRINJCT_TOKENS]; /* * prom_init() is called very early on, before the kernel text @@ -191,6 +192,10 @@ int order = status - 9900; unsigned long ms; + if (status < RTAS_EXTENDED_DELAY_MIN || + status > RTAS_EXTENDED_DELAY_MAX) + return 0; + if (order < 0) order = 0; /* RTC depends on this for -2 clock busy */ else if (order > 5) @@ -423,6 +428,159 @@ return 0; } + +#ifdef CONFIG_RTAS_ERRINJCT +int +rtas_errinjct_open(void) +{ + u32 ret[2]; + int open_token; + int rc; + unsigned int time; + + + while (1) { + /* + * The rc and open_token values are backwards due to a + * misprint in the RPA. + */ + open_token = rtas_call(rtas_token("ibm,open-errinjct"), 0, 2, (void *) &ret); + rc = ret[0]; + + if (rc == RTAS_BUSY) { + continue; + } + + if ((time = rtas_extended_busy_delay_time(rc))) { + udelay(time * 1000); + continue; + } + + if (rc < 0) { + printk(KERN_WARNING "error: ibm,open-errinjct failed (%d)\n", rc); + return rc; + } + + return open_token; + } +} + +int +rtas_errinjct(unsigned int open_token, char * ei_token, char * workspace, size_t workspace_size) +{ + struct errinjct_token * ei; + int rtas_ei_token = -1; + unsigned int time; + int rc = 0; + int i; + + ei = ei_token_list; + for (i = 0; i < MAX_ERRINJCT_TOKENS && ei->name; i++) { + if (strcmp(ei_token, ei->name) == 0) { + rtas_ei_token = ei->value; + break; + } + ei++; + } + if (rtas_ei_token == -1) { + return -EINVAL; + } + + spin_lock(&rtas_data_buf_lock); + + while (1) { + if (rc != RTAS_BUSY && workspace) { + memset(rtas_data_buf, 0, RTAS_DATA_BUF_SIZE); + memcpy(rtas_data_buf, workspace, workspace_size); + } + + rc = rtas_call(rtas_token("ibm,errinjct"), 3, 1, NULL, + rtas_ei_token, open_token, __pa(rtas_data_buf)); + + if (rc == RTAS_BUSY) { + continue; + } + + if ((time = rtas_extended_busy_delay_time(rc))) { + spin_unlock(&rtas_data_buf_lock); + udelay(time * 1000); + spin_lock(&rtas_data_buf_lock); + continue; + } + + if (rc != 0) { + printk(KERN_WARNING "error: ibm,errinjct failed (%d)\n", rc); + } + + spin_unlock(&rtas_data_buf_lock); + + return rc; + } +} + +int +rtas_errinjct_close(unsigned int open_token) +{ + int rc; + unsigned int time; + + while (1) { + rc = rtas_call(rtas_token("ibm,close-errinjct"), 1, 1, NULL, open_token); + + if (rc == RTAS_BUSY) { + continue; + } + + if ((time = rtas_extended_busy_delay_time(rc))) { + udelay(time * 1000); + continue; + } + + if (rc != 0) { + printk(KERN_WARNING "error: ibm,close-errinjct failed (%d)\n", rc); + } + + return rc; + } +} + +static int __init rtas_errinjct_init(void) +{ + char * token_array; + char * end_array; + int array_len = 0; + int len; + int i, j; + + token_array = (char *) get_property(rtas.dev, "ibm,errinjct-tokens", + &array_len); + end_array = token_array + array_len; + for (i = 0, j = 0; i < MAX_ERRINJCT_TOKENS && token_array < end_array; i++) { + + len = strnlen(token_array, ERRINJCT_TOKEN_LEN) + 1; + ei_token_list[i].name = (char *) kmalloc(len, GFP_KERNEL); + if (!ei_token_list[i].name) { + printk(KERN_WARNING "error: kmalloc failed\n"); + return -ENOMEM; + } + + strcpy(ei_token_list[i].name, token_array); + token_array += len; + + ei_token_list[i].value = *(int *)token_array; + token_array += sizeof(int); + } + for (; i < MAX_ERRINJCT_TOKENS; i++) { + ei_token_list[i].name = 0; + ei_token_list[i].value = 0; + } + + return 0; + +} +#endif + +__initcall(rtas_errinjct_init); EXPORT_SYMBOL(rtas_firmware_flash_list); diff -Nru a/include/asm-ppc64/rtas.h b/include/asm-ppc64/rtas.h --- a/include/asm-ppc64/rtas.h Fri Jan 30 16:25:35 2004 +++ b/include/asm-ppc64/rtas.h Fri Jan 30 16:25:35 2004 @@ -22,6 +22,11 @@ /* Buffer size for ppc_rtas system call. */ #define RTAS_RMOBUF_MAX (64 * 1024) +/* Error inject defines */ +#define ERRINJCT_TOKEN_LEN 24 /* Max length of an error inject token */ +#define MAX_ERRINJCT_TOKENS 15 /* Max # tokens. */ +#define WORKSPACE_SIZE 1024 + /* RTAS return codes */ #define RTAS_BUSY -2 /* RTAS Return Status - Busy */ #define RTAS_EXTENDED_DELAY_MIN 9900 @@ -141,6 +146,11 @@ unsigned char buffer[1]; /* allocated by klimit bump */ }; +struct errinjct_token { + char * name; + int value; +}; + struct flash_block { char *data; unsigned long length; @@ -178,6 +188,9 @@ extern int rtas_get_sensor(int sensor, int index, int *state); extern int rtas_get_power_level(int powerdomain, int *level); extern int rtas_set_indicator(int indicator, int index, int new_value); +extern int rtas_errinjct_open(void); +extern int rtas_errinjct(unsigned int, char *, char *, size_t); +extern int rtas_errinjct_close(unsigned int); /* Given an RTAS status code of 9900..9905 compute the hinted delay */ unsigned int rtas_extended_busy_delay_time(int status); @@ -187,6 +200,7 @@ } extern void pSeries_log_error(char *buf, unsigned int err_type, int fatal); +extern struct errinjct_token ei_token_list[MAX_ERRINJCT_TOKENS]; /* Error types logged. */ #define ERR_FLAG_ALREADY_LOGGED 0x0