[K42-discussion] Linux Dynamic Upgrade

Christopher Yeoh cyeoh at au1.ibm.com
Tue Oct 3 10:04:20 EST 2006


Will Schmidt (LTC) has been looking at doing a Linux implementation of the
K42 dynamic upgrade. For those who are interested, here's a forward of
some email about it....

From: Will Schmidt <will_schmidt at vnet.ibm.com>
Sender: team-bounces at ozlabs.au.ibm.com
To: team at ozlabs.au.ibm.com
Subject: [Team] Dynamic Update (work in progress.. )
Date: Fri, 29 Sep 2006 16:36:37 -0500

Hi Folks, 
   looking for commentary, thoughts, what part needs more oompf, etc.
For now sending just to team at oz.  

This is a work in progress..   A sample of dynamic upgrade for Linux.
This is loosely based on the papers written about K42's
implementation.  

The loopback code was chosen, as it seemed like it would be a
straightforward place to get a demo going.

An overview.. 

The idea is to allow a module to be upgraded on the fly, without
requiring that the module be unloaded, filesystems be unmounted, etc.  

To me, the most likely scenario will involve a bug being discoved, code
getting fixed, modules being rebuilt, and then trying to load the new
module on top of the old one...   

Because modutils seems to resist my attempts to load a module multiple
times, I had to cheat a bit.    I created loopX.c symlinks to loop,c, so
during my build, i effectively get multiple copies of the same module.
(loop2,loop3,loop4, etc..)   Longer term, this might be fixed via an
enhancement to modprobe/modutils.  Maybe. :-)  

Because modutils/kernel gets cranky if I try to load multiple copies of
the same module (due to non-static symbol clashing), I first needed to
find all the non-static functions in loop.c; and put them elsewhere,
where they wont cause problems.   Thusly, functions like
loop_register_transfer() and loop_unregister_transfer() get moved to a
new module/file loop_core.c.   The underlying loop_device and gendisk
structures (*loop_dev and __disks) get moved too.

Next, I needed a way to determine if i was the first instance, or an
update instance..   For this I key off of the register_blkdev call in
loop_init().  If that call fails, I assume i'm an update and call into
update_loop_init() instead of continuing through loop_init().  the
update_loop_init() function loops through the disk devices, and changes
their fops pointers to point at the new switcher_fops, which has just a
bit of logic to toggle between the original fops and the new fops. (more
on that in a bit..) (And.. piggybacking on the register_blkdev call
doesnt seem clean or safe, but appears sufficient for this demo purpose,
as an easy alternative doesnt come to mind).

And the real change comes next..  I've got two new _fops structures
involved.   The first is a preserved_lo_fops, which contains pointers
back to the original lo_fops functions; and second is a switcher_fops,
which points to a controlling function, which directs the calls between
the new and old versions.

For the switching logic, in this case i'm just using counters, with
arbitrary threshold values, to determine when to call the new version of
the function, and another random counter to trigger when to update the
fops pointer to bypass the fops_switcher completely and call the 'new'
functions directly.   This is where some fancier RCU sort of code could
be involved.  

Other comments on the code.. 
	I used #if 0's to block out the portions moved to loop_core.c; just to
keep the patch smaller. 
	After moving transfer_none and none_funcs reference out of loop.c, the
build complains about incompatible types, my attempts to cast those
errors away werent successful, am not sure why my incantations didnt
work. 

	Thats enough babble for the moment..  code attached.  :-)
-Will

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 7b3b94d..bdf595e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -75,10 +75,23 @@ #include <linux/gfp.h>
 
 #include <asm/uaccess.h>
 
-static int max_loop = 8;
-static struct loop_device *loop_dev;
-static struct gendisk **disks;
+#include "loop_core.h"
 
+static unsigned long open_count=0;
+static unsigned long ioctl_count=0;
+static unsigned long release_count=0;
+static unsigned long force_count=0;
+
+
+#define DBG \
+		printk(KERN_INFO " %s called %s ",__FILE__,__FUNCTION__);
+
+#define SWITCHOVER_OPEN_COUNT 5
+#define SWITCHOVER_IO_COUNT 10
+#define SWITCHOVER_RELEASE_COUNT 20
+#define FORCEOVER_COUNT 15
+
+#if 0
 /*
  * Transfer functions
  */
@@ -153,6 +166,7 @@ static struct loop_func_table *xfer_func
 	&none_funcs,
 	&xor_funcs
 };
+#endif
 
 static loff_t get_loop_size(struct loop_device *lo, struct file *file)
 {
@@ -851,6 +865,7 @@ static int loop_set_fd(struct loop_devic
 	return error;
 }
 
+#if 0
 static int
 loop_release_xfer(struct loop_device *lo)
 {
@@ -866,6 +881,7 @@ loop_release_xfer(struct loop_device *lo
 	}
 	return err;
 }
+#endif
 
 static int
 loop_init_xfer(struct loop_device *lo, struct loop_func_table *xfer,
@@ -1144,6 +1160,7 @@ static int lo_ioctl(struct inode * inode
 	struct loop_device *lo = inode->i_bdev->bd_disk->private_data;
 	int err;
 
+	DBG
 	mutex_lock(&lo->lo_ctl_mutex);
 	switch (cmd) {
 	case LOOP_SET_FD:
@@ -1178,6 +1195,7 @@ static int lo_open(struct inode *inode, 
 {
 	struct loop_device *lo = inode->i_bdev->bd_disk->private_data;
 
+	DBG
 	mutex_lock(&lo->lo_ctl_mutex);
 	lo->lo_refcnt++;
 	mutex_unlock(&lo->lo_ctl_mutex);
@@ -1189,6 +1207,7 @@ static int lo_release(struct inode *inod
 {
 	struct loop_device *lo = inode->i_bdev->bd_disk->private_data;
 
+	DBG
 	mutex_lock(&lo->lo_ctl_mutex);
 	--lo->lo_refcnt;
 	mutex_unlock(&lo->lo_ctl_mutex);
@@ -1203,6 +1222,9 @@ static struct block_device_operations lo
 	.ioctl =	lo_ioctl,
 };
 
+static struct block_device_operations preserved_lo_fops = {
+};
+
 /*
  * And now the modules code and kernel interface.
  */
@@ -1211,6 +1233,7 @@ MODULE_PARM_DESC(max_loop, "Maximum numb
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_BLOCKDEV_MAJOR(LOOP_MAJOR);
 
+#if 0
 int loop_register_transfer(struct loop_func_table *funcs)
 {
 	unsigned int n = funcs->number;
@@ -1243,9 +1266,102 @@ int loop_unregister_transfer(int number)
 
 	return 0;
 }
+#endif
+
+static int force_new_disk_fops(struct inode *inode)
+{
+	/* force the 'new' fops. */
+	inode->i_bdev->bd_disk->fops=&lo_fops;
+	return 0;
+}
+
+static int preserve_old_disk_fops(struct gendisk *disk)
+{
+	/* preserve the existing fops pointers in preserved_lo_fops */
+	preserved_lo_fops.owner = disk->fops->owner;
+	preserved_lo_fops.open = disk->fops->open;
+	preserved_lo_fops.release= disk->fops->release;
+	preserved_lo_fops.ioctl= disk->fops->ioctl;
+	/* TODO - add smarts here to check versioning.. */
+	return 0;
+}
+
+static int lo_ioctl_switcher(struct inode *inode, struct file *file,
+	unsigned int cmd, unsigned long arg)
+{
+	ioctl_count++;
+	force_count++;
+	DBG
+	if ( ioctl_count < SWITCHOVER_IO_COUNT )
+		return preserved_lo_fops.ioctl(inode,file,cmd,arg);
+	else {
+		if (force_count > FORCEOVER_COUNT )
+			force_new_disk_fops(inode);/* next time around, bypass the switcher funcs */
+		return lo_fops.ioctl(inode,file,cmd,arg);
+	}
+}
+
+static int lo_open_switcher(struct inode *inode, struct file *file)
+{
+	open_count++;
+	DBG
+
+	if ( open_count < SWITCHOVER_OPEN_COUNT )
+		return preserved_lo_fops.open(inode,file);
+	else
+		return lo_fops.open(inode,file);
+}
+
+static int lo_release_switcher(struct inode *inode, struct file *file)
+{
+	release_count++;
+	DBG
+
+	if ( release_count < SWITCHOVER_RELEASE_COUNT )
+		return preserved_lo_fops.release(inode,file);
+	else
+		return lo_fops.release(inode,file);
+}
+
+static struct block_device_operations switcher_fops = {
+	.owner =	THIS_MODULE,
+	.open =		lo_open_switcher,
+	.release =	lo_release_switcher,
+	.ioctl =	lo_ioctl_switcher,
+};
+
+/*
+ * this is functionally a subset of loop_init, which bypasses the
+ * initialization portions of loop_init that would otherwise
+ * prevent us from using the existing disk and loop structures.
+ */
+static int update_loop_init(void)
+{
+	int i;
+	printk(KERN_WARNING "loop: proceeding with "
+				"assumption that this is an upgrade path");
+
+	printk(KERN_INFO "module update setup path \n");
+
+	/* need to check for errors, and return -EIO to simulate the
+	 blkdev failed reference I hijacked to get here. */
+
+	for (i = 0; i < max_loop; i++) {
+		struct loop_device *lo = &loop_dev[i];
+		struct gendisk *disk = disks[i];
+		preserve_old_disk_fops(disk);
+		disk->fops = &switcher_fops;
+		printk(KERN_INFO "disks[%d] is %p \n",i,&disks[i]);
+		printk(KERN_INFO "loop_dev[%d] is %p \n",i,&loop_dev[i]);
+	}
+	printk(KERN_INFO "loop: update_loop_init completed\n");
+	return 0;
+}
 
+#if 0
 EXPORT_SYMBOL(loop_register_transfer);
 EXPORT_SYMBOL(loop_unregister_transfer);
+#endif
 
 static int __init loop_init(void)
 {
@@ -1258,7 +1374,7 @@ static int __init loop_init(void)
 	}
 
 	if (register_blkdev(LOOP_MAJOR, "loop"))
-		return -EIO;
+		return update_loop_init();
 
 	loop_dev = kmalloc(max_loop * sizeof(struct loop_device), GFP_KERNEL);
 	if (!loop_dev)
@@ -1279,6 +1395,7 @@ static int __init loop_init(void)
 		struct loop_device *lo = &loop_dev[i];
 		struct gendisk *disk = disks[i];
 
+		printk(KERN_INFO "normal setup path \n");
 		memset(lo, 0, sizeof(*lo));
 		lo->lo_queue = blk_alloc_queue(GFP_KERNEL);
 		if (!lo->lo_queue)
--- /dev/null	2006-07-25 15:29:37.852970584 -0500
+++ drivers/block/loop_core.c	2006-09-27 15:12:25.000000000 -0500
@@ -0,0 +1,173 @@
+/*
+ *  linux/drivers/block/loop_core.c
+ */
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/stat.h>
+#include <linux/errno.h>
+#include <linux/major.h>
+#include <linux/wait.h>
+#include <linux/blkdev.h>
+#include <linux/blkpg.h>
+#include <linux/init.h>
+#include <linux/smp_lock.h>
+#include <linux/swap.h>
+#include <linux/slab.h>
+#include <linux/loop.h>
+#include <linux/suspend.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>		/* for invalidate_bdev() */
+#include <linux/completion.h>
+#include <linux/highmem.h>
+#include <linux/gfp.h>
+
+#include <asm/uaccess.h>
+
+struct loop_device *loop_dev;
+EXPORT_SYMBOL(loop_dev);
+
+struct gendisk **disks;
+EXPORT_SYMBOL(disks);
+
+
+int max_loop = 8;
+EXPORT_SYMBOL(max_loop);
+
+/*
+ * Transfer functions
+ */
+int transfer_none(struct loop_device *lo, int cmd,
+			 struct page *raw_page, unsigned raw_off,
+			 struct page *loop_page, unsigned loop_off,
+			 int size, sector_t real_block)
+{
+	char *raw_buf = kmap_atomic(raw_page, KM_USER0) + raw_off;
+	char *loop_buf = kmap_atomic(loop_page, KM_USER1) + loop_off;
+
+	if (cmd == READ)
+		memcpy(loop_buf, raw_buf, size);
+	else
+		memcpy(raw_buf, loop_buf, size);
+
+	kunmap_atomic(raw_buf, KM_USER0);
+	kunmap_atomic(loop_buf, KM_USER1);
+	cond_resched();
+	return 0;
+}
+EXPORT_SYMBOL(transfer_none);
+
+static int transfer_xor(struct loop_device *lo, int cmd,
+			struct page *raw_page, unsigned raw_off,
+			struct page *loop_page, unsigned loop_off,
+			int size, sector_t real_block)
+{
+	char *raw_buf = kmap_atomic(raw_page, KM_USER0) + raw_off;
+	char *loop_buf = kmap_atomic(loop_page, KM_USER1) + loop_off;
+	char *in, *out, *key;
+	int i, keysize;
+
+	if (cmd == READ) {
+		in = raw_buf;
+		out = loop_buf;
+	} else {
+		in = loop_buf;
+		out = raw_buf;
+	}
+
+	key = lo->lo_encrypt_key;
+	keysize = lo->lo_encrypt_key_size;
+	for (i = 0; i < size; i++)
+		*out++ = *in++ ^ key[(i & 511) % keysize];
+
+	kunmap_atomic(raw_buf, KM_USER0);
+	kunmap_atomic(loop_buf, KM_USER1);
+	cond_resched();
+	return 0;
+}
+
+static int xor_init(struct loop_device *lo, const struct loop_info64 *info)
+{
+	if (unlikely(info->lo_encrypt_key_size <= 0))
+		return -EINVAL;
+	return 0;
+}
+
+struct loop_func_table none_funcs = {
+	.number = LO_CRYPT_NONE,
+	.transfer = transfer_none,
+}; 	
+EXPORT_SYMBOL(none_funcs);
+
+struct loop_func_table xor_funcs = {
+	.number = LO_CRYPT_XOR,
+	.transfer = transfer_xor,
+	.init = xor_init
+}; 	
+EXPORT_SYMBOL(xor_funcs);
+
+/* xfer_funcs[0] is special - its release function is never called */
+struct loop_func_table *xfer_funcs[MAX_LO_CRYPT] = {
+	&none_funcs,
+	&xor_funcs
+};
+EXPORT_SYMBOL(xfer_funcs);
+
+
+int loop_register_transfer(struct loop_func_table *funcs)
+{
+	unsigned int n = funcs->number;
+
+	if (n >= MAX_LO_CRYPT || xfer_funcs[n])
+		return -EINVAL;
+	xfer_funcs[n] = funcs;
+	return 0;
+}
+EXPORT_SYMBOL(loop_register_transfer);
+
+int
+loop_release_xfer(struct loop_device *lo)
+{
+	int err = 0;
+	struct loop_func_table *xfer = lo->lo_encryption;
+
+	if (xfer) {
+		if (xfer->release)
+			err = xfer->release(lo);
+		lo->transfer = NULL;
+		lo->lo_encryption = NULL;
+		module_put(xfer->owner);
+	}
+	return err;
+}
+EXPORT_SYMBOL(loop_release_xfer);
+
+int loop_unregister_transfer(int number)
+{
+	unsigned int n = number;
+	struct loop_device *lo;
+	struct loop_func_table *xfer;
+
+	if (n == 0 || n >= MAX_LO_CRYPT || (xfer = xfer_funcs[n]) == NULL)
+		return -EINVAL;
+
+	xfer_funcs[n] = NULL;
+
+	for (lo = &loop_dev[0]; lo < &loop_dev[max_loop]; lo++) {
+		mutex_lock(&lo->lo_ctl_mutex);
+
+		if (lo->lo_encryption == xfer)
+			loop_release_xfer(lo);
+
+		mutex_unlock(&lo->lo_ctl_mutex);
+	}
+
+	return 0;
+}
+
+
+MODULE_LICENSE("GPL");
+
--- /dev/null	2006-07-25 15:29:37.852970584 -0500
+++ drivers/block/loop_core.h	2006-09-28 13:20:48.000000000 -0500
@@ -0,0 +1,18 @@
+
+
+
+extern int max_loop;
+
+extern struct loop_device *loop_dev;
+extern struct gendisk **disks;
+
+extern int transfer_none(struct loop_device , int ,
+			 struct page *, unsigned ,
+			 struct page *, unsigned ,
+			 int , sector_t );
+
+extern struct loop_func_table *none_funcs;
+extern struct loop_func_table *xor_funcs;
+extern struct loop_func_table *xfer_funcs[];
+
+extern int loop_release_xfer(struct loop_device *);
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 410f259..85893de 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,7 +14,11 @@ obj-$(CONFIG_ATARI_ACSI)	+= acsi.o
 obj-$(CONFIG_ATARI_SLM)		+= acsi_slm.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)	+= rd.o
+obj-$(CONFIG_BLK_DEV_LOOP)	+= loop_core.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
+obj-$(CONFIG_BLK_DEV_LOOP)	+= loop2.o
+obj-$(CONFIG_BLK_DEV_LOOP)	+= loop3.o
+obj-$(CONFIG_BLK_DEV_LOOP)	+= loop4.o
 obj-$(CONFIG_BLK_DEV_PS2)	+= ps2esdi.o
 obj-$(CONFIG_BLK_DEV_XD)	+= xd.o
 obj-$(CONFIG_BLK_CPQ_DA)	+= cpqarray.o
_______________________________________________
Team mailing list
Team at ozlabs.au.ibm.com
http://ozlabs.au.ibm.com/mailman/listinfo/team

From: Nathan Lynch <nathanl at austin.ibm.com>
Sender: team-bounces at ozlabs.au.ibm.com
To: will_schmidt at vnet.ibm.com
Cc: team at ozlabs.au.ibm.com
Subject: Re: [Team] Dynamic Update (work in progress.. )
Date: Sat, 30 Sep 2006 23:02:56 -0500

On Fri, 2006-09-29 at 16:36 -0500, Will Schmidt wrote:
> Hi Folks, 
>    looking for commentary, thoughts, what part needs more oompf, etc.
> For now sending just to team at oz.  
> 
> This is a work in progress..   A sample of dynamic upgrade for Linux.
> This is loosely based on the papers written about K42's
> implementation.  
> 
> The loopback code was chosen, as it seemed like it would be a
> straightforward place to get a demo going.
> 
> An overview.. 
> 
> The idea is to allow a module to be upgraded on the fly, without
> requiring that the module be unloaded, filesystems be unmounted, etc.  
> 
> To me, the most likely scenario will involve a bug being discoved, code
> getting fixed, modules being rebuilt, and then trying to load the new
> module on top of the old one...   


While I think live patching of kernel bugs is a worthy pursuit, I don't
think this approach scales.  Making every driver "dynamic upgrade-aware"
would really make the code more difficult to read and maintain.  As you
illustrate with your treatment of updating the file ops, there tend to
be hairy (and perhaps unavoidable) race conditions involved.

My belief is that the class of kernel bugs that actually lend themselves
to live patching is relatively small, and that efforts to support live
patching should be proportionate.  I think a fair amount of the
infrastructure for that already exists with kprobes.  In fact, as an
exercise, I did implement a kprobes-based patch for the recent
gettimeofday problem -- no offense, but I think it's no sicker than
this ;-)

You might review some of the dynamic/static tracing discussion on lkml
lately; there have been some interesting ideas flying around that could
be applicable to this sort of thing.


-- 
Nathan Lynch <nathanl at austin.ibm.com>

_______________________________________________
Team mailing list
Team at ozlabs.au.ibm.com
http://ozlabs.au.ibm.com/mailman/listinfo/team



More information about the K42-discussion mailing list