Spend a day tracking down bugs in my skb-reservation patch mk II; turns out that one of the things I was chasing is a generic 2.3.35 bug. I'll look at that tomorrow (not much has changed in the network code, so probably most productive would be to see if it's in 2.3.34).
Alexey didn't like the last version, due to lack of generality. Hope he likes this one more. If not, I'll have to think hard.
Had dinner with Stephen Rothwell, Adrienne his wife and their kids Anthony and Jacqui; had a really great time. I guess the married guys in the office feel I could use some real cooking once in a while, and if this is the standard of hospitality I can expect, I should encourage it...
Busy running test suite; fixing bugs left and right, as expected. Made it through the packetfilter and conntrack part of the suite: just the NAT and backwards compatibility to go.
Today's hint: a spin_lock_bh() is not enough to stop timers from going off on SMP machines.
The new code is definitely more refined. I'll be testing it and benching it against my good old real packet dumps (that's going to take some time, since they are in tcpdump format, and I want to write some decent playback tools).
Adding another 2GB to the system reminds me: I need to get on top of the backup situation RSN; the current two-big-disks approach isn't going to scale (I'm thinking a dedicated box with raid-5...).
With skb field reservation, I can have the ftp conntrack module actually store the offset and length of the address within the IP packet for use by the ftp NAT module, to avoid duplicated effort. That implies that it's a good idea.
Tomorrow (Christmas) I'm forcing myself to take the day off, but on boxing day I attack the NAT code, then it's down to testing...
From: Rusty Russell <rusty@linuxcare.com.au> To: torvalds@transmeta.com, alan@lxorguk.ukuu.org.uk Subject: [PATCH] Trivial name typo. Date: Thu, 23 Dec 1999 15:24:07 +1100 Just noticed this... 2.2 and 2.3. This Russel disease must be stamped out before it becomes widespread. Rust. --- linux-2.2/net/core/dev.c.~1~ Sun Dec 5 13:24:45 1999 +++ linux-2.2/net/core/dev.c Thu Dec 23 15:20:21 1999 @@ -56,7 +56,7 @@ * Adam Sulmicki : Bug Fix : Network Device Unload * A network device unload needs to purge * the backlog queue. - * Paul Rusty Russel : SIOCSIFNAME + * Paul Rusty Russell : SIOCSIFNAME */ #include <asm/uaccess.h> -- Hacking time.Here's Linus' response, cc'd to Alan:
On Thu, 23 Dec 1999, Rusty Russell wrote: > > Just noticed this... 2.2 and 2.3. This Russel disease must be stamped > out before it becomes widespread. There's a serious shortage of the etter "", and we're trying to seriousy cut down our usage of the etter in order to improve conditions in the worst affected areas. ettes "" an "" ae aso affecte, and may une cetain cicumstances be epace with the ette "x" which is in pentifu suppy. Patch appie, inus
Got 4 patches in the pipe for linux-kernel; they're piling up enough for me to actually create a `patches' mail folder so I don't drop any.
Jan Harkes finally caught a clean netfilter oops, which explains a number of problems people have been seeing.
New netfilter release tonight, after I figured out why usually fragments weren't being forwarded. Wanted to test on the SMP box, but the damn thing has the hard drive on IDE3; I'll wait for a PC guru tomorrow to figure out how to deal with this and concentrate on tonight's release.
No Christmas for Rusty this year. I'll be taking the holidays themselves off (if you don't do that it's just simply too depressing), but the rest of the time will be playing catchup with netfilter, SMP and User Mode Linux.
Alexey (rapidly becoming my hero) corrected me on concurrency in 2.3.x: FYI here it is:
Subject: Re: Concurrency within netfilter hooks Date: Tue, 14 Dec 1999 17:53:10 +0300 (MSK) Hello! > For 2.4, it won't happen, except for packets from userspace being > interrupted by bottom halves and timers, Processes from userspace really overlap since 2.3.15. > but this is changing: you can look into Alexey's crystal ball at It is not necessary to look into magic crystals. 8) - Hooks, executed in process context, i.e. all output, post-routing etc. must be multithreaded. - Hooks (and all the code), usually executed from net_bh (input, forwarding) also must be multithreaded, but not softnet is reason for this. Netfilter itself creates concurrency in all the paths, which used to be executed in net_bh context, when it reinjects packets. Essentially, softnet adds __nothing__ new to these rules, except for one thing: concurency becomes common, rather than marginal phenomenon in all the paths. Essentially, it is the main argument, why I do not jest when proposing to add softnet before 2.4. All the complexity and all the bugs are already in 2.3 and softnet only clarifies code and fixes bugs. 8)8) Alexey
Housewarming was last night, which was fun. Kinda quiet, but what do you expect from Canberra? Paulus was still recovering from his recent return from SF, but everyone else made it, including Hugh and Lucy's 2-week old baby girl Rachael, who was pretty well behaved. My lovely ex-flatmates turned up to make sure I really did have somewhere else to live, and there was no chance of me trying to move back in with them 8-).
Fragment problems still blowing up the test suite. Fragments suck rocks. People forwarding fragments through connection tracking are going to see really bad performance. We have to defragment, then the forward code refragments, then we refragment.
Tridge called around this morning seeking a vote on the latest hire: looks good.
User Mode Linux package release scheduled for Monday the 20th; we should make it in time.
Getting some (justified) flack for netfilter bugs. Need a new release this weekend, since gargle has been rock solid for 4 days for me. After this release I'll start running all of Linuxcare Ozlabs through it.
My 386 seems to be netfiltering perfectly. And fairly fast, now I suppressed logging. Go figure; those bug reports must have been a subversive Microsoft plant 8-).
Most of today was spent greeting people and drinking coffee; slow day. I've noticed that Tridge is starting to get edgy not coding, and he's really central to the office; I think that getting used to all the wierdness of being involved in a pre-IPO company is starting to get to us all.
iptables-save is written, I need to write iptables-restore. Also on my TODO list is the branch for netfilter 1.0 (the NAT replacement), which requires the new skb reservation code, which I need to feed to Alexey...
Finally got around to watching the last 3 episodes of Babylon 5; I think they lost some impact given the months-long hiatus I had before finally getting around to it (I'm not a TV person), but I definitely have that melancholy `end of a good long book' feeling after 5 years. I do plan on watching the entire thing again sometime, maybe one season a week. One day.
Alan pulled my ipfw patch out for 2.2.14-pre11, because it seemed a likely candidate for memory corruption, and the problem has gone away. Symptoms are kmalloc corruption, which looks unlikely with my patch, which only alters locking. There's one dodgy `I assume this is safe' thing I did, which I'll try reverting.
Separated out the logging line-count patch and sent it to him separately in the meantime.
Sent off patch to Alexey for a taste-test, see if he likes it; it solves many of my dealing-with-fragment issues.
Camera crews came and went yesterday to inverview Tridge and Dave Mandala. The interview showed tonight, and came across really well I thought.
Discovered that my sense of humor not always appreciated by people I work with. At least it's not boring (Sam J. Bushell once had a T-shirt I always admired which went something like: `Where I come from, my behaviour is considered orthodox'). I guess I'm a goof.
Revised Makefile tags patch again; `this time for sure'. We'll see if Linus digests the last one before deciding how to feed him this one.
Got keys for the new place, and moving in tonight; expect to spend the next few weeks shopping for odds and ends. Housewarming is on the 11th December (Paul Mackerras should be back by then).
Wrote the first cut of a User Mode Kernel HOWTO, which I hope Jeff polishes a little and we can then expand and release. Need to hack on module support too; if I'm really lucky, Marc Boucher will find my bug and make a netfilter release before I do (now I sorted out his samba.org CVS access).
Bugger; just found another fragment problem. I think I'll have to ask Alexey to move the
Signed the lease today on the apartment; move in tomorrow after work. The inimitable Miguel de Icaza sent a congratulation EMail is his inimitable effusive style (Subject: WOOOHOOO!) on my move to Linuxcare.
Did some work on User Mode Linux yesterday; incorporated in a new release. I'm trying to convice Jeff Dike to take it to the next level with a core team, regular releases with announcements, and get some real momentum up. This is an extremely important development, since otherwise you can't really debug a kernel without duplicate hardware of VMWare; and everyone knows what I think about making kernel development dependent on proprietary software.
Moving into apartment Wednesday; Real Estate agents suck. Booked travel back to Adelaide for a few days to gather my stuff.
For future reference: don't stay with friends for longer than three weeks at the outside, however wonderful they may be; they're great to live with, but for a moment consider that I may not be.
This week I will do another netfilter release; preferably with the fragment and crash fixes; I found a race, thanks to discussions with Paul Mackerras (Linux PPC legend and all-round nice guy), and reworked some locking, but I don't think that's the problem.
Compressed read-only loopback also needs a release; there seems to be a great deal of interest in this code, so I'm brushing it up for inclusion in 2.4 as experimental.
I must say living with two attractive intelligent women who regard a bath towel as suitable morning attire has been a wonderful experience, but I'm finally moving into my new place on Tuesday; inner-city furnished two bedroom apartment. Two-bedroom so I can finally return some of the offers to crash at other hackers' places around the world (thanks guys!) without having to resort to the sofa.
The trip was good, but hell. I'll take a couple of days off, and just rest; when I get too little sleep for an extended period, I get oversensitive and generally useless. This weekend was filled with interviews, dinners and flying to Melbourne for an engagement party, so I didn't manage to get my recovery time.
There's a netfilter bug in masquerading local packets. Must look into it.
Netfilter interest is picking up as more people realize that we're missing functionality, and that it's fairly easy to do it. I'm hoping that Daniel Stone will come up with a workable IRC module. Fingers crossed.
I've been distracted by this trip; and today was a particularly poor day towards the end. I'm disappointed not to be getting to Ottawa this trip; maybe next time.
Turns out that I can get a flight from here (SF) to Montreal for US$400; given the amount (and quality) of work which Marc Boucher has done on netfilter recently, I can't turn down the opportunity to meet him, even if it means another four flights (via Chicago).
In my copious free time, I'm trying to write a compressed block device for the next generation of the LinuxCare Bootable Business Card, because it looks like a fun hack. I'm also trying to talk to everyone here about what they're doing, and what's happening; it's heaps of fun.
Marc Boucher submitted a patch against CVS already; a nice fix for ftp (which the testsuite should have found, but didn't, because it's too simple).
I trialled Andrew Tridgell's separate-the-men-from-the-boys support questionaire today; took me 1 hour, I got 6/10, and I didn't do it properly. If anyone beats me, they can have my job.
Finally printed and read Alexey's documentation on the `ip' command last night, and was stunned; I was expecting to wade through an incredibly complex and obtuse document, but it's fantastic. Is there anything Alexey can't do? I must meet this guy (he got an invitation to submit to the Australian Linux Expo, but AFAIK he didn't respond).
Netfilter debugging continues; problems, but I will overcome.
Watching Paul Mackerras do the IBM RS/6000 port is cool: he's showing an admirable degree of persistence and it's paying off: the bootloader and serial port work, but Linux proper doesn't boot yet. Maybe tomorrow. You can tell he's done this kind of thing before.
Second version of the Linux Graphing Project is up; I'm getting this one printed out. gargle (my 386 test box) now boots again, and is on the network: I'm compiling a new kernel for it to run the netfilter testsuite. Double-NAT and ftp fixes done.
Spoke with Art Tyde on the phone: the great thing about Linux work has been the quality of people you meet, and talking with Art reminded me of that. Added him to my mental list of people to meet.
I think this will work out really well: I just need to make sure that it doesn't all go horribly wrong (I think it was an Apple employee who said `A people hire A people; B people hire B and C people', and that applies here).
I'm staying in Canberra with my old friend Lisa and her stunning flatmate Vanessa; I'm looking for my own place, and won't be here long; at least hopefully before Vanessa gets sick of my drooling and throws me down the stairs. I'm sure she'll miss me when I'm gone. Sure.
Thursday I went to Tridge's lecture on parallel external sorting, which was really interesting. That night went to an SGI future-directions talk, which was mainly Linux; it looks very good. My 386 should arrive tomorrow; if it does, I can finish dual-table NAT, run the testsuite and get netfilter 0.1.11 out the door.
Errors on my HDD this evening: running badblocks over /home found some. This is a worry: I backed up (full) last night, and did an incremental immediately after the error. I don't want to lose this drive. I have been thrashing it alot to produce my images, but I expect my hardware to simply take it. Preparing for the Canberra move.
BIG NEWS! I'm going to Canberra next week until March (IETF), to work in the same office as the LinuxCare guys (ie. ANU people: Tridgell, Mackerras, Rothwell). I've postponed the knee recon until after that. Tridge wants me to work for LinuxCare, which would be kinda fun, but I'm still with WatchGuard at the moment.
This week's distraction project is starting to bear some fruit as well, but it's gonna have to hold for a while.
Hoping for a productive few days to get 0.1.11 out the door. Glad I'm not going to ALS, because I'm just starting to get on a roll, and travelling right now would fuck it up.
Documentation has fallen behind again, and needs updating. That's what I'll be working on for a while; other than that the 0.1.10 release is almost ready, just the NAT rewrite to go.
0.1.9 released, another couple of minor bugs reported, and I realized a fairly significant one (hint: don't insmod the ip_nat module for the first time in heavy traffic). More stuff on the scoreboard, and a major documentation update going on at the moment (I've been slack).
After bitching in my last entry about netfilter not turning in a Bazaar project, I got a empathic mail from Bill Stearns, who has similar issues with Mason. I thought about it for a bit, then decided to make a scoreboard of contributors; sure, it'll be a drain to process, but if it works in encouraging people to participate, it's a small price to pay.
My frustrations, however, must have been showing: I sent a mail to linux-kernel about the impenetrability of the networking code. Alexey took it with good humour, but as I started writing it I realized how trivial much of it is to fix: it's not bad code, it's just the naming of structure members and functions is such a mess. The skbuff functions are too widely-used to be repaired, but most of the non-exported functions could well be fixed without too much distruption.
It's important, which means I'd better do it.
The left mouse button on my VIAO stopped working too; I can tap the pad to get the same effect, but can't use chord middle button (ie. no paste in xterms). I'm chasing down the paperwork to see if I need to return it to the US to get it repaired; either way, getting it fixed is going to cost serious time I simply don't have. Looks like I'll have to live with it.
Found bug (I mod_timer() then add_timer() for tcp, due to a reordering ARGH). Going to bed.
Locking coding finished in conntrack: for NAT I got lazy, but it'll get better next release (promise... well, maybe not). First test tommorrow. Sent Andi Kleen my Netwinder (it was just gathering dust here, and Andi said he wanted one); cost be about $35 shipping. Ordered Curt Schimmel's book from Amazon, since it's not available for 4-5 weeks locally, and I want to read it.
Normally, conntrack locking would be simple: the reference count starts at one, and every skb which is attached to it bumps the count. Destroying the skb would drop the count. If the connection times out, mark it as dead and drop the count by one. Whoever drops it to zero gets to free it. Great.
Except I don't get to track skb destruction, so it's harder. Basically, you take the skb, you do a read lock on the hash, find the connection, bump its reference and drop the read lock. Then you can play with the conntrack all you want (as long as you don't want to alter it; for that you'd need another lock; I'd better check that). When you release the skb, you drop the reference.
This has all kinds of ugly side effects, such as what happens when a connection track is deleted, the NAT looks for it? (Answer: don't do that; we always delete on a timer, ick).
Moreover, helpers and protocols have a different problem. You could use the same trick: when a new connection comes in, bump the protocol or helper reference count, and when it's destroyed drop it. Unfortunately, connections can last a very long time, and you don't want to have to wait for them all to expire before you can rmmod the helper.
Good locking is hard. Hard locking is bad.
Linux Kongress was good: met some new faces; highlight of the trip was meeting Andi Kleen. Now if I have a vodka with Alexey, I can die happy. Andi convinced me to expand my planned kernel locking HOWTO into a kernel hacking HOWTO: that's bounced around a little now, and is ready for first release.
It was at Linux Kongress that someone mentioned the credit to Rob Malda in the kernel: I thought Alexey had dropped that patch (I didn't look carefully though, obviously). It's amazing how many people actually read the kernel; I sent mail to Rob, and he indicated that I was not the first to tell him.
I'm not travelling again for a while: certainly not ALS. I lost far too much time, and even now my body clock is badly fucked up (don't want to lose more time trying to sync it). The Bazaar is also out.
I know what's going to be in 0.1.8; should be finished by the time I return on Wednesday the 15th. Have to write my Linux Magazine article, and want to rewrite my netfilter talk for Linux Kongress; this is what planes and spare laptop batteries are for (thank god for my VIAO).
0.1.8 should have the static mapping stuff (which means rewriting the ipnatctl shared library infrastructure closer to the iptables one), stateful packet filtering, chain renaming, and more testsuites work.
Argh: I wrote netfilter because of all the cool stuff I could do on top of it (especially userspace), but I'm still caught in the kernel, while cool stuff (like the easter egg in iptables) tempts me away...
Oh, and The Bazaar is actually happening; Steve Blood sent me a mail. That would make 6 conferences this year, which is about 3 too many. On the other hand, I like the idea of The Bazaar... so I'm delaying my decision.
ARGH: late news. I figured out the SMP problem! Fuck. CONFIG_SMP isn't enough to control SMP, you need __SMP__ which all the headers use. I hacked this in for some modules, but not globally. Fixed in the Makefile: must retransmit cleanup patch to Linus.
Compiling 2.3.16 SMP now, to see if my testsuite still passes. 2.3.16 broke initfunc, so my 0.1.5 doesn't compile, forcing a release. The `local connect to masquerade' bug can be neatly solved by a small kernel patch I sent to netdev: Alexey may not like it, in which case I'll work around it at my end. With that patch, my entire test suite should pass, and the only bug left is the wierd NFS one I found.
Not many people using 2.3 kernels: haven't had the flood I expected. I think the fs corruption problems plus the fact that 2.3.16 doesn't compile on UP has saved me from a trial-by-fire, and gentled the user upramp. I did need the increase in users, however, to flush some bugs; testsuites can't do everything.
On the home front, my knee seems to be improving nicely, by the time I return from Linux Kongress in a week, I should be able to drive again. Then I go in for the reconstruction in mid-October (this is the second time I've torn the Anterior Cruciate Ligament in my left knee, and I'm sick of it), and I'll be off my feet for two-three weeks. Delaying Canberra move. 8-(.
So I'm back to experimenting on my production box, slow and dangerous work. Tomorrow I'll head into town and see if I can get a replacement switch of some kind.
One interesting bug has got me slightly stuck: fixing it one way requires a kernel change, and fixing it another requires a semantic change. I've proposed the kernel change to netdev, but I don't think Alexey will like what I've done (he'll almost certainly say I should be altering the source of a packet after it's been routed). That leaves only two known bugs: the fact that defragmentation and local nfs traffic don't seem to mix (which I need my 386 box back to test) and the crashing when rmmod'ing on SMP kernels.
So much for a release tonight. Oh well.
Talked to David Bonn of WatchGuard while I was in Seattle: because of lucrative (and numerous) offers elsewhere, I will probably be leaving them. Now that netfilter is in the official kernel it seems a good time. They're more than happy to keep paying me, but I'd prefer to start moving into some other area (netfilter will occupy me for some time to come though).
Oh, and I moved my diary. Got to go fight more bugs...
Tomorrow I fly back to Australia, and don't really get to rest: I have to do some kernel patches and process bug reports (I expect a reasonable number).
I'm going out tonight with Ace to celebrate. Chinese all round 8-).
I've been nominally taking two weeks off (bad timing, but isn't it always?); after my tutorial at LinuxWorld, Ace and I have been touring the US: Disneyland before the conference, San Francisco afterwards, then Vegas, Denver, and a huge driving trip through Wyoming, ducking into Utah, and back to Wyoming/Montana for Yellowstone. In Cody, Wyoming, I threw my left knee out again, and currently am hobbling on crutches. It's probably reopened the partial tear in my Anterior Cruciate Ligament. I'm hoping to be back on my feet for Linux Kongress, but from previous experience I'll have a marked limp.
Anyway, my net access has been really dodgy: some of these hotels don't have direct long-distance dialling from the rooms (you need to use credit cards). My Sony Viao has been great for these long trips: each batter is worth about 4 hours of Freeciv.
At LinuxWorld Expo, I crashed Alexey's kernel (with my mods) with my test suite, and corrupted my /home really badly. Ted T'so was interested because fsck didn't fix it *sigh*. I've been nervous about furthur testing while I've only got one box here, and no decent net. A number of people have suggested VMWare, but I figure if you need a proprietary piece of software to develop Free software, we might as well make the whole thing proprietary.
Should be able to handle stairs by the time I reach Canberra, which is a requirement for getting into the space the guys rented. I should learn to be more careful.
Got a reply already from Jim Pick of kernelnotes.org, so it looks like http://netfilter.kernelnotes.org will be the first site. www.kernelnotes.org is my homepage, so this is really nice; I owe Jim a beer or three.
Andi Kleen gave my latest patches (up on my ISP's web space for want of a real site) the thumbs up, hence the need for a reliable site: one netfilter goes into the development kernels, it's going to need a set of reliable sites.
I read the report this evening when someone sent it to ipchains (no, I still haven't been able to subscribe to bugtraq, even though I try every six months or so). My first reaction was to jump online and look at the bugtraq archives to see the response. That's when I found out that my ISP (Camtech, now OzEMail) had cancelled my account: calling the helpdesk revealed that it had expired on the 11th. Of course, I had renewed online after they sent me a letter; he told me to take it up with accounts, which is only open during office hours.
Tomorrow I get a new ISP; Camtech's service was OK, but their billing system was always completely fucked, and it funally bit me.
So I made my patch, and wrote some EMails, dropped them on a floppy and called Duncan Grove (Michael was out somewhere, got his answering machine). A couple of hours later, I was in the University updating my web page and dumping a couple of mails to the net using telnet port 25.
Netfilter work continues; I fixed the truncated-packets problem in both the new ip_tables code, and the backwards compatibility code. Moreover, it inspired me to take a detour and start hacking up my `ipt_unclean' module for iptables which matches on suspicious packets (eg. ping of death, wierd fragments, etc). The tests for the short packets case went into the testsuite, which is slowly increasing in scope.
Lost a lot of work when 2.3.11 screwed my disks over. Fortunately, I had backed up a couple of hours before, but any file I had touched during that time was deleted by fsck. Needless to say, I don't trust the 2.3 series as far as I can throw them, and 2.2 is bad at the moment: I'm building a 2.2.5 (last known kernel which didn't corrupt file systems).
This development is very disturbing to me. I'm used to trusting the Linux kernel implicitly, and fs corruption makes me feel like someone who was hit by an earthquake and never views the ground quite the same again.
It also makes me want to look deeper into the port of Linux to userspace. A kernel in a window is a nice idea, and probably worth investigating. And no, I won't use VMWare; if kernel debugging can't be done without a proprietary product, what's the point?
One of the good things about writing a test suite is that you actually find some of the stupid mistakes. Too bad I kept working on the test suite after 0.1.3 was released. Hence 0.1.3.1 tonight, to fix the two most glaring errors.
Laptops suck.
Hence my insistance on a >2 year warrantee on the new one. Looks like I'll be picking up my Dad's old G3 powerbook, which still has 2 years left on the clock, and the price is right ($2000 US).
Working on running the test suite I wrote this afternoon. It's tough when it crashes your machine. Looking through kern.log, found part of the Debian package list at offset 241665. I'd just screwed my kernel over, so maybe it's just a glitch, but with Linus's warning about fs corruption, it scared the fuck out of me. Booted back to 2.2.10 for the moment to recompile and backup.
The conference has left me physically and emotionally exhausted. By the end it was all I could do in some cases to be civil; I didn't accompany the others who went to Sydney. I took a day off and read Cryptonomicon; I am doing netfilter debugging in slow mode as well. Looks like I'm going to Augsburg in September for Linux Kongress again.
The big news is that I am planning to move to Canberra in September to work with Tridgell, Paul Mackerras et. al. for 3-6 months. This will be supurb; I can finally set up my test network like I want, and test on other machines, etc.
At the conference Dave Miller explained the new locking strategy in the network code. Now I understand what netfilter needs to do; I'm going to need the sk reference counts implemented to fix things properly though (I can hack around it for the moment). Writing the Kernel Locking HOWTO is on my TODO list.
Much other hackery occurred: I now understand rsync, and think Tridge is a God (imagine, a useful PhD subject!). devfs is the Right Way (or very close approximation), although I want the naming thing formalized a little. I want to be a founding member of maddog's School of Microcomputing and Microbrewing.
Uncrashing netfilter code. Yummy.
There seem to be a number of developments with other people doing packet-mangling stuff; a 2.0.36 version of FreeBSD's direct sockets, a double-masquerading (effectively masq+portforward) patch, and some work on masquerading speedups. We need to get netfilter in soon, so people can use it as a base; of course I want Andi to be happy with it.
I've been increasingly dropping EMails on the floor; netfilter stuff and conference work take priority, and everything else a distant second. My usually well-formatted and verbose style has become more terse under the pressure. Those who know me realize I'm busy, but I feel sorry for people who send me ipchains questions; I have no choice but to respond, but my recent responses have been less-than-helpful in some cases. Fortunately, the mailing list seems self-sustaining at the moment (thank God for those guys).
I'm not expecting much sleep in the next couple of weeks. And as soon as the conference is over, I'm back to netfilter coding; God I'm looking forward to it. Tonight was almost all netfilter; last night was too (rewrote the Makefiles to be non-recursive based on a discussion with Andrew Tridgell long ago, and they are sweet). Tridge was trying to convince me to use VMWare for debugging, but I don't want to be reliant on proprietary tools again. I figure VMWare only have about 18 months before a free version comes out anyway; I can wait.
This is the version I want to merge with DaveM if no problems appear in the next few days. I found a few bugs in my code, and did a couple of cleanups in the stuff I'm sending Dave. I know this is going to hit me at the same time the conference does, but I'll let Dave decide.
So I've made a full backup (at least that now works again!) and I'm having to run 2.3.8 on my production machine. I'm resolved to get a test network, but not here; this episode has taught me that I need the company of fellow hackers. I'm persuing a couple of different options at the moment, and as soon as the conference is over, that'll get my full attention.
My patience has run out.
The ethernet card and modem card don't both work at the same time. I have only one monitor, which needs to be flipped between gargle and ketchup. Gargle has 4MB of RAM, and the serial line connecting it to kevin seems to create enough noise to blow TCP performance to the Internet to shreds. I need a serious fucking test network; unfortunately test networks aren't mobile, and I am. That's gonna have to change.
Just trying to get the package list so I can install netbase (and hence have remote loginn capability to gargle has taken me over two hours. Setting up machines is such a PITA.
So now I'm going to have to go through the driver with a fine-tooth comb, seeking something that's not under a lock (the card simply stops responding after a while).
Conference is coming along nicely. Spent the weekend on my tutorial presentation with Michael; he's proofreading it today. Decided it was worth converting to LyX for the workbooks, since it looks so much nicer than printed HTML.
Ace is helping me set up my 386 box; it has only 4MB, so installing Debian or Redhat was out (we also tried Slackware). Smalllinux worked though; tomorrow I'll get her to put in the NE2000 network card, and start the slow climb of building it into a Debian box. She called the box `gargle', in keeping with my network theme of joke punchlines (`kevin', `hambush' and `ketchup' are the others).
Booked my tickets for the August trip to LinuxWorld, with Ace. She and I had much fun using travelocity to find reasonable fares. Bought Ace a copy of `Learning the UNIX Operating System', an O'Reilly book; if she makes it through that I'll get her Linux in a Nutshell. She's playing on my old laptop 9the one that doesn't run on batteries anymore).
Bought her a Furby. Don't worry, I'll balance it all out by getting her a Palm Pilot later. Really.
It sometimes seems to me that all the other kernel guys have mastery of a large number of areas of the kernel, and I have this tiny bit. I think I need to attack more areas, especially SMP and locking issues. Dave Miller shocked me by reworking the interaction with NFS and the page cache: it's really very little to do with his home ground of networking per se, but it's a major rework which appeared inside a week. I wish I could grab some area of the kernel and rework it in days, not months. I think I'm too used to working with people I'm much more experienced than; the Linux world is much more competitive, and I'd like to work closer with some gurus to hone my skills which plateaued here on the end of the earth.
So, after netfilter, I think I'm going to find a more collaborative Free Software project to tackle; I don't really care what it is, as long as there are tip people and the project is interesting. Rusty's Fucked-up Network Protocol might be the just ticket.
I'm looking forward to getting this finished, then it's ftp data mangling (shouldn't be too hard). Then on to writing the compatibility layer, then we'll be ready for the masses; I'll take a weekend off for a change, and then come back and code audit, probably some minor cleanups, and rewriting the HOWTO.
Wrote another column for Linux Magazine; this one is pretty cool. That means I don't have to do it in the middle of conference organization, so it's out of the way already. A friend of mine from Canberra called about tickets in August (LinuxWorld and a small tour of the US with my SO and misc. others) so I'm chasing that up as well. On the topic of conferences, they are trickling in for CALU in July; looks like we'll hit about 200 or so.
Tentative August itinerary:
Netscape was crashing for me on the appindex entry form, so I didn't get to enter the details for the ipchains 1.3.9 release. Before I got around to upgrading Netscape though, someone (someone I don't know) beat me to it; that's really cool!
Got state tracking to work, without crashing. Expect a snapshot tonight (with ftp tracking). I'm going to leave the addition of an iptables module for state tracking to someone else; it's not that hard, and it's a good project for someone.
At Linux Expo DaveM asked how Juanjo Ciarlante (who has been doing alot of 2.1 masq work) felt about me replacing the masq stuff with my NAT layer. I sent Juanjo a mail when I got back; and he's looking forward to hacking on netfilter, and likes my HOWTO!
You have to be careful not to trash someone's project; everyone knows masq needs a rewrite, but it can still be a wrenching feeling to have your code taken out. Of course, much of my code was inspired by the masquerading code and BSD's ipfilter anyway, so it lives on. I'm looking to Juanjo hacking on netfilter; he's got RL experience which will be a great contribution.
--- linux-netfilter/net/core/netfilter.c.~6~ Sun May 30 12:22:22 1999 +++ linux-netfilter/net/core/netfilter.c Mon May 31 21:21:26 1999 @@ -493,6 +493,7 @@ printk("Crap bits: 0x%04X", nf_debug); printk("\n"); } +#endif /* CONFIG_NETFILTER_DEBUG */ /* One semaphore for all of them. */ DECLARE_MUTEX(modreg_sem); @@ -615,4 +616,3 @@ { return modreg_find(headaddr, name, name_cmpfn); } -#endif /* CONFIG_NETFILTER_DEBUG */Been working on connection tracking. Cut less than a thousand lines today; should have been higher. Still, I'm quite happy with it. One advantage of separating connection tracking from NAT is that it's going to be fairly easy to test.
Posted my minmax.h patch to linux-kernel. Linus may be stubborn, but I really think it's for the best. I'll probably get some flames, and Linus'll probably just drop it on the floor. Oh well.
Need to sleep; this stuff is not going to be solved by a single allnighter, so it's best to keep hacking away at it. Did some conference stuff today too; it's going to be really cool (in retrospect).
Andrew Tridgell called, and we talked about many things; congratulated him on his impending new job, discussed finance. Andrew is organizing a minibus from Canberra. Andrew should have been the one to organize the conference, except he's only just recently come out of his thesis-induced hole, and if he'd done it the conference would have been in Canberra. He did make me think about other ways of getting finance for the conference, in particular, advertizing space. I've been persuing these in parallel.
Worked on netfilter some more; two changes I have been resisting but became neccessary. Firstly, hooks now have a priority, so we can ensure that local NAT occurs before packet filtering, preserving their independence. Secondly, hooks can return NF_STOLEN, to indicate that it has taken control of the skbuff. This is required to efficiently support ipfilter's "fastroute" option, which queues the skb. I disagree with the ipfilter "all-in-one" approach, but it is a valid use, and I am not going to dictate it by designing limitations into the netfilter infrastructure.
Had a big win when Ronald Kuetemeier said he implemented SAMBA failover using netfilter's NAT layer in a couple of hours. He hit some panics though, so I'll be hunting those down today.
DaveM took all the locking code out of the network stack, and benched it (until, obviously, it crashed). Twice as fast (I'm guessing it was a big SMP machine). Hence, he is itching to do away with net_bh, and packet queues for each CPU. I'm going to merge into his tree before that, which means in the next couple of weeks.
He did tell me that my ipchains code and the TCP stack were the only parts that didn't need fixing for the new locking.
Talked to Alan Cox about locking in the firewall code; he says it doesn't really get any better than 1 read semaphore on traversal. Larry McVoy has said a few times that Irix got so many locks that the overhead of grabbing dozens of them made performance suck (and I'd be worried about deadlocks).
Wensong Zhang didn't make it from China; the US wouldn't give him a VISA due to the embassy bombing fiasco. Larry spent a while on the phone to the US embassy in Beijing; no luck. It sucks, because I wanted to talk about his stuff on top of netfilter.
Staying with Raster, we discussed a whole heap of stuff. Future developments in E, CPU usagage, etc. The Linux Magazine guys paid me for my articles. I got my copy of Open Sources off Chris DiBona, finally. Discussed the LSB with Dan Quinlan: he hopes for something serious by end of year. Met one of my most prominent users and related-project developers, Bill Stearns, and I walked through the problems of ipfwadm-to-ipchains conversion; he's the author of ipfwadm2ipchains. Some mods are already forthcoming.
Spoke briefly with Werner Almesburger after his excellent talk on Traffic Queuing under Linux; I'd read some of the code briefly, but didn't really have a good understanding of its flexibility, which I do now. He tole me that the u32 classifier is faster than ipchains; not a huge shock, but definitely something I'll be looking at carefully.
Speaking of Alexey's code, he admitted I was right recently (he suggested the ability to mark interfaces with a number; I said the ability to rename interfaces and use interface wildcards was better). Makes me think that my work isn't a complete waste of time; after Werner's talk, I've even more respect for Alexey's coding ability...
Finally met Paul Maccarras, PPP guy, and as well as telling him to come to CALU, told him about renaming interfaces under Linux 2.2; I'll send him that patch for pppd if I can find it...
Looks like there really is going to be a Linux Developer con happening. The idea has been expressed by many people, and it loos like Larry McV and Victor Y might actually get it off the ground. I am also looking at The Bazaar, in December; but if I do Ottowa as well, that's 6 conferences this year; three above my limit.
Conference registrations are starting to trickle in; each one offers assurance that I'm not going to go bankrupt. This makes life easier for me.
I released v1.1.2 of ipchains-scripts, finally. The ipchains-1.3.9 awaits only the new Quick Reference Card (I just sent a reminder to Scott; he promised it this week). The new HOWTO only awaits ipchains 1.3.9.
I've been having a really interesting dialogue with my BSD counterpart, Darren Reed. Actually, I'm flattering myself, since his ipfilter does more than ipchains by a long shot, is older, more mature, and cross-platform. I've shared a few concerns and questions from skimming his source, and we've traded problem reports where there may be overlaps. It's been really instructive on a number of fronts, mainly for NAT implementation issues (see, Darren has real, live users, something I currently lack for netfilter).
This trip is going to be hell; in my three-and-a-half days I have to catch up with my old friend Chris Yeoh, who now lives in Denver, shmooze with the LinuxWorld organizers Natalie and Kathy (who run a real conference, and have been really good telling me how it's supposed to be done), talk to the PowerPC guys about that laptop for development, give Raster the six-pack of vintage beer for letting me stay with him, give hemos the nice bottle of wine I brought (free slashdot ads), and Larry McVoy the other nice bottle (Michael and I stayed with him in SF in March), catch up with Wensong Zhang to talk about the virtual-server project, catch up with the Linux Magazine editors to get my payment, maddog Hall to discuss conference airfares and US date format, and hopefully have time to see Alan Cox again and drag him to my netfilter WIP to get his criticism. Hence I'll be restrained in my drinking this trip. No, really.
Then, hopefully, I'll get a respite to do some actual coding before getting swamped by the Australian Conference. Argh.
Some people asked me about my routine. Well, I get up (around 11am), dial up my ISP, tell my laptop to upgrade to the latest Debian (I love apt), grab news and mail, then shut down the laptop and head into town (bus or walk, it's onlt 25 mins) for coffee. Over coffee I read my mail and reply to it. My batteries last for about 90 minutes, so sometimes I get to do some hacking in that time as well, but usually it is all spent on ipchains support.
Then I usually play something at the arcade (currently Gauntlet Legends, what a money pit), and head back home. If there's anything urgent, I connect again to let my mail out, otherwise I start hacking. The serious work doesn't usually start until around 7pm, after I've eaten and settle down for some serious hacking. Around 2/3am, I hit the sack and repeat.
Not very exciting, but it works for me.
At least I got the IPX packet filtering stuff off to Jay; not tested, but it compiles. I was planning on spending tonight on data mangling, but issues with iptables got in the way, and I ended up fixing some icky bugs with TCP.
Snapshot tomorrow: it's been too long, and now I've got some iptables fixes which need to go in; since Jerome and Herve are actively working on that stuff, we need to keep in sync.
Keeping an online diary is wierd. I get EMail, (and off-the-cuff comments: thanks Jerome; I'll have to introduce you to Meryki) from people about it. I'm tempted to take down the link from the front page. Still, it's my space to rant, and I hope noone takes it too seriously. If it keeps my ranting off linux-kernel, which can only be a good thing (Alex Buell, are you listening?).
I ended up stripping SACK permission and window scale options from the initial SYN. I'm not going to rewrite SACK options, and I don't want to allocate huge buffers for giant windows, so this seemed the easiest path. Not exactly non-intrusive, but we're violating so many boundaries with this stuff anyway, that I don't think it matters.
My current implementation is a sledgehammer, and it will be *slow*. There are several worthy optimizations which I've avoided until I get it working, then I'll look at speed. My ISP is having issues at the moment, so I've been unable to grab mail. Hope nothing important has happened.
Let's step back a bit: why do we want to replace data inside a packet as it flies past?
FTP. It puts the address of where to connect the data backchannel to in the data stream. We have to find out what this is, and replace it. Due to the format, this may involve changing the length of the packet.
There's a hack in the current masquerading code, but it assumes that the command is in a single packet (it isn't always). I wanted to solve a more generic problem: replacing a pattern which is less than the size of one packet, with something else. This could give spectacular side effects: imagine your Linux router substituting "idiot" for "boss" on TCP streams going out from the research network, and "boss" for "idiot" on the way back.
Nobody ever suspects the router...
Anyway, turns out that this problem is hard. What if there is more that one replacement in the packet? What if it's a fragment? What if doing the replacement(s) causes a packet to exceed the MTU of the link? Or the MSS of the receiver? What about out-of-order packets, or partial matches?
So this project turns out to be bigger than I originally intended. For the FTP case, you can probably just drop all partially matching packets, and hope they'll be coalesced on retransmission. For the generic case, we have to get tricky... thank God this is all in userspace.
This is my current obsession, and I'll know I've succeeded when I can ftp large files full of matches through my Linux box, and get the same results as `sed', even with deliberately induced packet losses.
I can't put it off any longer; Jay Schulist is probably out there hunting me now. IPX firewalling compiles (both kernel and userspace tools). Neither tested, nor neat, but I'm diffing up a 2.2.7 patch to appease him now.
Happy thoughts of packet data mangling are wandering through my brain. What if the new packet exceeds MSS? Or MTU? I think we have to chop to length. Should work.
Today, got a call from Darren and Petrina; two of my Canberran friends, who were in Sydney for the weekend. They knew I was there too, since they read it here. Cool. Had lunch; Amex still worked.
Just before I flew out, met up with Meryki, one of the girls from the pub-crawl group. Well, who am I kidding; the only girl from that group. I figured it couldn't be a bad thing to spend a couple of hours in the company of a tall, leggy blonde, and we had fun. Told Ace all the details when she picked me up from the airport, so I had to be well-behaved, and I was.
Dedicating some time to conference organization, but I'm hoping to get my packet data substitution code working in the next couple of days. netfilter is still moving, though, due to iptables patches streaming in from Jerome de Vivie and Herve Eychenne.
That was the least of what happened over the last couple of days. I flew to Sydney on Thursday, and stayed with Chris Saunderson, and old Adelaide friend who escaped to Sydney. Nice to talk to someone dealing with serious networks, and Chris is really cool.
I got my coffee fix at Bambini's on Liverpool street; the place I learnt to drink short blacks three years ago. Friday lunch I had a meeting about the Conference with Grahame Kelly, Jamie Honan and Terry Dawson. It was really great; everyone saw eye-to-eye and I'm working on a number of ideas which came out of that. Talking with Terry about LDP stuff afterwards was really informative: the next version of my HOWTO will now be in docbook form.
Friday night was the SLUG meeting, and (as a disinterested observer), I acted as returning officer for the voting in of the committee. A little unexpected. I was really there to promote the conference, which I did, and many brochures were snapped up.
Then Horms (Simon Horman, ZipWorld guy who I met at LinuxWorld March) invited me along to a pub crawl. I returned to Chris's apartment at 5:37am (he had to wander down and let me in, since I didn't have a key). I knew he was going to be downloading the Quake III demo, so I figured he might still be up: no such luck, as his Voodoo I wasn't up to the task, apparently.
After finishing the netfilter HOWTO, and doing some minor netfilter_dev tweaks, I went out to dinner (Zest) this evening. I needed to, because it's the only place I know of which sells Coopers Vintage Ale, and I need some of that for bribery at my upcoming stay with Chris Saunderson (Sydney this weekend) and Raster (North Carolina, LinuxExpo); both are Coopers drinkers. Not cheap, but neither are hotel rooms...
Tomorrow, I'll be beginning to design and writing of my userspace content matching code. I'm not entirely sure how it's best approached; I'm going to need to think through some scenarios. FTP control channel mangling is particularly difficult.
BTW, taper sucks. It keeps core dumping. I think it's time for tar.
Documentation almost finished; then I can get back to the IPX firewalling I promised Jay, and some conference organization issues which need to be addressed this week. I want to have things well in hand before LinuxExpo, so I'm not stressing out on the plane.
Anyway, as I was writing documentation, I decided that I should rename all the references to `bind's to `rule's, and all `perconns' to `bindings'. A huge search and replace job, but it has the benifit of making the nomenclature match the draft NAT RFC, the netfilter HOWTO, and the user's perspective. It also happens to be more accurate.
Still writing documentation; the netfilter HOWTO. Now I'm on programmers' documentation. I have to finish by the end of the weekend, so I can release it and another snapshot. After the documentation is done, I expect more users, more bug reports, and maybe more patches and enhancements.
When I was implementing ipchains, I noticed that the kernel firewall code to match interfaces (the `-V'' option) was broken in 2.1. It had been broken by someone who didn't understand all the issues who adopted it for the new interface/alias code.When someone breaks a feature, you have to look at how fragile that feature was in the first place; will it break again? When the breakage was undiscovered for so long, you have to ask how many people actually use it anyway, and is it vital to those people?
Combine this with the fact that I could drop a whole heap of code (in particular, notifiers for devices going up and down), avoid a loop in the critical path of the packet filter code, and generally make it simpler, I decided that was what I would do. The `-V' predated the `-W' option (match by interface name), so its existence made sense in the early days of ipfwadm, but now?
For a long time, no-one came up to me with a reason for wanting the `-V' option back, until an ISP system administrator came up with a convincing one. They assigned all their dialup PPP customers the same interface address, and used one set of rules for all of them. Thus I implemented what he really wanted: wildcard interface names (eg. `ppp+').
It was another ISP who came up with the second fair reason. They had pre-configured rules for each interface address for their static-IP dialup customers. The normal authentication mechanisms took care of assigning the address to the interface, and thus ensure the correct filtering rules were used. The interface name depended on which line they dialled in on, which varied.
Thus the existence of the SIOCSIFNAME in 2.2; you can actually alter the name of an interface. The idea was to add a pppd option to allow it to change the interface name to some name (depending on client), thus allowing filtering by interface name. It'd be pretty cool for an ISP to do an ifconfig and see a list of clients (eg bigcorp-ppp4 instead of `ppp4').
Tonight was the local Linux User Group meeting (LinuxSA) where I got to give out the glossy brochures for the first time. A number of people pointed out that we need a mail address for Linux Australia, where we can tell corporate types to send cheques.
Richard Stallman turned up at the meeting. Richard was focussed, as always, on freedom in software. I think he made a number of people think. I like Richard's mind, too bad his body doesn't bathe more often. Ace cringed when he picked his nose and ate it, but hey, I'm more easy-going than her (but then, I didn't see it, and wasn't eating pizza across from him at the time).
RMS is giving his usual talk on Thursday night. I'll probably go and drag Ace along; he's usually entertaining for the first hour.
Work on the netfilter HOWTO continues. Slowly.
Some fun with backups: taper told me there were "5 errors" after it completed my backup. Didn't say what they were, and a Verify gave 7 errors and 5 warnings. After it segfaulted the first time, I'm not really ready to trust it that much.
Still, I'm giving it a go for a week, to see how well it does. At the end of that I'll try a full restore, and to a compare. It'd be nice if my modem and SCSI PCMCIA cards worked at the same time though (a bit much to ask, since my modem card is really dodgy, and doesn't even work with my ethernet card).
A PowerPC user with compilation problems sent me a mail: not much I can do about it. I sent Cort Dougan a plea; if they want PowerPC supported, they need to get a machine to me (ideally a G3 laptop).
While I was writing this, taper segfaulted again. Great. Maybe (just maybe) it's running out of RAM: I'll try upping the swap.
Previously, I just made big tarballs on my laptop (and more recently, my 2GB Jaz drive), but (being manual) it's prone to error, and doesn't cover much of my home dir (only the devel stuff, not my mail). So I installed taper a while back in the hope of getting to use it.
The documentation is a little long (I can sympathize with those who look at the ipchains HOWTO and say `You want me to read THAT?'), but after a bit of fumbling, I've figured it out. It's really quite cute. I guess like every coder I've tried my hand at homebrew backups (and had them replaces by others' homebrew backups), and I've used Solstice Backup (IIRC it's rebadged Legato Networker), which is a VERY nice backup utility. Any backup utility which gracefully recovers from kill-9'ing the various processes, and also has cool features like allowing the user to do their own restores (I would have liked a simpler front-end for the lusers though) is ubercool.
So, I can't do devel while I'm backing up, hence the blurb here. Early morning tomorrow (11am), hence the early night tonight.
iptables made it down to text size ~3800 before creeping back up to 4004 bytes once all the FIXMEs were resolved. I'm not too unhappy with that; if modules discarded their initdata, it'd be even better, but I think that's planned for 2.3.
Work progresses on the iptables userspace tool (it's currently in pieces); separating out the various protocol handling in userspace as it is now done in the kernel. Gives me a chance to review some old cruft, and fully support some options (like arbitrary TCP flags detection, and TCP option detection).
My worry now is that testing this stuff is going to be so hard. It's going to have to be an exhaustive test, and those things take time to write (and test the testsuite). Meanwhile, this coding isn't writing my HOWTO any faster (really, after this, I'll get back to it. I promise).
I want to get the HOWTO finished well before Linux Expo next month, and at least two more snapshots under my belt. We'll see.
The good news is that iptables has shrunk again; I've almost got it under 1 x86 page (although actually insmod'ing it into the kernel seems to add some weight). Under one page is the holy grail, but I'll settle for "smaller than ipfwadm", as long as its also faster then ipfwadm (which, per-rule, is marginally faster than ipchains).
For the curious:
bash-2.02$ ls -l ip_tables.o -rw-rw-r-- 1 rusty fwdev 8840 Apr 15 07:40 ip_tables.o bash-2.02$ size ip_tables.o text data bss dec hex filename 4397 620 0 5017 1399 ip_tables.o bash-2.02$ lsmod Module Size Used by ip_tables 5880 0 (unused)
The ipchains mailing list seems broken for the moment: hope it's back by the time I return, because I spent (wasted) some time replying to a "what i he latest version of ipchains and where can I get it" question.
Back to putting iptables back together again tomorrow, then onward with the documentation. Hell, I might go all out and produce a web page.
Spent much of today reworking my March LinuxWorld conference tutorial based on the feedback summary LinuxWorld sent us. I felt that the tutorial response was disappointing, mainly because techniques which work with 40 people (as attended our practice run at LinuxSA) don't scale to 200 people. And August is presumably going to be even bigger, so I'm taking the knife to the tutorial.
SuSE's Michael Hasenstein has been banging on my netfilter releases, with little joy it seems. I'm trying to get everything to work for him: he seems to be a man of much patience, given the amount of bug reports he has sent me already, and that's exactly the kind of person you want to help you in the early stages.
Really, it works for me!(TM)
Hard coding is like combing long hair. You can't just run the comb through once; you have to do it repeatedly. While there are some knots which benifit from repeated short strokes, generally it's best to comb through the entire thing before repeating. And NAT is very hairy; I didn't even understand the problem when I started (fortunately, I knew it).
The week before last, I rewrote a major part of netfilter (basically the binding management). I've just rewritten the other part (the connection management), because by the time I finished the last cut, I realized that my nice, neat model was stretched out of shape by the addition of local packet handing.
At least I've been having fun along the way, and learnt a whole heap about NAT and networking in general. It'll be fun to compare notes with other implementors of this stuff; there are heaps of fun issues.
Well, it all compiles; userspace and all. 5:40am; I'm not even going to try to see what it will do to when I insmod it and pump a packet through it. Debugging is tomorrow's task (I hate crashing my machine). I'll back up before going to bed.
I was really happy to discover that Wensong Zhang (the virtual server project head, based in China ) is going to be at Linux Expo, so I'll finally get to meet him. I'll just be flying there and back, and staying at Raster's if all goes according to plan. Must remember to buy beer to bribe him with...
Netfilter feels more mature now. This is the last major feature-add release, from here on it should be doco and bugfixing.
I did realize, today, that port allocation is less symmetrical than I thought. Consider these cases:
There goes my pretty model. Oh well, it was months old anyway.
So I figure this asymmetry is a detail to be handled at the per-protocol level; I'll tell them what direction this was initiated in, and let them sort it out.
The other problem (discovered just before release, and hacked) is that there are cases where only per-protocol mapping is to be done, not IP mapping. I hacked in a special case (if the IP specification is the full range, it means "don't change it"), but it's not very neat.
Recompiled with serial-console; today I'll hook up the machines together so I can see the messages when it crashes, and maybe I can find a clue. It's almost certainly some stupid mistake; I found a couple already.
After my LinuxWorld talk, more people are asking about the netfilter stuff, so I need to get a new snapshot out ASAP. I hate network debugging; it's a huge hassle (three machines, two connected with ethernet, two with a parallel cable, one machine headless). The kernel debugging interface is also damn primitive for those of us used to source-level debuggers. Still, I don't get paid for my looks...
Still, I did manage one success last night; I set up and somewhat configured Enlightenment 0.15.4 for my SO. I think Ace will like it.
PS. Woohoo! It was a *&%!ing debugging printk trying to deref a NULL pointer (thanks, serial console!). Preparing the release now: looks good!
Introduced my SO to Spellcast last night, after she proof-read my Linux Magazine column. One day in my copious free time, I'll have to do a Gnome-Spellcast rewrite. Don't busy-wait on that one...
At least I'm not alone in writing a LM column; Alan Cox writes one as well. Hope he gets paid more than I do. I ran out of inspiration so this one is just a list of various IP stack bugs (mainly fragment problems). I know Alan would write this column way better than me, but I can simply say I ran out of space when someone points out that I missed a major one. Next week, TCP bugs.
Well, it's back to crashing my box with netfilter. Never a dull moment.
Hello Paul: I just wanted to write and thank you for the tremendous job you did writing the ipchains howto. I've been working with Linux on my home network for about a year, and network security is an area I've long been interested in. Until now, I haven't made the time to learn about it. Your howto has provided me with a lot of useful general information, as well as the inspiration to dig deeper. You obviously put a lot of work into it! Also, I'd like to send my thanks to you and all of the others who worked to make ipchains available. I use it on my home firewall to provide acces to a mixture of Win95/NT/CE, MacOS, OpenVMS, and Linux machines. Please forward my gratitude to other contributors. Best regards, Earl Morren River Falls, Wisconsin, USAYou can see my previous response under ``Wednesday February 16 1999''. Anyone who attended my `Future Plan for Linux Packet Filtering' talk at LinuxWorld will know that my greatest contribution to Linux Packet filtering was the HOWTO (the changes to the packet filter code were evolutionary, and insufficient). While netfilter will change all this (and ipchains was a neccessary stepping stone for me to get the experience and user feedback required for the netfilter infrastructure), I still regard the HOWTO as my greatest Linux achievement.
There are now four people doing translations of my HOWTO into other languages, and I consider this to be a huge compliment.
Meanwhile, I've reworked NAT (again). I think the new stackable framework for NAT binding is more efficient and generally nicer. You will now be able to specify rules like "masquerade everything out ppp0", and have TCP, UDP, ICMP and other handled, with full per-protocol support (without having to insert specific rules). In addition, if someone were to write a TCP load-sharing module, you'd be able to say things like "redirect,tcp-loadshare", which would first redirect packets to a local port range, then loadshare them between that. Each stage is responsible for calling the next stage, so you get to pre- and post-filter their actions.
Each protocol provides a simple default binding, which does the actual allocation of the new connection, based on its "range" parameter. Other things (such as load-sharing, redirect, masquerading) can hand it a different range parameter. It's sweet.
Per-protocol handling is done very similar to the old code: both tcp and udp allow you to register "per-protocol" handlers, in which you specify alternate timeout and callbacks for a given destination port. This works whether the connection is being NAT'ed or RNAT'ed.
The new infrastructure is neccessary because the old way of having the user specify what bind function to call was getting messy. Firstly, it meant a rule for each protocol, and secondly, "null" bindings didn't know which bind function to call, and didn't get the benifit of per-protocol handling (mainly timeout differences).
Now I have my test network up (two laptops, kevin and ketchup, connected by a PLIP cable, and a network connection from ketchip to hambush, my Netwinder), it should speed development and testing of NAT.
Local NAT, now I know how to do it properly, is on hold in order to speed up release. Once through-NAT is stable, I'll do a release then work on local NAT again.
To do it correctly, you need to create a Directional Acyclic Graph of the rule dependencies, then sort them without putting any rule before another rule it depends on. Rule B "depends on" Rule A if there is a packet which could match both, and the verdicts are different (here we don't care about counters). Figuring out the intersection of two rules (consider the case of interface name comparisons with possible wildcards and inverses involved).
My brain hurt trying to remember DAG stuff from my undergrad days, when I realized that it was far easier to assign a score to each rule, and sort them into descending order. Each rule has a score which is the number of packet matches it has, plus the scores of each of its dependents, plus the number of dependents. This means that if B depends on A, then A will always have a score > B. Since we have a valid order already (the original rules), it's trivial to traverse this backwards to calculate the scores, then sort into score order.
Netfilter tomorrow, I promise...
This week should see my test network up and running; while I've given up on the Netwinder as a development box (2.2 isn't ready on Netwinder yet), with a serial cable I can use it as a client. This should fast-track the next netfilter development phase, which will be my focus for the next two weeks.
The main issue is going to be speed; the first cut of netfilter's NAT will be slow. Not as slow as a 2.2 kernel with transparent proxy compiled in which is also doing masquerading, but still too slow. The real benchmark in this battle is either a FreeBSD box, or (closer to home), Alexey's iproute NAT code. If I can get within 10% of Alexey for real traffic, I'll cut his code out as well (removing code == GOOD).
Of course, the real aim is to use the cache code to allow Alexey's fast forwarding to work in as many cases as possible; even if you're doing NAT, portforwarding, packet filtering etc on some packets.
I have a 6GB disk with a 2GB real-life packet dump on it, thanks to WatchGuard. In about a month I hope to have the tools in place for using this to stress-test my laptop; this is the stuff I will be benchmarking on. Finally, I'll have a reasonable response to "what size pipe can I masquerade on my Pentium 166 laptop?".
IPX firewalling is also coming along; only two weeks behind the schedule I promised Jay. Kernel module compiles; working on userspace.
I promised Jay Schulist that I'd finish IPX firewalling for him, so that should be done tonight (need to finish the userspace tool). Tomorrow we (Michael Neuling and I) catch the train to San Francisco, and Larry McVoy has offered to put us up. Then we fly to Orlando for DisneyWorld, then New York, then home.
I want to release another snapshot soon; NAT in particular is getting interesting (but needs far more testing). I'm pretty sure locking is still hosed, but what's an occasional crash between friends?
Far too much to write about at LinuxWorld. I'm pretty much committed for LinuxExpo in May. Don't know about LinuxWorld August, although if they get Alexey, I'll be there.
Random ideas that have come forth this week include: the Linux Kernel developer human pyramid, the Linux Enquirer, the Kernel Hacker secret handshake, the Linux development ship which circles the world in International waters, allowing crypto development.
I won't do a write-up, as everyone else will. It was big, but it also had worrying shades to it.
How do you allocate ports for masquerading (or any NAT where you're sharing the address space you're mapping to with a real interface)? This is done in the older code by simply hardcoding the 61000 - 65095 range port for masquerading.
This is bad because it breaks rlogin: basically, privileged ports should get mapped to priviledged ports. It also restricts the number of connections you can masquerade. You also have to decide whether your NAT overlaps with an interface address (what if they bring up an interface in the middle of the NAT range?), or restrict all NAT to those ports.
Previously I had something to allow the NAT code to `claim' ports from the TCP and UDP layers. This is nicer, but still has the problems above, and means that the UDP and TCP layers need to be altered. Also, consider the case of port 8080 being allocated by NAT, and you want to start a web server there: you're out of luck.
OK. The other solution is to keep track of all connections (even those not being NAT'ed), and simply make sure no allocations clash. This should work quite well (with caching, these `null' perconns are cheap), and even allows us to share a NAT range with a real IP from a box behind the NAT machine.
The only design problem is that there is a race when two NAT boxes happen to map UDP packets going to each other over the other packet's server port. For example, say we have a UDP server on port 50000 on box A, and port 60000 on box B. Both boxes are masquerading for networks behind them. Box A masquerades an initial UDP packet going to box B's port 60000; it happens to set the source port to 50000. Box B masquerades an initial UDP packet going to box A's port 50000; it happens to set the source port to 60000. The two packets cross in transit.
Each box will think the other packet is a reply, and demasquerade it (which is wrong). This only happens if both are intial packets (if either box has seen the other packet first, it won't assign that port, since it would be a duplicate perconn). Moreover, we can detect this case for TCP, so it has to be UDP.
The worst case is for servers on low ports (we map ports < 1025 to 1-1024), giving a 1 in a million chance. Consider two DNS servers/NAT boxes, each masquerading another DNS server. The DNS requests cross; the incoming request will be demasqueraded (instead of going to the local server) and the internal server will reply (instead of the external server). If the masquerading is one-shot (ie. expires after the first reply), then the reply will be masqueraded on a new port, and ignored by the initial server. The next request will work. Otherwise, the answer will be accepted as kosher.
It might be possible to come up with a less contrived case, but it seems that this is unlikely to be a real issue.
I guess it's natural to blame the entire thing on one person, but this is ridiculous; masquerading should be credited to the original BSD authors, or anyone but me. The current masquerading code is even more bazaar-like than most of the code: there are many more names spread throughout its parts.
With that in mind, I replied thus:
Well, Linus started the kernel, Fred van Kempen did most of the the early networking code, Alan Cox then took it over, Daniel Boulet and Ugen J.S.Antsilevich did the original BSD firewalling code, Alan Cox and Jos Vos ported it and modified it for Linux, Pauline Middelink did the masquerading additions, and most recently Juan Jose Ciarlante has been maintaining and enhancing it while I reworked the packet filtering code for 2.2. David S. Miller is the main current maintainer of the IP code, and Alexey Kuznetsov is the main TCP/IP hacker at the moment. Help an old lady across the road; she probably wrote one of the per-protocol masquerading modules or something.
Now I just have to test the module (tomorrow...). A week until I leave for Seattle then LinuxWorld Expo, and I have to get registrations for the Conference of Australian Linux users organized before I leave...
I realised that my idiotic library to match patterns in packets is a complete waste of space; I'll steal Brian Murrell's code I think. Brian reports that his web server occasionally splits PASV responses (no doubt due to Nagle): this will break the current MASQ code, and we must handle this case, even though it's mega icky. I wonder how many people are getting 1 in 100 masq ftp failures and not realising it (you'd have to be using a browser, or something else which uses passive ftp).
Meanwhile I'll do naive ports and fix them later. Release another snapshot on Wedenesday, I hope. Tonight Michael and I went over the tutorial, and tomorrow night is the LinuxSA meeting.
Cleared the way for userspace handling of per-protocol issues; now I need to port the per-protocol modules from the old ip_masq code and test them. Of particular interest is Quake, where the detrimental effects of shuffling each packet through userspace is most likely to be noticed (I'd guess up to 200 microseconds extra delay each way on my Pentium 166). Basically, if I can get away with Quake, I can do anything (well, scanning each packet of a CU-SeeMee stream might chew CPU, but millisecond latency doesn't matter much there).
The way it's implemented is not what I was originally planning, but it makes sense. The tcp-nat and udp-nat modules take a setsockopt(), which allows you to add or delete a port from the `userspace' list. Then, any new connections set up to that port pass packets to userspace, with mark equal to that port (so different processes can wait for different protocols).
This can be trivially extended to allow the handling to be done by a kernel module, should userspace be too slow for some cases (but I don't want to encourage this unless I'm backed into a corner. With a knife at my throat).
Also figured out the `genuine transparent proxy' solution; writing a special NAT module to support it should be trivial, and it'll be functionally superior to the current setup as well (errors on outgoing connection establishment can be forwarded to the original client).
Advertising & News Inc -- Wednesday February 4th 2037 RADICAL "FREE VISION" BILL UNLIKELY TO PASS CONGRESS An independent bill curtailing business rights on advertisements is extremely unlikely to obtain `serious consideration' according to Whitehouse spokesperson David Gammet. David Stallman, Independent congressman and grandson of the late Free Software advocate Richard Stallman, described the Advertising Liability Repeal Bill as `a return to the intentions of the constitution' regarding copyright law. `There is no evidence that Advertising Liability does anything other than reduce freedom to line the pockets of large corporations, such as Advertising & News Inc'. [thispublicationisawhollyownedsubsidiaryofadvertisingandnews]. Advertising Liability can be traced back to the landmark Pearl And Dean vs. Presley Estate case in 2013, in which the Supreme Court ruled that `use of copyrighted artwork for public viewing ... whether for advertisement or other purpose ... implies a liability on behalf of the viewer'. In recent years, the Free Vision Foundation has promoted the use of "Open" advertisements, for which no liability is incurred; that is, the viewer pays nothing for seeing the ad. In certain niche markets (mainly educational and technical fields) these Open Advertisements claim increasing market share. According to Advertising & News spokesman William Gateman, people want to spend money to see advertisements. `The so-called Open Ads have their place in niche markets, but it takes large teams of artists, focus groups and market research to produce quality advertisements. Obviously, noone can afford to do this for free. People are prepared to pay in return for high quality advertisements; it costs over five million dollars for a twenty second slot in the Superbowl, and we can't just give that away.' Chairman of the Artist Protection Agency, Paul Johnston, goes furthur. `What anti-business radicals like the Free Vision Foundation can't seem to understand is that Advertising Liability creates thousands of jobs, and is one of the leading exports of the United States. The average person pays just 27c a day for advertising or advertising liability insurance; if it weren't for rampant liability evasion, this amount would be reduced even furthur.' The Advertising Liability Protection Bill, due to be introduced next month, increases fines for Liability evasion, offers increased rewards for reporting, and simplifies collection procedures. It is widely expected to pass. Neither party has announce support for the Advertising Liability Repeal Bill, so this reporter won't be letting her insurance lapse just yet. [Emily Postnews, Washington DC]
I'm told by those who claim to know, that modern terrorist attacks are frequently done by a group of disparate people who come together for one job, complete the task and then go their separate ways. The Harvard Business Review (IIRC) took the typical Open Source organisation model as a new way of doing business: rather than a static organization with multiple goals, one organization per task, lasting only as long as the task takes.
Thus, the "Open Source Community" is an even more vague term (cf. "The Business Community" or "The Terrorist Community"); there are members of the Linux Community who aren't on speaking terms with members of the FreeBSD Community, even those driven mainly by their antipathy for the other project!
Someone used to dealing with legal entities like a large corporation
bases their interation on this fact:
The individual they are speaking with has power to enter into
agreements on behalf of the corporation.
Thus, you can treat the individual as if they were the corporation itself. It's a fundamental assumption, so much so that people honour the assumption even when it's not true (eg. Nick Leeson, Barings Bank). The individual is "responsible for" the company, and "speaks for" the company.
Mr. Raymond's self-stated aim of selling Free Software to corporations means explaining it in terms they can relate to. This means adopting the role of "spokesman for" the Open Source Community, and representing them as an organisation. The message: "you can deal with the Open Source Community to your advantage".
One of the golden rules of engineering is "you can't push a string". Well, as a general rule you can't push the Open Source Community. It's not that responsibility is "decentralized", it's that there isn't any; we're not a corporation, or even a conglomorate of corporations.
Otherwise Netscape would have been able do make a deal: they release the source to their browser, in return for Apache not competing head to head with Netscape's SuiteSpot. Even considering such a deal is ludicrous, and shows a fundamental misunderstanding of "the Free Software Community".
If you're tempted to think this way, just replace "Open Source Community" with "Everyone Who's Name Starts With Q". Try cutting a deal with "Everyone Who's Name Starts With Q"; the implication that you'll have to deal with each one, one at a time, is correct. You can't push a string.
To be honest, it is possible to push; it's trickier because you can only effect individuals. Sue them; one at a time. Tip them off to the SPA; at least the software audit will cause the problems. Push their employer. Attack with frivolous patents. Of course, you'd better be ready for some really bad backlashes...
Without a stick, what interactions are possible? You can offer a carrot, and pull. Find something you want that some people out there might also want, and use it as your carrot. Netscape used their browser; even going so far as placing ads on slashdot for developers. Hardware vendors use their hardware itself; release the specs, and people who have the card will be able to use it by writing a driver. Corel are adding to Wine because they want to use it.
Realize also that the Bazaar phenomenon is a statistical effect: once there is only one member it becomes a Cathedral. In fact, at one user the distinction between Free Software and proprietary software vanishes. The theory that all bugs will be found quickly assumes that taking "care factor" multiplied by "skill factor" of each user, and adding them together, reaches a sufficient amount to overcome bugs. But like any statistical effect, there will be cases where it doesn't happen.
This is why Mr. Raymond's fetchmail program crashes for me about once
a month. So I type:
(sleep 2; echo USER rustcorp
sleep 2; echo PASS password
sleep 2; RETR 1
sleep 2; DELE 1
sleep 2; QUIT
sleep 2) | telnet mail.camtech.net.au pop-3 > /tmp/mail
and continue as normal. Care factor: v. low. I did submit a bug
report once.
Mail me your suggestions...
After a good run last Sunday with Michael and myself (which got the userspace tool, iptables, to compile), I went and rewrote the support library again to get it somewhere decent, causing an additional delay.
Wrote a man page, did some light testing, and the kernel hasn't crashed in a while, even though all my net traffic (even locally-generated stuff) is going through the NAT code. I use my development kernels for development; keeps stability on my mind.
The snapshot of my work (dubbed "netfilter") is here, (and here is the README). Now that Internic have agreed that, yes, Ryan really DID pay them for the rustcorp.com domain...
Michael and I put in some serious hours on userspace (and a bit of NAT testing) on Sunday. Just spent about 8 hours reworking the iptables support library (it wasn't powerful enough to support iptables, and needed a redesign). Lost some time due to Australia Day on Tuesday, and the 2.2.0 kernel release.
Squishes another bug I found in NAT at Michael's place, and made TCP masq/NAT timeouts a bit smarter. We're getting there. Well, people are starting to get up, so that's a sign that it's getting late. iptables compiles again (this time, without warnings), and is much nicer than it used to be.
Got the LinuxWorld glossy brochure; it looks cool. I'm going to enjoy speaking at that, but I can't help but hope that The Bazaar eclipses it, because I love how they're flying developers in. Wish I could afford that for the Australian Linux Conference...
Squished a bug in my new stuff for broadcast/multicast packet filtering (there's a cheat function called dev_loopback_xmit) where packets were coming back to us without passing through the outgoing chain.
That's it for my IMPORTANT: stuff on my TODO list, other than Mike's work on the iptables tool. I'll work on something non-critical tomorrow (probably the cache stuff, since it's so cool, or maybe userspace NAT). I have a busy weekend, so I don't know how much I'll get done. A Gnome local packet watcher would be cool if I get the chance (unlikely).
DaveM referred a "does this go in 2.2" from Linus to me yesterday. I tell you, getting mail from Dave, Alexey, Alan, Linus et. al still gives me a warm fuzzy feeling. I really need a 24x7 home connection to improve my response time though. Maybe I should just move to Silicon Valley (except I can't afford it).
Took out some crap code which allowed you to specify what port range to NAT/masq. Now the code simply maps unprivileged ports to unprivileged ports and privileged to privileged. How many people will want to masq only some ports (and let the other packets through unchanged?). I guess I got infected by the packet-filter/masquerading confusion caused by previous Linux implementations (and avoiding that is the entire point of the new code). Won't happen again.
Almost finished my IMPORTANT: stuff on my TODO list; went through the entire IP subirectory and inserted "nf_drop()" calls. See, there are other places than the packet filtering code which drop packets; in particular, people are reluctant to use Source Address Verification because it means the loss of spoof logging.
Currently nf_drop() calls are everywhere a packet gets dropped without getting delivered to userspace. This is cute, but hugely invasive and difficult to maintain. I'm tempted to restrict it to packets which are dropped between netfilter hooks (eg. once it's inside the TCP code, dropping it is fair game), or restricting it to only reporting packets which are not dropped due to out-of-mem or other stress-related errors (about half the nf_drop calls are due to this).
There are some issues with the tunnelling code, and how the hooks should work with it. I've never used it, so I'll have to delve deeper. Similar issues apply for the eql driver: does the OUTGOING hook get called twice (once for the eql device, and once for the real device?).
Michael and I are getting together on Sunday afternoon through evening to complete the alpha version of the iptables userspace tool, meaning a release (hopefully) Monday. If that happens, I can sleep on Monday (Ace is taking the day off, so I could spend it over her place), or on the public holiday Tuesday.
Then I want to spend some time trying to implement the SPF interface changes that Brian and I discussed on ipchains-dev. If I can get that settled, I can write some documentation and add a chapter to the HOWTO.
Well, kernel compile with the nf_drop calls in place has finished; I'm off to bed. I'll boot it tomorrow.
This quickly proved troublesome as well. While theoretically the right approach (at the per-protocol level we know the PID of the receiver), there are many side-cases which make implementation awkward. The final straw was the fact that there's no simple mapping from the `struct sock' to the pid. There's UID, but not PID; I'd have to add it in...
At this stage I started thinking of leaving it unimplemented, when I realized that it's pretty much unneccessary anyway. I'd chosen PIDs over UIDs because there's a special PID (0) which can be used to indicate "no recipient" for a packet. But for this to be useful, you really need a daemon to maintain your rules anyway. And a daemon can use the destination port to figure that out, so why sweat about it in the kernel for a `neat' feature which is of questionable worth?
Bottom line: PIDs (on output and input) are gone.
On the netfilter side, there are six things on my TODO list, three of which need to be completed before the weekend. I have to get onto Mike for the userspace tool, and I'll write the doco.
Some of you may have noticed the article in today's Age ( here ). The only two things that are misleading are the phrase ``/the/ Linux kernel maintainer'' -- I am only the IP Packet Filter maintainer. The other is the ``If you want to make billions of dollars selling software licences... If you want to make millions, you can''; makes it sound like Red Hat and Walnut Creek sell software licences.
The first error is an over-zealous subedit, and the second is my fault. I love Richard Keech's sound bytes in his article; I think I'll get Cybersource to vet my next interview. 8-)
BTW, if you get approached by Nathan Cochrane (Age) for an interview, you're in good hands; the man is clued up.
If anyone's wondering, it was Dave Bonn of WatchGuard who decided to put up the ad on the Usenix Job board on a napkin, and Dave is v. v. cool; I might not have noticed Yet Another Formal Ad (I wasn't looking for a job at the time).
In honour of this, the informal sessions at CALU '99 will be written on Official BOF Napkins and pinned to the BOF board. (From the Jargon File, ``Abbreviation for the phrase "Birds Of a Feather" (flocking together), an informal discussion group and/or bull session scheduled on a conference program.'').
My laptop got stolen last week; fortunately all I lost was a few chapters of my LDP book (damn) and some EMail. I'd been taking a couple of weeks off firewall development (still doing maintenance though). If anyone sees an NEC Versa 6200MX (grey, 13.3'' screen) in Australia, with a missing pcmcia flap cover and possibly a blue penguin badge on it, please send me EMail.
I got interviewed in Melbourne for the IT Age by the shockingly clued-up Nathan Cochrane. I went in expecting to have to explain the difference between Open Source and shareware, and came out pleasantly surprised: it's nice to meet a journalist who reads slashdot.
While Nathan's interview with me isn't going to make slashdot, unlike last Tuesday's interview with Raster, it was an interesting experience. I'm a little worried about one quote, but it'll be OK I think.
I made an idiot of myself on netdev recently with my completely bogus `NAT UDP checksum' patch. The same idiocy is repeated below in my critique of the NAT draft, but I'm going to resist the urge to expunge it. I figure the best way to make up for it is to cut some more killer code and hope everyone forgets...
I'm headed back to the US in March, to visit WatchGuard in Seattle and give a tutorial at LinuxWorld and The Bazaar. Will be fun. Michael and I completed the workbook for the LinuxWorld tutorial, and it looks pretty wicked. I think people will get their money's worth.
Hope to release a NAT/netfilter proof-of-concept against 2.1.131 tomorrow. It'll be buggy and incomplete, but it'll be something for people to look at (unfortunately just a primative tool for NAT, but no userspace tools for firewalling yet).
Once I have the userspace tools in place as well, I'd like to get an rsync repository set up (Ghent who's hosting rustcorp.com has been enthusiastic about it) for people to play with. There are enough people interested to make this a decent payoff, but there's only one way to find out.
Spending more time on NAT, especially handling fragments. Think I have a good generic algorithm for handling it in all cases. So much for the ``Translation of outbound TCP/UDP fragments (i.e., those originating from private hosts) in NAPT setup are doomed to fail.'' from the draft. This assertion comes from the lack of genericism shown throughout the draft, I think.
Every protocol has marks to distinguish concurrent streams. For TCP and UDP, these are the source and destination ports. For many ICMP types, it is the id field.
These can be furthur divided into parts we can change and parts we can't. For example, on outgoing packets, you can change the source port for TCP and UDP packets, and the id field for ICMP packets. This concept extends to arbitrary protocols, rather than being restricted to TCP and UDP.
For unknown protocols, their assertion that NAT is impossible is unduly pessimistic. If we have N external IPs which we want use for NAT, we can have at most N DIFFERENT internal IPs accessing the SAME external IP using the SAME unknown protocol. How likely this is to happen depends on the site, and the number of IPs, but for many cases this might be quite fine.
Their checksum calculations don't take into account the UDP 0xFFFF checksum problem (you can't adjust it reliably unless you know if it really was 0xFFFF or 0). I rechecksum the entire packet for this case, or when that's not possible (first fragment, or inside an ICMP reply), flip a coin. For ICMP replies we could remember outgoing UDP checksums which are 0xFFFF (we're adjusting the packets anyway), but noone seems to check these checksums inside ICMPs anyway. I expect to see wierd systematic one-in-131072 DNS failures through badly written NAT boxes in the next few years, though.
Fragment handling is another sticky issue. They say it's too hard for NAPT outgoing setup (and don't discuss it for other cases). My original tuple allocation protocol put a first priority at reducing clashes between IP pairs; eg. if there was already a connection to the external address 1.2.3.4, from the assigned NAT address 5.6.7.8, we'd prefer to use a different NAT address for the next connection. Due to Dan Kegel's lobbying, it's now second priority. If you do have only one connection using that IP pair and protocol, you know how to un-NAT the fragment. Otherwise, make copies and un-NAT each way for every possible connection; it will be sorted out when the offset=0 fragment arrives (which will only be mapped through the real connection).
Outgoing fragments are simply NAT'ed like anything else for which we don't understand the protocol.
Just don't suggest that we wait for the offset=0 fragment to arrive, or you'll rot in hell surrounded by PIX firewalls. Linux 2.1 and above send out fragments backwards for easier space allocation and because outgoing packets can be checksummed backwards and fragmented in one run. (God I love reading some of Alan Cox's posts).
Of course, Linux boxes can simply turn on `CONFIG_IP_ALWAYS_DEFRAG', and get around the problem altogether. But I like to solve the hard problems as well.
I know more about NAT principles than I did two weeks ago. Things like what to do about fragments, how to deal with games, how to interact with firewalling rules, and most recently, how to assign NAT addresses.
Here is my current NAT address allocation scheme (this is for masq or normal NAT, but the principle is the same for load-sharing/virtual server NAT):
alloc_perconn() handles output address allocation; we give alloc_perconn the range of allowable source addresses (this may be a single address, for example for masquerading), and a copy of the current tuple (the "src" parts of which may be changed by the function). First priority is uniqueness: tuple must be unique. Second priority is consistency (the same src is mapped the same way). Third priority is robustness in the case of fragments. Fourth priority is fairness.
Would you follow these guys through Networking hell?
Some time since my last confession. I've been working on NAT. It's not easy, but I'm probably only three rewrites away from something usable 8-). The new scheme tries to be very generic with the handling of protocols, load-sharing NAT, masquerading, etc. It also tries to be blazingly fast, and tight. Some things are cooler than the previous code. One thing to note is that Andrew Tridgell was right: it *is* possible to handle fragments in masquerading in a number of cases, but the question remains as to whether this makes enough difference in real life.
The good news is that the netfilter code hasn't required any feature additions for a week now. Feature stability is usually a precursor to runtime stability.
Moved my module feature registration code out to its own file: I use it in two places in the netfilter code, and at least once more so far in the masq code. The big 2.1 kernel lock works some magic here, but some of this code is going to flake when 2.3 rolls in.
Note to self: ponder security implications of ignoring source interface when deNATting.
My minimum aim with the NAT code is to be able to have a box supplying masquerading for multiple internal interfaces through asymmetric-routed or eql external interfaces while supplying load-balancing NAT for incoming connections, and have it set up by two intuitive commands.
Michael and I are getting together some weekend soon to work on userspace tools for firewalling and NAT. Then this might be of more interest to the world at large.
In other news, my NetWinder is apparently waiting for me in Seattle; they'll be sending it to me soon (fingers crossed). Then I'll have three machines in my test network.
Going sailing this Sunday if the weather holds up; trying to find a place which rents out cats or small boats.
More rewrites. I realized that I can move everything that was a module right out to userspace now. Not the most efficient solution (standard localhost pings go from 0.1ms to 0.2ms when routed through userspace) but good enough for small sites, and great for testing.
My intention is to allow people to develop in userspace and then painlessly move to a kernel module if they need the speed. Developing in userspace, with its richness of tools, is much nicer. Hope to release again later this week.
Weather is improving here. Summer coming.
Things are (fortunately) quiet on all fronts at the moment; the only reports of problems with the latest ipchains is in the bz2 files (probably due to a new bzip2 version (0.90) here).
I've visited Sydney, Brisbane and Melbourne since last writing, promoting the conference. I also took the opportunity to inspect and pencil in bookings for the speakers' accomodation and the BOF sessions. I have now met all the committee members bar one (Nathan Bailey).
I had lunch with Darren Reed while in Melbourne. Darren is the author of ipfilter: the packet filtering system now used by NetBSD, OpenBSD and FreeBSD. Darren seemed really cool, and didn't raise any red flags over my plans; it looks like they should be sufficient for ipfilter to be painlessly ported over the top of the new infrastructure. This is very convenient for people running heterogeneous networks, and also promotes competition between the different firewall flavors (ipfilter is more sophisticated than ipchains by a long shot).
Long weekend of hacking ahead of me; I want to get a snapshot out by Monday.
Released libfw0.2 today. I have high hopes that Brian Murrell will do awesome things with this. No doubt there are still things to be done.
Just compiling my rewrite of firewall code (called netfilter). The code has two entry functions (replacing call_*_firewall()):
void nf_hook(int pf, unsigned int hook, struct sk_buff *skb, const union if_or_pid *in, const union if_or_pid *out, nf_okfn okfn, nf_badfn badfn, void *arg); int nf_hook_wait(int pf, unsigned int hooknum, struct sk_buff **pskb, const union if_or_pid *in, const union if_or_pid *out);
The first is when you can't block; it takes over the skb (this is neccessary because a netfilter hook might want to pass a packet to userspace). The second case gives you a return value, but might block (thus can't be called from an interrupt or bh).
I was putting off implementing the second until I really had to. This was a mistake, because when I tried to do it, I couldn't. Turns out that it's far easier to implement the nf_hook_wait() case first, then have a kernel thread handle the second case. Ended up only taking a few hours to write that way around, and it's much simpler.
I came across a new lesson in locking in the old firewall code a while ago, which I applied on the new code with a twist. No locking is done for the traversal of the list, which is a real trick. By using the clumsy interface of registering firewall hooks by a structure, with the next pointer for a linked list already in place, a structure can be pulled out of the list without changing it in any way.
This means that we are OK traversing the list even if the element we are on is removed (the next pointer will point to the element which has just taken its place). If we assume that the actual destruction of the element (usually caused by removal of the module associated with it) takes place long afterwards, this is safe.
Of course, if we want to queue to userspace, this breaks. So in this case (speed no longer being critical), we grab the lock, check that the element is still on the list (if not, drop), and increment a counter showing that it has ordered packets to userspace and they're not back yet. When trying to unregister the hook, we return -EBUSY if this count is non-zero.
New releases of ipchains coming thick and fast as bugs are squished from the reorganisation.
libfw release coming up: problems injecting packets through raw sockets is the only thing stopping me; worked around a netlink bug already.
Race conditions in the new firewall registration code. The old code has a really nice property that you can traverse the list of firewalls without a lock; in fact, this has a race condition, but because removing a module is not done until well after the module is deregistered, noone ever notices. I want to keep this, so another rewrite of this stuff (closer to the original) is in order.
Back from Canberra, where I finally met Andrew Tridgell of Samba fame; had a great time there. Been working on 1.3.7, after a few bad bugs in 1.3.6, and Andi Kleen submitted a long options patch against 1.3.6 (and some cool info for the HOWTO).
Struck a snag with the 2.3 firewalling infrastructure which has forced me to implement a chunk now which I was planning to leave until later. This is good, as I'll have a fairly complete system once it's done.
I lost a bootable version of my new firewall code when my hard drive died; two weeks solid work down the plughole. The rewrite currently compiles, but doesn't boot much.
I'm in Perth at the moment, and the weather is fine. Met up with the local LUG (this is the beginning of my Australian plug-the-conference tour); very cool people, so I'm hoping for some good papers and a number of attendees.
Took some time to make a MIN/MAX macro consolidation patch against 2.1.125. Trickier than it sounds (to do it right). No mail access at the moment (no phone in the room: must find a net cafe), so it'll be a while before that goes out: I'm back in Adelaide on Sunday.
Well, it boots. I'm using it right now, in a coffee shop in Perth. Was worried for a while over my overzealous sanity checks, but it makes sense for RST packets to not pass the OUTPUT hook.
Time to write the first cut of the queuing interface; I have a three hour flight, after all.
This time I'm going to take up Dave Miller's suggestion to hang out in the Silicon Valley area for a week or so; hopefully make an SVLUG meeting.