Entry 10742 (Berkeley CSUA MOTD)

Berkeley CSUA MOTD:Entry 10742

WIKI \| FAQ \| Tech FAQ
`http://csua.com/feed/`

2025/07/16 [General] UID:1000 Activity:popular

7/16

2003/10/22-24 [Computer/SW/OS/FreeBSD] UID:10742 Activity:nil

10/22   What's a good place to learned about the mmap and vfork architectures?
        Like how they work in the OS, etc etc. Thanks.
        \_ If you're just starting out, I recommend the following articles.
           They're interesting and fairly contemporary, and the author /
           interviewee is coming to a CSUA meeting tomorrow.  Hint hint.
           http://www.daemonnews.org/200001/freebsd_vm.html
           http://kerneltrap.org/node/view/8
        \_ APUE by Richard Stevens
           \_ As far as I remember, APUE only tells you what mmap and vfork do
              and when to use them but it doesn't tell you how those system
              calls are actually implemented.
           \_ Isn't Apue the guy on the simpsons?
        \_ /usr/src

Cache (8192 bytes)

www.daemonnews.org/200001/freebsd_vm.htmlFor the last year I have concentrated on a number of major kernel subsystems within FreeBSD, with the VM and Swap subsystems being the most interesting and NFS being a necessary chore. In the VM arena the only major rewrite I have done is to the swap subsystem. Most of my work was cleanup and maintenance, with only moderate code rewriting and no major algorithmic adjustments within the VM subsystem. The bulk of the VM subsystems theoretical base remains unchanged and a lot of the credit for the modernization effort in the last few years belongs to John Dyson and David Greenman. Not being a historian like Kirk I will not attempt to tag all the various features with peoples names, since I will invariably get it wrong. Before moving along to the actual design lets spend a little time on the necessity of maintaining and modernizing any long-living codebase. In the programming world, algorithms tend to be more important than code and it is precisely due to BSDs academic roots that a great deal of attention was paid to algorithm design from the beginning. More attention paid to the design generally leads to a clean and flexible codebase that can be fairly easily modified, extended, or replaced over time. While BSD is considered an old operating system by some people, those of us who work on it tend to view it more as a mature codebase which has various components modified, extended, or replaced with modern code. It has evolved, and FreeBSD is at the bleeding edge no matter how old some of the code might be. This is an important distinction to make and one that is unfortunately lost to many people. The biggest error a programmer can make is to not learn from history, and this is precisely the error that many other modern operating systems have made. NT is the best example of this, and the consequences have been dire. Linux also makes this mistake to some degree - enough that we BSD folk can make small jokes about it every once in a while, anyway grin. Linuxs problem is simply one of a lack of experience and history to compare ideas against, a problem that is easily and rapidly being addressed by the Linux community in the same way it has been addressed in the BSD community - by continuous code development. The NT folk, on the other hand, repeatedly make the same mistakes solved by UNIX decades ago and then spend years fixing them. They have a severe case of not designed here and we are always right because our marketing department says so. Much of the apparent complexity of the FreeBSD design, especially in the VM/Swap subsystem, is a direct result of having to solve serious performance issues that occur under various conditions. These issues are not due to bad algorithmic design but instead rise from environmental factors. In any direct comparison between platforms, these issues become most apparent when system resources begin to get stressed. As I describe FreeBSDs VM/Swap subsystem the reader should always keep two points in mind. First, the most important aspect of performance design is what is known as Optimizing the Critical Path. It is often the case that performance optimizations add a little bloat to the code in order to make the critical path perform better. Second, a solid, generalized design outperforms a heavily-optimized design over the long run. While a generalized design may end up being slower than an heavily-optimized design when they are first implemented, the generalized design tends to be easier to adapt to changing conditions and the heavily-optimized design winds up having to be thrown away. Any codebase that will survive and be maintainable for years must therefore be designed properly from the beginning even if it costs some performance. Twenty years ago people were still arguing that programming in assembly was better than programming in a high-level language because it produced code that was ten times as fast. Today, the fallibility of that argument is obvious - as are the parallels to algorithmic design and code generalization. VM Objects The best way to begin describing the FreeBSD VM system is to look at it from the perspective of a user-level process. Each user process sees a single, private, contiguous VM address space containing several types of memory objects. Program code and program data are effectively a single memory-mapped file the binary file being run, but program code is read-only while program data is copy-on-write. Program BSS is just memory allocated and filled with zeros on demand, called demand zero page fill. Arbitrary files can be memory-mapped into the address space as well, which is how the shared library mechanism works. Such mappings can require modifications to remain private to the process making them. The fork system call adds an entirely new dimension to the VM management problem on top of the complexity already given. A program binary data page which is a basic copy-on-write page illustrates the complexity. A program binary contains a preinitialized data section which is initially mapped directly from the program file. When a program is loaded into a processs VM space, this area is initially memory-mapped and backed by the program binary itself, allowing the VM system to free/reuse the page and later load it back in from the binary. The moment a process modifies this data, however, the VM system must make a private copy of the page for that process. Since the private copy has been modified, the VM system may no longer free it, because there is no longer any way to restore it later on. You will notice immediately that what was originally a simple file mapping has become much more complex. Data may be modified on a page-by-page basis whereas the file mapping encompasses many pages at once. When a process forks, the result is two processes - each with their own private address spaces, including any modifications made by the original process prior to the call to fork. It would be silly for the VM system to make a complete copy of the data at the time of the fork because it is quite possible that at least one of the two processes will only need to read from that page from then on, allowing the original page to continue to be used. What was a private page is made copy-on-write again, since each process parent and child expects their own personal post-fork modifications to remain private to themselves and not effect the other. The original binary program file winds up being the lowest VM Object layer. A copy-on-write layer is pushed on top of that to hold those pages which had to be copied from the original file. If the program modifies a data page belonging to the original file the VM system takes a fault and makes a copy of the page in the higher layer. A fork is a common operation for any BSD system, so this example will consider a program that starts up, and forks. When the process starts, the VM system creates an object layer, lets call this A: A represents the file-pages may be paged in and out of the files physical media as necessary. Paging in from the disk is reasonable for a program, but we really dont want to page back out and overwrite the executable. The VM system therefore creates a second layer, B, that will be physically backed by swap space: On the first write to a page after this, a new page is created in B, and its contents are initialized from A. When the program forks, the VM system creates two new object layers-C1 for the parent, and C2 for the child-that rest on top of B: In this case, lets say a page in B is modified by the original parent process. The process will take a copy-on-write fault and duplicate the page in C1, leaving the original page in B untouched. Now, lets say the same page in B is modified by the child process. The process will take a copy-on-write fault and duplicate the page in C2. The original page in B is now completely hidden since both C1 and C2 have a copy and B could theoretically be destroyed if it does not represent a real file. However, this sort of optimization is not trivial to make because it is so fine-grained. Now, suppose as is often the case that the child process does an exec. Its current address space is usually replaced by a new ...

Cache (8192 bytes)

kerneltrap.org/node/view/8Ive always meant to go back and get a masters in something other then CS, and I still might. In any case, throughout this early period I got involved with the Commodore PET 8th grade, then the Amiga late high school. I learned 6502 machine code hex codes, not assembly on the PET which led to my writing an assembler and editor in 6502 hex for the PET, after which I wrote most programs in assembly. In any case, this interest in C eventually led to the writing of the DICE C compiler for the Amiga, which I did because I thought Lattice C was too expensive for many Amiga programmers. I sold DICE as shareware and it quite unexpectedly generated a fair chunk of income. This allowed me to expand into later Amiga models A3000 as well as put together some fairly souped up PCs for the times, on which I ran Linux. From late high school through college and beyond I also worked for a small engineering company in Truckee. I wound up doing all of the digital design work and software for several telemetry systems installed in the area as well as a number of other projects. We did everything from single-chip boards to fairly sophisticated memory-managed 68000 boards and everything from digital microprocessor systems to analog/radio systems, power supplies, and so forth. We even did a 600-gate ASIC chip design once, which was a lot of fun. Upon returning to the Bay Area in 1994ish I started a small ISP called BEST Internet with a couple of friends. This rapidly progressed, acquiring TLG The Little Garden in later years, merging with a pure web hosting provider called HiWay Technologies, and eventually sold to Verio at which point I and the other founders decided it was time to do something else. The finish occurred at the height of Internet Mania so we all did quite well. After that I took a year off, then started another startup with my brother. We did a very mundane-sounding billing system which was actually very sophisticated, but our timing couldnt be worse and we wound up having to idle the company. I still hope to adapt some of the database technology we developed for Backplane Inc for open source use. At the moment I am focusing on developing the Backplane technology for open-source use, working on FreeBSD, and planning the vacation that I couldnt take the last 7 years. Matt Dillon: The core of the database is a quorum-based peer-to-peer replication system that is able to maintain transactional integrity across all peers and snapshots. The database itself uses a very basic SQL command set similar to the original MySQL command set. The core replication features have a large number of potential applications including distributed web serving, distributed filesystems, and so forth. JA: I remember using DICE C back when I owned an Amiga 2500 in college. Finally, could it be used to compile something as complex as the FreeBSD kernel? It doesnt implement GCCs __asm__ macros and it doesnt implement inlines. It will compile under UNIX but, of course, it produces 68000 code. It is fully self contained and can compile, assemble, and link code into Amiga-style binaries and it can also produce ROMable code. In fact, I used DICE quite extensively for some of the telemetry projects while working up in Tahoe. DICE does a fair number of optimizations but nothing compared to what GCC can do. It still performs quite well though it can only produce 68000 output. I will say with absolute authority that it actually took TWO weeks to write! Actually I was able to write the preprocessor dcpp and the compiler core in about two weeks, but I still had to use the Lattice assembler and linker and, really, after only two weeks worth of work about the only thing I could compile was a hello world program. But this was enough to convince me that completing the compiler was possible, and in later weeks I beefed up the compiler core, wrote my own native linker and assembler and eventually started selling DICE as shareware. The linker took about a day, though I later decided to really do it right and make it collapse sections properly to support my auto-init stuff and that took a little longer. DICE took many months to refine into what I would consider a commercial quality system. I started selling it as shareware and eventually formed a small company called Obvious Implementations Corporation with a couple of friends people will remember Bryce Nesbit and John Toebes to sell it commercially. But as time progressed the Amiga started to decline due to Commodores crash and burn and the increased processing power in PCs. It did well enough as a side project but eventually it ran its course and we decided to give the source away. Matt Dillon: I got into BSD the original CSRG BSD starting in my last year of high school. A good friend of mine was taking a number of UCB courses and that gave me access to B50-Evans. I was very much into C programming during my Berkeley years but only did a little kernel hacking. There was a Perkin-Elmer in Cory Hall we had access to and I rewrote the serial driver to use the PEs microcoded DMA which still required a real interrupt before it would do anything to improve SLIP performance between it and B50-Evans. I knew Phil Lapsley, and made many friends quite a few of which are still involved with FreeBSD, but I never really got involved with CSRG. I didnt actually meet icons like Kirk McKusick till well after my college years. In any case, after Berkeley I moved up to Tahoe where the small engineering company I worked for was located, and FreeBSD didnt even exist then, not really. I fell in with Linux and did a bunch of work on the linux kernel, but once FreeBSD really got going I slipped back into the fold and began using it. Simply put, FreeBSD was BSD and I just had to go back to the UNIX I knew and loved from UCB. When we started BEST Internet we ran both BSDI and FreeBSD, and other things Im not going to talk about the IRIX snafu. Eventually it became clear that FreeBSD was the only way to go, especially considering my penchant for tracking down and fixing OS bugs. My focus was to keep BESTs shell machines, with 2000 user accounts per machine, up and running for as long as possible and that meant finding and fixing bugs as they came up. I also had access to a great deal of hardware in later years at BEST in the 1996 timeframe which put me in a unique position to ferret problems out and fix them. I found bugs in the VM system, in quotas, in the filesystem code, and many other places during those years. This work eventually led to my getting what we call the commit bit, which allowed me to commit directly to the FreeBSD CVS tree. With this bit in hand I decided to fix the VM system by rewriting half of it, which turned me into a Terror from the point of view of some of the old-boys in the tree who became alarmed at the rate I began committing changes. I think part of it was simply that many of them didnt understand the VM system and therefore didnt understand what I was doing, but since they werent really interested in fixing the VM system themselves and I was getting massive support from the end-user community, they didnt have much of a choice. This was a time when the VM system had a huge vacuum due to the departure of John Dyson. The only other person doing significant work on it was David Greenman, and he was already burning out after making hundreds of changes to Johns in-progress work in order to get the system to not crash. This led to an overreaction by core which by my account anyway led to some rather draconian rules in an attempt to slow me down? Suffice it to say that there was a lot of friction during this time and I even lost my commit bit for a number of months due to it. It didnt stick, of course, it just made DGs job harder because now he or Alan Cox not the Linux Alan Cox, our Alan Cox had to commit my submissions themselves and I was still fixing bugs at an insane rate. Eventually my work proved out both in later 3x releases and in the FreeBSD-40 release, and I became known as one of the VM Gurus. JA: Your impressive record of successful fixes and improvements stand on their own. On the other side, have you made any noticeable mistak...