lwn.net/2001/features/KernelSummit -> lwn.net/2001/features/KernelSummit/
Check out 10 our annotated group photo which puts names to as many people in the picture as possible. High-performance database requirements The first presentation was by Lance Larsh of Oracle who, essentially, provided a laundry list of changes and features Oracle would like to see in order to get better performance out of high-end, large database servers. It comes down to the following: * Raw I/O has a few problems that keep it from achieving the performance it should get. Large operations are broken down into 64KB batches which are transferred sequentially. The block I/O interface deals in blocks, so large chunks are broken down into 512-byte pieces which are shoved individually into the block system, where they are immediately reassembled. It would be far better to just keep the large operations intact. That lock needs to split into a bunch of per-device locks, allowing more operations to be carried out in parallel. Much modern, high-end hardware can address high memory, however, meaning that the bounce buffers are unnecessary. Memory and performance gains can be significant when the page table size can be reduced by a factor of 512. One exception was the non-preemption request, which was considered dangerous and unnecessary. SCTP SCTP is the "Stream Control Transmission Protocol," defined by RFC2960. It is intended to be a new, high-level protocol with many of the advantages of both TCP and UDP, and with some additional features, such as dynamic network failover for multihomed hosts. SCTP also incorporates multiple streams of messages into a single connection, which can bring performance benefits in situations where multiple streams are used. A beginning SCTP implementation was presented by La Monte Yarroll, along with a description of the changes that are needed in the kernel. Among other things, a new version of the bind() system call is needed which can bind a socket to multiple addresses, so that the failover mechanism can work. Stephen Tweedie presented a list of things he thought needed to be addressed as a way of starting discussion on this topic. Much of his talk mirrored the list presented earlier by Oracle's Lanch Larsh, but there were a number of new items as well. The first set of problems with the current block layer have to do with scalability. The system needs to be able to handle very large numbers of devices, leading, among other things, to an exhaustion of the number of device numbers available (see also the 11 March 29 LWN kernel page). Large memory systems need to be able to work without using bounce buffers. The current sector addressing scheme, which limits devices to 2TB, needs to be redone. There is also an issue with some SCSI devices which can report large numbers of units on each target. If an I/O operation fails, the kernel has no way of knowing if the disk simply has a bad sector, or if the controller is on fire. As a result, a bad block will cause a drive to be removed from a RAID array, even though it is otherwise functioning fine. The system needs to provide information on just what has failed; In the case of a write failure, the data in memory is still good, and can be used by the system. The splitting up of large requests in the block layer is a recurring theme here. The kernel does not maintain separate queues for distinct devices; I/O scheduling fairness is also an issue at higher levels. Then, there's a set of new features required by the kernel. I/O barriers are one such feature: a journaling filesystem, for example, needs to know that everything has been written to the journal before its associated "commit" record goes to disk. By placing a barrier, the journaling code could ensure that no operations are reordered around the commit. On SCSI disks, the barrier can be implemented with a SCSI ordered tag, meaning that here is no need to actually wait until the barrier has been reached. On IDE systems, instead, an actual "flush and wait" cycle would be required. A similar topic is the need for explicit control of write caching in the drive itself. Disk drives typically report that a write operation is complete as soon as the data reaches the drive's internal cache, but it's often necessary to know that the data has actually reached the oxide on the disk itself. There's also a need for access to the real geometry of the drive; As hardware configurations become more dynamic, it's important to be able to track disks as they move around. There was surprisingly little controversy in this session, perhaps because it concentrated on goals and didn't get into actual implementation designs. Integrating high-performance filesystems Steve Lord of SGI presented some of the advanced features of the XFS filesystem, with the idea that, perhaps, some of them should be moved into the Linux VFS layer. The most interesting, certainly, was the "delayed write allocation" technique employed by XFS. When a process extends a file, XFS makes a note of the space that has been used, but does not actually go to the trouble of figuring out where it will live on the disk. Only when the kernel gets around to actually flushing the file data to the disk will that allocation be done. There are a couple of benefits to this scheme: * Processes that write data to files often write more data shortly thereafter. Delayed allocation means that the filesystem can lay out all of that data together, making the file contiguous on disk. The network driver API Jamal Hadi Salim led a session describing changes to the network driver interface. The stock Linux kernel performs poorly under very heavy network loads - as the number of packets received goes up, the number of packets actually processed begins to drop, until it approaches zero in especially hostile situations. A number of problems have been indentified in the current networking stack, including: * In heavy load situations, the too many interrupts are generated. When several tens of thousands of packets must be dispatched every second, the system simply does not have the resources to deal with a hardware interrupt for every packet. Currently, packets are dropped far too late, after considerable resources have been expended on them. Jamal's work (done with Robert Olsson and Alexey Kuznetsov) has focused primarily on the first two problems. The first thing that has been done is to provide a mechanism to tell drivers that the networking load is high. The drivers should then tell their interfaces to cut back on interrupts. After all, when a heavy stream of packets is coming in, there will always be a few of them waiting in the DMA buffers, and the interrupts carry little new information. When interrupts are off, the networking code will instead poll drivers when it is ready to accept new packets. Each interface has a quota stating how many packets will be accepted; If the traffic is heavy, it is entirely likely that the DMA rings for one or more drivers will overflow, since the kernel is not polling often enough. Once that happens, packets will be dropped by the interface itself (it has, after all, no place to put them). Thus the kernel need not process them at all, and they do not even cross the I/O bus. The end result is an order of magnitude increase in the number of packets a Linux system can route. Hot plug devices A common theme with modern hardware is the ability to attach and detach devices while the system is running - hot plugging. The kernel needs to be able to deal with this kind of environment; USB maintainer Johannes Erdfelt gave a presentation on how that is being done. How, for example, should a new device in the system be named? Device attributes - permissions, owner, configuration - should be preserved across plugging events when possible. Selecting the proper driver for a new device can be a challenge - sometimes there's more than one to choose from. Sometimes user space programs need to know about device events; The current scheme is to run a script - /sbin/hotplug, for every device event. Linus likes this approach, since it makes it easy to do the right thing most of the time. Much of the time, however, was spent discussing the naming issue, and the associated issue ...
|