by Dominique Heger & Philip Carinhas, Fortuitous Technologies https://fortuitous.com
Over the last several years, the Linux operating system has gained acceptance as the operating system of choice in many scientific and commercial environments, respectively. Today, the performance aspects of the Linux operating system has improved significantly, as compared to traditional UNIX flavors. This is particularly true for smaller SMP systems with up to 4 processors. Recently, there has been an increased emphasis on Linux performance in mid to high-end enterprise-class environments, consisting of SMP systems that are configured with 64 CPUs. Therefore, scalability and performance of Linux 2.6 are paramount for applications on large systems that are scalable to high CPU counts. This article highlights some of the performance and scalability improvements of the Linux 2.6 kernel.
Most modern computer architectures support more than one memory page size. To illustrate, the IA-32 architecture supports either 4KB or 4MB pages. The 2.4 Linux kernel used to only utilize large pages for mapping the kernel image. In general, large page usage is primarily intended to provide performance improvements for high performance computing applications, as well as database applications that have large working sets. Any memory access intensive application that utilizes large amounts of virtual memory may obtain performance improvements by using large pages. Linux 2.6 can utilize 2MB or 4MB large pages, AIX uses 16MB large pages, whereas Solaris large pages are 4MB in size. The large page performance improvements are attributable to reduced translation lookaside buffer (TLB) misses. Large pages further improve the process of memory prefetching, by eliminating the necessity to restart prefetch operations on 4KB boundaries.
The Linux 2.6 scheduler is a multi queue scheduler that assigns a run-queue to each CPU, promoting a local scheduling approach. The previous incarnation of the Linux scheduler utilized the concept of goodness to determine which thread to execute next. All runnable tasks were kept on a single run-queue that represented a linked list of threads. In Linux 2.6, the single run-queue lock was replaced with a per CPU lock, ensuring better scalability on SMP systems. The new per CPU run-queue scheme decomposes the run-queue into a number of buckets (in priority order) and utilizes a bitmap to identify the buckets that hold runnable tasks. Locating the next task to execute requires a read from the bitmap to identify the first bucket with runnable tasks, and choosing the first task in that bucket's run-queue.
It should be pointed out that the Linux 2.6 environment provides a Non Uniform Memory Access (NUMA) aware extension to the new scheduler. The focus is on increasing the likelihood that memory references are local rather than remote on NUMA systems. The NUMA aware extension augments the existing CPU scheduler implementation via a node-balancing framework. Further, it is imperative to point out that next to the preemptible kernel support in Linux 2.6, the Native POSIX Threading Library (NPTL) represents the next generation POSIX threading solution for Linux, and hence has received a lot of attention from the performance community. The new threading implementation in Linux 2.6 has several major advantages, such as in-kernel POSIX signal handling. In a well-designed multi-threaded application domain, fast user space synchronization (futex) can be utilized. In contrast to the Linux 2.4, the futex framework avoids a scheduling collapse during heavy lock contention among different threads.
The I/O scheduler in Linux is the interface between the generic block layer and the low-level device drivers. The block layer provides functions that are utilized by file systems and the virtual memory manager to submit I/O requests to block devices. As prioritized resource management seeks to regulate the use of a disk subsystem by an application, the I/O scheduler is considered an important kernel component in the I/O path.
It is further possible to tune the disk usage in the kernel layers above and below the I/O scheduler. Adjusting the I/O pattern generated by the file system or the virtual memory manager (VMM) is now an option. Another option is to adjust the way specific device drivers or device controllers handle the I/O requests. Further, a new read-ahead algorithm designed and implemented by Dominique Heger and Steve Pratt for Linux 2.6 significantly boosts read IO throughput for all the discussed IO schedulers below.
The Deadline I/O scheduler available in Linux 2.6 incorporates a per-request expiration based approach, and operates on five I/O queues. The basic idea behind the implementation is to aggressively reorder requests to improve I/O performance while simultaneously ensuring that no I/O request is being starved. More specifically, the scheduler introduces the notion of a per-request deadline, which is used to assign a higher preference to read than write requests. To summarize, the basic idea behind the deadline scheduler is that all read requests are satisfied within a specified time period. On the other hand, write requests do not have any specific deadlines associated. As the block device driver is ready to launch another disk I/O request, the core algorithm of the deadline scheduler is invoked. In a simplified form, the first action being taken is to identify if there are I/O requests waiting in the dispatch queue, and if yes, there is no additional decision to be made on what to execute next. Otherwise, it is necessary to move a new set of I/O requests to the dispatch queue.
The Anticipatory I/O scheduler's design attempts to reduce the per-thread read response time. It introduces a controlled delay component into the dispatching equation. The delay is being invoked on any new request to the device driver, thereby allowing a thread that just finished its I/O request to submit a new request. This basically enhances the chances (based on locality) that this scheduling behavior will result in smaller seek operations. The tradeoff between reduced seeks and decreased disk utilization (due to the additional delay factor in dispatching a request) is managed by utilizing an actual cost-benefit calculation method.
The Completely Fair Queuing (CFQ) I/O scheduler can be considered as representing an extension to the better known stochastic fair queuing (SFQ) scheduler implementation. The focus of both implementations is on the concept of fair allocation of I/O bandwidth among all the initiators of I/O requests. A SFQ based scheduler design was initially proposed for some network subsystems. The goal to be accomplished is to distribute the available I/O bandwidth as equally as possible among the I/O requests.
The Linux 2.6 Noop I/O scheduler can be considered a minimal I/O scheduler that performs basic merging and sorting functionalities. The main usage of the noop scheduler revolves around non disk-based block devices like memory devices, as well as specialized software or hardware environments that incorporate their own I/O scheduling and caching functionality, and hence require only minimal assistance from the kernel. Hence, for large-scale I/O configurations that incorporate RAID controllers and many disk drives, the noop scheduler has the potential to outperform the other three I/O schedulers.
The Linux 2.6 kernel represents another evolutionary step forward, and builds upon its predecessors to boost (application) performance, through enhancements to the VM subsystem, the CPU scheduler and the I/O scheduler. In addition, this new version of the kernel delivers important functional enhancements in security, scalability, and networking.
This outline highlights the major performance features in Linux 2.6. Please visit the Fortuitous Website https://Fortuitous.com for the full article on Linux 2.6 Performance Enhancements. Fortuitous Technologies provides high quality IT services, focusing on performance tuning, capacity planning, and training.
Got something to add? Send me email.
More Articles by Dominique Heger & Philip Carinhas © 2012-07-21 Dominique Heger & Philip Carinhas
What happens then? Is there a ticker tape parade and heartfelt thanks from the computer it has reached? No, my friends, there is not. The poor packet is immediately gutted, stripped of its protective layers and tossed into the hungry maw of whatever application (mail, a webserver, whatever) it belongs to. (Tony Lawrence)