[HURD] Disk I/O Performance Tuning

Maksym Planeta

Abstract

The most obvious reason for the Hurd feeling slow compared to mainstream systems like GNU/Linux, is a low I/O system performance, in particular very slow hard disk access. The reason for this slowness is lack and/or bad implementation of common optimization techniques. One of them is clustered page reading. This technique is base on idea that usually not only one page in file is read. That's why to minimize moving of disk head several pages which stand one by one are read at once, although only one page was asked.

Additional Information

Full name: Maksym Planeta

Contact information:

email, Jabber ID: mcsim.planeta(at)gmail.com

IRC nick: mcsim

Phone number: +380623375636

General information

Personal information

I'm 20 years old System Programming student in Donetsk national technical university (Ukraine). This year is my last one before bachelor degree, but I want to continue studying further, applying to master's program. I'm interested in operation system development, specifically microkernel based ones, especially their virtual memory organization.

Motivations

I have chosen this project idea because I'm interested in virtual memory organisation of microkernel operating systems. The most appealing thing for me is feeling of low level programming for microkernel ;)

Experience

I started studying programming about 7 years ago in school. My first language was Pascal. But now I know two languages: C and Python. I think that I know C quite good whereas my Python level is between intermediate and bad. The biggest part of my programming experience, i think, I got from programming from HURD, because this kind of experience is mostly real-world one. I was working upon two projects, that I mentioned above. These projects were not related to designing, they required only debugging skills, which probably I got.

Involving in other projects

HURD is my first project where I've being actively involved. I met HURD year ago, when I was trying to take part in GSoC, although I wasn't successive I stayed in community and started doing project, that I was going to do as GSoC project. Than in autumn I started working upon fixing bugs in tmpfs and defpager.

Hurd description

HURD is multiserver operating system, that uses microkernel mach. Term "multiserver" means that system and user tasks are implemented by set of servers. These servers communicate each other using IPC mechanism and virtual memory subsystem, which implemented by microkernel. Also HURD provides set of libraries that simplify implementation of new servers. To these libraries belong libdiskfs, libpager, libstoreio and others.

The main goal of HURD is providing secure and flexible operating system. Also it provides interface that allows access, for instance, to content of directory archive, that is placed on iso image, that is accessed through ftp, like to regular directory, just stacking translators.

This makes me interested in HURD and I hope that in future it will increase number if developer and will be ready for every day usage.

Project information

The best description of paging mechanism in Mach made Neal Walfield in his paper "External pager mechanism" [1]. To be more specific I'll how paging works, paying attention at possible readahead optimization.

First of all to access any data in external store it should be mapped to process's virtual  address space, using function vm_map. Function vm_map is very similar to mmap, but instead of file descriptor memory object name should be specified (in short, memory object name is the same integer as file descriptor and has much the same sens as it). But mapping doesn't yet mean that actual data is loaded into memory and could be accessed immediately. On the contrary, these pages that represent mapped object don't have mapping to physical pages, that's why when they are accessed page fault occurs.

So control reaches architecture specific page fault handler in kernel. This handler determines memory map for thread that has caused page fault and calls function vm_fault. The goal of this function is to find memory object for address that caused fault and ask it's pager for appropriate data. Pager acts as backing store for memory object. Pager is handled not only when memory object needs data, but also it is handled when memory object wants to evict data to backing store to free physical memory. Act as pager could any server, for example, translator of ext2fs. It just have to implement appropriate interface. There are two types of memory. One of them is called "anonymous", whose pager is default pager, that stores data in swap partition. Another type is "managed" whose pager is any, but default.

In context of resolving page fault, pager will be invoked by function memory_object_data_request, which asks pager for necessary data. This function is called in function vm_fault_page. By this project I'm going to change vm_fault_page so, that it will ask pager not only one page, but specified number of pages. Examples of clustered page reading could be got from OSF Mach code[2] and KAM's patch[3].

Function m_o_data_request (in fact this is an RPC) invokes server that returns data via RPC m_o_data_supply to kernel. After that, kernel installs supplied data to virtual address space of process, that has generated page fault. Finally task could be resumed. Often these RPC's aren't used directly in server code. They rather are hidden behind some code from libpager library.

Paging data out is similar process: when kernel decides to evict data it calls m_o_data_return on appropriate pager to ask it to free memory. Pager is free to store data in backing store or just drop them. But finally this data should be deallocated with function vm_deallocate.

From user point of view project implementation will consist of implementing function madvise. It is libc function that used to suggest read ahead strategy and now it always returns error code. But other details of implementation are still discussing. Most likely major portion of changes will be done in function vm_fault_page. Also some internal kernel data structures will be upgraded to support read ahead. To make read ahead configurable, kernel API will be upgraded. One of possible ways for upgrading api is adding new function.

Possible prototype for this function could be following:

vm_advise (vm_task_t target_task, vm_address_t address, vm_size_t length, vm_size_t chunk_size, integer_t flags)

Here target task is the task to be affected. The starting address is address. The length of memory where advise should be applied to is length. Parameter chunk_size determines size of memory that should be read ahead. Flags are needed to specify additional attributes. For example, there is madvise advise MADV_SEQUENTIAL. Here is it's description from man page: "Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)". So, there should be special flag that will suggest not only read ahead several pages but evict the same amount of pages that are behind address of page fault.

Also libpager interface should be changed. Now functions pager_{read,write}_page are called in loop and translator should handle pages one by one, independently whether it could do this in bigger chunks. Because of compatibility issues interface should be supplemented with new functions, instead of substituting old functions for new ones.

As libpager interface will be changed, translators that implement own backing store could use new interface. As part of GSoC project I'm planning to upgrade this only for ext2fs, because this is the most critical area for IO improving, but later other translators could be moved to new interface. By upgrading of ext2fs I don't mean that I'll implement some IO scheduler, because it quite difficult task, but only using instead of pager_{read,write}_page new functions, like pager_{read,write}_pages, that will do the same, but in one function. If chunk size and fragmentation were not very big absence of schedule would not affect much.

Certainly as changes in API will be introduced, documentation both for hurd and gnumach should be upgraded.

Schedule

This schedule shows my intentions, the actual one will be shown on user page. 

  1. Before May 5. Study current code base, dig in irc logs, look through kam's patch. Discuss project with possible mentors. Study what has to be changed in ext2 translator. During this period I have study, so most of time I'll spent for university.
  2. May 5 - May27. During this period I have exams, so I will not be able to work upon GSoC.
  3. May 27 - June 7. Implement some simple read ahead technique. During this period I suppose to write code that will let to read more than one page at once
  4. June 8 - June 23. Implement other techniques 
  5. June 24 - July 2. Implement madvise
  6. July 2 - July 13. Debug madvise.
  7. Interm period. By this time readahead in kernel and madvise should be implemented. 
  8. July 14 - July 20. Implement clustered page reading in libpager.
  9. July 23 - July 26. Start implementing of clustered page reading in ext2fs.
  10. July 27 - August 1. During this period I'll have exams.
  11. August 2 - August 7. Finish clustered page reading in ext2fs.
  12. August 8 - August 13. Rearranging commits, writing documentation, measuring performance.
  13. August 13 - August 20. One week gap.

Preparation stage

I have studied how Mach's VM subsystem works both from behalf of kernel (when I was working on porting Richard Braun's memory allocator) and from behalf of user (when I was fixing bugs in tmpfs and in defpager). Although, I know how mach's VM subsystem works in general I have holes in my knowledge. To fill these holes I think, that I have to learn carefully how vm_maps are organized and how kernel communicates with pager, because this is the core of the project. Also like source of inspiration I'll look through OSF Mach code, KAM's patch and possibly Linux kernel.

I think that this project is not so hard by coding, but by self-organisation, good time scheduling and proper communication with mentor. So things I have to study from this point of view are: regular reporting about progress and fitting in schedule.
AFAIK, recently appeared possibility to debug kernel using gdb+qemu. I think that this is great, because it should be much more comfortable way to debug kernel, than I was using before. I didn't test this feature yet, so I have to update my virtual machine image and try this before coding stage starts, because debugging takes really a lot of time.

Conclusion

I will show my actual progress on my user page at hurd site (see link in additional information).

Links

  1. http://www.bddebian.com:8888/~hurd-web/microkernel/mach/external_pager_mechanism/
  2. http://cvs.mklinux.org/cgi-bin/cvsweb/osfmk/src/mach_kernel/vm/vm_fault.c?rev=1.1.1.1
  3. http://lists.gnu.org/archive/html/bug-hurd/2010-06/txtG4H5sAx52G.txt

Code samples