The OpenNET Project / Index page

[ новости /+++ | форум | теги | ]

Linux VM (linux vm memory proccess)


<< Предыдущая ИНДЕКС Поиск в статьях src Установить закладку Перейти на закладку Следующая >>
Ключевые слова: linux, vm, memory, proccess,  (найти похожие документы)
_ RU.OS.CMP (2:5077/15.22) _________________________________________ RU.OS.CMP _ From : Vadim Kolontsov 2:5020/400 17 Jan 99 22:28:02 Subj : Linux VM ________________________________________________________________________________ From: [email protected] (Vadim Kolontsov) Reply-To: [email protected] Привет, для более аргументированного сравнения... приношу свои извинения, если этот текст появлялся здесь неоднократно. Оригинал - на сайте автора, http://apollo.backplane.com. Сразу за документом следует комментарий к нему D.Dyson'а. В предыдущем письме -- FreeBSD VM system overview. V. ------------------------------------------------------------------------------- Date: Wed, 13 Jan 1999 23:20:22 -0800 (PST) From: Matthew Dillon <[email protected]> Message-Id: <[email protected]> Cc: [email protected] Subject: Review and report of linux kernel VM General Overview I've been looking at the linux kernel VM - mainly just to see what they've changed since I last looked at it. It's quite interesting... not bad at all though it is definitely a bit more memory-resource-intensive then FreeBSD's. However, it needs a *lot* of work when it comes to freeing up pages. I apologize in advance for any mistakes I've made! Basically, the linux kernel uses persistent hardware-level page tables in a mostly platform-independant fashion. The function of the persistent page tables is roughly equivalent to the function of FreeBSD's vm_object's. That is, the page tables are used to manage sharing and copy-on-write functions for VM objects. For example, when a process fork()'s, pages are duplicated literally by copying pte's. Writeable MAP_PRIVATE pages are write-protected and marked for copy-on-write. A global resident-page array is used to keep track of shared reference counts. Swapped-out pages are also represented by pte's and also marked for copy-on-write as appropriate. The swap block is stored in the PFN area of the pte (as far as I can tell). The swap system keeps a separate shared reference count to manage swap usage. The overhead is around 3 bytes per swap page (whether it is in use or not), and another pte-sized (int usually) field when storing the swap block in the pagetable. Linux cannot swap out its page tables, mainly due to the direct use of the page tables in handling VM object sharing. In general terms, linux's VM system is much cleaner then FreeBSD's... and I mean a *whole lot* cleaner, but at the cost of eating some extra memory. It isn't a whole lot of extra memory - maybe a meg or two for a typical system managing a lot of processes, and much less for typical 'small' systems. They are able to completely avoid the vm_object stacking (and related complexity) that we do, and they are able to completely avoid most of the pmap complexity in FreeBSD as well. Linux appears to implement a unified buffer cache. It's pretty straight forward except the object relationship is stored in the memory-map management structures in each process rather then in a vm_object type of structure. Linux appears to map all of physical memory into KVM. This avoids FreeBSD's (struct buf) complexity at the cost of not being able to deal with huge-memory configurations. I'm not 100% sure of this, but its my read of the code until someone corrects me. Problems Swap allocation is terrible. Linux uses a linear array which it scans looking for a free swap block. It does a relatively simple swap cluster cache, but eats the full linear scan if that fails which can be terribly nasty. The swap clustering algorithm is a piece of crap, too -- once swap becomes fragmented, the linux swapper falls on its face. It does read-ahead based on the swapblk which wouldn't be bad if it clustered writes by object or didn't have a fragmentation problem. As it stands, their read clustering is useless. Swap deallocation is fast since they are using a simple reference count array. File read-ahead is half-hazard at best. The paging queues ( determing the age of the page and whether to free or clean it) need to be written... the algorithms being used are terrible. * For the nominal page scan, it is using a one-hand clock algorithm. All I can say is: Oh my god! Are they nuts? That was abandoned a decade ago. The priority mechanism they've implemented is nearly useless. * To locate pages to swap out, it takes a pass through the task list. Ostensibly it locates the task with the largest RSS to then try to swap pages out from rather then select pages that are not in use. From my read of the code, it also botches this badly. Linux does not appear to do any page coloring whatsoever, but it would not be hard to add it in. Linux cannot swap-out its page tables or page directories. Thus, idle tasks can eat a significant amount of memory. This isn't a big deal for most systems ( small systems: no problem. Big systems: probably have lots of memory anyway ). But, mmap()'d files can create a significant burden if you have a lot of forked processes ( news, sendmail, web server, etc...). Not only does Linux have to scan the page tables for all the processes mapping the file, whether or not they are actively using the page being checked for, but Linux's swapout algorithm scans page tables and, effectively, makes redundant scans of shared objects. What FreeBSD can learn Well, the main thing is that the Linux VM system is very, very clean compared to the FreeBSD implementation. Cleaning up FreeBSD's VM system complexity is what I've been concentrating on and will continue to concentrate on. However, part of the reason that FreeBSD's VM system is more complex is because it does not use the page tables to store reference information. Instead, it uses the vm_object and pmap modules. I actually like this feature of FreeBSD. A lot. The biggest thing we need to do to clean up our VM system is, basically, to completely rewrite the struct buf filesystem buffering mechanism to make it much, much less complex - basically it should only be used as placeholders for read and write ops and not used to cache block number mappings between the files and the VM system, nor should it be used to map pages into KVM. Separating out these three mechanisms into three different subsystems would simplify the code enormously, I think. For example, we could implement a simple vm_object KVM mapping mechanism using FreeBSD's existing vm_object stacking model to map portions of a vm_object (aka filesystem partition) into KVM. Linux demarks interrupts from supervisor code much better then we do. If we move some of the more sophisticated operational capabilities out of our interrupt subsystem, we could get rid of most of the spl*() junk we currently have to do. This is a real sore spot in current FreeBSD code. Interrupts are just too complex. I'd also get rid of FreeBSD's intermediate 'software interrupt' layer, which is able to do even more complex things then hard interrupt code. The latency considerations just don't make any sense verses running pending software interrupts synchronously in tsleep(), prior to actually sleeping. We need to do this anyway ( or move softints to kernel threads ) to be able to take advantage of SMP mechanisms. The *only* thing our interrupts should be allowed to do is finish I/O on a page or use zalloc(). -Matt Matthew Dillon <[email protected]> ------------------------------------------------------------------------------- From: "John S. Dyson" <[email protected]> Subject: Re: Review and report of linux kernel VM Cc: [email protected], [email protected], [email protected], [email protected] > > In general terms, linux's VM system is much cleaner then FreeBSD's... and > I mean a *whole lot* cleaner, but at the cost of eating some extra memory. > It isn't a whole lot of extra memory - maybe a meg or two for a typical > system managing a lot of processes, and much less for typical 'small' > systems. They are able to completely avoid the vm_object stacking > (and related complexity) that we do, and they are able to completely > avoid most of the pmap complexity in FreeBSD as well. > IMO, the "cleaness" might be better described as "too simple." > > Linux appears to map all of physical memory into KVM. This avoids > FreeBSD's (struct buf) complexity at the cost of not being able to > deal with huge-memory configurations. I'm not 100% sure of this, but > its my read of the code until someone corrects me. > I suggest that we should get rid of the (struct buf) complexity by creating the concept of temporary kernel mappings. Such mappings are a resource limited so that the system doesn't have to map all of memory, yet have a cleaner, more consistant scheme than the current. The vm_page_t's at the end of the struct bufs were only a first step in that arena. There are about 5-10 more steps needed before it is really fully realized. > > * For the nominal page scan, it is using a one-hand clock algorithm. > All I can say is: Oh my god! Are they nuts? That was abandoned > a decade ago. The priority mechanism they've implemented is nearly > useless. > > * To locate pages to swap out, it takes a pass through the task list. > Ostensibly it locates the task with the largest RSS to then try to > swap pages out from rather then select pages that are not in use. > From my read of the code, it also botches this badly. > Yep, and it has been very difficult for me not to "educate" them on the right way to do it. Frankly, their code works really well until it is overused. Given the Linux VM code, "overused" mostly means used at all. :-). > > Linux does not appear to do any page coloring whatsoever, but it would > not be hard to add it in. > It wasn't hard to add to FreeBSD, but the coloring should be moved to a machine dependent section of the codebase. > > Linux cannot swap-out its page tables or page directories. Thus, idle > tasks can eat a significant amount of memory. This isn't a big deal for > most systems ( small systems: no problem. Big systems: probably have lots > of memory anyway ). But, mmap()'d files can create a significant burden > if you have a lot of forked processes ( news, sendmail, web server, > etc...). Not only does Linux have to scan the page tables for all the > processes mapping the file, whether or not they are actively using the > page being checked for, but Linux's swapout algorithm scans page tables > and, effectively, makes redundant scans of shared objects. > The key here is to NEVER swap out page tables or page directories. One should FREE them when it is possible. The notion of a dirty page table, or on that is on disk is meaningless. The FreeBSD code releases page table pages when they are empty. Page directories should be freeable when all descendants are no longer mapped (including page tables.) Of course, in that evaulation, the kernel mappings should be ignored, and when "swapping" page directories in, they are rebuilt from the kernel requirments. (FreeBSD doesn't release page directories yet, unless a process exits.) > What FreeBSD can learn > > Well, the main thing is that the Linux VM system is very, very clean > compared to the FreeBSD implementation. Cleaning up FreeBSD's VM system > complexity is what I've been concentrating on and will continue to > concentrate on. However, part of the reason that FreeBSD's VM system > is more complex is because it does not use the page tables to store > reference information. Instead, it uses the vm_object and pmap modules. > I actually like this feature of FreeBSD. A lot. > IMO, the pmap level is super flexible, but also there is too much stratification between the pmap code and the VM code. Layering is good for reference implementations, but also adds overhead. It would be "nice" to be able for the upper level VM code to simply modify page table entries sometimes, wouldn't it? One thing that I did in the FreeBSD code was to minimize the transitions between the pmap and VM layers. Perhaps that should be better defined, but if you wanted to rework the interfaces, you might see a major cleanup. > > The biggest thing we need to do to clean up our VM system is, basically, > to completely rewrite the struct buf filesystem buffering mechanism to > make it much, much less complex - basically it should only be used as > placeholders for read and write ops and not used to cache block number > mappings between the files and the VM system, nor should it be used to > map pages into KVM. Separating out these three mechanisms into three > different subsystems would simplify the code enormously, I think. For > example, we could implement a simple vm_object KVM mapping mechanism > using FreeBSD's existing vm_object stacking model to map portions of a > vm_object (aka filesystem partition) into KVM. > I agree. Take a look at using a seperate kernel mapping concept (there might even be some of that work still lying around somewhere.) You can then change buffers into what they should really be: I/O requests. The kernel mappings can be a dynamically allocated resource that are cached LRU or somesuch. Part of the technology to support them ended up being pmap_kenter/pmap_kremove and pmap_qenter/pmap_qremove. Without those, the temporary kernel mappings would have been terribly expensive. > > Linux demarks interrupts from supervisor code much better then we do. > If we move some of the more sophisticated operational capabilities > out of our interrupt subsystem, we could get rid of most of the spl*() > junk we currently have to do. This is a real sore spot in current > FreeBSD code. Interrupts are just too complex. I'd also get rid of > FreeBSD's intermediate 'software interrupt' layer, which is able to > do even more complex things then hard interrupt code. The latency > considerations just don't make any sense verses running pending software > interrupts synchronously in tsleep(), prior to actually sleeping. We > need to do this anyway ( or move softints to kernel threads ) to be able > to take advantage of SMP mechanisms. The *only* thing our interrupts > should be allowed to do is finish I/O on a page or use zalloc(). > Note that Linux doesn't even handle IDE PIO correctly, and has historically lost interrupts due to it (and they added a software interrupt scheme ot allow it to work.) IDE DMA saved their a**. John ------------------------------------------------------------------------------- --- ifmail v.2.14dev2 * Origin: Tver State University NOC (2:5020/400@fidonet)

<< Предыдущая ИНДЕКС Поиск в статьях src Установить закладку Перейти на закладку Следующая >>

 Добавить комментарий
Имя:
E-Mail:
Заголовок:
Текст:




Партнёры:
PostgresPro
Inferno Solutions
Hosting by Hoster.ru
Хостинг:

Закладки на сайте
Проследить за страницей
Created 1996-2024 by Maxim Chirkov
Добавить, Поддержать, Вебмастеру