Thread: Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ...
[moving to -performance, please drop -committers from replies] > > I've toyed with the idea of adding this because it is monstrously more > > efficient than select()/poll() in basically every way, shape, and > > form. > > From what I've looked at, kqueue only wins when you are watching a > large number of file descriptors at the same time; which is an > operation done nowhere in Postgres. I think the above would be a > complete waste of effort. It scales very well to many thousands of descriptors, but it also works well on small numbers as well. kqueue is about 5x faster than select() or poll() on the low end of number of fd's. As I said earlier, I don't think there is _much_ to gain in this regard, but I do think that it would be a speed improvement but only to one OS supported by PostgreSQL. I think that there are bigger speed improvements to be had elsewhere in the code. > > Is this one of the areas of PostgreSQL that just needs to get > > slowly migrated to use mmap() or are there any gaping reasons why > > to not use the family of system calls? > > There has been much speculation on this, and no proof that it > actually buys us anything to justify the portability hit. Actually, I think that it wouldn't be that big of a portability hit because you still would read() and write() as always, but in performance sensitive areas, an #ifdef HAVE_MMAP section would have the appropriate mmap() calls. If the system doesn't have mmap(), there isn't much to loose and we're in the same position we're in now. > There would be some nontrivial problems to solve, such as the > mechanics of accessing a large number of files from a large number > of backends without running out of virtual memory. Also, is it > guaranteed that multiple backends mmap'ing the same block will > access the very same physical buffer, and not multiple copies? > Multiple copies would be fatal. See the acrhives for more > discussion. Have read through the archives. Making a call to madvise() will speed up access to the pages as it gives hints to the VM about what order the pages are accessed/used. Here are a few bits from the BSD mmap() and madvise() man pages: mmap(2): MAP_NOSYNC Causes data dirtied via this VM map to be flushed to physical media only when necessary (usually by the pager) rather then gratuitously. Typically this pre- vents the update daemons from flushing pages dirtied through such maps and thus allows efficient sharing of memory across unassociated processes using a file- backed shared memory map. Without this option any VM pages you dirty may be flushed to disk every so often (every 30-60 seconds usually) which can create perfor- mance problems if you do not need that to occur (such as when you are using shared file-backed mmap regions for IPC purposes). Note that VM/filesystem coherency is maintained whether you use MAP_NOSYNC or not. This option is not portable across UNIX platforms (yet), though some may implement the same behavior by default. WARNING! Extending a file with ftruncate(2), thus cre- ating a big hole, and then filling the hole by modify- ing a shared mmap() can lead to severe file fragmenta- tion. In order to avoid such fragmentation you should always pre-allocate the file's backing store by write()ing zero's into the newly extended area prior to modifying the area via your mmap(). The fragmentation problem is especially sensitive to MAP_NOSYNC pages, because pages may be flushed to disk in a totally ran- dom order. The same applies when using MAP_NOSYNC to implement a file-based shared memory store. It is recommended that you create the backing store by write()ing zero's to the backing file rather then ftruncate()ing it. You can test file fragmentation by observing the KB/t (kilobytes per transfer) results from an ``iostat 1'' while reading a large file sequentially, e.g. using ``dd if=filename of=/dev/null bs=32k''. The fsync(2) function will flush all dirty data and metadata associated with a file, including dirty NOSYNC VM data, to physical media. The sync(8) command and sync(2) system call generally do not flush dirty NOSYNC VM data. The msync(2) system call is obsolete since BSD implements a coherent filesystem buffer cache. However, it may be used to associate dirty VM pages with filesystem buffers and thus cause them to be flushed to physical media sooner rather then later. madvise(2): MADV_NORMAL Tells the system to revert to the default paging behav- ior. MADV_RANDOM Is a hint that pages will be accessed randomly, and prefetching is likely not advantageous. MADV_SEQUENTIAL Causes the VM system to depress the priority of pages immediately preceding a given page when it is faulted in. mprotect(2): The mprotect() system call changes the specified pages to have protection prot. Not all implementations will guarantee protection on a page basis; the granularity of protection changes may be as large as an entire region. A region is the virtual address space defined by the start and end addresses of a struct vm_map_entry. Currently these protection bits are known, which can be combined, OR'd together: PROT_NONE No permissions at all. PROT_READ The pages can be read. PROT_WRITE The pages can be written. PROT_EXEC The pages can be executed. msync(2): The msync() system call writes any modified pages back to the filesystem and updates the file modification time. If len is 0, all modified pages within the region containing addr will be flushed; if len is non-zero, only those pages containing addr and len-1 succeeding locations will be examined. The flags argument may be specified as follows: MS_ASYNC Return immediately MS_SYNC Perform synchronous writes MS_INVALIDATE Invalidate all cached data A few thoughts come to mind: 1) backends could share buffers by mmap()'ing shared regions of data. While I haven't seen any numbers to reflect this, I'd wager that mmap() is a faster interface than ipc. 2) It looks like while there are various file IO schemes scattered all over the place, the bulk of the critical routines that would need to be updated are in backend/storage/file/fd.c, more specifically: *) fileNameOpenFile() would need the appropriate mmap() call made to it. *) FileTruncate() would need some attention to avoid fragmentation. *) a new "sync" GUC would have to be introduced to handle msync (affects only pg_fsync() and pg_fdatasync()). 3) There's a bit of code in pgsql/src/backend/storage/smgr that could be gutted/removed. Which of those storage types are even used any more? There's a reference in the code to PostgreSQL 3.0. :) And I think that'd be it. The LRU code could be used if necessary to help manage the amount of mmap()'ed in the VM at any one time, at the very least that could be a handled by a shm var that various backends would increment/decrement as files are open()'ed/close()'ed. I didn't spend too long looking at this, but I _think_ that'd cover 80% of PostgreSQL's disk access needs. The next bit to possibly add would be passing a flag on FileOpen operations that'd act as a hint to madvise() that way the VM could proactively react to PostgreSQL's needs. I don't have my copy of Steven's handy (it's some 700mi away atm otherwise I'd cite it), but if Tom or someone else has it handy, look up the example re: the performance gain from read()'ing an mmap()'ed file versus a non-mmap()'ed file. The difference is non-trivial and _WELL_ worth the time given the speed increase. The same speed benefit held true for writes as well, iirc. It's been a while, but I think it was around page 330. The index has it listed and it's not that hard of an example to find. -sc -- Sean Chittenden
Attachment
On Thu, 2003-03-06 at 19:36, Sean Chittenden wrote: > I don't have my copy of Steven's handy (it's some 700mi away atm > otherwise I'd cite it), but if Tom or someone else has it handy, look > up the example re: the performance gain from read()'ing an mmap()'ed > file versus a non-mmap()'ed file. The difference is non-trivial and > _WELL_ worth the time given the speed increase. Can anyone confirm this? If so, one easy step we could take in this direction would be adapting COPY FROM to use mmap(). Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
> > I don't have my copy of Steven's handy (it's some 700mi away atm > > otherwise I'd cite it), but if Tom or someone else has it handy, look > > up the example re: the performance gain from read()'ing an mmap()'ed > > file versus a non-mmap()'ed file. The difference is non-trivial and > > _WELL_ worth the time given the speed increase. > > Can anyone confirm this? If so, one easy step we could take in this > direction would be adapting COPY FROM to use mmap(). Weeee! Alright, so I got to have some fun writing out some simple tests with mmap() and friends tonight. Are the results interesting? Absolutely! Is this a simple benchmark? Yup. Do I think it simulates PostgreSQL? Eh, not particularly. Does it demonstrate that mmap() is a win and something worth implementing? I sure hope so. Is this a test program to demonstrate the ideal use of mmap() in PostgreSQL? No. Is it a place to start a factual discussion? I hope so. I have here four tests that are conditionalized by cpp. # The first one uses read() and write() but with the buffer size set # to the same size as the file. gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -o test-mmap test-mmap.c /usr/bin/time ./test-mmap > /dev/null Beginning tests with file: services Page size: 4096 File read size is the same as the file size Number of iterations: 100000 Start time: 1047013002.412516 Time: 82.88178 Completed tests 82.09 real 2.13 user 68.98 sys # The second one uses read() and write() with the default buffer size: # 65536 gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAULT_READSIZE=1 -o test-mmap test-mmap.c /usr/bin/time ./test-mmap > /dev/null Beginning tests with file: services Page size: 4096 File read size is default read size: 65536 Number of iterations: 100000 Start time: 1047013085.16204 Time: 18.155511 Completed tests 18.16 real 0.90 user 14.79 sys # Please note this is significantly faster, but that's expected # The third test uses mmap() + madvise() + write() gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAULT_READSIZE=1 -DDO_MMAP=1 -o test-mmap test-mmap.c /usr/bin/time ./test-mmap > /dev/null Beginning tests with file: services Page size: 4096 File read size is the same as the file size Number of iterations: 100000 Start time: 1047013103.859818 Time: 8.4294203644 Completed tests 7.24 real 0.41 user 5.92 sys # Faster still, and twice as fast as the normal read() case # The last test only calls mmap()'s once when the file is opened and # only msync()'s, munmap()'s, close()'s the file once at exit. gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAULT_READSIZE=1 -DDO_MMAP=1 -DDO_MMAP_ONCE=1 -o test-mmaptest-mmap.c /usr/bin/time ./test-mmap > /dev/null Beginning tests with file: services Page size: 4096 File read size is the same as the file size Number of iterations: 100000 Start time: 1047013111.623712 Time: 1.174076 Completed tests 1.18 real 0.09 user 0.92 sys # Substantially faster Obviously this isn't perfect, but reading and writing data is faster (specifically moving pages through the VM/OS). Doing partial writes from mmap()'ed data should be faster along with scanning through mmap()'ed portions of - or completely mmap()'ed - files because the pages are already loaded in the VM. PostgreSQL's LRU file descriptor cache could easily be adjusted to add mmap()'ing of frequently accessed files (specifically, system catalogs come to mind). It's not hard to figure out how often particular files are accessed and to either _avoid_ mmap()'ing a file that isn't accessed often, or to mmap() files that _are_ accessed often. mmap() does have a cost, but I'd wager that mmap()'ing the same file a second or third time from a different process would be more efficient. The speedup of searching through an mmap()'ed file may be worth it, however, to mmap() all files if the system is under a tunable resource limit (max_mmaped_bytes?). If someone is so inclined or there's enough interest, I can reverse this test case so that data is written to an mmap()'ed file, but the same performance difference should hold true (assuming this isn't a write to a tape drive ::grin::). The URL for the program used to generate the above tests is at: http://people.freebsd.org/~seanc/mmap_test/ Please ask if you have questions. -sc -- Sean Chittenden
Attachment
Sean Chittenden <sean@chittenden.org> writes: > Absolutely! Is this a simple benchmark? Yup. Do I think it > simulates PostgreSQL? Eh, not particularly. This would be on what OS? What hardware? What size test file? Do the "iterations" mean so many reads of the entire file, or so many buffer-sized read requests? Did the mmap case actually *read* anything, or just map and unmap the file? Also, what did you do to normalize for the effects of the test file being already in kernel disk cache after the first test? regards, tom lane
> > Absolutely! Is this a simple benchmark? Yup. Do I think it > > simulates PostgreSQL? Eh, not particularly. I think quite a few of these Q's would have been answered by reading the code/Makefile.... > This would be on what OS? FreeBSD, but it shouldn't matter. Any reasonably written VM should have similar numbers (though BSD is generally regarded as having the best VM, which, I think Linux poached not that long ago, iirc ::grimace::). > What hardware? My ultra-pathetic laptop with some fine - overly-noisy and can hardly buildworld - IDE drives. > What size test file? In this case, only 72K. I've just updated the test program to use an array of files though. > Do the "iterations" mean so many reads of the entire file, or so > many buffer-sized read requests? In some cases, yes. With the file mmap()'ed, sorta. One of the test cases (the one that did it in ~8s), mmap()'ed and munmap()'ed the file every iteration and was twice as fast as the vanilla read() call. > Did the mmap case actually *read* anything, or just map and unmap > the file? Nope, read it and wrote it out to stdout (which was redirected to /dev/null). > Also, what did you do to normalize for the effects of the test file > being already in kernel disk cache after the first test? That honestly doesn't matter too much since I wasn't testing the rate of reading in files from my hard drive, only the OS's ability to read/write pages of data around. In any case, I've updated my test case to iterate through an array of files instead of just reading in a copy of /etc/services. My laptop is generally a poor benchmark for disk read performance given it takes 8hrs to buildworld, over 12hrs to build mozilla, 18 for KDE, and about 48hrs for Open Office. :) Someone with faster disks may want to try this and report back, but it doesn't matter much in terms of relevancy for considering the benefits of mmap(). The point is that there are calls that can be used that substantially speed up read()'s and write()'s by allowing the VM to align pages of data and give hints about its usage. For the sake of argument re: the previously done tests, I'll reverse the order in which I ran them and I bet dime to dollar that the times will be identical. % make ~/open_source/mmap_test cp -f /etc/services ./services gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAULT_READSIZE=1 -DDO_MMAP=1 -DDO_MMAP_ONCE=1 -o mmap-testmmap-test.c /usr/bin/time ./mmap-test > /dev/null Beginning tests with file: services Page size: 4096 File read size is the same as the file size Number of iterations: 100000 Start time: 1047064672.276544 Time: 1.281477 Completed tests 1.29 real 0.10 user 0.92 sys gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAULT_READSIZE=1 -DDO_MMAP=1 -o mmap-test mmap-test.c /usr/bin/time ./mmap-test > /dev/null Beginning tests with file: services Page size: 4096 File read size is the same as the file size Number of iterations: 100000 Start time: 1047064674.266191 Time: 7.486622 Completed tests 7.49 real 0.41 user 6.01 sys gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAULT_READSIZE=1 -o mmap-test mmap-test.c /usr/bin/time ./mmap-test > /dev/null Beginning tests with file: services Page size: 4096 File read size is default read size: 65536 Number of iterations: 100000 Start time: 1047064682.288637 Time: 19.35214 Completed tests 19.04 real 0.88 user 15.43 sys gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -o mmap-test mmap-test.c /usr/bin/time ./mmap-test > /dev/null Beginning tests with file: services Page size: 4096 File read size is the same as the file size Number of iterations: 100000 Start time: 1047064701.867031 Time: 82.4294540875 Completed tests 81.57 real 2.10 user 69.55 sys Here's the updated test that iterates through. Ooh! One better, the files I've used are actual data files from ~pgsql. The new benchmark iterates through the list of files and and calls bench() once for each file and restarts at the first file after reaching the end of its list (ARGV). Whoa, if these tests are even close to real world, then we at the very least should be mmap()'ing the file every time we read it (assuming we're reading more than just a handful of bytes): find /usr/local/pgsql/data -type f | /usr/bin/xargs /usr/bin/time ./mmap-test > /dev/null Page size: 4096 File read size is the same as the file size Number of iterations: 100000 Start time: 1047071143.463360 Time: 12.109530 Completed tests 12.11 real 0.36 user 6.80 sys find /usr/local/pgsql/data -type f | /usr/bin/xargs /usr/bin/time ./mmap-test > /dev/null Page size: 4096 File read size is default read size: 65536 Number of iterations: 100000 .... [been waiting here for >40min now....] Ah well, if these tests finish this century, I'll post the results in a bit, but it's pretty clearly a win. In terms of the data that I'm copying, I'm copying ~700MB of data from my test DB on my laptop. I only have 256MB of RAM so I can pretty much promise you that the data isn't in my system buffers. If anyone else would like to run the tests or look at the results, please check it out: o1 and o2 should be the only targets used if FILES is bigger than the RAM on the system. o3's by far and away the fastest, but only in rare cases will a DBA have more RAM than data. But, as mentioned earlier, the LRU cache could easily be modified to munmap() infrequently accessed files to keep the size of mmap()'ed data down to a reasonable level. The updated test programs are at: http://people.FreeBSD.org/~seanc/mmap_test/ -sc -- Sean Chittenden