Thread: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
From
Aaron Werman
Date:
pg to my mind is unique in not trying to avoid OS buffering. Other dbmses spend a substantial effort to create a virtual OS (task management, I/O drivers, etc.) both in code and support. Choosing mmap seems such a limiting an option - it adds OS dependency and limits kernel developer options (2G limits, global mlock serializations, porting problems, inability to schedule or parallelize I/O, still having to coordinate writers and readers). More to the point, I think it is very hard to effectively coordinate multithreaded I/O, and mmap seems used mostly to manage relatively simple scenarios. If the I/O options are: - OS (which has enormous investment and is stable, but is general purpose with overhead) - pg (direct I/O would be costly and potentially destabilizing, but with big possible performance rewards) - mmap (a feature mostly used to reduce buffer copies in less concurrent apps such as image processing that has major architectural risk including an order of magnitude more semaphores, but can reduce some extra block copies) mmap doesn't look that promising. /Aaron ----- Original Message ----- From: "Kevin Brown" <kevin@sysexperts.com> To: <pgsql-performance@postgresql.org> Sent: Thursday, October 14, 2004 4:25 PM Subject: Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ... > Tom Lane wrote: > > Kevin Brown <kevin@sysexperts.com> writes: > > > Tom Lane wrote: > > >> mmap() is Right Out because it does not afford us sufficient control > > >> over when changes to the in-memory data will propagate to disk. > > > > > ... that's especially true if we simply cannot > > > have the page written to disk in a partially-modified state (something > > > I can easily see being an issue for the WAL -- would the same hold > > > true of the index/data files?). > > > > You're almost there. Remember the fundamental WAL rule: log entries > > must hit disk before the data changes they describe. That means that we > > need not only a way of forcing changes to disk (fsync) but a way of > > being sure that changes have *not* gone to disk yet. In the existing > > implementation we get that by just not issuing write() for a given page > > until we know that the relevant WAL log entries are fsync'd down to > > disk. (BTW, this is what the LSN field on every page is for: it tells > > the buffer manager the latest WAL offset that has to be flushed before > > it can safely write the page.) > > > > mmap provides msync which is comparable to fsync, but AFAICS it > > provides no way to prevent an in-memory change from reaching disk too > > soon. This would mean that WAL entries would have to be written *and > > flushed* before we could make the data change at all, which would > > convert multiple updates of a single page into a series of write-and- > > wait-for-WAL-fsync steps. Not good. fsync'ing WAL once per transaction > > is bad enough, once per atomic action is intolerable. > > Hmm...something just occurred to me about this. > > Would a hybrid approach be possible? That is, use mmap() to handle > reads, and use write() to handle writes? > > Any code that wishes to write to a page would have to recognize that > it's doing so and fetch a copy from the storage manager (or > something), which would look to see if the page already exists as a > writeable buffer. If it doesn't, it creates it by allocating the > memory and then copying the page from the mmap()ed area to the new > buffer, and returning it. If it does, it just returns a pointer to > the buffer. There would obviously have to be some bookkeeping > involved: the storage manager would have to know how to map a mmap()ed > page back to a writeable buffer and vice-versa, so that once it > decides to write the buffer it can determine which page in the > original file the buffer corresponds to (so it can do the appropriate > seek()). > > In a write-heavy database, you'll end up with a lot of memory copy > operations, but with the scheme we currently use you get that anyway > (it just happens in kernel code instead of user code), so I don't see > that as much of a loss, if any. Where you win is in a read-heavy > database: you end up being able to read directly from the pages in the > kernel's page cache and thus save a memory copy from kernel space to > user space, not to mention the context switch that happens due to > issuing the read(). > > > Obviously you'd want to mmap() the file read-only in order to prevent > the issues you mention regarding an errant backend, and then reopen > the file read-write for the purpose of writing to it. In fact, you > could decouple the two: mmap() the file, then close the file -- the > mmap()ed region will remain mapped. Then, as long as the file remains > mapped, you need to open the file again only when you want to write to > it. > > > -- > Kevin Brown kevin@sysexperts.com > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > -- Regards, /Aaron
Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
From
Kevin Brown
Date:
Aaron Werman wrote: > pg to my mind is unique in not trying to avoid OS buffering. Other > dbmses spend a substantial effort to create a virtual OS (task > management, I/O drivers, etc.) both in code and support. Choosing mmap > seems such a limiting an option - it adds OS dependency and limits > kernel developer options (2G limits, global mlock serializations, > porting problems, inability to schedule or parallelize I/O, still > having to coordinate writers and readers). I'm not sure I entirely agree with this. Whether you access a file via mmap() or via read(), the end result is that you still have to access it, and since PG has significant chunks of system-dependent code that it heavily relies on as it is (e.g., locking mechanisms, shared memory), writing the I/O subsystem in a similar way doesn't seem to me to be that much of a stretch (especially since PG already has the storage manager), though it might involve quite a bit of work. As for parallelization of I/O, the use of mmap() for reads should signficantly improve parallelization -- now instead of issuing read() system calls, possibly for the same set of blocks, all the backends would essentially be examining the same data directly. The performance improvements as a result of accessing the kernel's cache pages directly instead of having it do buffer copies to process-local memory should increase as concurrency goes up. But see below. > More to the point, I think it is very hard to effectively coordinate > multithreaded I/O, and mmap seems used mostly to manage relatively > simple scenarios. PG already manages and coordinates multithreaded I/O. The mechanisms used to coordinate writes needn't change at all. But the way reads are done relative to writes might have to be rethought, since an mmap()ed buffer always reflects what's actually in kernel space at the time the buffer is accessed, while a buffer retrieved via read() reflects the state of the file at the time of the read(). If it's necessary for the state of the buffers to be fixed at examination time, then mmap() will be at best a draw, not a win. > mmap doesn't look that promising. This ultimately depends on two things: how much time is spent copying buffers around in kernel memory, and how much advantage can be gained by freeing up the memory used by the backends to store the backend-local copies of the disk pages they use (and thus making that memory available to the kernel to use for additional disk buffering). The gains from the former are likely small. The gains from the latter are probably also small, but harder to estimate. The use of mmap() is probably one of those optimizations that should be done when there's little else left to optimize, because the potential gains are possibly (if not probably) relatively small and the amount of work involved may be quite large. So I agree -- compared with other, much lower-hanging fruit, mmap() doesn't look promising. -- Kevin Brown kevin@sysexperts.com
Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
From
Sean Chittenden
Date:
>> pg to my mind is unique in not trying to avoid OS buffering. Other >> dbmses spend a substantial effort to create a virtual OS (task >> management, I/O drivers, etc.) both in code and support. Choosing mmap >> seems such a limiting an option - it adds OS dependency and limits >> kernel developer options (2G limits, global mlock serializations, >> porting problems, inability to schedule or parallelize I/O, still >> having to coordinate writers and readers). 2G limits? That must be a Linux limitation, not a limitation with mmap(2). On OS-X and FreeBSD it's anywhere from 4GB to ... well, whatever the 64bit limit is (which is bigger than any data file in $PGDATA). An mlock(2) serialization problem is going to be cheaper than hitting the disk in nearly all cases and should be no worse than a context switch or semaphore (what we use for the current locking scheme), of which PostgreSQL causes plenty of 'em because it's multi-process, not multi-threaded. Coordination of data isn't necessary if you mmap(2) data as a private block, which takes a snapshot of the page at the time you make the mmap(2) call and gets copied only when the page is written to. More on that later. > I'm not sure I entirely agree with this. Whether you access a file > via mmap() or via read(), the end result is that you still have to > access it, and since PG has significant chunks of system-dependent > code that it heavily relies on as it is (e.g., locking mechanisms, > shared memory), writing the I/O subsystem in a similar way doesn't > seem to me to be that much of a stretch (especially since PG already > has the storage manager), though it might involve quite a bit of work. Obviously you have to access the file on the hard drive, but you're forgetting an enormous advantage of mmap(2). With a read(2) system call, the program has to allocate space for the read(2), then it copies data from the kernel into the allocated memory in the userland's newly allocated memory location. With mmap(2) there is no second copy. Let's look at what happens with a read(2) call. To read(2) data you have to have a block of memory to copy data into. Assume your OS of choice has a good malloc(3) implementation and it only needs to call brk(2) once to extend the process's memory address after the first malloc(3) call. There's your first system call, which guarantees one context switch. The second hit, a much larger hit, is the actual read(2) call itself, wherein the kernel has to copy the data twice: once into a kernel buffer, then from the kernel buffer into the userland's memory space. Yuk. Webserver's figured this out long ago that read(2) is slow and evil in terms of performance. Apache uses mmap(2) to send static files at performance levels that don't suck and is actually quite fast (in terms of responsiveness, I'm not talking about Apache's parallelism/concurrency performance levels... which in 1.X aren't great). mmap(2) is a totally different animal in that you don't ever need to make calls to read(2): mmap(2) is used in place of those calls (With #ifdef and a good abstraction, the rest of PostgreSQL wouldn't know it was working with a page of mmap(2)'ed data or need to know that it is). Instead you mmap(2) a file descriptor and the kernel does some heavy lifting/optimized magic in its VM. The kernel reads the file descriptor and places the data it reads into its buffer (exactly the same as what happens with read(2)), but, instead of copying the data to the userspace, mmap(2) adjusts the process's address space and maps the address of the kernel buffer into the process's address space. No copying necessary. The savings here are *huge*! Depending on the mmap(2) implementation, the VM may not even get a page from disk until its actually needed. So, lets say you mmap(2) a 16M file. The address space picks up an extra 16M of bits that the process *can* use, but doesn't necessarily use. So if a user reads only ten pages out of a 16MB file, only 10 pages (10 * getpagesize()), or usually 40,960K, which is 0.24% the amount of disk access (((4096 * 10) / (16 *1024 * 1024)) * 100). Did I forget to mention that if the file is already in the kernel's buffers, there's no need for the kernel to access the hard drive? Another big win for data that's hot/frequently accessed. There's another large savings if the machine is doing network IO too... > As for parallelization of I/O, the use of mmap() for reads should > signficantly improve parallelization -- now instead of issuing read() > system calls, possibly for the same set of blocks, all the backends > would essentially be examining the same data directly. The > performance improvements as a result of accessing the kernel's cache > pages directly instead of having it do buffer copies to process-local > memory should increase as concurrency goes up. But see below. That's kinda true... though not quite correct. The improvement in IO concurrency comes from zero-socket-copy operations from the disk to the network controller. If a write(2) system call is issued on a page of mmap(2)'ed data (and your operating system supports it, I know FreeBSD does, but don't think Linux does), then the page of data is DMA'ed by the network controller and sent out without the data needing to be copied into the network controller's buffer. So, instead of the CPU copying data from the OS's buffer to a kernel buffer, the network card grabs the chunk of data in one interrupt because of the DMA (direct memory access). This is a pretty big deal for web serving, but if you've got a database sending large sets of data over the network, assuming the network isn't the bottle neck, this results in a heafty performance boost (that won't be noticed by most until they're running huge, very busy installations). This optimization comes for free and without needing to add one line of code to an application once mmap(2) has been added to an application. >> More to the point, I think it is very hard to effectively coordinate >> multithreaded I/O, and mmap seems used mostly to manage relatively >> simple scenarios. > > PG already manages and coordinates multithreaded I/O. The mechanisms > used to coordinate writes needn't change at all. But the way reads > are done relative to writes might have to be rethought, since an > mmap()ed buffer always reflects what's actually in kernel space at the > time the buffer is accessed, while a buffer retrieved via read() > reflects the state of the file at the time of the read(). If it's > necessary for the state of the buffers to be fixed at examination > time, then mmap() will be at best a draw, not a win. Here's where things can get interesting from a transaction stand point. Your statement is correct up until you make the assertion that a page needs to be fixed. If you're doing a read(2) transaction, mmap(2) a region and set the MAP_PRIVATE flag so the ground won't change underneath you. No copying of this page is done by the kernel unless it gets written to. If you're doing a write(2) or are directly scribbling on an mmap(2)'ed page[1], you need to grab some kind of an exclusive lock on the page/file (mlock(2) is going to be no more expensive than a semaphore, but probably less expensive). We already do that with semaphores, however. So for databases that don't have high contention for the same page/file of data, there are no additional copies made. When a piece of data is written, a page is duplicated before it gets scribbled on, but the application never knows this happens. The next time a process mmap(2)'s a region of memory that's been written to, it'll get the updated data without any need to flush a cache or mark pages as dirty: the operating system does all of this for us (and probably faster too). mmap(2) implementations are, IMHO, more optimized that shared memory implementations (mmap(2) is a VM function, which gets many eyes to look it over and is always being tuned, whereas shared mem is a bastardized subsystem that works, but isn't integral to any performance areas in the kernel so it gets neglected. Just my observations from the *BSD commit lists. Linux it may be different). [1] I forgot to mention earlier, you don't have to write(2) data to a file if it's mmap(2)'ed, you can change the contents of an mmap(2)'ed region, then msync(2) it back to disk (to ensure it gets written out) or let the last munmap(2) call do that for you (which would be just as dangerous as running without fsync... but would result in an additional performance boost). >> mmap doesn't look that promising. > > This ultimately depends on two things: how much time is spent copying > buffers around in kernel memory, and how much advantage can be gained > by freeing up the memory used by the backends to store the > backend-local copies of the disk pages they use (and thus making that > memory available to the kernel to use for additional disk buffering). Someone on IRC pointed me to some OSDL benchmarks, which broke down where time is being spent. Want to know what the most expensive part of PostgreSQL is? *drum roll* http://khack.osdl.org/stp/297960/profile/DBT_2_Profile-tick.sort 3967393 total 1.7735 2331284 default_idle 36426.3125 825716 do_sigaction 1290.1813 133126 __copy_from_user_ll 1040.0469 97780 __copy_to_user_ll 763.9062 43135 finish_task_switch 269.5938 30973 do_anonymous_page 62.4456 24175 scsi_request_fn 22.2197 23355 __do_softirq 121.6406 17039 __wake_up 133.1172 16527 __make_request 10.8730 9823 try_to_wake_up 13.6431 9525 generic_unplug_device 66.1458 8799 find_get_page 78.5625 7878 scsi_end_request 30.7734 Copying data to/from userspace and signal handling!!!! Let's hear it for the need for mmap(2)!!! *crowd goes wild* > The gains from the former are likely small. The gains from the latter > are probably also small, but harder to estimate. I disagree. > The use of mmap() is probably one of those optimizations that should > be done when there's little else left to optimize, because the > potential gains are possibly (if not probably) relatively small and > the amount of work involved may be quite large. If system/kernel time is where most of your database spends its time, then mmap(2) is a huge optimization that is very much worth pursuing. It's stable (nearly all webservers use it, notably Apache), widely deployed, POSIX specified (granted not all implementations are 100% consistent, but that's an OS bug and mmap(2) doesn't have to be turned on for those platforms: it's no worse than where we are now), and well optimized by operating system hackers. I guarantee that your operating system of choice has a faster VM and disk cache than PostgreSQL's userland cache, nevermind using the OSs buffers leads to many performance boosts as the OS can short-circuit common pathways that would require data copying (ex: zero-socket-copy operations and copying data to/from userland). mmap(2) isn't a panacea or replacement for good software design, but it certainly does make IO operations vastly faster, which is what PostgreSQL does a lot of (hence its need for a userland cache). Remember, back when PostgreSQL had its architecture thunk up, mmap(2) hardly existed in anyone's eyes, nevermind it being widely used or a POSIX function. It wasn't until Apache started using it that Operating System vendors felt the need to implement it or make it work well. Now it's integral to nearly all virtual memory implementations and a modern OS can't live without it or have it broken in any way. It would be largely beneficial to PostgreSQL to heavily utilize mmap(2). A few places it should be used include: *) Storage. It is a good idea to mmap(2) all files instead of read(2)'ing files. mmap(2) doesn't fetch a page from disk until its actually needed, which is a nifty savings. Sure it causes a fault in the kernel, but it won't the second time that page is accessed. Changes are necessary to src/backend/storage/file/, possibly src/backend/storage/freespace/ (why is it using fread(3) and not read(2)?), src/backend/storage/large_object/ can remain gimpy since people should use BYTEA instead (IMHO), src/backend/storage/page/ doesn't need changes (I don't think), src/backend/storage/smgr/ shouldn't need any modifications either. *) ARC. Why unmmap(2) data if you don't need to? With ARC, it's possible for the database to coach the operating system in what pages should be persistent. ARC's a smart algorithm for handling the needs of a database. Instead of having a cache of pages in userland, PostgreSQL would have a cache of mmap(2)'ed pages. It's shared between processes, the changes are public to external programs read(2)'ing data, and its quick. The needs for shared memory by the kernel drops to nearly nothing. The needs for mmap(2)'able space in the kernel, however, does go up. Unlike SysV shared mem, this can normally be changed on the fly. The end result would be, if a page is needed, it checks to see if its in the cache. If it is, the mmap(2)'ed page is returned. If it isn't, the page gets read(2)/mmap(2) like it currently is loaded (except in the mmap(2) case where after the data has been loaded, the page gets munmap(2)'ed). If ARC decides to keep the page, the page doesn't get munmap(2)'ed. I don't think any changes need to be made though to take advantage of mmap(2) if the changes are made in the places mentioned above in the Storage point. A few other perks: *) DIRECTIO can be used without much of a cache coherency headache since the cache of data is in the kernel, not userland. *) NFS. I'm not suggesting multiple clients use the same data directory via NFS (unless read only), but if there were a single client accessing a data directory over NFS, performance would be much better than it is today because data consistency is handled by the kernel so in flight packets for writes that get dropped or lost won't cause a slow down (mmap(2) behaves differently with NFS pages) or corruption. *) mmap(2) is conditional on the operating system's abilities, but doesn't require any architectural changes. It does change the location of the cache, from being in the userland, down in to the kernel. This is a change for database administrators, but a good one, IMHO. Previously, the operating system would be split 25% kernel, 75% user because PostgreSQL would need the available RAM for its cache. Now, that can be moved closer to the opposite, 75% kernel, 25% user because most of the memory is mmap(2)'ed pages instead of actual memory in the userland. *) Pages can be protected via PROT_(EXEC|READ|WRITE). For backends that aren't making changes to the DDL or system catalogs (permissions, etc.), pages that are loaded from the catalogs could be loaded with the protection PROT_READ, which would prevent changes to the catalogs. All DDL and permission altering commands (anything that touches the system catalogs) would then load the page with the PROT_WRITE bit set, make their changes, then PROT_READ the page again. This would provide a first line of defense against buggy programs or exploits. *) Eliminates the double caching done currently (caching in PostgreSQL and the kernel) by pushing the cache into the kernel... but without PostgreSQL knowing it's working on a page that's in the kernel. Please ask questions if you have them. -sc -- Sean Chittenden
Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
From
Mark Wong
Date:
On Fri, Oct 15, 2004 at 01:09:01PM -0700, Sean Chittenden wrote: [snip] > > > > This ultimately depends on two things: how much time is spent copying > > buffers around in kernel memory, and how much advantage can be gained > > by freeing up the memory used by the backends to store the > > backend-local copies of the disk pages they use (and thus making that > > memory available to the kernel to use for additional disk buffering). > > Someone on IRC pointed me to some OSDL benchmarks, which broke down > where time is being spent. Want to know what the most expensive part > of PostgreSQL is? *drum roll* > > http://khack.osdl.org/stp/297960/profile/DBT_2_Profile-tick.sort > > 3967393 total 1.7735 > 2331284 default_idle 36426.3125 > 825716 do_sigaction 1290.1813 > 133126 __copy_from_user_ll 1040.0469 > 97780 __copy_to_user_ll 763.9062 > 43135 finish_task_switch 269.5938 > 30973 do_anonymous_page 62.4456 > 24175 scsi_request_fn 22.2197 > 23355 __do_softirq 121.6406 > 17039 __wake_up 133.1172 > 16527 __make_request 10.8730 > 9823 try_to_wake_up 13.6431 > 9525 generic_unplug_device 66.1458 > 8799 find_get_page 78.5625 > 7878 scsi_end_request 30.7734 > > Copying data to/from userspace and signal handling!!!! Let's hear it > for the need for mmap(2)!!! *crowd goes wild* > [snip] I know where the do_sigaction is coming from in this particular case. Manfred Spraul tracked it to a pair of pgsignal calls in libpq. Commenting out those two calls out virtually eliminates do_sigaction from the kernel profile for this workload. I've lost track of the discussion over the past year, but I heard a rumor that it was finally addressed to some degree. I did understand it touched on a lot of other things, but can anyone summarize where that discussion has gone? Mark
Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
From
Tom Lane
Date:
Sean Chittenden <sean@chittenden.org> writes: > Coordination of data isn't > necessary if you mmap(2) data as a private block, which takes a > snapshot of the page at the time you make the mmap(2) call and gets > copied only when the page is written to. More on that later. We cannot move to a model where different backends have different views of the same page, which seems to me to be inherent in the idea of using MAP_PRIVATE for anything. To take just one example, a backend that had mapped one btree index page some time ago could get completely confused if that page splits, because it might see the effects of the split in nearby index pages but not in the one that was split. Or it could follow an index link to a heap entry that isn't there anymore, or miss an entry it should have seen. MVCC doesn't save you from this because btree adjustments happen below the level of transactions. However the really major difficulty with using mmap is that it breaks the scheme we are currently using for WAL, because you don't have any way to restrict how soon a change in an mmap'd page will go to disk. (No, I don't believe that mlock guarantees this. It says that the page will not be removed from main memory; it does not specify that, say, the syncer won't write the contents out anyway.) > Let's look at what happens with a read(2) call. To read(2) data you > have to have a block of memory to copy data into. Assume your OS of > choice has a good malloc(3) implementation and it only needs to call > brk(2) once to extend the process's memory address after the first > malloc(3) call. There's your first system call, which guarantees one > context switch. Wrong. Our reads occur into shared memory allocated at postmaster startup, remember? > mmap(2) is a totally different animal in that you don't ever need to > make calls to read(2): mmap(2) is used in place of those calls (With > #ifdef and a good abstraction, the rest of PostgreSQL wouldn't know it > was working with a page of mmap(2)'ed data or need to know that it is). Instead, you have to worry about address space management and keeping a consistent view of the data. > ... If a write(2) system call is issued on a page of > mmap(2)'ed data (and your operating system supports it, I know FreeBSD > does, but don't think Linux does), then the page of data is DMA'ed by > the network controller and sent out without the data needing to be > copied into the network controller's buffer. Perfectly irrelevant to Postgres, since there is no situation where we'd ever write directly from a disk buffer to a socket; in the present implementation there are at least two levels of copy needed in between (datatype-specific output function and protocol message assembly). And that's not even counting the fact that any data item large enough to make the savings interesting would have been sliced, diced, and compressed by TOAST. > ... If you're doing a write(2) or are directly > scribbling on an mmap(2)'ed page[1], you need to grab some kind of an > exclusive lock on the page/file (mlock(2) is going to be no more > expensive than a semaphore, but probably less expensive). More incorrect information. The locking involved here is done by LWLockAcquire, which is significantly *less* expensive than a kernel call in the case where there is no need to block. (If you have to block, any kernel call to do so is probably about as bad as any other.) Switching over to mlock would likely make things considerably slower. In any case, you didn't actually mean to say mlock did you? It doesn't lock pages against writes by other processes AFAICS. > shared mem is a bastardized subsystem that works, but isn't integral to > any performance areas in the kernel so it gets neglected. What performance issues do you think shared memory needs to have fixed? We don't issue any shmem kernel calls after the initial shmget, so comparing the level of kernel tenseness about shmget to the level of tenseness about mmap is simply irrelevant. Perhaps the reason you don't see any traffic about this on the kernel lists is that shared memory already works fine and doesn't need any fixing. > Please ask questions if you have them. Do you have any arguments that are actually convincing? What I just read was a proposal to essentially throw away not only the entire low-level data access model, but the entire low-level locking model, and start from scratch. There is no possible way we could support both this approach and the current one, which means that we'd be permanently dropping support for all platforms without high-quality mmap implementations; and despite your enthusiasm I don't think that that category includes every interesting platform. Furthermore, you didn't give any really convincing reasons to think that the enormous effort involved would be repaid. Those oprofile reports Josh just put up showed 3% of the CPU time going into userspace/kernelspace copying. Even assuming that that number consists entirely of reads and writes of shared buffers (and of course no other kernel call ever transfers any data across that boundary ;-)), there's no way we are going to buy into this sort of project in hopes of a 3% win. regards, tom lane
Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
From
Tom Lane
Date:
Mark Wong <markw@osdl.org> writes: > I know where the do_sigaction is coming from in this particular case. > Manfred Spraul tracked it to a pair of pgsignal calls in libpq. > Commenting out those two calls out virtually eliminates do_sigaction from > the kernel profile for this workload. Hmm, I suppose those are the ones associated with suppressing SIGPIPE during send(). It looks to me like those should go away in 8.0 if you have compiled with ENABLE_THREAD_SAFETY ... exactly how is PG being built in the current round of tests? regards, tom lane
Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
From
Mark Wong
Date:
On Fri, Oct 15, 2004 at 05:37:50PM -0400, Tom Lane wrote: > Mark Wong <markw@osdl.org> writes: > > I know where the do_sigaction is coming from in this particular case. > > Manfred Spraul tracked it to a pair of pgsignal calls in libpq. > > Commenting out those two calls out virtually eliminates do_sigaction from > > the kernel profile for this workload. > > Hmm, I suppose those are the ones associated with suppressing SIGPIPE > during send(). It looks to me like those should go away in 8.0 if you > have compiled with ENABLE_THREAD_SAFETY ... exactly how is PG being > built in the current round of tests? > Ah, yes. Ok. It's not being configured with any options. That'll be easy to rememdy though. I'll get that change made and we can try again. Mark
Tom Lane wrote: > Mark Wong <markw@osdl.org> writes: > > I know where the do_sigaction is coming from in this particular case. > > Manfred Spraul tracked it to a pair of pgsignal calls in libpq. > > Commenting out those two calls out virtually eliminates do_sigaction from > > the kernel profile for this workload. > > Hmm, I suppose those are the ones associated with suppressing SIGPIPE > during send(). It looks to me like those should go away in 8.0 if you > have compiled with ENABLE_THREAD_SAFETY ... exactly how is PG being > built in the current round of tests? Yes, those calls are gone in 8.0 with --enable-thread-safety and were added specifically because of Manfred's reports. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
From
Mark Wong
Date:
On Fri, Oct 15, 2004 at 09:22:03PM -0400, Bruce Momjian wrote: > Tom Lane wrote: > > Mark Wong <markw@osdl.org> writes: > > > I know where the do_sigaction is coming from in this particular case. > > > Manfred Spraul tracked it to a pair of pgsignal calls in libpq. > > > Commenting out those two calls out virtually eliminates do_sigaction from > > > the kernel profile for this workload. > > > > Hmm, I suppose those are the ones associated with suppressing SIGPIPE > > during send(). It looks to me like those should go away in 8.0 if you > > have compiled with ENABLE_THREAD_SAFETY ... exactly how is PG being > > built in the current round of tests? > > Yes, those calls are gone in 8.0 with --enable-thread-safety and were > added specifically because of Manfred's reports. > Ok, I had the build commands changed for installing PostgreSQL in STP. The do_sigaction call isn't at the top of the profile anymore, here's a reference for those who are interested; it should have the same test parameters as the one Tom referenced a little earlier: http://khack.osdl.org/stp/298230/ Mark
Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
From
Sean Chittenden
Date:
> However the really major difficulty with using mmap is that it breaks > the scheme we are currently using for WAL, because you don't have any > way to restrict how soon a change in an mmap'd page will go to disk. > (No, I don't believe that mlock guarantees this. It says that the > page will not be removed from main memory; it does not specify that, > say, the syncer won't write the contents out anyway.) I had to think about this for a minute (now nearly a week) and reread the docs on WAL before I groked what could happen here. You're absolutely right in that WAL needs to be taken into account first. How does this execution path sound to you? By default, all mmap(2)'ed pages are MAP_SHARED. There are no complications with regards to reads. When a backend wishes to write a page, the following steps are taken: 1) Backend grabs a lock from the lockmgr to write to the page (exactly as it does now) 2) Backend mmap(2)'s a second copy of the page(s) being written to, this time with the MAP_PRIVATE flag set. Mapping a copy of the page again is wasteful in terms of address space, but does not require any more memory than our current scheme. The re-mapping of the page with MAP_PRIVATE prevents changes to the data that other backends are viewing. 3) The writing backend, can then scribble on its private copy of the page(s) as it sees fit. 4) Once completed making changes and a transaction is to be committed, the backend WAL logs its changes. 5) Once the WAL logging is complete and it has hit the disk, the backend msync(2)'s its private copy of the pages to disk (ASYNC or SYNC, it doesn't really matter too much to me). 6) Optional(?). I'm not sure whether or not the backend would need to also issues an msync(2) MS_INVALIDATE, but, I suspect it would not need to on systems with unified buffer caches such as FreeBSD or OS-X. On HPUX, or other older *NIX'es, it may be necessary. *shrug* I could be trying to be overly protective here. 7) Backend munmap(2)'s its private copy of the written on page(s). 8) Backend releases its lock from the lockmgr. At this point, the remaining backends now are able to see the updated pages of data. >> Let's look at what happens with a read(2) call. To read(2) data you >> have to have a block of memory to copy data into. Assume your OS of >> choice has a good malloc(3) implementation and it only needs to call >> brk(2) once to extend the process's memory address after the first >> malloc(3) call. There's your first system call, which guarantees one >> context switch. > > Wrong. Our reads occur into shared memory allocated at postmaster > startup, remember? Doh. Fair enough. In most programs that involve read(2), a call to alloc(3) needs to be made. >> mmap(2) is a totally different animal in that you don't ever need to >> make calls to read(2): mmap(2) is used in place of those calls (With >> #ifdef and a good abstraction, the rest of PostgreSQL wouldn't know it >> was working with a page of mmap(2)'ed data or need to know that it >> is). > > Instead, you have to worry about address space management and keeping a > consistent view of the data. Which is largely handled by mmap() and the VM. >> ... If a write(2) system call is issued on a page of >> mmap(2)'ed data (and your operating system supports it, I know FreeBSD >> does, but don't think Linux does), then the page of data is DMA'ed by >> the network controller and sent out without the data needing to be >> copied into the network controller's buffer. > > Perfectly irrelevant to Postgres, since there is no situation where > we'd > ever write directly from a disk buffer to a socket; in the present > implementation there are at least two levels of copy needed in between > (datatype-specific output function and protocol message assembly). And > that's not even counting the fact that any data item large enough to > make the savings interesting would have been sliced, diced, and > compressed by TOAST. The biggest winners will be columns whos storage type is PLAIN or EXTERNAL. writev(2) from mmap(2)'ed pages and non-mmap(2)'ed pages would be a nice perk too (not sure if PostgreSQL uses this or not). Since compression isn't happening on most tuples under 1K in size and most tuples in a database are going to be under that, most tuples are going to be uncompressed. Total pages for the database, however, is likely a different story. For large tuples that are uncompressed and larger than a page, it is probably beneficial to use sendfile(2) instead of mmap(2) + write(2)'ing the page/file. If a large tuple is compressed, it'd be interesting to see if it'd be worthwhile to have the data uncompressed onto an anonymously mmap(2)'ed page(s) that way the benefits of zero-socket-copies could be used. >> shared mem is a bastardized subsystem that works, but isn't integral >> to >> any performance areas in the kernel so it gets neglected. > > What performance issues do you think shared memory needs to have fixed? > We don't issue any shmem kernel calls after the initial shmget, so > comparing the level of kernel tenseness about shmget to the level of > tenseness about mmap is simply irrelevant. Perhaps the reason you > don't > see any traffic about this on the kernel lists is that shared memory > already works fine and doesn't need any fixing. I'm gunna get flamed for this, but I think its improperly used as a second level cache on top of the operating system's cache. mmap(2) would consolidate all caching into the kernel. >> Please ask questions if you have them. > > Do you have any arguments that are actually convincing? Three things come to mind. 1) A single cache for pages 2) Ability to give access hints to the kernel regarding future IO 3) On the fly memory use for a cache. There would be no need to preallocate slabs of shared memory on startup. And a more minor point would be: 4) Not having shared pages get lost when the backend dies (mmap(2) uses refcounts and cleans itself up, no need for ipcs/ipcrm/ipcclean). This isn't too practical in production though, but it sucks doing PostgreSQL development on OS-X because there is no ipcs/ipcrm command. > What I just read was a proposal to essentially throw away not only the > entire > low-level data access model, but the entire low-level locking model, > and start from scratch. From the above list, steps 2, 3, 5, 6, and 7 would be different than our current approach, all of which could be safely handled with some #ifdef's on platforms that don't have mmap(2). > There is no possible way we could support both > this approach and the current one, which means that we'd be permanently > dropping support for all platforms without high-quality mmap > implementations; Architecturally, I don't see anything different or incompatibilities that aren't solved with an #ifdef USE_MMAP/#else/#endif. > Furthermore, you didn't > give any really convincing reasons to think that the enormous effort > involved would be repaid. Steven's has a great reimplementaion of cat(1) that uses mmap(1) and benchmarks the two. I did my own version of that here: http://people.freebsd.org/~seanc/mmap_test/ When read(2)'ing/write(2)'ing /etc/services 100,000 times without mmap(2), it takes 82 seconds. With mmap(2), it takes anywhere from 1.1 to 18 seconds. Worst case scenario with mmap(2) yields a speedup by a factor of four. Best case scenario... *shrug* something better than 4x. I doubt PostgreSQL would see 4x speedups in the IO department, but I do think it would be vastly greater than the 3% suggested. > Those oprofile reports Josh just put up > showed 3% of the CPU time going into userspace/kernelspace copying. > Even assuming that that number consists entirely of reads and writes of > shared buffers (and of course no other kernel call ever transfers any > data across that boundary ;-)), there's no way we are going to buy into > this sort of project in hopes of a 3% win. Would it be helpful if I created a test program that demonstrated that the execution path for writing mmap(2)'ed pages as outlined above? -sc -- Sean Chittenden
Re: mmap (was First set of OSDL Shared Mem scalability results, some wierdness ...
From
Tom Lane
Date:
Sean Chittenden <sean@chittenden.org> writes: > When a backend wishes to write a page, the following steps are taken: > ... > 2) Backend mmap(2)'s a second copy of the page(s) being written to, > this time with the MAP_PRIVATE flag set. > ... > 5) Once the WAL logging is complete and it has hit the disk, the > backend msync(2)'s its private copy of the pages to disk (ASYNC or > SYNC, it doesn't really matter too much to me). My man page for mmap says that changes in a MAP_PRIVATE region are private; they do not affect the file at all, msync or no. So I don't think the above actually works. In any case, this scheme still forces you to flush WAL records to disk before making the changed page visible to other backends, so I don't see how it improves the situation. In the existing scheme we only have to fsync WAL at (1) transaction commit, (2) when we are forced to write a page out from shared buffers because we are short of buffers, or (3) checkpoint. Anything that implies an fsync per atomic action is going to be a loser. It does not matter how great your kernel API is if you only get to perform one atomic action per disk rotation :-( The important point here is that you can't postpone making changes at the page level visible to other backends; there's no MVCC at this level. Consider for example two backends wanting to insert a new row. If they both MAP_PRIVATE the same page, they'll probably choose the same tuple slot on the page to insert into (certainly there is nothing to stop that from happening). Now you have conflicting definitions for the same CTID, not to mention probably conflicting uses of the page's physical free space; disaster ensues. So "atomic action" really means "lock page, make changes, add WAL record to in-memory WAL buffers, unlock page" with the understanding that as soon as you unlock the page the changes you've made in it are visible to all other backends. You *can't* afford to put a WAL fsync in this sequence. You could possibly buy back most of the lossage in this scenario if there were some efficient way for a backend to hold the low-level lock on a page just until some other backend wanted to modify the page; whereupon the previous owner would have to do what's needed to make his changes visible before releasing the lock. Given the right access patterns you don't have to fsync very often (though given the wrong access patterns you're still in deep trouble). But we don't have any such mechanism and I think the communication costs of one would be forbidding. > [ much snipped ] > 4) Not having shared pages get lost when the backend dies (mmap(2) uses > refcounts and cleans itself up, no need for ipcs/ipcrm/ipcclean). Actually, that is not a bug that's a feature. One of the things that scares me about mmap is that a crashing backend is able to scribble all over live disk buffers before it finally SEGV's (think about memcpy gone wrong and similar cases). In our existing scheme there's a pretty good chance that we will be able to commit hara-kiri before any of the trashed data gets written out. In an mmap scheme, it's time to dig out your backup tapes, because there simply is no distinction between transient and permanent data --- the kernel has no way to know that you didn't mean it. In short, I remain entirely unconvinced that mmap is of any interest to us. regards, tom lane