Thread: MIT benchmarks pgsql multicore (up to 48)performance
Hi,
for whom it may concern:
http://pdos.csail.mit.edu/mosbench/They tested with 8.3.9, i wonder what results 9.0 would give.
Best regards and keep up the good work
Hakan
Dan, (btw, OpenSQL Confererence is going to be at MIT in 2 weeks. Think anyone from the MOSBENCH team could attend? http://www.opensqlcamp.org/Main_Page) > The big takeaway for -hackers, I think, is that lock manager > performance is going to be an issue for large multicore systems, and > the uncontended cases need to be lock-free. That includes cases where > multiple threads are trying to acquire the same lock in compatible > modes. Yes; we were aware of this due to work Jignesh did at Sun on TPC-E. > Currently even acquiring a shared heavyweight lock requires taking out > an exclusive LWLock on the partition, and acquiring shared LWLocks > requires acquiring a spinlock. All of this gets more expensive on > multicores, where even acquiring spinlocks can take longer than the > work being done in the critical section. Certainly, the question has always been how to fix it without breaking major features and endangering data integrity. > Note that their implementation of the lock manager omits some features > for simplicity, like deadlock detection, 2PC, and probably any > semblance of portability. (These are the sort of things we're allowed > to do in the research world! :-) Well, nice that you did! We'd never have that much time to experiment with non-production stuff as a group in the project. So, now we have a theoretical solution which we can look at maybe implementing parts of in some watered-down form. > The other major bottleneck they ran into was a kernel one: reading from > the heap file requires a couple lseek operations, and Linux acquires a > mutex on the inode to do that. The proper place to fix this is > certainly in the kernel but it may be possible to work around in > Postgres. Or we could complain to Kernel.org. They've been fairly responsive in the past. Too bad this didn't get posted earlier; I just got back from LinuxCon. So you know someone who can speak technically to this issue? I can put them in touch with the Linux geeks in charge of that part of the kernel code. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Mon, Oct 4, 2010 at 8:44 AM, Hakan Kocaman <hkocam@googlemail.com> wrote: > Hi, > for whom it may concern: > http://pdos.csail.mit.edu/mosbench/ > They tested with 8.3.9, i wonder what results 9.0 would give. > Best regards and keep up the good work They mention that these tests were run on the older 8xxx series opterons which has much slower memory speed and HT speed as well. I wonder how much better the newer 6xxx series magny cours would have done on it... When I tested some simple benchmarks like pgbench, I got scalability right to 48 processes on our 48 core magny cours machines. Still, lots of room for improvement in kernel and pgsql. -- To understand recursion, one must first understand recursion.
I wasn't involved in this work but I do know a bit about it. Sadly, the work on Postgres performance was cut down to under a page, complete with the amazing offhand mention of "rewriting PostgreSQL's lock manager". Here are a few more details... The benchmarks in this paper are all about stressing the kernel. The database is entirely in memory -- it's stored on tmpfs rather than on disk, and it fits within shared_buffers. The workload consists of index lookups and inserts on a single table. You can fill in all the caveats about what conclusions can and cannot be drawn from this workload. The big takeaway for -hackers, I think, is that lock manager performance is going to be an issue for large multicore systems, and the uncontended cases need to be lock-free. That includes cases where multiple threads are trying to acquire the same lock in compatible modes. Currently even acquiring a shared heavyweight lock requires taking out an exclusive LWLock on the partition, and acquiring shared LWLocks requires acquiring a spinlock. All of this gets more expensive on multicores, where even acquiring spinlocks can take longer than the work being done in the critical section. Their modifications to Postgres should be available in the code that was published last night. As I understand it, the approach is to implement LWLocks with atomic operations on a counter that contains both the exclusive and shared lock count. Heavyweight locks do something similar but with counters for each lock mode packed into a word. Note that their implementation of the lock manager omits some features for simplicity, like deadlock detection, 2PC, and probably any semblance of portability. (These are the sort of things we're allowed to do in the research world! :-) The other major bottleneck they ran into was a kernel one: reading from the heap file requires a couple lseek operations, and Linux acquires a mutex on the inode to do that. The proper place to fix this is certainly in the kernel but it may be possible to work around in Postgres. Dan -- Dan R. K. Ports MIT CSAIL http://drkp.net/
On 10/04/10 20:49, Josh Berkus wrote: >> The other major bottleneck they ran into was a kernel one: reading from >> the heap file requires a couple lseek operations, and Linux acquires a >> mutex on the inode to do that. The proper place to fix this is >> certainly in the kernel but it may be possible to work around in >> Postgres. > > Or we could complain to Kernel.org. They've been fairly responsive in > the past. Too bad this didn't get posted earlier; I just got back from > LinuxCon. > > So you know someone who can speak technically to this issue? I can put > them in touch with the Linux geeks in charge of that part of the kernel > code. Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK cannot be fixed since you're modifying the global "strean position" variable and something has got to lock that. OTOH, pread() / pwrite() don't have to do that.
On Wed, Oct 6, 2010 at 5:31 PM, Ivan Voras <ivoras@freebsd.org> wrote: > On 10/04/10 20:49, Josh Berkus wrote: > >>> The other major bottleneck they ran into was a kernel one: reading from >>> the heap file requires a couple lseek operations, and Linux acquires a >>> mutex on the inode to do that. The proper place to fix this is >>> certainly in the kernel but it may be possible to work around in >>> Postgres. >> >> Or we could complain to Kernel.org. They've been fairly responsive in >> the past. Too bad this didn't get posted earlier; I just got back from >> LinuxCon. >> >> So you know someone who can speak technically to this issue? I can put >> them in touch with the Linux geeks in charge of that part of the kernel >> code. > > Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK > cannot be fixed since you're modifying the global "strean position" > variable and something has got to lock that. > > OTOH, pread() / pwrite() don't have to do that. While lseek is very "cheap" it is like any other system call in that when you multiple "cheap" times "a jillion" you end up with "notable" or even "lots". I've personally seen notable performance improvements by switching to pread/pwrite instead of lseek+{read,write}. For platforms that don't implement pread or pwrite, wrapper calls are trivial to produce. One less system call is, in this case, 50% fewer. -- Jon
On Wed, Oct 6, 2010 at 6:31 PM, Ivan Voras <ivoras@freebsd.org> wrote: > On 10/04/10 20:49, Josh Berkus wrote: > >>> The other major bottleneck they ran into was a kernel one: reading from >>> the heap file requires a couple lseek operations, and Linux acquires a >>> mutex on the inode to do that. The proper place to fix this is >>> certainly in the kernel but it may be possible to work around in >>> Postgres. >> >> Or we could complain to Kernel.org. They've been fairly responsive in >> the past. Too bad this didn't get posted earlier; I just got back from >> LinuxCon. >> >> So you know someone who can speak technically to this issue? I can put >> them in touch with the Linux geeks in charge of that part of the kernel >> code. > > Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK > cannot be fixed since you're modifying the global "strean position" > variable and something has got to lock that. Well, there are lock free algorithms using CAS, no? > OTOH, pread() / pwrite() don't have to do that. Hey, I didn't know about those. That sounds like it might be worth investigating, though I confess I lack a 48-core machine on which to measure the alleged benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Ivan Voras <ivoras@freebsd.org> writes: > On 10/04/10 20:49, Josh Berkus wrote: >>> The other major bottleneck they ran into was a kernel one: reading from >>> the heap file requires a couple lseek operations, and Linux acquires a >>> mutex on the inode to do that. > Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK > cannot be fixed since you're modifying the global "strean position" > variable and something has got to lock that. Um, there is no "global stream position" associated with an inode. A file position is associated with an open-file descriptor. If Josh quoted the problem correctly, the issue is that the kernel is locking a file's inode (which may be referenced by quite a lot of file descriptors) in order to change the state of one file descriptor. It sure sounds like a possible source of contention to me. regards, tom lane
* Robert Haas (robertmhaas@gmail.com) wrote: > Hey, I didn't know about those. That sounds like it might be worth > investigating, though I confess I lack a 48-core machine on which to > measure the alleged benefit. I've got a couple 24-core systems, if it'd be sufficiently useful to test with.. Stephen
Attachment
On Wed, Oct 6, 2010 at 9:30 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: >> Hey, I didn't know about those. That sounds like it might be worth >> investigating, though I confess I lack a 48-core machine on which to >> measure the alleged benefit. > > I've got a couple 24-core systems, if it'd be sufficiently useful to > test with.. It's good to be you. I don't suppose you could try to replicate the lseek() contention? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
* Robert Haas (robertmhaas@gmail.com) wrote: > It's good to be you. They're HP BL465 G7's w/ 2x 12-core AMD processors and 48G of RAM. Unfortunately, they currently only have local storage, but it seems unlikely that would be an issue for this. > I don't suppose you could try to replicate the lseek() contention? I can give it a shot, but the impression I had from the paper is that the lseek() contention wouldn't be seen without the changes to the lock manager...? Or did I misunderstand? Thanks, Stephen
Attachment
On 7 October 2010 03:25, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ivan Voras <ivoras@freebsd.org> writes: >> On 10/04/10 20:49, Josh Berkus wrote: >>>> The other major bottleneck they ran into was a kernel one: reading from >>>> the heap file requires a couple lseek operations, and Linux acquires a >>>> mutex on the inode to do that. > >> Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK >> cannot be fixed since you're modifying the global "strean position" >> variable and something has got to lock that. > > Um, there is no "global stream position" associated with an inode. > A file position is associated with an open-file descriptor. You're right of course, I was pattern matching late last night on the "lseek()" and "locking problems" keywords and ignored "inode". > If Josh quoted the problem correctly, the issue is that the kernel is > locking a file's inode (which may be referenced by quite a lot of file > descriptors) in order to change the state of one file descriptor. > It sure sounds like a possible source of contention to me. Though it does depend on the details of how pg uses it. Forked processes share their parents' file descriptors.
On Wed, Oct 6, 2010 at 10:07 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: >> It's good to be you. > > They're HP BL465 G7's w/ 2x 12-core AMD processors and 48G of RAM. > Unfortunately, they currently only have local storage, but it seems > unlikely that would be an issue for this. > >> I don't suppose you could try to replicate the lseek() contention? > > I can give it a shot, but the impression I had from the paper is that > the lseek() contention wouldn't be seen without the changes to the lock > manager...? Or did I misunderstand? <rereads appropriate section of paper> Looks like the lock manager problems hit at 28 cores, and the lseek problems at 36 cores. So your system might not even be big enough to manifest either problem. It's unclear to me whether a 48-core system would be able to see the lseek issues without improvements to the lock manager, but perhaps it would be possible by, say, increasing the number of lock partitions by 8x. It would be nice to segregate these issues though, because using pread/pwrite is probably a lot less work than rewriting our lock manager. Do you have tools to measure the lseek overhead? If so, we could prepare a patch to use pread()/pwrite() and just see whether that reduced the overhead, without worrying so much about whether it was actually a major bottleneck. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 10/07/10 02:39, Robert Haas wrote: > On Wed, Oct 6, 2010 at 6:31 PM, Ivan Voras<ivoras@freebsd.org> wrote: >> On 10/04/10 20:49, Josh Berkus wrote: >> >>>> The other major bottleneck they ran into was a kernel one: reading from >>>> the heap file requires a couple lseek operations, and Linux acquires a >>>> mutex on the inode to do that. The proper place to fix this is >>>> certainly in the kernel but it may be possible to work around in >>>> Postgres. >>> >>> Or we could complain to Kernel.org. They've been fairly responsive in >>> the past. Too bad this didn't get posted earlier; I just got back from >>> LinuxCon. >>> >>> So you know someone who can speak technically to this issue? I can put >>> them in touch with the Linux geeks in charge of that part of the kernel >>> code. >> >> Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK >> cannot be fixed since you're modifying the global "strean position" >> variable and something has got to lock that. > > Well, there are lock free algorithms using CAS, no? Nothing is really "lock free" - in this case the algorithms simply push the locking down to atomic operations on the CPU (and the memory bus). Semantically, *something* has to lock the memory region for however brief period of time and then propagate that update to other CPUs' caches (i.e. invalidate them). >> OTOH, pread() / pwrite() don't have to do that. > > Hey, I didn't know about those. That sounds like it might be worth > investigating, though I confess I lack a 48-core machine on which to > measure the alleged benefit. As Jon said, it will in any case reduce the number of these syscalls by half, and they can be wrapped by a C macro for the platforms which don't implement them. http://man.freebsd.org/pread (and just in case it's needed: pread() is a special case of preadv()).
Robert Haas <robertmhaas@gmail.com> wrote: > perhaps it would be possible by, say, increasing the number of > lock partitions by 8x. It would be nice to segregate these issues > though, because using pread/pwrite is probably a lot less work > than rewriting our lock manager. You mean easier than changing this 4 to a 7?: #define LOG2_NUM_LOCK_PARTITIONS 4 Or am I missing something? -Kevin
* Kevin Grittner (Kevin.Grittner@wicourts.gov) wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > > perhaps it would be possible by, say, increasing the number of > > lock partitions by 8x. It would be nice to segregate these issues > > though, because using pread/pwrite is probably a lot less work > > than rewriting our lock manager. > > You mean easier than changing this 4 to a 7?: > > #define LOG2_NUM_LOCK_PARTITIONS 4 > > Or am I missing something? I'm pretty sure we were talking about the change described in the paper of moving to a system which uses atomic changes instead of spinlocks for certain locking situations.. If that's all the MIT folks did, they certainly made it sound like alot more. :) Stephen
Attachment
Stephen Frost <sfrost@snowman.net> wrote: > Kevin Grittner (Kevin.Grittner@wicourts.gov) wrote: >> Robert Haas <robertmhaas@gmail.com> wrote: >>> perhaps it would be possible by, say, increasing the number of >>> lock partitions by 8x. >> changing this 4 to a 7?: >> >> #define LOG2_NUM_LOCK_PARTITIONS 4 > I'm pretty sure we were talking about the change described in the > paper of moving to a system which uses atomic changes instead of > spinlocks for certain locking situations.. Well, they also mentioned increasing the number of lock partitions to reduce contention, and that seemed to be what Robert was talking about in the quoted section. Of course, that's not the *only* thing they did; it's just the point which seemed to be under discussion just there. -Kevin
On Thu, Oct 7, 2010 at 1:21 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > >> perhaps it would be possible by, say, increasing the number of >> lock partitions by 8x. It would be nice to segregate these issues >> though, because using pread/pwrite is probably a lot less work >> than rewriting our lock manager. > > You mean easier than changing this 4 to a 7?: > > #define LOG2_NUM_LOCK_PARTITIONS 4 > > Or am I missing something? Right. They did something more complicated (and, I think, better) than that, but that change by itself might be enough to ameliorate the lock contention enough to see the lsek() issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company