Thread: MIT benchmarks pgsql multicore (up to 48)performance

MIT benchmarks pgsql multicore (up to 48)performance

From
Hakan Kocaman
Date:
Hi,

for whom it may concern:
http://pdos.csail.mit.edu/mosbench/

They tested with 8.3.9, i wonder what results 9.0 would give.

Best regards and keep up the good work

Hakan

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Josh Berkus
Date:
Dan,

(btw, OpenSQL Confererence is going to be at MIT in 2 weeks.  Think
anyone from the MOSBENCH team could attend?
http://www.opensqlcamp.org/Main_Page)

> The big takeaway for -hackers, I think, is that lock manager
> performance is going to be an issue for large multicore systems, and
> the uncontended cases need to be lock-free. That includes cases where
> multiple threads are trying to acquire the same lock in compatible
> modes.

Yes; we were aware of this due to work Jignesh did at Sun on TPC-E.

> Currently even acquiring a shared heavyweight lock requires taking out
> an exclusive LWLock on the partition, and acquiring shared LWLocks
> requires acquiring a spinlock. All of this gets more expensive on
> multicores, where even acquiring spinlocks can take longer than the
> work being done in the critical section.

Certainly, the question has always been how to fix it without breaking
major features and endangering data integrity.

> Note that their implementation of the lock manager omits some features
> for simplicity, like deadlock detection, 2PC, and probably any
> semblance of portability. (These are the sort of things we're allowed
> to do in the research world! :-)

Well, nice that you did!  We'd never have that much time to experiment
with non-production stuff as a group in the project.  So, now we have a
theoretical solution which we can look at maybe implementing parts of in
some watered-down form.

> The other major bottleneck they ran into was a kernel one: reading from
> the heap file requires a couple lseek operations, and Linux acquires a
> mutex on the inode to do that. The proper place to fix this is
> certainly in the kernel but it may be possible to work around in
> Postgres.

Or we could complain to Kernel.org.  They've been fairly responsive in
the past.  Too bad this didn't get posted earlier; I just got back from
LinuxCon.

So you know someone who can speak technically to this issue? I can put
them in touch with the Linux geeks in charge of that part of the kernel
code.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: MIT benchmarks pgsql multicore (up to 48)performance

From
Scott Marlowe
Date:
On Mon, Oct 4, 2010 at 8:44 AM, Hakan Kocaman <hkocam@googlemail.com> wrote:
> Hi,
> for whom it may concern:
> http://pdos.csail.mit.edu/mosbench/
> They tested with 8.3.9, i wonder what results 9.0 would give.
> Best regards and keep up the good work

They mention that these tests were run on the older 8xxx series
opterons which has much slower memory speed and HT speed as well.  I
wonder how much better the newer 6xxx series magny cours would have
done on it...  When I tested some simple benchmarks like pgbench, I
got scalability right to 48 processes on our 48 core magny cours
machines.

Still, lots of room for improvement in kernel and pgsql.

--
To understand recursion, one must first understand recursion.

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Dan Ports
Date:
I wasn't involved in this work but I do know a bit about it. Sadly, the
work on Postgres performance was cut down to under a page, complete
with the amazing offhand mention of "rewriting PostgreSQL's lock
manager". Here are a few more details...

The benchmarks in this paper are all about stressing the kernel. The
database is entirely in memory -- it's stored on tmpfs rather than on
disk, and it fits within shared_buffers. The workload consists of index
lookups and inserts on a single table. You can fill in all the caveats
about what conclusions can and cannot be drawn from this workload.

The big takeaway for -hackers, I think, is that lock manager
performance is going to be an issue for large multicore systems, and
the uncontended cases need to be lock-free. That includes cases where
multiple threads are trying to acquire the same lock in compatible
modes.

Currently even acquiring a shared heavyweight lock requires taking out
an exclusive LWLock on the partition, and acquiring shared LWLocks
requires acquiring a spinlock. All of this gets more expensive on
multicores, where even acquiring spinlocks can take longer than the
work being done in the critical section.

Their modifications to Postgres should be available in the code that
was published last night. As I understand it, the approach is to
implement LWLocks with atomic operations on a counter that contains
both the exclusive and shared lock count. Heavyweight locks do
something similar but with counters for each lock mode packed into a
word.

Note that their implementation of the lock manager omits some features
for simplicity, like deadlock detection, 2PC, and probably any
semblance of portability. (These are the sort of things we're allowed
to do in the research world! :-)

The other major bottleneck they ran into was a kernel one: reading from
the heap file requires a couple lseek operations, and Linux acquires a
mutex on the inode to do that. The proper place to fix this is
certainly in the kernel but it may be possible to work around in
Postgres.

Dan

--
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Ivan Voras
Date:
On 10/04/10 20:49, Josh Berkus wrote:

>> The other major bottleneck they ran into was a kernel one: reading from
>> the heap file requires a couple lseek operations, and Linux acquires a
>> mutex on the inode to do that. The proper place to fix this is
>> certainly in the kernel but it may be possible to work around in
>> Postgres.
>
> Or we could complain to Kernel.org.  They've been fairly responsive in
> the past.  Too bad this didn't get posted earlier; I just got back from
> LinuxCon.
>
> So you know someone who can speak technically to this issue? I can put
> them in touch with the Linux geeks in charge of that part of the kernel
> code.

Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK
cannot be fixed since you're modifying the global "strean position"
variable and something has got to lock that.

OTOH, pread() / pwrite() don't have to do that.

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Jon Nelson
Date:
On Wed, Oct 6, 2010 at 5:31 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 10/04/10 20:49, Josh Berkus wrote:
>
>>> The other major bottleneck they ran into was a kernel one: reading from
>>> the heap file requires a couple lseek operations, and Linux acquires a
>>> mutex on the inode to do that. The proper place to fix this is
>>> certainly in the kernel but it may be possible to work around in
>>> Postgres.
>>
>> Or we could complain to Kernel.org.  They've been fairly responsive in
>> the past.  Too bad this didn't get posted earlier; I just got back from
>> LinuxCon.
>>
>> So you know someone who can speak technically to this issue? I can put
>> them in touch with the Linux geeks in charge of that part of the kernel
>> code.
>
> Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK
> cannot be fixed since you're modifying the global "strean position"
> variable and something has got to lock that.
>
> OTOH, pread() / pwrite() don't have to do that.

While lseek is very "cheap" it is like any other system call in that
when you multiple "cheap" times "a jillion" you end up with "notable"
or even "lots". I've personally seen notable performance improvements
by switching to pread/pwrite instead of lseek+{read,write}. For
platforms that don't implement pread or pwrite, wrapper calls are
trivial to produce. One less system call is, in this case, 50% fewer.


--
Jon

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Robert Haas
Date:
On Wed, Oct 6, 2010 at 6:31 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 10/04/10 20:49, Josh Berkus wrote:
>
>>> The other major bottleneck they ran into was a kernel one: reading from
>>> the heap file requires a couple lseek operations, and Linux acquires a
>>> mutex on the inode to do that. The proper place to fix this is
>>> certainly in the kernel but it may be possible to work around in
>>> Postgres.
>>
>> Or we could complain to Kernel.org.  They've been fairly responsive in
>> the past.  Too bad this didn't get posted earlier; I just got back from
>> LinuxCon.
>>
>> So you know someone who can speak technically to this issue? I can put
>> them in touch with the Linux geeks in charge of that part of the kernel
>> code.
>
> Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK
> cannot be fixed since you're modifying the global "strean position"
> variable and something has got to lock that.

Well, there are lock free algorithms using CAS, no?

> OTOH, pread() / pwrite() don't have to do that.

Hey, I didn't know about those.  That sounds like it might be worth
investigating, though I confess I lack a 48-core machine on which to
measure the alleged benefit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Tom Lane
Date:
Ivan Voras <ivoras@freebsd.org> writes:
> On 10/04/10 20:49, Josh Berkus wrote:
>>> The other major bottleneck they ran into was a kernel one: reading from
>>> the heap file requires a couple lseek operations, and Linux acquires a
>>> mutex on the inode to do that.

> Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK
> cannot be fixed since you're modifying the global "strean position"
> variable and something has got to lock that.

Um, there is no "global stream position" associated with an inode.
A file position is associated with an open-file descriptor.

If Josh quoted the problem correctly, the issue is that the kernel is
locking a file's inode (which may be referenced by quite a lot of file
descriptors) in order to change the state of one file descriptor.
It sure sounds like a possible source of contention to me.

            regards, tom lane

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> Hey, I didn't know about those.  That sounds like it might be worth
> investigating, though I confess I lack a 48-core machine on which to
> measure the alleged benefit.

I've got a couple 24-core systems, if it'd be sufficiently useful to
test with..

    Stephen

Attachment

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Robert Haas
Date:
On Wed, Oct 6, 2010 at 9:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> Hey, I didn't know about those.  That sounds like it might be worth
>> investigating, though I confess I lack a 48-core machine on which to
>> measure the alleged benefit.
>
> I've got a couple 24-core systems, if it'd be sufficiently useful to
> test with..

It's good to be you.

I don't suppose you could try to replicate the lseek() contention?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> It's good to be you.

They're HP BL465 G7's w/ 2x 12-core AMD processors and 48G of RAM.
Unfortunately, they currently only have local storage, but it seems
unlikely that would be an issue for this.

> I don't suppose you could try to replicate the lseek() contention?

I can give it a shot, but the impression I had from the paper is that
the lseek() contention wouldn't be seen without the changes to the lock
manager...?  Or did I misunderstand?

    Thanks,

        Stephen

Attachment

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Ivan Voras
Date:
On 7 October 2010 03:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ivan Voras <ivoras@freebsd.org> writes:
>> On 10/04/10 20:49, Josh Berkus wrote:
>>>> The other major bottleneck they ran into was a kernel one: reading from
>>>> the heap file requires a couple lseek operations, and Linux acquires a
>>>> mutex on the inode to do that.
>
>> Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK
>> cannot be fixed since you're modifying the global "strean position"
>> variable and something has got to lock that.
>
> Um, there is no "global stream position" associated with an inode.
> A file position is associated with an open-file descriptor.

You're right of course, I was pattern matching late last night on the
"lseek()" and "locking problems" keywords and ignored "inode".

> If Josh quoted the problem correctly, the issue is that the kernel is
> locking a file's inode (which may be referenced by quite a lot of file
> descriptors) in order to change the state of one file descriptor.
> It sure sounds like a possible source of contention to me.

Though it does depend on the details of how pg uses it. Forked
processes share their parents' file descriptors.

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Robert Haas
Date:
On Wed, Oct 6, 2010 at 10:07 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> It's good to be you.
>
> They're HP BL465 G7's w/ 2x 12-core AMD processors and 48G of RAM.
> Unfortunately, they currently only have local storage, but it seems
> unlikely that would be an issue for this.
>
>> I don't suppose you could try to replicate the lseek() contention?
>
> I can give it a shot, but the impression I had from the paper is that
> the lseek() contention wouldn't be seen without the changes to the lock
> manager...?  Or did I misunderstand?

<rereads appropriate section of paper>

Looks like the lock manager problems hit at 28 cores, and the lseek
problems at 36 cores.  So your system might not even be big enough to
manifest either problem.

It's unclear to me whether a 48-core system would be able to see the
lseek issues without improvements to the lock manager, but perhaps it
would be possible by, say, increasing the number of lock partitions by
8x.  It would be nice to segregate these issues though, because using
pread/pwrite is probably a lot less work than rewriting our lock
manager.  Do you have tools to measure the lseek overhead?  If so, we
could prepare a patch to use pread()/pwrite() and just see whether
that reduced the overhead, without worrying so much about whether it
was actually a major bottleneck.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Ivan Voras
Date:
On 10/07/10 02:39, Robert Haas wrote:
> On Wed, Oct 6, 2010 at 6:31 PM, Ivan Voras<ivoras@freebsd.org>  wrote:
>> On 10/04/10 20:49, Josh Berkus wrote:
>>
>>>> The other major bottleneck they ran into was a kernel one: reading from
>>>> the heap file requires a couple lseek operations, and Linux acquires a
>>>> mutex on the inode to do that. The proper place to fix this is
>>>> certainly in the kernel but it may be possible to work around in
>>>> Postgres.
>>>
>>> Or we could complain to Kernel.org.  They've been fairly responsive in
>>> the past.  Too bad this didn't get posted earlier; I just got back from
>>> LinuxCon.
>>>
>>> So you know someone who can speak technically to this issue? I can put
>>> them in touch with the Linux geeks in charge of that part of the kernel
>>> code.
>>
>> Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK
>> cannot be fixed since you're modifying the global "strean position"
>> variable and something has got to lock that.
>
> Well, there are lock free algorithms using CAS, no?

Nothing is really "lock free" - in this case the algorithms simply push
the locking down to atomic operations on the CPU (and the memory bus).
Semantically, *something* has to lock the memory region for however
brief period of time and then propagate that update to other CPUs'
caches (i.e. invalidate them).

>> OTOH, pread() / pwrite() don't have to do that.
>
> Hey, I didn't know about those.  That sounds like it might be worth
> investigating, though I confess I lack a 48-core machine on which to
> measure the alleged benefit.

As Jon said, it will in any case reduce the number of these syscalls by
half, and they can be wrapped by a C macro for the platforms which don't
implement them.

http://man.freebsd.org/pread

(and just in case it's needed: pread() is a special case of preadv()).

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> perhaps it would be possible by, say, increasing the number of
> lock partitions by 8x.  It would be nice to segregate these issues
> though, because using pread/pwrite is probably a lot less work
> than rewriting our lock manager.

You mean easier than changing this 4 to a 7?:

#define LOG2_NUM_LOCK_PARTITIONS  4

Or am I missing something?

-Kevin

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Stephen Frost
Date:
* Kevin Grittner (Kevin.Grittner@wicourts.gov) wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
> > perhaps it would be possible by, say, increasing the number of
> > lock partitions by 8x.  It would be nice to segregate these issues
> > though, because using pread/pwrite is probably a lot less work
> > than rewriting our lock manager.
>
> You mean easier than changing this 4 to a 7?:
>
> #define LOG2_NUM_LOCK_PARTITIONS  4
>
> Or am I missing something?

I'm pretty sure we were talking about the change described in the paper
of moving to a system which uses atomic changes instead of spinlocks for
certain locking situations..

If that's all the MIT folks did, they certainly made it sound like alot
more. :)

    Stephen

Attachment

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
"Kevin Grittner"
Date:
Stephen Frost <sfrost@snowman.net> wrote:
> Kevin Grittner (Kevin.Grittner@wicourts.gov) wrote:
>> Robert Haas <robertmhaas@gmail.com> wrote:

>>> perhaps it would be possible by, say, increasing the number of
>>> lock partitions by 8x.

>> changing this 4 to a 7?:
>>
>> #define LOG2_NUM_LOCK_PARTITIONS  4

> I'm pretty sure we were talking about the change described in the
> paper of moving to a system which uses atomic changes instead of
> spinlocks for certain locking situations..

Well, they also mentioned increasing the number of lock partitions
to reduce contention, and that seemed to be what Robert was talking
about in the quoted section.

Of course, that's not the *only* thing they did; it's just the point
which seemed to be under discussion just there.

-Kevin

Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

From
Robert Haas
Date:
On Thu, Oct 7, 2010 at 1:21 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>
>> perhaps it would be possible by, say, increasing the number of
>> lock partitions by 8x.  It would be nice to segregate these issues
>> though, because using pread/pwrite is probably a lot less work
>> than rewriting our lock manager.
>
> You mean easier than changing this 4 to a 7?:
>
> #define LOG2_NUM_LOCK_PARTITIONS  4
>
> Or am I missing something?

Right.  They did something more complicated (and, I think, better)
than that, but that change by itself might be enough to ameliorate the
lock contention enough to see the lsek() issue.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company