Thread: Scaling PostgreSQL at multicore Power8

Scaling PostgreSQL at multicore Power8

From

YUriy Zhuravlev

Date:

31 August 2015, 10:55:52

Hello hackers

Recently, we were given access to the test server is IBM, 9119-MHE with 8 CPUs
* 8 cores * 8 threads. We decided to take advantage of this and to find
bottlenecks for read scalability (pgbench -S).

All detail you can read here:
http://www.postgrespro.ru/blog/pgsql/2015/08/30/p8scaling

Performance 9.4 stopped growing after 100 clients, and 9.5 / 9.6 stopped after
150 (at 4 NUMA nodes). After research using pref we saw that inhibits
ProcArrayLock in GetSnaphotData. But inserting the stub instead of
GetSnapshotData not significantly increased scalability. Trying to find the
bottleneck with gdb, we found another place. We have noticed s_lock in
PinBuffer and UnpinBuffer. For the test we rewrited PinBuffer and UnpinBuffer
by atomic operations and we liked the result. Degradation of performance
almost completely disappeared, and went scaling up to 400 clients (4 NUMA
nodes with 256 "CPUs").

To scale Postgres for large NUMA machine must be ported to the atomic
operations bufmgr. During our tests, we no found errors in our patch, but most
likely it is not true and this patch only for test.

Who has any thoughts?

--
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

atomic_bufmgr_v3.patch

Re: Scaling PostgreSQL at multicore Power8

From

Andres Freund

Date:

31 August 2015, 11:04:00

On 2015-08-31 13:54:57 +0300, YUriy Zhuravlev wrote:
> We have noticed s_lock in PinBuffer and UnpinBuffer. For the test we
> rewrited PinBuffer and UnpinBuffer by atomic operations and we liked
> the result. Degradation of performance almost completely disappeared,
> and went scaling up to 400 clients (4 NUMA nodes with 256 "CPUs").
> 
> To scale Postgres for large NUMA machine must be ported to the atomic 
> operations bufmgr. During our tests, we no found errors in our patch, but most 
> likely it is not true and this patch only for test.

I agree that this is necessary, and it matches with what I've
profiled.

Unfortunately I don't think the patch can be quite as simple as
presented here - we rely on the exclusion provided by the spinlock in a
bunch of places for more than the manipulation of individual values. And
those currently are only correct if there's no other possible
manipulations going on. But it's definitely doable.

I've initial patch doing this, but for me at it seemed to be necessary
to merge flags, usage and refcount into a single value - otherwise the
consistency is hard to maintain because you can't do a CAS over all
values.

> diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
> index 3ae2848..5fdaca7 100644
> --- a/src/backend/storage/buffer/buf_init.c
> +++ b/src/backend/storage/buffer/buf_init.c
> @@ -95,9 +95,9 @@ InitBufferPool(void)
>              BufferDesc *buf = GetBufferDescriptor(i);
>  
>              CLEAR_BUFFERTAG(buf->tag);
> -            buf->flags = 0;
> -            buf->usage_count = 0;
> -            buf->refcount = 0;
> +            buf->flags.value = 0;
> +            buf->usage_count.value = 0;
> +            buf->refcount.value = 0;
>              buf->wait_backend_pid = 0;

That's definitely not correct, you should initialize the atomics using
pg_atomic_init_u32() and write to by using pg_atomic_write_u32() - not
access them directly. This breaks the fallback paths.

Greetings,

Andres Freund

Re: Scaling PostgreSQL at multicore Power8

From

YUriy Zhuravlev

Date:

31 August 2015, 11:09:50

On Monday 31 August 2015 13:03:07 you wrote:
> That's definitely not correct, you should initialize the atomics using
> pg_atomic_init_u32() and write to by using pg_atomic_write_u32() - not
> access them directly. This breaks the fallback paths.

You right. Now it's just to silence the compiler. 
This patch is concept only.

-- 
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Scaling PostgreSQL at multicore Power8

From

Tomas Vondra

Date:

31 August 2015, 15:43:18


On 08/31/2015 12:54 PM, YUriy Zhuravlev wrote:
> Hello hackers
>
> Recently, we were given access to the test server is IBM, 9119-MHE with 8 CPUs
> * 8 cores * 8 threads. We decided to take advantage of this and to find
> bottlenecks for read scalability (pgbench -S).
>
> All detail you can read here:
> http://www.postgrespro.ru/blog/pgsql/2015/08/30/p8scaling
>
> Performance 9.4 stopped growing after 100 clients, and 9.5 / 9.6 stopped after
> 150 (at 4 NUMA nodes). After research using pref we saw that inhibits
> ProcArrayLock in GetSnaphotData. But inserting the stub instead of
> GetSnapshotData not significantly increased scalability. Trying to find the
> bottleneck with gdb, we found another place. We have noticed s_lock in
> PinBuffer and UnpinBuffer. For the test we rewrited PinBuffer and UnpinBuffer
> by atomic operations and we liked the result. Degradation of performance
> almost completely disappeared, and went scaling up to 400 clients (4 NUMA
> nodes with 256 "CPUs").
>
> To scale Postgres for large NUMA machine must be ported to the atomic
> operations bufmgr. During our tests, we no found errors in our patch, but most
> likely it is not true and this patch only for test.
>
> Who has any thoughts?

Well, I could test the patch on a x86 machine with 4 sockets (64 cores), 
but I wonder whether it makes sense at this point, as the patch really 
is not correct (judging by what Andres says).

Also, what pgbench scale was used for the testing?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Scaling PostgreSQL at multicore Power8

From

Andres Freund

Date:

31 August 2015, 15:48:59

On 2015-08-31 17:43:08 +0200, Tomas Vondra wrote:
> Well, I could test the patch on a x86 machine with 4 sockets (64 cores), but
> I wonder whether it makes sense at this point, as the patch really is not
> correct (judging by what Andres says).

Additionally it's, for default pgbench, really mostly a bottlneck after
GetSnapshotData() is fixed. You can make it a problem much earlier if
you have index nested loops over a lot of rows.

- Andres

Re: Scaling PostgreSQL at multicore Power8

From

Tomas Vondra

Date:

31 August 2015, 15:54:24


On 08/31/2015 05:48 PM, Andres Freund wrote:
> On 2015-08-31 17:43:08 +0200, Tomas Vondra wrote:
>> Well, I could test the patch on a x86 machine with 4 sockets (64 cores), but
>> I wonder whether it makes sense at this point, as the patch really is not
>> correct (judging by what Andres says).
>
> Additionally it's, for default pgbench, really mostly a bottlneck after
> GetSnapshotData() is fixed. You can make it a problem much earlier if
> you have index nested loops over a lot of rows.

[scratches head] So does this mean it's worth testing the patch on x86 
or not, in it's current state? Or should we come up with another test 
case, exercising the nested loops?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Scaling PostgreSQL at multicore Power8

From

Andres Freund

Date:

31 August 2015, 16:06:34

On 2015-08-31 17:54:17 +0200, Tomas Vondra wrote:
> [scratches head] So does this mean it's worth testing the patch on x86 or
> not, in it's current state?

You could try if you're interested. But I don't think it's super
meaningful. The patch is just a POC and rather widely incorrect.

Don't get me wrong, I think it's rather important that we fix this. But
that requires, imo, conceptual/development work right now, not
performance testing.

> Or should we come up with another test case, exercising the nested
> loops?

You'd need to do that, yes.

Greetings,

Andres Freund

Re: Scaling PostgreSQL at multicore Power8

From

Dmitry Vasilyev

Date:

31 August 2015, 16:14:09

We did not got any affect on core 64 with smt = 8, and we not have a 64
-cpu x86 machine with disable HT feature. 
You can set scale > 1000 and with shared_buffers >> size of index
pgbench_accounts_pkey.
You can also increase the concurrency: not only access top of b-tree
index, but also to a specific buffer: select * from pgbench_accounts
where aid = 1;


On Mon, 2015-08-31 at 17:43 +0200, Tomas Vondra wrote:
> 
> On 08/31/2015 12:54 PM, YUriy Zhuravlev wrote:
> > Hello hackers
> > 
> > Recently, we were given access to the test server is IBM, 9119-MHE
> > with 8 CPUs
> > * 8 cores * 8 threads. We decided to take advantage of this and to
> > find
> > bottlenecks for read scalability (pgbench -S).
> > 
> > All detail you can read here:
> > http://www.postgrespro.ru/blog/pgsql/2015/08/30/p8scaling
> > 
> > Performance 9.4 stopped growing after 100 clients, and 9.5 / 9.6
> > stopped after
> > 150 (at 4 NUMA nodes). After research using pref we saw that
> > inhibits
> > ProcArrayLock in GetSnaphotData. But inserting the stub instead of
> > GetSnapshotData not significantly increased scalability. Trying to
> > find the
> > bottleneck with gdb, we found another place. We have noticed s_lock
> > in
> > PinBuffer and UnpinBuffer. For the test we rewrited PinBuffer and
> > UnpinBuffer
> > by atomic operations and we liked the result. Degradation of
> > performance
> > almost completely disappeared, and went scaling up to 400 clients
> > (4 NUMA
> > nodes with 256 "CPUs").
> > 
> > To scale Postgres for large NUMA machine must be ported to the
> > atomic
> > operations bufmgr. During our tests, we no found errors in our
> > patch, but most
> > likely it is not true and this patch only for test.
> > 
> > Who has any thoughts?
> 
> Well, I could test the patch on a x86 machine with 4 sockets (64
> cores), 
> but I wonder whether it makes sense at this point, as the patch
> really 
> is not correct (judging by what Andres says).
> 
> Also, what pgbench scale was used for the testing?
> 
> regards
> 
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
>

Re: Scaling PostgreSQL at multicore Power8

From

YUriy Zhuravlev

Date:

31 August 2015, 16:14:31

On Monday 31 August 2015 17:54:17 Tomas Vondra wrote:
> So does this mean it's worth testing the patch on x86 
> or not, in it's current state?

Its realy intersting. But you need have true 64 cores without HT. (32 core +HT 
not have effect)

-- 
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Scaling PostgreSQL at multicore Power8

From

YUriy Zhuravlev

Date:

31 August 2015, 16:43:11

On Monday 31 August 2015 17:48:50 Andres Freund wrote:
> Additionally it's, for default pgbench, really mostly a bottlneck after
> GetSnapshotData() is fixed. You can make it a problem much earlier if
> you have index nested loops over a lot of rows.
100 000 000 is a lot? Simple select query from pgbech is common task not for 
all but...
-- 
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Scaling PostgreSQL at multicore Power8

From

YUriy Zhuravlev

Date:

14 September 2015, 13:05:26

On Monday 31 August 2015 17:43:08 Tomas Vondra wrote:
> Well, I could test the patch on a x86 machine with 4 sockets (64 cores),
> but I wonder whether it makes sense at this point, as the patch really
> is not correct (judging by what Andres says).

Can you test patch from this thread: http://www.postgresql.org/message-id/2400449.GjM57CE0Yg@dinodell ?

In our view it is correct, although this is not obvious.

-- 
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company