Thread: Scaling PostgreSQL at multicore Power8
Hello hackers Recently, we were given access to the test server is IBM, 9119-MHE with 8 CPUs * 8 cores * 8 threads. We decided to take advantage of this and to find bottlenecks for read scalability (pgbench -S). All detail you can read here: http://www.postgrespro.ru/blog/pgsql/2015/08/30/p8scaling Performance 9.4 stopped growing after 100 clients, and 9.5 / 9.6 stopped after 150 (at 4 NUMA nodes). After research using pref we saw that inhibits ProcArrayLock in GetSnaphotData. But inserting the stub instead of GetSnapshotData not significantly increased scalability. Trying to find the bottleneck with gdb, we found another place. We have noticed s_lock in PinBuffer and UnpinBuffer. For the test we rewrited PinBuffer and UnpinBuffer by atomic operations and we liked the result. Degradation of performance almost completely disappeared, and went scaling up to 400 clients (4 NUMA nodes with 256 "CPUs"). To scale Postgres for large NUMA machine must be ported to the atomic operations bufmgr. During our tests, we no found errors in our patch, but most likely it is not true and this patch only for test. Who has any thoughts? -- YUriy Zhuravlev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 2015-08-31 13:54:57 +0300, YUriy Zhuravlev wrote: > We have noticed s_lock in PinBuffer and UnpinBuffer. For the test we > rewrited PinBuffer and UnpinBuffer by atomic operations and we liked > the result. Degradation of performance almost completely disappeared, > and went scaling up to 400 clients (4 NUMA nodes with 256 "CPUs"). > > To scale Postgres for large NUMA machine must be ported to the atomic > operations bufmgr. During our tests, we no found errors in our patch, but most > likely it is not true and this patch only for test. I agree that this is necessary, and it matches with what I've profiled. Unfortunately I don't think the patch can be quite as simple as presented here - we rely on the exclusion provided by the spinlock in a bunch of places for more than the manipulation of individual values. And those currently are only correct if there's no other possible manipulations going on. But it's definitely doable. I've initial patch doing this, but for me at it seemed to be necessary to merge flags, usage and refcount into a single value - otherwise the consistency is hard to maintain because you can't do a CAS over all values. > diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c > index 3ae2848..5fdaca7 100644 > --- a/src/backend/storage/buffer/buf_init.c > +++ b/src/backend/storage/buffer/buf_init.c > @@ -95,9 +95,9 @@ InitBufferPool(void) > BufferDesc *buf = GetBufferDescriptor(i); > > CLEAR_BUFFERTAG(buf->tag); > - buf->flags = 0; > - buf->usage_count = 0; > - buf->refcount = 0; > + buf->flags.value = 0; > + buf->usage_count.value = 0; > + buf->refcount.value = 0; > buf->wait_backend_pid = 0; That's definitely not correct, you should initialize the atomics using pg_atomic_init_u32() and write to by using pg_atomic_write_u32() - not access them directly. This breaks the fallback paths. Greetings, Andres Freund
On Monday 31 August 2015 13:03:07 you wrote: > That's definitely not correct, you should initialize the atomics using > pg_atomic_init_u32() and write to by using pg_atomic_write_u32() - not > access them directly. This breaks the fallback paths. You right. Now it's just to silence the compiler. This patch is concept only. -- YUriy Zhuravlev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 08/31/2015 12:54 PM, YUriy Zhuravlev wrote: > Hello hackers > > Recently, we were given access to the test server is IBM, 9119-MHE with 8 CPUs > * 8 cores * 8 threads. We decided to take advantage of this and to find > bottlenecks for read scalability (pgbench -S). > > All detail you can read here: > http://www.postgrespro.ru/blog/pgsql/2015/08/30/p8scaling > > Performance 9.4 stopped growing after 100 clients, and 9.5 / 9.6 stopped after > 150 (at 4 NUMA nodes). After research using pref we saw that inhibits > ProcArrayLock in GetSnaphotData. But inserting the stub instead of > GetSnapshotData not significantly increased scalability. Trying to find the > bottleneck with gdb, we found another place. We have noticed s_lock in > PinBuffer and UnpinBuffer. For the test we rewrited PinBuffer and UnpinBuffer > by atomic operations and we liked the result. Degradation of performance > almost completely disappeared, and went scaling up to 400 clients (4 NUMA > nodes with 256 "CPUs"). > > To scale Postgres for large NUMA machine must be ported to the atomic > operations bufmgr. During our tests, we no found errors in our patch, but most > likely it is not true and this patch only for test. > > Who has any thoughts? Well, I could test the patch on a x86 machine with 4 sockets (64 cores), but I wonder whether it makes sense at this point, as the patch really is not correct (judging by what Andres says). Also, what pgbench scale was used for the testing? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-08-31 17:43:08 +0200, Tomas Vondra wrote: > Well, I could test the patch on a x86 machine with 4 sockets (64 cores), but > I wonder whether it makes sense at this point, as the patch really is not > correct (judging by what Andres says). Additionally it's, for default pgbench, really mostly a bottlneck after GetSnapshotData() is fixed. You can make it a problem much earlier if you have index nested loops over a lot of rows. - Andres
On 08/31/2015 05:48 PM, Andres Freund wrote: > On 2015-08-31 17:43:08 +0200, Tomas Vondra wrote: >> Well, I could test the patch on a x86 machine with 4 sockets (64 cores), but >> I wonder whether it makes sense at this point, as the patch really is not >> correct (judging by what Andres says). > > Additionally it's, for default pgbench, really mostly a bottlneck after > GetSnapshotData() is fixed. You can make it a problem much earlier if > you have index nested loops over a lot of rows. [scratches head] So does this mean it's worth testing the patch on x86 or not, in it's current state? Or should we come up with another test case, exercising the nested loops? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-08-31 17:54:17 +0200, Tomas Vondra wrote: > [scratches head] So does this mean it's worth testing the patch on x86 or > not, in it's current state? You could try if you're interested. But I don't think it's super meaningful. The patch is just a POC and rather widely incorrect. Don't get me wrong, I think it's rather important that we fix this. But that requires, imo, conceptual/development work right now, not performance testing. > Or should we come up with another test case, exercising the nested > loops? You'd need to do that, yes. Greetings, Andres Freund
We did not got any affect on core 64 with smt = 8, and we not have a 64 -cpu x86 machine with disable HT feature. You can set scale > 1000 and with shared_buffers >> size of index pgbench_accounts_pkey. You can also increase the concurrency: not only access top of b-tree index, but also to a specific buffer: select * from pgbench_accounts where aid = 1; On Mon, 2015-08-31 at 17:43 +0200, Tomas Vondra wrote: > > On 08/31/2015 12:54 PM, YUriy Zhuravlev wrote: > > Hello hackers > > > > Recently, we were given access to the test server is IBM, 9119-MHE > > with 8 CPUs > > * 8 cores * 8 threads. We decided to take advantage of this and to > > find > > bottlenecks for read scalability (pgbench -S). > > > > All detail you can read here: > > http://www.postgrespro.ru/blog/pgsql/2015/08/30/p8scaling > > > > Performance 9.4 stopped growing after 100 clients, and 9.5 / 9.6 > > stopped after > > 150 (at 4 NUMA nodes). After research using pref we saw that > > inhibits > > ProcArrayLock in GetSnaphotData. But inserting the stub instead of > > GetSnapshotData not significantly increased scalability. Trying to > > find the > > bottleneck with gdb, we found another place. We have noticed s_lock > > in > > PinBuffer and UnpinBuffer. For the test we rewrited PinBuffer and > > UnpinBuffer > > by atomic operations and we liked the result. Degradation of > > performance > > almost completely disappeared, and went scaling up to 400 clients > > (4 NUMA > > nodes with 256 "CPUs"). > > > > To scale Postgres for large NUMA machine must be ported to the > > atomic > > operations bufmgr. During our tests, we no found errors in our > > patch, but most > > likely it is not true and this patch only for test. > > > > Who has any thoughts? > > Well, I could test the patch on a x86 machine with 4 sockets (64 > cores), > but I wonder whether it makes sense at this point, as the patch > really > is not correct (judging by what Andres says). > > Also, what pgbench scale was used for the testing? > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > >
On Monday 31 August 2015 17:54:17 Tomas Vondra wrote: > So does this mean it's worth testing the patch on x86 > or not, in it's current state? Its realy intersting. But you need have true 64 cores without HT. (32 core +HT not have effect) -- YUriy Zhuravlev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Monday 31 August 2015 17:48:50 Andres Freund wrote: > Additionally it's, for default pgbench, really mostly a bottlneck after > GetSnapshotData() is fixed. You can make it a problem much earlier if > you have index nested loops over a lot of rows. 100 000 000 is a lot? Simple select query from pgbech is common task not for all but... -- YUriy Zhuravlev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Monday 31 August 2015 17:43:08 Tomas Vondra wrote: > Well, I could test the patch on a x86 machine with 4 sockets (64 cores), > but I wonder whether it makes sense at this point, as the patch really > is not correct (judging by what Andres says). Can you test patch from this thread: http://www.postgresql.org/message-id/2400449.GjM57CE0Yg@dinodell ? In our view it is correct, although this is not obvious. -- YUriy Zhuravlev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company