Thread: Speed up Clog Access by increasing CLOG buffers

Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
After reducing ProcArrayLock contention in commit (0e141c0fbb211bdd23783afa731e3eef95c9ad7a), the other lock
which seems to be contentious in read-write transactions is
CLogControlLock.  In my investigation, I found that the contention
is mainly due to two reasons, one is that while writing the transaction
status in CLOG (TransactionIdSetPageStatus()), it acquires EXCLUSIVE
CLogControlLock which contends with every other transaction which
tries to access the CLOG for checking transaction status and to reduce it
already a patch [1] is proposed by Simon; Second contention is due to
the reason that when the CLOG page is not found in CLOG buffers, it
needs to acquire CLogControlLock in Exclusive mode which again contends
with shared lockers which tries to access the transaction status.

Increasing CLOG buffers to 64 helps in reducing the contention due to second
reason.  Experiments revealed that increasing CLOG buffers only helps
once the contention around ProcArrayLock is reduced.

Performance Data
-----------------------------
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout    =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

pgbench setup
------------------------
scale factor - 300
Data is on magnetic disk and WAL on ssd.
pgbench -M prepared tpc-b

HEAD - commit 0e141c0f
Patch-1 - increase_clog_bufs_v1

Client Count/Patch_ver18163264128256
HEAD9115695988618028278512865425714
Patch-19545568989818450293133110828213


This data shows that there is an increase of ~5% at 64 client-count
and 8~10% at more higher clients without degradation at lower client-
count. In above data, there is some fluctuation seen at 8-client-count,
but I attribute that to run-to-run variation, however if anybody has doubts
I can again re-verify the data at lower client counts.

Now if we try to further increase the number of CLOG buffers to 128,
no improvement is seen.

I have also verified that this improvement can be seen only after the
contention around ProcArrayLock is reduced.  Below is the data with
Commit before the ProcArrayLock reduction patch.  Setup and test
is same as mentioned for previous test.

HEAD - commit 253de7e1
Patch-1 - increase_clog_bufs_v1


Client Count/Patch_ver128256
HEAD1665710512
Patch-11669410477


I think the benefit of this patch would be more significant along
with the other patch to reduce CLogControlLock contention [1]
(I have not tested both the patches together as still there are
few issues left in the other patch), however it has it's own independent
value, so can be considered separately.

Thoughts?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
On 2015-09-01 10:19:19 +0530, Amit Kapila wrote:
> pgbench setup
> ------------------------
> scale factor - 300
> Data is on magnetic disk and WAL on ssd.
> pgbench -M prepared tpc-b
> 
> HEAD - commit 0e141c0f
> Patch-1 - increase_clog_bufs_v1
> 
> Client Count/Patch_ver 1 8 16 32 64 128 256 HEAD 911 5695 9886 18028 27851
> 28654 25714 Patch-1 954 5568 9898 18450 29313 31108 28213
> 
> 
> This data shows that there is an increase of ~5% at 64 client-count
> and 8~10% at more higher clients without degradation at lower client-
> count. In above data, there is some fluctuation seen at 8-client-count,
> but I attribute that to run-to-run variation, however if anybody has doubts
> I can again re-verify the data at lower client counts.

> Now if we try to further increase the number of CLOG buffers to 128,
> no improvement is seen.
> 
> I have also verified that this improvement can be seen only after the
> contention around ProcArrayLock is reduced.  Below is the data with
> Commit before the ProcArrayLock reduction patch.  Setup and test
> is same as mentioned for previous test.

The buffer replacement algorithm for clog is rather stupid - I do wonder
where the cutoff is that it hurts.

Could you perhaps try to create a testcase where xids are accessed that
are so far apart on average that they're unlikely to be in memory? And
then test that across a number of client counts?

There's two reasons that I'd like to see that: First I'd like to avoid
regression, second I'd like to avoid having to bump the maximum number
of buffers by small buffers after every hardware generation...

>  /*
>   * Number of shared CLOG buffers.
>   *
> - * Testing during the PostgreSQL 9.2 development cycle revealed that on a
> + * Testing during the PostgreSQL 9.6 development cycle revealed that on a
>   * large multi-processor system, it was possible to have more CLOG page
> - * requests in flight at one time than the number of CLOG buffers which existed
> - * at that time, which was hardcoded to 8.  Further testing revealed that
> - * performance dropped off with more than 32 CLOG buffers, possibly because
> - * the linear buffer search algorithm doesn't scale well.
> + * requests in flight at one time than the number of CLOG buffers which
> + * existed at that time, which was 32 assuming there are enough shared_buffers.
> + * Further testing revealed that either performance stayed same or dropped off
> + * with more than 64 CLOG buffers, possibly because the linear buffer search
> + * algorithm doesn't scale well or some other locking bottlenecks in the
> + * system mask the improvement.
>   *
> - * Unconditionally increasing the number of CLOG buffers to 32 did not seem
> + * Unconditionally increasing the number of CLOG buffers to 64 did not seem
>   * like a good idea, because it would increase the minimum amount of shared
>   * memory required to start, which could be a problem for people running very
>   * small configurations.  The following formula seems to represent a reasonable
>   * compromise: people with very low values for shared_buffers will get fewer
> - * CLOG buffers as well, and everyone else will get 32.
> + * CLOG buffers as well, and everyone else will get 64.
>   *
>   * It is likely that some further work will be needed here in future releases;
>   * for example, on a 64-core server, the maximum number of CLOG requests that
>   * can be simultaneously in flight will be even larger.  But that will
>   * apparently require more than just changing the formula, so for now we take
> - * the easy way out.
> + * the easy way out.  It could also happen that after removing other locking
> + * bottlenecks, further increase in CLOG buffers can help, but that's not the
> + * case now.
>   */

I think the comment should be more drastically rephrased to not
reference individual versions and numbers.

Greetings,

Andres Freund



Re: Speed up Clog Access by increasing CLOG buffers

From
Alvaro Herrera
Date:
Andres Freund wrote:

> The buffer replacement algorithm for clog is rather stupid - I do wonder
> where the cutoff is that it hurts.
> 
> Could you perhaps try to create a testcase where xids are accessed that
> are so far apart on average that they're unlikely to be in memory? And
> then test that across a number of client counts?
> 
> There's two reasons that I'd like to see that: First I'd like to avoid
> regression, second I'd like to avoid having to bump the maximum number
> of buffers by small buffers after every hardware generation...

I wonder if it would make sense to explore an idea that has been floated
for years now -- to have pg_clog pages be allocated as part of shared
buffers rather than have their own separate pool.  That way, no separate
hardcoded allocation limit is needed.  It's probably pretty tricky to
implement, though :-(

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
Hi,

On 2015-09-07 10:34:10 -0300, Alvaro Herrera wrote:
> I wonder if it would make sense to explore an idea that has been floated
> for years now -- to have pg_clog pages be allocated as part of shared
> buffers rather than have their own separate pool.  That way, no separate
> hardcoded allocation limit is needed.  It's probably pretty tricky to
> implement, though :-(

I still think that'd be a good plan, especially as it'd also let us use
a lot of related infrastructure. I doubt we could just use the standard
cache replacement mechanism though - it's not particularly efficient
either... I also have my doubts that a hash table lookup at every clog
lookup is going to be ok performancewise.

The biggest problem will probably be that the buffer manager is pretty
directly tied to relations and breaking up that bond won't be all that
easy. My guess is that the best bet here is that the easiest way to at
least explore this is to define pg_clog/... as their own tablespaces
(akin to pg_global) and treat the files therein as plain relations.

Greetings,

Andres Freund



Re: Speed up Clog Access by increasing CLOG buffers

From
Alvaro Herrera
Date:
Andres Freund wrote:

> On 2015-09-07 10:34:10 -0300, Alvaro Herrera wrote:
> > I wonder if it would make sense to explore an idea that has been floated
> > for years now -- to have pg_clog pages be allocated as part of shared
> > buffers rather than have their own separate pool.  That way, no separate
> > hardcoded allocation limit is needed.  It's probably pretty tricky to
> > implement, though :-(
> 
> I still think that'd be a good plan, especially as it'd also let us use
> a lot of related infrastructure. I doubt we could just use the standard
> cache replacement mechanism though - it's not particularly efficient
> either... I also have my doubts that a hash table lookup at every clog
> lookup is going to be ok performancewise.

Yeah.  I guess we'd have to mark buffers as unusable for regular pages
("somehow"), and have a separate lookup mechanism.  As I said, it is
likely to be tricky.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Sep 7, 2015 at 7:04 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Andres Freund wrote:
>
> > The buffer replacement algorithm for clog is rather stupid - I do wonder
> > where the cutoff is that it hurts.
> >
> > Could you perhaps try to create a testcase where xids are accessed that
> > are so far apart on average that they're unlikely to be in memory?
> >

Yes, I am working on it, what I have in mind is to create a table with
large number of rows (say 50000000) and have each row with different
transaction id.  Now each transaction should try to update rows that
are at least 1048576 (number of transactions whose status can be held in
32 CLog buffers) distance apart, that way for each update it will try to access
Clog page that is not in-memory.  Let me know if you can think of any
better or simpler way.


> > There's two reasons that I'd like to see that: First I'd like to avoid
> > regression, second I'd like to avoid having to bump the maximum number
> > of buffers by small buffers after every hardware generation...
>
> I wonder if it would make sense to explore an idea that has been floated
> for years now -- to have pg_clog pages be allocated as part of shared
> buffers rather than have their own separate pool.
>

There could be some benefits of it, but I think we still have to acquire
Exclusive lock while committing transaction or while Extending Clog
which are also major sources of contention in this area.  I think the
benefits of moving it to shared_buffers could be that the upper limit on
number of pages that can be retained in memory could be increased and even
if we have to replace the page, responsibility to flush it could be delegated
to checkpoint.  So yes, there could be benefits with this idea, but not sure
if they are worth investigating this idea, one thing we could try if you think
that is beneficial is that just skip fsync during write of clog pages and if thats
beneficial, then we can think of pushing it to checkpoint (something similar
to what Andres has mentioned on nearby thread).

Yet another way could be to have configuration variable for clog buffers
(Clog_Buffers).


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Mon, Sep 7, 2015 at 9:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Andres Freund wrote:
>> The buffer replacement algorithm for clog is rather stupid - I do wonder
>> where the cutoff is that it hurts.
>>
>> Could you perhaps try to create a testcase where xids are accessed that
>> are so far apart on average that they're unlikely to be in memory? And
>> then test that across a number of client counts?
>>
>> There's two reasons that I'd like to see that: First I'd like to avoid
>> regression, second I'd like to avoid having to bump the maximum number
>> of buffers by small buffers after every hardware generation...
>
> I wonder if it would make sense to explore an idea that has been floated
> for years now -- to have pg_clog pages be allocated as part of shared
> buffers rather than have their own separate pool.  That way, no separate
> hardcoded allocation limit is needed.  It's probably pretty tricky to
> implement, though :-(

Yeah, I looked at that once and threw my hands up in despair pretty
quickly.  I also considered another idea that looked simpler: instead
of giving every SLRU its own pool of pages, have one pool of pages for
all of them, separate from shared buffers but common to all SLRUs.
That looked easier, but still not easy.

I've also considered trying to replace the entire SLRU system with new
code and throwing away what exists today.  The locking mode is just
really strange compared to what we do elsewhere.  That, too, does not
look all that easy.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Sep 3, 2015 at 5:11 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2015-09-01 10:19:19 +0530, Amit Kapila wrote:
> > pgbench setup
> > ------------------------
> > scale factor - 300
> > Data is on magnetic disk and WAL on ssd.
> > pgbench -M prepared tpc-b
> >
> > HEAD - commit 0e141c0f
> > Patch-1 - increase_clog_bufs_v1
> >
>
> The buffer replacement algorithm for clog is rather stupid - I do wonder
> where the cutoff is that it hurts.
>
> Could you perhaps try to create a testcase where xids are accessed that
> are so far apart on average that they're unlikely to be in memory? And
> then test that across a number of client counts?
>

Okay, I have tried one such test, but what I could come up with is on an
average every 100th access is a disk access and then tested it with
different number of clog buffers and client count.  Below is the result:

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=32GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout    =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB
autovacuum=off


HEAD - commit 49124613
Patch-1 - Clog Buffers - 64
Patch-2 - Clog Buffers - 128

Client Count/Patch_ver1864128
HEAD139583363786634463
Patch-1161581803779935315
Patch-2140982193706834729
 
So there is not much difference in test results with different values for Clog
buffers, probably because the I/O has dominated the test and it shows that
increasing the clog buffers won't regress the current behaviour even though
there are lot more accesses for transaction status outside CLOG buffers.

Now about the test, create a table with large number of rows (say 11617457,
I have tried to create larger, but it was taking too much time (more than a day))
and have each row with different transaction id.  Now each transaction should
update rows that are at least 1048576 (number of transactions whose status can
be held in 32 CLog buffers) distance apart, that way ideally for each update it will
try to access Clog page that is not in-memory, however as the value to update
is getting selected randomly and that leads to every 100th access as disk access.

Test
-------
1. Attached file clog_prep.sh should create and populate the required
table and create the function used to access the CLOG pages.  You
might want to update the no_of_rows based on the rows you want to
create in table
2. Attached file  access_clog_disk.sql is used to execute the function
with random values. You might want to update nrows variable based
on the rows created in previous step.
3. Use pgbench as follows with different client count
./pgbench -c 4 -j 4 -n -M prepared -f "access_clog_disk.sql" -T 300 postgres
4. To ensure that clog access function always accesses same data
during each run, the test ensures to copy the data_directory created by step-1
before each run.

I have checked by adding some instrumentation that approximately
every 100th access is disk access, attached patch clog_info-v1.patch
adds the necessary instrumentation in code.

As an example, pgbench test yields below results:
./pgbench -c 4 -j 4 -n -M prepared -f "access_clog_disk.sql" -T 180 postgres

LOG:  trans_status(3169396)
LOG:  trans_status_disk(29546)
LOG:  trans_status(3054952)
LOG:  trans_status_disk(28291)
LOG:  trans_status(3131242)
LOG:  trans_status_disk(28989)
LOG:  trans_status(3155449)
LOG:  trans_status_disk(29347)

Here 'trans_status' is the number of times the process went for accessing
the CLOG status and 'trans_status_disk' is the number of times it went
to disk for accessing CLOG page.

 
>
> >  /*
> >   * Number of shared CLOG buffers.
> >   *
>
>
>
> I think the comment should be more drastically rephrased to not
> reference individual versions and numbers.
>

Updated comments and the patch (increate_clog_bufs_v2.patch)
containing the same is attached.


With Regards,
Amit Kapila.
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Fri, Sep 11, 2015 at 10:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Could you perhaps try to create a testcase where xids are accessed that
> > are so far apart on average that they're unlikely to be in memory? And
> > then test that across a number of client counts?
> >
>
> Now about the test, create a table with large number of rows (say 11617457,
> I have tried to create larger, but it was taking too much time (more than a day))
> and have each row with different transaction id.  Now each transaction should
> update rows that are at least 1048576 (number of transactions whose status can
> be held in 32 CLog buffers) distance apart, that way ideally for each update it will
> try to access Clog page that is not in-memory, however as the value to update
> is getting selected randomly and that leads to every 100th access as disk access.

What about just running a regular pgbench test, but hacking the
XID-assignment code so that we increment the XID counter by 100 each
time instead of 1?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Sep 11, 2015 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Sep 11, 2015 at 10:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Could you perhaps try to create a testcase where xids are accessed that
> > > are so far apart on average that they're unlikely to be in memory? And
> > > then test that across a number of client counts?
> > >
> >
> > Now about the test, create a table with large number of rows (say 11617457,
> > I have tried to create larger, but it was taking too much time (more than a day))
> > and have each row with different transaction id.  Now each transaction should
> > update rows that are at least 1048576 (number of transactions whose status can
> > be held in 32 CLog buffers) distance apart, that way ideally for each update it will
> > try to access Clog page that is not in-memory, however as the value to update
> > is getting selected randomly and that leads to every 100th access as disk access.
>
> What about just running a regular pgbench test, but hacking the
> XID-assignment code so that we increment the XID counter by 100 each
> time instead of 1?
>

If I am not wrong we need 1048576 number of transactions difference
for each record to make each CLOG access a disk access, so if we
increment XID counter by 100, then probably every 10000th (or multiplier
of 10000) transaction would go for disk access.

The number 1048576 is derived by below calc:
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)

Transaction difference required for each transaction to go for disk access:
CLOG_XACTS_PER_PAGE * num_clog_buffers.

I think reducing to every 100th access for transaction status as disk access
is sufficient to prove that there is no regression with the patch for the screnario
asked by Andres or do you think it is not?

Now another possibility here could be that we try by commenting out fsync
in CLOG path to see how much it impact the performance of this test and
then for pgbench test.  I am not sure there will be any impact because even
every 100th transaction goes to disk access that is still less as compare
WAL fsync which we have to perform for each transaction. 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Fri, Sep 11, 2015 at 11:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> If I am not wrong we need 1048576 number of transactions difference
> for each record to make each CLOG access a disk access, so if we
> increment XID counter by 100, then probably every 10000th (or multiplier
> of 10000) transaction would go for disk access.
>
> The number 1048576 is derived by below calc:
> #define CLOG_XACTS_PER_BYTE 4
> #define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
>
> Transaction difference required for each transaction to go for disk access:
> CLOG_XACTS_PER_PAGE * num_clog_buffers.
>
> I think reducing to every 100th access for transaction status as disk access
> is sufficient to prove that there is no regression with the patch for the
> screnario
> asked by Andres or do you think it is not?

I have no idea.  I was just suggesting that hacking the server somehow
might be an easier way of creating the scenario Andres was interested
in than the process you described.  But feel free to ignore me, I
haven't taken much time to think about this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Jesper Pedersen
Date:
On 09/11/2015 10:31 AM, Amit Kapila wrote:
> Updated comments and the patch (increate_clog_bufs_v2.patch)
> containing the same is attached.
>

I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x 
RAID10 SSD (data + xlog) with Min(64,).

Kept the shared_buffers=64GB and effective_cache_size=160GB settings 
across all runs, but did runs with both synchronous_commit on and off 
and different scale factors for pgbench.

The results are in flux for all client numbers within -2 to +2% 
depending on the latency average.

So no real conclusion from here other than the patch doesn't help/hurt 
performance on this setup, likely depends on further CLogControlLock 
related changes to see real benefit.

Best regards, Jesper




Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Sep 18, 2015 at 11:08 PM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
On 09/11/2015 10:31 AM, Amit Kapila wrote:
Updated comments and the patch (increate_clog_bufs_v2.patch)
containing the same is attached.


I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x RAID10 SSD (data + xlog) with Min(64,).


The benefit with this patch could be seen at somewhat higher
client-count as you can see in my initial mail, can you please
once try with client count > 64?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Peter Geoghegan
Date:
On Mon, Aug 31, 2015 at 9:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Increasing CLOG buffers to 64 helps in reducing the contention due to second
> reason.  Experiments revealed that increasing CLOG buffers only helps
> once the contention around ProcArrayLock is reduced.

There has been a lot of research on bitmap compression, more or less
for the benefit of bitmap index access methods.

Simple techniques like run length encoding are effective for some
things. If the need to map the bitmap into memory to access the status
of transactions is a concern, there has been work done on that, too.
Byte-aligned bitmap compression is a technique that might offer a good
trade-off between compression clog, and decompression overhead -- I
think that there basically is no decompression overhead, because set
operations can be performed on the "compressed" representation
directly. There are other techniques, too.

Something to consider. There could be multiple benefits to compressing
clog, even beyond simply avoiding managing clog buffers.

-- 
Peter Geoghegan



Re: Speed up Clog Access by increasing CLOG buffers

From
Jesper Pedersen
Date:
On 09/18/2015 11:11 PM, Amit Kapila wrote:
>> I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x
>> RAID10 SSD (data + xlog) with Min(64,).
>>
>>
> The benefit with this patch could be seen at somewhat higher
> client-count as you can see in my initial mail, can you please
> once try with client count > 64?
>

Client count were from 1 to 80.

I did do one run with Min(128,) like you, but didn't see any difference 
in the result compared to Min(64,), so focused instead in the 
sync_commit on/off testing case.

Best regards, Jesper




Re: Speed up Clog Access by increasing CLOG buffers

From
Jeff Janes
Date:
On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Sep 11, 2015 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Sep 11, 2015 at 10:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Could you perhaps try to create a testcase where xids are accessed that
> > > are so far apart on average that they're unlikely to be in memory? And
> > > then test that across a number of client counts?
> > >
> >
> > Now about the test, create a table with large number of rows (say 11617457,
> > I have tried to create larger, but it was taking too much time (more than a day))
> > and have each row with different transaction id.  Now each transaction should
> > update rows that are at least 1048576 (number of transactions whose status can
> > be held in 32 CLog buffers) distance apart, that way ideally for each update it will
> > try to access Clog page that is not in-memory, however as the value to update
> > is getting selected randomly and that leads to every 100th access as disk access.
>
> What about just running a regular pgbench test, but hacking the
> XID-assignment code so that we increment the XID counter by 100 each
> time instead of 1?
>

If I am not wrong we need 1048576 number of transactions difference
for each record to make each CLOG access a disk access, so if we
increment XID counter by 100, then probably every 10000th (or multiplier
of 10000) transaction would go for disk access.

The number 1048576 is derived by below calc:
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)

Transaction difference required for each transaction to go for disk access:
CLOG_XACTS_PER_PAGE * num_clog_buffers.


That guarantees that every xid occupies its own 32-contiguous-pages chunk of clog.  

But clog pages are not pulled in and out in 32-page chunks, but one page chunks.  So you would only need 32,768 differences to get every real transaction to live on its own clog page, which means every look up of a different real transaction would have to do a page replacement.  (I think your references to disk access here are misleading.  Isn't the issue here the contention on the lock that controls the page replacement, not the actual IO?)

I've attached a patch that allows you set the guc "JJ_xid",which makes it burn the given number of xids every time one new one is asked for.  (The patch introduces lots of other stuff as well, but I didn't feel like ripping the irrelevant parts out--if you don't set any of the other gucs it introduces from their defaults, they shouldn't cause you trouble.)  I think there are other tools around that do the same thing, but this is the one I know about.  It is easy to drive the system into wrap-around shutdown with this, so lowering autovacuum_vacuum_cost_delay is a good idea.

Actually I haven't attached it, because then the commitfest app will list it as the patch needing review, instead I've put it here https://drive.google.com/file/d/0Bzqrh1SO9FcERV9EUThtT3pacmM/view?usp=sharing

I think reducing to every 100th access for transaction status as disk access
is sufficient to prove that there is no regression with the patch for the screnario
asked by Andres or do you think it is not?

Now another possibility here could be that we try by commenting out fsync
in CLOG path to see how much it impact the performance of this test and
then for pgbench test.  I am not sure there will be any impact because even
every 100th transaction goes to disk access that is still less as compare
WAL fsync which we have to perform for each transaction. 

You mentioned that your clog is not on ssd, but surely at this scale of hardware, the hdd the clog is on has a bbu in front of it, no?

But I thought Andres' concern was not about fsync, but about the fact that the SLRU does linear scans (repeatedly) of the buffers while holding the control lock?  At some point, scanning more and more buffers under the lock is going to cause more contention than scanning fewer buffers and just evicting a page will.
 
Cheers,

Jeff

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Oct 5, 2015 at 6:34 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

If I am not wrong we need 1048576 number of transactions difference
for each record to make each CLOG access a disk access, so if we
increment XID counter by 100, then probably every 10000th (or multiplier
of 10000) transaction would go for disk access.

The number 1048576 is derived by below calc:
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)

Transaction difference required for each transaction to go for disk access:
CLOG_XACTS_PER_PAGE * num_clog_buffers.


That guarantees that every xid occupies its own 32-contiguous-pages chunk of clog.  

But clog pages are not pulled in and out in 32-page chunks, but one page chunks.  So you would only need 32,768 differences to get every real transaction to live on its own clog page, which means every look up of a different real transaction would have to do a page replacement.


Agreed, but that doesn't effect the test result with the test done above.
 
 (I think your references to disk access here are misleading.  Isn't the issue here the contention on the lock that controls the page replacement, not the actual IO?)


The point is that if there is no I/O needed, then all the read-access for
transaction status will just use Shared locks, however if there is an I/O,
then it would need an Exclusive lock.
 
I've attached a patch that allows you set the guc "JJ_xid",which makes it burn the given number of xids every time one new one is asked for.  (The patch introduces lots of other stuff as well, but I didn't feel like ripping the irrelevant parts out--if you don't set any of the other gucs it introduces from their defaults, they shouldn't cause you trouble.)  I think there are other tools around that do the same thing, but this is the one I know about.  It is easy to drive the system into wrap-around shutdown with this, so lowering autovacuum_vacuum_cost_delay is a good idea.

Actually I haven't attached it, because then the commitfest app will list it as the patch needing review, instead I've put it here https://drive.google.com/file/d/0Bzqrh1SO9FcERV9EUThtT3pacmM/view?usp=sharing


Thanks, I think probably this could also be used for testing.
 
I think reducing to every 100th access for transaction status as disk access
is sufficient to prove that there is no regression with the patch for the screnario
asked by Andres or do you think it is not?

Now another possibility here could be that we try by commenting out fsync
in CLOG path to see how much it impact the performance of this test and
then for pgbench test.  I am not sure there will be any impact because even
every 100th transaction goes to disk access that is still less as compare
WAL fsync which we have to perform for each transaction. 

You mentioned that your clog is not on ssd, but surely at this scale of hardware, the hdd the clog is on has a bbu in front of it, no?


Yes.
 
But I thought Andres' concern was not about fsync, but about the fact that the SLRU does linear scans (repeatedly) of the buffers while holding the control lock?  At some point, scanning more and more buffers under the lock is going to cause more contention than scanning fewer buffers and just evicting a page will.
 

Yes, at some point, that could matter, but I could not see the impact
at 64 or 128 number of Clog buffers.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Sep 21, 2015 at 6:25 PM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
On 09/18/2015 11:11 PM, Amit Kapila wrote:
I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x
RAID10 SSD (data + xlog) with Min(64,).


The benefit with this patch could be seen at somewhat higher
client-count as you can see in my initial mail, can you please
once try with client count > 64?


Client count were from 1 to 80.

I did do one run with Min(128,) like you, but didn't see any difference in the result compared to Min(64,), so focused instead in the sync_commit on/off testing case.

I think the main focus for test in this area would be at higher client
count.  At what scale factors have you taken the data and what are
the other non-default settings you have used.  By the way, have you
tried by dropping and recreating the database and restarting the server
after each run, can you share the exact steps you have used to perform
the tests.  I am not sure why it is not showing the benefit in your testing,
may be the benefit is on some what more higher end m/c or it could be
that some of the settings used for test are not same as mine or the way
to test the read-write workload of pgbench is different.

In anycase, I went ahead and tried further reducing the CLogControlLock
contention by grouping the transaction status updates.  The basic idea
is same as is used to reduce the ProcArrayLock contention [1] which is to
allow one of the proc to become leader and update the transaction status for
other active transactions in system.  This has helped to reduce the contention
around CLOGControlLock.  Attached patch group_update_clog_v1.patch
implements this idea.

I have taken performance data with this patch to see the impact at
various scale-factors.  All the data is for cases when data fits in shared
buffers and is taken against commit - 5c90a2ff on server with below
configuration and non-default postgresql.conf settings.

Performance Data
-----------------------------
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout    =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

Refer attached files for performance data.

sc_300_perf.png - This data indicates that at scale_factor 300, there is a
gain of ~15% at higher client counts, without degradation at lower client
count.
different_sc_perf.png - At various scale factors, there is a gain from
~15% to 41% at higher client counts and in some cases we see gain
of ~5% at somewhat moderate client count (64) as well.
perf_write_clogcontrollock_data_v1.ods - Detailed performance data at
various client counts and scale factors.

Feel free to ask for more details if the data in attached files is not clear.

Below is the LWLock_Stats information with and without patch:

Stats Data
---------
A. scale_factor = 300; shared_buffers=32GB; client_connections - 128

HEAD - 5c90a2ff
----------------
CLogControlLock Data
------------------------
PID 94100 lwlock main 11: shacq 678672 exacq 326477 blk 204427 spindelay 8532 dequeue self 93192
PID 94129 lwlock main 11: shacq 757047 exacq 363176 blk 207840 spindelay 8866 dequeue self 96601
PID 94115 lwlock main 11: shacq 721632 exacq 345967 blk 207665 spindelay 8595 dequeue self 96185
PID 94011 lwlock main 11: shacq 501900 exacq 241346 blk 173295 spindelay 7882 dequeue self 78134
PID 94087 lwlock main 11: shacq 653701 exacq 314311 blk 201733 spindelay 8419 dequeue self 92190

After Patch group_update_clog_v1
----------------
CLogControlLock Data
------------------------
PID 100205 lwlock main 11: shacq 836897 exacq 176007 blk 116328 spindelay 1206 dequeue self 54485
PID 100034 lwlock main 11: shacq 437610 exacq 91419 blk 77523 spindelay 994 dequeue self 35419
PID 100175 lwlock main 11: shacq 748948 exacq 158970 blk 114027 spindelay 1277 dequeue self 53486
PID 100162 lwlock main 11: shacq 717262 exacq 152807 blk 115268 spindelay 1227 dequeue self 51643
PID 100214 lwlock main 11: shacq 856044 exacq 180422 blk 113695 spindelay 1202 dequeue self 54435

The above data indicates that contention due to CLogControlLock is
reduced by around 50% with this patch.

The reasons for remaining contention could be:

1. Readers of clog data (checking transaction status data) can take
Exclusive CLOGControlLock when reading the page from disk, this can
contend with other Readers (shared lockers of CLogControlLock) and with
exclusive locker which updates transaction status. One of the ways to
mitigate this contention is to increase the number of CLOG buffers for which
patch has been already posted on this thread.

2. Readers of clog data (checking transaction status data) takes shared
CLOGControlLock which can contend with exclusive locker (Group leader) which
updates transaction status.  I have tried to reduce the amount of work done
by group leader, by allowing group leader to just read the Clog page once
for all the transactions in the group which updated the same CLOG page
(idea similar to what we currently we use for updating the status of transactions
having sub-transaction tree), but that hasn't given any further performance boost,
so I left it.

I think we can use some other ways as well to reduce the contention around
CLOGControlLock by doing somewhat major surgery around SLRU like using
buffer pools similar to shared buffers, but this idea gives us moderate
improvement without much impact on exiting mechanism.


Thoughts?
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Sep 21, 2015 at 6:34 AM, Peter Geoghegan <pg@heroku.com> wrote:
>
> On Mon, Aug 31, 2015 at 9:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Increasing CLOG buffers to 64 helps in reducing the contention due to second
> > reason.  Experiments revealed that increasing CLOG buffers only helps
> > once the contention around ProcArrayLock is reduced.
>
> There has been a lot of research on bitmap compression, more or less
> for the benefit of bitmap index access methods.
>
> Simple techniques like run length encoding are effective for some
> things. If the need to map the bitmap into memory to access the status
> of transactions is a concern, there has been work done on that, too.
> Byte-aligned bitmap compression is a technique that might offer a good
> trade-off between compression clog, and decompression overhead -- I
> think that there basically is no decompression overhead, because set
> operations can be performed on the "compressed" representation
> directly. There are other techniques, too.
>

I could see benefits of doing compression for CLOG, but I think it won't
be straight forward, other than handling of compression and decompression,
currently code relies on transaction id to find the clog page, that will
not work after compression or we need to do some changes in that mapping
to make it work.  Also I think it could avoid the increase of clog buffers which
can help readers, but it won't help much for contention around clog
updates for transaction status.

Overall this idea sounds promising, but I think the work involved is more
than the benefit I am expecting for the current optimization we are
discussing.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Nov 17, 2015 at 1:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 21, 2015 at 6:34 AM, Peter Geoghegan <pg@heroku.com> wrote:
> >
> > On Mon, Aug 31, 2015 at 9:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Increasing CLOG buffers to 64 helps in reducing the contention due to second
> > > reason.  Experiments revealed that increasing CLOG buffers only helps
> > > once the contention around ProcArrayLock is reduced.
> >
>
> Overall this idea sounds promising, but I think the work involved is more
> than the benefit I am expecting for the current optimization we are
> discussing.
>

Sorry, I think last line is slightly confusing, let me try to again write
it:

Overall this idea sounds promising, but I think the work involved is more
than the benefit expected from the current optimization we are
discussing.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Simon Riggs
Date:
On 17 November 2015 at 06:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
 
In anycase, I went ahead and tried further reducing the CLogControlLock
contention by grouping the transaction status updates.  The basic idea
is same as is used to reduce the ProcArrayLock contention [1] which is to
allow one of the proc to become leader and update the transaction status for
other active transactions in system.  This has helped to reduce the contention
around CLOGControlLock.

Sounds good. The technique has proved effective with proc array and makes sense to use here also.
 
Attached patch group_update_clog_v1.patch
implements this idea.

I don't think we should be doing this only for transactions that don't have subtransactions. We are trying to speed up real cases, not just benchmarks.

So +1 for the concept, patch is going in right direction though lets do the full press-up.

The above data indicates that contention due to CLogControlLock is
reduced by around 50% with this patch.

The reasons for remaining contention could be:

1. Readers of clog data (checking transaction status data) can take
Exclusive CLOGControlLock when reading the page from disk, this can
contend with other Readers (shared lockers of CLogControlLock) and with
exclusive locker which updates transaction status. One of the ways to
mitigate this contention is to increase the number of CLOG buffers for which
patch has been already posted on this thread.

2. Readers of clog data (checking transaction status data) takes shared
CLOGControlLock which can contend with exclusive locker (Group leader) which
updates transaction status.  I have tried to reduce the amount of work done
by group leader, by allowing group leader to just read the Clog page once
for all the transactions in the group which updated the same CLOG page
(idea similar to what we currently we use for updating the status of transactions
having sub-transaction tree), but that hasn't given any further performance boost,
so I left it.

I think we can use some other ways as well to reduce the contention around
CLOGControlLock by doing somewhat major surgery around SLRU like using
buffer pools similar to shared buffers, but this idea gives us moderate
improvement without much impact on exiting mechanism.

My earlier patch to reduce contention by changing required lock level is still valid here. Increasing the number of buffers doesn't do enough to remove that.

I'm working on a patch to use a fast-update area like we use for GIN. If a page is not available when we want to record commit, just store it in a hash table, when not in crash recovery. I'm experimenting with writing WAL for any xids earlier than last checkpoint, though we could also trickle writes and/or flush them in batches at checkpoint time - your code would help there.

The hash table can also be used for lookups. My thinking is that most reads of older xids are caused by long running transactions, so they cause a page fault at commit and then other page faults later when people read them back in. The hash table works for both kinds of page fault.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Nov 17, 2015 at 2:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 17 November 2015 at 06:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
 
In anycase, I went ahead and tried further reducing the CLogControlLock
contention by grouping the transaction status updates.  The basic idea
is same as is used to reduce the ProcArrayLock contention [1] which is to
allow one of the proc to become leader and update the transaction status for
other active transactions in system.  This has helped to reduce the contention
around CLOGControlLock.

Sounds good. The technique has proved effective with proc array and makes sense to use here also.
 
Attached patch group_update_clog_v1.patch
implements this idea.

I don't think we should be doing this only for transactions that don't have subtransactions.

The reason for not doing this optimization for subtransactions is that we
need to advertise the information that Group leader needs for updating
the transaction status and if we want to do it for sub transactions, then
all the subtransaction id's needs to be advertised.  Now here the tricky
part is that number of subtransactions for which the status needs to
be updated is dynamic, so reserving memory for it would be difficult.
However, we can reserve some space in Proc like we do for XidCache
(cache of sub transaction ids) and then use that to advertise that many
Xid's at-a-time or just allow this optimization if number of subtransactions
is lesser than or equal to the size of this new XidCache.  I am not sure
if it is good idea to use the existing XidCache for this purpose in which
case we need to have a separate space in PGProc for this purpose.  I
don't see allocating space for 64 or so subxid's as a problem, however
doing it for bigger number could be cause of concern.  
 
We are trying to speed up real cases, not just benchmarks.

So +1 for the concept, patch is going in right direction though lets do the full press-up.


I have mentioned above the reason for not doing it for sub transactions, if
you think it is viable to reserve space in shared memory for this purpose, then
I can include the optimization for subtransactions as well.

 
The above data indicates that contention due to CLogControlLock is
reduced by around 50% with this patch.

The reasons for remaining contention could be:

1. Readers of clog data (checking transaction status data) can take
Exclusive CLOGControlLock when reading the page from disk, this can
contend with other Readers (shared lockers of CLogControlLock) and with
exclusive locker which updates transaction status. One of the ways to
mitigate this contention is to increase the number of CLOG buffers for which
patch has been already posted on this thread.

2. Readers of clog data (checking transaction status data) takes shared
CLOGControlLock which can contend with exclusive locker (Group leader) which
updates transaction status.  I have tried to reduce the amount of work done
by group leader, by allowing group leader to just read the Clog page once
for all the transactions in the group which updated the same CLOG page
(idea similar to what we currently we use for updating the status of transactions
having sub-transaction tree), but that hasn't given any further performance boost,
so I left it.

I think we can use some other ways as well to reduce the contention around
CLOGControlLock by doing somewhat major surgery around SLRU like using
buffer pools similar to shared buffers, but this idea gives us moderate
improvement without much impact on exiting mechanism.

My earlier patch to reduce contention by changing required lock level is still valid here. Increasing the number of buffers doesn't do enough to remove that.


I understand that increasing alone the number of buffers is not
enough, that's why I have tried this group leader idea.  However
if we do something on lines what you have described below
(handling page faults) could avoid the need for increasing buffers.
 
I'm working on a patch to use a fast-update area like we use for GIN. If a page is not available when we want to record commit, just store it in a hash table, when not in crash recovery. I'm experimenting with writing WAL for any xids earlier than last checkpoint, though we could also trickle writes and/or flush them in batches at checkpoint time - your code would help there.

The hash table can also be used for lookups. My thinking is that most reads of older xids are caused by long running transactions, so they cause a page fault at commit and then other page faults later when people read them back in. The hash table works for both kinds of page fault.



With Regards,
Amit Kapila.

Re: Speed up Clog Access by increasing CLOG buffers

From
Simon Riggs
Date:
On 17 November 2015 at 11:27, Amit Kapila <amit.kapila16@gmail.com> wrote:
 
Attached patch group_update_clog_v1.patch
implements this idea.

I don't think we should be doing this only for transactions that don't have subtransactions.

The reason for not doing this optimization for subtransactions is that we
need to advertise the information that Group leader needs for updating
the transaction status and if we want to do it for sub transactions, then
all the subtransaction id's needs to be advertised.  Now here the tricky
part is that number of subtransactions for which the status needs to
be updated is dynamic, so reserving memory for it would be difficult.
However, we can reserve some space in Proc like we do for XidCache
(cache of sub transaction ids) and then use that to advertise that many
Xid's at-a-time or just allow this optimization if number of subtransactions
is lesser than or equal to the size of this new XidCache.  I am not sure
if it is good idea to use the existing XidCache for this purpose in which
case we need to have a separate space in PGProc for this purpose.  I
don't see allocating space for 64 or so subxid's as a problem, however
doing it for bigger number could be cause of concern.  
 
We are trying to speed up real cases, not just benchmarks.

So +1 for the concept, patch is going in right direction though lets do the full press-up.


I have mentioned above the reason for not doing it for sub transactions, if
you think it is viable to reserve space in shared memory for this purpose, then
I can include the optimization for subtransactions as well.

The number of subxids is unbounded, so as you say, reserving shmem isn't viable.

I'm interested in real world cases, so allocating 65 xids per process isn't needed, but we can say is that the optimization shouldn't break down abruptly in the presence of a small/reasonable number of subtransactions.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Nov 17, 2015 at 5:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 17 November 2015 at 11:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

We are trying to speed up real cases, not just benchmarks.

So +1 for the concept, patch is going in right direction though lets do the full press-up.


I have mentioned above the reason for not doing it for sub transactions, if
you think it is viable to reserve space in shared memory for this purpose, then
I can include the optimization for subtransactions as well.

The number of subxids is unbounded, so as you say, reserving shmem isn't viable.

I'm interested in real world cases, so allocating 65 xids per process isn't needed, but we can say is that the optimization shouldn't break down abruptly in the presence of a small/reasonable number of subtransactions.


I think in that case what we can do is if the total number of
sub transactions is lesser than equal to 64 (we can find that by
overflowed flag in PGXact) , then apply this optimisation, else use
the existing flow to update the transaction status.  I think for that we
don't even need to reserve any additional memory. Does that sound
sensible to you? 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Nov 17, 2015 at 5:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Nov 17, 2015 at 5:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 17 November 2015 at 11:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

We are trying to speed up real cases, not just benchmarks.

So +1 for the concept, patch is going in right direction though lets do the full press-up.


I have mentioned above the reason for not doing it for sub transactions, if
you think it is viable to reserve space in shared memory for this purpose, then
I can include the optimization for subtransactions as well.

The number of subxids is unbounded, so as you say, reserving shmem isn't viable.

I'm interested in real world cases, so allocating 65 xids per process isn't needed, but we can say is that the optimization shouldn't break down abruptly in the presence of a small/reasonable number of subtransactions.


I think in that case what we can do is if the total number of
sub transactions is lesser than equal to 64 (we can find that by
overflowed flag in PGXact) , then apply this optimisation, else use
the existing flow to update the transaction status.  I think for that we
don't even need to reserve any additional memory. 

I think this won't work as it is, because subxids in XidCache could be
on different pages in which case we either need an additional flag
in XidCache array or a separate array to indicate for which subxids
we want to update the status.  I don't see any better way to do this
optimization for sub transactions, do you have something else in
mind?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Simon Riggs
Date:
On 17 November 2015 at 11:48, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Nov 17, 2015 at 5:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 17 November 2015 at 11:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

We are trying to speed up real cases, not just benchmarks.

So +1 for the concept, patch is going in right direction though lets do the full press-up.


I have mentioned above the reason for not doing it for sub transactions, if
you think it is viable to reserve space in shared memory for this purpose, then
I can include the optimization for subtransactions as well.

The number of subxids is unbounded, so as you say, reserving shmem isn't viable.

I'm interested in real world cases, so allocating 65 xids per process isn't needed, but we can say is that the optimization shouldn't break down abruptly in the presence of a small/reasonable number of subtransactions.


I think in that case what we can do is if the total number of
sub transactions is lesser than equal to 64 (we can find that by
overflowed flag in PGXact) , then apply this optimisation, else use
the existing flow to update the transaction status.  I think for that we
don't even need to reserve any additional memory. Does that sound
sensible to you?

I understand you to mean that the leader should look backwards through the queue collecting xids while !(PGXACT->overflowed)

No additional shmem is required

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Nov 17, 2015 at 6:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 17 November 2015 at 11:48, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Nov 17, 2015 at 5:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 17 November 2015 at 11:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

We are trying to speed up real cases, not just benchmarks.

So +1 for the concept, patch is going in right direction though lets do the full press-up.


I have mentioned above the reason for not doing it for sub transactions, if
you think it is viable to reserve space in shared memory for this purpose, then
I can include the optimization for subtransactions as well.

The number of subxids is unbounded, so as you say, reserving shmem isn't viable.

I'm interested in real world cases, so allocating 65 xids per process isn't needed, but we can say is that the optimization shouldn't break down abruptly in the presence of a small/reasonable number of subtransactions.


I think in that case what we can do is if the total number of
sub transactions is lesser than equal to 64 (we can find that by
overflowed flag in PGXact) , then apply this optimisation, else use
the existing flow to update the transaction status.  I think for that we
don't even need to reserve any additional memory. Does that sound
sensible to you?

I understand you to mean that the leader should look backwards through the queue collecting xids while !(PGXACT->overflowed)

 
Yes, that is what the above idea is, but the problem with that is leader
won't be able to collect the subxids of member proc's (from each member
proc's XidCache) as it doesn't have information which of those subxid's
needs to be update as part of current transaction status update (for
subtransactions on different clog pages, we update the status of those
in multiple phases).  I think it could only be possible to use the above idea
if all the subtransactions are on same page, which we can identify in
function TransactionIdSetPageStatus().  Though it looks okay that we
can apply this optimization when number of subtransactions is lesser
than 65 and all exist on same page, still it would be better if we can
apply it generically for all cases when number of subtransactions is small
(say 32 or 64).  Does this explanation clarify the problem with the above
idea to handle subtransactions?


No additional shmem is required


If we want to do it for all cases when number of subtransactions
are small then we need extra memory.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Nov 17, 2015 at 6:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 17 November 2015 at 11:48, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think in that case what we can do is if the total number of
sub transactions is lesser than equal to 64 (we can find that by
overflowed flag in PGXact) , then apply this optimisation, else use
the existing flow to update the transaction status.  I think for that we
don't even need to reserve any additional memory. Does that sound
sensible to you?

I understand you to mean that the leader should look backwards through the queue collecting xids while !(PGXACT->overflowed)

No additional shmem is required


Okay, as discussed I have handled the case of sub-transactions without
additional shmem in the attached patch.  Apart from that, I have tried
to apply this optimization for Prepared transactions as well, but as
the dummy proc used for such transactions doesn't have semaphore like
backend proc's, so it is not possible to use such a proc in group status
updation as each group member needs to wait on semaphore.  It is not tad
difficult to add the support for that case if we are okay with creating additional
semaphore for each such dummy proc which I was not sure, so I have left
it for now.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Jeff Janes
Date:
On Thu, Nov 26, 2015 at 11:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Nov 17, 2015 at 6:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> On 17 November 2015 at 11:48, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>>
>>> I think in that case what we can do is if the total number of
>>> sub transactions is lesser than equal to 64 (we can find that by
>>> overflowed flag in PGXact) , then apply this optimisation, else use
>>> the existing flow to update the transaction status.  I think for that we
>>> don't even need to reserve any additional memory. Does that sound
>>> sensible to you?
>>
>>
>> I understand you to mean that the leader should look backwards through the
>> queue collecting xids while !(PGXACT->overflowed)
>>
>> No additional shmem is required
>>
>
> Okay, as discussed I have handled the case of sub-transactions without
> additional shmem in the attached patch.  Apart from that, I have tried
> to apply this optimization for Prepared transactions as well, but as
> the dummy proc used for such transactions doesn't have semaphore like
> backend proc's, so it is not possible to use such a proc in group status
> updation as each group member needs to wait on semaphore.  It is not tad
> difficult to add the support for that case if we are okay with creating
> additional
> semaphore for each such dummy proc which I was not sure, so I have left
> it for now.

Is this proposal instead of, or in addition to, the original thread
topic of increasing clog buffers to 64?

Thanks,

Jeff



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Sun, Nov 29, 2015 at 1:47 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> On Thu, Nov 26, 2015 at 11:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Nov 17, 2015 at 6:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >>
> >> On 17 November 2015 at 11:48, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>>
> >>>
> >>> I think in that case what we can do is if the total number of
> >>> sub transactions is lesser than equal to 64 (we can find that by
> >>> overflowed flag in PGXact) , then apply this optimisation, else use
> >>> the existing flow to update the transaction status.  I think for that we
> >>> don't even need to reserve any additional memory. Does that sound
> >>> sensible to you?
> >>
> >>
> >> I understand you to mean that the leader should look backwards through the
> >> queue collecting xids while !(PGXACT->overflowed)
> >>
> >> No additional shmem is required
> >>
> >
> > Okay, as discussed I have handled the case of sub-transactions without
> > additional shmem in the attached patch.  Apart from that, I have tried
> > to apply this optimization for Prepared transactions as well, but as
> > the dummy proc used for such transactions doesn't have semaphore like
> > backend proc's, so it is not possible to use such a proc in group status
> > updation as each group member needs to wait on semaphore.  It is not tad
> > difficult to add the support for that case if we are okay with creating
> > additional
> > semaphore for each such dummy proc which I was not sure, so I have left
> > it for now.
>
> Is this proposal instead of, or in addition to, the original thread
> topic of increasing clog buffers to 64?
>

This is in addition to increasing the clog buffers to 64, but with this patch
(Group Clog updation), the effect of increasing the clog buffers will be lesser.
 



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Fri, Nov 27, 2015 at 2:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, as discussed I have handled the case of sub-transactions without
> additional shmem in the attached patch.  Apart from that, I have tried
> to apply this optimization for Prepared transactions as well, but as
> the dummy proc used for such transactions doesn't have semaphore like
> backend proc's, so it is not possible to use such a proc in group status
> updation as each group member needs to wait on semaphore.  It is not tad
> difficult to add the support for that case if we are okay with creating
> additional
> semaphore for each such dummy proc which I was not sure, so I have left
> it for now.

"updation" is not a word.  "acquirations" is not a word.  "penality"
is spelled wrong.

I think the approach this patch takes is pretty darned strange, and
almost certainly not what we want.  What you're doing here is putting
every outstanding CLOG-update request into a linked list, and then the
leader goes and does all of those CLOG updates.  But there's no
guarantee that the pages that need to be updated are even present in a
CLOG buffer.  If it turns out that all of the batched CLOG updates are
part of resident pages, then this is going to work great, just like
the similar ProcArrayLock optimization.  But if the pages are not
resident, then you will get WORSE concurrency and SLOWER performance
than the status quo.  The leader will sit there and read every page
that is needed, and to do that it will repeatedly release and
reacquire CLogControlLock (inside SimpleLruReadPage).  If you didn't
have a leader, the reads of all those pages could happen at the same
time, but with this design, they get serialized.  That's not good.

My idea for how this could possibly work is that you could have a list
of waiting backends for each SLRU buffer page.  Pages with waiting
backends can't be evicted without performing the updates for which
backends are waiting.  Updates to non-resident pages just work as they
do now.  When a backend acquires CLogControlLock to perform updates to
a given page, it also performs all other pending updates to that page
and releases those waiters.  When a backend acquires CLogControlLock
to evict a page, it must perform any pending updates and write the
page before completing the eviction.

I agree with Simon that it's probably a good idea for this
optimization to handle cases where a backend has a non-overflowed list
of subtransactions.  That seems doable.  Handling the case where the
subxid list has overflowed seems unimportant; it should happen rarely
and is therefore not performance-critical.  Also, handling the case
where the XIDs are spread over multiple pages seems far too
complicated to be worth the effort of trying to fit into a "fast
path".  Optimizing the case where there are 1+ XIDs that need to be
updated but all on the same page should cover well over 90% of commits
on real systems, very possibly over 99%.  That should be plenty good
enough to get whatever contention-reduction benefit is possible here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Wed, Dec 2, 2015 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> I think the approach this patch takes is pretty darned strange, and
> almost certainly not what we want.  What you're doing here is putting
> every outstanding CLOG-update request into a linked list, and then the
> leader goes and does all of those CLOG updates.  But there's no
> guarantee that the pages that need to be updated are even present in a
> CLOG buffer.  If it turns out that all of the batched CLOG updates are
> part of resident pages, then this is going to work great, just like
> the similar ProcArrayLock optimization.  But if the pages are not
> resident, then you will get WORSE concurrency and SLOWER performance
> than the status quo.  The leader will sit there and read every page
> that is needed, and to do that it will repeatedly release and
> reacquire CLogControlLock (inside SimpleLruReadPage).  If you didn't
> have a leader, the reads of all those pages could happen at the same
> time, but with this design, they get serialized.  That's not good.
>

I think the way to address is don't add backend to Group list if it is
not intended to update the same page as Group leader.  For transactions
to be on different pages, they have to be 32768 transactionid's far apart
and I don't see much possibility of that happening for concurrent
transactions that are going to be grouped.

> My idea for how this could possibly work is that you could have a list
> of waiting backends for each SLRU buffer page.
>

Won't this mean that first we need to ensure that page exists in one of
the buffers and once we have page in SLRU buffer, we can form the
list and ensure that before eviction, the list must be processed?
If my understanding is right, then for this to work we need to probably
acquire CLogControlLock in Shared mode in addition to acquiring it
in Exclusive mode for updating the status on page and performing
pending updates for other backends.

>
> I agree with Simon that it's probably a good idea for this
> optimization to handle cases where a backend has a non-overflowed list
> of subtransactions.  That seems doable.
>

Agreed and I have already handled it in the last version of patch posted
by me.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Thu, Dec 3, 2015 at 1:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think the way to address is don't add backend to Group list if it is
> not intended to update the same page as Group leader.  For transactions
> to be on different pages, they have to be 32768 transactionid's far apart
> and I don't see much possibility of that happening for concurrent
> transactions that are going to be grouped.

That might work.

>> My idea for how this could possibly work is that you could have a list
>> of waiting backends for each SLRU buffer page.
>
> Won't this mean that first we need to ensure that page exists in one of
> the buffers and once we have page in SLRU buffer, we can form the
> list and ensure that before eviction, the list must be processed?
> If my understanding is right, then for this to work we need to probably
> acquire CLogControlLock in Shared mode in addition to acquiring it
> in Exclusive mode for updating the status on page and performing
> pending updates for other backends.

Hmm, that wouldn't be good.  You're right: this is a problem with my
idea.  We can try what you suggested above and see how that works.  We
could also have two or more slots for groups - if a backend doesn't
get the lock, it joins the existing group for the same page, or else
creates a new group if any slot is unused.  I think it might be
advantageous to have at least two groups because otherwise things
might slow down when some transactions are rolling over to a new page
while others are still in flight for the previous page.  Perhaps we
should try it both ways and benchmark.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Wed, Dec 9, 2015 at 1:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Dec 3, 2015 at 1:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think the way to address is don't add backend to Group list if it is
> > not intended to update the same page as Group leader.  For transactions
> > to be on different pages, they have to be 32768 transactionid's far apart
> > and I don't see much possibility of that happening for concurrent
> > transactions that are going to be grouped.
>
> That might work.
>

Okay, attached patch group_update_clog_v3.patch implements the above.

> >> My idea for how this could possibly work is that you could have a list
> >> of waiting backends for each SLRU buffer page.
> >
> > Won't this mean that first we need to ensure that page exists in one of
> > the buffers and once we have page in SLRU buffer, we can form the
> > list and ensure that before eviction, the list must be processed?
> > If my understanding is right, then for this to work we need to probably
> > acquire CLogControlLock in Shared mode in addition to acquiring it
> > in Exclusive mode for updating the status on page and performing
> > pending updates for other backends.
>
> Hmm, that wouldn't be good.  You're right: this is a problem with my
> idea.  We can try what you suggested above and see how that works.  We
> could also have two or more slots for groups - if a backend doesn't
> get the lock, it joins the existing group for the same page, or else
> creates a new group if any slot is unused.
>

I have implemented this idea as well in the attached patch
group_slots_update_clog_v3.patch

>  I think it might be
> advantageous to have at least two groups because otherwise things
> might slow down when some transactions are rolling over to a new page
> while others are still in flight for the previous page.  Perhaps we
> should try it both ways and benchmark.
>

Sure, I can do the benchmarks with both the patches, but before that
if you can once check whether group_slots_update_clog_v3.patch is inline
with what you have in mind then it will be helpful.

Note - I have used attached patch transaction_burner_v1.patch (extracted
from Jeff's patch upthread) to verify the transactions that fall into different
page boundaries. 

Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Sat, Dec 12, 2015 at 8:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

>>  I think it might be
>> advantageous to have at least two groups because otherwise things
>> might slow down when some transactions are rolling over to a new page
>> while others are still in flight for the previous page.  Perhaps we
>> should try it both ways and benchmark.
>>
>
> Sure, I can do the benchmarks with both the patches, but before that
> if you can once check whether group_slots_update_clog_v3.patch is inline
> with what you have in mind then it will be helpful.

Benchmarking sounds good.  This looks broadly like what I was thinking
about, although I'm not very sure you've got all the details right.

Some random comments:

- TransactionGroupUpdateXidStatus could do just as well without
add_proc_to_group.  You could just say if (group_no >= NUM_GROUPS)
break; instead.  Also, I think you could combine the two if statements
inside the loop.  if (nextidx != INVALID_PGPROCNO &&
ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
something like that.

- memberXid and memberXidstatus are terrible names.  Member of what?
That's going to be clear as mud to the next person looking at the
definitiono f PGPROC.  And the capitalization of memberXidstatus isn't
even consistent.  Nor is nextupdateXidStatusElem.  Please do give some
thought to the names you pick for variables and structure members.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Dec 17, 2015 at 12:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Dec 12, 2015 at 8:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

>>  I think it might be
>> advantageous to have at least two groups because otherwise things
>> might slow down when some transactions are rolling over to a new page
>> while others are still in flight for the previous page.  Perhaps we
>> should try it both ways and benchmark.
>>
>
> Sure, I can do the benchmarks with both the patches, but before that
> if you can once check whether group_slots_update_clog_v3.patch is inline
> with what you have in mind then it will be helpful.

Benchmarking sounds good.  This looks broadly like what I was thinking
about, although I'm not very sure you've got all the details right.


Unfortunately, I didn't have access to high end Intel m/c on which I took
the performance data last time, so I took on Power-8 m/c where I/O
sub-system is not that good, so the write performance data at lower scale
factor like 300 is reasonably good and at higher scale factor (>= 1000)
it is mainly I/O bound, so there is not much difference with or without
patch.

Performance Data
-----------------------------
M/c configuration:
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=32GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout    =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

Attached files show the performance data with both the patches at
scale factor 300 and 1000.

Read Patch-1 and Patch-2 in graphs as below:

Patch-1 - group_update_clog_v3.patch
Patch-2 - group_slots_update_v3.patch

Observations
----------------------
1. At scale factor 300, there is gain of 11% at 128-client count and
27% at 256 client count with Patch-1. At 4 clients, the performance with
Patch is 0.6% less (which might be a run-to-run variation or there could
be a small regression, but I think it is too less to be bothered about)  

2. At scale factor 1000, there is no visible difference and there is some
at lower client count there is a <1% regression which could be due to
I/O bound nature of test.

3. On these runs, Patch-2 is mostly always worse than Patch-1, but
the difference between them is not significant.
 
Some random comments:

- TransactionGroupUpdateXidStatus could do just as well without
add_proc_to_group.  You could just say if (group_no >= NUM_GROUPS)
break; instead.  Also, I think you could combine the two if statements
inside the loop.  if (nextidx != INVALID_PGPROCNO &&
ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
something like that.

- memberXid and memberXidstatus are terrible names.  Member of what?

How about changing them to clogGroupMemberXid and
clogGroupMemberXidStatus?
 
That's going to be clear as mud to the next person looking at the
definitiono f PGPROC.

I understand that you don't like the naming convention, but using
such harsh language could sometimes hurt others.
 
  And the capitalization of memberXidstatus isn't
even consistent.  Nor is nextupdateXidStatusElem.  Please do give some
thought to the names you pick for variables and structure members.


Got it, I will do so.

Let me know what you think about whether we need to proceed with slots
approach and try some more performance data?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Fri, Dec 18, 2015 at 1:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> 1. At scale factor 300, there is gain of 11% at 128-client count and
> 27% at 256 client count with Patch-1. At 4 clients, the performance with
> Patch is 0.6% less (which might be a run-to-run variation or there could
> be a small regression, but I think it is too less to be bothered about)
>
> 2. At scale factor 1000, there is no visible difference and there is some
> at lower client count there is a <1% regression which could be due to
> I/O bound nature of test.
>
> 3. On these runs, Patch-2 is mostly always worse than Patch-1, but
> the difference between them is not significant.

Hmm, that's interesting.  So the slots don't help.  I was concerned
that with only a single slot, you might have things moving quickly
until you hit the point where you switch over to the next clog
segment, and then you get a bad stall.  It sounds like that either
doesn't happen in practice, or more likely it does happen but the
extra slot doesn't eliminate the stall because there's I/O at that
point.  Either way, it sounds like we can forget the slots idea for
now.

>> Some random comments:
>>
>> - TransactionGroupUpdateXidStatus could do just as well without
>> add_proc_to_group.  You could just say if (group_no >= NUM_GROUPS)
>> break; instead.  Also, I think you could combine the two if statements
>> inside the loop.  if (nextidx != INVALID_PGPROCNO &&
>> ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
>> something like that.
>>
>> - memberXid and memberXidstatus are terrible names.  Member of what?
>
> How about changing them to clogGroupMemberXid and
> clogGroupMemberXidStatus?

What we've currently got for group XID clearing for the ProcArray is
clearXid, nextClearXidElem, and backendLatestXid.  We should try to
make these things consistent.  Maybe rename those to
procArrayGroupMember, procArrayGroupNext, procArrayGroupXid and then
start all of these identifiers with clogGroup as you propose.

>> That's going to be clear as mud to the next person looking at the
>> definitiono f PGPROC.
>
> I understand that you don't like the naming convention, but using
> such harsh language could sometimes hurt others.

Sorry.  If I am slightly frustrated here I think it is because this
same point has been raised about three times now, by me and also by
Andres, just with respect to this particular technique, and also on
other patches.  But you are right - that is no excuse for being rude.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Dec 18, 2015 at 9:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Dec 18, 2015 at 1:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >> Some random comments:
> >>
> >> - TransactionGroupUpdateXidStatus could do just as well without
> >> add_proc_to_group.  You could just say if (group_no >= NUM_GROUPS)
> >> break; instead.  Also, I think you could combine the two if statements
> >> inside the loop.  if (nextidx != INVALID_PGPROCNO &&
> >> ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
> >> something like that.
> >>

Changed as per suggestion.

> >> - memberXid and memberXidstatus are terrible names.  Member of what?
> >
> > How about changing them to clogGroupMemberXid and
> > clogGroupMemberXidStatus?
>
> What we've currently got for group XID clearing for the ProcArray is
> clearXid, nextClearXidElem, and backendLatestXid.  We should try to
> make these things consistent.  Maybe rename those to
> procArrayGroupMember, procArrayGroupNext, procArrayGroupXid
>

Here procArrayGroupXid sounds like Xid at group level, how about
procArrayGroupMemberXid?
Find the patch with renamed variables for PGProc
(rename_pgproc_variables_v1.patch) attached with mail.

> and then
> start all of these identifiers with clogGroup as you propose.
>

I have changed them accordingly in the attached patch
(group_update_clog_v4.patch)  and addressed other comments given by
you.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Mon, Dec 21, 2015 at 1:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Dec 18, 2015 at 9:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Fri, Dec 18, 2015 at 1:16 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>>
>> >> Some random comments:
>> >>
>> >> - TransactionGroupUpdateXidStatus could do just as well without
>> >> add_proc_to_group.  You could just say if (group_no >= NUM_GROUPS)
>> >> break; instead.  Also, I think you could combine the two if statements
>> >> inside the loop.  if (nextidx != INVALID_PGPROCNO &&
>> >> ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
>> >> something like that.
>> >>
>
> Changed as per suggestion.
>
>> >> - memberXid and memberXidstatus are terrible names.  Member of what?
>> >
>> > How about changing them to clogGroupMemberXid and
>> > clogGroupMemberXidStatus?
>>
>> What we've currently got for group XID clearing for the ProcArray is
>> clearXid, nextClearXidElem, and backendLatestXid.  We should try to
>> make these things consistent.  Maybe rename those to
>> procArrayGroupMember, procArrayGroupNext, procArrayGroupXid
>>
>
> Here procArrayGroupXid sounds like Xid at group level, how about
> procArrayGroupMemberXid?
> Find the patch with renamed variables for PGProc
> (rename_pgproc_variables_v1.patch) attached with mail.

I sort of hate to make these member names any longer, but I wonder if
we should make it procArrayGroupClearXid etc.  Otherwise it might be
confused with some other time of grouping of PGPROCs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Dec 22, 2015 at 10:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 21, 2015 at 1:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Dec 18, 2015 at 9:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >>
> >> On Fri, Dec 18, 2015 at 1:16 AM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >>
> >> >> Some random comments:
> >> >>
> >> >> - TransactionGroupUpdateXidStatus could do just as well without
> >> >> add_proc_to_group.  You could just say if (group_no >= NUM_GROUPS)
> >> >> break; instead.  Also, I think you could combine the two if statements
> >> >> inside the loop.  if (nextidx != INVALID_PGPROCNO &&
> >> >> ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
> >> >> something like that.
> >> >>
> >
> > Changed as per suggestion.
> >
> >> >> - memberXid and memberXidstatus are terrible names.  Member of what?
> >> >
> >> > How about changing them to clogGroupMemberXid and
> >> > clogGroupMemberXidStatus?
> >>
> >> What we've currently got for group XID clearing for the ProcArray is
> >> clearXid, nextClearXidElem, and backendLatestXid.  We should try to
> >> make these things consistent.  Maybe rename those to
> >> procArrayGroupMember, procArrayGroupNext, procArrayGroupXid
> >>
> >
> > Here procArrayGroupXid sounds like Xid at group level, how about
> > procArrayGroupMemberXid?
> > Find the patch with renamed variables for PGProc
> > (rename_pgproc_variables_v1.patch) attached with mail.
>
> I sort of hate to make these member names any longer, but I wonder if
> we should make it procArrayGroupClearXid etc.

If we go by this suggestion, then the name will look like:
PGProc
{
..
bool procArrayGroupClearXid, pg_atomic_uint32 procArrayGroupNextClearXid,
TransactionId procArrayGroupLatestXid;
..

PROC_HDR
{
..
pg_atomic_uint32 procArrayGroupFirstClearXid;
..
}

I think whatever I sent in last patch were better.  It seems to me it is
better to add some comments before variable names, so that anybody
referring them can understand better and I have added comments in
attached patch rename_pgproc_variables_v2.patch to explain the same. 

>  Otherwise it might be
> confused with some other time of grouping of PGPROCs.
>

Won't procArray prefix distinguish it from other type of groupings?

About CLogControlLock patch, yesterday I noticed that SLRU code
can return error in some cases which can lead to hang for group
members, as once group leader errors out, there is no one to wake
them up.  However on further looking into code, I found out that
this path (TransactionIdSetPageStatus()) is always called in critical
section (RecordTransactionCommit() ensures the same), so any
ERROR in this path will be converted to PANIC which don't require
any wakeup mechanism for group members.  In any case, if you
find any other way which can lead to error (not being converted to
PANIC), I have already handled the error case in the attached patch
(group_update_clog_error_handling_v4.patch) and if you also don't
find any case, then previous patch stands good.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Michael Paquier
Date:
On Wed, Dec 23, 2015 at 6:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> blah.

autovacuum log: Moved to next CF as thread is really active.
-- 
Michael



Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Wed, Dec 23, 2015 at 1:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Dec 22, 2015 at 10:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Dec 21, 2015 at 1:27 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > On Fri, Dec 18, 2015 at 9:58 PM, Robert Haas <robertmhaas@gmail.com>
>> > wrote:
>> >>
>> >> On Fri, Dec 18, 2015 at 1:16 AM, Amit Kapila <amit.kapila16@gmail.com>
>> >> wrote:
>> >>
>> >> >> Some random comments:
>> >> >>
>> >> >> - TransactionGroupUpdateXidStatus could do just as well without
>> >> >> add_proc_to_group.  You could just say if (group_no >= NUM_GROUPS)
>> >> >> break; instead.  Also, I think you could combine the two if
>> >> >> statements
>> >> >> inside the loop.  if (nextidx != INVALID_PGPROCNO &&
>> >> >> ProcGlobal->allProcs[nextidx].clogPage == proc->clogPage) break; or
>> >> >> something like that.
>> >> >>
>> >
>> > Changed as per suggestion.
>> >
>> >> >> - memberXid and memberXidstatus are terrible names.  Member of what?
>> >> >
>> >> > How about changing them to clogGroupMemberXid and
>> >> > clogGroupMemberXidStatus?
>> >>
>> >> What we've currently got for group XID clearing for the ProcArray is
>> >> clearXid, nextClearXidElem, and backendLatestXid.  We should try to
>> >> make these things consistent.  Maybe rename those to
>> >> procArrayGroupMember, procArrayGroupNext, procArrayGroupXid
>> >>
>> >
>> > Here procArrayGroupXid sounds like Xid at group level, how about
>> > procArrayGroupMemberXid?
>> > Find the patch with renamed variables for PGProc
>> > (rename_pgproc_variables_v1.patch) attached with mail.
>>
>> I sort of hate to make these member names any longer, but I wonder if
>> we should make it procArrayGroupClearXid etc.
>
> If we go by this suggestion, then the name will look like:
> PGProc
> {
> ..
> bool procArrayGroupClearXid, pg_atomic_uint32 procArrayGroupNextClearXid,
> TransactionId procArrayGroupLatestXid;
> ..
>
> PROC_HDR
> {
> ..
> pg_atomic_uint32 procArrayGroupFirstClearXid;
> ..
> }
>
> I think whatever I sent in last patch were better.  It seems to me it is
> better to add some comments before variable names, so that anybody
> referring them can understand better and I have added comments in
> attached patch rename_pgproc_variables_v2.patch to explain the same.

Well, I don't know.  Anybody else have an opinion?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Dec 25, 2015 at 6:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 23, 2015 at 1:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>> > Here procArrayGroupXid sounds like Xid at group level, how about
>> > procArrayGroupMemberXid?
>> > Find the patch with renamed variables for PGProc
>> > (rename_pgproc_variables_v1.patch) attached with mail.
>>
>> I sort of hate to make these member names any longer, but I wonder if
>> we should make it procArrayGroupClearXid etc.
>
> If we go by this suggestion, then the name will look like:
> PGProc
> {
> ..
> bool procArrayGroupClearXid, pg_atomic_uint32 procArrayGroupNextClearXid,
> TransactionId procArrayGroupLatestXid;
> ..
>
> PROC_HDR
> {
> ..
> pg_atomic_uint32 procArrayGroupFirstClearXid;
> ..
> }
>
> I think whatever I sent in last patch were better.  It seems to me it is
> better to add some comments before variable names, so that anybody
> referring them can understand better and I have added comments in
> attached patch rename_pgproc_variables_v2.patch to explain the same.

Well, I don't know.  Anybody else have an opinion?


It seems that either people don't have any opinion on this matter or they
are okay with either of the naming conventions being discussed.  I think
specifying Member after procArrayGroup can help distinguishing which
variables are specific to the whole group and which are specific to a
particular member.  I think that will be helpful for other places as well
if we use this technique to improve performance.  Let me know what
you think about the same.

I have verified that previous patches can be applied cleanly and passes
make check-world.  To avoid confusion, I am attaching the latest
patches with this mail.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Thom Brown
Date:
On 7 January 2016 at 05:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Dec 25, 2015 at 6:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Wed, Dec 23, 2015 at 1:16 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >> >
>> >> > Here procArrayGroupXid sounds like Xid at group level, how about
>> >> > procArrayGroupMemberXid?
>> >> > Find the patch with renamed variables for PGProc
>> >> > (rename_pgproc_variables_v1.patch) attached with mail.
>> >>
>> >> I sort of hate to make these member names any longer, but I wonder if
>> >> we should make it procArrayGroupClearXid etc.
>> >
>> > If we go by this suggestion, then the name will look like:
>> > PGProc
>> > {
>> > ..
>> > bool procArrayGroupClearXid, pg_atomic_uint32
>> > procArrayGroupNextClearXid,
>> > TransactionId procArrayGroupLatestXid;
>> > ..
>> >
>> > PROC_HDR
>> > {
>> > ..
>> > pg_atomic_uint32 procArrayGroupFirstClearXid;
>> > ..
>> > }
>> >
>> > I think whatever I sent in last patch were better.  It seems to me it is
>> > better to add some comments before variable names, so that anybody
>> > referring them can understand better and I have added comments in
>> > attached patch rename_pgproc_variables_v2.patch to explain the same.
>>
>> Well, I don't know.  Anybody else have an opinion?
>>
>
> It seems that either people don't have any opinion on this matter or they
> are okay with either of the naming conventions being discussed.  I think
> specifying Member after procArrayGroup can help distinguishing which
> variables are specific to the whole group and which are specific to a
> particular member.  I think that will be helpful for other places as well
> if we use this technique to improve performance.  Let me know what
> you think about the same.
>
> I have verified that previous patches can be applied cleanly and passes
> make check-world.  To avoid confusion, I am attaching the latest
> patches with this mail.

Patches still apply 1 month later.

I don't really have an opinion on the variable naming.  I guess they
only need making longer if there's going to be some confusion about
what they're for, but I'm guessing it's not a blocker here.

Thom



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Feb 9, 2016 at 7:26 PM, Thom Brown <thom@linux.com> wrote:
>
> On 7 January 2016 at 05:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Dec 25, 2015 at 6:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >>
> >> On Wed, Dec 23, 2015 at 1:16 AM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> >> >
> >> >> > Here procArrayGroupXid sounds like Xid at group level, how about
> >> >> > procArrayGroupMemberXid?
> >> >> > Find the patch with renamed variables for PGProc
> >> >> > (rename_pgproc_variables_v1.patch) attached with mail.
> >> >>
> >> >> I sort of hate to make these member names any longer, but I wonder if
> >> >> we should make it procArrayGroupClearXid etc.
> >> >
> >> > If we go by this suggestion, then the name will look like:
> >> > PGProc
> >> > {
> >> > ..
> >> > bool procArrayGroupClearXid, pg_atomic_uint32
> >> > procArrayGroupNextClearXid,
> >> > TransactionId procArrayGroupLatestXid;
> >> > ..
> >> >
> >> > PROC_HDR
> >> > {
> >> > ..
> >> > pg_atomic_uint32 procArrayGroupFirstClearXid;
> >> > ..
> >> > }
> >> >
> >> > I think whatever I sent in last patch were better.  It seems to me it is
> >> > better to add some comments before variable names, so that anybody
> >> > referring them can understand better and I have added comments in
> >> > attached patch rename_pgproc_variables_v2.patch to explain the same.
> >>
> >> Well, I don't know.  Anybody else have an opinion?
> >>
> >
> > It seems that either people don't have any opinion on this matter or they
> > are okay with either of the naming conventions being discussed.  I think
> > specifying Member after procArrayGroup can help distinguishing which
> > variables are specific to the whole group and which are specific to a
> > particular member.  I think that will be helpful for other places as well
> > if we use this technique to improve performance.  Let me know what
> > you think about the same.
> >
> > I have verified that previous patches can be applied cleanly and passes
> > make check-world.  To avoid confusion, I am attaching the latest
> > patches with this mail.
>
> Patches still apply 1 month later.
>

Thanks for verification!
 
>
> I don't really have an opinion on the variable naming.  I guess they
> only need making longer if there's going to be some confusion about
> what they're for,

makes sense, that is the reason why I have added few comments
as well, but not sure if you are suggesting something else.

> but I'm guessing it's not a blocker here.
>

I also think so, but not sure what else is required here.  The basic
idea of this rename_pgproc_variables_v2.patch is to rename
few variables in existing similar code, so that the main patch
group_update_clog can adapt those naming convention if required,
other than that I have handled all review comments raised in this
thread (mainly by Simon and Robert).

Is there anything, I can do to move this forward?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Tue, Feb 9, 2016 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Patches still apply 1 month later.
>
> Thanks for verification!
>
>>
>> I don't really have an opinion on the variable naming.  I guess they
>> only need making longer if there's going to be some confusion about
>> what they're for,
>
> makes sense, that is the reason why I have added few comments
> as well, but not sure if you are suggesting something else.
>
>> but I'm guessing it's not a blocker here.
>>
>
> I also think so, but not sure what else is required here.  The basic
> idea of this rename_pgproc_variables_v2.patch is to rename
> few variables in existing similar code, so that the main patch
> group_update_clog can adapt those naming convention if required,
> other than that I have handled all review comments raised in this
> thread (mainly by Simon and Robert).
>
> Is there anything, I can do to move this forward?

Well, looking at this again, I think I'm OK to go with your names.
That doesn't seem like the thing to hold up the patch for.  So I'll go
ahead and push the renaming patch now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Thu, Feb 11, 2016 at 9:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Is there anything, I can do to move this forward?
>
> Well, looking at this again, I think I'm OK to go with your names.
> That doesn't seem like the thing to hold up the patch for.  So I'll go
> ahead and push the renaming patch now.

On the substantive part of the patch, this doesn't look safe:

+    /*
+     * Add ourselves to the list of processes needing a group XID status
+     * update.
+     */
+    proc->clogGroupMember = true;
+    proc->clogGroupMemberXid = xid;
+    proc->clogGroupMemberXidStatus = status;
+    proc->clogGroupMemberPage = pageno;
+    proc->clogGroupMemberLsn = lsn;
+    while (true)
+    {
+        nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
+
+        /*
+         * Add the proc to list if the clog page where we need to update the
+         * current transaction status is same as group leader's clog page.
+         */
+        if (nextidx != INVALID_PGPROCNO &&
+            ProcGlobal->allProcs[nextidx].clogGroupMemberPage !=
proc->clogGroupMemberPage)
+            return false;

DANGER HERE!

+        pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
+
+        if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
+                                           &nextidx,
+                                           (uint32) proc->pgprocno))
+            break;
+    }

There is a potential ABA problem here.  Suppose that this code
executes in one process as far as the line that says DANGER HERE.
Then, the group leader wakes up, performs all of the CLOG
modifications, performs another write transaction, and again becomes
the group leader, but for a different member page.  Then, the original
process that went to sleep at DANGER HERE wakes up.  At this point,
the pg_atomic_compare_exchange_u32 will succeed and we'll have
processes with different pages in the list, contrary to the intention
of the code.

This kind of thing can be really subtle and difficult to fix.  The
problem might not happen even with a very large amount of testing, and
then might happen in the real world on some other hardware or on
really rare occasions.  In general, compare-and-swap loops need to be
really really simple with minimal dependencies on other data, ideally
none.  It's very hard to make anything else work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Feb 11, 2016 at 8:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On the substantive part of the patch, this doesn't look safe:
>
> +    /*
> +     * Add ourselves to the list of processes needing a group XID status
> +     * update.
> +     */
> +    proc->clogGroupMember = true;
> +    proc->clogGroupMemberXid = xid;
> +    proc->clogGroupMemberXidStatus = status;
> +    proc->clogGroupMemberPage = pageno;
> +    proc->clogGroupMemberLsn = lsn;
> +    while (true)
> +    {
> +        nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> +
> +        /*
> +         * Add the proc to list if the clog page where we need to update the
> +         * current transaction status is same as group leader's clog page.
> +         */
> +        if (nextidx != INVALID_PGPROCNO &&
> +            ProcGlobal->allProcs[nextidx].clogGroupMemberPage !=
> proc->clogGroupMemberPage)
> +            return false;
>
> DANGER HERE!
>
> +        pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
> +
> +        if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> +                                           &nextidx,
> +                                           (uint32) proc->pgprocno))
> +            break;
> +    }
>
> There is a potential ABA problem here.  Suppose that this code
> executes in one process as far as the line that says DANGER HERE.
> Then, the group leader wakes up, performs all of the CLOG
> modifications, performs another write transaction, and again becomes
> the group leader, but for a different member page.  Then, the original
> process that went to sleep at DANGER HERE wakes up.  At this point,
> the pg_atomic_compare_exchange_u32 will succeed and we'll have
> processes with different pages in the list, contrary to the intention
> of the code.
>

Very Good Catch.  I think if we want to address this we can detect
the non-group leader transactions that tries to update the different
CLOG page (different from group-leader) after acquiring
CLogControlLock and then mark these transactions such that
after waking they need to perform CLOG update via normal path.
Now this can decrease the latency of such transactions, but I
think there will be only very few transactions if at-all there which
can face this condition, because most of the concurrent transactions
should be on same page, otherwise the idea of multiple-slots we
have tried upthread would have shown benefits.
Another idea could be that we update the comments indicating the
possibility of multiple Clog-page updates in same group on the basis
that such cases will be less and even if it happens, it won't effect the
transaction status update.

Do you have anything else in mind?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Fri, Feb 12, 2016 at 12:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Very Good Catch.  I think if we want to address this we can detect
> the non-group leader transactions that tries to update the different
> CLOG page (different from group-leader) after acquiring
> CLogControlLock and then mark these transactions such that
> after waking they need to perform CLOG update via normal path.
> Now this can decrease the latency of such transactions, but I

I think you mean "increase".

> think there will be only very few transactions if at-all there which
> can face this condition, because most of the concurrent transactions
> should be on same page, otherwise the idea of multiple-slots we
> have tried upthread would have shown benefits.
> Another idea could be that we update the comments indicating the
> possibility of multiple Clog-page updates in same group on the basis
> that such cases will be less and even if it happens, it won't effect the
> transaction status update.

I think either approach of those approaches could work, as long as the
logic is correct and the comments are clear.  The important thing is
that the code had better do something safe if this situation ever
occurs, and the comments had better be clear that this is a possible
situation so that someone modifying the code in the future doesn't
think it's impossible, rely on it not happening, and consequently
introduce a very-low-probability bug.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Sat, Feb 13, 2016 at 10:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Feb 12, 2016 at 12:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Very Good Catch.  I think if we want to address this we can detect
> > the non-group leader transactions that tries to update the different
> > CLOG page (different from group-leader) after acquiring
> > CLogControlLock and then mark these transactions such that
> > after waking they need to perform CLOG update via normal path.
> > Now this can decrease the latency of such transactions, but I
>
> I think you mean "increase".
>

Yes.

> > think there will be only very few transactions if at-all there which
> > can face this condition, because most of the concurrent transactions
> > should be on same page, otherwise the idea of multiple-slots we
> > have tried upthread would have shown benefits.
> > Another idea could be that we update the comments indicating the
> > possibility of multiple Clog-page updates in same group on the basis
> > that such cases will be less and even if it happens, it won't effect the
> > transaction status update.
>
> I think either approach of those approaches could work, as long as the
> logic is correct and the comments are clear.  The important thing is
> that the code had better do something safe if this situation ever
> occurs, and the comments had better be clear that this is a possible
> situation so that someone modifying the code in the future doesn't
> think it's impossible, rely on it not happening, and consequently
> introduce a very-low-probability bug.
>


Okay, I have updated the comments for such a possibility and the
possible improvement, if we ever face such a situation.  I also
once again verified that even if group contains transaction status for
multiple pages, it works fine.

Performance data with attached patch is as below.

M/c configuration
-----------------------------
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout    =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

Client_Count/Patch_Ver164128256
HEAD(481725c0)963281452859326447
Patch-1938281523170329402


We can see 10~11% performance improvement as observed
previously.  You might see 0.02% performance difference with
patch as regression, but that is just a run-to-run variation.

Note - To take this performance data, I have to revert commit
ac1d7945 which is known issue in HEAD as reported here [1].

[1] -


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Sun, Feb 21, 2016 at 10:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Client_Count/Patch_Ver164128256
HEAD(481725c0)963281452859326447
Patch-1938281523170329402


We can see 10~11% performance improvement as observed
previously.  You might see 0.02% performance difference with
patch as regression, but that is just a run-to-run variation.

Don't the single-client numbers show about a 3% regresssion?  Surely not 0.02%.
 
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Sun, Feb 21, 2016 at 12:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Feb 21, 2016 at 10:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Client_Count/Patch_Ver164128256
HEAD(481725c0)963281452859326447
Patch-1938281523170329402


We can see 10~11% performance improvement as observed
previously.  You might see 0.02% performance difference with
patch as regression, but that is just a run-to-run variation.

Don't the single-client numbers show about a 3% regresssion?  Surely not 0.02%.


Sorry, you are right, it is ~2.66%, but in read-write pgbench tests, I
could see such fluctuation.  Also patch doesn't change single-client
case.  However, if you still feel that there could be impact by patch,
I can re-run the single client case once again with different combinations
like first with HEAD and then patch and vice versa.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Sun, Feb 21, 2016 at 12:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sun, Feb 21, 2016 at 12:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Feb 21, 2016 at 10:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Client_Count/Patch_Ver164128256
HEAD(481725c0)963281452859326447
Patch-1938281523170329402


We can see 10~11% performance improvement as observed
previously.  You might see 0.02% performance difference with
patch as regression, but that is just a run-to-run variation.

Don't the single-client numbers show about a 3% regresssion?  Surely not 0.02%.


Sorry, you are right, it is ~2.66%, but in read-write pgbench tests, I
could see such fluctuation.  Also patch doesn't change single-client
case.  However, if you still feel that there could be impact by patch,
I can re-run the single client case once again with different combinations
like first with HEAD and then patch and vice versa.

Are these results from a single run, or median-of-three?

I mean, my basic feeling is that I would not accept a 2-3% regression in the single client case to get a 10% speedup in the case where we have 128 clients.  A lot of people will not have 128 clients; quite a few will have a single session, or just a few.  Sometimes just making the code more complex can hurt performance in subtle ways, e.g. by making it fit into the L1 instruction cache less well.  If the numbers you have here are accurate, I'd vote to reject the patch.

Note that we already have apparently regressed single-client performance noticeably between 9.0 and 9.5:

http://bonesmoses.org/2016/01/08/pg-phriday-how-far-weve-come/

I bet that wasn't a single patch but a series of patches which made things more complex to improve concurrency behavior, but in the process each one made the single-client case a tiny bit slower.  In the end, that adds up.  I think we need to put some effort into figuring out if there is a way we can get some of that single-client performance (and ideally more) back.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Sun, Feb 21, 2016 at 2:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Feb 21, 2016 at 12:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sun, Feb 21, 2016 at 12:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Feb 21, 2016 at 10:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Client_Count/Patch_Ver164128256
HEAD(481725c0)963281452859326447
Patch-1938281523170329402


We can see 10~11% performance improvement as observed
previously.  You might see 0.02% performance difference with
patch as regression, but that is just a run-to-run variation.

Don't the single-client numbers show about a 3% regresssion?  Surely not 0.02%.


Sorry, you are right, it is ~2.66%, but in read-write pgbench tests, I
could see such fluctuation.  Also patch doesn't change single-client
case.  However, if you still feel that there could be impact by patch,
I can re-run the single client case once again with different combinations
like first with HEAD and then patch and vice versa.

Are these results from a single run, or median-of-three?


This was median-of-three, but the highest tps with patch is 1119
and with HEAD, it is 969 which shows a gain at single client-count.
Sometimes, I see such differences, it could be due to auto vacuum
getting triggered at some situations which lead to such variations.
However, if I try 2-3 times, the difference generally gets disappeared
unless there is some real regression or if just switch off auto vacuum
and do manual vacuum after each run.  This time, I haven't run the
tests multiple times.
 
I mean, my basic feeling is that I would not accept a 2-3% regression in the single client case to get a 10% speedup in the case where we have 128 clients.

I understand your point.  I think to verify whether it is run-to-run
variation or an actual regression, I will re-run these tests on single
client multiple times and post the result.
 
  A lot of people will not have 128 clients; quite a few will have a single session, or just a few.  Sometimes just making the code more complex can hurt performance in subtle ways, e.g. by making it fit into the L1 instruction cache less well.  If the numbers you have here are accurate, I'd vote to reject the patch.


One point to note is that this patch along with first patch which I
posted in this thread to increase clog buffers can make significant
reduction in contention on CLogControlLock.  OTOH, I think introducing
regression at single-client is also not a sane thing to do, so lets
first try to find if there is actually any regression and if it is, can
we mitigate it by writing code with somewhat fewer instructions or
in a slightly different way and then we can decide whether it is good
to reject the patch or not.  Does that sound reasonable to you?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Sun, Feb 21, 2016 at 7:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I mean, my basic feeling is that I would not accept a 2-3% regression in the single client case to get a 10% speedup in the case where we have 128 clients.

I understand your point.  I think to verify whether it is run-to-run
variation or an actual regression, I will re-run these tests on single
client multiple times and post the result.

Perhaps you could also try it on a couple of different machines (e.g. MacBook Pro and a couple of different large servers).
 

  A lot of people will not have 128 clients; quite a few will have a single session, or just a few.  Sometimes just making the code more complex can hurt performance in subtle ways, e.g. by making it fit into the L1 instruction cache less well.  If the numbers you have here are accurate, I'd vote to reject the patch.
One point to note is that this patch along with first patch which I
posted in this thread to increase clog buffers can make significant
reduction in contention on CLogControlLock.  OTOH, I think introducing
regression at single-client is also not a sane thing to do, so lets
first try to find if there is actually any regression and if it is, can
we mitigate it by writing code with somewhat fewer instructions or
in a slightly different way and then we can decide whether it is good
to reject the patch or not.  Does that sound reasonable to you?

Yes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Feb 23, 2016 at 7:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Feb 21, 2016 at 7:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I mean, my basic feeling is that I would not accept a 2-3% regression in the single client case to get a 10% speedup in the case where we have 128 clients.


When I tried by running the pgbench first with patch and then with Head, I see 1.2% performance increase with patch.  TPS with patch is 976 and with Head it is 964. For 3, 30 mins TPS data, refer "Patch – group_clog_update_v5" and before that "HEAD – Commit 481725c0" in perf_write_clogcontrollock_data_v6.ods attached with this mail.

Nonetheless, I have observed that below new check has been added by the patch which can effect single client performance.  So I have changed it such that new check is done only when we there is actually a need of group update which means when multiple clients tries to update clog at-a-time.

+       if (!InRecovery &&
+               all_trans_same_page &&
+               nsubxids < PGPROC_MAX_CACHED_SUBXIDS &&
+               !IsGXactActive())

 
I understand your point.  I think to verify whether it is run-to-run
variation or an actual regression, I will re-run these tests on single
client multiple times and post the result.

Perhaps you could also try it on a couple of different machines (e.g. MacBook Pro and a couple of different large servers).

 
Okay, I have tried latest patch (group_update_clog_v6.patch) on 2 different big servers and then on Mac-Pro. The detailed data for various runs can be found in attached document perf_write_clogcontrollock_data_v6.ods.  I have taken the performance data for higher client-counts with somewhat larger scale factor (1000) and data for median of same is as below: 

M/c configuration
-----------------------------
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)

Non-default parameters
------------------------------------
max_connections = 1000
shared_buffers=32GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout    =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB


Client_Count/Patch_ver1864128256
HEAD8715090177601761613907
PATCH9005110183312027719263


Here, we can see that there is a gain of ~15% to ~38% at higher client count.

The attached document (perf_write_clogcontrollock_data_v6.ods) contains data, mainly focussing on single client performance.  The data is for multiple runs on different machines, so I thought it is better to present in form of document rather than dumping everything in e-mail.  Do let me know if there is any confusion in understanding/interpreting the data.

Thanks to Dilip Kumar for helping me in conducting test of this patch on MacBook-Pro.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Here, we can see that there is a gain of ~15% to ~38% at higher client count.

The attached document (perf_write_clogcontrollock_data_v6.ods) contains data, mainly focussing on single client performance.  The data is for multiple runs on different machines, so I thought it is better to present in form of document rather than dumping everything in e-mail.  Do let me know if there is any confusion in understanding/interpreting the data.


Forgot to mention that all these tests have been done by reverting commit-ac1d794.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Fri, Feb 26, 2016 at 11:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> Here, we can see that there is a gain of ~15% to ~38% at higher client
>> count.
>>
>> The attached document (perf_write_clogcontrollock_data_v6.ods) contains
>> data, mainly focussing on single client performance.  The data is for
>> multiple runs on different machines, so I thought it is better to present in
>> form of document rather than dumping everything in e-mail.  Do let me know
>> if there is any confusion in understanding/interpreting the data.
>
> Forgot to mention that all these tests have been done by reverting
> commit-ac1d794.

OK, that seems better.  But I have a question: if we don't really need
to make this optimization apply only when everything is on the same
page, then why even try?  If we didn't try, we wouldn't need the
all_trans_same_page flag, which would reduce the amount of code
change.  Would that hurt anything? Taking it even further, we could
remove the check from TransactionGroupUpdateXidStatus too.  I'd be
curious to know whether that set of changes would improve performance
or regress it.  Or maybe it does nothing, in which case perhaps
simpler is better.

All things being equal, it's probably better if the cases where
transactions from different pages get into the list together is
something that is more or less expected rather than a
once-in-a-blue-moon scenario - that way, if any bugs exist, we'll find
them.  The downside of that is that we could increase latency for the
leader that way - doing other work on the same page shouldn't hurt
much but different pages is a bigger hit.  But that hit might be
trivial enough not to be worth worrying about.

+       /*
+        * Now that we've released the lock, go back and wake everybody up.  We
+        * don't do this under the lock so as to keep lock hold times to a
+        * minimum.  The system calls we need to perform to wake other processes
+        * up are probably much slower than the simple memory writes
we did while
+        * holding the lock.
+        */

This comment was true in the place that you cut-and-pasted it from,
but it's not true here, since we potentially need to read from disk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Feb 29, 2016 at 11:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Feb 26, 2016 at 11:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> >>
> >> Here, we can see that there is a gain of ~15% to ~38% at higher client
> >> count.
> >>
> >> The attached document (perf_write_clogcontrollock_data_v6.ods) contains
> >> data, mainly focussing on single client performance.  The data is for
> >> multiple runs on different machines, so I thought it is better to present in
> >> form of document rather than dumping everything in e-mail.  Do let me know
> >> if there is any confusion in understanding/interpreting the data.
> >
> > Forgot to mention that all these tests have been done by reverting
> > commit-ac1d794.
>
> OK, that seems better.  But I have a question: if we don't really need
> to make this optimization apply only when everything is on the same
> page, then why even try?  If we didn't try, we wouldn't need the
> all_trans_same_page flag, which would reduce the amount of code
> change.

I am not sure if I understood your question, do you want to know why at the first place transactions spanning more than one-page call the function TransactionIdSetPageStatus()?  If we want to avoid trying the transaction status update when they are on different page, then I think we need some
major changes in TransactionIdSetTreeStatus().


>  Would that hurt anything? Taking it even further, we could
> remove the check from TransactionGroupUpdateXidStatus too.  I'd be
> curious to know whether that set of changes would improve performance
> or regress it.  Or maybe it does nothing, in which case perhaps
> simpler is better.
>
> All things being equal, it's probably better if the cases where
> transactions from different pages get into the list together is
> something that is more or less expected rather than a
> once-in-a-blue-moon scenario - that way, if any bugs exist, we'll find
> them.  The downside of that is that we could increase latency for the
> leader that way - doing other work on the same page shouldn't hurt
> much but different pages is a bigger hit.  But that hit might be
> trivial enough not to be worth worrying about.
>

In my tests, the check related to same page in TransactionGroupUpdateXidStatus() doesn't impact performance in any way and I think the reason is that, it happens rarely that group contain multiple pages and even it occurs, there is hardly much impact.  So, I will remove that check and I think thats what you also want for now.

> +       /*
> +        * Now that we've released the lock, go back and wake everybody up.  We
> +        * don't do this under the lock so as to keep lock hold times to a
> +        * minimum.  The system calls we need to perform to wake other processes
> +        * up are probably much slower than the simple memory writes
> we did while
> +        * holding the lock.
> +        */
>
> This comment was true in the place that you cut-and-pasted it from,
> but it's not true here, since we potentially need to read from disk.
>

Okay, will change.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Feb 29, 2016 at 11:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Feb 26, 2016 at 11:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> >>
> >> Here, we can see that there is a gain of ~15% to ~38% at higher client
> >> count.
> >>
> >> The attached document (perf_write_clogcontrollock_data_v6.ods) contains
> >> data, mainly focussing on single client performance.  The data is for
> >> multiple runs on different machines, so I thought it is better to present in
> >> form of document rather than dumping everything in e-mail.  Do let me know
> >> if there is any confusion in understanding/interpreting the data.
> >
> > Forgot to mention that all these tests have been done by reverting
> > commit-ac1d794.
>
> OK, that seems better.  But I have a question: if we don't really need
> to make this optimization apply only when everything is on the same
> page, then why even try?
>

This is to save the case when sub-transactions belonging to a transaction are on different pages, and the reason for same is that currently I am using XidCache as stored in each proc to pass the information of subtransactions to TransactionIdSetPageStatusInternal(), now if we allow subtransactions from different pages then I need to extract subxid's from that cache which belong to the page on which we are trying to update the status.  Now this will add few more cycles in the code path under ExclusiveLock without any clear benefit, thats why I have not implemented it.  I have explained the same in code comments as well:

This optimization is only applicable if the transaction and

+ * all child sub-transactions belong to same page which we presume to be the

+ * most common case, we might be able to apply this when they are not on same

+ * page, but that needs us to map sub-transactions in proc's XidCache based

+ * on pageno for which each time Group leader needs to set the transaction

+ * status and that can lead to some performance penalty as well because it

+ * needs to be done after acquiring CLogControlLock, so let's leave that

+ * case for now. 




With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
David Steele
Date:
On 2/26/16 11:37 PM, Amit Kapila wrote:

> On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila <amit.kapila16@gmail.com
>
>     Here, we can see that there is a gain of ~15% to ~38% at higher
>     client count.
>
>     The attached document (perf_write_clogcontrollock_data_v6.ods)
>     contains data, mainly focussing on single client performance.  The
>     data is for multiple runs on different machines, so I thought it is
>     better to present in form of document rather than dumping everything
>     in e-mail.  Do let me know if there is any confusion in
>     understanding/interpreting the data.
>
> Forgot to mention that all these tests have been done by
> reverting commit-ac1d794.

This patch no longer applies cleanly:

$ git apply ../other/group_update_clog_v6.patch
error: patch failed: src/backend/storage/lmgr/proc.c:404
error: src/backend/storage/lmgr/proc.c: patch does not apply
error: patch failed: src/include/storage/proc.h:152
error: src/include/storage/proc.h: patch does not apply

It's not clear to me whether Robert has completed a review of this code 
or it still needs to be reviewed more comprehensively.

Other than a comment that needs to be fixed it seems that all questions 
have been answered by Amit.

Is this "ready for committer" or still in need of further review?

-- 
-David
david@pgmasters.net



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Mar 15, 2016 at 12:00 AM, David Steele <david@pgmasters.net> wrote:
>
> On 2/26/16 11:37 PM, Amit Kapila wrote:
>
>> On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila <amit.kapila16@gmail.com
>>
>>     Here, we can see that there is a gain of ~15% to ~38% at higher
>>     client count.
>>
>>     The attached document (perf_write_clogcontrollock_data_v6.ods)
>>     contains data, mainly focussing on single client performance.  The
>>     data is for multiple runs on different machines, so I thought it is
>>     better to present in form of document rather than dumping everything
>>     in e-mail.  Do let me know if there is any confusion in
>>     understanding/interpreting the data.
>>
>> Forgot to mention that all these tests have been done by
>> reverting commit-ac1d794.
>
>
> This patch no longer applies cleanly:
>
> $ git apply ../other/group_update_clog_v6.patch
> error: patch failed: src/backend/storage/lmgr/proc.c:404
> error: src/backend/storage/lmgr/proc.c: patch does not apply
> error: patch failed: src/include/storage/proc.h:152
> error: src/include/storage/proc.h: patch does not apply
>

For me, with patch -p1 < <path_of_patch> it works, but any how I have updated the patch based on recent commit.  Can you please check the latest patch and see if it applies cleanly for you now.
 
>
> It's not clear to me whether Robert has completed a review of this code or it still needs to be reviewed more comprehensively.
>
> Other than a comment that needs to be fixed it seems that all questions have been answered by Amit.
>

I have updated the comments and changed the name of one of a variable from "all_trans_same_page" to "all_xact_same_page" as pointed out offlist by Alvaro.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
David Steele
Date:
On 3/15/16 1:17 AM, Amit Kapila wrote:

> On Tue, Mar 15, 2016 at 12:00 AM, David Steele <david@pgmasters.net
>
>> This patch no longer applies cleanly:
>>
>> $ git apply ../other/group_update_clog_v6.patch
>> error: patch failed: src/backend/storage/lmgr/proc.c:404
>> error: src/backend/storage/lmgr/proc.c: patch does not apply
>> error: patch failed: src/include/storage/proc.h:152
>> error: src/include/storage/proc.h: patch does not apply
> 
> For me, with patch -p1 < <path_of_patch> it works, but any how I have
> updated the patch based on recent commit.  Can you please check the
> latest patch and see if it applies cleanly for you now.

Yes, it now applies cleanly (101fd93).

-- 
-David
david@pgmasters.net



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Mar 15, 2016 at 7:54 PM, David Steele <david@pgmasters.net> wrote:
>
> On 3/15/16 1:17 AM, Amit Kapila wrote:
>
> > On Tue, Mar 15, 2016 at 12:00 AM, David Steele <david@pgmasters.net
> >
> >> This patch no longer applies cleanly:
> >>
> >> $ git apply ../other/group_update_clog_v6.patch
> >> error: patch failed: src/backend/storage/lmgr/proc.c:404
> >> error: src/backend/storage/lmgr/proc.c: patch does not apply
> >> error: patch failed: src/include/storage/proc.h:152
> >> error: src/include/storage/proc.h: patch does not apply
> >
> > For me, with patch -p1 < <path_of_patch> it works, but any how I have
> > updated the patch based on recent commit.  Can you please check the
> > latest patch and see if it applies cleanly for you now.
>
> Yes, it now applies cleanly (101fd93).
>

Thanks for verification.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Alvaro Herrera
Date:
David Steele wrote:

> This patch no longer applies cleanly:
> 
> $ git apply ../other/group_update_clog_v6.patch

Normally "git apply -3" gives good results in these cases -- it applies
the 3-way merge algorithm just as if you had applied the patch to the
revision it was built on and later git-merged with the latest head.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Jesper Pedersen
Date:
On 03/15/2016 01:17 AM, Amit Kapila wrote:
> I have updated the comments and changed the name of one of a variable from
> "all_trans_same_page" to "all_xact_same_page" as pointed out offlist by
> Alvaro.
>
>

I have done a run, and don't see any regressions.

Intel Xeon 28C/56T @ 2GHz w/ 256GB + 2 x RAID10 (data + xlog) SSD.

I can provide perf / flamegraph profiles if needed.

Thanks for working on this !

Best regards,
  Jesper


Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Wed, Mar 16, 2016 at 11:57 PM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>
> On 03/15/2016 01:17 AM, Amit Kapila wrote:
>>
>> I have updated the comments and changed the name of one of a variable from
>> "all_trans_same_page" to "all_xact_same_page" as pointed out offlist by
>> Alvaro.
>>
>>
>
> I have done a run, and don't see any regressions.
>

Can you provide the details of test, like is this pgbench read-write test and if possible steps for doing test execution.

I wonder if you can do the test with unlogged tables (if you are using pgbench, then I think you need to change the Create Table command to use Unlogged option).
 
>
> Intel Xeon 28C/56T @ 2GHz w/ 256GB + 2 x RAID10 (data + xlog) SSD.
>

Can you provide CPU information (probably by using lscpu).


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:

On Thu, Mar 17, 2016 at 11:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

 I have reviewed the patch.. here are some review comments, I will continue to review..

1.

+
+ /*
+  * Add the proc to list, if the clog page where we need to update the

+  */
+ if (nextidx != INVALID_PGPROCNO &&
+ ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage)
+ return false;

Should we clear all these structure variable what we set above in case we are not adding our self to group, I can see it will not have any problem even if we don't clear them,
I think if we don't want to clear we can add some comment mentioning the same.

+ proc->clogGroupMember = true;
+ proc->clogGroupMemberXid = xid;
+ proc->clogGroupMemberXidStatus = status;
+ proc->clogGroupMemberPage = pageno;
+ proc->clogGroupMemberLsn = lsn;


2.

Here we are updating in our own proc, I think we don't need atomic operation here, we are not yet added to the list.

+ if (nextidx != INVALID_PGPROCNO &&
+ ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage)
+ return false;
+
+ pg_atomic_write_u32(&proc->clogGroupNext, nextidx);


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
Hi,

On 2016-03-15 10:47:12 +0530, Amit Kapila wrote:
> @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId *subxids,
>   * Record the final state of transaction entries in the commit log for
>   * all entries on a single page.  Atomic only on this page.
>   *
> + * Group the status update for transactions. This improves the efficiency
> + * of the transaction status update by reducing the number of lock
> + * acquisitions required for it.  To achieve the group transaction status
> + * update, we need to populate the transaction status related information
> + * in shared memory and doing it for overflowed sub-transactions would need
> + * a big chunk of shared memory, so we are not doing this optimization for
> + * such cases. This optimization is only applicable if the transaction and
> + * all child sub-transactions belong to same page which we presume to be the
> + * most common case, we might be able to apply this when they are not on same
> + * page, but that needs us to map sub-transactions in proc's XidCache based
> + * on pageno for which each time a group leader needs to set the transaction
> + * status and that can lead to some performance penalty as well because it
> + * needs to be done after acquiring CLogControlLock, so let's leave that
> + * case for now.  We don't do this optimization for prepared transactions
> + * as the dummy proc associated with such transactions doesn't have a
> + * semaphore associated with it and the same is required for group status
> + * update.  We choose not to create a semaphore for dummy procs for this
> + * purpose as the advantage of using this optimization for prepared transactions
> + * is not clear.
> + *

I think you should try to break up some of the sentences, one of them
spans 7 lines.

I'm actually rather unconvinced that it's all that common that all
subtransactions are on one page. If you have concurrency - otherwise
there'd be not much point in this patch - they'll usually be heavily
interleaved, no?  You can argue that you don't care about subxacts,
because they're more often used in less concurrent scenarios, but if
that's the argument, it should actually be made.


>   * Otherwise API is same as TransactionIdSetTreeStatus()
>   */
>  static void
>  TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
>                             TransactionId *subxids, XidStatus status,
> -                           XLogRecPtr lsn, int pageno)
> +                           XLogRecPtr lsn, int pageno,
> +                           bool all_xact_same_page)
> +{
> +    /*
> +     * If we can immediately acquire CLogControlLock, we update the status
> +     * of our own XID and release the lock.  If not, use group XID status
> +     * update to improve efficiency and if still not able to update, then
> +     * acquire CLogControlLock and update it.
> +     */
> +    if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
> +    {
> +        TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> +        LWLockRelease(CLogControlLock);
> +    }
> +    else if (!all_xact_same_page ||
> +             nsubxids > PGPROC_MAX_CACHED_SUBXIDS ||
> +             IsGXactActive() ||
> +             !TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
> +    {
> +        LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> +
> +        TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> +
> +        LWLockRelease(CLogControlLock);
> +    }
> +}
>

This code is a bit arcane. I think it should be restructured to
a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >  PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive().
Goingfor a conditional  lock acquire first can be rather expensive.
 
b) I'd rather see an explicit fallback for the  !TransactionGroupUpdateXidStatus case, this way it's too hard to
understand.It's also harder to add probes to detect whether that
 


> +
> +/*
> + * When we cannot immediately acquire CLogControlLock in exclusive mode at
> + * commit time, add ourselves to a list of processes that need their XIDs
> + * status update.

At this point my "ABA Problem" alarm goes off. If it's not an actual
danger, can you please document close by, why not?


> The first process to add itself to the list will acquire
> + * CLogControlLock in exclusive mode and perform TransactionIdSetPageStatusInternal
> + * on behalf of all group members.  This avoids a great deal of contention
> + * around CLogControlLock when many processes are trying to commit at once,
> + * since the lock need not be repeatedly handed off from one committing
> + * process to the next.
> + *
> + * Returns true, if transaction status is updated in clog page, else return
> + * false.
> + */
> +static bool
> +TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
> +                                XLogRecPtr lsn, int pageno)
> +{
> +    volatile PROC_HDR *procglobal = ProcGlobal;
> +    PGPROC       *proc = MyProc;
> +    uint32        nextidx;
> +    uint32        wakeidx;
> +    int            extraWaits = -1;
> +
> +    /* We should definitely have an XID whose status needs to be updated. */
> +    Assert(TransactionIdIsValid(xid));
> +
> +    /*
> +     * Add ourselves to the list of processes needing a group XID status
> +     * update.
> +     */
> +    proc->clogGroupMember = true;
> +    proc->clogGroupMemberXid = xid;
> +    proc->clogGroupMemberXidStatus = status;
> +    proc->clogGroupMemberPage = pageno;
> +    proc->clogGroupMemberLsn = lsn;
> +    while (true)
> +    {
> +        nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> +
> +        /*
> +         * Add the proc to list, if the clog page where we need to update the
> +         * current transaction status is same as group leader's clog page.
> +         * There is a race condition here such that after doing the below
> +         * check and before adding this proc's clog update to a group, if the
> +         * group leader already finishes the group update for this page and
> +         * becomes group leader of another group which updates different clog
> +         * page, then it will lead to a situation where a single group can
> +         * have different clog page updates.  Now the chances of such a race
> +         * condition are less and even if it happens, the only downside is
> +         * that it could lead to serial access of clog pages from disk if
> +         * those pages are not in memory.  Tests doesn't indicate any
> +         * performance hit due to different clog page updates in same group,
> +         * however in future, if we want to improve the situation, then we can
> +         * detect the non-group leader transactions that tries to update the
> +         * different CLOG page after acquiring CLogControlLock and then mark
> +         * these transactions such that after waking they need to perform CLOG
> +         * update via normal path.
> +         */

Needs a good portion of polishing.


> +        if (nextidx != INVALID_PGPROCNO &&
> +            ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage)
> +            return false;

I think we're returning with clogGroupMember = true - that doesn't look
right.


> +        pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
> +
> +        if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> +                                           &nextidx,
> +                                           (uint32) proc->pgprocno))
> +            break;
> +    }

So this indeed has ABA type problems. And you appear to be arguing above
that that's ok. Need to ponder that for a bit.

So, we enqueue ourselves as the *head* of the wait list, if there's
other waiters. Seems like it could lead to the first element after the
leader to be delayed longer than the others.


FWIW, You can move the nextidx = part of out the loop,
pgatomic_compare_exchange will update the nextidx value from memory; no
need for another load afterwards.


> +    /*
> +     * If the list was not empty, the leader will update the status of our
> +     * XID. It is impossible to have followers without a leader because the
> +     * first process that has added itself to the list will always have
> +     * nextidx as INVALID_PGPROCNO.
> +     */
> +    if (nextidx != INVALID_PGPROCNO)
> +    {
> +        /* Sleep until the leader updates our XID status. */
> +        for (;;)
> +        {
> +            /* acts as a read barrier */
> +            PGSemaphoreLock(&proc->sem);
> +            if (!proc->clogGroupMember)
> +                break;
> +            extraWaits++;
> +        }
> +
> +        Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO);
> +
> +        /* Fix semaphore count for any absorbed wakeups */
> +        while (extraWaits-- > 0)
> +            PGSemaphoreUnlock(&proc->sem);
> +        return true;
> +    }
> +
> +    /* We are the leader.  Acquire the lock on behalf of everyone. */
> +    LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> +
> +    /*
> +     * Now that we've got the lock, clear the list of processes waiting for
> +     * group XID status update, saving a pointer to the head of the list.
> +     * Trying to pop elements one at a time could lead to an ABA problem.
> +     */
> +    while (true)
> +    {
> +        nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> +        if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> +                                           &nextidx,
> +                                           INVALID_PGPROCNO))
> +            break;
> +    }

Hm. It seems like you should should simply use pg_atomic_exchange_u32(),
rather than compare_exchange?


> diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
> index c4fd9ef..120b9c0 100644
> --- a/src/backend/access/transam/twophase.c
> +++ b/src/backend/access/transam/twophase.c
> @@ -177,7 +177,7 @@ static TwoPhaseStateData *TwoPhaseState;
>  /*
>   * Global transaction entry currently locked by us, if any.
>   */
> -static GlobalTransaction MyLockedGxact = NULL;
> +GlobalTransaction MyLockedGxact = NULL;

Hm, I'm doubtful it's worthwhile to expose this, just so we can use an
inline function, but whatever.


> +#include "access/clog.h"
>  #include "access/xlogdefs.h"
>  #include "lib/ilist.h"
>  #include "storage/latch.h"
> @@ -154,6 +155,17 @@ struct PGPROC
>  
>      uint32          wait_event_info;        /* proc's wait information */
>  
> +    /* Support for group transaction status update. */
> +    bool        clogGroupMember;    /* true, if member of clog group */
> +    pg_atomic_uint32 clogGroupNext;        /* next clog group member */
> +    TransactionId clogGroupMemberXid;    /* transaction id of clog group member */
> +    XidStatus    clogGroupMemberXidStatus;        /* transaction status of clog
> +                                                 * group member */
> +    int            clogGroupMemberPage;    /* clog page corresponding to
> +                                         * transaction id of clog group member */
> +    XLogRecPtr    clogGroupMemberLsn;        /* WAL location of commit record for
> +                                         * clog group member */
> +

Man, we're surely bloating PGPROC at a prodigious rate.


That's my first pass over the code itself.


Hm. Details aside, what concerns me most is that the whole group
mechanism, as implemented, only works als long as transactions only span
a short and regular amount of time. As soon as there's some variance in
transaction duration, the likelihood of building a group, where all xids
are on one page, diminishes. That likely works well in benchmarking, but
I'm afraid it's much less the case in the real world, where there's
network latency involved, and where applications actually contain
computations themselves.

If I understand correctly, without having followed the thread, the
reason you came up with this batching on a per-page level is to bound
the amount of effort spent by the leader; and thus bound the latency?

I think it's worthwhile to create a benchmark that does something like
BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
completely realistic values for network RTT + application computation),
the success rate of group updates shrinks noticeably.

Greetings,

Andres Freund



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2016-03-15 10:47:12 +0530, Amit Kapila wrote:
> > @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId *subxids,
> >   * Record the final state of transaction entries in the commit log for
> >   * all entries on a single page.  Atomic only on this page.
> >   *
> > + * Group the status update for transactions. This improves the efficiency
> > + * of the transaction status update by reducing the number of lock
> > + * acquisitions required for it.  To achieve the group transaction status
> > + * update, we need to populate the transaction status related information
> > + * in shared memory and doing it for overflowed sub-transactions would need
> > + * a big chunk of shared memory, so we are not doing this optimization for
> > + * such cases. This optimization is only applicable if the transaction and
> > + * all child sub-transactions belong to same page which we presume to be the
> > + * most common case, we might be able to apply this when they are not on same
> > + * page, but that needs us to map sub-transactions in proc's XidCache based
> > + * on pageno for which each time a group leader needs to set the transaction
> > + * status and that can lead to some performance penalty as well because it
> > + * needs to be done after acquiring CLogControlLock, so let's leave that
> > + * case for now.  We don't do this optimization for prepared transactions
> > + * as the dummy proc associated with such transactions doesn't have a
> > + * semaphore associated with it and the same is required for group status
> > + * update.  We choose not to create a semaphore for dummy procs for this
> > + * purpose as the advantage of using this optimization for prepared transactions
> > + * is not clear.
> > + *
>
> I think you should try to break up some of the sentences, one of them
> spans 7 lines.
>

Okay, I will try to do so in next version.

> I'm actually rather unconvinced that it's all that common that all
> subtransactions are on one page. If you have concurrency - otherwise
> there'd be not much point in this patch - they'll usually be heavily
> interleaved, no?  You can argue that you don't care about subxacts,
> because they're more often used in less concurrent scenarios, but if
> that's the argument, it should actually be made.
>

Note, that we are doing it only when a transaction has less than equal to 64 sub transactions.  Now, I am not denying from the fact that there will be cases where subtransactions won't fall into different pages, but I think chances of such transactions to participate in group mode will be less and this patch is mainly targeting scalability for short transactions.

>
> >   * Otherwise API is same as TransactionIdSetTreeStatus()
> >   */
> >  static void
> >  TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
> >                                                  TransactionId *subxids, XidStatus status,
> > -                                                XLogRecPtr lsn, int pageno)
> > +                                                XLogRecPtr lsn, int pageno,
> > +                                                bool all_xact_same_page)
> > +{
> > +     /*
> > +      * If we can immediately acquire CLogControlLock, we update the status
> > +      * of our own XID and release the lock.  If not, use group XID status
> > +      * update to improve efficiency and if still not able to update, then
> > +      * acquire CLogControlLock and update it.
> > +      */
> > +     if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
> > +     {
> > +             TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> > +             LWLockRelease(CLogControlLock);
> > +     }
> > +     else if (!all_xact_same_page ||
> > +                      nsubxids > PGPROC_MAX_CACHED_SUBXIDS ||
> > +                      IsGXactActive() ||
> > +                      !TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
> > +     {
> > +             LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > +             TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> > +
> > +             LWLockRelease(CLogControlLock);
> > +     }
> > +}
> >
>
> This code is a bit arcane. I think it should be restructured to
> a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
>    PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
>    lock acquire first can be rather expensive.

The previous version (v5 - [1]) has code that way, but that adds few extra instructions for single client case and I was seeing minor performance regression for single client case due to which it has been changed as per current code.

> b) I'd rather see an explicit fallback for the
>    !TransactionGroupUpdateXidStatus case, this way it's too hard to
>    understand. It's also harder to add probes to detect whether that
>

Considering above reply to (a), do you want to see it done as a separate else if loop in patch?

>
> > +
> > +/*
> > + * When we cannot immediately acquire CLogControlLock in exclusive mode at
> > + * commit time, add ourselves to a list of processes that need their XIDs
> > + * status update.
>
> At this point my "ABA Problem" alarm goes off. If it's not an actual
> danger, can you please document close by, why not?
>

Why this won't lead to ABA problem is explained below in comments. Refer

+ /*

+ * Now that we've got the lock, clear the list of processes waiting for

+ * group XID status update, saving a pointer to the head of the list.

+ * Trying to pop elements one at a time could lead to an ABA problem.

+ */



>
> > The first process to add itself to the list will acquire
> > + * CLogControlLock in exclusive mode and perform TransactionIdSetPageStatusInternal
> > + * on behalf of all group members.  This avoids a great deal of contention
> > + * around CLogControlLock when many processes are trying to commit at once,
> > + * since the lock need not be repeatedly handed off from one committing
> > + * process to the next.
> > + *
> > + * Returns true, if transaction status is updated in clog page, else return
> > + * false.
> > + */
> > +static bool
> > +TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
> > +                                                             XLogRecPtr lsn, int pageno)
> > +{
> > +     volatile PROC_HDR *procglobal = ProcGlobal;
> > +     PGPROC     *proc = MyProc;
> > +     uint32          nextidx;
> > +     uint32          wakeidx;
> > +     int                     extraWaits = -1;
> > +
> > +     /* We should definitely have an XID whose status needs to be updated. */
> > +     Assert(TransactionIdIsValid(xid));
> > +
> > +     /*
> > +      * Add ourselves to the list of processes needing a group XID status
> > +      * update.
> > +      */
> > +     proc->clogGroupMember = true;
> > +     proc->clogGroupMemberXid = xid;
> > +     proc->clogGroupMemberXidStatus = status;
> > +     proc->clogGroupMemberPage = pageno;
> > +     proc->clogGroupMemberLsn = lsn;
> > +     while (true)
> > +     {
> > +             nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > +
> > +             /*
> > +              * Add the proc to list, if the clog page where we need to update the
> > +              * current transaction status is same as group leader's clog page.
> > +              * There is a race condition here such that after doing the below
> > +              * check and before adding this proc's clog update to a group, if the
> > +              * group leader already finishes the group update for this page and
> > +              * becomes group leader of another group which updates different clog
> > +              * page, then it will lead to a situation where a single group can
> > +              * have different clog page updates.  Now the chances of such a race
> > +              * condition are less and even if it happens, the only downside is
> > +              * that it could lead to serial access of clog pages from disk if
> > +              * those pages are not in memory.  Tests doesn't indicate any
> > +              * performance hit due to different clog page updates in same group,
> > +              * however in future, if we want to improve the situation, then we can
> > +              * detect the non-group leader transactions that tries to update the
> > +              * different CLOG page after acquiring CLogControlLock and then mark
> > +              * these transactions such that after waking they need to perform CLOG
> > +              * update via normal path.
> > +              */
>
> Needs a good portion of polishing.
>
>
> > +             if (nextidx != INVALID_PGPROCNO &&
> > +                     ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage)
> > +                     return false;
>
> I think we're returning with clogGroupMember = true - that doesn't look
> right.
>

I think it won't create a problem, but surely it is not good to return as true, will change in next version of patch.

>
> > +             pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
> > +
> > +             if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > +                                                                                &nextidx,
> > +                                                                                (uint32) proc->pgprocno))
> > +                     break;
> > +     }
>
> So this indeed has ABA type problems. And you appear to be arguing above
> that that's ok. Need to ponder that for a bit.
>
> So, we enqueue ourselves as the *head* of the wait list, if there's
> other waiters. Seems like it could lead to the first element after the
> leader to be delayed longer than the others.
>

It will not matter because we are waking the queued process only once we are done with xid status update.

>
> FWIW, You can move the nextidx = part of out the loop,
> pgatomic_compare_exchange will update the nextidx value from memory; no
> need for another load afterwards.
>

Not sure, if I understood which statement you are referring here (are you referring to atomic read operation) and how can we save the load operation?

>
> > +     /*
> > +      * If the list was not empty, the leader will update the status of our
> > +      * XID. It is impossible to have followers without a leader because the
> > +      * first process that has added itself to the list will always have
> > +      * nextidx as INVALID_PGPROCNO.
> > +      */
> > +     if (nextidx != INVALID_PGPROCNO)
> > +     {
> > +             /* Sleep until the leader updates our XID status. */
> > +             for (;;)
> > +             {
> > +                     /* acts as a read barrier */
> > +                     PGSemaphoreLock(&proc->sem);
> > +                     if (!proc->clogGroupMember)
> > +                             break;
> > +                     extraWaits++;
> > +             }
> > +
> > +             Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO);
> > +
> > +             /* Fix semaphore count for any absorbed wakeups */
> > +             while (extraWaits-- > 0)
> > +                     PGSemaphoreUnlock(&proc->sem);
> > +             return true;
> > +     }
> > +
> > +     /* We are the leader.  Acquire the lock on behalf of everyone. */
> > +     LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > +     /*
> > +      * Now that we've got the lock, clear the list of processes waiting for
> > +      * group XID status update, saving a pointer to the head of the list.
> > +      * Trying to pop elements one at a time could lead to an ABA problem.
> > +      */
> > +     while (true)
> > +     {
> > +             nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > +             if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > +                                                                                &nextidx,
> > +                                                                                INVALID_PGPROCNO))
> > +                     break;
> > +     }
>
> Hm. It seems like you should should simply use pg_atomic_exchange_u32(),
> rather than compare_exchange?
>

We need to remember the head of list to wake up the processes due to which I think above loop is required.

>
> > diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
> > index c4fd9ef..120b9c0 100644
> > --- a/src/backend/access/transam/twophase.c
> > +++ b/src/backend/access/transam/twophase.c
> > @@ -177,7 +177,7 @@ static TwoPhaseStateData *TwoPhaseState;
> >  /*
> >   * Global transaction entry currently locked by us, if any.
> >   */
> > -static GlobalTransaction MyLockedGxact = NULL;
> > +GlobalTransaction MyLockedGxact = NULL;
>
> Hm, I'm doubtful it's worthwhile to expose this, just so we can use an
> inline function, but whatever.
>

I have done considering this as a hot-path to save an additional function call, but can change if you think so.

>
> > +#include "access/clog.h"
> >  #include "access/xlogdefs.h"
> >  #include "lib/ilist.h"
> >  #include "storage/latch.h"
> > @@ -154,6 +155,17 @@ struct PGPROC
> >
> >       uint32          wait_event_info;        /* proc's wait information */
> >
> > +     /* Support for group transaction status update. */
> > +     bool            clogGroupMember;        /* true, if member of clog group */
> > +     pg_atomic_uint32 clogGroupNext;         /* next clog group member */
> > +     TransactionId clogGroupMemberXid;       /* transaction id of clog group member */
> > +     XidStatus       clogGroupMemberXidStatus;               /* transaction status of clog
> > +                                                                                              * group member */
> > +     int                     clogGroupMemberPage;    /* clog page corresponding to
> > +                                                                              * transaction id of clog group member */
> > +     XLogRecPtr      clogGroupMemberLsn;             /* WAL location of commit record for
> > +                                                                              * clog group member */
> > +
>
> Man, we're surely bloating PGPROC at a prodigious rate.
>
>
> That's my first pass over the code itself.
>
>
> Hm. Details aside, what concerns me most is that the whole group
> mechanism, as implemented, only works als long as transactions only span
> a short and regular amount of time.
>

Yes, thats the main case which will be targeted by this patch and I think there are many such cases in OLTP workloads where there are very short transactions. 

>
> If I understand correctly, without having followed the thread, the
> reason you came up with this batching on a per-page level is to bound
> the amount of effort spent by the leader; and thus bound the latency?
>

This is mainly to save the case where multiple pages are not-in-memory and leader needs to perform the I/O serially. Refer mail [2] for point raised by Robert.


> I think it's worthwhile to create a benchmark that does something like
> BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> completely realistic values for network RTT + application computation),
> the success rate of group updates shrinks noticeably.
>

I think it will happen that way, but what do we want to see with that benchmark? I think the results will be that for such a workload either there is no benefit or will be very less as compare to short transactions.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
On 2016-03-22 18:19:48 +0530, Amit Kapila wrote:
> > I'm actually rather unconvinced that it's all that common that all
> > subtransactions are on one page. If you have concurrency - otherwise
> > there'd be not much point in this patch - they'll usually be heavily
> > interleaved, no?  You can argue that you don't care about subxacts,
> > because they're more often used in less concurrent scenarios, but if
> > that's the argument, it should actually be made.
> >
> 
> Note, that we are doing it only when a transaction has less than equal to
> 64 sub transactions.

So?

> > This code is a bit arcane. I think it should be restructured to
> > a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
> >    PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
> >    lock acquire first can be rather expensive.
> 
> The previous version (v5 - [1]) has code that way, but that adds few extra
> instructions for single client case and I was seeing minor performance
> regression for single client case due to which it has been changed as per
> current code.

I don't believe that changing conditions here is likely to cause a
measurable regression.


> > So, we enqueue ourselves as the *head* of the wait list, if there's
> > other waiters. Seems like it could lead to the first element after the
> > leader to be delayed longer than the others.
> >
> 
> It will not matter because we are waking the queued process only once we
> are done with xid status update.

If there's only N cores, process N+1 won't be run immediately. But yea,
it's probably not large.


> > FWIW, You can move the nextidx = part of out the loop,
> > pgatomic_compare_exchange will update the nextidx value from memory; no
> > need for another load afterwards.
> >
> 
> Not sure, if I understood which statement you are referring here (are you
> referring to atomic read operation) and how can we save the load operation?

Yes, to the atomic read. And we can save it in the loop, because
compare_exchange returns the current value if it fails.


> > > +      * Now that we've got the lock, clear the list of processes
> waiting for
> > > +      * group XID status update, saving a pointer to the head of the
> list.
> > > +      * Trying to pop elements one at a time could lead to an ABA
> problem.
> > > +      */
> > > +     while (true)
> > > +     {
> > > +             nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > > +             if
> (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > > +
>          &nextidx,
> > > +
>          INVALID_PGPROCNO))
> > > +                     break;
> > > +     }
> >
> > Hm. It seems like you should should simply use pg_atomic_exchange_u32(),
> > rather than compare_exchange?
> >
> 
> We need to remember the head of list to wake up the processes due to which
> I think above loop is required.

exchange returns the old value? There's no need for a compare here.


> > I think it's worthwhile to create a benchmark that does something like
> > BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> > INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> > completely realistic values for network RTT + application computation),
> > the success rate of group updates shrinks noticeably.
> >
> 
> I think it will happen that way, but what do we want to see with that
> benchmark? I think the results will be that for such a workload either
> there is no benefit or will be very less as compare to short transactions.

Because we want our performance improvements to matter in reality, not
just in unrealistic benchmarks where the benchmarking tool is running on
the same machine as the the database, and uses unix sockets. That not
actually an all that realistic workload.


Andres



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Mar 22, 2016 at 6:29 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-03-22 18:19:48 +0530, Amit Kapila wrote:
> > > I'm actually rather unconvinced that it's all that common that all
> > > subtransactions are on one page. If you have concurrency - otherwise
> > > there'd be not much point in this patch - they'll usually be heavily
> > > interleaved, no?  You can argue that you don't care about subxacts,
> > > because they're more often used in less concurrent scenarios, but if
> > > that's the argument, it should actually be made.
> > >
> >
> > Note, that we are doing it only when a transaction has less than equal to
> > 64 sub transactions.
>
> So?
>

They should fall on one page, unless they are heavily interleaved as pointed by you.  I think either subtransactions are present or not, this patch won't help for bigger transactions.

I will address your other review comments and send an updated patch.

Thanks for the review.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Tue, Mar 22, 2016 at 6:52 AM, Andres Freund <andres@anarazel.de> wrote:
> I'm actually rather unconvinced that it's all that common that all
> subtransactions are on one page. If you have concurrency - otherwise
> there'd be not much point in this patch - they'll usually be heavily
> interleaved, no?  You can argue that you don't care about subxacts,
> because they're more often used in less concurrent scenarios, but if
> that's the argument, it should actually be made.

But a single clog page holds a lot of transactions - I think it's
~32k.  If you have 100 backends running, and each one allocates an XID
in turn, and then each allocates a sub-XID in turn, and then they all
commit, and then you repeat this pattern, >99% of transactions will be
on a single CLOG page.  And that is a pretty pathological case.

It's true that if you have many short-running transactions interleaved
with occasional long-running transactions, and the latter use
subxacts, the optimization might fail to apply to the long-running
subxacts fairly often.  But who cares?  Those are, by definition, a
small percentage of the overall transaction stream.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
On 2016-03-22 10:40:28 -0400, Robert Haas wrote:
> On Tue, Mar 22, 2016 at 6:52 AM, Andres Freund <andres@anarazel.de> wrote:
> > I'm actually rather unconvinced that it's all that common that all
> > subtransactions are on one page. If you have concurrency - otherwise
> > there'd be not much point in this patch - they'll usually be heavily
> > interleaved, no?  You can argue that you don't care about subxacts,
> > because they're more often used in less concurrent scenarios, but if
> > that's the argument, it should actually be made.
> 
> But a single clog page holds a lot of transactions - I think it's
> ~32k.

At 30-40k TPS/sec that's not actually all that much.


> If you have 100 backends running, and each one allocates an XID
> in turn, and then each allocates a sub-XID in turn, and then they all
> commit, and then you repeat this pattern, >99% of transactions will be
> on a single CLOG page.  And that is a pretty pathological case.

I think it's much more likely that some backends will immediately
allocate and others won't for a short while.


> It's true that if you have many short-running transactions interleaved
> with occasional long-running transactions, and the latter use
> subxacts, the optimization might fail to apply to the long-running
> subxacts fairly often.  But who cares?  Those are, by definition, a
> small percentage of the overall transaction stream.

Leaving subtransactions aside, I think the problem is that if you're
having slightly longer running transactions on a regular basis (and I'm
thinking 100-200ms, very common on OLTP systems due to network and
client processing), the effectiveness of the batching will be greatly
reduced.

I'll play around with the updated patch Amit promised, and see how high
the batching rate is over time, depending on the type of transaction
processed.


Andres



Re: Speed up Clog Access by increasing CLOG buffers

From
Jim Nasby
Date:
On 3/22/16 9:36 AM, Amit Kapila wrote:
>  > > Note, that we are doing it only when a transaction has less than
> equal to
>  > > 64 sub transactions.
>  >
>  > So?
>  >
>
> They should fall on one page, unless they are heavily interleaved as
> pointed by you.  I think either subtransactions are present or not, this
> patch won't help for bigger transactions.

FWIW, the use case that comes to mind here is the "upsert" example in 
the docs. AFAIK that's going to create a subtransaction every time it's 
called, regardless if whether it performs actual DML. I've used that in 
places that would probably have moderately high concurrency, and I 
suspect I'm not alone in that.

That said, it wouldn't surprise me if plpgsql overhead swamps an effect 
this patch has, so perhaps it's a moot point.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2016-03-15 10:47:12 +0530, Amit Kapila wrote:
> > @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId *subxids,
> >   * Record the final state of transaction entries in the commit log for
> >   * all entries on a single page.  Atomic only on this page.
> >   *
> > + * Group the status update for transactions. This improves the efficiency
> > + * of the transaction status update by reducing the number of lock
> > + * acquisitions required for it.  To achieve the group transaction status
> > + * update, we need to populate the transaction status related information
> > + * in shared memory and doing it for overflowed sub-transactions would need
> > + * a big chunk of shared memory, so we are not doing this optimization for
> > + * such cases. This optimization is only applicable if the transaction and
> > + * all child sub-transactions belong to same page which we presume to be the
> > + * most common case, we might be able to apply this when they are not on same
> > + * page, but that needs us to map sub-transactions in proc's XidCache based
> > + * on pageno for which each time a group leader needs to set the transaction
> > + * status and that can lead to some performance penalty as well because it
> > + * needs to be done after acquiring CLogControlLock, so let's leave that
> > + * case for now.  We don't do this optimization for prepared transactions
> > + * as the dummy proc associated with such transactions doesn't have a
> > + * semaphore associated with it and the same is required for group status
> > + * update.  We choose not to create a semaphore for dummy procs for this
> > + * purpose as the advantage of using this optimization for prepared transactions
> > + * is not clear.
> > + *
>
> I think you should try to break up some of the sentences, one of them
> spans 7 lines.
>

Okay, I have simplified the sentences in the comment.

>
>
> >   * Otherwise API is same as TransactionIdSetTreeStatus()
> >   */
> >  static void
> >  TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
> >                                                  TransactionId *subxids, XidStatus status,
> > -                                                XLogRecPtr lsn, int pageno)
> > +                                                XLogRecPtr lsn, int pageno,
> > +                                                bool all_xact_same_page)
> > +{
> > +     /*
> > +      * If we can immediately acquire CLogControlLock, we update the status
> > +      * of our own XID and release the lock.  If not, use group XID status
> > +      * update to improve efficiency and if still not able to update, then
> > +      * acquire CLogControlLock and update it.
> > +      */
> > +     if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
> > +     {
> > +             TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> > +             LWLockRelease(CLogControlLock);
> > +     }
> > +     else if (!all_xact_same_page ||
> > +                      nsubxids > PGPROC_MAX_CACHED_SUBXIDS ||
> > +                      IsGXactActive() ||
> > +                      !TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
> > +     {
> > +             LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > +             TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> > +
> > +             LWLockRelease(CLogControlLock);
> > +     }
> > +}
> >
>
> This code is a bit arcane. I think it should be restructured to
> a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
>    PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
>    lock acquire first can be rather expensive.
> b) I'd rather see an explicit fallback for the
>    !TransactionGroupUpdateXidStatus case, this way it's too hard to
>    understand. It's also harder to add probes to detect whether that
>

Changed.


>
>
> > The first process to add itself to the list will acquire
> > + * CLogControlLock in exclusive mode and perform TransactionIdSetPageStatusInternal
> > + * on behalf of all group members.  This avoids a great deal of contention
> > + * around CLogControlLock when many processes are trying to commit at once,
> > + * since the lock need not be repeatedly handed off from one committing
> > + * process to the next.
> > + *
> > + * Returns true, if transaction status is updated in clog page, else return
> > + * false.
> > + */
> > +static bool
> > +TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
> > +                                                             XLogRecPtr lsn, int pageno)
> > +{
> > +     volatile PROC_HDR *procglobal = ProcGlobal;
> > +     PGPROC     *proc = MyProc;
> > +     uint32          nextidx;
> > +     uint32          wakeidx;
> > +     int                     extraWaits = -1;
> > +
> > +     /* We should definitely have an XID whose status needs to be updated. */
> > +     Assert(TransactionIdIsValid(xid));
> > +
> > +     /*
> > +      * Add ourselves to the list of processes needing a group XID status
> > +      * update.
> > +      */
> > +     proc->clogGroupMember = true;
> > +     proc->clogGroupMemberXid = xid;
> > +     proc->clogGroupMemberXidStatus = status;
> > +     proc->clogGroupMemberPage = pageno;
> > +     proc->clogGroupMemberLsn = lsn;
> > +     while (true)
> > +     {
> > +             nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > +
> > +             /*
> > +              * Add the proc to list, if the clog page where we need to update the
> > +              * current transaction status is same as group leader's clog page.
> > +              * There is a race condition here such that after doing the below
> > +              * check and before adding this proc's clog update to a group, if the
> > +              * group leader already finishes the group update for this page and
> > +              * becomes group leader of another group which updates different clog
> > +              * page, then it will lead to a situation where a single group can
> > +              * have different clog page updates.  Now the chances of such a race
> > +              * condition are less and even if it happens, the only downside is
> > +              * that it could lead to serial access of clog pages from disk if
> > +              * those pages are not in memory.  Tests doesn't indicate any
> > +              * performance hit due to different clog page updates in same group,
> > +              * however in future, if we want to improve the situation, then we can
> > +              * detect the non-group leader transactions that tries to update the
> > +              * different CLOG page after acquiring CLogControlLock and then mark
> > +              * these transactions such that after waking they need to perform CLOG
> > +              * update via normal path.
> > +              */
>
> Needs a good portion of polishing.
>

Okay, I have tried to simplify the comment as well.

>
> > +             if (nextidx != INVALID_PGPROCNO &&
> > +                     ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage)
> > +                     return false;
>
> I think we're returning with clogGroupMember = true - that doesn't look
> right.
>

Changed as per suggestion.

>
> > +             pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
> > +
> > +             if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > +                                                                                &nextidx,
> > +                                                                                (uint32) proc->pgprocno))
> > +                     break;
> > +     }
>
> So this indeed has ABA type problems. And you appear to be arguing above
> that that's ok. Need to ponder that for a bit.
>
> So, we enqueue ourselves as the *head* of the wait list, if there's
> other waiters. Seems like it could lead to the first element after the
> leader to be delayed longer than the others.
>
>
> FWIW, You can move the nextidx = part of out the loop,
> pgatomic_compare_exchange will update the nextidx value from memory; no
> need for another load afterwards.
>

Changed as per suggestion.

>
> > +     /*
> > +      * If the list was not empty, the leader will update the status of our
> > +      * XID. It is impossible to have followers without a leader because the
> > +      * first process that has added itself to the list will always have
> > +      * nextidx as INVALID_PGPROCNO.
> > +      */
> > +     if (nextidx != INVALID_PGPROCNO)
> > +     {
> > +             /* Sleep until the leader updates our XID status. */
> > +             for (;;)
> > +             {
> > +                     /* acts as a read barrier */
> > +                     PGSemaphoreLock(&proc->sem);
> > +                     if (!proc->clogGroupMember)
> > +                             break;
> > +                     extraWaits++;
> > +             }
> > +
> > +             Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO);
> > +
> > +             /* Fix semaphore count for any absorbed wakeups */
> > +             while (extraWaits-- > 0)
> > +                     PGSemaphoreUnlock(&proc->sem);
> > +             return true;
> > +     }
> > +
> > +     /* We are the leader.  Acquire the lock on behalf of everyone. */
> > +     LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > +     /*
> > +      * Now that we've got the lock, clear the list of processes waiting for
> > +      * group XID status update, saving a pointer to the head of the list.
> > +      * Trying to pop elements one at a time could lead to an ABA problem.
> > +      */
> > +     while (true)
> > +     {
> > +             nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > +             if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > +                                                                                &nextidx,
> > +                                                                                INVALID_PGPROCNO))
> > +                     break;
> > +     }
>
> Hm. It seems like you should should simply use pg_atomic_exchange_u32(),
> rather than compare_exchange?
>

Changed as per suggestion.

>
> I think it's worthwhile to create a benchmark that does something like
> BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> completely realistic values for network RTT + application computation),
> the success rate of group updates shrinks noticeably.
>

Will do some tests based on above test and share results.


Attached patch contains all the changes suggested by you.  Let me know if I have missed anything or you want it differently.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Wed, Mar 23, 2016 at 12:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund <andres@anarazel.de> wrote:
> >
> >
> > I think it's worthwhile to create a benchmark that does something like
> > BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> > INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> > completely realistic values for network RTT + application computation),
> > the success rate of group updates shrinks noticeably.
> >
>
> Will do some tests based on above test and share results.
>

Forgot to mention that the effect of patch is better visible with unlogged tables, so will do the test with those and request you to use same if you yourself is also planning to perform some tests.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Mar 22, 2016 at 12:33 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> On Thu, Mar 17, 2016 at 11:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>  I have reviewed the patch.. here are some review comments, I will continue to review..
>
> 1.
>
> +
> + /*
> +  * Add the proc to list, if the clog page where we need to update the
>
> +  */
> + if (nextidx != INVALID_PGPROCNO &&
> + ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage)
> + return false;
>
> Should we clear all these structure variable what we set above in case we are not adding our self to group, I can see it will not have any problem even if we don't clear them,
> I think if we don't want to clear we can add some comment mentioning the same.
>

I have updated the patch to just clear clogGroupMember as that is what is done when we wake the processes.

>
> 2.
>
> Here we are updating in our own proc, I think we don't need atomic operation here, we are not yet added to the list.
>
> + if (nextidx != INVALID_PGPROCNO &&
> + ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage)
> + return false;
> +
> + pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
>
>

We won't be able to assign nextidx to clogGroupNext is of type pg_atomic_uint32.

Thanks for the review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
On 2016-03-23 12:33:22 +0530, Amit Kapila wrote:
> On Wed, Mar 23, 2016 at 12:26 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund <andres@anarazel.de> wrote:
> > >
> > >
> > > I think it's worthwhile to create a benchmark that does something like
> > > BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> > > INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> > > completely realistic values for network RTT + application computation),
> > > the success rate of group updates shrinks noticeably.
> > >
> >
> > Will do some tests based on above test and share results.
> >
> 
> Forgot to mention that the effect of patch is better visible with unlogged
> tables, so will do the test with those and request you to use same if you
> yourself is also planning to perform some tests.

I'm playing around with SELECT txid_current(); right now - that should
be about the most specific load for setting clog bits.

Andres



Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
On 2016-03-23 21:43:41 +0100, Andres Freund wrote:
> I'm playing around with SELECT txid_current(); right now - that should
> be about the most specific load for setting clog bits.

Or so I thought.

In my testing that showed just about zero performance difference between
the patch and master. And more surprisingly, profiling showed very
little contention on the control lock. Hacking
TransactionIdSetPageStatus() to return without doing anything, actually
only showed minor performance benefits.

[there's also the fact that txid_current() indirectly acquires two
lwlock twice, which showed up more prominently than control lock, but
that I could easily hack around by adding a xid_current().]

Similar with an INSERT only workload. And a small scale pgbench.


Looking through the thread showed that the positive results you'd posted
all were with relativey big scale factors. Which made me think. Running
a bigger pgbench showed that most the interesting (i.e. long) lock waits
were both via TransactionIdSetPageStatus *and* TransactionIdGetStatus().


So I think what happens is that once you have a big enough table, the
UPDATEs standard pgbench does start to often hit *old* xids (in unhinted
rows). Thus old pages have to be read in, potentially displacing slru
content needed very shortly after.


Have you, in your evaluation of the performance of this patch, done
profiles over time? I.e. whether the performance benefits are the
immediately, or only after a significant amount of test time? Comparing
TPS over time, for both patched/unpatched looks relevant.


Even after changing to scale 500, the performance benefits on this,
older 2 socket, machine were minor; even though contention on the
ClogControlLock was the second most severe (after ProcArrayLock).

Afaics that squares with Jesper's result, which basically also didn't
show a difference either way?


I'm afraid that this patch might be putting bandaid on some of the
absolutely worst cases, without actually addressing the core
problem. Simon's patch in [1] seems to come closer addressing that
(which I don't believe it's safe without going doing every status
manipulation atomically, as individual status bits are smaller than 4
bytes).  Now it's possibly to argue that the bandaid might slow the
bleeding to a survivable level, but I have to admit I'm doubtful.

Here's the stats for a -s 500 run btw:Performance counter stats for 'system wide':           18,747
probe_postgres:TransactionIdSetTreeStatus          68,884      probe_postgres:TransactionIdGetStatus            9,718
  probe_postgres:PGSemaphoreLock
 
(the PGSemaphoreLock is over 50% ProcArrayLock, followed by ~15%
SimpleLruReadPage_ReadOnly)


My suspicion is that a better approach for now would be to take Simon's
patch, but add a (per-page?) 'ClogModificationLock'; to avoid the need
of doing something fancier in TransactionIdSetStatusBit().

Andres

[1]:
http://archives.postgresql.org/message-id/CANP8%2Bj%2BimQfHxkChFyfnXDyi6k-arAzRV%2BZG-V_OFxEtJjOL2Q%40mail.gmail.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
Hi,

On 2016-03-24 01:10:55 +0100, Andres Freund wrote:
> I'm afraid that this patch might be putting bandaid on some of the
> absolutely worst cases, without actually addressing the core
> problem. Simon's patch in [1] seems to come closer addressing that
> (which I don't believe it's safe without going doing every status
> manipulation atomically, as individual status bits are smaller than 4
> bytes).  Now it's possibly to argue that the bandaid might slow the
> bleeding to a survivable level, but I have to admit I'm doubtful.
> 
> Here's the stats for a -s 500 run btw:
>  Performance counter stats for 'system wide':
>             18,747      probe_postgres:TransactionIdSetTreeStatus
>             68,884      probe_postgres:TransactionIdGetStatus
>              9,718      probe_postgres:PGSemaphoreLock
> (the PGSemaphoreLock is over 50% ProcArrayLock, followed by ~15%
> SimpleLruReadPage_ReadOnly)
> 
> 
> My suspicion is that a better approach for now would be to take Simon's
> patch, but add a (per-page?) 'ClogModificationLock'; to avoid the need
> of doing something fancier in TransactionIdSetStatusBit().
> 
> Andres
> 
> [1]:
http://archives.postgresql.org/message-id/CANP8%2Bj%2BimQfHxkChFyfnXDyi6k-arAzRV%2BZG-V_OFxEtJjOL2Q%40mail.gmail.com

Simon, would you mind if I took your patch for a spin like roughly
suggested above?


Andres



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-03-23 21:43:41 +0100, Andres Freund wrote:
> > I'm playing around with SELECT txid_current(); right now - that should
> > be about the most specific load for setting clog bits.
>
> Or so I thought.
>
> In my testing that showed just about zero performance difference between
> the patch and master. And more surprisingly, profiling showed very
> little contention on the control lock. Hacking
> TransactionIdSetPageStatus() to return without doing anything, actually
> only showed minor performance benefits.
>
> [there's also the fact that txid_current() indirectly acquires two
> lwlock twice, which showed up more prominently than control lock, but
> that I could easily hack around by adding a xid_current().]
>
> Similar with an INSERT only workload. And a small scale pgbench.
>
>
> Looking through the thread showed that the positive results you'd posted
> all were with relativey big scale factors.
>

I have seen smaller benefits at 300 scale factor and somewhat larger benefits at 1000 scale factor.  Also Mithun has done similar testing with unlogged tables and the results of same [1] also looks good.

>
> Which made me think. Running
> a bigger pgbench showed that most the interesting (i.e. long) lock waits
> were both via TransactionIdSetPageStatus *and* TransactionIdGetStatus().
>

Yes, this is same what I have observed as well.

>
> So I think what happens is that once you have a big enough table, the
> UPDATEs standard pgbench does start to often hit *old* xids (in unhinted
> rows). Thus old pages have to be read in, potentially displacing slru
> content needed very shortly after.
>
>
> Have you, in your evaluation of the performance of this patch, done
> profiles over time? I.e. whether the performance benefits are the
> immediately, or only after a significant amount of test time? Comparing
> TPS over time, for both patched/unpatched looks relevant.
>

I have mainly done it with half-hour read-write tests. What do you want to observe via smaller tests, sometimes it gives inconsistent data for read-write tests? 

>
> Even after changing to scale 500, the performance benefits on this,
> older 2 socket, machine were minor; even though contention on the
> ClogControlLock was the second most severe (after ProcArrayLock).
>

I have tried this patch on mainly 8 socket machine with 300 & 1000 scale factor.  I am hoping that you have tried this test on unlogged tables and by the way at what client count, you have seen these results.

> Afaics that squares with Jesper's result, which basically also didn't
> show a difference either way?
>

One difference was that I think Jesper has done testing with synchronous_commit mode as off whereas my tests were with synchronous commit mode on.

>
> I'm afraid that this patch might be putting bandaid on some of the
> absolutely worst cases, without actually addressing the core
> problem. Simon's patch in [1] seems to come closer addressing that
> (which I don't believe it's safe without going doing every status
> manipulation atomically, as individual status bits are smaller than 4
> bytes).  Now it's possibly to argue that the bandaid might slow the
> bleeding to a survivable level, but I have to admit I'm doubtful.
>
> Here's the stats for a -s 500 run btw:
>  Performance counter stats for 'system wide':
>             18,747      probe_postgres:TransactionIdSetTreeStatus
>             68,884      probe_postgres:TransactionIdGetStatus
>              9,718      probe_postgres:PGSemaphoreLock
> (the PGSemaphoreLock is over 50% ProcArrayLock, followed by ~15%
> SimpleLruReadPage_ReadOnly)
>
>
> My suspicion is that a better approach for now would be to take Simon's
> patch, but add a (per-page?) 'ClogModificationLock'; to avoid the need
> of doing something fancier in TransactionIdSetStatusBit().
>

I think we can try that as well and if you see better results by that Approach, then we can use that instead of current patch.


Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Mar 24, 2016 at 8:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund <andres@anarazel.de> wrote:
> >
> > Even after changing to scale 500, the performance benefits on this,
> > older 2 socket, machine were minor; even though contention on the
> > ClogControlLock was the second most severe (after ProcArrayLock).
> >
>

One more point, I wanted to say here which is that I think the benefit will be shown mainly when the ClogControlLock has contention more than or near to ProcArrayLock, otherwise even if patch reduces contention (you can see via LWLock stats), the performance doesn't increase.  From Mithun's data [1], related to LWLocks, it seems like at 88 clients in his test, the contention on CLOGControlLock becomes more than ProcArrayLock and that is the point where it has started showing noticeable performance gain.  I have explained some more on that thread [2] about this point.  Is it possible for you to once test in similar situation and see the behaviour (like for client count greater than number of cores) w.r.t locking contention and TPS.

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Mar 24, 2016 at 8:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund <andres@anarazel.de> wrote:
> >
> > Have you, in your evaluation of the performance of this patch, done
> > profiles over time? I.e. whether the performance benefits are the
> > immediately, or only after a significant amount of test time? Comparing
> > TPS over time, for both patched/unpatched looks relevant.
> >
>
> I have mainly done it with half-hour read-write tests. What do you want to observe via smaller tests, sometimes it gives inconsistent data for read-write tests?
>

I have done some tests on both intel and power m/c (configuration of which are mentioned at end-of-mail) to see the results at different time-intervals and it is always showing greater than 50% improvement in power m/c at 128 client-count and greater than 29% improvement in Intel m/c at 88 client-count.


Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout    =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

pgbench setup
------------------------
scale factor - 300
used *unlogged* tables : pgbench -i --unlogged-tables -s 300 ..
pgbench -M prepared tpc-b


Results on Intel m/c
--------------------------------
client-count - 88

Time (minutes)BasePatch%
5399785185829.71
10381695219536.74
20369925217341.03
30370425214940.78

Results on power m/c
-----------------------------------
Client-count - 128

Time (minutes)BasePatch%
5424796565554.55
10418766605057.72
20380996520071.13
30378386190863.61
 
>
> >
> > Even after changing to scale 500, the performance benefits on this,
> > older 2 socket, machine were minor; even though contention on the
> > ClogControlLock was the second most severe (after ProcArrayLock).
> >
>
> I have tried this patch on mainly 8 socket machine with 300 & 1000 scale factor.  I am hoping that you have tried this test on unlogged tables and by the way at what client count, you have seen these results.
>

Do you think in your tests, we don't see increase in performance in your tests because of m/c difference (sockets/cpu cores) or client-count?


Intel m/c config (lscpu)
-------------------------------------
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             8
NUMA node(s):          8
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 47
Model name:            Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:              2
CPU MHz:               1064.000
BogoMIPS:              4266.62
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              24576K
NUMA node0 CPU(s):     0,65-71,96-103
NUMA node1 CPU(s):     72-79,104-111
NUMA node2 CPU(s):     80-87,112-119
NUMA node3 CPU(s):     88-95,120-127
NUMA node4 CPU(s):     1-8,33-40
NUMA node5 CPU(s):     9-16,41-48
NUMA node6 CPU(s):     17-24,49-56
NUMA node7 CPU(s):     25-32,57-64

Power m/c config (lscpu)
-------------------------------------
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                192
On-line CPU(s) list:   0-191
Thread(s) per core:    8
Core(s) per socket:    1
Socket(s):             24
NUMA node(s):          4
Model:                 IBM,8286-42A
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-47
NUMA node1 CPU(s):     48-95
NUMA node2 CPU(s):     96-143
NUMA node3 CPU(s):     144-191

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Mar 24, 2016 at 8:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund <andres@anarazel.de> wrote:
> >
> > Even after changing to scale 500, the performance benefits on this,
> > older 2 socket, machine were minor; even though contention on the
> > ClogControlLock was the second most severe (after ProcArrayLock).
> >
>
> I have tried this patch on mainly 8 socket machine with 300 & 1000 scale factor.  I am hoping that you have tried this test on unlogged tables and by the way at what client count, you have seen these results.
>
> > Afaics that squares with Jesper's result, which basically also didn't
> > show a difference either way?
> >
>
> One difference was that I think Jesper has done testing with synchronous_commit mode as off whereas my tests were with synchronous commit mode on.
>

On again looking at results posted by Jesper [1] and Mithun [2], I have one more observation which is that in HEAD, the performance doesn't dip even at higher client count (>75) on tests done by Jesper, whereas the results of tests done by Mithun indicate that it dips at high client count (>64) in HEAD and that is where the patch is helping.  Now there is certainly some difference in test environment like Jesper has done testing on 2 socket m/c whereas mine and Mithun's tests were done 4 or 8 socket m/c.  So I think the difference in TPS due to reduced contention on CLogControlLock are mainly visible with high socket m/c.

Can anybody having access to 4 or more socket m/c help in testing this patch with --unlogged-tables?

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Sep 3, 2015 at 5:11 PM, Andres Freund <andres@anarazel.de> wrote:
> >
>
> Updated comments and the patch (increate_clog_bufs_v2.patch)
> containing the same is attached.
>

Andres mentioned to me in off-list discussion, that he thinks we should first try to fix the clog buffers problem as he sees in his tests that clog buffer replacement is one of the bottlenecks. He also suggested me a test to see if the increase in buffers could lead to regression.  The basic idea of test was to ensure every access on clog access to be a disk one.  Based on his suggestion, I have written a SQL statement which will allow every access of CLOG to be a disk access and the query used for same is as below:
With ins AS (INSERT INTO test_clog_access values(default) RETURNING c1) Select * from test_clog_access where c1 = (Select c1 from ins) - 32768 * :client_id;

Test Results
---------------------
HEAD - commit d12e5bb7 Clog Buffers - 32
Patch-1 - Clog Buffers - 64
Patch-2 - Clog Buffers - 128


Patch_Ver/Client_Count164
HEAD1267757470
Patch-11230558079
Patch-21276158637

Above data is a median of 3 10-min runs.  Above data indicates that there is no substantial dip in increasing clog buffers.

Test scripts used in testing are attached with this mail.  In perf_clog_access.sh, you need to change data_directory path as per your m/c, also you might want to change the binary name, if you want to create postgres binaries with different names.

Andres, Is this test inline with what you have in mind? 

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
On 2016-03-28 22:50:49 +0530, Amit Kapila wrote:
> On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Thu, Sep 3, 2015 at 5:11 PM, Andres Freund <andres@anarazel.de> wrote:
> > >
> >
> > Updated comments and the patch (increate_clog_bufs_v2.patch)
> > containing the same is attached.
> >
>
> Andres mentioned to me in off-list discussion, that he thinks we should
> first try to fix the clog buffers problem as he sees in his tests that clog
> buffer replacement is one of the bottlenecks. He also suggested me a test
> to see if the increase in buffers could lead to regression.  The basic idea
> of test was to ensure every access on clog access to be a disk one.  Based
> on his suggestion, I have written a SQL statement which will allow every
> access of CLOG to be a disk access and the query used for same is as below:
> With ins AS (INSERT INTO test_clog_access values(default) RETURNING c1)
> Select * from test_clog_access where c1 = (Select c1 from ins) - 32768 *
> :client_id;
>
> Test Results
> ---------------------
> HEAD - commit d12e5bb7 Clog Buffers - 32
> Patch-1 - Clog Buffers - 64
> Patch-2 - Clog Buffers - 128
>
>
> Patch_Ver/Client_Count 1 64
> HEAD 12677 57470
> Patch-1 12305 58079
> Patch-2 12761 58637
>
> Above data is a median of 3 10-min runs.  Above data indicates that there
> is no substantial dip in increasing clog buffers.
>
> Test scripts used in testing are attached with this mail.  In
> perf_clog_access.sh, you need to change data_directory path as per your
> m/c, also you might want to change the binary name, if you want to create
> postgres binaries with different names.
>
> Andres, Is this test inline with what you have in mind?

Yes. That looks good. My testing shows that increasing the number of
buffers can increase both throughput and reduce latency variance. The
former is a smaller effect with one of the discussed patches applied,
the latter seems to actually increase in scale (with increased
throughput).


I've attached patches to:
0001: Increase the max number of clog buffers
0002: Implement 64bit atomics fallback and optimize read/write
0003: Edited version of Simon's clog scalability patch

WRT 0003 - still clearly WIP - I've:
- made group_lsn pg_atomic_u64*, to allow for tear-free reads
- split content from IO lock
- made SimpleLruReadPage_optShared always return with only share lock
  held
- Implement a different, experimental, concurrency model for
  SetStatusBit using cmpxchg. A define USE_CONTENT_LOCK controls which
  bit is used.

I've tested this and saw this outperform Amit's approach. Especially so
when using a read/write mix, rather then only reads. I saw over 30%
increase on a large EC2 instance with -btpcb-like@1 -bselect-only@3. But
that's in a virtualized environment, not very good for reproducability.

Amit, could you run benchmarks on your bigger hardware? Both with
USE_CONTENT_LOCK commented out and in?

I think we should go for 1) and 2) unconditionally. And then evaluate
whether to go with your, or 3) from above. If the latter, we've to do
some cleanup :)

Greetings,

Andres Freund

Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Mar 31, 2016 at 4:39 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-03-28 22:50:49 +0530, Amit Kapila wrote:
> > On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > >
>
> Amit, could you run benchmarks on your bigger hardware? Both with
> USE_CONTENT_LOCK commented out and in?
>

Yes.

> I think we should go for 1) and 2) unconditionally.


Yes, that makes sense.  On 20 min read-write pgbench --unlogged-tables benchmark, I see that with HEAD Tps is 36241 and with increase the clog buffers patch, Tps is 69340 at 128 client count (very good performance boost) which indicates that we should go ahead with 1) and 2) patches.

0002-Increase-max-number-of-buffers-in-clog-SLRU-to-128

Size

 CLOGShmemBuffers(void)

 {

- return Min(32, Max(4, NBuffers / 512));

+ return Min(128, Max(4, NBuffers / 512));

 }



I think we should change comments on top of this function.  I have changed the comments as per my previous patch and attached the modified patch with this mail, see if that makes sense.


0001-Improve-64bit-atomics-support

+#if 0
+#ifndef PG_HAVE_ATOMIC_READ_U64
+#define PG_HAVE_ATOMIC_READ_U64
+static inline uint64

What the purpose of above #if 0?  Other than that patch looks good to me.


> And then evaluate
> whether to go with your, or 3) from above. If the latter, we've to do
> some cleanup :)
>

Yes, that makes sense to me, so lets go with 1) and 2) first.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
On 2016-03-31 15:07:22 +0530, Amit Kapila wrote:
> On Thu, Mar 31, 2016 at 4:39 AM, Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2016-03-28 22:50:49 +0530, Amit Kapila wrote:
> > > On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > >
> >
> > Amit, could you run benchmarks on your bigger hardware? Both with
> > USE_CONTENT_LOCK commented out and in?
> >
> 
> Yes.

Cool.


> > I think we should go for 1) and 2) unconditionally.

> Yes, that makes sense.  On 20 min read-write pgbench --unlogged-tables
> benchmark, I see that with HEAD Tps is 36241 and with increase the clog
> buffers patch, Tps is 69340 at 128 client count (very good performance
> boost) which indicates that we should go ahead with 1) and 2) patches.

Especially considering the line count... I do wonder about going crazy
and increasing to 256 immediately. It otherwise seems likely that we'll
have the the same issue in a year.  Could you perhaps run your test
against that as well?


> I think we should change comments on top of this function.

Yes, definitely.


> 0001-Improve-64bit-atomics-support
> 
> +#if 0
> +#ifndef PG_HAVE_ATOMIC_READ_U64
> +#define PG_HAVE_ATOMIC_READ_U64
> +static inline uint64
> 
> What the purpose of above #if 0?  Other than that patch looks good to me.

I think I was investigating something. Other than that obviously there's
no point. Sorry for that.

Greetings,

Andres Freund



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Mar 31, 2016 at 3:48 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-03-31 15:07:22 +0530, Amit Kapila wrote:
> > On Thu, Mar 31, 2016 at 4:39 AM, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On 2016-03-28 22:50:49 +0530, Amit Kapila wrote:
> > > > On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit.kapila16@gmail.com>
> > > > wrote:
> > > > >
> > >
> > > Amit, could you run benchmarks on your bigger hardware? Both with
> > > USE_CONTENT_LOCK commented out and in?
> > >
> >
> > Yes.
>
> Cool.
>
>
> > > I think we should go for 1) and 2) unconditionally.
>
> > Yes, that makes sense.  On 20 min read-write pgbench --unlogged-tables
> > benchmark, I see that with HEAD Tps is 36241 and with increase the clog
> > buffers patch, Tps is 69340 at 128 client count (very good performance
> > boost) which indicates that we should go ahead with 1) and 2) patches.
>
> Especially considering the line count... I do wonder about going crazy
> and increasing to 256 immediately. It otherwise seems likely that we'll
> have the the same issue in a year.  Could you perhaps run your test
> against that as well?
>

Unfortunately, it dipped to 65005 with 256 clog bufs.  So I think 128 is appropriate number.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
On 2016-03-31 17:52:12 +0530, Amit Kapila wrote:
> On Thu, Mar 31, 2016 at 3:48 PM, Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2016-03-31 15:07:22 +0530, Amit Kapila wrote:
> > > On Thu, Mar 31, 2016 at 4:39 AM, Andres Freund <andres@anarazel.de>
> wrote:
> > > >
> > > > On 2016-03-28 22:50:49 +0530, Amit Kapila wrote:
> > > > > On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <
> amit.kapila16@gmail.com>
> > > > > wrote:
> > > > > >
> > > >
> > > > Amit, could you run benchmarks on your bigger hardware? Both with
> > > > USE_CONTENT_LOCK commented out and in?
> > > >
> > >
> > > Yes.
> >
> > Cool.
> >
> >
> > > > I think we should go for 1) and 2) unconditionally.
> >
> > > Yes, that makes sense.  On 20 min read-write pgbench --unlogged-tables
> > > benchmark, I see that with HEAD Tps is 36241 and with increase the clog
> > > buffers patch, Tps is 69340 at 128 client count (very good performance
> > > boost) which indicates that we should go ahead with 1) and 2) patches.
> >
> > Especially considering the line count... I do wonder about going crazy
> > and increasing to 256 immediately. It otherwise seems likely that we'll
> > have the the same issue in a year.  Could you perhaps run your test
> > against that as well?
> >
> 
> Unfortunately, it dipped to 65005 with 256 clog bufs.  So I think 128 is
> appropriate number.

Ah, interesting. Then let's go with that.



Re: Speed up Clog Access by increasing CLOG buffers

From
Jesper Pedersen
Date:
Hi,

On 03/30/2016 07:09 PM, Andres Freund wrote:
> Yes. That looks good. My testing shows that increasing the number of
> buffers can increase both throughput and reduce latency variance. The
> former is a smaller effect with one of the discussed patches applied,
> the latter seems to actually increase in scale (with increased
> throughput).
>
>
> I've attached patches to:
> 0001: Increase the max number of clog buffers
> 0002: Implement 64bit atomics fallback and optimize read/write
> 0003: Edited version of Simon's clog scalability patch
>
> WRT 0003 - still clearly WIP - I've:
> - made group_lsn pg_atomic_u64*, to allow for tear-free reads
> - split content from IO lock
> - made SimpleLruReadPage_optShared always return with only share lock
>    held
> - Implement a different, experimental, concurrency model for
>    SetStatusBit using cmpxchg. A define USE_CONTENT_LOCK controls which
>    bit is used.
>
> I've tested this and saw this outperform Amit's approach. Especially so
> when using a read/write mix, rather then only reads. I saw over 30%
> increase on a large EC2 instance with -btpcb-like@1 -bselect-only@3. But
> that's in a virtualized environment, not very good for reproducability.
>
> Amit, could you run benchmarks on your bigger hardware? Both with
> USE_CONTENT_LOCK commented out and in?
>
> I think we should go for 1) and 2) unconditionally. And then evaluate
> whether to go with your, or 3) from above. If the latter, we've to do
> some cleanup :)
>

I have been testing Amit's patch in various setups and work loads, with 
up to 400 connections on a 2 x Xeon E5-2683 (28C/56T @ 2 GHz), not 
seeing an improvement, but no regression either.

Testing with 0001 and 0002 do show up to a 5% improvement when using a 
HDD for data + wal - about 1% when using 2 x RAID10 SSD - unlogged.

I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6.

Thanks for your work on this !

Best regards, Jesper




Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:

On March 31, 2016 11:13:46 PM GMT+02:00, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:

>I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6.

Yes please. I think the lock variant is realistic, the lockless did isn't.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Speed up Clog Access by increasing CLOG buffers

From
Jesper Pedersen
Date:
Hi,

On 03/31/2016 06:21 PM, Andres Freund wrote:
> On March 31, 2016 11:13:46 PM GMT+02:00, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>
>> I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6.
>
> Yes please. I think the lock variant is realistic, the lockless did isn't.
>

I have done a run with -M prepared on unlogged running 10min per data 
point, up to 300 connections. Using data + wal on HDD.

I'm not seeing a difference between with and without USE_CONTENT_LOCK -- 
all points are within +/- 0.5%.

Let me know if there are other tests I can perform.

Best regards, Jesper




Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:

On April 1, 2016 10:25:51 PM GMT+02:00, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>Hi,
>
>On 03/31/2016 06:21 PM, Andres Freund wrote:
>> On March 31, 2016 11:13:46 PM GMT+02:00, Jesper Pedersen
><jesper.pedersen@redhat.com> wrote:
>>
>>> I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6.
>>
>> Yes please. I think the lock variant is realistic, the lockless did
>isn't.
>>
>
>I have done a run with -M prepared on unlogged running 10min per data 
>point, up to 300 connections. Using data + wal on HDD.
>
>I'm not seeing a difference between with and without USE_CONTENT_LOCK
>-- 
>all points are within +/- 0.5%.
>
>Let me know if there are other tests I can perform

How do either compare to just 0002 applied?

Thanks!
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Mar 31, 2016 at 3:48 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-03-31 15:07:22 +0530, Amit Kapila wrote:
> > On Thu, Mar 31, 2016 at 4:39 AM, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On 2016-03-28 22:50:49 +0530, Amit Kapila wrote:
> > > > On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila <amit.kapila16@gmail.com>
> > > > wrote:
> > > > >
> > >
> > > Amit, could you run benchmarks on your bigger hardware? Both with
> > > USE_CONTENT_LOCK commented out and in?
> > >
> >
> > Yes.
>
> Cool.
>

Here is the performance data (configuration of machine used to perform this test is mentioned at end of mail):

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout    =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

median of 3, 20-min pgbench tpc-b results for --unlogged-tables
 

Client Count/No. Of Runs (tps)264128
HEAD+clog_buf_12849306675468818
group_clog_v857536900278843
content_lock56687013470501
nocontent_lock47876953170663


I am not exactly sure why using content lock (defined USE_CONTENT_LOCK in 0003-Use-a-much-more-granular-locking-model-for-the-clog-) patch or no content lock (not defined USE_CONTENT_LOCK) patch gives poor performance at 128 client, may it is due to some bug in patch or due to some reason mentioned by Robert [1] (usage of two locks instead of one).  On running it many-2 times with content lock and no content lock patch, some times it gives 80 ~ 81K TPS at 128 client count which is approximately 3% higher than group_clog_v8 patch which indicates that group clog approach is able to address most of the remaining contention (after increasing clog buffers) around CLOGControlLock.  There is one small regression observed with no content lock patch at lower client count (2) which might be due to run-to-run variation or may be it is due to increased number of instructions due to atomic ops, need to be investigated if we want to follow no content lock approach.

Note, I have not posted TPS numbers with HEAD, as I have already shown above that increasing clog bufs has increased TPS from ~36K to ~68K at 128 client-count.


M/c details
-----------------
Power m/c config (lscpu)
-------------------------------------
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                192
On-line CPU(s) list:   0-191
Thread(s) per core:    8
Core(s) per socket:    1
Socket(s):             24
NUMA node(s):          4
Model:                 IBM,8286-42A
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-47
NUMA node1 CPU(s):     48-95
NUMA node2 CPU(s):     96-143
NUMA node3 CPU(s):     144-191




With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Jesper Pedersen
Date:
On 04/01/2016 04:39 PM, Andres Freund wrote:
> On April 1, 2016 10:25:51 PM GMT+02:00, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>> Hi,
>>
>> On 03/31/2016 06:21 PM, Andres Freund wrote:
>>> On March 31, 2016 11:13:46 PM GMT+02:00, Jesper Pedersen
>> <jesper.pedersen@redhat.com> wrote:
>>>
>>>> I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6.
>>>
>>> Yes please. I think the lock variant is realistic, the lockless did
>> isn't.
>>>
>>
>> I have done a run with -M prepared on unlogged running 10min per data
>> point, up to 300 connections. Using data + wal on HDD.
>>
>> I'm not seeing a difference between with and without USE_CONTENT_LOCK
>> --
>> all points are within +/- 0.5%.
>>
>> Let me know if there are other tests I can perform
>
> How do either compare to just 0002 applied?
>

0001 + 0002 compared to 0001 + 0002 + 0003 (either way) were pretty much 
the same +/- 0.5% on the HDD run.

Best regards, Jesper




Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Apr 4, 2016 at 8:55 PM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
On 04/01/2016 04:39 PM, Andres Freund wrote:
On April 1, 2016 10:25:51 PM GMT+02:00, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
Hi,

On 03/31/2016 06:21 PM, Andres Freund wrote:
On March 31, 2016 11:13:46 PM GMT+02:00, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

I can do a USE_CONTENT_LOCK run on 0003 if it is something for 9.6.

Yes please. I think the lock variant is realistic, the lockless did
isn't.


I have done a run with -M prepared on unlogged running 10min per data
point, up to 300 connections. Using data + wal on HDD.

I'm not seeing a difference between with and without USE_CONTENT_LOCK
--
all points are within +/- 0.5%.

Let me know if there are other tests I can perform

How do either compare to just 0002 applied?


0001 + 0002 compared to 0001 + 0002 + 0003 (either way) were pretty much the same +/- 0.5% on the HDD run.


I think the main reason why there is no significant gain shown in your tests is that on the m/c where you are testing the contention due to CLOGControlLock is not high enough that any reduction on the same will help.  To me, it is visible in some of the high-end machines like which have 4 or more sockets.  So, I think these results should be taken as an indication that there is no regression in the tests performed by you.

Thanks for doing all the tests for these patches.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Sat, Apr 2, 2016 at 5:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Mar 31, 2016 at 3:48 PM, Andres Freund <andres@anarazel.de> wrote:

Here is the performance data (configuration of machine used to perform this test is mentioned at end of mail):

Non-default parameters
------------------------------------
max_connections = 300
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout    =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

median of 3, 20-min pgbench tpc-b results for --unlogged-tables

I have ran exactly same test on intel x86 m/c and the results are as below:

Client Count/Patch_ver (tps)2128256
HEAD – Commit 2143f5e128323500126756
clog_buf_12829095068540998
clog_buf_128 +group_update_clog_v829815304350779
clog_buf_128 +content_lock28435626154059
clog_buf_128 +nocontent_lock26305655454429


In this m/c, I don't see any run-to-run variation, however the trend of results seems somewhat similar to power m/c.  Clearly the first patch increasing clog bufs to 128 shows upto 50% performance improvement on 256 client-count.  We can also observe that group clog patch gives ~24% gain on top of increase clog bufs patch at 256 client count.  Both content lock and no content lock patches show similar performance gains and the performance is 6~7% better than group clog patch.  Also as on power m/c, no content lock patch seems to show some regression at lower client count (2 clients in this case).

Based on above results, increase_clog_bufs to 128 is a clear winner and I think we might not want to proceed with no content lock approach patch as that shows some regression and also it is no better than using content lock approach patch.   Now, I think we need to decide between group clog mode approach patch and use content lock approach patch, it seems to me that the difference between both of these is not high (6~7%) and I think that when there are sub-transactions involved (sub-transactions on same page as main transaction) group clog mode patch should give better performance as then content lock in itself will start becoming bottleneck.  Now, I think we can address that case for content lock approach by using grouping technique on content lock or something similar, but not sure if that is worth the effort.   Also, I see some variation in performance data with content lock patch on power m/c, but again that might be attributed to m/c characteristics.  So, I think we can proceed with either group clog patch or content lock patch and if we want to proceed with content lock approach, then we need to do some more work on it.


Note - For both content and no content lock, I have applied 0001-Improve-64bit-atomics-support patch.


m/c config (lscpu)
---------------------------
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             8
NUMA node(s):          8
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 47
Model name:            Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:              2
CPU MHz:               1064.000
BogoMIPS:              4266.62
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              24576K
NUMA node0 CPU(s):     0,65-71,96-103
NUMA node1 CPU(s):     72-79,104-111
NUMA node2 CPU(s):     80-87,112-119
NUMA node3 CPU(s):     88-95,120-127
NUMA node4 CPU(s):     1-8,33-40
NUMA node5 CPU(s):     9-16,41-48
NUMA node6 CPU(s):     17-24,49-56
NUMA node7 CPU(s):     25-32,57-64

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
Hi,

On 2016-04-07 09:14:00 +0530, Amit Kapila wrote:
> On Sat, Apr 2, 2016 at 5:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have ran exactly same test on intel x86 m/c and the results are as below:

Thanks for running these tests!

> Client Count/Patch_ver (tps) 2 128 256
> HEAD – Commit 2143f5e1 2832 35001 26756
> clog_buf_128 2909 50685 40998
> clog_buf_128 +group_update_clog_v8 2981 53043 50779
> clog_buf_128 +content_lock 2843 56261 54059
> clog_buf_128 +nocontent_lock 2630 56554 54429

Interesting.

could you perhaps also run a test with -btpcb-like@1 -bselect-only@3?
That much represents real world loads, and it's where I saw simon's
approach outshining yours considerably...

Greetings,

Andres Freund



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Apr 7, 2016 at 10:16 AM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2016-04-07 09:14:00 +0530, Amit Kapila wrote:
> > On Sat, Apr 2, 2016 at 5:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have ran exactly same test on intel x86 m/c and the results are as below:
>
> Thanks for running these tests!
>
> > Client Count/Patch_ver (tps) 2 128 256
> > HEAD – Commit 2143f5e1 2832 35001 26756
> > clog_buf_128 2909 50685 40998
> > clog_buf_128 +group_update_clog_v8 2981 53043 50779
> > clog_buf_128 +content_lock 2843 56261 54059
> > clog_buf_128 +nocontent_lock 2630 56554 54429
>
> Interesting.
>
> could you perhaps also run a test with -btpcb-like@1 -bselect-only@3?
>

This is the data with -b tpcb-like@1 with 20-min run for each version and I could see almost similar results as the data posted in previous e-mail.

Client Count/Patch_ver (tps)256
clog_buf_12840617
clog_buf_128 +group_clog_v851137
clog_buf_128 +content_lock54188


For -b select-only@3,  I have done quicktest for each version and number is same 62K~63K for all version, why do you think this will improve select-only workload?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
On 2016-04-07 18:40:14 +0530, Amit Kapila wrote:
> On Thu, Apr 7, 2016 at 10:16 AM, Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2016-04-07 09:14:00 +0530, Amit Kapila wrote:
> > > On Sat, Apr 2, 2016 at 5:25 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > I have ran exactly same test on intel x86 m/c and the results are as
> below:
> >
> > Thanks for running these tests!
> >
> > > Client Count/Patch_ver (tps) 2 128 256
> > > HEAD – Commit 2143f5e1 2832 35001 26756
> > > clog_buf_128 2909 50685 40998
> > > clog_buf_128 +group_update_clog_v8 2981 53043 50779
> > > clog_buf_128 +content_lock 2843 56261 54059
> > > clog_buf_128 +nocontent_lock 2630 56554 54429
> >
> > Interesting.
> >
> > could you perhaps also run a test with -btpcb-like@1 -bselect-only@3?

> This is the data with -b tpcb-like@1 with 20-min run for each version and I
> could see almost similar results as the data posted in previous e-mail.
> 
> Client Count/Patch_ver (tps) 256
> clog_buf_128 40617
> clog_buf_128 +group_clog_v8 51137
> clog_buf_128 +content_lock 54188
> 
> For -b select-only@3,  I have done quicktest for each version and number is
> same 62K~63K for all version, why do you think this will improve
> select-only workload?

What I was looking for was pgbench with both -btpcb-like@1
-bselect-only@3 specified; i.e. a mixed read/write test. In my
measurement that's where Simon's approach shines (not surprising if you
look at the way it works), and it's of immense practical importance -
most workloads are mixed.

Regards,

Andres



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Apr 7, 2016 at 6:48 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-04-07 18:40:14 +0530, Amit Kapila wrote:
> > This is the data with -b tpcb-like@1 with 20-min run for each version and I
> > could see almost similar results as the data posted in previous e-mail.
> >
> > Client Count/Patch_ver (tps) 256
> > clog_buf_128 40617
> > clog_buf_128 +group_clog_v8 51137
> > clog_buf_128 +content_lock 54188
> >
> > For -b select-only@3,  I have done quicktest for each version and number is
> > same 62K~63K for all version, why do you think this will improve
> > select-only workload?
>
> What I was looking for was pgbench with both -btpcb-like@1
> -bselect-only@3 specified; i.e. a mixed read/write test.
>

Okay, I can again take the performance data, but on what basis are we ignoring the variation of results on power m/c, previous to this I have not seen such a variation for read-write tests.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Apr 7, 2016 at 6:48 PM, Andres Freund <andres@anarazel.de> wrote:
On 2016-04-07 18:40:14 +0530, Amit Kapila wrote:

> This is the data with -b tpcb-like@1 with 20-min run for each version and I
> could see almost similar results as the data posted in previous e-mail.
>
> Client Count/Patch_ver (tps) 256
> clog_buf_128 40617
> clog_buf_128 +group_clog_v8 51137
> clog_buf_128 +content_lock 54188
>
> For -b select-only@3,  I have done quicktest for each version and number is
> same 62K~63K for all version, why do you think this will improve
> select-only workload?

What I was looking for was pgbench with both -btpcb-like@1
-bselect-only@3 specified; i.e. a mixed read/write test.

I have taken the data in the suggested way and the performance seems to be neutral for both the patches.  Detailed data for all the runs for three versions is attached.

Median of 3 20-minutes run.

Client Count/Patch_ver (tps)256
clog_buf_128110630
clog_buf_128 +group_clog_v8111575
clog_buf_128 +content_lock96581


Now, from above data, it appears that content lock patch has some regression, but if you see in detailed data attached with this mail, the highest TPS is close to other patches, but still on the lesser side.

 
In my
measurement that's where Simon's approach shines (not surprising if you
look at the way it works), and it's of immense practical importance -
most workloads are mixed.


I have tried above tests two times, but didn't notice any gain with content lock approach.


I think by now, we have done many tests with both approaches and we find that in some cases, it is slightly better and in most cases it is neutral and in some cases it is worse than group clog approach.  I feel we should go with group clog approach now as that has been tested and reviewed multiple times and in future if we find that other approach is giving substantial gain, then we can anyway change it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
On 2016-03-31 15:07:22 +0530, Amit Kapila wrote:
> I think we should change comments on top of this function.  I have changed
> the comments as per my previous patch and attached the modified patch with
> this mail, see if that makes sense.

I've applied this patch.

Regards,

Andres



Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
On 2016-04-08 13:07:05 +0530, Amit Kapila wrote:
> I think by now, we have done many tests with both approaches and we find
> that in some cases, it is slightly better and in most cases it is neutral
> and in some cases it is worse than group clog approach.  I feel we should
> go with group clog approach now as that has been tested and reviewed
> multiple times and in future if we find that other approach is giving
> substantial gain, then we can anyway change it.

I think that's a discussion for the 9.7 cycle unfortunately. I've now
pushed the #clog-buffers patch; that's going to help the worst cases.

Greetings,

Andres Freund



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Apr 8, 2016 at 9:00 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-03-31 15:07:22 +0530, Amit Kapila wrote:
> > I think we should change comments on top of this function.  I have changed
> > the comments as per my previous patch and attached the modified patch with
> > this mail, see if that makes sense.
>
> I've applied this patch.
>

Thanks!


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
Hi,

This thread started a year ago, different people contributed various
patches, some of which already got committed. Can someone please post a
summary of this thread, so that it's a bit more clear what needs
review/testing, what are the main open questions and so on?

I'm interested in doing some tests on the hardware I have available, but
I'm not willing spending my time untangling the discussion.

thanks

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Sep 5, 2016 at 3:18 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> This thread started a year ago, different people contributed various
> patches, some of which already got committed. Can someone please post a
> summary of this thread, so that it's a bit more clear what needs
> review/testing, what are the main open questions and so on?
>

Okay, let me try to summarize this thread.  This thread started off to
ameliorate the CLOGControlLock contention with a patch to increase the
clog buffers to 128 (which got committed in 9.6).   Then the second
patch was developed to use Group mode to further reduce the
CLOGControlLock contention, latest version of which is upthread [1] (I
have checked that version still gets applied).  Then Andres suggested
to compare the Group lock mode approach with an alternative (more
granular) locking model approach for which he has posted patches
upthread [2].  There are three patches on that link, the patches of
interest are 0001-Improve-64bit-atomics-support and
0003-Use-a-much-more-granular-locking-model-for-the-clog-.  I have
checked that second one of those doesn't get applied, so I have
rebased it and attached it with this mail.  In the more granular
locking approach, actually, you can comment USE_CONTENT_LOCK to make
it use atomic operations (I could not compile it by disabling
USE_CONTENT_LOCK on my windows box, you can try by commenting that as
well, if it works for you).  So, in short we have to compare three
approaches here.

1) Group mode to reduce CLOGControlLock contention
2) Use granular locking model
3) Use atomic operations

For approach-1, you can use patch [1].  For approach-2, you can use
0001-Improve-64bit-atomics-support patch[2] and the patch attached
with this mail.  For approach-3, you can use
0001-Improve-64bit-atomics-support patch[2] and the patch attached
with this mail by commenting USE_CONTENT_LOCK.  If the third doesn't
work for you then for now we can compare approach-1 and approach-2.

I have done some testing of these patches for read-write pgbench
workload and doesn't find big difference.  Now the interesting test
case could be to use few sub-transactions (may be 4-8) for each
transaction as with that we can see more contention for
CLOGControlLock.

Few points to note for performance testing, one should use --unlogged
tables, else the WAL writing and WALWriteLock contention masks the
impact of this patch.  The impact of this patch is visible at
higher-client counts (say at 64~128).

> I'm interested in doing some tests on the hardware I have available, but
> I'm not willing spending my time untangling the discussion.
>

Thanks for showing the interest and let me know if something is still
un-clear or you need more information to proceed.


[1] - https://www.postgresql.org/message-id/CAA4eK1%2B8gQTyGSZLe1Rb7jeM1Beh4FqA4VNjtpZcmvwizDQ0hw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/20160330230914.GH13305%40awork2.anarazel.de

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Pavan Deolasee
Date:


On Mon, Sep 5, 2016 at 3:18 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

This thread started a year ago, different people contributed various
patches, some of which already got committed. Can someone please post a
summary of this thread, so that it's a bit more clear what needs
review/testing, what are the main open questions and so on?

I'm interested in doing some tests on the hardware I have available, but
I'm not willing spending my time untangling the discussion.


I signed up for reviewing this patch. But as Amit explained later, there are two different and independent implementations to solve the problem. Since Tomas has volunteered to do some benchmarking, I guess I should wait for the results because that might influence which approach we choose. 

Does that sound correct? Or do we already know which implementation is more likely to be pursued, in which case I can start reviewing that patch.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Sep 5, 2016 at 2:00 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>
>
> On Mon, Sep 5, 2016 at 3:18 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
>>
>> Hi,
>>
>> This thread started a year ago, different people contributed various
>> patches, some of which already got committed. Can someone please post a
>> summary of this thread, so that it's a bit more clear what needs
>> review/testing, what are the main open questions and so on?
>>
>> I'm interested in doing some tests on the hardware I have available, but
>> I'm not willing spending my time untangling the discussion.
>>
>
> I signed up for reviewing this patch. But as Amit explained later, there are
> two different and independent implementations to solve the problem. Since
> Tomas has volunteered to do some benchmarking, I guess I should wait for the
> results because that might influence which approach we choose.
>
> Does that sound correct?
>

Sounds correct to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:

On 09/05/2016 06:03 AM, Amit Kapila wrote:
> On Mon, Sep 5, 2016 at 3:18 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Hi,
>>
>> This thread started a year ago, different people contributed various
>> patches, some of which already got committed. Can someone please post a
>> summary of this thread, so that it's a bit more clear what needs
>> review/testing, what are the main open questions and so on?
>>
> 
> Okay, let me try to summarize this thread.  This thread started off to
> ameliorate the CLOGControlLock contention with a patch to increase the
> clog buffers to 128 (which got committed in 9.6).   Then the second
> patch was developed to use Group mode to further reduce the
> CLOGControlLock contention, latest version of which is upthread [1] (I
> have checked that version still gets applied).  Then Andres suggested
> to compare the Group lock mode approach with an alternative (more
> granular) locking model approach for which he has posted patches
> upthread [2].  There are three patches on that link, the patches of
> interest are 0001-Improve-64bit-atomics-support and
> 0003-Use-a-much-more-granular-locking-model-for-the-clog-.  I have
> checked that second one of those doesn't get applied, so I have
> rebased it and attached it with this mail.  In the more granular
> locking approach, actually, you can comment USE_CONTENT_LOCK to make
> it use atomic operations (I could not compile it by disabling
> USE_CONTENT_LOCK on my windows box, you can try by commenting that as
> well, if it works for you).  So, in short we have to compare three
> approaches here.
> 
> 1) Group mode to reduce CLOGControlLock contention
> 2) Use granular locking model
> 3) Use atomic operations
> 
> For approach-1, you can use patch [1].  For approach-2, you can use
> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
> with this mail.  For approach-3, you can use
> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
> with this mail by commenting USE_CONTENT_LOCK.  If the third doesn't
> work for you then for now we can compare approach-1 and approach-2.
> 

OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
attempts to update to a newer version were unsuccessful so far.

> I have done some testing of these patches for read-write pgbench
> workload and doesn't find big difference.  Now the interesting test
> case could be to use few sub-transactions (may be 4-8) for each
> transaction as with that we can see more contention for
> CLOGControlLock.

Understood. So a bunch of inserts/updates interleaved by savepoints?

I presume you started looking into this based on a real-world
performance issue, right? Would that be a good test case?

> 
> Few points to note for performance testing, one should use --unlogged
> tables, else the WAL writing and WALWriteLock contention masks the
> impact of this patch.  The impact of this patch is visible at
> higher-client counts (say at 64~128).
> 

Even on good hardware (say, PCIe SSD storage that can do thousands of
fsyncs per second)? Does it then make sense to try optimizing this if
the effect can only be observed without the WAL overhead (so almost
never in practice)?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Sep 5, 2016 at 11:34 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 09/05/2016 06:03 AM, Amit Kapila wrote:
>>  So, in short we have to compare three
>> approaches here.
>>
>> 1) Group mode to reduce CLOGControlLock contention
>> 2) Use granular locking model
>> 3) Use atomic operations
>>
>> For approach-1, you can use patch [1].  For approach-2, you can use
>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>> with this mail.  For approach-3, you can use
>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>> with this mail by commenting USE_CONTENT_LOCK.  If the third doesn't
>> work for you then for now we can compare approach-1 and approach-2.
>>
>
> OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
> the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
> attempts to update to a newer version were unsuccessful so far.
>

So which all patches your are able to compile on 4-socket m/c?  I
think it is better to measure the performance on bigger machine.

>> I have done some testing of these patches for read-write pgbench
>> workload and doesn't find big difference.  Now the interesting test
>> case could be to use few sub-transactions (may be 4-8) for each
>> transaction as with that we can see more contention for
>> CLOGControlLock.
>
> Understood. So a bunch of inserts/updates interleaved by savepoints?
>

Yes.

> I presume you started looking into this based on a real-world
> performance issue, right? Would that be a good test case?
>

I had started looking into it based on LWLOCK_STATS data for
read-write workload (pgbench tpc-b).  I think it will depict many of
the real-world read-write workloads.

>>
>> Few points to note for performance testing, one should use --unlogged
>> tables, else the WAL writing and WALWriteLock contention masks the
>> impact of this patch.  The impact of this patch is visible at
>> higher-client counts (say at 64~128).
>>
>
> Even on good hardware (say, PCIe SSD storage that can do thousands of
> fsyncs per second)?

Not sure, because it could be masked by WALWriteLock contention.

> Does it then make sense to try optimizing this if
> the effect can only be observed without the WAL overhead (so almost
> never in practice)?
>

It is not that there is no improvement with WAL overhead (like one can
observe that via LWLOCK_STATS apart from TPS), but it is clearly
visible with unlogged tables.  The situation is not that simple,
because let us say we don't do anything for the remaining contention
for CLOGControlLock, then when we try to reduce the contention around
other locks like WALWriteLock or may be ProcArrayLock, there is a
chance that contention will shift to CLOGControlLock.  So, the basic
idea is to get the big benefits, we need to eliminate contention
around each of the locks.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:

On 09/06/2016 04:49 AM, Amit Kapila wrote:
> On Mon, Sep 5, 2016 at 11:34 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> On 09/05/2016 06:03 AM, Amit Kapila wrote:
>>>  So, in short we have to compare three
>>> approaches here.
>>>
>>> 1) Group mode to reduce CLOGControlLock contention
>>> 2) Use granular locking model
>>> 3) Use atomic operations
>>>
>>> For approach-1, you can use patch [1].  For approach-2, you can use
>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>> with this mail.  For approach-3, you can use
>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>> with this mail by commenting USE_CONTENT_LOCK.  If the third doesn't
>>> work for you then for now we can compare approach-1 and approach-2.
>>>
>>
>> OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
>> the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
>> attempts to update to a newer version were unsuccessful so far.
>>
> 
> So which all patches your are able to compile on 4-socket m/c?  I
> think it is better to measure the performance on bigger machine.

Oh, sorry - I forgot to mention that only the last test (with
USE_CONTENT_LOCK commented out) fails to compile, because the functions
for atomics were added in gcc-4.7.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Wed, Sep 7, 2016 at 1:08 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/06/2016 04:49 AM, Amit Kapila wrote:
>> On Mon, Sep 5, 2016 at 11:34 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>
>>> On 09/05/2016 06:03 AM, Amit Kapila wrote:
>>>>  So, in short we have to compare three
>>>> approaches here.
>>>>
>>>> 1) Group mode to reduce CLOGControlLock contention
>>>> 2) Use granular locking model
>>>> 3) Use atomic operations
>>>>
>>>> For approach-1, you can use patch [1].  For approach-2, you can use
>>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>>> with this mail.  For approach-3, you can use
>>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>>> with this mail by commenting USE_CONTENT_LOCK.  If the third doesn't
>>>> work for you then for now we can compare approach-1 and approach-2.
>>>>
>>>
>>> OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
>>> the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
>>> attempts to update to a newer version were unsuccessful so far.
>>>
>>
>> So which all patches your are able to compile on 4-socket m/c?  I
>> think it is better to measure the performance on bigger machine.
>
> Oh, sorry - I forgot to mention that only the last test (with
> USE_CONTENT_LOCK commented out) fails to compile, because the functions
> for atomics were added in gcc-4.7.
>

No issues, in that case we can leave the last test for now and do it later.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:

On 09/07/2016 01:13 PM, Amit Kapila wrote:
> On Wed, Sep 7, 2016 at 1:08 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 09/06/2016 04:49 AM, Amit Kapila wrote:
>>> On Mon, Sep 5, 2016 at 11:34 PM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>>>
>>>> On 09/05/2016 06:03 AM, Amit Kapila wrote:
>>>>>  So, in short we have to compare three
>>>>> approaches here.
>>>>>
>>>>> 1) Group mode to reduce CLOGControlLock contention
>>>>> 2) Use granular locking model
>>>>> 3) Use atomic operations
>>>>>
>>>>> For approach-1, you can use patch [1].  For approach-2, you can use
>>>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>>>> with this mail.  For approach-3, you can use
>>>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>>>> with this mail by commenting USE_CONTENT_LOCK.  If the third doesn't
>>>>> work for you then for now we can compare approach-1 and approach-2.
>>>>>
>>>>
>>>> OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
>>>> the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
>>>> attempts to update to a newer version were unsuccessful so far.
>>>>
>>>
>>> So which all patches your are able to compile on 4-socket m/c?  I
>>> think it is better to measure the performance on bigger machine.
>>
>> Oh, sorry - I forgot to mention that only the last test (with
>> USE_CONTENT_LOCK commented out) fails to compile, because the functions
>> for atomics were added in gcc-4.7.
>>
> 
> No issues, in that case we can leave the last test for now and do it later.
> 

FWIW I've managed to compile a new GCC on the system (all I had to do
was to actually read the damn manual), so I'm ready to do the test once
I get a bit of time.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Mon, Sep 5, 2016 at 9:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> USE_CONTENT_LOCK on my windows box, you can try by commenting that as
> well, if it works for you).  So, in short we have to compare three
> approaches here.
>
> 1) Group mode to reduce CLOGControlLock contention
> 2) Use granular locking model
> 3) Use atomic operations

I have tested performance with approach 1 and approach 2.

1. Transaction (script.sql): I have used below transaction to run my
bench mark, We can argue that this may not be an ideal workload, but I
tested this to put more load on ClogControlLock during commit
transaction.

-----------
\set aid random (1,30000000)
\set tid random (1,3000)

BEGIN;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s1;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
SAVEPOINT s2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
END;
-----------

2. Results
./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
scale factor: 300
Clients   head(tps)        grouplock(tps)          granular(tps)
-------      ---------               ----------                   -------
128        29367                 39326                    37421
180        29777                 37810                    36469
256        28523                 37418                    35882


grouplock --> 1) Group mode to reduce CLOGControlLock contention
granular  --> 2) Use granular locking model

I will test with 3rd approach also, whenever I get time.

3. Summary:
1. I can see on head we are gaining almost ~30 % performance at higher
client count (128 and beyond).
2. group lock is ~5% better compared to granular lock.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Wed, Sep 14, 2016 at 10:25 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have tested performance with approach 1 and approach 2.
>
> 1. Transaction (script.sql): I have used below transaction to run my
> bench mark, We can argue that this may not be an ideal workload, but I
> tested this to put more load on ClogControlLock during commit
> transaction.
>
> -----------
> \set aid random (1,30000000)
> \set tid random (1,3000)
>
> BEGIN;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> SAVEPOINT s1;
> SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
> SAVEPOINT s2;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> END;
> -----------
>
> 2. Results
> ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
> scale factor: 300
> Clients   head(tps)        grouplock(tps)          granular(tps)
> -------      ---------               ----------                   -------
> 128        29367                 39326                    37421
> 180        29777                 37810                    36469
> 256        28523                 37418                    35882
>
>
> grouplock --> 1) Group mode to reduce CLOGControlLock contention
> granular  --> 2) Use granular locking model
>
> I will test with 3rd approach also, whenever I get time.
>
> 3. Summary:
> 1. I can see on head we are gaining almost ~30 % performance at higher
> client count (128 and beyond).
> 2. group lock is ~5% better compared to granular lock.

Forgot to mention that, this test is on unlogged tables.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Wed, Sep 14, 2016 at 12:55 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 2. Results
> ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
> scale factor: 300
> Clients   head(tps)        grouplock(tps)          granular(tps)
> -------      ---------               ----------                   -------
> 128        29367                 39326                    37421
> 180        29777                 37810                    36469
> 256        28523                 37418                    35882
>
>
> grouplock --> 1) Group mode to reduce CLOGControlLock contention
> granular  --> 2) Use granular locking model
>
> I will test with 3rd approach also, whenever I get time.
>
> 3. Summary:
> 1. I can see on head we are gaining almost ~30 % performance at higher
> client count (128 and beyond).
> 2. group lock is ~5% better compared to granular lock.

Sure, but you're testing at *really* high client counts here.  Almost
nobody is going to benefit from a 5% improvement at 256 clients.  You
need to test 64 clients and 32 clients and 16 clients and 8 clients
and see what happens there.  Those cases are a lot more likely than
these stratospheric client counts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Wed, Sep 14, 2016 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Sure, but you're testing at *really* high client counts here.  Almost
> nobody is going to benefit from a 5% improvement at 256 clients.

I agree with your point, but here we need to consider one more thing,
that on head we are gaining ~30% with both the approaches.

So for comparing these two patches we can consider..

A.  Other workloads (one can be as below)  -> Load on CLogControlLock at commit (exclusive mode) + Load on
CLogControlLock at Transaction status (shared mode).  I think we can mix (savepoint + updates)

B. Simplicity of the patch (if both are performing almost equal in all
practical scenarios).

C. Bases on algorithm whichever seems winner.

I will try to test these patches with other workloads...

>  You
> need to test 64 clients and 32 clients and 16 clients and 8 clients
> and see what happens there.  Those cases are a lot more likely than
> these stratospheric client counts.

I tested with 64 clients as well..
1. On head we are gaining ~15% with both the patches.
2. But group lock vs granular lock is almost same.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/14/2016 06:04 PM, Dilip Kumar wrote:
> On Wed, Sep 14, 2016 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Sure, but you're testing at *really* high client counts here.  Almost
>> nobody is going to benefit from a 5% improvement at 256 clients.
>
> I agree with your point, but here we need to consider one more thing,
> that on head we are gaining ~30% with both the approaches.
>
> So for comparing these two patches we can consider..
>
> A.  Other workloads (one can be as below)
>    -> Load on CLogControlLock at commit (exclusive mode) + Load on
> CLogControlLock at Transaction status (shared mode).
>    I think we can mix (savepoint + updates)
>
> B. Simplicity of the patch (if both are performing almost equal in all
> practical scenarios).
>
> C. Bases on algorithm whichever seems winner.
>
> I will try to test these patches with other workloads...
>
>>  You
>> need to test 64 clients and 32 clients and 16 clients and 8 clients
>> and see what happens there.  Those cases are a lot more likely than
>> these stratospheric client counts.
>
> I tested with 64 clients as well..
> 1. On head we are gaining ~15% with both the patches.
> 2. But group lock vs granular lock is almost same.
>

I've been doing some testing too, but I haven't managed to measure any
significant difference between master and any of the patches. Not sure
why, I've repeated the test from scratch to make sure I haven't done
anything stupid, but I got the same results (which is one of the main
reasons why the testing took me so long).

Attached is an archive with a script running the benchmark (including
SQL scripts generating the data and custom transaction for pgbench), and
results in a CSV format.

The benchmark is fairly simple - for each case (master + 3 different
patches) we do 10 runs, 5 minutes each, for 32, 64, 128 and 192 clients
(the machine has 32 physical cores).

The transaction is using a single unlogged table initialized like this:

     create unlogged table t(id int, val int);
     insert into t select i, i from generate_series(1,100000) s(i);
     vacuum t;
     create index on t(id);

(I've also ran it with 100M rows, called "large" in the results), and
pgbench is running this transaction:

     \set id random(1, 100000)

     BEGIN;
     UPDATE t SET val = val + 1 WHERE id = :id;
     SAVEPOINT s1;
     UPDATE t SET val = val + 1 WHERE id = :id;
     SAVEPOINT s2;
     UPDATE t SET val = val + 1 WHERE id = :id;
     SAVEPOINT s3;
     UPDATE t SET val = val + 1 WHERE id = :id;
     SAVEPOINT s4;
     UPDATE t SET val = val + 1 WHERE id = :id;
     SAVEPOINT s5;
     UPDATE t SET val = val + 1 WHERE id = :id;
     SAVEPOINT s6;
     UPDATE t SET val = val + 1 WHERE id = :id;
     SAVEPOINT s7;
     UPDATE t SET val = val + 1 WHERE id = :id;
     SAVEPOINT s8;
     COMMIT;

So 8 simple UPDATEs interleaved by savepoints. The benchmark was running
on a machine with 256GB of RAM, 32 cores (4x E5-4620) and a fairly large
SSD array. I'd done some basic tuning on the system, most importantly:

     effective_io_concurrency = 32
     work_mem = 512MB
     maintenance_work_mem = 512MB
     max_connections = 300
     checkpoint_completion_target = 0.9
     checkpoint_timeout = 3600
     max_wal_size = 128GB
     min_wal_size = 16GB
     shared_buffers = 16GB

Although most of the changes probably does not matter much for unlogged
tables (I planned to see how this affects regular tables, but as I see
no difference for unlogged ones, I haven't done that yet).

So the question is why Dilip sees +30% improvement, while my results are
almost exactly the same. Looking at Dilip's benchmark, I see he only ran
the test for 10 seconds, and I'm not sure how many runs he did, warmup
etc. Dilip, can you provide additional info?

I'll ask someone else to redo the benchmark after the weekend to make
sure it's not actually some stupid mistake of mine.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Sat, Sep 17, 2016 at 6:54 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/14/2016 06:04 PM, Dilip Kumar wrote:
>>
>> On Wed, Sep 14, 2016 at 8:59 PM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>>>
>>> Sure, but you're testing at *really* high client counts here.  Almost
>>> nobody is going to benefit from a 5% improvement at 256 clients.
>>
>>
>> I agree with your point, but here we need to consider one more thing,
>> that on head we are gaining ~30% with both the approaches.
>>
>> So for comparing these two patches we can consider..
>>
>> A.  Other workloads (one can be as below)
>>    -> Load on CLogControlLock at commit (exclusive mode) + Load on
>> CLogControlLock at Transaction status (shared mode).
>>    I think we can mix (savepoint + updates)
>>
>> B. Simplicity of the patch (if both are performing almost equal in all
>> practical scenarios).
>>
>> C. Bases on algorithm whichever seems winner.
>>
>> I will try to test these patches with other workloads...
>>
>>>  You
>>> need to test 64 clients and 32 clients and 16 clients and 8 clients
>>> and see what happens there.  Those cases are a lot more likely than
>>> these stratospheric client counts.
>>
>>
>> I tested with 64 clients as well..
>> 1. On head we are gaining ~15% with both the patches.
>> 2. But group lock vs granular lock is almost same.
>>
>
>
> The transaction is using a single unlogged table initialized like this:
>
>     create unlogged table t(id int, val int);
>     insert into t select i, i from generate_series(1,100000) s(i);
>     vacuum t;
>     create index on t(id);
>
> (I've also ran it with 100M rows, called "large" in the results), and
> pgbench is running this transaction:
>
>     \set id random(1, 100000)
>
>     BEGIN;
>     UPDATE t SET val = val + 1 WHERE id = :id;
>     SAVEPOINT s1;
>     UPDATE t SET val = val + 1 WHERE id = :id;
>     SAVEPOINT s2;
>     UPDATE t SET val = val + 1 WHERE id = :id;
>     SAVEPOINT s3;
>     UPDATE t SET val = val + 1 WHERE id = :id;
>     SAVEPOINT s4;
>     UPDATE t SET val = val + 1 WHERE id = :id;
>     SAVEPOINT s5;
>     UPDATE t SET val = val + 1 WHERE id = :id;
>     SAVEPOINT s6;
>     UPDATE t SET val = val + 1 WHERE id = :id;
>     SAVEPOINT s7;
>     UPDATE t SET val = val + 1 WHERE id = :id;
>     SAVEPOINT s8;
>     COMMIT;
>
> So 8 simple UPDATEs interleaved by savepoints.
>

The difference between these and tests performed by Dilip is that he
has lesser savepoints.  I think if you want to try it again, then can
you once do it with either no savepoint or 1~2 savepoints.  The other
thing you could try out is the same test as Dilip has done (with and
without 2 savepoints).

> The benchmark was running on
> a machine with 256GB of RAM, 32 cores (4x E5-4620) and a fairly large SSD
> array. I'd done some basic tuning on the system, most importantly:
>
>     effective_io_concurrency = 32
>     work_mem = 512MB
>     maintenance_work_mem = 512MB
>     max_connections = 300
>     checkpoint_completion_target = 0.9
>     checkpoint_timeout = 3600
>     max_wal_size = 128GB
>     min_wal_size = 16GB
>     shared_buffers = 16GB
>
> Although most of the changes probably does not matter much for unlogged
> tables (I planned to see how this affects regular tables, but as I see no
> difference for unlogged ones, I haven't done that yet).
>

You are right.  Unless, we don't see the benefit with unlogged tables,
there is no point in doing it for regular tables.

> So the question is why Dilip sees +30% improvement, while my results are
> almost exactly the same. Looking at Dilip's benchmark, I see he only ran the
> test for 10 seconds, and I'm not sure how many runs he did, warmup etc.
> Dilip, can you provide additional info?
>
> I'll ask someone else to redo the benchmark after the weekend to make sure
> it's not actually some stupid mistake of mine.
>

I think there is not much point in repeating the tests you have done,
rather it is better if we can try again the tests done by Dilip in
your environment to see the results.

Thanks for doing the tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/17/2016 05:23 AM, Amit Kapila wrote:
> On Sat, Sep 17, 2016 at 6:54 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 09/14/2016 06:04 PM, Dilip Kumar wrote:
>>>
...
>>
>> (I've also ran it with 100M rows, called "large" in the results), and
>> pgbench is running this transaction:
>>
>>     \set id random(1, 100000)
>>
>>     BEGIN;
>>     UPDATE t SET val = val + 1 WHERE id = :id;
>>     SAVEPOINT s1;
>>     UPDATE t SET val = val + 1 WHERE id = :id;
>>     SAVEPOINT s2;
>>     UPDATE t SET val = val + 1 WHERE id = :id;
>>     SAVEPOINT s3;
>>     UPDATE t SET val = val + 1 WHERE id = :id;
>>     SAVEPOINT s4;
>>     UPDATE t SET val = val + 1 WHERE id = :id;
>>     SAVEPOINT s5;
>>     UPDATE t SET val = val + 1 WHERE id = :id;
>>     SAVEPOINT s6;
>>     UPDATE t SET val = val + 1 WHERE id = :id;
>>     SAVEPOINT s7;
>>     UPDATE t SET val = val + 1 WHERE id = :id;
>>     SAVEPOINT s8;
>>     COMMIT;
>>
>> So 8 simple UPDATEs interleaved by savepoints.
>>
>
> The difference between these and tests performed by Dilip is that he
> has lesser savepoints.  I think if you want to try it again, then can
> you once do it with either no savepoint or 1~2 savepoints.  The other
> thing you could try out is the same test as Dilip has done (with and
> without 2 savepoints).
>

I don't follow. My understanding is the patches should make savepoints 
cheaper - so why would using fewer savepoints increase the effect of the 
patches?

FWIW I've already done a quick test with 2 savepoints, no difference. I 
can do a full test of course.

>> The benchmark was running on
>> a machine with 256GB of RAM, 32 cores (4x E5-4620) and a fairly large SSD
>> array. I'd done some basic tuning on the system, most importantly:
>>
>>     effective_io_concurrency = 32
>>     work_mem = 512MB
>>     maintenance_work_mem = 512MB
>>     max_connections = 300
>>     checkpoint_completion_target = 0.9
>>     checkpoint_timeout = 3600
>>     max_wal_size = 128GB
>>     min_wal_size = 16GB
>>     shared_buffers = 16GB
>>
>> Although most of the changes probably does not matter much for unlogged
>> tables (I planned to see how this affects regular tables, but as I see no
>> difference for unlogged ones, I haven't done that yet).
>>
>
> You are right.  Unless, we don't see the benefit with unlogged tables,
> there is no point in doing it for regular tables.
>
>> So the question is why Dilip sees +30% improvement, while my results are
>> almost exactly the same. Looking at Dilip's benchmark, I see he only ran the
>> test for 10 seconds, and I'm not sure how many runs he did, warmup etc.
>> Dilip, can you provide additional info?
>>
>> I'll ask someone else to redo the benchmark after the weekend to make sure
>> it's not actually some stupid mistake of mine.
>>
>
> I think there is not much point in repeating the tests you have
> done, rather it is better if we can try again the tests done by Dilip
> in your environment to see the results.
>

I'm OK with running Dilip's tests, but I'm not sure why there's not much 
point in running the tests I've done. Or perhaps I'd like to understand 
why "my tests" show no improvement whatsoever first - after all, they're 
not that different from Dilip's.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/14/2016 05:29 PM, Robert Haas wrote:
> On Wed, Sep 14, 2016 at 12:55 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> 2. Results
>> ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
>> scale factor: 300
>> Clients   head(tps)        grouplock(tps)          granular(tps)
>> -------      ---------               ----------                   -------
>> 128        29367                 39326                    37421
>> 180        29777                 37810                    36469
>> 256        28523                 37418                    35882
>>
>>
>> grouplock --> 1) Group mode to reduce CLOGControlLock contention
>> granular  --> 2) Use granular locking model
>>
>> I will test with 3rd approach also, whenever I get time.
>>
>> 3. Summary:
>> 1. I can see on head we are gaining almost ~30 % performance at higher
>> client count (128 and beyond).
>> 2. group lock is ~5% better compared to granular lock.
>
> Sure, but you're testing at *really* high client counts here.  Almost
> nobody is going to benefit from a 5% improvement at 256 clients.  You
> need to test 64 clients and 32 clients and 16 clients and 8 clients
> and see what happens there.  Those cases are a lot more likely than
> these stratospheric client counts.
>

Right. My impression from the discussion so far is that the patches only 
improve performance with very many concurrent clients - but as Robert 
points out, almost no one is running with 256 active clients, unless 
they have 128 cores or so. At least not if they value latency more than 
throughput.

So while it's nice to improve throughput in those cases, it's a bit like 
a tree falling in the forest without anyone around.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Sat, Sep 17, 2016 at 9:12 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/17/2016 05:23 AM, Amit Kapila wrote:
>>
>> The difference between these and tests performed by Dilip is that he
>> has lesser savepoints.  I think if you want to try it again, then can
>> you once do it with either no savepoint or 1~2 savepoints.  The other
>> thing you could try out is the same test as Dilip has done (with and
>> without 2 savepoints).
>>
>
> I don't follow. My understanding is the patches should make savepoints
> cheaper - so why would using fewer savepoints increase the effect of the
> patches?
>

Oh, no the purpose of the patch is not to make savepoints cheaper (I
know I have earlier suggested to check by having few savepoints, but
that was not intended to mean that this patch makes savepoint cheaper,
rather it might show the difference between different approaches,
sorry if that was not clearly stated earlier).  The purpose of this
patch('es) is to make commits cheaper and in particular updating the
status in CLOG.  Let me try to explain in brief about the CLOG
contention and what these patches try to accomplish.  As of head, when
we try to update the status in CLOG (TransactionIdSetPageStatus), we
take CLOGControlLock in EXCLUSIVE mode for reading the appropriate
CLOG page (most of the time, it will be in memory, so it is cheap) and
then updating the transaction status in the same.  We take
CLOGControlLock in SHARED mode (if we the required clog page is in
memory, otherwise the lock is upgraded to Exclusive) while reading the
transaction status which happen when we access the tuple where hint
bit is not set.

So, we have two different type of contention around CLOGControlLock,
(a) all the transactions that try to commit at same time, each of them
have to do it almost serially (b) readers of transaction status
contend with writers.

Now with the patch that went in 9.6 (increasing the clog buffers), the
second type of contention is mostly reduced as most of the required
pages are in-memory and we are hoping that this patch will help in
reducing first type (a) of contention as well.

>>
>
> I'm OK with running Dilip's tests, but I'm not sure why there's not much
> point in running the tests I've done. Or perhaps I'd like to understand why
> "my tests" show no improvement whatsoever first - after all, they're not
> that different from Dilip's.
>

The test which Dilip is doing "Select ... For Update" is mainly aiming
at first type (a) of contention as it doesn't change the hint bits, so
mostly it should not go for reading the transaction status when
accessing the tuple.  Whereas, the tests you are doing is mainly
focussed on second type (b) of contention.

I think one point we have to keep in mind here is that this contention
is visible in bigger socket m/c, last time Jesper also tried these
patches, but didn't find much difference in his environment and on
further analyzing (IIRC) we found that the reason was that contention
was not visible in his environment.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Sat, Sep 17, 2016 at 9:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/14/2016 05:29 PM, Robert Haas wrote:
>>
>> On Wed, Sep 14, 2016 at 12:55 AM, Dilip Kumar <dilipbalaut@gmail.com>
>> wrote:
>>>
>>> 2. Results
>>> ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f
>>> script.sql
>>> scale factor: 300
>>> Clients   head(tps)        grouplock(tps)          granular(tps)
>>> -------      ---------               ----------                   -------
>>> 128        29367                 39326                    37421
>>> 180        29777                 37810                    36469
>>> 256        28523                 37418                    35882
>>>
>>>
>>> grouplock --> 1) Group mode to reduce CLOGControlLock contention
>>> granular  --> 2) Use granular locking model
>>>
>>> I will test with 3rd approach also, whenever I get time.
>>>
>>> 3. Summary:
>>> 1. I can see on head we are gaining almost ~30 % performance at higher
>>> client count (128 and beyond).
>>> 2. group lock is ~5% better compared to granular lock.
>>
>>
>> Sure, but you're testing at *really* high client counts here.  Almost
>> nobody is going to benefit from a 5% improvement at 256 clients.  You
>> need to test 64 clients and 32 clients and 16 clients and 8 clients
>> and see what happens there.  Those cases are a lot more likely than
>> these stratospheric client counts.
>>
>
> Right. My impression from the discussion so far is that the patches only
> improve performance with very many concurrent clients - but as Robert points
> out, almost no one is running with 256 active clients, unless they have 128
> cores or so. At least not if they value latency more than throughput.
>

See, I am also not in favor of going with any of these patches, if
they doesn't help in reduction of contention.  However, I think it is
important to understand, under what kind of workload and which
environment it can show the benefit or regression whichever is
applicable.  Just FYI, couple of days back one of EDB's partner who
was doing the performance tests by using HammerDB (which is again OLTP
focussed workload) on 9.5 based code has found that CLogControlLock
has the significantly high contention.  They were using
synchronous_commit=off in their settings.  Now, it is quite possible
that with improvements done in 9.6, the contention they are seeing
will be eliminated, but we have yet to figure that out.  I just shared
this information to you with the intention that this seems to be a
real problem and we should try to work on it unless we are able to
convince ourselves that this is not a problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Sat, Sep 17, 2016 at 6:54 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Although most of the changes probably does not matter much for unlogged
> tables (I planned to see how this affects regular tables, but as I see no
> difference for unlogged ones, I haven't done that yet).
>
> So the question is why Dilip sees +30% improvement, while my results are
> almost exactly the same. Looking at Dilip's benchmark, I see he only ran the
> test for 10 seconds, and I'm not sure how many runs he did, warmup etc.
> Dilip, can you provide additional info?

Actually I ran test for 10 minutes.

Sorry for the confusion (I copy paste my script and manually replaced
the variable and made mistake )

My script is like this

scale_factor=300
shared_bufs=8GB
time_for_reading=600

./postgres -c shared_buffers=8GB -c checkpoint_timeout=40min -c
max_wal_size=20GB -c max_connections=300 -c maintenance_work_mem=1GB&
./pgbench -i -s $scale_factor --unlogged-tables postgres
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared
postgres -f ../../script.sql >> test_results.txt

I am taking median of three readings..

with below script, I can repeat my results every time (64 client 15%
gain on head and 128+ client 30% gain on head).

I will repeat my test with 8,16 and 32 client and post the results soon.

> \set aid random (1,30000000)
> \set tid random (1,3000)
>
> BEGIN;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> SAVEPOINT s1;
> SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
> SAVEPOINT s2;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> END;
> -----------


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/17/2016 07:05 AM, Amit Kapila wrote:
> On Sat, Sep 17, 2016 at 9:17 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 09/14/2016 05:29 PM, Robert Haas wrote:
...
>>> Sure, but you're testing at *really* high client counts here.
>>> Almost nobody is going to benefit from a 5% improvement at 256
>>> clients. You need to test 64 clients and 32 clients and 16
>>> clients and 8 clients and see what happens there. Those cases are
>>> a lot more likely than these stratospheric client counts.
>>>
>>
>> Right. My impression from the discussion so far is that the patches
>> only improve performance with very many concurrent clients - but as
>> Robert points out, almost no one is running with 256 active
>> clients, unless they have 128 cores or so. At least not if they
>> value latency more than throughput.
>>
>
> See, I am also not in favor of going with any of these patches, if
> they doesn't help in reduction of contention. However, I think it is
> important to understand, under what kind of workload and which
> environment it can show the benefit or regression whichever is
> applicable.

Sure. Which is why I initially asked what type of workload should I be 
testing, and then done the testing with multiple savepoints as that's 
what you suggested. But apparently that's not a workload that could 
benefit from this patch, so I'm a bit confused.

> Just FYI, couple of days back one of EDB's partner who was doing the
> performance tests by using HammerDB (which is again OLTP focussed
> workload) on 9.5 based code has found that CLogControlLock has the
> significantly high contention. They were using synchronous_commit=off
> in their settings. Now, it is quite possible that with improvements
> done in 9.6, the contention they are seeing will be eliminated, but
> we have yet to figure that out. I just shared this information to you
> with the intention that this seems to be a real problem and we should
> try to work on it unless we are able to convince ourselves that this
> is not a problem.
>

So, can we approach the problem from this direction instead? That is, 
instead of looking for workloads that might benefit from the patches, 
look at real-world examples of CLOG lock contention and then evaluate 
the impact on those?

Extracting the workload from benchmarks probably is not ideal, but it's 
still better than constructing the workload on our own to fit the patch.

FWIW I'll do a simple pgbench test - first with synchronous_commit=on 
and then with synchronous_commit=off. Probably the workloads we should 
have started with anyway, I guess.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Sat, Sep 17, 2016 at 11:25 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/17/2016 07:05 AM, Amit Kapila wrote:
>>
>> On Sat, Sep 17, 2016 at 9:17 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>> On 09/14/2016 05:29 PM, Robert Haas wrote:
>
> ...
>>>>
>>>> Sure, but you're testing at *really* high client counts here.
>>>> Almost nobody is going to benefit from a 5% improvement at 256
>>>> clients. You need to test 64 clients and 32 clients and 16
>>>> clients and 8 clients and see what happens there. Those cases are
>>>> a lot more likely than these stratospheric client counts.
>>>>
>>>
>>> Right. My impression from the discussion so far is that the patches
>>> only improve performance with very many concurrent clients - but as
>>> Robert points out, almost no one is running with 256 active
>>> clients, unless they have 128 cores or so. At least not if they
>>> value latency more than throughput.
>>>
>>
>> See, I am also not in favor of going with any of these patches, if
>> they doesn't help in reduction of contention. However, I think it is
>> important to understand, under what kind of workload and which
>> environment it can show the benefit or regression whichever is
>> applicable.
>
>
> Sure. Which is why I initially asked what type of workload should I be
> testing, and then done the testing with multiple savepoints as that's what
> you suggested. But apparently that's not a workload that could benefit from
> this patch, so I'm a bit confused.
>
>> Just FYI, couple of days back one of EDB's partner who was doing the
>> performance tests by using HammerDB (which is again OLTP focussed
>> workload) on 9.5 based code has found that CLogControlLock has the
>> significantly high contention. They were using synchronous_commit=off
>> in their settings. Now, it is quite possible that with improvements
>> done in 9.6, the contention they are seeing will be eliminated, but
>> we have yet to figure that out. I just shared this information to you
>> with the intention that this seems to be a real problem and we should
>> try to work on it unless we are able to convince ourselves that this
>> is not a problem.
>>
>
> So, can we approach the problem from this direction instead? That is,
> instead of looking for workloads that might benefit from the patches, look
> at real-world examples of CLOG lock contention and then evaluate the impact
> on those?
>

Sure, we can go that way as well, but I thought instead of testing
with a new benchmark kit (HammerDB), it is better to first get with
some simple statements.

> Extracting the workload from benchmarks probably is not ideal, but it's
> still better than constructing the workload on our own to fit the patch.
>
> FWIW I'll do a simple pgbench test - first with synchronous_commit=on and
> then with synchronous_commit=off. Probably the workloads we should have
> started with anyway, I guess.
>

Here, synchronous_commit = off case could be interesting.  Do you see
any problem with first trying a workload where Dilip is seeing
benefit?  I am not suggesting we should not do any other testing, but
just first lets try to reproduce the performance gain which is seen in
Dilip's tests.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/18/2016 06:08 AM, Amit Kapila wrote:
> On Sat, Sep 17, 2016 at 11:25 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 09/17/2016 07:05 AM, Amit Kapila wrote:
>>>
>>> On Sat, Sep 17, 2016 at 9:17 AM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>>> On 09/14/2016 05:29 PM, Robert Haas wrote:
>>
>> ...
>>>>>
>>>>> Sure, but you're testing at *really* high client counts here.
>>>>> Almost nobody is going to benefit from a 5% improvement at 256
>>>>> clients. You need to test 64 clients and 32 clients and 16
>>>>> clients and 8 clients and see what happens there. Those cases are
>>>>> a lot more likely than these stratospheric client counts.
>>>>>
>>>>
>>>> Right. My impression from the discussion so far is that the patches
>>>> only improve performance with very many concurrent clients - but as
>>>> Robert points out, almost no one is running with 256 active
>>>> clients, unless they have 128 cores or so. At least not if they
>>>> value latency more than throughput.
>>>>
>>>
>>> See, I am also not in favor of going with any of these patches, if
>>> they doesn't help in reduction of contention. However, I think it is
>>> important to understand, under what kind of workload and which
>>> environment it can show the benefit or regression whichever is
>>> applicable.
>>
>>
>> Sure. Which is why I initially asked what type of workload should I be
>> testing, and then done the testing with multiple savepoints as that's what
>> you suggested. But apparently that's not a workload that could benefit from
>> this patch, so I'm a bit confused.
>>
>>> Just FYI, couple of days back one of EDB's partner who was doing the
>>> performance tests by using HammerDB (which is again OLTP focussed
>>> workload) on 9.5 based code has found that CLogControlLock has the
>>> significantly high contention. They were using synchronous_commit=off
>>> in their settings. Now, it is quite possible that with improvements
>>> done in 9.6, the contention they are seeing will be eliminated, but
>>> we have yet to figure that out. I just shared this information to you
>>> with the intention that this seems to be a real problem and we should
>>> try to work on it unless we are able to convince ourselves that this
>>> is not a problem.
>>>
>>
>> So, can we approach the problem from this direction instead? That is,
>> instead of looking for workloads that might benefit from the patches, look
>> at real-world examples of CLOG lock contention and then evaluate the impact
>> on those?
>>
>
> Sure, we can go that way as well, but I thought instead of testing
> with a new benchmark kit (HammerDB), it is better to first get with
> some simple statements.
>

IMHO in the ideal case the first message in this thread would provide a 
test case, demonstrating the effect of the patch. Then we wouldn't have 
the issue of looking for a good workload two years later.

But now that I look at the first post, I see it apparently used a plain 
tpc-b pgbench (with synchronous_commit=on) to show the benefits, which 
is the workload I'm running right now (results sometime tomorrow).

That workload clearly uses no savepoints at all, so I'm wondering why 
you suggested to use several of them - I know you said that it's to show 
differences between the approaches, but why should that matter to any of 
the patches (and if it matters, why I got almost no differences in the 
benchmarks)?

Pardon my ignorance, CLOG is not my area of expertise ...

>> Extracting the workload from benchmarks probably is not ideal, but
>> it's still better than constructing the workload on our own to fit
>> the patch.
>>
>> FWIW I'll do a simple pgbench test - first with
>> synchronous_commit=on and then with synchronous_commit=off.
>> Probably the workloads we should have started with anyway, I
>> guess.
>>
>
> Here, synchronous_commit = off case could be interesting. Do you see
> any problem with first trying a workload where Dilip is seeing
> benefit? I am not suggesting we should not do any other testing, but
> just first lets try to reproduce the performance gain which is seen
> in Dilip's tests.
>

I plan to run Dilip's workload once the current benchmarks complete.


regard

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Mon, Sep 19, 2016 at 2:41 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> But now that I look at the first post, I see it apparently used a plain
> tpc-b pgbench (with synchronous_commit=on) to show the benefits, which is
> the workload I'm running right now (results sometime tomorrow).

Good option, We can test plain TPC-B also..

I have some more results.. I have got the result for "Update with no
savepoint"....

below is my script...

\set aid random (1,30000000)
\set tid random (1,3000)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;


Results: (median of three, 10 minutes run).

Clients   Head     GroupLock
16          21452    21589
32          42422    42688
64          42460    52590           ~ 23%
128        22683    56825           ~150%
256       18748     54867

With this workload I observed that gain is bigger than my previous
workload (select for update with 2 SP)..

Just to confirm that the gain what we are seeing is because of Clog
Lock contention removal or it's
something else, I ran 128 client with perf for 5 minutes and below is my result.

I can see that after applying group lock patch, LWLockAcquire become
28% to just 4%, and all because
of Clog Lock.

On Head:
------------
-   28.45%     0.24%  postgres  postgres           [.] LWLockAcquire  - LWLockAcquire     + 53.49%
TransactionIdSetPageStatus    + 40.83% SimpleLruReadPage_ReadOnly     + 1.16% BufferAlloc     + 0.92% GetSnapshotData
 + 0.89% GetNewTransactionId     + 0.72% LockBuffer     + 0.70% ProcArrayGroupClearXid
 


After Group Lock Patch:
-------------------------------
-    4.47%     0.26%  postgres  postgres           [.] LWLockAcquire  - LWLockAcquire     + 27.11% GetSnapshotData
+21.57% GetNewTransactionId     + 11.44% SimpleLruReadPage_ReadOnly     + 10.13% BufferAlloc     + 7.24%
ProcArrayGroupClearXid    + 4.74% LockBuffer     + 4.08% LockAcquireExtended     + 2.91%
TransactionGroupUpdateXidStatus    + 2.71% LockReleaseAll     + 1.90% WALInsertLockAcquire     + 0.94% LockRelease
+0.91% VirtualXactLockTableInsert     + 0.90% VirtualXactLockTableCleanup     + 0.72% MultiXactIdSetOldestMember     +
0.66%LockRefindAndRelease
 

Next I will test, "update with 2 savepoints", "select for update with
no savepoints"....
I will also test the granular lock and atomic lock patch in next run..

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Sun, Sep 18, 2016 at 5:11 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> IMHO in the ideal case the first message in this thread would provide a test
> case, demonstrating the effect of the patch. Then we wouldn't have the issue
> of looking for a good workload two years later.
>
> But now that I look at the first post, I see it apparently used a plain
> tpc-b pgbench (with synchronous_commit=on) to show the benefits, which is
> the workload I'm running right now (results sometime tomorrow).
>
> That workload clearly uses no savepoints at all, so I'm wondering why you
> suggested to use several of them - I know you said that it's to show
> differences between the approaches, but why should that matter to any of the
> patches (and if it matters, why I got almost no differences in the
> benchmarks)?
>
> Pardon my ignorance, CLOG is not my area of expertise ...

It's possible that the effect of this patch depends on the number of
sockets.  EDB test machine cthulhu as 8 sockets, and power2 has 4
sockets.  I assume Dilip's tests were run on one of those two,
although he doesn't seem to have mentioned which one.  Your system is
probably 2 or 4 sockets, which might make a difference.  Results might
also depend on CPU architecture; power2 is, unsurprisingly, a POWER
system, whereas I assume you are testing x86.  Maybe somebody who has
access should test on hydra.pg.osuosl.org, which is a community POWER
resource.  (Send me a private email if you are a known community
member who wants access for benchmarking purposes.)

Personally, I find the results so far posted on this thread thoroughly
unimpressive.  I acknowledge that Dilip's results appear to show that
in a best-case scenario these patches produce a rather large gain.
However, that gain seems to happen in a completely contrived scenario:
astronomical client counts, unlogged tables, and a test script that
maximizes pressure on CLogControlLock.  If you have to work that hard
to find a big win, and tests under more reasonable conditions show no
benefit, it's not clear to me that it's really worth the time we're
all spending benchmarking and reviewing this, or the risk of bugs, or
the damage to the SLRU abstraction layer.  I think there's a very good
chance that we're better off moving on to projects that have a better
chance of helping in the real world.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Andres Freund
Date:
On 2016-09-19 15:10:58 -0400, Robert Haas wrote:
> Personally, I find the results so far posted on this thread thoroughly
> unimpressive.  I acknowledge that Dilip's results appear to show that
> in a best-case scenario these patches produce a rather large gain.
> However, that gain seems to happen in a completely contrived scenario:
> astronomical client counts, unlogged tables, and a test script that
> maximizes pressure on CLogControlLock.  If you have to work that hard
> to find a big win, and tests under more reasonable conditions show no
> benefit, it's not clear to me that it's really worth the time we're
> all spending benchmarking and reviewing this, or the risk of bugs, or
> the damage to the SLRU abstraction layer.  I think there's a very good
> chance that we're better off moving on to projects that have a better
> chance of helping in the real world.

+1



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Sep 20, 2016 at 12:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Sep 18, 2016 at 5:11 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> IMHO in the ideal case the first message in this thread would provide a test
>> case, demonstrating the effect of the patch. Then we wouldn't have the issue
>> of looking for a good workload two years later.
>>
>> But now that I look at the first post, I see it apparently used a plain
>> tpc-b pgbench (with synchronous_commit=on) to show the benefits, which is
>> the workload I'm running right now (results sometime tomorrow).
>>
>> That workload clearly uses no savepoints at all, so I'm wondering why you
>> suggested to use several of them - I know you said that it's to show
>> differences between the approaches, but why should that matter to any of the
>> patches (and if it matters, why I got almost no differences in the
>> benchmarks)?
>>
>> Pardon my ignorance, CLOG is not my area of expertise ...
>
> It's possible that the effect of this patch depends on the number of
> sockets.  EDB test machine cthulhu as 8 sockets, and power2 has 4
> sockets.  I assume Dilip's tests were run on one of those two,
>

I think it is former (8 socket machine).

> although he doesn't seem to have mentioned which one.  Your system is
> probably 2 or 4 sockets, which might make a difference.  Results might
> also depend on CPU architecture; power2 is, unsurprisingly, a POWER
> system, whereas I assume you are testing x86.  Maybe somebody who has
> access should test on hydra.pg.osuosl.org, which is a community POWER
> resource.  (Send me a private email if you are a known community
> member who wants access for benchmarking purposes.)
>
> Personally, I find the results so far posted on this thread thoroughly
> unimpressive.  I acknowledge that Dilip's results appear to show that
> in a best-case scenario these patches produce a rather large gain.
> However, that gain seems to happen in a completely contrived scenario:
> astronomical client counts, unlogged tables, and a test script that
> maximizes pressure on CLogControlLock.
>

You are right that the scenario is somewhat contrived, but I think he
hasn't posted the results for simple-update or tpc-b kind of scenarios
for pgbench, so we can't conclude that those won't show benefit.  I
think we can see benefits with synchronous_commit=off as well may not
be as good as with unlogged tables.  The other thing to keep in mind
is that reducing contention on one lock (assume in this case
CLOGControlLock) also gives benefits when we reduce contention on
other locks (like ProcArrayLock, WALWriteLock, ..).  Last time we have
verified this effect with Andres's patch (cache the snapshot) which
reduces the remaining contention on ProcArrayLock.  We have seen that
individually that patch gives some benefit, but by removing the
contention on CLOGControlLock with the patches (increase the clog
buffers and grouping stuff, each one helps) discussed in this thread,
it gives much bigger benefit.

You point related to high-client count is valid and I am sure it won't
give noticeable benefit at lower client-count as the the
CLOGControlLock contention starts impacting only at high-client count.
I am not sure if it is good idea to reject a patch which helps in
stabilising the performance (helps in falling off the cliff) when the
processes increases the number of cores (or hardware threads)

>  If you have to work that hard
> to find a big win, and tests under more reasonable conditions show no
> benefit, it's not clear to me that it's really worth the time we're
> all spending benchmarking and reviewing this, or the risk of bugs, or
> the damage to the SLRU abstraction layer.

I agree with you unless it shows benefit on somewhat more usual
scenario's, we should not accept it.  So shouldn't we wait for results
of other workloads like simple-update or tpc-b on bigger machines
before reaching to conclusion?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Tue, Sep 20, 2016 at 8:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think it is former (8 socket machine).

I confirm this is 8 sockets machine(cthulhu)
>

>
> You point related to high-client count is valid and I am sure it won't
> give noticeable benefit at lower client-count as the the
> CLOGControlLock contention starts impacting only at high-client count.
> I am not sure if it is good idea to reject a patch which helps in
> stabilising the performance (helps in falling off the cliff) when the
> processes increases the number of cores (or hardware threads)
>
>>  If you have to work that hard
>> to find a big win, and tests under more reasonable conditions show no
>> benefit, it's not clear to me that it's really worth the time we're
>> all spending benchmarking and reviewing this, or the risk of bugs, or
>> the damage to the SLRU abstraction layer.
>
> I agree with you unless it shows benefit on somewhat more usual
> scenario's, we should not accept it.  So shouldn't we wait for results
> of other workloads like simple-update or tpc-b on bigger machines
> before reaching to conclusion?

+1

My test are under run, I will post it soon..


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
Hi,

On 09/19/2016 09:10 PM, Robert Haas wrote:
 >
> It's possible that the effect of this patch depends on the number of
> sockets. EDB test machine cthulhu as 8 sockets, and power2 has 4
> sockets. I assume Dilip's tests were run on one of those two,
> although he doesn't seem to have mentioned which one. Your system is
> probably 2 or 4 sockets, which might make a difference. Results
> might also depend on CPU architecture; power2 is, unsurprisingly, a
> POWER system, whereas I assume you are testing x86. Maybe somebody
> who has access should test on hydra.pg.osuosl.org, which is a
> community POWER resource. (Send me a private email if you are a known
> community member who wants access for benchmarking purposes.)
>

Yes, I'm using x86 machines:

1) large but slightly old
- 4 sockets, e5-4620 (so a bit old CPU, 32 cores in total)
- kernel 3.2.80

2) smaller but fresh
- 2 sockets, e5-2620 v4 (newest type of Xeons, 16 cores in total)
- kernel 4.8.0

> Personally, I find the results so far posted on this thread
> thoroughly unimpressive. I acknowledge that Dilip's results appear
> to show that in a best-case scenario these patches produce a rather
> large gain. However, that gain seems to happen in a completely
> contrived scenario: astronomical client counts, unlogged tables, and
> a test script that maximizes pressure on CLogControlLock. If you
> have to work that hard to find a big win, and tests under more
> reasonable conditions show no benefit, it's not clear to me that it's
> really worth the time we're all spending benchmarking and reviewing
> this, or the risk of bugs, or the damage to the SLRU abstraction
> layer. I think there's a very good chance that we're better off
> moving on to projects that have a better chance of helping in the
> real world.

I'm posting results from two types of workloads - traditional r/w
pgbench and Dilip's transaction. With synchronous_commit on/off.

Full results (including script driving the benchmark) are available
here, if needed:

     https://bitbucket.org/tvondra/group-clog-benchmark/src

It'd be good if someone could try reproduce this on a comparable
machine, to rule out my stupidity.


2 x e5-2620 v4 (16 cores, 32 with HT)
=====================================

On the "smaller" machine the results look like this - I have only tested
up to 64 clients, as higher values seem rather uninteresting on a
machine with only 16 physical cores.

These are averages of 5 runs, where the min/max for each group are
within ~5% in most cases (see the "spread" sheet). The "e5-2620" sheet
also shows the numbers as % compared to master.


  dilip / sync=off      1        4        8       16       32       64
----------------------------------------------------------------------
  master             4756    17672    35542    57303    74596    82138
  granular-locking   4745    17728    35078    56105    72983    77858
  no-content-lock    4646    17650    34887    55794    73273    79000
  group-update       4582    17757    35383    56974    74387    81794

  dilip / sync=on       1        4        8       16       32       64
----------------------------------------------------------------------
  master             4819    17583    35636    57437    74620    82036
  granular-locking   4568    17816    35122    56168    73192    78462
  no-content-lock    4540    17662    34747    55560    73508    79320
  group-update       4495    17612    35474    57095    74409    81874

  pgbench / sync=off    1        4        8       16       32       64
----------------------------------------------------------------------
  master             3791    14368    27806    43369    54472    62956
  granular-locking   3822    14462    27597    43173    56391    64669
  no-content-lock    3725    14212    27471    43041    55431    63589
  group-update       3895    14453    27574    43405    56783    62406

  pgbench / sync=on     1        4        8       16       32       64
----------------------------------------------------------------------
  master             3907    14289    27802    43717    56902    62916
  granular-locking   3770    14503    27636    44107    55205    63903
  no-content-lock    3772    14111    27388    43054    56424    64386
  group-update       3844    14334    27452    43621    55896    62498

There's pretty much no improvement at all - most of the results are
within 1-2% of master, in both directions. Hardly a win.

Actually, with 1 client there seems to be ~5% regression, but it might
also be noise and verifying it would require further testing.


4 x e5-4620 v1 (32 cores, 64 with HT)
=====================================

These are averages of 10 runs, and there are a few strange things here.

Firstly, for Dilip's workload the results get much (much) worse between
64 and 128 clients, for some reason. I suspect this might be due to
fairly old kernel (3.2.80), so I'll reboot the machine with 4.5.x kernel
and try again.

Secondly, the min/max differences get much larger than the ~5% on the
smaller machine - with 128 clients, the (max-min)/average is often
 >100%. See the "spread" or "spread2" sheets in the attached file.

But for some reason this only affects Dilip's workload, and apparently
the patches make it measurably worse (master is ~75%, patches ~120%). If
you look at tps for individual runs, there's usually 9 runs with almost
the same performance, and then one or two much faster runs. Again, the
pgbench seems not to have this issue.

I have no idea what's causing this - it might be related to the kernel,
but I'm not sure why it should affect the patches differently. Let's see
how the new kernel affects this.

  dilip / sync=off       16       32       64      128     192
--------------------------------------------------------------
  master              26198    37901    37211    14441    8315
  granular-locking    25829    38395    40626    14299    8160
  no-content-lock     25872    38994    41053    14058    8169
  group-update        26503    38911    42993    19474    8325

  dilip / sync=on        16       32       64      128     192
--------------------------------------------------------------
  master              26138    37790    38492    13653    8337
  granular-locking    25661    38586    40692    14535    8311
  no-content-lock     25653    39059    41169    14370    8373
  group-update        26472    39170    42126    18923    8366

  pgbench / sync=off     16       32       64      128     192
--------------------------------------------------------------
  master              23001    35762    41202    31789    8005
  granular-locking    23218    36130    42535    45850    8701
  no-content-lock     23322    36553    42772    47394    8204
  group-update        23129    36177    41788    46419    8163

  pgbench / sync=on      16       32       64      128     192
--------------------------------------------------------------
  master              22904    36077    41295    35574    8297
  granular-locking    23323    36254    42446    43909    8959
  no-content-lock     23304    36670    42606    48440    8813
  group-update        23127    36696    41859    46693    8345


So there is some improvement due to the patches for 128 clients (+30% in
some cases), but it's rather useless as 64 clients either give you
comparable performance (pgbench workload) or way better one (Dilip's
workload).

Also, pretty much no difference between synchronous_commit on/off,
probably thanks to running on unlogged tables.

I'll repeat the test on the 4-socket machine with a newer kernel, but
that's probably the last benchmark I'll do for this patch for now. I
agree with Robert that the cases the patch is supposed to improve are a
bit contrived because of the very high client counts.

IMHO to continue with the patch (or even with testing it), we really
need a credible / practical example of a real-world workload that
benefits from the patches. The closest we have to that is Amit's
suggestion someone hit the commit lock when running HammerDB, but we
have absolutely no idea what parameters they were using, except that
they were running with synchronous_commit=off. Pgbench shows no such
improvements (at least for me), at least with reasonable parameters.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Tue, Sep 20, 2016 at 9:15 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> +1
>
> My test are under run, I will post it soon..

I have some more results now....

8 socket machine
10 min run(median of 3 run)
synchronous_commit=off
scal factor = 300
share buffer= 8GB

test1: Simple update(pgbench)

Clients         Head          GroupLock32               45702           4540264               46974           51627
128              35056           55362


test2: TPCB (pgbench)

Clients         Head           GroupLock32               27969           2776564               33140           34786
128              21555           38848

Summary:
--------------
At 32 clients no gain, I think at this workload Clog Lock is not a problem.
At 64 Clients we can see ~10% gain with simple update and ~5% with TPCB.
At 128 Clients we can see > 50% gain.

Currently I have tested with synchronous commit=off, later I can try
with on. I can also test at 80 client, I think we will see some
significant gain at this client count also, but as of now I haven't
yet tested.

With above results, what we think ? should we continue our testing ?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Wed, Sep 21, 2016 at 3:48 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> I have no idea what's causing this - it might be related to the kernel, but
> I'm not sure why it should affect the patches differently. Let's see how the
> new kernel affects this.
>
>  dilip / sync=off       16       32       64      128     192
> --------------------------------------------------------------
>  master              26198    37901    37211    14441    8315
>  granular-locking    25829    38395    40626    14299    8160
>  no-content-lock     25872    38994    41053    14058    8169
>  group-update        26503    38911    42993    19474    8325
>
>  dilip / sync=on        16       32       64      128     192
> --------------------------------------------------------------
>  master              26138    37790    38492    13653    8337
>  granular-locking    25661    38586    40692    14535    8311
>  no-content-lock     25653    39059    41169    14370    8373
>  group-update        26472    39170    42126    18923    8366
>
>  pgbench / sync=off     16       32       64      128     192
> --------------------------------------------------------------
>  master              23001    35762    41202    31789    8005
>  granular-locking    23218    36130    42535    45850    8701
>  no-content-lock     23322    36553    42772    47394    8204
>  group-update        23129    36177    41788    46419    8163
>
>  pgbench / sync=on      16       32       64      128     192
> --------------------------------------------------------------
>  master              22904    36077    41295    35574    8297
>  granular-locking    23323    36254    42446    43909    8959
>  no-content-lock     23304    36670    42606    48440    8813
>  group-update        23127    36696    41859    46693    8345
>
>
> So there is some improvement due to the patches for 128 clients (+30% in
> some cases), but it's rather useless as 64 clients either give you
> comparable performance (pgbench workload) or way better one (Dilip's
> workload).
>

I think these results are somewhat similar to what Dilip has reported.
Here, if you see in both cases, the performance improvement is seen
when the client count is greater than cores (including HT).  As far as
I know the m/c on which Dilip is running the tests also has 64 HT.
The point here is that the CLOGControlLock contention is noticeable
only at that client count, so it is not fault of patch that it is not
improving at lower client-count.  I guess that we will see performance
improvement between 64~128 client counts as well.


> Also, pretty much no difference between synchronous_commit on/off, probably
> thanks to running on unlogged tables.
>

Yeah.

> I'll repeat the test on the 4-socket machine with a newer kernel, but that's
> probably the last benchmark I'll do for this patch for now.
>

Okay, but I think it is better to see the results between 64~128
client count and may be greater than128 client counts, because it is
clear that patch won't improve performance below that.

> I agree with
> Robert that the cases the patch is supposed to improve are a bit contrived
> because of the very high client counts.
>

No issues, I have already explained why I think it is important to
reduce the remaining CLOGControlLock contention in yesterday's and
this mail.  If none of you is convinced, then I think we have no
choice but to drop this patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/21/2016 08:04 AM, Amit Kapila wrote:
> On Wed, Sep 21, 2016 at 3:48 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
...
>
>> I'll repeat the test on the 4-socket machine with a newer kernel,
>> but that's probably the last benchmark I'll do for this patch for
>> now.
>>

Attached are results from benchmarks running on kernel 4.5 (instead of
the old 3.2.80). I've only done synchronous_commit=on, and I've added a
few client counts (mostly at the lower end). The data are pushed the
data to the git repository, see

     git push --set-upstream origin master

The summary looks like this (showing both the 3.2.80 and 4.5.5 results):

1) Dilip's workload

  3.2.80                             16     32     64    128    192
-------------------------------------------------------------------
  master                          26138  37790  38492  13653   8337
  granular-locking                25661  38586  40692  14535   8311
  no-content-lock                 25653  39059  41169  14370   8373
  group-update                    26472  39170  42126  18923   8366

  4.5.5                 1      8     16     32     64    128    192
-------------------------------------------------------------------
  granular-locking   4050  23048  27969  32076  34874  36555  37710
  no-content-lock    4025  23166  28430  33032  35214  37576  39191
  group-update       4002  23037  28008  32492  35161  36836  38850
  master             3968  22883  27437  32217  34823  36668  38073


2) pgbench

  3.2.80                             16     32     64    128    192
-------------------------------------------------------------------
  master                          22904  36077  41295  35574   8297
  granular-locking                23323  36254  42446  43909   8959
  no-content-lock                 23304  36670  42606  48440   8813
  group-update                    23127  36696  41859  46693   8345

  4.5.5                 1      8     16     32     64    128    192
-------------------------------------------------------------------
  granular-locking   3116  19235  27388  29150  31905  34105  36359
  no-content-lock    3206  19071  27492  29178  32009  34140  36321
  group-update       3195  19104  26888  29236  32140  33953  35901
  master             3136  18650  26249  28731  31515  33328  35243


The 4.5 kernel clearly changed the results significantly:

(a) Compared to the results from 3.2.80 kernel, some numbers improved,
some got worse. For example, on 3.2.80 pgbench did ~23k tps with 16
clients, on 4.5.5 it does 27k tps. With 64 clients the performance
dropped from 41k tps to ~34k (on master).

(b) The drop above 64 clients is gone - on 3.2.80 it dropped very
quickly to only ~8k with 192 clients. On 4.5 the tps actually continues
to increase, and we get ~35k with 192 clients.

(c) Although it's not visible in the results, 4.5.5 almost perfectly
eliminated the fluctuations in the results. For example when 3.2.80
produced this results (10 runs with the same parameters):

     12118 11610 27939 11771 18065
     12152 14375 10983 13614 11077

we get this on 4.5.5

     37354 37650 37371 37190 37233
     38498 37166 36862 37928 38509

Notice how much more even the 4.5.5 results are, compared to 3.2.80.

(d) There's no sign of any benefit from any of the patches (it was only
helpful >= 128 clients, but that's where the tps actually dropped on
3.2.80 - apparently 4.5.5 fixes that and the benefit is gone).

It's a bit annoying that after upgrading from 3.2.80 to 4.5.5, the
performance with 32 and 64 clients dropped quite noticeably (by more
than 10%). I believe that might be a kernel regression, but perhaps it's
a price for improved scalability for higher client counts.

It of course begs the question what kernel version is running on the
machine used by Dilip (i.e. cthulhu)? Although it's a Power machine, so
I'm not sure how much the kernel matters on it.

I'll ask someone else with access to this particular machine to repeat
the tests, as I have a nagging suspicion that I've missed something
important when compiling / running the benchmarks. I'll also retry the
benchmarks on 3.2.80 to see if I get the same numbers.

>
> Okay, but I think it is better to see the results between 64~128
> client count and may be greater than128 client counts, because it is
> clear that patch won't improve performance below that.
>

There are results for 64, 128 and 192 clients. Why should we care about
numbers in between? How likely (and useful) would it be to get
improvement with 96 clients, but no improvement for 64 or 128 clients?

 >>
>> I agree with Robert that the cases the patch is supposed to
>> improve are a bit contrived because of the very high client
>> counts.
>>
>
> No issues, I have already explained why I think it is important to
> reduce the remaining CLOGControlLock contention in yesterday's and
> this mail. If none of you is convinced, then I think we have no
> choice but to drop this patch.
>

I agree it's useful to reduce lock contention in general, but
considering the last set of benchmarks shows no benefit with recent
kernel, I think we really need a better understanding of what's going
on, what workloads / systems it's supposed to improve, etc.

I don't dare to suggest rejecting the patch, but I don't see how we
could commit any of the patches at this point. So perhaps "returned with
feedback" and resubmitting in the next CF (along with analysis of
improved workloads) would be appropriate.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I don't dare to suggest rejecting the patch, but I don't see how we could
> commit any of the patches at this point. So perhaps "returned with feedback"
> and resubmitting in the next CF (along with analysis of improved workloads)
> would be appropriate.

I think it would be useful to have some kind of theoretical analysis
of how much time we're spending waiting for various locks.  So, for
example, suppose we one run of these tests with various client counts
- say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run "select
wait_event from pg_stat_activity" once per second throughout the test.
Then we see how many times we get each wait event, including NULL (no
wait event).  Now, from this, we can compute the approximate
percentage of time we're spending waiting on CLogControlLock and every
other lock, too, as well as the percentage of time we're not waiting
for lock.  That, it seems to me, would give us a pretty clear idea
what the maximum benefit we could hope for from reducing contention on
any given lock might be.

Now, we could also try that experiment with various patches.  If we
can show that some patch reduces CLogControlLock contention without
increasing TPS, they might still be worth committing for that reason.
Otherwise, you could have a chicken-and-egg problem.  If reducing
contention on A doesn't help TPS because of lock B and visca-versa,
then does that mean we can never commit any patch to reduce contention
on either lock?  Hopefully not.  But I agree with you that there's
certainly not enough evidence to commit any of these patches now.  To
my mind, these numbers aren't convincing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/23/2016 03:20 AM, Robert Haas wrote:
> On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I don't dare to suggest rejecting the patch, but I don't see how
>> we could commit any of the patches at this point. So perhaps
>> "returned with feedback" and resubmitting in the next CF (along
>> with analysis of improvedworkloads) would be appropriate.
>
> I think it would be useful to have some kind of theoretical analysis
> of how much time we're spending waiting for various locks. So, for
> example, suppose we one run of these tests with various client
> counts - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run
> "select wait_event from pg_stat_activity" once per second throughout
> the test. Then we see how many times we get each wait event,
> including NULL (no wait event). Now, from this, we can compute the
> approximate percentage of time we're spending waiting on
> CLogControlLock and every other lock, too, as well as the percentage
> of time we're not waiting for lock. That, it seems to me, would give
> us a pretty clear idea what the maximum benefit we could hope for
> from reducing contention on any given lock might be.
>

Yeah, I think that might be a good way to analyze the locks in general, 
not just got these patches. 24h run with per-second samples should give 
us about 86400 samples (well, multiplied by number of clients), which is 
probably good enough.

We also have LWLOCK_STATS, that might be interesting too, but I'm not 
sure how much it affects the behavior (and AFAIK it also only dumps the 
data to the server log).
>
> Now, we could also try that experiment with various patches. If we
> can show that some patch reduces CLogControlLock contention without
> increasing TPS, they might still be worth committing for that
> reason. Otherwise, you could have a chicken-and-egg problem. If
> reducing contention on A doesn't help TPS because of lock B and
> visca-versa, then does that mean we can never commit any patch to
> reduce contention on either lock? Hopefully not. But I agree with you
> that there's certainly not enough evidence to commit any of these
> patches now. To my mind, these numbers aren't convincing.
>

Yes, the chicken-and-egg problem is why the tests were done with 
unlogged tables (to work around the WAL lock).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Sep 23, 2016 at 7:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/23/2016 03:20 AM, Robert Haas wrote:
>>
>> On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>> I don't dare to suggest rejecting the patch, but I don't see how
>>> we could commit any of the patches at this point. So perhaps
>>> "returned with feedback" and resubmitting in the next CF (along
>>> with analysis of improvedworkloads) would be appropriate.
>>
>>
>> I think it would be useful to have some kind of theoretical analysis
>> of how much time we're spending waiting for various locks. So, for
>> example, suppose we one run of these tests with various client
>> counts - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run
>> "select wait_event from pg_stat_activity" once per second throughout
>> the test. Then we see how many times we get each wait event,
>> including NULL (no wait event). Now, from this, we can compute the
>> approximate percentage of time we're spending waiting on
>> CLogControlLock and every other lock, too, as well as the percentage
>> of time we're not waiting for lock. That, it seems to me, would give
>> us a pretty clear idea what the maximum benefit we could hope for
>> from reducing contention on any given lock might be.
>>
>
> Yeah, I think that might be a good way to analyze the locks in general, not
> just got these patches. 24h run with per-second samples should give us about
> 86400 samples (well, multiplied by number of clients), which is probably
> good enough.
>
> We also have LWLOCK_STATS, that might be interesting too, but I'm not sure
> how much it affects the behavior (and AFAIK it also only dumps the data to
> the server log).
>

Right, I think LWLOCK_STATS give us the count of how many time we have
blocked due to particular lock like below where *blk* gives that
number.

PID 164692 lwlock main 11: shacq 2734189 exacq 146304 blk 73808
spindelay 73 dequeue self 57241

I think doing some experiments with both the techniques can help us to
take a call on these patches.

Do we want these experiments on different kernel versions or are we
okay with the current version on cthulhu (3.10) or we want to only
consider the results with latest kernel?

>>
>>
>> Now, we could also try that experiment with various patches. If we
>> can show that some patch reduces CLogControlLock contention without
>> increasing TPS, they might still be worth committing for that
>> reason. Otherwise, you could have a chicken-and-egg problem. If
>> reducing contention on A doesn't help TPS because of lock B and
>> visca-versa, then does that mean we can never commit any patch to
>> reduce contention on either lock? Hopefully not. But I agree with you
>> that there's certainly not enough evidence to commit any of these
>> patches now. To my mind, these numbers aren't convincing.
>>
>
> Yes, the chicken-and-egg problem is why the tests were done with unlogged
> tables (to work around the WAL lock).
>

Yeah, but I suspect still there was a impact due to ProcArrayLock.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>>
>
> (c) Although it's not visible in the results, 4.5.5 almost perfectly
> eliminated the fluctuations in the results. For example when 3.2.80 produced
> this results (10 runs with the same parameters):
>
>     12118 11610 27939 11771 18065
>     12152 14375 10983 13614 11077
>
> we get this on 4.5.5
>
>     37354 37650 37371 37190 37233
>     38498 37166 36862 37928 38509
>
> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>

how long each run was?  Generally, I do half-hour run to get stable results.

> (d) There's no sign of any benefit from any of the patches (it was only
> helpful >= 128 clients, but that's where the tps actually dropped on 3.2.80
> - apparently 4.5.5 fixes that and the benefit is gone).
>
> It's a bit annoying that after upgrading from 3.2.80 to 4.5.5, the
> performance with 32 and 64 clients dropped quite noticeably (by more than
> 10%). I believe that might be a kernel regression, but perhaps it's a price
> for improved scalability for higher client counts.
>
> It of course begs the question what kernel version is running on the machine
> used by Dilip (i.e. cthulhu)? Although it's a Power machine, so I'm not sure
> how much the kernel matters on it.
>

cthulhu is a x86 m/c and the kernel version is 3.10.  Seeing, the
above results I think kernel version do matter, but does that mean we
ignore the benefits we are seeing on somewhat older kernel version.  I
think right answer here is to do some experiments which can show the
actual contention as suggested by Robert and you.

> I'll ask someone else with access to this particular machine to repeat the
> tests, as I have a nagging suspicion that I've missed something important
> when compiling / running the benchmarks. I'll also retry the benchmarks on
> 3.2.80 to see if I get the same numbers.
>
>>
>> Okay, but I think it is better to see the results between 64~128
>> client count and may be greater than128 client counts, because it is
>> clear that patch won't improve performance below that.
>>
>
> There are results for 64, 128 and 192 clients. Why should we care about
> numbers in between? How likely (and useful) would it be to get improvement
> with 96 clients, but no improvement for 64 or 128 clients?
>

The only point to take was to see from where we have started seeing
improvement, saying that the TPS has improved from >=72 client count
is different from saying that it has improved from >=128.

>> No issues, I have already explained why I think it is important to
>> reduce the remaining CLOGControlLock contention in yesterday's and
>> this mail. If none of you is convinced, then I think we have no
>> choice but to drop this patch.
>>
>
> I agree it's useful to reduce lock contention in general, but considering
> the last set of benchmarks shows no benefit with recent kernel, I think we
> really need a better understanding of what's going on, what workloads /
> systems it's supposed to improve, etc.
>
> I don't dare to suggest rejecting the patch, but I don't see how we could
> commit any of the patches at this point. So perhaps "returned with feedback"
> and resubmitting in the next CF (along with analysis of improved workloads)
> would be appropriate.
>

Agreed with your conclusion and changed the status of patch in CF accordingly.

Many thanks for doing the tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/23/2016 05:10 AM, Amit Kapila wrote:
> On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>>>
>>
>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>> eliminated the fluctuations in the results. For example when 3.2.80 produced
>> this results (10 runs with the same parameters):
>>
>>     12118 11610 27939 11771 18065
>>     12152 14375 10983 13614 11077
>>
>> we get this on 4.5.5
>>
>>     37354 37650 37371 37190 37233
>>     38498 37166 36862 37928 38509
>>
>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>
>
> how long each run was?  Generally, I do half-hour run to get stable results.
>

10 x 5-minute runs for each client count. The full shell script driving 
the benchmark is here: http://bit.ly/2doY6ID and in short it looks like 
this:
    for r in `seq 1 $runs`; do        for c in 1 8 16 32 64 128 192; do            psql -c checkpoint
pgbench-j 8 -c $c ...        done    done
 

>>
>> It of course begs the question what kernel version is running on
>> the machine used by Dilip (i.e. cthulhu)? Although it's a Power
>> machine, so I'm not sure how much the kernel matters on it.
>>
>
> cthulhu is a x86 m/c and the kernel version is 3.10. Seeing, the
> above results I think kernel version do matter, but does that mean
> we ignore the benefits we are seeing on somewhat older kernel
> version. I think right answer here is to do some experiments which
> can show the actual contention as suggested by Robert and you.
>

Yes, I think it'd be useful to test a new kernel version. Perhaps try 
4.5.x so that we can compare it to my results. Maybe even try using my 
shell script

>>
>> There are results for 64, 128 and 192 clients. Why should we care
>> about numbers in between? How likely (and useful) would it be to
>> get improvement with 96 clients, but no improvement for 64 or 128
>> clients?>>
>
> The only point to take was to see from where we have started seeing
> improvement, saying that the TPS has improved from >=72 client count
> is different from saying that it has improved from >=128.
>

I find the exact client count rather uninteresting - it's going to be 
quite dependent on hardware, workload etc.
>>
>> I don't dare to suggest rejecting the patch, but I don't see how
>> we could commit any of the patches at this point. So perhaps
>> "returned with feedback" and resubmitting in the next CF (along
>> with analysis of improvedworkloads) would be appropriate.
>>
>
> Agreed with your conclusion and changed the status of patch in CF
> accordingly.
>

+1


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/23/2016 01:44 AM, Tomas Vondra wrote:
>...
> The 4.5 kernel clearly changed the results significantly:
>
...>
> (c) Although it's not visible in the results, 4.5.5 almost perfectly
> eliminated the fluctuations in the results. For example when 3.2.80
> produced this results (10 runs with the same parameters):
>
>     12118 11610 27939 11771 18065
>     12152 14375 10983 13614 11077
>
> we get this on 4.5.5
>
>     37354 37650 37371 37190 37233
>     38498 37166 36862 37928 38509
>
> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>

The more I think about these random spikes in pgbench performance on 
3.2.80, the more I find them intriguing. Let me show you another example 
(from Dilip's workload and group-update patch on 64 clients).

This is on 3.2.80:
  44175  34619  51944  38384  49066  37004  47242  36296  46353  36180

and on 4.5.5 it looks like this:
  34400  35559  35436  34890  34626  35233  35756  34876  35347  35486

So the 4.5.5 results are much more even, but overall clearly below 
3.2.80. How does 3.2.80 manage to do ~50k tps in some of the runs? 
Clearly we randomly do something right, but what is it and why doesn't 
it happen on the new kernel? And how could we do it every time?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Pavan Deolasee
Date:


On Fri, Sep 23, 2016 at 6:05 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 09/23/2016 05:10 AM, Amit Kapila wrote:
On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On 09/21/2016 08:04 AM, Amit Kapila wrote:


(c) Although it's not visible in the results, 4.5.5 almost perfectly
eliminated the fluctuations in the results. For example when 3.2.80 produced
this results (10 runs with the same parameters):

    12118 11610 27939 11771 18065
    12152 14375 10983 13614 11077

we get this on 4.5.5

    37354 37650 37371 37190 37233
    38498 37166 36862 37928 38509

Notice how much more even the 4.5.5 results are, compared to 3.2.80.


how long each run was?  Generally, I do half-hour run to get stable results.


10 x 5-minute runs for each client count. The full shell script driving the benchmark is here: http://bit.ly/2doY6ID and in short it looks like this:

    for r in `seq 1 $runs`; do
        for c in 1 8 16 32 64 128 192; do
            psql -c checkpoint
            pgbench -j 8 -c $c ...
        done
    done


I see couple of problems with the tests:

1. You're running regular pgbench, which also updates the small tables. At scale 300 and higher clients, there is going to heavy contention on the pgbench_branches table. Why not test with pgbench -N? As far as this patch is concerned, we are only interested in seeing contention on ClogControlLock. In fact, how about a test which only consumes an XID, but does not do any write activity at all? Complete artificial workload, but good enough to tell us if and how much the patch helps in the best case. We can probably do that with a simple txid_current() call, right?

2. Each subsequent pgbench run will bloat the tables. Now that may not be such a big deal given that you're checkpointing between each run. But it still makes results somewhat hard to compare. If a vacuum kicks in, that may have some impact too. Given the scale factor you're testing, why not just start fresh every time?

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Sep 23, 2016 at 6:16 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/23/2016 01:44 AM, Tomas Vondra wrote:
>>
>> ...
>> The 4.5 kernel clearly changed the results significantly:
>>
> ...
>>
>>
>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>> eliminated the fluctuations in the results. For example when 3.2.80
>> produced this results (10 runs with the same parameters):
>>
>>     12118 11610 27939 11771 18065
>>     12152 14375 10983 13614 11077
>>
>> we get this on 4.5.5
>>
>>     37354 37650 37371 37190 37233
>>     38498 37166 36862 37928 38509
>>
>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>
>
> The more I think about these random spikes in pgbench performance on 3.2.80,
> the more I find them intriguing. Let me show you another example (from
> Dilip's workload and group-update patch on 64 clients).
>
> This is on 3.2.80:
>
>   44175  34619  51944  38384  49066
>   37004  47242  36296  46353  36180
>
> and on 4.5.5 it looks like this:
>
>   34400  35559  35436  34890  34626
>   35233  35756  34876  35347  35486
>
> So the 4.5.5 results are much more even, but overall clearly below 3.2.80.
> How does 3.2.80 manage to do ~50k tps in some of the runs? Clearly we
> randomly do something right, but what is it and why doesn't it happen on the
> new kernel? And how could we do it every time?
>

As far as I can see you are using default values of min_wal_size,
max_wal_size, checkpoint related params, have you changed default
shared_buffer settings, because that can have a bigger impact.  Using
default values of mentioned parameters can lead to checkpoints in
between your runs. Also, I think instead of 5 mins, read-write runs
should be run for 15 mins to get consistent data.  For Dilip's
workload where he is using only Select ... For Update, i think it is
okay, but otherwise you need to drop and re-create the database
between each run, otherwise data bloat could impact the readings.

I think in general, the impact should be same for both the kernels
because you are using same parameters, but I think if use appropriate
parameters, then you can get consistent results for 3.2.80.  I have
also seen variation in read-write tests, but the variation you are
showing is really a matter of concern, because it will be difficult to
rely on final data.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Sep 23, 2016 at 6:29 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> On Fri, Sep 23, 2016 at 6:05 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
>>
>> On 09/23/2016 05:10 AM, Amit Kapila wrote:
>>>
>>> On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>>> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>>>>>
>>>>>
>>>>
>>>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>>>> eliminated the fluctuations in the results. For example when 3.2.80
>>>> produced
>>>> this results (10 runs with the same parameters):
>>>>
>>>>     12118 11610 27939 11771 18065
>>>>     12152 14375 10983 13614 11077
>>>>
>>>> we get this on 4.5.5
>>>>
>>>>     37354 37650 37371 37190 37233
>>>>     38498 37166 36862 37928 38509
>>>>
>>>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>>>
>>>
>>> how long each run was?  Generally, I do half-hour run to get stable
>>> results.
>>>
>>
>> 10 x 5-minute runs for each client count. The full shell script driving
>> the benchmark is here: http://bit.ly/2doY6ID and in short it looks like
>> this:
>>
>>     for r in `seq 1 $runs`; do
>>         for c in 1 8 16 32 64 128 192; do
>>             psql -c checkpoint
>>             pgbench -j 8 -c $c ...
>>         done
>>     done
>
>
>
> I see couple of problems with the tests:
>
> 1. You're running regular pgbench, which also updates the small tables. At
> scale 300 and higher clients, there is going to heavy contention on the
> pgbench_branches table. Why not test with pgbench -N? As far as this patch
> is concerned, we are only interested in seeing contention on
> ClogControlLock. In fact, how about a test which only consumes an XID, but
> does not do any write activity at all? Complete artificial workload, but
> good enough to tell us if and how much the patch helps in the best case. We
> can probably do that with a simple txid_current() call, right?
>

Right, that is why in the initial tests done by Dilip, he has used
Select .. for Update.  I think using txid_current will generate lot of
contention on XidGenLock which will mask the contention around
CLOGControlLock, in-fact we have tried that.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Sep 23, 2016 at 6:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I don't dare to suggest rejecting the patch, but I don't see how we could
>> commit any of the patches at this point. So perhaps "returned with feedback"
>> and resubmitting in the next CF (along with analysis of improved workloads)
>> would be appropriate.
>
> I think it would be useful to have some kind of theoretical analysis
> of how much time we're spending waiting for various locks.  So, for
> example, suppose we one run of these tests with various client counts
> - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run "select
> wait_event from pg_stat_activity" once per second throughout the test.
> Then we see how many times we get each wait event, including NULL (no
> wait event).  Now, from this, we can compute the approximate
> percentage of time we're spending waiting on CLogControlLock and every
> other lock, too, as well as the percentage of time we're not waiting
> for lock.  That, it seems to me, would give us a pretty clear idea
> what the maximum benefit we could hope for from reducing contention on
> any given lock might be.
>

As mentioned earlier, such an activity makes sense, however today,
again reading this thread, I noticed that Dilip has already posted
some analysis of lock contention upthread [1].  It is clear that patch
has reduced LWLock contention from ~28% to ~4% (where the major
contributor was TransactionIdSetPageStatus which has reduced from ~53%
to ~3%).  Isn't it inline with what you are looking for?


[1] - https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/23/2016 03:07 PM, Amit Kapila wrote:
> On Fri, Sep 23, 2016 at 6:16 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 09/23/2016 01:44 AM, Tomas Vondra wrote:
>>>
>>> ...
>>> The 4.5 kernel clearly changed the results significantly:
>>>
>> ...
>>>
>>>
>>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>>> eliminated the fluctuations in the results. For example when 3.2.80
>>> produced this results (10 runs with the same parameters):
>>>
>>>     12118 11610 27939 11771 18065
>>>     12152 14375 10983 13614 11077
>>>
>>> we get this on 4.5.5
>>>
>>>     37354 37650 37371 37190 37233
>>>     38498 37166 36862 37928 38509
>>>
>>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>>
>>
>> The more I think about these random spikes in pgbench performance on 3.2.80,
>> the more I find them intriguing. Let me show you another example (from
>> Dilip's workload and group-update patch on 64 clients).
>>
>> This is on 3.2.80:
>>
>>   44175  34619  51944  38384  49066
>>   37004  47242  36296  46353  36180
>>
>> and on 4.5.5 it looks like this:
>>
>>   34400  35559  35436  34890  34626
>>   35233  35756  34876  35347  35486
>>
>> So the 4.5.5 results are much more even, but overall clearly below 3.2.80.
>> How does 3.2.80 manage to do ~50k tps in some of the runs? Clearly we
>> randomly do something right, but what is it and why doesn't it happen on the
>> new kernel? And how could we do it every time?
>>
>
> As far as I can see you are using default values of min_wal_size,
> max_wal_size, checkpoint related params, have you changed default
> shared_buffer settings, because that can have a bigger impact.

Huh? Where do you see me using default values? There are settings.log 
with a dump of pg_settings data, and the modified values are

checkpoint_completion_target = 0.9
checkpoint_timeout = 3600
effective_io_concurrency = 32
log_autovacuum_min_duration = 100
log_checkpoints = on
log_line_prefix = %m
log_timezone = UTC
maintenance_work_mem = 524288
max_connections = 300
max_wal_size = 8192
min_wal_size = 1024
shared_buffers = 2097152
synchronous_commit = on
work_mem = 524288

(ignoring some irrelevant stuff like locales, timezone etc.).

> Using default values of mentioned parameters can lead to checkpoints in
> between your runs.

So I'm using 16GB shared buffers (so with scale 300 everything fits into 
shared buffers), min_wal_size=16GB, max_wal_size=128GB, checkpoint 
timeout 1h etc. So no, there are no checkpoints during the 5-minute 
runs, only those triggered explicitly before each run.

> Also, I think instead of 5 mins, read-write runs should be run for 15
> mins to get consistent data.

Where does the inconsistency come from? Lack of warmup? Considering how 
uniform the results from the 10 runs are (at least on 4.5.5), I claim 
this is not an issue.

> For Dilip's workload where he is using only Select ... For Update, i
> think it is okay, but otherwise you need to drop and re-create the
> database between each run, otherwise data bloat could impact the
> readings.

And why should it affect 3.2.80 and 4.5.5 differently?

>
> I think in general, the impact should be same for both the kernels
> because you are using same parameters, but I think if use
> appropriate parameters, then you can get consistent results for
> 3.2.80. I have also seen variation in read-write tests, but the
> variation you are showing is really a matter of concern, because it
> will be difficult to rely on final data.
>

Both kernels use exactly the same parameters (fairly tuned, IMHO).


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/23/2016 02:59 PM, Pavan Deolasee wrote:
>
>
> On Fri, Sep 23, 2016 at 6:05 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     On 09/23/2016 05:10 AM, Amit Kapila wrote:
>
>         On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
>         <tomas.vondra@2ndquadrant.com
>         <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>             On 09/21/2016 08:04 AM, Amit Kapila wrote:
>
>
>
>             (c) Although it's not visible in the results, 4.5.5 almost
>             perfectly
>             eliminated the fluctuations in the results. For example when
>             3.2.80 produced
>             this results (10 runs with the same parameters):
>
>                 12118 11610 27939 11771 18065
>                 12152 14375 10983 13614 11077
>
>             we get this on 4.5.5
>
>                 37354 37650 37371 37190 37233
>                 38498 37166 36862 37928 38509
>
>             Notice how much more even the 4.5.5 results are, compared to
>             3.2.80.
>
>
>         how long each run was?  Generally, I do half-hour run to get
>         stable results.
>
>
>     10 x 5-minute runs for each client count. The full shell script
>     driving the benchmark is here: http://bit.ly/2doY6ID and in short it
>     looks like this:
>
>         for r in `seq 1 $runs`; do
>             for c in 1 8 16 32 64 128 192; do
>                 psql -c checkpoint
>                 pgbench -j 8 -c $c ...
>             done
>         done
>
>
>
> I see couple of problems with the tests:
>
> 1. You're running regular pgbench, which also updates the small
> tables. At scale 300 and higher clients, there is going to heavy
> contention on the pgbench_branches table. Why not test with pgbench
> -N?

Sure, I can do a bunch of tests with pgbench -N. Good point.

But notice that I've also done the testing with Dilip's workload, and 
the results are pretty much the same.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Sep 23, 2016 at 8:22 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/23/2016 03:07 PM, Amit Kapila wrote:
>>
>> On Fri, Sep 23, 2016 at 6:16 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>> On 09/23/2016 01:44 AM, Tomas Vondra wrote:
>>>>
>>>>
>>>> ...
>>>> The 4.5 kernel clearly changed the results significantly:
>>>>
>>> ...
>>>>
>>>>
>>>>
>>>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>>>> eliminated the fluctuations in the results. For example when 3.2.80
>>>> produced this results (10 runs with the same parameters):
>>>>
>>>>     12118 11610 27939 11771 18065
>>>>     12152 14375 10983 13614 11077
>>>>
>>>> we get this on 4.5.5
>>>>
>>>>     37354 37650 37371 37190 37233
>>>>     38498 37166 36862 37928 38509
>>>>
>>>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>>>
>>>
>>> The more I think about these random spikes in pgbench performance on
>>> 3.2.80,
>>> the more I find them intriguing. Let me show you another example (from
>>> Dilip's workload and group-update patch on 64 clients).
>>>
>>> This is on 3.2.80:
>>>
>>>   44175  34619  51944  38384  49066
>>>   37004  47242  36296  46353  36180
>>>
>>> and on 4.5.5 it looks like this:
>>>
>>>   34400  35559  35436  34890  34626
>>>   35233  35756  34876  35347  35486
>>>
>>> So the 4.5.5 results are much more even, but overall clearly below
>>> 3.2.80.
>>> How does 3.2.80 manage to do ~50k tps in some of the runs? Clearly we
>>> randomly do something right, but what is it and why doesn't it happen on
>>> the
>>> new kernel? And how could we do it every time?
>>>
>>
>> As far as I can see you are using default values of min_wal_size,
>> max_wal_size, checkpoint related params, have you changed default
>> shared_buffer settings, because that can have a bigger impact.
>
>
> Huh? Where do you see me using default values?
>

I was referring to one of your script @ http://bit.ly/2doY6ID.  I
haven't noticed that you have changed default values in
postgresql.conf.

> There are settings.log with a
> dump of pg_settings data, and the modified values are
>
> checkpoint_completion_target = 0.9
> checkpoint_timeout = 3600
> effective_io_concurrency = 32
> log_autovacuum_min_duration = 100
> log_checkpoints = on
> log_line_prefix = %m
> log_timezone = UTC
> maintenance_work_mem = 524288
> max_connections = 300
> max_wal_size = 8192
> min_wal_size = 1024
> shared_buffers = 2097152
> synchronous_commit = on
> work_mem = 524288
>
> (ignoring some irrelevant stuff like locales, timezone etc.).
>
>> Using default values of mentioned parameters can lead to checkpoints in
>> between your runs.
>
>
> So I'm using 16GB shared buffers (so with scale 300 everything fits into
> shared buffers), min_wal_size=16GB, max_wal_size=128GB, checkpoint timeout
> 1h etc. So no, there are no checkpoints during the 5-minute runs, only those
> triggered explicitly before each run.
>

Thanks for clarification.  Do you think we should try some different
settings *_flush_after parameters as those can help in reducing spikes
in writes?

>> Also, I think instead of 5 mins, read-write runs should be run for 15
>> mins to get consistent data.
>
>
> Where does the inconsistency come from?

Thats what I am also curious to know.

> Lack of warmup?

Can't say, but at least we should try to rule out the possibilities.
I think one way to rule out is to do slightly longer runs for Dilip's
test cases and for pgbench we might need to drop and re-create
database after each reading.

> Considering how
> uniform the results from the 10 runs are (at least on 4.5.5), I claim this
> is not an issue.
>

It is quite possible that it is some kernel regression which might be
fixed in later version.  Like we are doing most tests in cthulhu which
has 3.10 version of kernel and we generally get consistent results.
I am not sure if later version of kernel say 4.5.5 is a net win,
because there is a considerable difference (dip) of performance in
that version, though it produces quite stable results.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/24/2016 06:06 AM, Amit Kapila wrote:
> On Fri, Sep 23, 2016 at 8:22 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> ...>>
>> So I'm using 16GB shared buffers (so with scale 300 everything fits into
>> shared buffers), min_wal_size=16GB, max_wal_size=128GB, checkpoint timeout
>> 1h etc. So no, there are no checkpoints during the 5-minute runs, only those
>> triggered explicitly before each run.
>>
>
> Thanks for clarification.  Do you think we should try some different
> settings *_flush_after parameters as those can help in reducing spikes
> in writes?
>

I don't see why that settings would matter. The tests are on unlogged 
tables, so there's almost no WAL traffic and checkpoints (triggered 
explicitly before each run) look like this:

checkpoint complete: wrote 17 buffers (0.0%); 0 transaction log file(s) 
added, 0 removed, 13 recycled; write=0.062 s, sync=0.006 s, total=0.092 
s; sync files=10, longest=0.004 s, average=0.000 s; distance=309223 kB, 
estimate=363742 kB

So I don't see how tuning the flushing would change anything, as we're 
not doing any writes.

Moreover, the machine has a bunch of SSD drives (16 or 24, I don't 
remember at the moment), behind a RAID controller with 2GB of write 
cache on it.

>>> Also, I think instead of 5 mins, read-write runs should be run for 15
>>> mins to get consistent data.
>>
>>
>> Where does the inconsistency come from?
>
> Thats what I am also curious to know.
>
>> Lack of warmup?
>
> Can't say, but at least we should try to rule out the possibilities.
> I think one way to rule out is to do slightly longer runs for
> Dilip's test cases and for pgbench we might need to drop and
> re-create database after each reading.
>

My point is that it's unlikely to be due to insufficient warmup, because 
the inconsistencies appear randomly - generally you get a bunch of slow 
runs, one significantly faster one, then slow ones again.

I believe the runs to be sufficiently long. I don't see why recreating 
the database would be useful - the whole point is to get the database 
and shared buffers into a stable state, and then do measurements on it.

I don't think bloat is a major factor here - I'm collecting some 
additional statistics during this run, including pg_database_size, and I 
can see the size oscillates between 4.8GB and 5.4GB. That's pretty 
negligible, I believe.

I'll let the current set of benchmarks complete - it's running on 4.5.5 
now, I'll do tests on 3.2.80 too.

Then we can re-evaluate if longer runs are needed.

>> Considering how uniform the results from the 10 runs are (at least
>> on 4.5.5), I claim  this is not an issue.
>>
>
> It is quite possible that it is some kernel regression which might
> be fixed in later version. Like we are doing most tests in cthulhu
> which has 3.10 version of kernel and we generally get consistent
> results. I am not sure if later version of kernel say 4.5.5 is a net
> win, because there is a considerable difference (dip) of performance
> in that version, though it produces quite stable results.
>

Well, the thing is - the 4.5.5 behavior is much nicer in general. I'll 
always prefer lower but more consistent performance (in most cases). In 
any case, we're stuck with whatever kernel version the people are using, 
and they're likely to use the newer ones.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/24/2016 08:28 PM, Tomas Vondra wrote:
> On 09/24/2016 06:06 AM, Amit Kapila wrote:
>> On Fri, Sep 23, 2016 at 8:22 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> ...
>>>
>>> So I'm using 16GB shared buffers (so with scale 300 everything fits into
>>> shared buffers), min_wal_size=16GB, max_wal_size=128GB, checkpoint
>>> timeout
>>> 1h etc. So no, there are no checkpoints during the 5-minute runs,
>>> only those
>>> triggered explicitly before each run.
>>>
>>
>> Thanks for clarification.  Do you think we should try some different
>> settings *_flush_after parameters as those can help in reducing spikes
>> in writes?
>>
>
> I don't see why that settings would matter. The tests are on unlogged
> tables, so there's almost no WAL traffic and checkpoints (triggered
> explicitly before each run) look like this:
>
> checkpoint complete: wrote 17 buffers (0.0%); 0 transaction log file(s)
> added, 0 removed, 13 recycled; write=0.062 s, sync=0.006 s, total=0.092
> s; sync files=10, longest=0.004 s, average=0.000 s; distance=309223 kB,
> estimate=363742 kB
>
> So I don't see how tuning the flushing would change anything, as we're
> not doing any writes.
>
> Moreover, the machine has a bunch of SSD drives (16 or 24, I don't
> remember at the moment), behind a RAID controller with 2GB of write
> cache on it.
>
>>>> Also, I think instead of 5 mins, read-write runs should be run for 15
>>>> mins to get consistent data.
>>>
>>>
>>> Where does the inconsistency come from?
>>
>> Thats what I am also curious to know.
>>
>>> Lack of warmup?
>>
>> Can't say, but at least we should try to rule out the possibilities.
>> I think one way to rule out is to do slightly longer runs for
>> Dilip's test cases and for pgbench we might need to drop and
>> re-create database after each reading.
>>
>
> My point is that it's unlikely to be due to insufficient warmup, because
> the inconsistencies appear randomly - generally you get a bunch of slow
> runs, one significantly faster one, then slow ones again.
>
> I believe the runs to be sufficiently long. I don't see why recreating
> the database would be useful - the whole point is to get the database
> and shared buffers into a stable state, and then do measurements on it.
>
> I don't think bloat is a major factor here - I'm collecting some
> additional statistics during this run, including pg_database_size, and I
> can see the size oscillates between 4.8GB and 5.4GB. That's pretty
> negligible, I believe.
>
> I'll let the current set of benchmarks complete - it's running on 4.5.5
> now, I'll do tests on 3.2.80 too.
>
> Then we can re-evaluate if longer runs are needed.
>
>>> Considering how uniform the results from the 10 runs are (at least
>>> on 4.5.5), I claim  this is not an issue.
>>>
>>
>> It is quite possible that it is some kernel regression which might
>> be fixed in later version. Like we are doing most tests in cthulhu
>> which has 3.10 version of kernel and we generally get consistent
>> results. I am not sure if later version of kernel say 4.5.5 is a net
>> win, because there is a considerable difference (dip) of performance
>> in that version, though it produces quite stable results.
>>
>
> Well, the thing is - the 4.5.5 behavior is much nicer in general. I'll
> always prefer lower but more consistent performance (in most cases). In
> any case, we're stuck with whatever kernel version the people are using,
> and they're likely to use the newer ones.
>

So, I have the pgbench results from 3.2.80 and 4.5.5, and in general I
think it matches the previous results rather exactly, so it wasn't just
a fluke before.

The full results, including systat data and various database statistics
(pg_stat_* sampled every second) are available here:

     https://bitbucket.org/tvondra/group-clog-kernels

Attached are the per-run results. The averages (over the 10 runs, 5
minute each) look like this:

  3.2.80                 1      8     16     32     64    128    192
--------------------------------------------------------------------
  granular-locking    1567  12146  26341  44188  43263  49590  15042
  no-content-lock     1567  12180  25549  43787  43675  51800  16831
  group-update        1550  12018  26121  44451  42734  51455  15504
  master              1566  12057  25457  42299  42513  42562  10462

  4.5.5                  1      8     16     32     64    128    192
--------------------------------------------------------------------
  granular-locking    3018  19031  27394  29222  32032  34249  36191
  no-content-lock     2988  18871  27384  29260  32120  34456  36216
  group-update        2960  18848  26870  29025  32078  34259  35900
  master              2984  18917  26430  29065  32119  33924  35897

That is:

(1) The 3.2.80 performs a bit better than before, particularly for 128
and 256 clients - I'm not sure if it's thanks to the reboots or so.

(2) 4.5.5 performs measurably worse for >= 32 clients (by ~30%). That's
a pretty significant regression, on a fairly common workload.

(3) The patches somewhat help on 3.2.80, with 128 clients or more.

(4) There's no measurable improvement on 4.5.5.

As for the warmup, possible impact of database bloat etc. Attached are
two charts, illustrating how tps and database looks like over the whole
benchmark on 4.5.5 (~1440 minutes). Clearly, the behavior is very stable
- the database size oscillates around 5GB (which easily fits into
shared_buffers), and the tps is very stable over the 10 runs.  If the
warmup (or run duration) was insufficient, there'd be visible behavior
changes during the benchmark. So I believe the parameters are appropriate.

I've realized there actually is 3.10.101 kernel available on the
machine, so I'll repeat the pgbench on it too - perhaps that'll give us
some comparison to cthulhu, which is running 3.10 kernel too.

Then I'll run Dilip's workload on those three kernels (so far only the
simple pgbench was measured).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Fri, Sep 23, 2016 at 9:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Sep 23, 2016 at 6:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> I don't dare to suggest rejecting the patch, but I don't see how we could
>>> commit any of the patches at this point. So perhaps "returned with feedback"
>>> and resubmitting in the next CF (along with analysis of improved workloads)
>>> would be appropriate.
>>
>> I think it would be useful to have some kind of theoretical analysis
>> of how much time we're spending waiting for various locks.  So, for
>> example, suppose we one run of these tests with various client counts
>> - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run "select
>> wait_event from pg_stat_activity" once per second throughout the test.
>> Then we see how many times we get each wait event, including NULL (no
>> wait event).  Now, from this, we can compute the approximate
>> percentage of time we're spending waiting on CLogControlLock and every
>> other lock, too, as well as the percentage of time we're not waiting
>> for lock.  That, it seems to me, would give us a pretty clear idea
>> what the maximum benefit we could hope for from reducing contention on
>> any given lock might be.
>>
> As mentioned earlier, such an activity makes sense, however today,
> again reading this thread, I noticed that Dilip has already posted
> some analysis of lock contention upthread [1].  It is clear that patch
> has reduced LWLock contention from ~28% to ~4% (where the major
> contributor was TransactionIdSetPageStatus which has reduced from ~53%
> to ~3%).  Isn't it inline with what you are looking for?

Hmm, yes.  But it's a little hard to interpret what that means; I
think the test I proposed in the quoted material above would provide
clearer data.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/26/2016 07:16 PM, Tomas Vondra wrote:
>
> The averages (over the 10 runs, 5 minute each) look like this:
>
>  3.2.80                 1      8     16     32     64    128    192
> --------------------------------------------------------------------
>  granular-locking    1567  12146  26341  44188  43263  49590  15042
>  no-content-lock     1567  12180  25549  43787  43675  51800  16831
>  group-update        1550  12018  26121  44451  42734  51455  15504
>  master              1566  12057  25457  42299  42513  42562  10462
>
>  4.5.5                  1      8     16     32     64    128    192
> --------------------------------------------------------------------
>  granular-locking    3018  19031  27394  29222  32032  34249  36191
>  no-content-lock     2988  18871  27384  29260  32120  34456  36216
>  group-update        2960  18848  26870  29025  32078  34259  35900
>  master              2984  18917  26430  29065  32119  33924  35897
>
> That is:
>
> (1) The 3.2.80 performs a bit better than before, particularly for 128
> and 256 clients - I'm not sure if it's thanks to the reboots or so.
>
> (2) 4.5.5 performs measurably worse for >= 32 clients (by ~30%). That's
> a pretty significant regression, on a fairly common workload.
>

FWIW, now that I think about this, the regression is roughly in line 
with my findings presented in my recent blog post:
    http://blog.2ndquadrant.com/postgresql-vs-kernel-versions/

Those numbers were collected on a much smaller machine (2/4 cores only), 
which might be why the difference observed on 32-core machine is much 
more significant.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Wed, Sep 21, 2016 at 8:47 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Summary:
> --------------
> At 32 clients no gain, I think at this workload Clog Lock is not a problem.
> At 64 Clients we can see ~10% gain with simple update and ~5% with TPCB.
> At 128 Clients we can see > 50% gain.
>
> Currently I have tested with synchronous commit=off, later I can try
> with on. I can also test at 80 client, I think we will see some
> significant gain at this client count also, but as of now I haven't
> yet tested.
>
> With above results, what we think ? should we continue our testing ?

I have done further testing with on TPCB workload to see the impact on
performance gain by increasing scale factor.

Again at 32 client there is no gain, but at 64 client gain is 12% and
at 128 client it's 75%, it shows that improvement with group lock is
better at higher scale factor (at 300 scale factor gain was 5% at 64
client and 50% at 128 clients).

8 socket machine (kernel 3.10)
10 min run(median of 3 run)
synchronous_commit=off
scal factor = 1000
share buffer= 40GB

Test results:
----------------

client      head         group lock
32          27496       27178
64          31275       35205
128        20656        34490


LWLOCK_STATS approx. block count on ClogControl Lock ("lwlock main 11")
--------------------------------------------------------------------------------------------------------
client      head      group lock
32          80000      60000
64        150000     100000
128      140000       70000

Note: These are approx. block count, I have detailed result of
LWLOCK_STAT, incase someone wants to look into.


LWLOCK_STATS shows that ClogControl lock block count reduced by 25% at
32 client, 33% at 64 client and 50% at 128 client.

Conclusion:
1. I think both  LWLOCK_STATS and performance data shows that we get
significant contention reduction on ClogControlLock with the patch.
2. It also shows that though we are not seeing any performance gain at
32 clients, but there is contention reduction with patch.

I am planning to do some more test with higher scale factor (3000 or more).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/26/2016 08:48 PM, Tomas Vondra wrote:
> On 09/26/2016 07:16 PM, Tomas Vondra wrote:
>>
>> The averages (over the 10 runs, 5 minute each) look like this:
>>
>>  3.2.80                 1      8     16     32     64    128    192
>> --------------------------------------------------------------------
>>  granular-locking    1567  12146  26341  44188  43263  49590  15042
>>  no-content-lock     1567  12180  25549  43787  43675  51800  16831
>>  group-update        1550  12018  26121  44451  42734  51455  15504
>>  master              1566  12057  25457  42299  42513  42562  10462
>>
>>  4.5.5                  1      8     16     32     64    128    192
>> --------------------------------------------------------------------
>>  granular-locking    3018  19031  27394  29222  32032  34249  36191
>>  no-content-lock     2988  18871  27384  29260  32120  34456  36216
>>  group-update        2960  18848  26870  29025  32078  34259  35900
>>  master              2984  18917  26430  29065  32119  33924  35897
>>

So, I got the results from 3.10.101 (only the pgbench data), and it 
looks like this:
 3.10.101               1      8     16     32     64    128    192
-------------------------------------------------------------------- granular-locking    2582  18492  33416  49583
53759 53572  51295 no-content-lock     2580  18666  33860  49976  54382  54012  51549 group-update        2635  18877
33806 49525  54787  54117  51718 master              2630  18783  33630  49451  54104  53199  50497
 

So 3.10.101 performs even better tnan 3.2.80 (and much better than 
4.5.5), and there's no sign any of the patches making a difference.

It also seems there's a major regression in the kernel, somewhere 
between 3.10 and 4.5. With 64 clients, 3.10 does ~54k transactions, 
while 4.5 does only ~32k - that's helluva difference.

I wonder if this might be due to running the benchmark on unlogged 
tables (and thus not waiting for WAL), but I don't see why that should 
result in such drop on a new kernel.

In any case, this seems like an issue unrelated to the patch, so I'll 
post further data into a new thread instead of hijacking this one.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Tue, Sep 27, 2016 at 5:15 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So, I got the results from 3.10.101 (only the pgbench data), and it looks
> like this:
>
>  3.10.101               1      8     16     32     64    128    192
> --------------------------------------------------------------------
>  granular-locking    2582  18492  33416  49583  53759  53572  51295
>  no-content-lock     2580  18666  33860  49976  54382  54012  51549
>  group-update        2635  18877  33806  49525  54787  54117  51718
>  master              2630  18783  33630  49451  54104  53199  50497
>
> So 3.10.101 performs even better tnan 3.2.80 (and much better than 4.5.5),
> and there's no sign any of the patches making a difference.

I'm sure that you mentioned this upthread somewhere, but I can't
immediately find it.  What scale factor are you testing here?

It strikes me that the larger the scale factor, the more
CLogControlLock contention we expect to have.  We'll pretty much do
one CLOG access per update, and the more rows there are, the more
chance there is that the next update hits an "old" row that hasn't
been updated in a long time.  So a larger scale factor also increases
the number of active CLOG pages and, presumably therefore, the amount
of CLOG paging activity.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/28/2016 05:39 PM, Robert Haas wrote:
> On Tue, Sep 27, 2016 at 5:15 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> So, I got the results from 3.10.101 (only the pgbench data), and it looks
>> like this:
>>
>>  3.10.101               1      8     16     32     64    128    192
>> --------------------------------------------------------------------
>>  granular-locking    2582  18492  33416  49583  53759  53572  51295
>>  no-content-lock     2580  18666  33860  49976  54382  54012  51549
>>  group-update        2635  18877  33806  49525  54787  54117  51718
>>  master              2630  18783  33630  49451  54104  53199  50497
>>
>> So 3.10.101 performs even better tnan 3.2.80 (and much better than 4.5.5),
>> and there's no sign any of the patches making a difference.
>
> I'm sure that you mentioned this upthread somewhere, but I can't
> immediately find it.  What scale factor are you testing here?
>

300, the same scale factor as Dilip.

>
> It strikes me that the larger the scale factor, the more
> CLogControlLock contention we expect to have.  We'll pretty much do
> one CLOG access per update, and the more rows there are, the more
> chance there is that the next update hits an "old" row that hasn't
> been updated in a long time.  So a larger scale factor also
> increases the number of active CLOG pages and, presumably therefore,
> the amount of CLOG paging activity.>

So, is 300 too little? I don't think so, because Dilip saw some benefit 
from that. Or what scale factor do we think is needed to reproduce the 
benefit? My machine has 256GB of ram, so I can easily go up to 15000 and 
still keep everything in RAM. But is it worth it?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Wed, Sep 28, 2016 at 6:45 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So, is 300 too little? I don't think so, because Dilip saw some benefit from
> that. Or what scale factor do we think is needed to reproduce the benefit?
> My machine has 256GB of ram, so I can easily go up to 15000 and still keep
> everything in RAM. But is it worth it?

Dunno.  But it might be worth a test or two at, say, 5000, just to see
if that makes any difference.

I feel like we must be missing something here.  If Dilip is seeing
huge speedups and you're seeing nothing, something is different, and
we don't know what it is.  Even if the test case is artificial, it
ought to be the same when one of you runs it as when the other runs
it.  Right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/29/2016 01:59 AM, Robert Haas wrote:
> On Wed, Sep 28, 2016 at 6:45 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> So, is 300 too little? I don't think so, because Dilip saw some benefit from
>> that. Or what scale factor do we think is needed to reproduce the benefit?
>> My machine has 256GB of ram, so I can easily go up to 15000 and still keep
>> everything in RAM. But is it worth it?
>
> Dunno. But it might be worth a test or two at, say, 5000, just to
> see if that makes any difference.
>

OK, I have some benchmarks to run on that machine, but I'll do a few 
tests with scale 5000 - probably sometime next week. I don't think the 
delay matters very much, as it's clear the patch will end up with RwF in 
this CF round.

> I feel like we must be missing something here.  If Dilip is seeing
> huge speedups and you're seeing nothing, something is different, and
> we don't know what it is.  Even if the test case is artificial, it
> ought to be the same when one of you runs it as when the other runs
> it.  Right?
>

Yes, definitely - we're missing something important, I think. One 
difference is that Dilip is using longer runs, but I don't think that's 
a problem (as I demonstrated how stable the results are).

I wonder what CPU model is Dilip using - I know it's x86, but not which 
generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a 
newer model and it makes a difference (although that seems unlikely).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Thu, Sep 29, 2016 at 6:40 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Yes, definitely - we're missing something important, I think. One difference
> is that Dilip is using longer runs, but I don't think that's a problem (as I
> demonstrated how stable the results are).
>
> I wonder what CPU model is Dilip using - I know it's x86, but not which
> generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a newer
> model and it makes a difference (although that seems unlikely).

I am using "Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz "


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Sep 29, 2016 at 12:56 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Sep 29, 2016 at 6:40 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Yes, definitely - we're missing something important, I think. One difference
>> is that Dilip is using longer runs, but I don't think that's a problem (as I
>> demonstrated how stable the results are).
>>
>> I wonder what CPU model is Dilip using - I know it's x86, but not which
>> generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a newer
>> model and it makes a difference (although that seems unlikely).
>
> I am using "Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz "
>

Another difference is that m/c on which Dilip is doing tests has 8 sockets.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Wed, Sep 28, 2016 at 9:10 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>> I feel like we must be missing something here.  If Dilip is seeing
>> huge speedups and you're seeing nothing, something is different, and
>> we don't know what it is.  Even if the test case is artificial, it
>> ought to be the same when one of you runs it as when the other runs
>> it.  Right?
>>
> Yes, definitely - we're missing something important, I think. One difference
> is that Dilip is using longer runs, but I don't think that's a problem (as I
> demonstrated how stable the results are).

It's not impossible that the longer runs could matter - performance
isn't necessarily stable across time during a pgbench test, and the
longer the run the more CLOG pages it will fill.

> I wonder what CPU model is Dilip using - I know it's x86, but not which
> generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a newer
> model and it makes a difference (although that seems unlikely).

The fact that he's using an 8-socket machine seems more likely to
matter than the CPU generation, which isn't much different.  Maybe
Dilip should try this on a 2-socket machine and see if he sees the
same kinds of results.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 09/29/2016 03:47 PM, Robert Haas wrote:
> On Wed, Sep 28, 2016 at 9:10 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>> I feel like we must be missing something here.  If Dilip is seeing
>>> huge speedups and you're seeing nothing, something is different, and
>>> we don't know what it is.  Even if the test case is artificial, it
>>> ought to be the same when one of you runs it as when the other runs
>>> it.  Right?
>>>
>> Yes, definitely - we're missing something important, I think. One difference
>> is that Dilip is using longer runs, but I don't think that's a problem (as I
>> demonstrated how stable the results are).
>
> It's not impossible that the longer runs could matter - performance
> isn't necessarily stable across time during a pgbench test, and the
> longer the run the more CLOG pages it will fill.
>

Sure, but I'm not doing just a single pgbench run. I do a sequence of 
pgbench runs, with different client counts, with ~6h of total runtime. 
There's a checkpoint in between the runs, but as those benchmarks are on 
unlogged tables, that flushes only very few buffers.

Also, the clog SLRU has 128 pages, which is ~1MB of clog data, i.e. ~4M 
transactions. On some kernels (3.10 and 3.12) I can get >50k tps with 64 
clients or more, which means we fill the 128 pages in less than 80 seconds.

So half-way through the run only 50% of clog pages fits into the SLRU, 
and we have a data set with 30M tuples, with uniform random access - so 
it seems rather unlikely we'll get transaction that's still in the SLRU.

But sure, I can do a run with larger data set to verify this.

>> I wonder what CPU model is Dilip using - I know it's x86, but not which
>> generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a newer
>> model and it makes a difference (although that seems unlikely).
>
> The fact that he's using an 8-socket machine seems more likely to
> matter than the CPU generation, which isn't much different.  Maybe
> Dilip should try this on a 2-socket machine and see if he sees the
> same kinds of results.
>

Maybe. I wouldn't expect a major difference between 4 and 8 sockets, but 
I may be wrong.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Thu, Sep 29, 2016 at 10:14 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>> It's not impossible that the longer runs could matter - performance
>> isn't necessarily stable across time during a pgbench test, and the
>> longer the run the more CLOG pages it will fill.
>
> Sure, but I'm not doing just a single pgbench run. I do a sequence of
> pgbench runs, with different client counts, with ~6h of total runtime.
> There's a checkpoint in between the runs, but as those benchmarks are on
> unlogged tables, that flushes only very few buffers.
>
> Also, the clog SLRU has 128 pages, which is ~1MB of clog data, i.e. ~4M
> transactions. On some kernels (3.10 and 3.12) I can get >50k tps with 64
> clients or more, which means we fill the 128 pages in less than 80 seconds.
>
> So half-way through the run only 50% of clog pages fits into the SLRU, and
> we have a data set with 30M tuples, with uniform random access - so it seems
> rather unlikely we'll get transaction that's still in the SLRU.
>
> But sure, I can do a run with larger data set to verify this.

OK, another theory: Dilip is, I believe, reinitializing for each run,
and you are not.  Maybe somehow the effect Dilip is seeing only
happens with a newly-initialized set of pgbench tables.  For example,
maybe the patches cause a huge improvement when all rows have the same
XID, but the effect fades rapidly once the XIDs spread out...

I'm not saying any of what I'm throwing out here is worth the
electrons upon which it is printed, just that there has to be some
explanation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Thu, Sep 29, 2016 at 8:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> OK, another theory: Dilip is, I believe, reinitializing for each run,
> and you are not.

Yes, I am reinitializing for each run.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
Hi,

After collecting a lot more results from multiple kernel versions, I can 
confirm that I see a significant improvement with 128 and 192 clients, 
roughly by 30%:
                           64        128        192    ------------------------------------------------     master
      62482      43181      50985     granular-locking   61701      59611      47483     no-content-lock    62650
59819     47895     group-update       63702      64758      62596
 

But I only see this with Dilip's workload, and only with pre-4.3.0 
kernels (the results above are from kernel 3.19).

With 4.5.5, results for the same benchmark look like this:
                           64        128        192    ------------------------------------------------     master
      35693      39822      42151     granular-locking   35370      39409      41353     no-content-lock    36201
39848     42407     group-update       35697      39893      42667
 

That seems like a fairly bad regression in kernel, although I have not 
identified the feature/commit causing it (and it's also possible the 
issue lies somewhere else, of course).

With regular pgbench, I see no improvement on any kernel version. For 
example on 3.19 the results look like this:
                           64        128        192    ------------------------------------------------     master
      54661      61014      59484     granular-locking   55904      62481      60711     no-content-lock    56182
62442     61234     group-update       55019      61587      60485
 

I haven't done much more testing (e.g. with -N to eliminate collisions 
on branches) yet, let's see if it changes anything.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Wed, Oct 5, 2016 at 12:05 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> After collecting a lot more results from multiple kernel versions, I can
> confirm that I see a significant improvement with 128 and 192 clients,
> roughly by 30%:
>
>                            64        128        192
>     ------------------------------------------------
>      master             62482      43181      50985
>      granular-locking   61701      59611      47483
>      no-content-lock    62650      59819      47895
>      group-update       63702      64758      62596
>
> But I only see this with Dilip's workload, and only with pre-4.3.0 kernels
> (the results above are from kernel 3.19).
>

That appears positive.

> With 4.5.5, results for the same benchmark look like this:
>
>                            64        128        192
>     ------------------------------------------------
>      master             35693      39822      42151
>      granular-locking   35370      39409      41353
>      no-content-lock    36201      39848      42407
>      group-update       35697      39893      42667
>
> That seems like a fairly bad regression in kernel, although I have not
> identified the feature/commit causing it (and it's also possible the issue
> lies somewhere else, of course).
>
> With regular pgbench, I see no improvement on any kernel version. For
> example on 3.19 the results look like this:
>
>                            64        128        192
>     ------------------------------------------------
>      master             54661      61014      59484
>      granular-locking   55904      62481      60711
>      no-content-lock    56182      62442      61234
>      group-update       55019      61587      60485
>

Are the above results with synchronous_commit=off?

> I haven't done much more testing (e.g. with -N to eliminate collisions on
> branches) yet, let's see if it changes anything.
>

Yeah, let us see how it behaves with -N.  Also, I think we could try
at higher scale factor?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 10/05/2016 10:03 AM, Amit Kapila wrote:
> On Wed, Oct 5, 2016 at 12:05 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Hi,
>>
>> After collecting a lot more results from multiple kernel versions, I can
>> confirm that I see a significant improvement with 128 and 192 clients,
>> roughly by 30%:
>>
>>                            64        128        192
>>     ------------------------------------------------
>>      master             62482      43181      50985
>>      granular-locking   61701      59611      47483
>>      no-content-lock    62650      59819      47895
>>      group-update       63702      64758      62596
>>
>> But I only see this with Dilip's workload, and only with pre-4.3.0 kernels
>> (the results above are from kernel 3.19).
>>
>
> That appears positive.
>

I got access to a large machine with 72/144 cores (thanks to Oleg and 
Alexander from Postgres Professional), and I'm running the tests on that 
machine too.

Results from Dilip's workload (with scale 300, unlogged tables) look 
like this:
                        32      64    128     192    224     256    288  master            104943  128579  72167
100967 66631   97088  63767  granular-locking  103415  141689  83780  120480  71847  115201  67240  group-update
105343 144322  92229  130149  81247  126629  76638  no-content-lock   103153  140568  80101  119185  70004  115386
66199

So there's some 20-30% improvement for >= 128 clients.

But what I find much more intriguing is the zig-zag behavior. I mean, 64 
clients give ~130k tps, 128 clients only give ~70k but 192 clients jump 
up to >100k tps again, etc.

FWIW I don't see any such behavior on pgbench, and all those tests were 
done on the same cluster.

>> With 4.5.5, results for the same benchmark look like this:
>>
>>                            64        128        192
>>     ------------------------------------------------
>>      master             35693      39822      42151
>>      granular-locking   35370      39409      41353
>>      no-content-lock    36201      39848      42407
>>      group-update       35697      39893      42667
>>
>> That seems like a fairly bad regression in kernel, although I have not
>> identified the feature/commit causing it (and it's also possible the issue
>> lies somewhere else, of course).
>>
>> With regular pgbench, I see no improvement on any kernel version. For
>> example on 3.19 the results look like this:
>>
>>                            64        128        192
>>     ------------------------------------------------
>>      master             54661      61014      59484
>>      granular-locking   55904      62481      60711
>>      no-content-lock    56182      62442      61234
>>      group-update       55019      61587      60485
>>
>
> Are the above results with synchronous_commit=off?
>

No, but I can do that.

>> I haven't done much more testing (e.g. with -N to eliminate
>> collisions on branches) yet, let's see if it changes anything.
>>
>
> Yeah, let us see how it behaves with -N. Also, I think we could try
> at higher scale factor?
>

Yes, I plan to do that. In total, I plan to test combinations of:

(a) Dilip's workload and pgbench (regular and -N)
(b) logged and unlogged tables
(c) scale 300 and scale 3000 (both fits into RAM)
(d) sync_commit=on/off

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Oct 7, 2016 at 3:02 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> I got access to a large machine with 72/144 cores (thanks to Oleg and
> Alexander from Postgres Professional), and I'm running the tests on that
> machine too.
>
> Results from Dilip's workload (with scale 300, unlogged tables) look like
> this:
>
>                         32      64    128     192    224     256    288
>   master            104943  128579  72167  100967  66631   97088  63767
>   granular-locking  103415  141689  83780  120480  71847  115201  67240
>   group-update      105343  144322  92229  130149  81247  126629  76638
>   no-content-lock   103153  140568  80101  119185  70004  115386  66199
>
> So there's some 20-30% improvement for >= 128 clients.
>

So here we see performance improvement starting at 64 clients, this is
somewhat similar to what Dilip saw in his tests.

> But what I find much more intriguing is the zig-zag behavior. I mean, 64
> clients give ~130k tps, 128 clients only give ~70k but 192 clients jump up
> to >100k tps again, etc.
>

No clear answer.

> FWIW I don't see any such behavior on pgbench, and all those tests were done
> on the same cluster.
>
>>> With 4.5.5, results for the same benchmark look like this:
>>>
>>>                            64        128        192
>>>     ------------------------------------------------
>>>      master             35693      39822      42151
>>>      granular-locking   35370      39409      41353
>>>      no-content-lock    36201      39848      42407
>>>      group-update       35697      39893      42667
>>>
>>> That seems like a fairly bad regression in kernel, although I have not
>>> identified the feature/commit causing it (and it's also possible the
>>> issue
>>> lies somewhere else, of course).
>>>
>>> With regular pgbench, I see no improvement on any kernel version. For
>>> example on 3.19 the results look like this:
>>>
>>>                            64        128        192
>>>     ------------------------------------------------
>>>      master             54661      61014      59484
>>>      granular-locking   55904      62481      60711
>>>      no-content-lock    56182      62442      61234
>>>      group-update       55019      61587      60485
>>>
>>
>> Are the above results with synchronous_commit=off?
>>
>
> No, but I can do that.
>
>>> I haven't done much more testing (e.g. with -N to eliminate
>>> collisions on branches) yet, let's see if it changes anything.
>>>
>>
>> Yeah, let us see how it behaves with -N. Also, I think we could try
>> at higher scale factor?
>>
>
> Yes, I plan to do that. In total, I plan to test combinations of:
>
> (a) Dilip's workload and pgbench (regular and -N)
> (b) logged and unlogged tables
> (c) scale 300 and scale 3000 (both fits into RAM)
> (d) sync_commit=on/off
>

sounds sensible.

Thanks for doing the tests.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 10/08/2016 07:47 AM, Amit Kapila wrote:
> On Fri, Oct 7, 2016 at 3:02 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:>> ...>
>> In total, I plan to test combinations of:
>>
>> (a) Dilip's workload and pgbench (regular and -N)
>> (b) logged and unlogged tables
>> (c) scale 300 and scale 3000 (both fits into RAM)
>> (d) sync_commit=on/off
>>
>
> sounds sensible.
>
> Thanks for doing the tests.
>

FWIW I've started those tests on the big machine provided by Oleg and 
Alexander, an estimate to complete all the benchmarks is 9 days. The 
results will be pushed
   to https://bitbucket.org/tvondra/hp05-results/src

after testing each combination (every ~9 hours). Inspired by Robert's 
wait event post a few days ago, I've added wait event sampling so that 
we can perform similar analysis. (Neat idea!)

While messing with the kernel on the other machine I've managed to 
misconfigure it to the extent that it's not accessible anymore. I'll 
start similar benchmarks once I find someone with console access who can 
fix the boot.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Mon, Oct 10, 2016 at 2:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> after testing each combination (every ~9 hours). Inspired by Robert's wait
> event post a few days ago, I've added wait event sampling so that we can
> perform similar analysis. (Neat idea!)

I have done wait event test on for head vs group lock patch.
I have used similar script what Robert has mentioned in below thread

https://www.postgresql.org/message-id/CA+Tgmoav9Q5v5ZGT3+wP_1tQjT6TGYXrwrDcTRrWimC+ZY7RRA@mail.gmail.com

Test details and Results:
--------------------------------
Machine, POWER, 4 socket machine (machine details are attached in file.)

30-minute pgbench runs with
configurations,
had max_connections = 200,
shared_buffers = 8GB,
maintenance_work_mem = 4GB,
synchronous_commit =off,
checkpoint_timeout = 15min,
checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p]
max_wal_size = 40GB,
log_checkpoints =on.

Test1: unlogged table, 192 clients
---------------------------------------------
On Head:
tps = 44898.862257 (including connections establishing)
tps = 44899.761934 (excluding connections establishing)

 262092  LWLockNamed     | CLogControlLock
 224396                  |
 114510  Lock            | transactionid
  42908  Client          | ClientRead
  20610  Lock            | tuple
  13700  LWLockTranche   | buffer_content
   3637
   2562  LWLockNamed     | XidGenLock
   2359  LWLockNamed     | ProcArrayLock
   1037  Lock            | extend
    948  LWLockTranche   | lock_manager
     46  LWLockTranche   | wal_insert
     12  BufferPin       | BufferPin
      4  LWLockTranche   | buffer_mapping

With Patch:

tps = 77846.622956 (including connections establishing)
tps = 77848.234046 (excluding connections establishing)

 101832  Lock            | transactionid
  91358  Client          | ClientRead
  16691  LWLockNamed     | XidGenLock
  12467  Lock            | tuple
   6007  LWLockNamed     | CLogControlLock
   3640
   3531  LWLockNamed     | ProcArrayLock
   3390  LWLockTranche   | lock_manager
   2683  Lock            | extend
   1112  LWLockTranche   | buffer_content
     72  LWLockTranche   | wal_insert
      8  LWLockTranche   | buffer_mapping
      2  LWLockTranche   | proc
      2  BufferPin       | BufferPin


Test2: unlogged table, 96 clients
------------------------------------------
On head:
tps = 58632.065563 (including connections establishing)
tps = 58632.767384 (excluding connections establishing)
  77039  LWLockNamed     | CLogControlLock
  39712  Client          | ClientRead
  18358  Lock            | transactionid
   4238  LWLockNamed     | XidGenLock
   3638
   3518  LWLockTranche   | buffer_content
   2717  LWLockNamed     | ProcArrayLock
   1410  Lock            | tuple
    792  Lock            | extend
    182  LWLockTranche   | lock_manager
     30  LWLockTranche   | wal_insert
      3  LWLockTranche   | buffer_mapping
      1 Tuples only is on.
      1  BufferPin       | BufferPin

With Patch:
tps = 75204.166640 (including connections establishing)
tps = 75204.922105 (excluding connections establishing)
[dilip.kumar@power2 bin]$ cat out_300_96_ul.txt
 261917                  |
  53407  Client          | ClientRead
  14994  Lock            | transactionid
   5258  LWLockNamed     | XidGenLock
   3660
   3604  LWLockNamed     | ProcArrayLock
   2096  LWLockNamed     | CLogControlLock
   1102  Lock            | tuple
    823  Lock            | extend
    481  LWLockTranche   | buffer_content
    372  LWLockTranche   | lock_manager
    192  Lock            | relation
     65  LWLockTranche   | wal_insert
      6  LWLockTranche   | buffer_mapping
      1 Tuples only is on.
      1  LWLockTranche   | proc


Test3: unlogged table, 64 clients
------------------------------------------
On Head:

tps = 66231.203018 (including connections establishing)
tps = 66231.664990 (excluding connections establishing)

  43446  Client          | ClientRead
   6992  LWLockNamed     | CLogControlLock
   4685  Lock            | transactionid
   3650
   3381  LWLockNamed     | ProcArrayLock
    810  LWLockNamed     | XidGenLock
    734  Lock            | extend
    439  LWLockTranche   | buffer_content
    247  Lock            | tuple
    136  LWLockTranche   | lock_manager
     64  Lock            | relation
     24  LWLockTranche   | wal_insert
      2  LWLockTranche   | buffer_mapping
      1 Tuples only is on.


With Patch:
tps = 67294.042602 (including connections establishing)
tps = 67294.532650 (excluding connections establishing)

  28186  Client          | ClientRead
   3655
   1172  LWLockNamed     | ProcArrayLock
    619  Lock            | transactionid
    289  LWLockNamed     | CLogControlLock
    237  Lock            | extend
     81  LWLockTranche   | buffer_content
     48  LWLockNamed     | XidGenLock
     28  LWLockTranche   | lock_manager
     23  Lock            | tuple
      6  LWLockTranche   | wal_insert



Test4:  unlogged table, 32 clients

Head:
tps = 52320.190549 (including connections establishing)
tps = 52320.442694 (excluding connections establishing)

  28564  Client          | ClientRead
   3663
   1320  LWLockNamed     | ProcArrayLock
    742  Lock            | transactionid
    534  LWLockNamed     | CLogControlLock
    255  Lock            | extend
    108  LWLockNamed     | XidGenLock
     81  LWLockTranche   | buffer_content
     44  LWLockTranche   | lock_manager
     29  Lock            | tuple
      6  LWLockTranche   | wal_insert
      1 Tuples only is on.
      1  LWLockTranche   | buffer_mapping

With Patch:
tps = 47505.582315 (including connections establishing)
tps = 47505.773351 (excluding connections establishing)

  28186  Client          | ClientRead
   3655
   1172  LWLockNamed     | ProcArrayLock
    619  Lock            | transactionid
    289  LWLockNamed     | CLogControlLock
    237  Lock            | extend
     81  LWLockTranche   | buffer_content
     48  LWLockNamed     | XidGenLock
     28  LWLockTranche   | lock_manager
     23  Lock            | tuple
      6  LWLockTranche   | wal_insert

I think at higher client count from client count 96 onwards contention
on CLogControlLock is clearly visible and which is completely solved
with group lock patch.

And at lower client count 32,64  contention on CLogControlLock is not
significant hence we can not see any gain with group lock patch.
(though we can see some contention on CLogControlLock is reduced at 64
clients.)

Note: Here I have taken only one set of reading, and at 32 client my
reading shows some regression with group lock patch, which may be run
to run variance (because earlier I never saw this regression, I can
confirm again with multiple runs).


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Wed, Oct 12, 2016 at 3:21 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I think at higher client count from client count 96 onwards contention
> on CLogControlLock is clearly visible and which is completely solved
> with group lock patch.
>
> And at lower client count 32,64  contention on CLogControlLock is not
> significant hence we can not see any gain with group lock patch.
> (though we can see some contention on CLogControlLock is reduced at 64
> clients.)

I agree with these conclusions.  I had a chance to talk with Andres
this morning at Postgres Vision and based on that conversation I'd
like to suggest a couple of additional tests:

1. Repeat this test on x86.  In particular, I think you should test on
the EnterpriseDB server cthulhu, which is an 8-socket x86 server.

2. Repeat this test with a mixed read-write workload, like -b
tpcb-like@1 -b select-only@9

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 10/12/2016 08:55 PM, Robert Haas wrote:
> On Wed, Oct 12, 2016 at 3:21 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I think at higher client count from client count 96 onwards contention
>> on CLogControlLock is clearly visible and which is completely solved
>> with group lock patch.
>>
>> And at lower client count 32,64  contention on CLogControlLock is not
>> significant hence we can not see any gain with group lock patch.
>> (though we can see some contention on CLogControlLock is reduced at 64
>> clients.)
> 
> I agree with these conclusions.  I had a chance to talk with Andres
> this morning at Postgres Vision and based on that conversation I'd
> like to suggest a couple of additional tests:
> 
> 1. Repeat this test on x86.  In particular, I think you should test on
> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
> 
> 2. Repeat this test with a mixed read-write workload, like -b
> tpcb-like@1 -b select-only@9
> 

FWIW, I'm already running similar benchmarks on an x86 machine with 72
cores (144 with HT). It's "just" a 4-socket system, but the results I
got so far seem quite interesting. The tooling and results (pushed
incrementally) are available here:
   https://bitbucket.org/tvondra/hp05-results/overview

The tooling is completely automated, and it also collects various stats,
like for example the wait event. So perhaps we could simply run it on
ctulhu and get comparable results, and also more thorough data sets than
just snippets posted to the list?

There's also a bunch of reports for the 5 already completed runs
- dilip-300-logged-sync- dilip-300-unlogged-sync- pgbench-300-logged-sync-skip- pgbench-300-unlogged-sync-noskip-
pgbench-300-unlogged-sync-skip

The name identifies the workload type, scale and whether the tables are
wal-logged (for pgbench the "skip" means "-N" while "noskip" does
regular pgbench).

For example the "reports/wait-events-count-patches.txt" compares the
wait even stats with different patches applied (and master):


https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/wait-events-count-patches.txt?at=master&fileviewer=file-view-default

and average tps (from 3 runs, 5 minutes each):


https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/tps-avg-patches.txt?at=master&fileviewer=file-view-default

There are certainly interesting bits. For example while the "logged"
case is dominated y WALWriteLock for most client counts, for large
client counts that's no longer true.

Consider for example dilip-300-logged-sync results with 216 clients:
    wait_event      | master  | gran_lock | no_cont_lock |
group_upd--------------------+---------+-----------+--------------+-----------CLogControlLock    |  624566 |    474261
|      458599 |    225338WALWriteLock        |  431106 |    623142 |       619596 |    699224                    |
331542|    358220 |       371393 |    537076buffer_content      |  261308 |    134764 |       138664 |
102057ClientRead         |   59826 |    100883 |       103609 |    118379transactionid       |   26966 |     23155 |
   23815 |     31700ProcArrayLock       |    3967 |      3852 |         4070 |      4576wal_insert          |    3948 |
   10430 |         9513 |     12079clog                |    1710 |      4006 |         2443 |       925XidGenLock
  |    1689 |      3785 |         4229 |      3539tuple               |     965 |       617 |          655 |
840lock_manager       |     300 |       571 |          619 |       802WALBufMappingLock   |     168 |       140 |
  158 |       147SubtransControlLock |      60 |       115 |          124 |       105
 

Clearly, CLOG is an issue here, and it's (slightly) improved by all the
patches (group_update performing the best). And with 288 clients (which
is 2x the number of virtual cores in the machine, so not entirely crazy)
you get this:
    wait_event      | master  | gran_lock | no_cont_lock |
group_upd--------------------+---------+-----------+--------------+-----------CLogControlLock    |  901670 |    736822
|      728823 |    398111buffer_content      |  492637 |    318129 |       319251 |    270416WALWriteLock        |
414371|    593804 |       589809 |    656613                    |  380344 |    452936 |       470178 |
745790ClientRead         |   60261 |    111367 |       111391 |    126151transactionid       |   43627 |     34585 |
   35464 |     48679wal_insert          |    5423 |     29323 |        25898 |     30191ProcArrayLock       |    4379 |
    3918 |         4006 |      4582clog                |    2952 |      9135 |         5304 |      2514XidGenLock
  |    2182 |      9488 |         8894 |      8595tuple               |    2176 |      1288 |         1409 |
1821lock_manager       |     323 |       797 |          827 |      1006WALBufMappingLock   |     124 |       124 |
   146 |       206SubtransControlLock |      85 |       146 |          170 |       120
 

So even buffer_content gets ahead of the WALWriteLock. I wonder whether
this might be because of only having 128 buffers for clog pages, causing
contention on this system (surely, systems with 144 cores were not that
common when the 128 limit was introduced).

So the patch has positive impact even with WAL, as illustrated by tps
improvements (for large client counts):
 clients | master | gran_locking | no_content_lock |
group_update---------+--------+--------------+-----------------+--------------     36 |  39725 |        39627 |
 41203 |        39763      72 |  70533 |        65795 |           65602 |        66195     108 |  81664 |        87415
|          86896 |        87199     144 |  68950 |        98054 |           98266 |       102834     180 | 105741 |
 109827 |          109201 |       113911     216 |  62789 |        92193 |           90586 |        98995     252 |
94243|       102368 |          100663 |       107515     288 |  57895 |        83608 |           82556 |        91738
 

I find the tps fluctuation intriguing, and I'd like to see that fixed
before committing any of the patches.

For pgbench-300-logged-sync-skip (the other WAL-logging test already
completed), the CLOG contention is also reduced significantly, but the
tps did not improve this significantly.

For the the unlogged case (dilip-300-unlogged-sync), the results are
fairly similar - CLogControlLock and buffer_content dominating the wait
even profiles (WALWriteLock is missing, of course), and the average tps
fluctuates in almost exactly the same way.

Interestingly, no fluctuation for the pgbench tests. For example for
pgbench-300-unlogged-sync-ski (i.e. pgbench -N) the result is this:
 clients | master | gran_locking | no_content_lock |
group_update---------+--------+--------------+-----------------+--------------     36 | 147265 |       148663 |
148985 |       146559      72 | 162645 |       209070 |          207841 |       204588     108 | 135785 |       219982
|         218111 |       217588     144 | 113979 |       228683 |          228953 |       226934     180 |  96930 |
 230161 |          230316 |       227156     216 |  89068 |       224241 |          226524 |       225805     252 |
78203|       222507 |          225636 |       224810     288 |  63999 |       204524 |          225469 |       220098
 

That's a fairly significant improvement, and the behavior is very
smooth. Sadly, with WAL logging (pgbench-300-logged-sync-ski) the tps
drops back to master mostly thanks to WALWriteLock.

Another interesting aspect of the patches is impact on variability of
results - for example looking at dilip-300-unlogged-sync, the overall
average tps (for the three runs combined) and for each of the tree runs,
looks like this:
 clients | avg_tps |     tps_1 |     tps_2 |    tps_3---------+---------+-----------+-----------+-----------      36 |
117332|    115042 |    116125 |    120841      72 |   90917 |     72451 |    119915 |     80319     108 |   96070 |
106105|     73606 |    108580     144 |   81422 |     71094 |    102109 |     71063     180 |   88537 |     98871 |
67756|     99021     216 |   75962 |     65584 |     96365 |     66010     252 |   59941 |     57771 |     64756 |
57289    288 |   80851 |     93005 |     56454 |     93313
 

Notice the variability between the runs - the difference between min and
max is often more than 40%. Now compare it to results with the
"group-update" patch applied:
 clients | avg_tps |     tps_1 |     tps_2 |    tps_3---------+---------+-----------+-----------+-----------      36 |
116273|    117031 |    116005 |    115786      72 |  145273 |    147166 |    144839 |    143821     108 |   89892 |
89957|     89585 |     90133     144 |  130176 |    130310 |    130565 |    129655     180 |   81944 |     81927 |
81951|     81953     216 |  124415 |    124367 |    123228 |    125651     252 |   76723 |     76467 |     77266 |
76436    288 |  120072 |    121205 |    119731 |    119283
 

In this case there's pretty much no cross-run variability, the
differences are usually within 2%, so basically random noise. (There's
of course the variability depending on client count, but that was
already mentioned).


There's certainly much more interesting stuff in the results, but I
don't have time for more thorough analysis now - I only intended to do
some "quick benchmarking" on the patch, and I've already spent days on
this, and I have other things to do.

I'll take care of collecting data for the remaining cases on this
machine (and possibly running the same tests on the other one, if I
manage to get access to it again). But I'll leave further analysis of
the collected data up to the patch authors, or some volunteers.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Oct 13, 2016 at 7:53 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 10/12/2016 08:55 PM, Robert Haas wrote:
>> On Wed, Oct 12, 2016 at 3:21 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> I think at higher client count from client count 96 onwards contention
>>> on CLogControlLock is clearly visible and which is completely solved
>>> with group lock patch.
>>>
>>> And at lower client count 32,64  contention on CLogControlLock is not
>>> significant hence we can not see any gain with group lock patch.
>>> (though we can see some contention on CLogControlLock is reduced at 64
>>> clients.)
>>
>> I agree with these conclusions.  I had a chance to talk with Andres
>> this morning at Postgres Vision and based on that conversation I'd
>> like to suggest a couple of additional tests:
>>
>> 1. Repeat this test on x86.  In particular, I think you should test on
>> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>>
>> 2. Repeat this test with a mixed read-write workload, like -b
>> tpcb-like@1 -b select-only@9
>>
>
> FWIW, I'm already running similar benchmarks on an x86 machine with 72
> cores (144 with HT). It's "just" a 4-socket system, but the results I
> got so far seem quite interesting. The tooling and results (pushed
> incrementally) are available here:
>
>     https://bitbucket.org/tvondra/hp05-results/overview
>
> The tooling is completely automated, and it also collects various stats,
> like for example the wait event. So perhaps we could simply run it on
> ctulhu and get comparable results, and also more thorough data sets than
> just snippets posted to the list?
>
> There's also a bunch of reports for the 5 already completed runs
>
>  - dilip-300-logged-sync
>  - dilip-300-unlogged-sync
>  - pgbench-300-logged-sync-skip
>  - pgbench-300-unlogged-sync-noskip
>  - pgbench-300-unlogged-sync-skip
>
> The name identifies the workload type, scale and whether the tables are
> wal-logged (for pgbench the "skip" means "-N" while "noskip" does
> regular pgbench).
>
> For example the "reports/wait-events-count-patches.txt" compares the
> wait even stats with different patches applied (and master):
>
>
https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/wait-events-count-patches.txt?at=master&fileviewer=file-view-default
>
> and average tps (from 3 runs, 5 minutes each):
>
>
https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/tps-avg-patches.txt?at=master&fileviewer=file-view-default
>
> There are certainly interesting bits. For example while the "logged"
> case is dominated y WALWriteLock for most client counts, for large
> client counts that's no longer true.
>
> Consider for example dilip-300-logged-sync results with 216 clients:
>
>      wait_event      | master  | gran_lock | no_cont_lock | group_upd
>  --------------------+---------+-----------+--------------+-----------
>  CLogControlLock     |  624566 |    474261 |       458599 |    225338
>  WALWriteLock        |  431106 |    623142 |       619596 |    699224
>                      |  331542 |    358220 |       371393 |    537076
>  buffer_content      |  261308 |    134764 |       138664 |    102057
>  ClientRead          |   59826 |    100883 |       103609 |    118379
>  transactionid       |   26966 |     23155 |        23815 |     31700
>  ProcArrayLock       |    3967 |      3852 |         4070 |      4576
>  wal_insert          |    3948 |     10430 |         9513 |     12079
>  clog                |    1710 |      4006 |         2443 |       925
>  XidGenLock          |    1689 |      3785 |         4229 |      3539
>  tuple               |     965 |       617 |          655 |       840
>  lock_manager        |     300 |       571 |          619 |       802
>  WALBufMappingLock   |     168 |       140 |          158 |       147
>  SubtransControlLock |      60 |       115 |          124 |       105
>
> Clearly, CLOG is an issue here, and it's (slightly) improved by all the
> patches (group_update performing the best). And with 288 clients (which
> is 2x the number of virtual cores in the machine, so not entirely crazy)
> you get this:
>
>      wait_event      | master  | gran_lock | no_cont_lock | group_upd
>  --------------------+---------+-----------+--------------+-----------
>  CLogControlLock     |  901670 |    736822 |       728823 |    398111
>  buffer_content      |  492637 |    318129 |       319251 |    270416
>  WALWriteLock        |  414371 |    593804 |       589809 |    656613
>                      |  380344 |    452936 |       470178 |    745790
>  ClientRead          |   60261 |    111367 |       111391 |    126151
>  transactionid       |   43627 |     34585 |        35464 |     48679
>  wal_insert          |    5423 |     29323 |        25898 |     30191
>  ProcArrayLock       |    4379 |      3918 |         4006 |      4582
>  clog                |    2952 |      9135 |         5304 |      2514
>  XidGenLock          |    2182 |      9488 |         8894 |      8595
>  tuple               |    2176 |      1288 |         1409 |      1821
>  lock_manager        |     323 |       797 |          827 |      1006
>  WALBufMappingLock   |     124 |       124 |          146 |       206
>  SubtransControlLock |      85 |       146 |          170 |       120
>
> So even buffer_content gets ahead of the WALWriteLock. I wonder whether
> this might be because of only having 128 buffers for clog pages, causing
> contention on this system (surely, systems with 144 cores were not that
> common when the 128 limit was introduced).
>

Not sure, but I have checked if we increase clog buffers greater than
128, then it causes dip in performance on read-write workload in some
cases. Apart from that from above results, it is quite clear that
patches help in significantly reducing the CLOGControlLock contention
with group-update patch consistently better, probably because with
this workload is more contended on writing the transaction status.

> So the patch has positive impact even with WAL, as illustrated by tps
> improvements (for large client counts):
>
>   clients | master | gran_locking | no_content_lock | group_update
>  ---------+--------+--------------+-----------------+--------------
>        36 |  39725 |        39627 |           41203 |        39763
>        72 |  70533 |        65795 |           65602 |        66195
>       108 |  81664 |        87415 |           86896 |        87199
>       144 |  68950 |        98054 |           98266 |       102834
>       180 | 105741 |       109827 |          109201 |       113911
>       216 |  62789 |        92193 |           90586 |        98995
>       252 |  94243 |       102368 |          100663 |       107515
>       288 |  57895 |        83608 |           82556 |        91738
>
> I find the tps fluctuation intriguing, and I'd like to see that fixed
> before committing any of the patches.
>

I have checked the wait event results where there is more fluctuation:

test               | clients | wait_event_type |     wait_event      |
master  | granular_locking | no_content_lock | group_update

----------------------------------+---------+-----------------+---------------------+---------+------------------+-----------------+--------------
dilip-300-unlogged-sync          |     108 | LWLockNamed     |
CLogControlLock     |  343526 |           502127 |          479937 |   301381
dilip-300-unlogged-sync          |     180 | LWLockNamed     |
CLogControlLock     |  557639 |           835567 |          795403 |   512707


So, if I read above results correctly, then it shows that group-update
has helped slightly to reduce the contention and one probable reason
could be that we need to update clog status on different clog pages
more frequently on such a workload and may be need to perform disk
page reads for clog pages as well, so the benefit of grouping will
certainly be less.  This is because page read requests will get
serialized and only leader backend needs to perform all such requests.
Robert has pointed about somewhat similar case upthread [1] and I have
modified the patch as well to use multiple slots (groups) for group
transaction status update [2], but we didn't pursued, because on
pgbench workload, it didn't showed any benefit. However, may be here
it can show some benefit, if we could make above results reproducible
and you guys think that above theory sounds reasonable, then I can
again modify the patch based on that idea.

Now, the story with granular_locking and no_content_lock patches seems
to be worse, because they seem to be increasing the contention on
CLOGControlLock rather than reducing it.  I think one of the probable
reasons that could happen for both the approaches is that it
frequently needs to release the CLogControlLock acquired in Shared
mode and reacquire it in Exclusive mode as the clog page to modify is
not in buffer (different clog page update then the currently in
buffer) and then once again it needs to release the CLogControlLock
lock to read the clog page from disk and acquire it again in Exclusive
mode.  This frequent release-acquire of CLOGControlLock in different
modes could lead to significant increase in contention.  It is
slightly more for granular_locking patch as it needs one additional
lock (buffer_content_lock) in Exclusive mode after acquiring
CLogControlLock. Offhand, I could not see a way to reduce the
contention with granular_locking and no_content_lock patches.

So, the crux is that we are seeing more variability in some of the
results because of frequent different clog page accesses which is not
so easy to predict, but I think it is possible with ~100,000 tps.

>
> There's certainly much more interesting stuff in the results, but I
> don't have time for more thorough analysis now - I only intended to do
> some "quick benchmarking" on the patch, and I've already spent days on
> this, and I have other things to do.
>

Thanks a ton for doing such a detailed testing.


[1] -
https://www.postgresql.org/message-id/CA%2BTgmoahCx6XgprR%3Dp5%3D%3DcF0g9uhSHsJxVdWdUEHN9H2Mv0gkw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1%2BSoW3FBrdZV%2B3m34uCByK3DMPy_9QQs34yvN8spByzyA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I agree with these conclusions.  I had a chance to talk with Andres
> this morning at Postgres Vision and based on that conversation I'd
> like to suggest a couple of additional tests:
>
> 1. Repeat this test on x86.  In particular, I think you should test on
> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.

I have done my test on cthulhu, basic difference is that In POWER we
saw ClogControlLock on top at 96 and more client with 300 scale
factor. But, on cthulhu at 300 scale factor transactionid lock is
always on top. So I repeated my test with 1000 scale factor as well on
cthulhu.

All configuration is same as my last test.

Test with 1000 scale factor
-------------------------------------

Test1: number of clients: 192

Head:
tps = 21206.108856 (including connections establishing)
tps = 21206.245441 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_192_ul.txt310489  LWLockNamed     | CLogControlLock296152                  | 35537
Lock           | transactionid 15821  LWLockTranche   | buffer_mapping 10342  LWLockTranche   | buffer_content  8427
LWLockTranche  | clog  3961  3165  Lock            | extend  2861  Lock            | tuple  2781  LWLockNamed     |
ProcArrayLock 1104  LWLockNamed     | XidGenLock   745  LWLockTranche   | lock_manager   371  LWLockNamed     |
CheckpointerCommLock   70  LWLockTranche   | wal_insert     5  BufferPin       | BufferPin     3  LWLockTranche   |
proc

Patch:
tps = 28725.038933 (including connections establishing)
tps = 28725.367102 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_192_ul.txt540061                  | 57810  LWLockNamed     | CLogControlLock 36264
LWLockTranche  | buffer_mapping 29976  Lock            | transactionid  4770  Lock            | extend  4735
LWLockTranche  | clog  4479  LWLockNamed     | ProcArrayLock  4006  3955  LWLockTranche   | buffer_content  2505
LWLockTranche  | lock_manager  2179  Lock            | tuple  1977  LWLockNamed     | XidGenLock   905  LWLockNamed
|CheckpointerCommLock   222  LWLockTranche   | wal_insert     8  LWLockTranche   | proc
 

Test2: number of clients: 96

Head:
tps = 25447.861572 (including connections establishing)
tps = 25448.012739 (excluding connections establishing)261611                  | 69604  LWLockNamed     |
CLogControlLock 6119  Lock            | transactionid  4008  2874  LWLockTranche   | buffer_mapping  2578
LWLockTranche  | buffer_content  2355  LWLockNamed     | ProcArrayLock  1245  Lock            | extend  1168
LWLockTranche  | clog   232  Lock            | tuple   217  LWLockNamed     | CheckpointerCommLock   160  LWLockNamed
 | XidGenLock   158  LWLockTranche   | lock_manager    78  LWLockTranche   | wal_insert     5  BufferPin       |
BufferPin

Patch:
tps = 32708.368938 (including connections establishing)
tps = 32708.765989 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_96_ul.txt326601                  |  7471  LWLockNamed     | CLogControlLock  5387
Lock           | transactionid  4018  3331  LWLockTranche   | buffer_mapping  3144  LWLockNamed     | ProcArrayLock
1372 Lock            | extend   722  LWLockTranche   | buffer_content   393  LWLockNamed     | XidGenLock   237
LWLockTranche  | lock_manager   234  Lock            | tuple   194  LWLockTranche   | clog    96  Lock            |
relation   88  LWLockTranche   | wal_insert    34  LWLockNamed     | CheckpointerCommLock
 

Test3: number of clients: 64

Head:

tps = 28264.194438 (including connections establishing)
tps = 28264.336270 (excluding connections establishing)
218264                  | 10314  LWLockNamed     | CLogControlLock  4019  2067  Lock            | transactionid  1950
LWLockTranche  | buffer_mapping  1879  LWLockNamed     | ProcArrayLock   592  Lock            | extend   565
LWLockTranche  | buffer_content   222  LWLockNamed     | XidGenLock   143  LWLockTranche   | clog   131  LWLockNamed
| CheckpointerCommLock    63  LWLockTranche   | lock_manager    52  Lock            | tuple    35  LWLockTranche   |
wal_insert

Patch:
tps = 27906.376194 (including connections establishing)
tps = 27906.531392 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_64_ul.txt228108                  |  4039  2294  Lock            | transactionid
2116 LWLockTranche   | buffer_mapping  1757  LWLockNamed     | ProcArrayLock  1553  LWLockNamed     | CLogControlLock
800 Lock            | extend   403  LWLockTranche   | buffer_content    92  LWLockNamed     | XidGenLock    74
LWLockTranche  | lock_manager    42  Lock            | tuple    35  LWLockTranche   | wal_insert    34  LWLockTranche
|clog    14  LWLockNamed     | CheckpointerCommLock
 

Test4: number of clients: 32

Head:
tps = 27587.999912 (including connections establishing)
tps = 27588.119611 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt117762                  |  4031   614  LWLockNamed     | ProcArrayLock
379 LWLockNamed     | CLogControlLock   344  Lock            | transactionid   183  Lock            | extend   102
LWLockTranche  | buffer_mapping    71  LWLockTranche   | buffer_content    39  LWLockNamed     | XidGenLock    25
LWLockTranche  | lock_manager     3  LWLockTranche   | wal_insert     3  LWLockTranche   | clog     2  LWLockNamed
|CheckpointerCommLock     2  Lock            | tuple
 

Patch:
tps = 28291.428848 (including connections establishing)
tps = 28291.586435 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt116596                  |  4041   757  LWLockNamed     | ProcArrayLock
407 LWLockNamed     | CLogControlLock   358  Lock            | transactionid   183  Lock            | extend   142
LWLockTranche  | buffer_mapping    77  LWLockTranche   | buffer_content    68  LWLockNamed     | XidGenLock    35
LWLockTranche  | lock_manager    15  LWLockTranche   | wal_insert     7  LWLockTranche   | clog     7  Lock
|tuple     4  LWLockNamed     | CheckpointerCommLock     1 Tuples only is on.
 

Summary:
- At 96 and more clients count we can see ClogControlLock at the top.
- With patch contention on ClogControlLock is reduced significantly.
I think these behaviours are same as we saw on power.

With 300 scale factor:
- Contention on ClogControlLock is significant only at 192 client
(still transaction id lock is on top), Which is completely removed
with group lock patch.

For 300 scale factor, I am posting data only at 192 client count (If
anyone interested in other data I can post).

Head:
scaling factor: 300
query mode: prepared
number of clients: 192
number of threads: 192
duration: 1800 s
number of transactions actually processed: 65930726
latency average: 5.242 ms
tps = 36621.827041 (including connections establishing)
tps = 36622.064081 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 300_192_ul.txt437848                  |118966  Lock            | transactionid 88869
LWLockNamed    | CLogControlLock 18558  Lock            | tuple  6183  LWLockTranche   | buffer_content  5664
LWLockTranche  | lock_manager  3995  LWLockNamed     | ProcArrayLock  3646  1748  Lock            | extend  1635
LWLockNamed    | XidGenLock   401  LWLockTranche   | wal_insert    33  BufferPin       | BufferPin     5  LWLockTranche
 | proc     3  LWLockTranche   | buffer_mapping
 

Patch:
scaling factor: 300
query mode: prepared
number of clients: 192
number of threads: 192
duration: 1800 s
number of transactions actually processed: 82616270
latency average: 4.183 ms
tps = 45894.737813 (including connections establishing)
tps = 45894.995634 (excluding connections establishing)
120372  Lock            | transactionid 16346  Lock            | tuple  7489  LWLockTranche   | lock_manager  4514
LWLockNamed    | ProcArrayLock  3632  3310  LWLockNamed     | CLogControlLock  2287  LWLockNamed     | XidGenLock  2271
Lock            | extend   709  LWLockTranche   | buffer_content   490  LWLockTranche   | wal_insert    30  BufferPin
   | BufferPin    10  LWLockTranche   | proc     6  LWLockTranche   | buffer_mapping
 

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 10/20/2016 09:36 AM, Dilip Kumar wrote:
> On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I agree with these conclusions.  I had a chance to talk with Andres
>> this morning at Postgres Vision and based on that conversation I'd
>> like to suggest a couple of additional tests:
>>
>> 1. Repeat this test on x86.  In particular, I think you should test on
>> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>
> I have done my test on cthulhu, basic difference is that In POWER we
> saw ClogControlLock on top at 96 and more client with 300 scale
> factor. But, on cthulhu at 300 scale factor transactionid lock is
> always on top. So I repeated my test with 1000 scale factor as well on
> cthulhu.
>
> All configuration is same as my last test.
>
> Test with 1000 scale factor
> -------------------------------------
>
> Test1: number of clients: 192
>
> Head:
> tps = 21206.108856 (including connections establishing)
> tps = 21206.245441 (excluding connections establishing)
> [dilip.kumar@cthulhu bin]$ cat 1000_192_ul.txt
>  310489  LWLockNamed     | CLogControlLock
>  296152                  |
>   35537  Lock            | transactionid
>   15821  LWLockTranche   | buffer_mapping
>   10342  LWLockTranche   | buffer_content
>    8427  LWLockTranche   | clog
>    3961
>    3165  Lock            | extend
>    2861  Lock            | tuple
>    2781  LWLockNamed     | ProcArrayLock
>    1104  LWLockNamed     | XidGenLock
>     745  LWLockTranche   | lock_manager
>     371  LWLockNamed     | CheckpointerCommLock
>      70  LWLockTranche   | wal_insert
>       5  BufferPin       | BufferPin
>       3  LWLockTranche   | proc
>
> Patch:
> tps = 28725.038933 (including connections establishing)
> tps = 28725.367102 (excluding connections establishing)
> [dilip.kumar@cthulhu bin]$ cat 1000_192_ul.txt
>  540061                  |
>   57810  LWLockNamed     | CLogControlLock
>   36264  LWLockTranche   | buffer_mapping
>   29976  Lock            | transactionid
>    4770  Lock            | extend
>    4735  LWLockTranche   | clog
>    4479  LWLockNamed     | ProcArrayLock
>    4006
>    3955  LWLockTranche   | buffer_content
>    2505  LWLockTranche   | lock_manager
>    2179  Lock            | tuple
>    1977  LWLockNamed     | XidGenLock
>     905  LWLockNamed     | CheckpointerCommLock
>     222  LWLockTranche   | wal_insert
>       8  LWLockTranche   | proc
>
> Test2: number of clients: 96
>
> Head:
> tps = 25447.861572 (including connections establishing)
> tps = 25448.012739 (excluding connections establishing)
>  261611                  |
>   69604  LWLockNamed     | CLogControlLock
>    6119  Lock            | transactionid
>    4008
>    2874  LWLockTranche   | buffer_mapping
>    2578  LWLockTranche   | buffer_content
>    2355  LWLockNamed     | ProcArrayLock
>    1245  Lock            | extend
>    1168  LWLockTranche   | clog
>     232  Lock            | tuple
>     217  LWLockNamed     | CheckpointerCommLock
>     160  LWLockNamed     | XidGenLock
>     158  LWLockTranche   | lock_manager
>      78  LWLockTranche   | wal_insert
>       5  BufferPin       | BufferPin
>
> Patch:
> tps = 32708.368938 (including connections establishing)
> tps = 32708.765989 (excluding connections establishing)
> [dilip.kumar@cthulhu bin]$ cat 1000_96_ul.txt
>  326601                  |
>    7471  LWLockNamed     | CLogControlLock
>    5387  Lock            | transactionid
>    4018
>    3331  LWLockTranche   | buffer_mapping
>    3144  LWLockNamed     | ProcArrayLock
>    1372  Lock            | extend
>     722  LWLockTranche   | buffer_content
>     393  LWLockNamed     | XidGenLock
>     237  LWLockTranche   | lock_manager
>     234  Lock            | tuple
>     194  LWLockTranche   | clog
>      96  Lock            | relation
>      88  LWLockTranche   | wal_insert
>      34  LWLockNamed     | CheckpointerCommLock
>
> Test3: number of clients: 64
>
> Head:
>
> tps = 28264.194438 (including connections establishing)
> tps = 28264.336270 (excluding connections establishing)
>
>  218264                  |
>   10314  LWLockNamed     | CLogControlLock
>    4019
>    2067  Lock            | transactionid
>    1950  LWLockTranche   | buffer_mapping
>    1879  LWLockNamed     | ProcArrayLock
>     592  Lock            | extend
>     565  LWLockTranche   | buffer_content
>     222  LWLockNamed     | XidGenLock
>     143  LWLockTranche   | clog
>     131  LWLockNamed     | CheckpointerCommLock
>      63  LWLockTranche   | lock_manager
>      52  Lock            | tuple
>      35  LWLockTranche   | wal_insert
>
> Patch:
> tps = 27906.376194 (including connections establishing)
> tps = 27906.531392 (excluding connections establishing)
> [dilip.kumar@cthulhu bin]$ cat 1000_64_ul.txt
>  228108                  |
>    4039
>    2294  Lock            | transactionid
>    2116  LWLockTranche   | buffer_mapping
>    1757  LWLockNamed     | ProcArrayLock
>    1553  LWLockNamed     | CLogControlLock
>     800  Lock            | extend
>     403  LWLockTranche   | buffer_content
>      92  LWLockNamed     | XidGenLock
>      74  LWLockTranche   | lock_manager
>      42  Lock            | tuple
>      35  LWLockTranche   | wal_insert
>      34  LWLockTranche   | clog
>      14  LWLockNamed     | CheckpointerCommLock
>
> Test4: number of clients: 32
>
> Head:
> tps = 27587.999912 (including connections establishing)
> tps = 27588.119611 (excluding connections establishing)
> [dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt
>  117762                  |
>    4031
>     614  LWLockNamed     | ProcArrayLock
>     379  LWLockNamed     | CLogControlLock
>     344  Lock            | transactionid
>     183  Lock            | extend
>     102  LWLockTranche   | buffer_mapping
>      71  LWLockTranche   | buffer_content
>      39  LWLockNamed     | XidGenLock
>      25  LWLockTranche   | lock_manager
>       3  LWLockTranche   | wal_insert
>       3  LWLockTranche   | clog
>       2  LWLockNamed     | CheckpointerCommLock
>       2  Lock            | tuple
>
> Patch:
> tps = 28291.428848 (including connections establishing)
> tps = 28291.586435 (excluding connections establishing)
> [dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt
>  116596                  |
>    4041
>     757  LWLockNamed     | ProcArrayLock
>     407  LWLockNamed     | CLogControlLock
>     358  Lock            | transactionid
>     183  Lock            | extend
>     142  LWLockTranche   | buffer_mapping
>      77  LWLockTranche   | buffer_content
>      68  LWLockNamed     | XidGenLock
>      35  LWLockTranche   | lock_manager
>      15  LWLockTranche   | wal_insert
>       7  LWLockTranche   | clog
>       7  Lock            | tuple
>       4  LWLockNamed     | CheckpointerCommLock
>       1 Tuples only is on.
>
> Summary:
> - At 96 and more clients count we can see ClogControlLock at the top.
> - With patch contention on ClogControlLock is reduced significantly.
> I think these behaviours are same as we saw on power.
>
> With 300 scale factor:
> - Contention on ClogControlLock is significant only at 192 client
> (still transaction id lock is on top), Which is completely removed
> with group lock patch.
>
> For 300 scale factor, I am posting data only at 192 client count (If
> anyone interested in other data I can post).
>

In the results you've posted on 10/12, you've mentioned a regression 
with 32 clients, where you got 52k tps on master but only 48k tps with 
the patch (so ~10% difference). I have no idea what scale was used for 
those tests, and I see no such regression in the current results (but 
you only report results for some of the client counts).

Also, which of the proposed patches have you been testing?

Can you collect and share a more complete set of data, perhaps based on 
the scripts I use to do tests on the large machine with 36/72 cores, 
available at https://bitbucket.org/tvondra/hp05-results ?

I've taken some time to build a simple web-based reports from the
results collected so far (also included in the git repository), and 
pushed them here:
    http://tvondra.bitbucket.org

For each of the completed runs, there's a report comparing tps for 
different client counts with master and the three patches (average tps, 
median and stddev), and it's possible to download a more thorough text 
report with wait event stats, comparison of individual runs etc.

If you want to cooperate on this, I'm available - i.e. I can help you 
get the tooling running, customize it etc.


Regarding the results collected on the "big machine" so far, I do have a 
few observations:

pgbench / scale 300 (fits into 16GB shared buffers)
---------------------------------------------------
* in general, those results seem fine

* the results generally fall into 3 categories (I'll show results for 
"pgbench -N" but regular pgbench behaves similarly):

(a) logged, sync_commit=on - no impact    http://tvondra.bitbucket.org/#pgbench-300-logged-sync-skip

(b) logged, sync_commit=off - improvement    http://tvondra.bitbucket.org/#pgbench-300-logged-async-skip
    The thoughput gets improved by ~20% with 72 clients, and then it    levels-off (but does not drop unlike on
master).With high client    counts the difference is up to 300%, but people who care about    throughput won't run with
suchclient counts anyway.
 
    And not only this improves throughput, it also significantly    reduces variability of the performance (i.e.
measurethroughput    each second and compute STDDEV of that). You can imagine this    as a much "smoother" chart of tps
overtime.
 

(c) unlogged, sync_commit=* - improvement    http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
    This is actually quite similar to (b).


dilip / scale 300 (fits into 16GB shared buffers)
-------------------------------------------------

* those results seem less OK

* I haven't found any significant regressions (in the sense of 
significant performance drop compared to master), but the behavior in 
some cases seem fairly strange (and it's repeatable)

* consider for example these results:
  http://tvondra.bitbucket.org/#dilip-300-unlogged-async  http://tvondra.bitbucket.org/#dilip-300-logged-async

* the saw-like pattern is rather suspicious, and I don't think I've seen 
anything like that before - I guess there's some feedback loop and we 
better find it before committing any of the patches, because this is 
something I don't want to see on any production machine (and I bet 
neither do you)

* After looking into wait even details in the full text report at
  http://tvondra.bitbucket.org/by-test/dilip-300-unlogged-async.txt
  (section "wait events for dilip-300-unlogged-async (runs combined)")
  I see that for pg-9.6-group-update, the statistics for 72, 108 and  144 clients (low - high - low), the summary looks
likethis
 
   clients | wait_event_type |     wait_event  | wait_count | wait_pct
---------+-----------------+-----------------+------------+----------       72 |                 |                 |
374845 |    62.87        72 | Client          | ClientRead      |     136320 |    22.86        72 | LWLockNamed     |
CLogControlLock|      52804 |     8.86        72 | LWLockTranche   | buffer_content  |      15337 |     2.57        72
|LWLockNamed     | XidGenLock      |       7352 |     1.23        72 | LWLockNamed     | ProcArrayLock   |       6630 |
   1.11
 
       108 |                 |                 |     407179 |    46.01       108 | LWLockNamed     | CLogControlLock |
  300452 |    33.95       108 | LWLockTranche   | buffer_content  |      87597 |     9.90       108 | Client          |
ClientRead     |      80901 |     9.14       108 | LWLockNamed     | ProcArrayLock   |       3290 |     0.37
 
       144 |                 |                 |     623057 |    53.44       144 | LWLockNamed     | CLogControlLock |
  175072 |    15.02       144 | Client          | ClientRead      |     163451 |    14.02       144 | LWLockTranche   |
buffer_content |     147963 |    12.69       144 | LWLockNamed     | XidGenLock      |      38361 |     3.29       144
|Lock            | transactionid   |       8821 |     0.76
 
  That is, there's sudden jump on CLogControlLock from 22% to 33% and  then back to 15% (and for 180 clients it jumps
backto ~35%). That's  pretty strange, and all the patches behave exactly the same.
 


scale 3000 (45GB), shared_buffers=16GB
---------------------------------------

For the small scale, the whole data set fits into 16GB shared buffers, 
so there were pretty much no writes except for WAL and CLOG. For scale 
3000 that's no longer true - the backends will compete for buffers and 
will constantly write dirty buffers to page cache.

I haven't realized this initially and the kernel was using the default 
vm.dirty_* limits, i.e. 10% and 20%. As the machine has 3TB of RAM, this 
resulted in rather excessive threshold (or "insane" if you want), so the 
kernel regularly accumulated up to ~15GB of dirty data and then wrote it 
out in very short period of time. Even though the machine has fairly 
powerful storage (4GB write cache on controller, 10 x 12Gbps SAS SSDs), 
this lead to pretty bad latency spikes / drops in throughput.

I've only done two runs with this configuration before realizing what's 
happening, the results are illustrated here"

* http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-high-dirty-bytes
* 
http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-high-dirty-bytes

I'm not sure how important those results are (if throughput and smooth 
behavior matters, tuning the kernel thresholds is a must), but what I 
find interesting is that while the patches manage to improve throughput 
by 10-20%, they also (quite significantly) increase variability of the 
results (jitter in the tps over time). It's particularly visible on the 
pgbench results. I'm not sure that's a good tradeoff.

After fixing the kernel page cache thresholds (by setting 
background_bytes to 256MB to perform smooth write-out), the effect 
differs depending on the workload:

(a) dilip    http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync
    - eliminates any impact of all the patches

(b) pgbench (-N)    http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip
    - By far the most severe regression observed during the testing.    With 36 clients the throughput drops by ~40%,
whichI think is    pretty bad. Also the results are much more variable with the    patches (compared to master).
 


scale 3000 (45GB), shared_buffers=64GB
---------------------------------------

I've also done some tests with increased shared buffers, so that even 
the large data set fits into them. Again, the results slightly depend on 
the workload:

(a) dilip
  * http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-64  *
http://tvondra.bitbucket.org/#dilip-3000-unlogged-async-64
  Pretty much no impact on throughput or variability. Unlike on the  small data set, it the patches don't even
eliminatethe performance  drop above 72 clients - the performance closely matches master.
 

(b) pgbench
  * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-64  *
http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-noskip-64
  There's a small benefit (~20% on the same client count), and the  performance drop only happens after 72 clients. The
patchesalso  significantly increase variability of the results, particularly for  large client counts.
 


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Thu, Oct 20, 2016 at 3:36 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I agree with these conclusions.  I had a chance to talk with Andres
>> this morning at Postgres Vision and based on that conversation I'd
>> like to suggest a couple of additional tests:
>>
>> 1. Repeat this test on x86.  In particular, I think you should test on
>> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>
> I have done my test on cthulhu, basic difference is that In POWER we
> saw ClogControlLock on top at 96 and more client with 300 scale
> factor. But, on cthulhu at 300 scale factor transactionid lock is
> always on top. So I repeated my test with 1000 scale factor as well on
> cthulhu.

So the upshot appears to be that this problem is a lot worse on power2
than cthulhu, which suggests that this is architecture-dependent.  I
guess it could also be kernel-dependent, but it doesn't seem likely,
because:

power2: Red Hat Enterprise Linux Server release 7.1 (Maipo),
3.10.0-229.14.1.ael7b.ppc64le
cthulhu: CentOS Linux release 7.2.1511 (Core), 3.10.0-229.7.2.el7.x86_64

So here's my theory.  The whole reason why Tomas is having difficulty
seeing any big effect from these patches is because he's testing on
x86.  When Dilip tests on x86, he doesn't see a big effect either,
regardless of workload.  But when Dilip tests on POWER, which I think
is where he's mostly been testing, he sees a huge effect, because for
some reason POWER has major problems with this lock that don't exist
on x86.

If that's so, then we ought to be able to reproduce the big gains on
hydra, a community POWER server.  In fact, I think I'll go run a quick
test over there right now...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Thu, Oct 20, 2016 at 11:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Oct 20, 2016 at 3:36 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I agree with these conclusions.  I had a chance to talk with Andres
>>> this morning at Postgres Vision and based on that conversation I'd
>>> like to suggest a couple of additional tests:
>>>
>>> 1. Repeat this test on x86.  In particular, I think you should test on
>>> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>>
>> I have done my test on cthulhu, basic difference is that In POWER we
>> saw ClogControlLock on top at 96 and more client with 300 scale
>> factor. But, on cthulhu at 300 scale factor transactionid lock is
>> always on top. So I repeated my test with 1000 scale factor as well on
>> cthulhu.
>
> So the upshot appears to be that this problem is a lot worse on power2
> than cthulhu, which suggests that this is architecture-dependent.  I
> guess it could also be kernel-dependent, but it doesn't seem likely,
> because:
>
> power2: Red Hat Enterprise Linux Server release 7.1 (Maipo),
> 3.10.0-229.14.1.ael7b.ppc64le
> cthulhu: CentOS Linux release 7.2.1511 (Core), 3.10.0-229.7.2.el7.x86_64
>
> So here's my theory.  The whole reason why Tomas is having difficulty
> seeing any big effect from these patches is because he's testing on
> x86.  When Dilip tests on x86, he doesn't see a big effect either,
> regardless of workload.  But when Dilip tests on POWER, which I think
> is where he's mostly been testing, he sees a huge effect, because for
> some reason POWER has major problems with this lock that don't exist
> on x86.
>
> If that's so, then we ought to be able to reproduce the big gains on
> hydra, a community POWER server.  In fact, I think I'll go run a quick
> test over there right now...

And ... nope.  I ran a 30-minute pgbench test on unpatched master
using unlogged tables at scale factor 300 with 64 clients and got
these results:
    14  LWLockTranche   | wal_insert    36  LWLockTranche   | lock_manager    45  LWLockTranche   | buffer_content
223 Lock            | tuple   527  LWLockNamed     | CLogControlLock   921  Lock            | extend  1195  LWLockNamed
   | XidGenLock  1248  LWLockNamed     | ProcArrayLock  3349  Lock            | transactionid 85957  Client          |
ClientRead135935                 |
 

I then started a run at 96 clients which I accidentally killed shortly
before it was scheduled to finish, but the results are not much
different; there is no hint of the runaway CLogControlLock contention
that Dilip sees on power2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 10/20/2016 07:59 PM, Robert Haas wrote:
> On Thu, Oct 20, 2016 at 11:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Oct 20, 2016 at 3:36 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:>>
>> ...
>>
>> So here's my theory.  The whole reason why Tomas is having difficulty
>> seeing any big effect from these patches is because he's testing on
>> x86.  When Dilip tests on x86, he doesn't see a big effect either,
>> regardless of workload.  But when Dilip tests on POWER, which I think
>> is where he's mostly been testing, he sees a huge effect, because for
>> some reason POWER has major problems with this lock that don't exist
>> on x86.
>>
>> If that's so, then we ought to be able to reproduce the big gains on
>> hydra, a community POWER server.  In fact, I think I'll go run a quick
>> test over there right now...
>
> And ... nope.  I ran a 30-minute pgbench test on unpatched master
> using unlogged tables at scale factor 300 with 64 clients and got
> these results:
>
>      14  LWLockTranche   | wal_insert
>      36  LWLockTranche   | lock_manager
>      45  LWLockTranche   | buffer_content
>     223  Lock            | tuple
>     527  LWLockNamed     | CLogControlLock
>     921  Lock            | extend
>    1195  LWLockNamed     | XidGenLock
>    1248  LWLockNamed     | ProcArrayLock
>    3349  Lock            | transactionid
>   85957  Client          | ClientRead
>  135935                  |
>
> I then started a run at 96 clients which I accidentally killed shortly
> before it was scheduled to finish, but the results are not much
> different; there is no hint of the runaway CLogControlLock contention
> that Dilip sees on power2.
>

What shared_buffer size were you using? I assume the data set fit into 
shared buffers, right?

FWIW as I explained in the lengthy post earlier today, I can actually 
reproduce the significant CLogControlLock contention (and the patches do 
reduce it), even on x86_64.

For example consider these two tests:

* http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
* http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip

However, it seems I can also reproduce fairly bad regressions, like for 
example this case with data set exceeding shared_buffers:

* http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Thu, Oct 20, 2016 at 4:04 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>> I then started a run at 96 clients which I accidentally killed shortly
>> before it was scheduled to finish, but the results are not much
>> different; there is no hint of the runaway CLogControlLock contention
>> that Dilip sees on power2.
>>
> What shared_buffer size were you using? I assume the data set fit into
> shared buffers, right?

8GB.

> FWIW as I explained in the lengthy post earlier today, I can actually
> reproduce the significant CLogControlLock contention (and the patches do
> reduce it), even on x86_64.

/me goes back, rereads post.  Sorry, I didn't look at this carefully
the first time.

> For example consider these two tests:
>
> * http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
> * http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
>
> However, it seems I can also reproduce fairly bad regressions, like for
> example this case with data set exceeding shared_buffers:
>
> * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip

I'm not sure how seriously we should take the regressions.  I mean,
what I see there is that CLogControlLock contention goes down by about
50% -- which is the point of the patch -- and WALWriteLock contention
goes up dramatically -- which sucks, but can't really be blamed on the
patch except in the indirect sense that a backend can't spend much
time waiting for A if it's already spending all of its time waiting
for B.  It would be nice to know why it happened, but we shouldn't
allow CLogControlLock to act as an admission control facility for
WALWriteLock (I think).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

> In the results you've posted on 10/12, you've mentioned a regression with 32
> clients, where you got 52k tps on master but only 48k tps with the patch (so
> ~10% difference). I have no idea what scale was used for those tests,

That test was with scale factor 300 on POWER 4 socket machine. I think
I need to repeat this test with multiple reading to confirm it was
regression or run to run variation. I will do that soon and post the
results.

> and I
> see no such regression in the current results (but you only report results
> for some of the client counts).

This test is on X86 8 socket machine, At 1000 scale factor I have
given reading with all client counts (32,64,96,192), but at 300 scale
factor I posted only with 192 because on this machine (X86 8 socket
machine) I did not see much load on ClogControlLock at 300 scale
factor.
>
> Also, which of the proposed patches have you been testing?
I tested with GroupLock patch.

> Can you collect and share a more complete set of data, perhaps based on the
> scripts I use to do tests on the large machine with 36/72 cores, available
> at https://bitbucket.org/tvondra/hp05-results ?

I think from my last run I did not share data for -> X86 8 socket
machine, 300 scale factor, 32,64,96 client. I already have those data
so I ma sharing it. (Please let me know if you want to see at some
other client count, for that I need to run another test.)

Head:
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 77233356
latency average: 0.746 ms
tps = 42907.363243 (including connections establishing)
tps = 42907.546190 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 300_32_ul.txt111757                  |  3666  1289  LWLockNamed     | ProcArrayLock
1142 Lock            | transactionid   318  LWLockNamed     | CLogControlLock   299  Lock            | extend   109
LWLockNamed    | XidGenLock    70  LWLockTranche   | buffer_content    35  Lock            | tuple    29  LWLockTranche
 | lock_manager    14  LWLockTranche   | wal_insert     1 Tuples only is on.     1  LWLockNamed     |
CheckpointerCommLock

Group Lock Patch:

scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 77544028
latency average: 0.743 ms
tps = 43079.783906 (including connections establishing)
tps = 43079.960331 (excluding connections establishing112209                  |  3718  1402  LWLockNamed     |
ProcArrayLock 1070  Lock            | transactionid   245  LWLockNamed     | CLogControlLock   188  Lock            |
extend   80  LWLockNamed     | XidGenLock    76  LWLockTranche   | buffer_content    39  LWLockTranche   | lock_manager
  31  Lock            | tuple     7  LWLockTranche   | wal_insert     1 Tuples only is on.     1  LWLockTranche   |
buffer_mapping

Head:
number of clients: 64
number of threads: 64
duration: 1800 s
number of transactions actually processed: 76211698
latency average: 1.512 ms
tps = 42339.731054 (including connections establishing)
tps = 42339.930464 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 300_64_ul.txt215734                  |  5106  Lock            | transactionid  3754
LWLockNamed    | ProcArrayLock  3669  3267  LWLockNamed     | CLogControlLock   661  Lock            | extend   339
LWLockNamed    | XidGenLock   310  Lock            | tuple   289  LWLockTranche   | buffer_content   205  LWLockTranche
 | lock_manager    50  LWLockTranche   | wal_insert     2  LWLockTranche   | buffer_mapping     1 Tuples only is on.
1  LWLockTranche   | proc
 

GroupLock patch:
scaling factor: 300
query mode: prepared
number of clients: 64
number of threads: 64
duration: 1800 s
number of transactions actually processed: 76629309
latency average: 1.503 ms
tps = 42571.704635 (including connections establishing)
tps = 42571.905157 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 300_64_ul.txt217840                  |  5197  Lock            | transactionid  3744
LWLockNamed    | ProcArrayLock  3663   966  Lock            | extend   849  LWLockNamed     | CLogControlLock   372
Lock           | tuple   305  LWLockNamed     | XidGenLock   199  LWLockTranche   | buffer_content   184  LWLockTranche
 | lock_manager    35  LWLockTranche   | wal_insert     1 Tuples only is on.     1  LWLockTranche   | proc     1
LWLockTranche  | buffer_mapping
 

Head:
scaling factor: 300
query mode: prepared
number of clients: 96
number of threads: 96
duration: 1800 s
number of transactions actually processed: 77663593
latency average: 2.225 ms
tps = 43145.624864 (including connections establishing)
tps = 43145.838167 (excluding connections establishing)
302317                  | 18836  Lock            | transactionid 12912  LWLockNamed     | CLogControlLock  4120
LWLockNamed    | ProcArrayLock  3662  1700  Lock            | tuple  1305  Lock            | extend  1030
LWLockTranche  | buffer_content   828  LWLockTranche   | lock_manager   730  LWLockNamed     | XidGenLock   107
LWLockTranche  | wal_insert     4  LWLockTranche   | buffer_mapping     1 Tuples only is on.     1  LWLockTranche   |
proc    1  BufferPin       | BufferPin
 

Group Lock Patch:
scaling factor: 300
query mode: prepared
number of clients: 96
number of threads: 96
duration: 1800 s
number of transactions actually processed: 61608756
latency average: 2.805 ms
tps = 44385.885080 (including connections establishing)
tps = 44386.297364 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 300_96_ul.txt237842                  | 14379  Lock            | transactionid  3335
LWLockNamed    | ProcArrayLock  2850  1374  LWLockNamed     | CLogControlLock  1200  Lock            | tuple   992
Lock           | extend   717  LWLockNamed     | XidGenLock   625  LWLockTranche   | lock_manager   259  LWLockTranche
| buffer_content   105  LWLockTranche   | wal_insert     4  LWLockTranche   | buffer_mapping     2  LWLockTranche   |
proc

Head:
scaling factor: 300
query mode: prepared
number of clients: 192
number of threads: 192
duration: 1800 s
number of transactions actually processed: 65930726
latency average: 5.242 ms
tps = 36621.827041 (including connections establishing)
tps = 36622.064081 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 300_192_ul.txt437848                  |118966  Lock            | transactionid 88869
LWLockNamed    | CLogControlLock 18558  Lock            | tuple  6183  LWLockTranche   | buffer_content  5664
LWLockTranche  | lock_manager  3995  LWLockNamed     | ProcArrayLock  3646  1748  Lock            | extend  1635
LWLockNamed    | XidGenLock   401  LWLockTranche   | wal_insert    33  BufferPin       | BufferPin     5  LWLockTranche
 | proc     3  LWLockTranche   | buffer_mapping
 

GroupLock Patch:
scaling factor: 300
query mode: prepared
number of clients: 192
number of threads: 192
duration: 1800 s
number of transactions actually processed: 82616270
latency average: 4.183 ms
tps = 45894.737813 (including connections establishing)
tps = 45894.995634 (excluding connections establishing)
120372  Lock            | transactionid 16346  Lock            | tuple  7489  LWLockTranche   | lock_manager  4514
LWLockNamed    | ProcArrayLock  3632  3310  LWLockNamed     | CLogControlLock  2287  LWLockNamed     | XidGenLock  2271
Lock            | extend   709  LWLockTranche   | buffer_content   490  LWLockTranche   | wal_insert    30  BufferPin
   | BufferPin    10  LWLockTranche   | proc     6  LWLockTranche   | buffer_mapping
 

Summary: On (X86 8 Socket machine, 300 S.F), I did not observe
significant wait on ClogControlLock upto 96 clients. However at 192 we
can see significant wait on ClogControlLock, but still not as bad as
we see on POWER.


>
> I've taken some time to build a simple web-based reports from the results
> collected so far (also included in the git repository), and pushed them
> here:
>
>     http://tvondra.bitbucket.org
>
> For each of the completed runs, there's a report comparing tps for different
> client counts with master and the three patches (average tps, median and
> stddev), and it's possible to download a more thorough text report with wait
> event stats, comparison of individual runs etc.

I saw your report, I think presenting it this way can give very clear idea.
>
> If you want to cooperate on this, I'm available - i.e. I can help you get
> the tooling running, customize it etc.

That will be really helpful, then next time I can also present my
reports in same format.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Thu, Oct 20, 2016 at 9:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> So here's my theory.  The whole reason why Tomas is having difficulty
> seeing any big effect from these patches is because he's testing on
> x86.  When Dilip tests on x86, he doesn't see a big effect either,
> regardless of workload.  But when Dilip tests on POWER, which I think
> is where he's mostly been testing, he sees a huge effect, because for
> some reason POWER has major problems with this lock that don't exist
> on x86.

Right, because on POWER we can see big contention on ClogControlLock
with 300 scale factor, even at 96 client count, but on X86 with 300
scan factor there is almost no contention on ClogControlLock.

However at 1000 scale factor we can see significant contention on
ClogControlLock on X86 machine.

I want to test on POWER with 1000 scale factor to see whether
contention on ClogControlLock become much worse ?

I will run this test and post the results.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Oct 21, 2016 at 6:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Oct 20, 2016 at 4:04 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>> I then started a run at 96 clients which I accidentally killed shortly
>>> before it was scheduled to finish, but the results are not much
>>> different; there is no hint of the runaway CLogControlLock contention
>>> that Dilip sees on power2.
>>>
>> What shared_buffer size were you using? I assume the data set fit into
>> shared buffers, right?
>
> 8GB.
>
>> FWIW as I explained in the lengthy post earlier today, I can actually
>> reproduce the significant CLogControlLock contention (and the patches do
>> reduce it), even on x86_64.
>
> /me goes back, rereads post.  Sorry, I didn't look at this carefully
> the first time.
>
>> For example consider these two tests:
>>
>> * http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
>> * http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
>>
>> However, it seems I can also reproduce fairly bad regressions, like for
>> example this case with data set exceeding shared_buffers:
>>
>> * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip
>
> I'm not sure how seriously we should take the regressions.  I mean,
> what I see there is that CLogControlLock contention goes down by about
> 50% -- which is the point of the patch -- and WALWriteLock contention
> goes up dramatically -- which sucks, but can't really be blamed on the
> patch except in the indirect sense that a backend can't spend much
> time waiting for A if it's already spending all of its time waiting
> for B.
>

Right, I think not only WALWriteLock, but contention on other locks
also goes up as you can see in below table.  I think there is nothing
much we can do for that with this patch.  One thing which is unclear
is why on unlogged tests it is showing WALWriteLock?

                     test                       | clients |
wait_event_type |      wait_event      | master  | granular_locking |
no_content_lock | group_update

--------------------------------------------------+---------+-----------------+----------------------+---------+------------------+-----------------+--------------

pgbench-3000-unlogged-sync-skip                  |      72 |
LWLockNamed     | CLogControlLock      |  217012 |            37326 |        32288 |        12040
pgbench-3000-unlogged-sync-skip                  |      72 |
LWLockNamed     | WALWriteLock         |   13188 |           104183 |       123359 |       103267
pgbench-3000-unlogged-sync-skip                  |      72 |
LWLockTranche   | buffer_content       |   10532 |            65880 |        57007 |        86176
pgbench-3000-unlogged-sync-skip                  |      72 |
LWLockTranche   | wal_insert           |    9280 |            85917 |       109472 |        99609
pgbench-3000-unlogged-sync-skip                  |      72 |
LWLockTranche   | clog                 |    4623 |            25692 |        10422 |        11755




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 10/21/2016 08:13 AM, Amit Kapila wrote:
> On Fri, Oct 21, 2016 at 6:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Oct 20, 2016 at 4:04 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>> I then started a run at 96 clients which I accidentally killed shortly
>>>> before it was scheduled to finish, but the results are not much
>>>> different; there is no hint of the runaway CLogControlLock contention
>>>> that Dilip sees on power2.
>>>>
>>> What shared_buffer size were you using? I assume the data set fit into
>>> shared buffers, right?
>>
>> 8GB.
>>
>>> FWIW as I explained in the lengthy post earlier today, I can actually
>>> reproduce the significant CLogControlLock contention (and the patches do
>>> reduce it), even on x86_64.
>>
>> /me goes back, rereads post.  Sorry, I didn't look at this carefully
>> the first time.
>>
>>> For example consider these two tests:
>>>
>>> * http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
>>> * http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
>>>
>>> However, it seems I can also reproduce fairly bad regressions, like for
>>> example this case with data set exceeding shared_buffers:
>>>
>>> * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip
>>
>> I'm not sure how seriously we should take the regressions.  I mean,
>> what I see there is that CLogControlLock contention goes down by about
>> 50% -- which is the point of the patch -- and WALWriteLock contention
>> goes up dramatically -- which sucks, but can't really be blamed on the
>> patch except in the indirect sense that a backend can't spend much
>> time waiting for A if it's already spending all of its time waiting
>> for B.
>>
>
> Right, I think not only WALWriteLock, but contention on other locks
> also goes up as you can see in below table.  I think there is nothing
> much we can do for that with this patch.  One thing which is unclear
> is why on unlogged tests it is showing WALWriteLock?
>

Well, although we don't write the table data to the WAL, we still need 
to write commits and other stuff, right? And on scale 3000 (which 
exceeds the 16GB shared buffers in this case), there's a continuous 
stream of dirty pages (not to WAL, but evicted from shared buffers), so 
iostat looks like this:
      time    tps  wr_sec/s  avgrq-sz  avgqu-sz     await   %util  08:48:21  81654   1367483     16.75 127264.60
1294.80  97.41  08:48:31  41514    697516     16.80 103271.11   3015.01   97.64  08:48:41  78892   1359779     17.24
97308.42   928.36   96.76  08:48:51  58735    978475     16.66  92303.00   1472.82   95.92  08:49:01  62441   1068605
 17.11  78482.71   1615.56   95.57  08:49:11  55571    945365     17.01 113672.62   1923.37   98.07  08:49:21  69016
1161586    16.83  87055.66   1363.05   95.53  08:49:31  54552    913461     16.74  98695.87   1761.30   97.84
 

That's ~500-600 MB/s of continuous writes. I'm sure the storage could 
handle more than this (will do some testing after the tests complete), 
but surely the WAL has to compete for bandwidth (it's on the same volume 
/ devices). Another thing is that we only have 8 WAL insert locks, and 
maybe that leads to contention with such high client counts.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Oct 21, 2016 at 1:07 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 10/21/2016 08:13 AM, Amit Kapila wrote:
>>
>> On Fri, Oct 21, 2016 at 6:31 AM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>>>
>>> On Thu, Oct 20, 2016 at 4:04 PM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>>
>>>>> I then started a run at 96 clients which I accidentally killed shortly
>>>>> before it was scheduled to finish, but the results are not much
>>>>> different; there is no hint of the runaway CLogControlLock contention
>>>>> that Dilip sees on power2.
>>>>>
>>>> What shared_buffer size were you using? I assume the data set fit into
>>>> shared buffers, right?
>>>
>>>
>>> 8GB.
>>>
>>>> FWIW as I explained in the lengthy post earlier today, I can actually
>>>> reproduce the significant CLogControlLock contention (and the patches do
>>>> reduce it), even on x86_64.
>>>
>>>
>>> /me goes back, rereads post.  Sorry, I didn't look at this carefully
>>> the first time.
>>>
>>>> For example consider these two tests:
>>>>
>>>> * http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
>>>> * http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
>>>>
>>>> However, it seems I can also reproduce fairly bad regressions, like for
>>>> example this case with data set exceeding shared_buffers:
>>>>
>>>> * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip
>>>
>>>
>>> I'm not sure how seriously we should take the regressions.  I mean,
>>> what I see there is that CLogControlLock contention goes down by about
>>> 50% -- which is the point of the patch -- and WALWriteLock contention
>>> goes up dramatically -- which sucks, but can't really be blamed on the
>>> patch except in the indirect sense that a backend can't spend much
>>> time waiting for A if it's already spending all of its time waiting
>>> for B.
>>>
>>
>> Right, I think not only WALWriteLock, but contention on other locks
>> also goes up as you can see in below table.  I think there is nothing
>> much we can do for that with this patch.  One thing which is unclear
>> is why on unlogged tests it is showing WALWriteLock?
>>
>
> Well, although we don't write the table data to the WAL, we still need to
> write commits and other stuff, right?
>

We do need to write commit, but do we need to flush it immediately to
WAL for unlogged tables?  It seems we allow WALWriter to do that,
refer logic in RecordTransactionCommit.
And on scale 3000 (which exceeds the
> 16GB shared buffers in this case), there's a continuous stream of dirty
> pages (not to WAL, but evicted from shared buffers), so iostat looks like
> this:
>
>       time    tps  wr_sec/s  avgrq-sz  avgqu-sz     await   %util
>   08:48:21  81654   1367483     16.75 127264.60   1294.80   97.41
>   08:48:31  41514    697516     16.80 103271.11   3015.01   97.64
>   08:48:41  78892   1359779     17.24  97308.42    928.36   96.76
>   08:48:51  58735    978475     16.66  92303.00   1472.82   95.92
>   08:49:01  62441   1068605     17.11  78482.71   1615.56   95.57
>   08:49:11  55571    945365     17.01 113672.62   1923.37   98.07
>   08:49:21  69016   1161586     16.83  87055.66   1363.05   95.53
>   08:49:31  54552    913461     16.74  98695.87   1761.30   97.84
>
> That's ~500-600 MB/s of continuous writes. I'm sure the storage could handle
> more than this (will do some testing after the tests complete), but surely
> the WAL has to compete for bandwidth (it's on the same volume / devices).
> Another thing is that we only have 8 WAL insert locks, and maybe that leads
> to contention with such high client counts.
>

Yeah, quite possible, but I don't think increasing that would benefit
in general, because while writing WAL we need to take all the
wal_insert locks. In any case, I think that is a separate problem to
study.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Fri, Oct 21, 2016 at 7:57 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>
>> In the results you've posted on 10/12, you've mentioned a regression with 32
>> clients, where you got 52k tps on master but only 48k tps with the patch (so
>> ~10% difference). I have no idea what scale was used for those tests,
>
> That test was with scale factor 300 on POWER 4 socket machine. I think
> I need to repeat this test with multiple reading to confirm it was
> regression or run to run variation. I will do that soon and post the
> results.

As promised, I have rerun my test (3 times), and I did not see any regression.
Median of 3 run on both head and with group lock patch are same.
However I am posting results of all three runs.

I think in my earlier reading, we saw TPS ~48K with the patch, but I
think over multiple run we get this reading with both head as well as
with patch.

Head:
--------
run1:

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 87784836
latency average = 0.656 ms
tps = 48769.327513 (including connections establishing)
tps = 48769.543276 (excluding connections establishing)

run2:
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 91240374
latency average = 0.631 ms
tps = 50689.069717 (including connections establishing)
tps = 50689.263505 (excluding connections establishing)

run3:
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 90966003
latency average = 0.633 ms
tps = 50536.639303 (including connections establishing)
tps = 50536.836924 (excluding connections establishing)

With group lock patch:
------------------------------
run1:
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 87316264
latency average = 0.660 ms
tps = 48509.008040 (including connections establishing)
tps = 48509.194978 (excluding connections establishing)

run2:
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 91950412
latency average = 0.626 ms
tps = 51083.507790 (including connections establishing)
tps = 51083.704489 (excluding connections establishing)

run3:
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 90378462
latency average = 0.637 ms
tps = 50210.225983 (including connections establishing)
tps = 50210.405401 (excluding connections establishing)

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Oct 24, 2016 at 2:48 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Oct 21, 2016 at 7:57 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>
>>> In the results you've posted on 10/12, you've mentioned a regression with 32
>>> clients, where you got 52k tps on master but only 48k tps with the patch (so
>>> ~10% difference). I have no idea what scale was used for those tests,
>>
>> That test was with scale factor 300 on POWER 4 socket machine. I think
>> I need to repeat this test with multiple reading to confirm it was
>> regression or run to run variation. I will do that soon and post the
>> results.
>
> As promised, I have rerun my test (3 times), and I did not see any regression.
>

Thanks Tomas and Dilip for doing detailed performance tests for this
patch.  I would like to summarise the performance testing results.

1. With update intensive workload, we are seeing gains from 23%~192%
at client count >=64 with group_update patch [1].
2. With tpc-b pgbench workload (at 1000 scale factor), we are seeing
gains from 12% to ~70% at client count >=64 [2].  Tests are done on
8-socket intel   m/c.
3. With pgbench workload (both simple-update and tpc-b at 300 scale
factor), we are seeing gain 10% to > 50% at client count >=64 [3].
Tests are done on 8-socket intel m/c.
4. To see why the patch only helps at higher client count, we have
done wait event testing for various workloads [4], [5] and the results
indicate that at lower clients, the waits are mostly due to
transactionid or clientread.  At client-counts where contention due to
CLOGControlLock is significant, this patch helps a lot to reduce that
contention.  These tests are done on on 8-socket intel m/c and
4-socket power m/c
5. With pgbench workload (unlogged tables), we are seeing gains from
15% to > 300% at client count >=72 [6].

There are many more tests done for the proposed patches where gains
are either or similar lines as above or are neutral.  We do see
regression in some cases.

1. When data doesn't fit in shared buffers, there is regression at
some client counts [7], but on analysis it has been found that it is
mainly due to the shift in contention from CLOGControlLock to
WALWriteLock and or other locks.
2. We do see in some cases that granular_locking and no_content_lock
patches has shown significant increase in contention on
CLOGControlLock.  I have already shared my analysis for same upthread
[8].

Attached is the latest group update clog patch.

In last commit fest, the patch was returned with feedback to evaluate
the cases where it can show win and I think above results indicates
that the patch has significant benefit on various workloads.  What I
think is pending at this stage is the either one of the committer or
the reviewers of this patch needs to provide feedback on my analysis
[8] for the cases where patches are not showing win.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAFiTN-tr_%3D25EQUFezKNRk%3D4N-V%2BD6WMxo7HWs9BMaNx7S3y6w%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAFiTN-v5hm1EO4cLXYmpppYdNQk%2Bn4N-O1m%2B%2B3U9f0Ga1gBzRQ%40mail.gmail.com
[4] - https://www.postgresql.org/message-id/CAFiTN-taV4iVkPHrxg%3DYCicKjBS6%3DQZm_cM4hbS_2q2ryLhUUw%40mail.gmail.com
[5] - https://www.postgresql.org/message-id/CAFiTN-uQ%2BJbd31cXvRbj48Ba6TqDUDpLKSPnsUCCYRju0Y0U8Q%40mail.gmail.com
[6] - http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
[7] - http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip
[8] - https://www.postgresql.org/message-id/CAA4eK1J9VxJUnpOiQDf0O%3DZ87QUMbw%3DuGcQr4EaGbHSCibx9yA%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 10/25/2016 06:10 AM, Amit Kapila wrote:
> On Mon, Oct 24, 2016 at 2:48 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Fri, Oct 21, 2016 at 7:57 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>> In the results you've posted on 10/12, you've mentioned a regression with 32
>>>> clients, where you got 52k tps on master but only 48k tps with the patch (so
>>>> ~10% difference). I have no idea what scale was used for those tests,
>>>
>>> That test was with scale factor 300 on POWER 4 socket machine. I think
>>> I need to repeat this test with multiple reading to confirm it was
>>> regression or run to run variation. I will do that soon and post the
>>> results.
>>
>> As promised, I have rerun my test (3 times), and I did not see any regression.
>>
>
> Thanks Tomas and Dilip for doing detailed performance tests for this
> patch.  I would like to summarise the performance testing results.
>
> 1. With update intensive workload, we are seeing gains from 23%~192%
> at client count >=64 with group_update patch [1].
> 2. With tpc-b pgbench workload (at 1000 scale factor), we are seeing
> gains from 12% to ~70% at client count >=64 [2].  Tests are done on
> 8-socket intel   m/c.
> 3. With pgbench workload (both simple-update and tpc-b at 300 scale
> factor), we are seeing gain 10% to > 50% at client count >=64 [3].
> Tests are done on 8-socket intel m/c.
> 4. To see why the patch only helps at higher client count, we have
> done wait event testing for various workloads [4], [5] and the results
> indicate that at lower clients, the waits are mostly due to
> transactionid or clientread.  At client-counts where contention due to
> CLOGControlLock is significant, this patch helps a lot to reduce that
> contention.  These tests are done on on 8-socket intel m/c and
> 4-socket power m/c
> 5. With pgbench workload (unlogged tables), we are seeing gains from
> 15% to > 300% at client count >=72 [6].
>

It's not entirely clear which of the above tests were done on unlogged 
tables, and I don't see that in the referenced e-mails. That would be an 
interesting thing to mention in the summary, I think.

> There are many more tests done for the proposed patches where gains
> are either or similar lines as above or are neutral.  We do see
> regression in some cases.
>
> 1. When data doesn't fit in shared buffers, there is regression at
> some client counts [7], but on analysis it has been found that it is
> mainly due to the shift in contention from CLOGControlLock to
> WALWriteLock and or other locks.

The questions is why shifting the lock contention to WALWriteLock should 
cause such significant performance drop, particularly when the test was 
done on unlogged tables. Or, if that's the case, how it makes the 
performance drop less problematic / acceptable.

FWIW I plan to run the same test with logged tables - if it shows 
similar regression, I'll be much more worried, because that's a fairly 
typical scenario (logged tables, data set > shared buffers), and we 
surely can't just go and break that.

> 2. We do see in some cases that granular_locking and no_content_lock
> patches has shown significant increase in contention on
> CLOGControlLock.  I have already shared my analysis for same upthread
> [8].

I do agree that some cases this significantly reduces contention on the 
CLogControlLock. I do however think that currently the performance gains 
are limited almost exclusively to cases on unlogged tables, and some 
logged+async cases.

On logged tables it usually looks like this (i.e. modest increase for 
high client counts at the expense of significantly higher variability):
  http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64

or like this (i.e. only partial recovery for the drop above 36 clients):
  http://tvondra.bitbucket.org/#pgbench-3000-logged-async-skip-64

And of course, there are cases like this:
  http://tvondra.bitbucket.org/#dilip-300-logged-async

I'd really like to understand why the patched results behave that 
differently depending on client count.
>> Attached is the latest group update clog patch.>

How is that different from the previous versions?
>
> In last commit fest, the patch was returned with feedback to evaluate
> the cases where it can show win and I think above results indicates
> that the patch has significant benefit on various workloads.  What I
> think is pending at this stage is the either one of the committer or
> the reviewers of this patch needs to provide feedback on my analysis
> [8] for the cases where patches are not showing win.
>
> Thoughts?
>

I do agree the patch(es) significantly reduce CLogControlLock, although 
with WAL logging enabled (which is what matters for most production 
deployments) it pretty much only shifts the contention to a different 
lock (so the immediate performance benefit is 0).

Which raises the question why to commit this patch now, before we have a 
patch addressing the WAL locks. I realize this is a chicken-egg problem, 
but my worry is that the increased WALWriteLock contention will cause 
regressions in current workloads.

BTW I've ran some tests with the number of clog buffers increases to 
512, and it seems like a fairly positive. Compare for example these two 
results:
  http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip-clog-512

The first one is with the default 128 buffers, the other one is with 512 
buffers. The impact on master is pretty obvious - for 72 clients the tps 
jumps from 160k to 197k, and for higher client counts it gives us about 
+50k tps (typically increase from ~80k to ~130k tps). And the tps 
variability is significantly reduced.

For the other workload, the results are less convincing though:
  http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
http://tvondra.bitbucket.org/#dilip-300-unlogged-sync-clog-512

Interesting that the master adopts the zig-zag patter, but shifted.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Oct 27, 2016 at 4:15 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 10/25/2016 06:10 AM, Amit Kapila wrote:
>>
>> On Mon, Oct 24, 2016 at 2:48 PM, Dilip Kumar <dilipbalaut@gmail.com>
>> wrote:
>>>
>>> On Fri, Oct 21, 2016 at 7:57 AM, Dilip Kumar <dilipbalaut@gmail.com>
>>> wrote:
>>>>
>>>> On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
>>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>>>> In the results you've posted on 10/12, you've mentioned a regression
>>>>> with 32
>>>>> clients, where you got 52k tps on master but only 48k tps with the
>>>>> patch (so
>>>>> ~10% difference). I have no idea what scale was used for those tests,
>>>>
>>>>
>>>> That test was with scale factor 300 on POWER 4 socket machine. I think
>>>> I need to repeat this test with multiple reading to confirm it was
>>>> regression or run to run variation. I will do that soon and post the
>>>> results.
>>>
>>>
>>> As promised, I have rerun my test (3 times), and I did not see any
>>> regression.
>>>
>>
>> Thanks Tomas and Dilip for doing detailed performance tests for this
>> patch.  I would like to summarise the performance testing results.
>>
>> 1. With update intensive workload, we are seeing gains from 23%~192%
>> at client count >=64 with group_update patch [1].
>> 2. With tpc-b pgbench workload (at 1000 scale factor), we are seeing
>> gains from 12% to ~70% at client count >=64 [2].  Tests are done on
>> 8-socket intel   m/c.
>> 3. With pgbench workload (both simple-update and tpc-b at 300 scale
>> factor), we are seeing gain 10% to > 50% at client count >=64 [3].
>> Tests are done on 8-socket intel m/c.
>> 4. To see why the patch only helps at higher client count, we have
>> done wait event testing for various workloads [4], [5] and the results
>> indicate that at lower clients, the waits are mostly due to
>> transactionid or clientread.  At client-counts where contention due to
>> CLOGControlLock is significant, this patch helps a lot to reduce that
>> contention.  These tests are done on on 8-socket intel m/c and
>> 4-socket power m/c
>> 5. With pgbench workload (unlogged tables), we are seeing gains from
>> 15% to > 300% at client count >=72 [6].
>>
>
> It's not entirely clear which of the above tests were done on unlogged
> tables, and I don't see that in the referenced e-mails. That would be an
> interesting thing to mention in the summary, I think.
>

One thing is clear that all results are on either
synchronous_commit=off or on unlogged tables.  I think Dilip can
answer better which of those are on unlogged and which on
synchronous_commit=off.

>> There are many more tests done for the proposed patches where gains
>> are either or similar lines as above or are neutral.  We do see
>> regression in some cases.
>>
>> 1. When data doesn't fit in shared buffers, there is regression at
>> some client counts [7], but on analysis it has been found that it is
>> mainly due to the shift in contention from CLOGControlLock to
>> WALWriteLock and or other locks.
>
>
> The questions is why shifting the lock contention to WALWriteLock should
> cause such significant performance drop, particularly when the test was done
> on unlogged tables. Or, if that's the case, how it makes the performance
> drop less problematic / acceptable.
>

Whenever the contention shifts to other lock, there is a chance that
it can show performance dip in some cases and I have seen that
previously as well. The theory behind that could be like this, say you
have two locks L1 and L2, and there are 100 processes that are
contending on L1 and 50 on L2.  Now say, you reduce contention on L1
such that it leads to 120 processes contending on L2, so increased
contention on L2 can slowdown the overall throughput of all processes.

> FWIW I plan to run the same test with logged tables - if it shows similar
> regression, I'll be much more worried, because that's a fairly typical
> scenario (logged tables, data set > shared buffers), and we surely can't
> just go and break that.
>

Sure, please do those tests.

>> 2. We do see in some cases that granular_locking and no_content_lock
>> patches has shown significant increase in contention on
>> CLOGControlLock.  I have already shared my analysis for same upthread
>> [8].
>
>
> I do agree that some cases this significantly reduces contention on the
> CLogControlLock. I do however think that currently the performance gains are
> limited almost exclusively to cases on unlogged tables, and some
> logged+async cases.
>

Right, because the contention is mainly visible for those workloads.

> On logged tables it usually looks like this (i.e. modest increase for high
> client counts at the expense of significantly higher variability):
>
>   http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64
>

What variability are you referring to in those results?

> or like this (i.e. only partial recovery for the drop above 36 clients):
>
>   http://tvondra.bitbucket.org/#pgbench-3000-logged-async-skip-64
>
> And of course, there are cases like this:
>
>   http://tvondra.bitbucket.org/#dilip-300-logged-async
>
> I'd really like to understand why the patched results behave that
> differently depending on client count.
>

I have already explained this upthread [1].  Refer text after line "I
have checked the wait event results where there is more fluctuation:"

>>
>> Attached is the latest group update clog patch.
>>
>
> How is that different from the previous versions?
>

Previous patch was showing some hunks when you try to apply.  I
thought it might be better to rebase so that it can be applied
cleanly, otherwise there is no change in code.

>>
>>
>> In last commit fest, the patch was returned with feedback to evaluate
>> the cases where it can show win and I think above results indicates
>> that the patch has significant benefit on various workloads.  What I
>> think is pending at this stage is the either one of the committer or
>> the reviewers of this patch needs to provide feedback on my analysis
>> [8] for the cases where patches are not showing win.
>>
>> Thoughts?
>>
>
> I do agree the patch(es) significantly reduce CLogControlLock, although with
> WAL logging enabled (which is what matters for most production deployments)
> it pretty much only shifts the contention to a different lock (so the
> immediate performance benefit is 0).
>

Yeah, but I think there are use cases where users can use
synchronous_commit=off.

> Which raises the question why to commit this patch now, before we have a
> patch addressing the WAL locks. I realize this is a chicken-egg problem, but
> my worry is that the increased WALWriteLock contention will cause
> regressions in current workloads.
>

I think if we use that theory, we won't be able to make progress in
terms of reducing lock contention.  I think we have previously
committed the code in such situations.  For example while reducing
contention in buffer management area
(d72731a70450b5e7084991b9caa15cb58a2820df), I have noticed such a
behaviour and reported my analysis [2] as well (In the mail [2], you
can see there is performance improvement at 1000 scale factor and dip
at 5000 scale factor).  Later on, when the contention on dynahash
spinlocks got alleviated (44ca4022f3f9297bab5cbffdd97973dbba1879ed),
the results were much better.  If we would not have reduced the
contention in buffer management, then the benefits with dynahash
improvements wouldn't have been much in those workloads (if you want,
I can find out and share the results of dynhash improvements).

> BTW I've ran some tests with the number of clog buffers increases to 512,
> and it seems like a fairly positive. Compare for example these two results:
>
>   http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
>   http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip-clog-512
>
> The first one is with the default 128 buffers, the other one is with 512
> buffers. The impact on master is pretty obvious - for 72 clients the tps
> jumps from 160k to 197k, and for higher client counts it gives us about +50k
> tps (typically increase from ~80k to ~130k tps). And the tps variability is
> significantly reduced.
>

Interesting, because last time I have done such testing by increasing
clog buffers, it didn't show any improvement, rather If I remember
correctly it was showing some regression.  I am not sure what is best
way to handle this, may be we can make clogbuffers as guc variable.


[1] - https://www.postgresql.org/message-id/CAA4eK1J9VxJUnpOiQDf0O%3DZ87QUMbw%3DuGcQr4EaGbHSCibx9yA%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1JUPn1rV0ep5DR74skcv%2BRRK7i2inM1X01ajG%2BgCX-hMw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Thu, Oct 27, 2016 at 5:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Thanks Tomas and Dilip for doing detailed performance tests for this
>>> patch.  I would like to summarise the performance testing results.
>>>
>>> 1. With update intensive workload, we are seeing gains from 23%~192%
>>> at client count >=64 with group_update patch [1].

this is with unlogged table

>>> 2. With tpc-b pgbench workload (at 1000 scale factor), we are seeing
>>> gains from 12% to ~70% at client count >=64 [2].  Tests are done on
>>> 8-socket intel   m/c.

this is with synchronous_commit=off

>>> 3. With pgbench workload (both simple-update and tpc-b at 300 scale
>>> factor), we are seeing gain 10% to > 50% at client count >=64 [3].
>>> Tests are done on 8-socket intel m/c.

this is with synchronous_commit=off

>>> 4. To see why the patch only helps at higher client count, we have
>>> done wait event testing for various workloads [4], [5] and the results
>>> indicate that at lower clients, the waits are mostly due to
>>> transactionid or clientread.  At client-counts where contention due to
>>> CLOGControlLock is significant, this patch helps a lot to reduce that
>>> contention.  These tests are done on on 8-socket intel m/c and
>>> 4-socket power m/c

these both are with synchronous_commit=off + unlogged table

>>> 5. With pgbench workload (unlogged tables), we are seeing gains from
>>> 15% to > 300% at client count >=72 [6].
>>>
>>
>> It's not entirely clear which of the above tests were done on unlogged
>> tables, and I don't see that in the referenced e-mails. That would be an
>> interesting thing to mention in the summary, I think.
>>
>
> One thing is clear that all results are on either
> synchronous_commit=off or on unlogged tables.  I think Dilip can
> answer better which of those are on unlogged and which on
> synchronous_commit=off.

I have mentioned this above under each of your test point..

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
Hi,

On 10/27/2016 01:44 PM, Amit Kapila wrote:
> On Thu, Oct 27, 2016 at 4:15 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>> FWIW I plan to run the same test with logged tables - if it shows similar
>> regression, I'll be much more worried, because that's a fairly typical
>> scenario (logged tables, data set > shared buffers), and we surely can't
>> just go and break that.
>>
>
> Sure, please do those tests.
>

OK, so I do have results for those tests - that is, scale 3000 with 
shared_buffers=16GB (so continuously writing out dirty buffers). The 
following reports show the results slightly differently - all three "tps 
charts" next to each other, then the speedup charts and tables.

Overall, the results are surprisingly positive - look at these results 
(all ending with "-retest"):

[1] http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest

[2] 
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-noskip-retest

[3] 
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest

All three show significant improvement, even with fairly low client 
counts. For example with 72 clients, the tps improves 20%, without 
significantly affecting variability variability of the results( measured 
as stdddev, more on this later).

It's however interesting that "no_content_lock" is almost exactly the 
same as master, while the other two cases improve significantly.

The other interesting thing is that "pgbench -N" [3] shows no such 
improvement, unlike regular pgbench and Dilip's workload. Not sure why, 
though - I'd expect to see significant improvement in this case.

I have also repeated those tests with clog buffers increased to 512 (so 
4x the current maximum of 128). I only have results for Dilip's workload 
and "pgbench -N":

[4] 
http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest-512

[5] 
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest-512

The results are somewhat surprising, I guess, because the effect is 
wildly different for each workload.

For Dilip's workload increasing clog buffers to 512 pretty much 
eliminates all benefits of the patches. For example with 288 client, the 
group_update patch gives ~60k tps on 128 buffers [1] but only 42k tps on 
512 buffers [4].

With "pgbench -N", the effect is exactly the opposite - while with 128 
buffers there was pretty much no benefit from any of the patches [3], 
with 512 buffers we suddenly get almost 2x the throughput, but only for 
group_update and master (while the other two patches show no improvement 
at all).

I don't have results for the regular pgbench ("noskip") with 512 buffers 
yet, but I'm curious what that will show.

In general I however think that the patches don't show any regression in 
any of those workloads (at least not with 128 buffers). Based solely on 
the results, I like the group_update more, because it performs as good 
as master or significantly better.

>>> 2. We do see in some cases that granular_locking and
>>> no_content_lock patches has shown significant increase in
>>> contention on CLOGControlLock. I have already shared my analysis
>>> for same upthread [8].
>>

I've read that analysis, but I'm not sure I see how it explains the "zig 
zag" behavior. I do understand that shifting the contention to some 
other (already busy) lock may negatively impact throughput, or that the 
group_update may result in updating multiple clog pages, but I don't 
understand two things:

(1) Why this should result in the fluctuations we observe in some of the 
cases. For example, why should we see 150k tps on, 72 clients, then drop 
to 92k with 108 clients, then back to 130k on 144 clients, then 84k on 
180 clients etc. That seems fairly strange.

(2) Why this should affect all three patches, when only group_update has 
to modify multiple clog pages.

For example consider this:
    http://tvondra.bitbucket.org/index2.html#dilip-300-logged-async

For example looking at % of time spent on different locks with the 
group_update patch, I see this (ignoring locks with ~1%):
 event_type     wait_event       36   72  108  144  180  216  252  288
----------------------------------------------------------------------              -                60   63   45   53
38   50   33   48 Client         ClientRead       33   23    9   14    6   10    4    8 LWLockNamed    CLogControlLock
2    7   33   14   34   14   33   14 LWLockTranche  buffer_content    0    2    9   13   19   18   26   22
 

I don't see any sign of contention shifting to other locks, just 
CLogControlLock fluctuating between 14% and 33% for some reason.

Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's 
some sort of CPU / OS scheduling artifact. For example, the system has 
36 physical cores, 72 virtual ones (thanks to HT). I find it strange 
that the "good" client counts are always multiples of 72, while the 
"bad" ones fall in between.
  72 = 72 * 1   (good) 108 = 72 * 1.5 (bad) 144 = 72 * 2   (good) 180 = 72 * 2.5 (bad) 216 = 72 * 3   (good) 252 = 72 *
3.5(bad) 288 = 72 * 4   (good)
 

So maybe this has something to do with how OS schedules the tasks, or 
maybe some internal heuristics in the CPU, or something like that.


>> On logged tables it usually looks like this (i.e. modest increase for high
>> client counts at the expense of significantly higher variability):
>>
>>   http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64
>>
>
> What variability are you referring to in those results?>

Good question. What I mean by "variability" is how stable the tps is 
during the benchmark (when measured on per-second granularity). For 
example, let's run a 10-second benchmark, measuring number of 
transactions committed each second.

Then all those runs do 1000 tps on average:
  run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000  run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500,
500,1500  run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000
 

I guess we agree those runs behave very differently, despite having the 
same throughput. So this is what STDDEV(tps) measures, i.e. the third 
chart on the reports, shows.

So for example this [6] shows that the patches give us higher throughput 
with >= 180 clients, but we also pay for that with increased variability 
of the results (i.e. the tps chart will have jitter):

[6] 
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-64

Of course, exchanging throughput, latency and variability is one of the 
crucial trade-offs in transactions systems - at some point the resources 
get saturated and higher throughput can only be achieved in exchange for 
latency (e.g. by grouping requests). But still, we'd like to get stable 
tps from the system, not something that gives us 2000 tps one second and 
0 tps the next one.

Of course, this is not perfect - it does not show whether there are 
transactions with significantly higher latency, and so on. It'd be good 
to also measure latency, but I haven't collected that info during the 
runs so far.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Jim Nasby
Date:
On 10/30/16 1:32 PM, Tomas Vondra wrote:
>
> Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's
> some sort of CPU / OS scheduling artifact. For example, the system has
> 36 physical cores, 72 virtual ones (thanks to HT). I find it strange
> that the "good" client counts are always multiples of 72, while the
> "bad" ones fall in between.
>
>   72 = 72 * 1   (good)
>  108 = 72 * 1.5 (bad)
>  144 = 72 * 2   (good)
>  180 = 72 * 2.5 (bad)
>  216 = 72 * 3   (good)
>  252 = 72 * 3.5 (bad)
>  288 = 72 * 4   (good)
>
> So maybe this has something to do with how OS schedules the tasks, or
> maybe some internal heuristics in the CPU, or something like that.

It might be enlightening to run a series of tests that are 72*.1 or *.2 
apart (say, 72, 79, 86, ..., 137, 144).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 10/31/2016 05:01 AM, Jim Nasby wrote:
> On 10/30/16 1:32 PM, Tomas Vondra wrote:
>>
>> Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's
>> some sort of CPU / OS scheduling artifact. For example, the system has
>> 36 physical cores, 72 virtual ones (thanks to HT). I find it strange
>> that the "good" client counts are always multiples of 72, while the
>> "bad" ones fall in between.
>>
>>   72 = 72 * 1   (good)
>>  108 = 72 * 1.5 (bad)
>>  144 = 72 * 2   (good)
>>  180 = 72 * 2.5 (bad)
>>  216 = 72 * 3   (good)
>>  252 = 72 * 3.5 (bad)
>>  288 = 72 * 4   (good)
>>
>> So maybe this has something to do with how OS schedules the tasks, or
>> maybe some internal heuristics in the CPU, or something like that.
>
> It might be enlightening to run a series of tests that are 72*.1 or *.2
> apart (say, 72, 79, 86, ..., 137, 144).

Yeah, I've started a benchmark with client a step of 6 clients
    36 42 48 54 60 66 72 78 ... 252 258 264 270 276 282 288

instead of just
    36 72 108 144 180 216 252 288

which did a test every 36 clients. To compensate for the 6x longer runs, 
I'm only running tests for "group-update" and "master", so I should have 
the results in ~36h.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 10/30/2016 07:32 PM, Tomas Vondra wrote:
> Hi,
>
> On 10/27/2016 01:44 PM, Amit Kapila wrote:
>> On Thu, Oct 27, 2016 at 4:15 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>> FWIW I plan to run the same test with logged tables - if it shows
>>> similar
>>> regression, I'll be much more worried, because that's a fairly typical
>>> scenario (logged tables, data set > shared buffers), and we surely can't
>>> just go and break that.
>>>
>>
>> Sure, please do those tests.
>>
>
> OK, so I do have results for those tests - that is, scale 3000 with
> shared_buffers=16GB (so continuously writing out dirty buffers). The
> following reports show the results slightly differently - all three "tps
> charts" next to each other, then the speedup charts and tables.
>
> Overall, the results are surprisingly positive - look at these results
> (all ending with "-retest"):
>
> [1] http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest
>
> [2]
> http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-noskip-retest
>
>
> [3]
> http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest
>
>
> All three show significant improvement, even with fairly low client
> counts. For example with 72 clients, the tps improves 20%, without
> significantly affecting variability variability of the results( measured
> as stdddev, more on this later).
>
> It's however interesting that "no_content_lock" is almost exactly the
> same as master, while the other two cases improve significantly.
>
> The other interesting thing is that "pgbench -N" [3] shows no such
> improvement, unlike regular pgbench and Dilip's workload. Not sure why,
> though - I'd expect to see significant improvement in this case.
>
> I have also repeated those tests with clog buffers increased to 512 (so
> 4x the current maximum of 128). I only have results for Dilip's workload
> and "pgbench -N":
>
> [4]
> http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest-512
>
> [5]
> http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest-512
>
>
> The results are somewhat surprising, I guess, because the effect is
> wildly different for each workload.
>
> For Dilip's workload increasing clog buffers to 512 pretty much
> eliminates all benefits of the patches. For example with 288 client,
> the group_update patch gives ~60k tps on 128 buffers [1] but only 42k
> tps on 512 buffers [4].
>
> With "pgbench -N", the effect is exactly the opposite - while with
> 128 buffers there was pretty much no benefit from any of the patches
> [3], with 512 buffers we suddenly get almost 2x the throughput, but
> only for group_update and master (while the other two patches show no
> improvement at all).
>

The remaining benchmark with 512 clog buffers completed, and the impact 
roughly matches Dilip's benchmark - that is, increasing the number of 
clog buffers eliminates all positive impact of the patches observed on 
128 buffers. Compare these two reports:

[a] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest

[b] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest-512

With 128 buffers the group_update and granular_locking patches achieve 
up to 50k tps, while master and no_content_lock do ~30k tps. After 
increasing number of clog buffers, we get only ~30k in all cases.

I'm not sure what's causing this, whether we're hitting limits of the 
simple LRU cache used for clog buffers, or something else. But maybe 
there's something in the design of clog buffers that make them work less 
efficiently with more clog buffers? I'm not sure whether that's 
something we need to fix before eventually committing any of them.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Oct 31, 2016 at 12:02 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> On 10/27/2016 01:44 PM, Amit Kapila wrote:
>
> I've read that analysis, but I'm not sure I see how it explains the "zig
> zag" behavior. I do understand that shifting the contention to some other
> (already busy) lock may negatively impact throughput, or that the
> group_update may result in updating multiple clog pages, but I don't
> understand two things:
>
> (1) Why this should result in the fluctuations we observe in some of the
> cases. For example, why should we see 150k tps on, 72 clients, then drop to
> 92k with 108 clients, then back to 130k on 144 clients, then 84k on 180
> clients etc. That seems fairly strange.
>

I don't think hitting multiple clog pages has much to do with
client-count.  However, we can wait to see your further detailed test
report.

> (2) Why this should affect all three patches, when only group_update has to
> modify multiple clog pages.
>

No, all three patches can be affected due to multiple clog pages.
Read second paragraph ("I think one of the probable reasons that could
happen for both the approaches") in same e-mail [1].  It is basically
due to frequent release-and-reacquire of locks.

>
>
>>> On logged tables it usually looks like this (i.e. modest increase for
>>> high
>>> client counts at the expense of significantly higher variability):
>>>
>>>   http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64
>>>
>>
>> What variability are you referring to in those results?
>
>>
>
> Good question. What I mean by "variability" is how stable the tps is during
> the benchmark (when measured on per-second granularity). For example, let's
> run a 10-second benchmark, measuring number of transactions committed each
> second.
>
> Then all those runs do 1000 tps on average:
>
>   run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000
>   run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500
>   run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000
>

Generally, such behaviours are seen due to writes.  Are WAL and DATA
on same disk in your tests?


[1] - https://www.postgresql.org/message-id/CAA4eK1J9VxJUnpOiQDf0O%3DZ87QUMbw%3DuGcQr4EaGbHSCibx9yA%40mail.gmail.com


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Oct 31, 2016 at 7:02 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> The remaining benchmark with 512 clog buffers completed, and the impact
> roughly matches Dilip's benchmark - that is, increasing the number of clog
> buffers eliminates all positive impact of the patches observed on 128
> buffers. Compare these two reports:
>
> [a] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest
>
> [b] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest-512
>
> With 128 buffers the group_update and granular_locking patches achieve up to
> 50k tps, while master and no_content_lock do ~30k tps. After increasing
> number of clog buffers, we get only ~30k in all cases.
>
> I'm not sure what's causing this, whether we're hitting limits of the simple
> LRU cache used for clog buffers, or something else.
>

I have also seen previously that increasing clog buffers to 256 can
impact performance negatively.  So, probably here the gains due to
group_update patch is negated due to the impact of increasing clog
buffers.   I am not sure if it is good idea to see the impact of
increasing clog buffers along with this patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 10/31/2016 02:51 PM, Amit Kapila wrote:
> On Mon, Oct 31, 2016 at 12:02 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Hi,
>>
>> On 10/27/2016 01:44 PM, Amit Kapila wrote:
>>
>> I've read that analysis, but I'm not sure I see how it explains the "zig
>> zag" behavior. I do understand that shifting the contention to some other
>> (already busy) lock may negatively impact throughput, or that the
>> group_update may result in updating multiple clog pages, but I don't
>> understand two things:
>>
>> (1) Why this should result in the fluctuations we observe in some of the
>> cases. For example, why should we see 150k tps on, 72 clients, then drop to
>> 92k with 108 clients, then back to 130k on 144 clients, then 84k on 180
>> clients etc. That seems fairly strange.
>>
>
> I don't think hitting multiple clog pages has much to do with
> client-count.  However, we can wait to see your further detailed test
> report.
>
>> (2) Why this should affect all three patches, when only group_update has to
>> modify multiple clog pages.
>>
>
> No, all three patches can be affected due to multiple clog pages.
> Read second paragraph ("I think one of the probable reasons that could
> happen for both the approaches") in same e-mail [1].  It is basically
> due to frequent release-and-reacquire of locks.
>
>>
>>
>>>> On logged tables it usually looks like this (i.e. modest increase for
>>>> high
>>>> client counts at the expense of significantly higher variability):
>>>>
>>>>   http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64
>>>>
>>>
>>> What variability are you referring to in those results?
>>
>>>
>>
>> Good question. What I mean by "variability" is how stable the tps is during
>> the benchmark (when measured on per-second granularity). For example, let's
>> run a 10-second benchmark, measuring number of transactions committed each
>> second.
>>
>> Then all those runs do 1000 tps on average:
>>
>>   run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000
>>   run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500
>>   run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000
>>
>
> Generally, such behaviours are seen due to writes. Are WAL and DATA
> on same disk in your tests?
>

Yes, there's one RAID device on 10 SSDs, with 4GB of the controller. 
I've done some tests and it easily handles > 1.5GB/s in sequential 
writes, and >500MB/s in sustained random writes.

Also, let me point out that most of the tests were done so that the 
whole data set fits into shared_buffers, and with no checkpoints during 
the runs (so no writes to data files should really happen).

For example these tests were done on scale 3000 (45GB data set) with 
64GB shared buffers:

[a] 
http://tvondra.bitbucket.org/index2.html#pgbench-3000-unlogged-sync-noskip-64

[b] 
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-async-noskip-64

and I could show similar cases with scale 300 on 16GB shared buffers.

In those cases, there's very little contention between WAL and the rest 
of the data base (in terms of I/O).

And moreover, this setup (single device for the whole cluster) is very 
common, we can't just neglect it.

But my main point here really is that the trade-off in those cases may 
not be really all that great, because you get the best performance at 
36/72 clients, and then the tps drops and variability increases. At 
least not right now, before tackling contention on the WAL lock (or 
whatever lock becomes the bottleneck).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Oct 31, 2016 at 7:58 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 10/31/2016 02:51 PM, Amit Kapila wrote:
> And moreover, this setup (single device for the whole cluster) is very
> common, we can't just neglect it.
>
> But my main point here really is that the trade-off in those cases may not
> be really all that great, because you get the best performance at 36/72
> clients, and then the tps drops and variability increases. At least not
> right now, before tackling contention on the WAL lock (or whatever lock
> becomes the bottleneck).
>

Okay, but does wait event results show increase in contention on some
other locks for pgbench-3000-logged-sync-skip-64?  Can you share wait
events for the runs where there is a fluctuation?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 10/31/2016 08:43 PM, Amit Kapila wrote:
> On Mon, Oct 31, 2016 at 7:58 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 10/31/2016 02:51 PM, Amit Kapila wrote:
>> And moreover, this setup (single device for the whole cluster) is very
>> common, we can't just neglect it.
>>
>> But my main point here really is that the trade-off in those cases may not
>> be really all that great, because you get the best performance at 36/72
>> clients, and then the tps drops and variability increases. At least not
>> right now, before tackling contention on the WAL lock (or whatever lock
>> becomes the bottleneck).
>>
>
> Okay, but does wait event results show increase in contention on some
> other locks for pgbench-3000-logged-sync-skip-64?  Can you share wait
> events for the runs where there is a fluctuation?
>

Sure, I do have wait event stats, including a summary for different 
client counts - see this:

http://tvondra.bitbucket.org/by-test/pgbench-3000-logged-sync-skip-64.txt

Looking only at group_update patch for three interesting client counts, 
it looks like this:
   wait_event_type |    wait_event     |    108     144      180
-----------------+-------------------+-------------------------  LWLockNamed     | WALWriteLock      | 661284  847057
1006061                  |                   | 126654  191506   265386   Client          | ClientRead        |  37273
52791   64799   LWLockTranche   | wal_insert        |  28394   51893    79932   LWLockNamed     | CLogControlLock   |
7766  14913    23138   LWLockNamed     | WALBufMappingLock |   3615    3739     3803   LWLockNamed     | ProcArrayLock
  |    913    1776     2685   Lock            | extend            |    909    2082     2228   LWLockNamed     |
XidGenLock       |    301     349      675   LWLockTranche   | clog              |    173     331      607
LWLockTranche  | buffer_content    |    163     468      737   LWLockTranche   | lock_manager      |     88     140
145
 

Compared to master, this shows significant reduction of contention on 
CLogControlLock (which on master has 20k, 83k and 200k samples), and 
moving the contention to WALWriteLock.

But perhaps you're asking about variability during the benchmark? I 
suppose that could be extracted from the collected data, but I haven't 
done that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 10/31/2016 02:24 PM, Tomas Vondra wrote:
> On 10/31/2016 05:01 AM, Jim Nasby wrote:
>> On 10/30/16 1:32 PM, Tomas Vondra wrote:
>>>
>>> Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's
>>> some sort of CPU / OS scheduling artifact. For example, the system has
>>> 36 physical cores, 72 virtual ones (thanks to HT). I find it strange
>>> that the "good" client counts are always multiples of 72, while the
>>> "bad" ones fall in between.
>>>
>>>   72 = 72 * 1   (good)
>>>  108 = 72 * 1.5 (bad)
>>>  144 = 72 * 2   (good)
>>>  180 = 72 * 2.5 (bad)
>>>  216 = 72 * 3   (good)
>>>  252 = 72 * 3.5 (bad)
>>>  288 = 72 * 4   (good)
>>>
>>> So maybe this has something to do with how OS schedules the tasks, or
>>> maybe some internal heuristics in the CPU, or something like that.
>>
>> It might be enlightening to run a series of tests that are 72*.1 or *.2
>> apart (say, 72, 79, 86, ..., 137, 144).
>
> Yeah, I've started a benchmark with client a step of 6 clients
>
>     36 42 48 54 60 66 72 78 ... 252 258 264 270 276 282 288
>
> instead of just
>
>     36 72 108 144 180 216 252 288
>
> which did a test every 36 clients. To compensate for the 6x longer runs,
> I'm only running tests for "group-update" and "master", so I should have
> the results in ~36h.
>

So I've been curious and looked at results of the runs executed so far, 
and for the group_update patch it looks like this:
  clients  tps -----------------       36  117663       42  139791       48  129331       54  144970       60  124174
   66  137227       72  146064       78  100267       84  141538       90   96607       96  139290      102   93976
108  136421      114   91848      120  133563      126   89801      132  132607      138   87912      144  129688
150  87221      156  129608      162   85403      168  130193      174   83863      180  129337      186   81968
192 128571      198   82053      204  128020      210   80768      216  124153      222   80493      228  125503
234  78950      240  125670      246   78418      252  123532      258   77623      264  124366      270   76726
276 119054      282   76960      288  121819
 

So, similar saw-like behavior, perfectly periodic. But the really 
strange thing is the peaks/valleys don't match those observed before!

That is, during the previous runs, 72, 144, 216 and 288 were "good" 
while 108, 180 and 252 were "bad". But in those runs, all those client 
counts are "good" ...

Honestly, I have no idea what to think about this ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Mon, Oct 31, 2016 at 5:48 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Honestly, I have no idea what to think about this ...

I think a lot of the details here depend on OS scheduler behavior.
For example, here's one of the first scalability graphs I ever did:

http://rhaas.blogspot.com/2011/09/scalability-in-graphical-form-analyzed.html

It's a nice advertisement for fast-path locking, but look at the funny
shape of the red and green lines between 1 and 32 cores.  The curve is
oddly bowl-shaped.  As the post discusses, we actually dip WAY under
linear scalability in the 8-20 core range and then shoot up like a
rocket afterwards so that at 32 cores we actually achieve super-linear
scalability. You can't blame this on anything except Linux.  Someone
shared BSD graphs (I forget which flavor) with me privately and they
don't exhibit this poor behavior.  (They had different poor behaviors
instead - performance collapsed at high client counts.  That was a
long time ago so it's probably fixed now.)

This is why I think it's fundamentally wrong to look at this patch and
say "well, contention goes down, and in some cases that makes
performance go up, but because in other cases it decreases performance
or increases variability we shouldn't commit it".  If we took that
approach, we wouldn't have fast-path locking today, because the early
versions of fast-path locking could exhibit *major* regressions
precisely because of contention shifting to other locks, specifically
SInvalReadLock and msgNumLock.  (cf. commit
b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4).  If we say that because the
contention on those other locks can get worse as a result of
contention on this lock being reduced, or even worse, if we try to
take responsibility for what effect reducing lock contention might
have on the operating system scheduler discipline (which will
certainly differ from system to system and version to version), we're
never going to get anywhere, because there's almost always going to be
some way that reducing contention in one place can bite you someplace
else.

I also believe it's pretty normal for patches that remove lock
contention to increase variability.  If you run an auto race where
every car has a speed governor installed that limits it to 80 kph,
there will be much less variability in the finish times than if you
remove the governor, but that's a stupid way to run a race.  You won't
get much innovation around increasing the top speed of the cars under
those circumstances, either.  Nobody ever bothered optimizing the
contention around msgNumLock before fast-path locking happened,
because the heavyweight lock manager burdened the system so heavily
that you couldn't generate enough contention on it to matter.
Similarly, we're not going to get much traction around optimizing the
other locks to which contention would shift if we applied this patch
unless we apply it.  This is not theoretical: EnterpriseDB staff have
already done work on trying to optimize WALWriteLock, but it's hard to
get a benefit.  The more contention other contention we eliminate, the
easier it will be to see whether a proposed change to WALWriteLock
helps.  Of course, we'll also be more at the mercy of operating system
scheduler discipline, but that's not all a bad thing either.  The
Linux kernel guys have been known to run PostgreSQL to see whether
proposed changes help or hurt, but they're not going to try those
tests after applying patches that we rejected because they expose us
to existing Linux shortcomings.

I don't want to be perceived as advocating too forcefully for a patch
that was, after all, written by a colleague.  However, I sincerely
believe it's a mistake to say that a patch which reduces lock
contention must show a tangible win or at least no loss on every piece
of hardware, on every kernel, at every client count with no increase
in variability in any configuration.  Very few (if any) patches are
going to be able to meet that bar, and if we make that the bar, people
aren't going to write patches to reduce lock contention in PostgreSQL.
For that to be worth doing, you have to be able to get the patch
committed in finite time.  We've spent an entire release cycle
dithering over this patch.  Several alternative patches have been
written that are not any better (and the people who wrote those
patches don't seem especially interested in doing further work on them
anyway).  There is increasing evidence that the patch is effective at
solving the problem it claims to solve, and that any downsides are
just the result of poor lock-scaling behavior elsewhere which we could
be working on fixing if we weren't still spending time on this.  Is
that really not good enough?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 11/01/2016 08:13 PM, Robert Haas wrote:
> On Mon, Oct 31, 2016 at 5:48 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Honestly, I have no idea what to think about this ...
>
> I think a lot of the details here depend on OS scheduler behavior.
> For example, here's one of the first scalability graphs I ever did:
>
> http://rhaas.blogspot.com/2011/09/scalability-in-graphical-form-analyzed.html
>
> It's a nice advertisement for fast-path locking, but look at the funny
> shape of the red and green lines between 1 and 32 cores.  The curve is
> oddly bowl-shaped.  As the post discusses, we actually dip WAY under
> linear scalability in the 8-20 core range and then shoot up like a
> rocket afterwards so that at 32 cores we actually achieve super-linear
> scalability. You can't blame this on anything except Linux.  Someone
> shared BSD graphs (I forget which flavor) with me privately and they
> don't exhibit this poor behavior.  (They had different poor behaviors
> instead - performance collapsed at high client counts.  That was a
> long time ago so it's probably fixed now.)
>
> This is why I think it's fundamentally wrong to look at this patch and
> say "well, contention goes down, and in some cases that makes
> performance go up, but because in other cases it decreases performance
> or increases variability we shouldn't commit it".  If we took that
> approach, we wouldn't have fast-path locking today, because the early
> versions of fast-path locking could exhibit *major* regressions
> precisely because of contention shifting to other locks, specifically
> SInvalReadLock and msgNumLock.  (cf. commit
> b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4).  If we say that because the
> contention on those other locks can get worse as a result of
> contention on this lock being reduced, or even worse, if we try to
> take responsibility for what effect reducing lock contention might
> have on the operating system scheduler discipline (which will
> certainly differ from system to system and version to version), we're
> never going to get anywhere, because there's almost always going to be
> some way that reducing contention in one place can bite you someplace
> else.
>

I don't think I've suggested not committing any of the clog patches (or 
other patches in general) because shifting the contention somewhere else 
might cause regressions. At the end of the last CF I've however stated 
that we need to better understand the impact on various wokloads, and I 
think Amit agreed with that conclusion.

We have that understanding now, I believe - also thanks to your idea of 
sampling wait events data.

You're right we can't fix all the contention points in one patch, and 
that shifting the contention may cause regressions. But we should at 
least understand what workloads might be impacted, how serious the 
regressions may get etc. Which is why all the testing was done.


> I also believe it's pretty normal for patches that remove lock
> contention to increase variability.  If you run an auto race where
> every car has a speed governor installed that limits it to 80 kph,
> there will be much less variability in the finish times than if you
> remove the governor, but that's a stupid way to run a race.  You won't
> get much innovation around increasing the top speed of the cars under
> those circumstances, either.  Nobody ever bothered optimizing the
> contention around msgNumLock before fast-path locking happened,
> because the heavyweight lock manager burdened the system so heavily
> that you couldn't generate enough contention on it to matter.
> Similarly, we're not going to get much traction around optimizing the
> other locks to which contention would shift if we applied this patch
> unless we apply it.  This is not theoretical: EnterpriseDB staff have
> already done work on trying to optimize WALWriteLock, but it's hard to
> get a benefit.  The more contention other contention we eliminate, the
> easier it will be to see whether a proposed change to WALWriteLock
> helps.

Sure, I understand that. My main worry was that people will get worse 
performance with the next major version that what they get now (assuming 
we don't manage to address the other contention points). Which is 
difficult to explain to users & customers, no matter how reasonable it 
seems to us.

The difference is that both the fast-path locks and msgNumLock went into 
9.2, so that end users probably never saw that regression. But we don't 
know if that happens for clog and WAL.

Perhaps you have a working patch addressing the WAL contention, so that 
we could see how that changes the results?

> Of course, we'll also be more at the mercy of operating system
> scheduler discipline, but that's not all a bad thing either.  The
> Linux kernel guys have been known to run PostgreSQL to see whether
> proposed changes help or hurt, but they're not going to try those
> tests after applying patches that we rejected because they expose us
> to existing Linux shortcomings.
>

I might be wrong, but I doubt the kernel guys are running particularly 
wide set of tests, so how likely is it they will notice issues with 
specific workloads? Wouldn't it be great if we could tell them there's a 
bug and provide a workload that reproduces it?

I don't see how "it's a Linux issue" makes it someone else's problem. 
The kernel guys can't really test everything (and are not obliged to). 
It's up to us to do more testing in this area, and report issues to the 
kernel guys (which is not happening as much as it should).
>
> I don't want to be perceived as advocating too forcefully for a
> patch that was, after all, written by a colleague. However, I
> sincerely believe it's a mistake to say that a patch which reduces
> lock contention must show a tangible win or at least no loss on every
> piece of hardware, on every kernel, at every client count with no
> increase in variability in any configuration.>

I don't think anyone suggested that.
>
> Very few (if any) patches are going to be able to meet that bar, and
> if we make that the bar, people aren't going to write patches to
> reduce lock contention in PostgreSQL. For that to be worth doing, you
> have to be able to get the patch committed in finite time. We've
> spent an entire release cycle dithering over this patch. Several
> alternative patches have been written that are not any better (and
> the people who wrote those patches don't seem especially interested
> in doing further work on them anyway). There is increasing evidence
> that the patch is effective at solving the problem it claims to
> solve, and that any downsides are just the result of poor
> lock-scaling behavior elsewhere which we could be working on fixing
> if we weren't still spending time on this. Is that really not good
> enough?
>

Except that a few days ago, after getting results from the last round of 
tests, I've stated that we haven't really found any regressions that 
would matter, and that group_update seems to be performing the best (and 
actually significantly improves results for some of the tests). I 
haven't done any code review, though.

The one remaining thing is the strange zig-zag behavior, but that might 
easily be a due to scheduling in kernel, or something else. I don't 
consider it a blocker for any of the patches, though.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Wed, Nov 2, 2016 at 9:01 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 11/01/2016 08:13 PM, Robert Haas wrote:
>>
>> On Mon, Oct 31, 2016 at 5:48 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>
> The one remaining thing is the strange zig-zag behavior, but that might
> easily be a due to scheduling in kernel, or something else. I don't consider
> it a blocker for any of the patches, though.
>

The only reason I could think of for that zig-zag behaviour is
frequent multiple clog page accesses and it could be due to below
reasons:

a. transaction and its subtransactions (IIRC, Dilip's case has one
main transaction and two subtransactions) can't fit into same page, in
which case the group_update optimization won't apply and I don't think
we can do anything for it.
b. In the same group, multiple clog pages are being accessed.  It is
not a likely scenario, but it can happen and we might be able to
improve a bit if that is happening.
c. The transactions at same time tries to update different clog page.
I think as mentioned upthread we can handle it by using slots an
allowing multiple groups to work together instead of a single group.

To check if there is any impact due to (a) or (b), I have added few
logs in code (patch - group_update_clog_v9_log).  The log message
could be "all xacts are not on same page" or  "Group contains
different pages".

Patch group_update_clog_v9_slots tries to address (c). So if there is
any problem due to (c), this patch should improve the situation.

Can you please try to run the test where you saw zig-zag behaviour
with both the patches separately?  I think if there is anything due to
postgres, then you can see either one of the new log message or
performance will be improved, OTOH if we see same behaviour, then I
think we can probably assume it due to scheduler activity and move on.
Also one point to note here is that even when the performance is down
in that curve, it is equal to or better than HEAD.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 11/02/2016 05:52 PM, Amit Kapila wrote:
> On Wed, Nov 2, 2016 at 9:01 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 11/01/2016 08:13 PM, Robert Haas wrote:
>>>
>>> On Mon, Oct 31, 2016 at 5:48 PM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>
>> The one remaining thing is the strange zig-zag behavior, but that might
>> easily be a due to scheduling in kernel, or something else. I don't consider
>> it a blocker for any of the patches, though.
>>
>
> The only reason I could think of for that zig-zag behaviour is
> frequent multiple clog page accesses and it could be due to below
> reasons:
>
> a. transaction and its subtransactions (IIRC, Dilip's case has one
> main transaction and two subtransactions) can't fit into same page, in
> which case the group_update optimization won't apply and I don't think
> we can do anything for it.
> b. In the same group, multiple clog pages are being accessed.  It is
> not a likely scenario, but it can happen and we might be able to
> improve a bit if that is happening.
> c. The transactions at same time tries to update different clog page.
> I think as mentioned upthread we can handle it by using slots an
> allowing multiple groups to work together instead of a single group.
>
> To check if there is any impact due to (a) or (b), I have added few
> logs in code (patch - group_update_clog_v9_log). The log message
> could be "all xacts are not on same page" or "Group contains
> different pages".
>
> Patch group_update_clog_v9_slots tries to address (c). So if there
> is any problem due to (c), this patch should improve the situation.
>
> Can you please try to run the test where you saw zig-zag behaviour
> with both the patches separately? I think if there is anything due
> to postgres, then you can see either one of the new log message or
> performance will be improved, OTOH if we see same behaviour, then I
> think we can probably assume it due to scheduler activity and move
> on. Also one point to note here is that even when the performance is
> down in that curve, it is equal to or better than HEAD.
>

Will do.

Based on the results with more client counts (increment by 6 clients 
instead of 36), I think this really looks like something unrelated to 
any of the patches - kernel, CPU, or something already present in 
current master.

The attached results show that:

(a) master shows the same zig-zag behavior - No idea why this wasn't 
observed on the previous runs.

(b) group_update actually seems to improve the situation, because the 
performance keeps stable up to 72 clients, while on master the 
fluctuation starts way earlier.

I'll redo the tests with a newer kernel - this was on 3.10.x which is 
what Red Hat 7.2 uses, I'll try on 4.8.6. Then I'll try with the patches 
you submitted, if the 4.8.6 kernel does not help.

Overall, I'm convinced this issue is unrelated to the patches.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 11/02/2016 05:52 PM, Amit Kapila wrote:
> On Wed, Nov 2, 2016 at 9:01 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 11/01/2016 08:13 PM, Robert Haas wrote:
>>>
>>> On Mon, Oct 31, 2016 at 5:48 PM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>
>> The one remaining thing is the strange zig-zag behavior, but that might
>> easily be a due to scheduling in kernel, or something else. I don't consider
>> it a blocker for any of the patches, though.
>>
>
> The only reason I could think of for that zig-zag behaviour is
> frequent multiple clog page accesses and it could be due to below
> reasons:
>
> a. transaction and its subtransactions (IIRC, Dilip's case has one
> main transaction and two subtransactions) can't fit into same page, in
> which case the group_update optimization won't apply and I don't think
> we can do anything for it.
> b. In the same group, multiple clog pages are being accessed.  It is
> not a likely scenario, but it can happen and we might be able to
> improve a bit if that is happening.
> c. The transactions at same time tries to update different clog page.
> I think as mentioned upthread we can handle it by using slots an
> allowing multiple groups to work together instead of a single group.
>
> To check if there is any impact due to (a) or (b), I have added few
> logs in code (patch - group_update_clog_v9_log). The log message
> could be "all xacts are not on same page" or "Group contains
> different pages".
>
> Patch group_update_clog_v9_slots tries to address (c). So if there
> is any problem due to (c), this patch should improve the situation.
>
> Can you please try to run the test where you saw zig-zag behaviour
> with both the patches separately? I think if there is anything due
> to postgres, then you can see either one of the new log message or
> performance will be improved, OTOH if we see same behaviour, then I
> think we can probably assume it due to scheduler activity and move
> on. Also one point to note here is that even when the performance is
> down in that curve, it is equal to or better than HEAD.
>

Will do.

Based on the results with more client counts (increment by 6 clients
instead of 36), I think this really looks like something unrelated to
any of the patches - kernel, CPU, or something already present in
current master.

The attached results show that:

(a) master shows the same zig-zag behavior - No idea why this wasn't
observed on the previous runs.

(b) group_update actually seems to improve the situation, because the
performance keeps stable up to 72 clients, while on master the
fluctuation starts way earlier.

I'll redo the tests with a newer kernel - this was on 3.10.x which is
what Red Hat 7.2 uses, I'll try on 4.8.6. Then I'll try with the patches
you submitted, if the 4.8.6 kernel does not help.

Overall, I'm convinced this issue is unrelated to the patches.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Tue, Nov 1, 2016 at 11:31 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I don't think I've suggested not committing any of the clog patches (or
> other patches in general) because shifting the contention somewhere else
> might cause regressions. At the end of the last CF I've however stated that
> we need to better understand the impact on various wokloads, and I think
> Amit agreed with that conclusion.
>
> We have that understanding now, I believe - also thanks to your idea of
> sampling wait events data.
>
> You're right we can't fix all the contention points in one patch, and that
> shifting the contention may cause regressions. But we should at least
> understand what workloads might be impacted, how serious the regressions may
> get etc. Which is why all the testing was done.

OK.

> Sure, I understand that. My main worry was that people will get worse
> performance with the next major version that what they get now (assuming we
> don't manage to address the other contention points). Which is difficult to
> explain to users & customers, no matter how reasonable it seems to us.
>
> The difference is that both the fast-path locks and msgNumLock went into
> 9.2, so that end users probably never saw that regression. But we don't know
> if that happens for clog and WAL.
>
> Perhaps you have a working patch addressing the WAL contention, so that we
> could see how that changes the results?

I don't think we do, yet.  Amit or Kuntal might know more.  At some
level I think we're just hitting the limits of the hardware's ability
to lay bytes on a platter, and fine-tuning the locking may not help
much.

> I might be wrong, but I doubt the kernel guys are running particularly wide
> set of tests, so how likely is it they will notice issues with specific
> workloads? Wouldn't it be great if we could tell them there's a bug and
> provide a workload that reproduces it?
>
> I don't see how "it's a Linux issue" makes it someone else's problem. The
> kernel guys can't really test everything (and are not obliged to). It's up
> to us to do more testing in this area, and report issues to the kernel guys
> (which is not happening as much as it should).

I don't exactly disagree with any of that.  I just want to find a
course of action that we can agree on and move forward.  This has been
cooking for a long time, and I want to converge on some resolution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Nov 3, 2016 at 8:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 1, 2016 at 11:31 PM, Tomas Vondra
>> The difference is that both the fast-path locks and msgNumLock went into
>> 9.2, so that end users probably never saw that regression. But we don't know
>> if that happens for clog and WAL.
>>
>> Perhaps you have a working patch addressing the WAL contention, so that we
>> could see how that changes the results?
>
> I don't think we do, yet.
>

Right.  At this stage, we are just evaluating the ways (basic idea is
to split the OS writes and Flush requests in separate locks) to reduce
it.  It is difficult to speculate results at this stage.  I think
after spending some more time (probably few weeks), we will be in
position to share our findings.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Haribabu Kommi
Date:


On Fri, Nov 4, 2016 at 8:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Nov 3, 2016 at 8:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 1, 2016 at 11:31 PM, Tomas Vondra
>> The difference is that both the fast-path locks and msgNumLock went into
>> 9.2, so that end users probably never saw that regression. But we don't know
>> if that happens for clog and WAL.
>>
>> Perhaps you have a working patch addressing the WAL contention, so that we
>> could see how that changes the results?
>
> I don't think we do, yet.
>

Right.  At this stage, we are just evaluating the ways (basic idea is
to split the OS writes and Flush requests in separate locks) to reduce
it.  It is difficult to speculate results at this stage.  I think
after spending some more time (probably few weeks), we will be in
position to share our findings.


As per my understanding the current state of the patch is waiting for the
performance results from author.

Moved to next CF with "waiting on author" status. Please feel free to
update the status if the current status differs with the actual patch status.

Regards,
Hari Babu
Fujitsu Australia

Re: Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Dec 5, 2016 at 6:00 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
>
> On Fri, Nov 4, 2016 at 8:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Thu, Nov 3, 2016 at 8:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > On Tue, Nov 1, 2016 at 11:31 PM, Tomas Vondra
>> >> The difference is that both the fast-path locks and msgNumLock went
>> >> into
>> >> 9.2, so that end users probably never saw that regression. But we don't
>> >> know
>> >> if that happens for clog and WAL.
>> >>
>> >> Perhaps you have a working patch addressing the WAL contention, so that
>> >> we
>> >> could see how that changes the results?
>> >
>> > I don't think we do, yet.
>> >
>>
>> Right.  At this stage, we are just evaluating the ways (basic idea is
>> to split the OS writes and Flush requests in separate locks) to reduce
>> it.  It is difficult to speculate results at this stage.  I think
>> after spending some more time (probably few weeks), we will be in
>> position to share our findings.
>>
>
> As per my understanding the current state of the patch is waiting for the
> performance results from author.
>

No, that is not true.  You have quoted the wrong message, that
discussion was about WALWriteLock contention not about the patch being
discussed in this thread.  I have posted the latest set of patches
here [1].  Tomas is supposed to share the results of his tests.  He
mentioned to me in PGConf Asia last week that he ran few tests on
Power Box, so let us wait for him to share his findings.

> Moved to next CF with "waiting on author" status. Please feel free to
> update the status if the current status differs with the actual patch
> status.
>

I think we should keep the status as "Needs Review".

[1] - https://www.postgresql.org/message-id/CAA4eK1JjatUZu0%2BHCi%3D5VM1q-hFgN_OhegPAwEUJqxf-7pESbg%40mail.gmail.com


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Speed up Clog Access by increasing CLOG buffers

From
Haribabu Kommi
Date:


On Mon, Dec 5, 2016 at 1:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Dec 5, 2016 at 6:00 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

No, that is not true.  You have quoted the wrong message, that
discussion was about WALWriteLock contention not about the patch being
discussed in this thread.  I have posted the latest set of patches
here [1].  Tomas is supposed to share the results of his tests.  He
mentioned to me in PGConf Asia last week that he ran few tests on
Power Box, so let us wait for him to share his findings.

> Moved to next CF with "waiting on author" status. Please feel free to
> update the status if the current status differs with the actual patch
> status.
>

I think we should keep the status as "Needs Review".

[1] - https://www.postgresql.org/message-id/CAA4eK1JjatUZu0%2BHCi%3D5VM1q-hFgN_OhegPAwEUJqxf-7pESbg%40mail.gmail.com

Thanks for the update.
I changed the status to "needs review" in 2017-01 commitfest.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
Hi,

> The attached results show that:
>
> (a) master shows the same zig-zag behavior - No idea why this wasn't
> observed on the previous runs.
>
> (b) group_update actually seems to improve the situation, because the
> performance keeps stable up to 72 clients, while on master the
> fluctuation starts way earlier.
>
> I'll redo the tests with a newer kernel - this was on 3.10.x which is
> what Red Hat 7.2 uses, I'll try on 4.8.6. Then I'll try with the patches
> you submitted, if the 4.8.6 kernel does not help.
>
> Overall, I'm convinced this issue is unrelated to the patches.

I've been unable to rerun the tests on this hardware with a newer 
kernel, so nothing new on the x86 front.

But as discussed with Amit in Tokyo at pgconf.asia, I got access to a 
Power8e machine (IBM 8247-22L to be precise). It's a much smaller 
machine compared to the x86 one, though - it only has 24 cores in 2 
sockets, 128GB of RAM and less powerful storage, for example.

I've repeated a subset of x86 tests and pushed them to
    https://bitbucket.org/tvondra/power8-results-2

The new results are prefixed with "power-" and I've tried to put them 
right next to the "same" x86 tests.

In all cases the patches significantly reduce the contention on 
CLogControlLock, just like on x86. Which is good and expected.

Otherwise the results are rather boring - no major regressions compared 
to master, and all the patches perform almost exactly the same. Compare 
for example this:

* http://tvondra.bitbucket.org/#dilip-300-unlogged-sync

* http://tvondra.bitbucket.org/#power-dilip-300-unlogged-sync

So the results seem much smoother compared to x86, and the performance 
difference is roughly 3x, which matches the 24 vs. 72 cores.

For pgbench, the difference is much more significant, though:

* http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip

* http://tvondra.bitbucket.org/#power-pgbench-300-unlogged-sync-skip

So, we're doing ~40k on Power8, but 220k on x86 (which is ~6x more, so 
double per-core throughput). My first guess was that this is due to the 
x86 machine having better I/O subsystem, so I've reran the tests with 
data directory in tmpfs, but that produced almost the same results.

Of course, this observation is unrelated to this patch.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Dec 22, 2016 at 6:59 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> But as discussed with Amit in Tokyo at pgconf.asia, I got access to a
> Power8e machine (IBM 8247-22L to be precise). It's a much smaller machine
> compared to the x86 one, though - it only has 24 cores in 2 sockets, 128GB
> of RAM and less powerful storage, for example.
>
> I've repeated a subset of x86 tests and pushed them to
>
>     https://bitbucket.org/tvondra/power8-results-2
>
> The new results are prefixed with "power-" and I've tried to put them right
> next to the "same" x86 tests.
>
> In all cases the patches significantly reduce the contention on
> CLogControlLock, just like on x86. Which is good and expected.
>

The results look positive.  Do you think we can conclude based on all
the tests you and Dilip have done, that we can move forward with this
patch (in particular group-update) or do you still want to do more
tests?   I am aware that in one of the tests we have observed that
reducing contention on CLOGControlLock has increased the contention on
WALWriteLock, but I feel we can leave that point as a note to
committer and let him take a final call.  From the code perspective
already Robert and Andres have taken one pass of review and I have
addressed all their comments, so surely more review of code can help,
but I think that is not a big deal considering patch size is
relatively small.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Tomas Vondra
Date:
On 12/23/2016 03:58 AM, Amit Kapila wrote:
> On Thu, Dec 22, 2016 at 6:59 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Hi,
>>
>> But as discussed with Amit in Tokyo at pgconf.asia, I got access to a
>> Power8e machine (IBM 8247-22L to be precise). It's a much smaller machine
>> compared to the x86 one, though - it only has 24 cores in 2 sockets, 128GB
>> of RAM and less powerful storage, for example.
>>
>> I've repeated a subset of x86 tests and pushed them to
>>
>>     https://bitbucket.org/tvondra/power8-results-2
>>
>> The new results are prefixed with "power-" and I've tried to put them right
>> next to the "same" x86 tests.
>>
>> In all cases the patches significantly reduce the contention on
>> CLogControlLock, just like on x86. Which is good and expected.
>>
>
> The results look positive.  Do you think we can conclude based on all
> the tests you and Dilip have done, that we can move forward with this
> patch (in particular group-update) or do you still want to do more
> tests?   I am aware that in one of the tests we have observed that
> reducing contention on CLOGControlLock has increased the contention on
> WALWriteLock, but I feel we can leave that point as a note to
> committer and let him take a final call.  From the code perspective
> already Robert and Andres have taken one pass of review and I have
> addressed all their comments, so surely more review of code can help,
> but I think that is not a big deal considering patch size is
> relatively small.
>

Yes, I believe that seems like a reasonable conclusion. I've done a few 
more tests on the Power machine with data placed on a tmpfs filesystem 
(to minimize all the I/O overhead), but the results are the same.

I don't think more testing is needed at this point, at lest not with the 
synthetic test cases we've been using for the testing. The patch already 
received way more benchmarking than most other patches.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Fri, Dec 23, 2016 at 8:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> The results look positive.  Do you think we can conclude based on all
> the tests you and Dilip have done, that we can move forward with this
> patch (in particular group-update) or do you still want to do more
> tests?   I am aware that in one of the tests we have observed that
> reducing contention on CLOGControlLock has increased the contention on
> WALWriteLock, but I feel we can leave that point as a note to
> committer and let him take a final call.  From the code perspective
> already Robert and Andres have taken one pass of review and I have
> addressed all their comments, so surely more review of code can help,
> but I think that is not a big deal considering patch size is
> relatively small.

I have done one more pass of the review today. I have few comments.

+ if (nextidx != INVALID_PGPROCNO)
+ {
+ /* Sleep until the leader updates our XID status. */
+ for (;;)
+ {
+ /* acts as a read barrier */
+ PGSemaphoreLock(&proc->sem);
+ if (!proc->clogGroupMember)
+ break;
+ extraWaits++;
+ }
+
+ Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO);
+
+ /* Fix semaphore count for any absorbed wakeups */
+ while (extraWaits-- > 0)
+ PGSemaphoreUnlock(&proc->sem);
+ return true;
+ }

1. extraWaits is used only locally in this block so I guess we can
declare inside this block only.

2. It seems that we have missed one unlock in case of absorbed
wakeups. You have initialised extraWaits with -1 and if there is one
extra wake up then extraWaits will become 0 (it means we have made one
extra call to PGSemaphoreLock and it's our responsibility to fix it as
the leader will Unlock only once). But it appear in such case we will
not make any call to PGSemaphoreUnlock. Am I missing something?



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Dec 29, 2016 at 10:41 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have done one more pass of the review today. I have few comments.
>
> + if (nextidx != INVALID_PGPROCNO)
> + {
> + /* Sleep until the leader updates our XID status. */
> + for (;;)
> + {
> + /* acts as a read barrier */
> + PGSemaphoreLock(&proc->sem);
> + if (!proc->clogGroupMember)
> + break;
> + extraWaits++;
> + }
> +
> + Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO);
> +
> + /* Fix semaphore count for any absorbed wakeups */
> + while (extraWaits-- > 0)
> + PGSemaphoreUnlock(&proc->sem);
> + return true;
> + }
>
> 1. extraWaits is used only locally in this block so I guess we can
> declare inside this block only.
>

Agreed and changed accordingly.

> 2. It seems that we have missed one unlock in case of absorbed
> wakeups. You have initialised extraWaits with -1 and if there is one
> extra wake up then extraWaits will become 0 (it means we have made one
> extra call to PGSemaphoreLock and it's our responsibility to fix it as
> the leader will Unlock only once). But it appear in such case we will
> not make any call to PGSemaphoreUnlock.
>

Good catch!  I have fixed it by initialising extraWaits to 0.  This
same issue exists from Group clear xid for which I will send a patch
separately.

Apart from above, the patch needs to be adjusted for commit be7b2848
which has changed the definition of PGSemaphore.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Sat, Dec 31, 2016 at 9:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Agreed and changed accordingly.
>
>> 2. It seems that we have missed one unlock in case of absorbed
>> wakeups. You have initialised extraWaits with -1 and if there is one
>> extra wake up then extraWaits will become 0 (it means we have made one
>> extra call to PGSemaphoreLock and it's our responsibility to fix it as
>> the leader will Unlock only once). But it appear in such case we will
>> not make any call to PGSemaphoreUnlock.
>>
>
> Good catch!  I have fixed it by initialising extraWaits to 0.  This
> same issue exists from Group clear xid for which I will send a patch
> separately.
>
> Apart from above, the patch needs to be adjusted for commit be7b2848
> which has changed the definition of PGSemaphore.

I have reviewed the latest patch and I don't have any more comments.
So if there is no objection from other reviewers I can move it to
"Ready For Committer"?


I have performed one more test, with 3000 scale factor because
previously I tested only up to 1000 scale factor. The purpose of this
test is to check whether there is any regression at higher scale
factor.

Machine: Intel 8 socket machine.
Scale Factor: 3000
Shared Buffer: 8GB
Test: Pgbench RW test.
Run: 30 mins median of 3

Other modified GUC:
-N 300 -c min_wal_size=15GB -c max_wal_size=20GB -c
checkpoint_timeout=900 -c maintenance_work_mem=1GB -c
checkpoint_completion_target=0.9

Summary:
- Did not observed any regression.
- The performance gain is in sync with what we have observed with
other tests at lower scale factors.


Sync_Commit_Off:
client      Head     Patch

8             10065   10009
16           18487   18826
32           28167   28057
64           26655   28712
128         20152   24917
256         16740   22891

Sync_Commit_On:

Client       Head     Patch

8             5102       5110
16           8087       8282
32         12523     12548
64         14701     15112
128       14656      15238
256       13421      16424

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Wed, Jan 11, 2017 at 10:55 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have reviewed the latest patch and I don't have any more comments.
> So if there is no objection from other reviewers I can move it to
> "Ready For Committer"?

Seeing no objections, I have moved it to Ready For Committer.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Tue, Jan 17, 2017 at 11:39 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Wed, Jan 11, 2017 at 10:55 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I have reviewed the latest patch and I don't have any more comments.
>> So if there is no objection from other reviewers I can move it to
>> "Ready For Committer"?
>
> Seeing no objections, I have moved it to Ready For Committer.
>

Thanks for the review.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Michael Paquier
Date:
On Tue, Jan 17, 2017 at 9:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jan 17, 2017 at 11:39 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Wed, Jan 11, 2017 at 10:55 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> I have reviewed the latest patch and I don't have any more comments.
>>> So if there is no objection from other reviewers I can move it to
>>> "Ready For Committer"?
>>
>> Seeing no objections, I have moved it to Ready For Committer.
>>
>
> Thanks for the review.

Moved to CF 2017-03, the 8th commit fest of this patch.
-- 
Michael



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Tue, Jan 31, 2017 at 11:35 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> Thanks for the review.
>
> Moved to CF 2017-03, the 8th commit fest of this patch.

I think eight is enough.  Committed with some cosmetic changes.

I think the turning point for this somewhat-troubled patch was when we
realized that, while results were somewhat mixed on whether it
improved performance, wait event monitoring showed that it definitely
reduced contention significantly.  However, I just realized that in
both this case and in the case of group XID clearing, we weren't
advertising a wait event for the PGSemaphoreLock calls that are part
of the group locking machinery.  I think we should fix that, because a
quick test shows that can happen fairly often -- not, I think, as
often as we would have seen LWLock waits without these patches, but
often enough that you'll want to know.  Patch attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I think eight is enough.  Committed with some cosmetic changes.

Buildfarm thinks eight wasn't enough.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=clam&dt=2017-03-10%2002%3A00%3A01
        regards, tom lane



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Mar 10, 2017 at 7:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think eight is enough.  Committed with some cosmetic changes.
>
> Buildfarm thinks eight wasn't enough.
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=clam&dt=2017-03-10%2002%3A00%3A01
>

Will look into this, though I don't have access to that machine, but
it looks to be a power machine and I have access to somewhat similar
machine.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Thu, Mar 9, 2017 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think eight is enough.  Committed with some cosmetic changes.
>
> Buildfarm thinks eight wasn't enough.
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=clam&dt=2017-03-10%2002%3A00%3A01

At first I was confused how you knew that this was the fault of this
patch, but this seems like a pretty indicator:

TRAP: FailedAssertion("!(curval == 0 || (curval == 0x03 && status !=
0x00) || curval == status)", File: "clog.c", Line: 574)

I'm not sure whether it's related to this problem or not, but now that
I look at it, this (preexisting) comment looks like entirely wishful
thinking:
    * If we update more than one xid on this page while it is being written    * out, we might find that some of the
bitsgo to disk and others don't.    * If we are updating commits on the page with the top-level xid that    * could
breakatomicity, so we subcommit the subxids first before we mark    * the top-level commit.
 

The problem with that is the word "before".  There are no memory
barriers here, so there's zero guarantee that other processes see the
writes in the order they're performed here.  But it might be a stretch
to suppose that that would cause this symptom.

Maybe we should replace that Assert() with an elog() and dump out the
actual values.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Mar 9, 2017 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Buildfarm thinks eight wasn't enough.
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=clam&dt=2017-03-10%2002%3A00%3A01

> At first I was confused how you knew that this was the fault of this
> patch, but this seems like a pretty indicator:
> TRAP: FailedAssertion("!(curval == 0 || (curval == 0x03 && status !=
> 0x00) || curval == status)", File: "clog.c", Line: 574)

Yeah, that's what led me to blame the clog-group-update patch.

> I'm not sure whether it's related to this problem or not, but now that
> I look at it, this (preexisting) comment looks like entirely wishful
> thinking:
>      * If we update more than one xid on this page while it is being written
>      * out, we might find that some of the bits go to disk and others don't.
>      * If we are updating commits on the page with the top-level xid that
>      * could break atomicity, so we subcommit the subxids first before we mark
>      * the top-level commit.

Maybe, but that comment dates to 2008 according to git, and clam has
been, er, happy as a clam up to now.  My money is on a newly-introduced
memory-access-ordering bug.

Also, I see clam reported in green just now, so it's not 100%
reproducible :-(
        regards, tom lane



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Mar 10, 2017 at 10:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Mar 9, 2017 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Buildfarm thinks eight wasn't enough.
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=clam&dt=2017-03-10%2002%3A00%3A01
>
>> At first I was confused how you knew that this was the fault of this
>> patch, but this seems like a pretty indicator:
>> TRAP: FailedAssertion("!(curval == 0 || (curval == 0x03 && status !=
>> 0x00) || curval == status)", File: "clog.c", Line: 574)
>
> Yeah, that's what led me to blame the clog-group-update patch.
>
>> I'm not sure whether it's related to this problem or not, but now that
>> I look at it, this (preexisting) comment looks like entirely wishful
>> thinking:
>>      * If we update more than one xid on this page while it is being written
>>      * out, we might find that some of the bits go to disk and others don't.
>>      * If we are updating commits on the page with the top-level xid that
>>      * could break atomicity, so we subcommit the subxids first before we mark
>>      * the top-level commit.
>
> Maybe, but that comment dates to 2008 according to git, and clam has
> been, er, happy as a clam up to now.  My money is on a newly-introduced
> memory-access-ordering bug.
>
> Also, I see clam reported in green just now, so it's not 100%
> reproducible :-(
>

Just to let you know that I think I have figured out the reason of
failure.  If we run the regressions with attached patch, it will make
the regression tests fail consistently in same way.  The patch just
makes all transaction status updates to go via group clog update
mechanism.  Now, the reason of the problem is that the patch has
relied on XidCache in PGPROC for subtransactions when they are not
overflowed which is okay for Commits, but not for Rollback to
Savepoint and Rollback.  For Rollback to Savepoint, we just pass the
particular (sub)-transaction id to abort, but group mechanism will
abort all the sub-transactions in that top transaction to Rollback.  I
am still analysing what could be the best way to fix this issue.  I
think there could be multiple ways to fix this problem.  One way is
that we can advertise the fact that the status update for transaction
involves subtransactions and then we can use xidcache for actually
processing the status update.  Second is advertise all the
subtransaction ids for which status needs to be update, but I am sure
that is not-at all efficient as that will cosume lot of memory.  Last
resort could be that we don't use group clog update optimization when
transaction has sub-transactions.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> Just to let you know that I think I have figured out the reason of
> failure.  If we run the regressions with attached patch, it will make
> the regression tests fail consistently in same way.  The patch just
> makes all transaction status updates to go via group clog update
> mechanism.

This does *not* give me a warm fuzzy feeling that this patch was
ready to commit.  Or even that it was tested to the claimed degree.
        regards, tom lane



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Mar 10, 2017 at 11:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Mar 10, 2017 at 10:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Also, I see clam reported in green just now, so it's not 100%
>> reproducible :-(
>>
>
> Just to let you know that I think I have figured out the reason of
> failure.  If we run the regressions with attached patch, it will make
> the regression tests fail consistently in same way.  The patch just
> makes all transaction status updates to go via group clog update
> mechanism.  Now, the reason of the problem is that the patch has
> relied on XidCache in PGPROC for subtransactions when they are not
> overflowed which is okay for Commits, but not for Rollback to
> Savepoint and Rollback.  For Rollback to Savepoint, we just pass the
> particular (sub)-transaction id to abort, but group mechanism will
> abort all the sub-transactions in that top transaction to Rollback.  I
> am still analysing what could be the best way to fix this issue.  I
> think there could be multiple ways to fix this problem.  One way is
> that we can advertise the fact that the status update for transaction
> involves subtransactions and then we can use xidcache for actually
> processing the status update.  Second is advertise all the
> subtransaction ids for which status needs to be update, but I am sure
> that is not-at all efficient as that will cosume lot of memory.  Last
> resort could be that we don't use group clog update optimization when
> transaction has sub-transactions.
>

On further analysis, I don't think the first way mentioned above can
work for Rollback To Savepoint because it can pass just a subset of
sub-tranasctions in which case we can never identify it by looking at
subxids in PGPROC unless we advertise all such subxids.  The case I am
talking is something like:

Begin;
Savepoint one;
Insert ...
Savepoint two
Insert ..
Savepoint three
Insert ...
Rollback to Savepoint two;

Now, for Rollback to Savepoint two, we pass transaction ids
corresponding to Savepoint three and two.

So, I think we can apply this optimization only for transactions that
always commits which will anyway be the most common use case.  Another
alternative as mentioned above is to do this optimization when there
are no subtransactions involved.  Attached two patches implements
these two approaches (fix_clog_group_commit_opt_v1.patch - allow
optimization only for commits; fix_clog_group_commit_opt_v2.patch -
allow optimizations for transaction status updates that don't involve
subxids).  I think the first approach is a better way to deal with
this, let me know your thoughts?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Mar 10, 2017 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> Just to let you know that I think I have figured out the reason of
>> failure.  If we run the regressions with attached patch, it will make
>> the regression tests fail consistently in same way.  The patch just
>> makes all transaction status updates to go via group clog update
>> mechanism.
>
> This does *not* give me a warm fuzzy feeling that this patch was
> ready to commit.  Or even that it was tested to the claimed degree.
>

I think this is more of an implementation detail missed by me.  We
have done quite some performance/stress testing with a different
number of savepoints, but this could have been caught only by having
Rollback to Savepoint followed by a commit.  I agree that we could
have devised some simple way (like the one I shared above) to test the
wide range of tests with this new mechanism earlier.  This is a
learning from here and I will try to be more cautious about such
things in future.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Fri, Mar 10, 2017 at 6:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Mar 10, 2017 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Amit Kapila <amit.kapila16@gmail.com> writes:
>>> Just to let you know that I think I have figured out the reason of
>>> failure.  If we run the regressions with attached patch, it will make
>>> the regression tests fail consistently in same way.  The patch just
>>> makes all transaction status updates to go via group clog update
>>> mechanism.
>>
>> This does *not* give me a warm fuzzy feeling that this patch was
>> ready to commit.  Or even that it was tested to the claimed degree.
>>
>
> I think this is more of an implementation detail missed by me.  We
> have done quite some performance/stress testing with a different
> number of savepoints, but this could have been caught only by having
> Rollback to Savepoint followed by a commit.  I agree that we could
> have devised some simple way (like the one I shared above) to test the
> wide range of tests with this new mechanism earlier.  This is a
> learning from here and I will try to be more cautious about such
> things in future.

After some study, I don't feel confident that it's this simple.  The
underlying issue here is that TransactionGroupUpdateXidStatus thinks
it can assume that proc->clogGroupMemberXid, pgxact->nxids, and
proc->subxids.xids match the values that were passed to
TransactionIdSetPageStatus, but that's not checked anywhere.  For
example, I thought about adding these assertions:
      Assert(nsubxids == MyPgXact->nxids);      Assert(memcmp(subxids, MyProc->subxids.xids,             nsubxids *
sizeof(TransactionId))== 0);
 

There's not even a comment in the patch anywhere that notes that we're
assuming this, let alone anything that checks that it's actually true,
which seems worrying.

One thing that seems off is that we have this new field
clogGroupMemberXid, which we use to determine the XID that is being
committed, but for the subxids we think it's going to be true in every
case.   Well, that seems a bit odd, right?  I mean, if the contents of
the PGXACT are a valid way to figure out the subxids that we need to
worry about, then why not also it to get the toplevel XID?

Another point that's kind of bothering me is that this whole approach
now seems to me to be an abstraction violation.  It relies on the set
of subxids for which we're setting status in clog matching the set of
subxids advertised in PGPROC.  But actually there's a fair amount of
separation between those things.  What's getting passed down to clog
is coming from xact.c's transaction state stack, which is completely
separate from the procarray.  Now after going over the logic in some
detail, it does look to me that you're correct that in the case of a
toplevel commit they will always match, but in some sense that looks
accidental.

For example, look at this code from RecordTransactionAbort:
   /*    * If we're aborting a subtransaction, we can immediately remove failed    * XIDs from PGPROC's cache of
runningchild XIDs.  We do that here for    * subxacts, because we already have the child XID array at hand.  For    *
mainxacts, the equivalent happens just after this function returns.    */   if (isSubXact)
XidCacheRemoveRunningXids(xid,nchildren, children, latestXid);
 

That code paints the removal of the aborted subxids from our PGPROC as
an optimization, not a requirement for correctness.  And without this
patch, that's correct: the XIDs are advertised in PGPROC so that we
construct correct snapshots, but they only need to be present there
for so long as there is a possibility that those XIDs might in the
future commit.  Once they've aborted, it's not *necessary* for them to
appear in PGPROC any more, but it doesn't hurt anything if they do.
However, with this patch, removing them from PGPROC becomes a hard
requirement, because otherwise the set of XIDs that are running
according to the transaction state stack and the set that are running
according to the PGPROC might be different.  Yet, neither the original
patch nor your proposed fix patch updated any of the comments here.

One might wonder whether it's even wise to tie these things together
too closely.  For example, you can imagine a future patch for
autonomous transactions stashing their XIDs in the subxids array.
That'd be fine for snapshot purposes, but it would break this.

Finally, I had an unexplained hang during the TAP tests while testing
out your fix patch.  I haven't been able to reproduce that so it
might've just been an artifact of something stupid I did, or of some
unrelated bug, but I think it's best to back up and reconsider a bit
here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Fri, Mar 10, 2017 at 3:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Finally, I had an unexplained hang during the TAP tests while testing
> out your fix patch.  I haven't been able to reproduce that so it
> might've just been an artifact of something stupid I did, or of some
> unrelated bug, but I think it's best to back up and reconsider a bit
> here.

I was able to reproduce this with the following patch:

diff --git a/src/backend/access/transam/clog.c
b/src/backend/access/transam/clog.c
index bff42dc..0546425 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -268,9 +268,11 @@ set_status_by_pages(int nsubxids, TransactionId *subxids, * has a race condition (see
TransactionGroupUpdateXidStatus)but the * worst thing that happens if we mess up is a small loss of efficiency; * the
intentis to avoid having the leader access pages it wouldn't
 
- * otherwise need to touch.  Finally, we skip it for prepared transactions,
- * which don't have the semaphore we would need for this optimization,
- * and which are anyway probably not all that common.
+ * otherwise need to touch.  We also skip it if the transaction status is
+ * other than commit, because for rollback and rollback to savepoint, the
+ * list of subxids won't be same as subxids array in PGPROC.  Finally, we skip
+ * it for prepared transactions, which don't have the semaphore we would need
+ * for this optimization, and which are anyway probably not all that common. */static
voidTransactionIdSetPageStatus(TransactionIdxid, int nsubxids,
 
@@ -280,15 +282,20 @@ TransactionIdSetPageStatus(TransactionId xid,
int nsubxids,{    if (all_xact_same_page &&        nsubxids < PGPROC_MAX_CACHED_SUBXIDS &&
+        status == TRANSACTION_STATUS_COMMITTED &&        !IsGXactActive())    {
+        Assert(nsubxids == MyPgXact->nxids);
+        Assert(memcmp(subxids, MyProc->subxids.xids,
+               nsubxids * sizeof(TransactionId)) == 0);
+        /*         * If we can immediately acquire CLogControlLock, we update the status         * of our own XID and
releasethe lock.  If not, try use group XID         * update.  If that doesn't work out, fall back to waiting for the
     * lock to perform an update for this transaction only.         */
 
-        if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
+        if (false && LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))        {
TransactionIdSetPageStatusInternal(xid,nsubxids,
 
subxids, status, lsn, pageno);            LWLockRelease(CLogControlLock);

make check-world hung here:

t/009_twophase.pl ..........
1..13
ok 1 - Commit prepared transaction after restart
ok 2 - Rollback prepared transaction after restart

[rhaas pgsql]$ ps uxww | grep postgres
rhaas 72255   0.0  0.0  2447996   1684 s000  S+    3:40PM   0:00.00
/Users/rhaas/pgsql/tmp_install/Users/rhaas/install/dev/bin/psql -XAtq
-d port=64230 host=/var/folders/y8/r2ycj_jj2vd65v71rmyddpr40000gn/T/ZVWy0JGbuw
dbname='postgres' -f - -v ON_ERROR_STOP=1
rhaas 72253   0.0  0.0  2478532   1548   ??  Ss    3:40PM   0:00.00
postgres: bgworker: logical replication launcher
rhaas 72252   0.0  0.0  2483132    740   ??  Ss    3:40PM   0:00.05
postgres: stats collector process
rhaas 72251   0.0  0.0  2486724   1952   ??  Ss    3:40PM   0:00.02
postgres: autovacuum launcher process
rhaas 72250   0.0  0.0  2477508    880   ??  Ss    3:40PM   0:00.03
postgres: wal writer process
rhaas 72249   0.0  0.0  2477508    972   ??  Ss    3:40PM   0:00.03
postgres: writer process
rhaas 72248   0.0  0.0  2477508   1252   ??  Ss    3:40PM   0:00.00
postgres: checkpointer process
rhaas 72246   0.0  0.0  2481604   5076 s000  S+    3:40PM   0:00.03
/Users/rhaas/pgsql/tmp_install/Users/rhaas/install/dev/bin/postgres -D
/Users/rhaas/pgsql/src/test/recovery/tmp_check/data_master_Ylq1/pgdata
rhaas 72337   0.0  0.0  2433796    688 s002  S+    4:14PM   0:00.00
grep postgres
rhaas 72256   0.0  0.0  2478920   2984   ??  Ss    3:40PM   0:00.00
postgres: rhaas postgres [local] COMMIT PREPARED waiting for 0/301D0D0

Backtrace of PID 72256:

#0  0x00007fff8ecc85c2 in poll ()
#1  0x00000001078eb727 in WaitEventSetWaitBlock [inlined] () at
/Users/rhaas/pgsql/src/backend/storage/ipc/latch.c:1118
#2  0x00000001078eb727 in WaitEventSetWait (set=0x7fab3c8366c8,
timeout=-1, occurred_events=0x7fff585e5410, nevents=1,
wait_event_info=<value temporarily unavailable, due to optimizations>)
at latch.c:949
#3  0x00000001078eb409 in WaitLatchOrSocket (latch=<value temporarily
unavailable, due to optimizations>, wakeEvents=<value temporarily
unavailable, due to optimizations>, sock=-1, timeout=<value
temporarily unavailable, due to optimizations>,
wait_event_info=134217741) at latch.c:349
#4  0x00000001078cf077 in SyncRepWaitForLSN (lsn=<value temporarily
unavailable, due to optimizations>, commit=<value temporarily
unavailable, due to optimizations>) at syncrep.c:284
#5  0x00000001076a2dab in FinishPreparedTransaction (gid=<value
temporarily unavailable, due to optimizations>, isCommit=1 '\001') at
twophase.c:2110
#6  0x0000000107919420 in standard_ProcessUtility (pstmt=<value
temporarily unavailable, due to optimizations>, queryString=<value
temporarily unavailable, due to optimizations>,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x7fab3c853cf8,
completionTag=<value temporarily unavailable, due to optimizations>)
at utility.c:452
#7  0x00000001079186f3 in PortalRunUtility (portal=0x7fab3c874a40,
pstmt=0x7fab3c853c00, isTopLevel=1 '\001', setHoldSnapshot=<value
temporarily unavailable, due to optimizations>, dest=0x7fab3c853cf8,
completionTag=0x7fab3c8366f8 "\n") at pquery.c:1165
#8  0x0000000107917cd6 in PortalRunMulti (portal=<value temporarily
unavailable, due to optimizations>, isTopLevel=1 '\001',
setHoldSnapshot=0 '\0', dest=0x7fab3c853cf8, altdest=0x7fab3c853cf8,
completionTag=<value temporarily unavailable, due to optimizations>)
at pquery.c:1315
#9  0x0000000107917634 in PortalRun (portal=0x7fab3c874a40,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x7fab3c853cf8,
altdest=0x7fab3c853cf8, completionTag=0x7fff585e5a30 "") at
pquery.c:788
#10 0x000000010791586b in PostgresMain (argc=<value temporarily
unavailable, due to optimizations>, argv=<value temporarily
unavailable, due to optimizations>, dbname=<value temporarily
unavailable, due to optimizations>, username=<value temporarily
unavailable, due to optimizations>) at postgres.c:1101
#11 0x0000000107897a68 in PostmasterMain (argc=<value temporarily
unavailable, due to optimizations>, argv=<value temporarily
unavailable, due to optimizations>) at postmaster.c:4317
#12 0x00000001078124cd in main (argc=<value temporarily unavailable,
due to optimizations>, argv=<value temporarily unavailable, due to
optimizations>) at main.c:228

debug_query_string is COMMIT PREPARED 'xact_009_1'

end of regress_log_009_twophase looks like this:

ok 2 - Rollback prepared transaction after restart
### Stopping node "master" using mode immediate
# Running: pg_ctl -D
/Users/rhaas/pgsql/src/test/recovery/tmp_check/data_master_Ylq1/pgdata
-m immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID
### Starting node "master"
# Running: pg_ctl -D
/Users/rhaas/pgsql/src/test/recovery/tmp_check/data_master_Ylq1/pgdata
-l /Users/rhaas/pgsql/src/test/recovery/tmp_check/log/009_twophase_master.log
start
waiting for server to start.... done
server started
# Postmaster PID for node "master" is 72246

The smoking gun was in 009_twophase_slave.log:

TRAP: FailedAssertion("!(nsubxids == MyPgXact->nxids)", File:
"clog.c", Line: 288)

...and then the node shuts down, which is why this hangs forever.
(Also... what's up with it hanging forever instead of timing out or
failing or something?)

So evidently on a standby it is in fact possible for the procarray
contents not to match what got passed down to clog.  Now you might say
"well, we shouldn't be using group update on a standby anyway", but
it's possible for a hot standby backend to hold a shared lock on
CLogControlLock, and then the startup process would be pushed into the
group-update path and - boom.

Anyway, this is surely fixable, but I think it's another piece of
evidence that the assumption that the transaction status stack will
match the procarray is fairly fragile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Alvaro Herrera
Date:
Robert Haas wrote:

> The smoking gun was in 009_twophase_slave.log:
> 
> TRAP: FailedAssertion("!(nsubxids == MyPgXact->nxids)", File:
> "clog.c", Line: 288)
> 
> ...and then the node shuts down, which is why this hangs forever.
> (Also... what's up with it hanging forever instead of timing out or
> failing or something?)

This bit my while messing with 2PC tests recently.  I think it'd be
worth doing something about this, such as causing the test to die if we
request a server to (re)start and it doesn't start or it immediately
crashes.  This doesn't solve the problem of a server crashing at a point
not immediately after start, though.

(It'd be very annoying to have to sprinkle the Perl test code with
"assert $server->islive", but perhaps we can add assertions of some kind
in PostgresNode itself).

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Sat, Mar 11, 2017 at 2:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 10, 2017 at 6:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Mar 10, 2017 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Amit Kapila <amit.kapila16@gmail.com> writes:
>>>> Just to let you know that I think I have figured out the reason of
>>>> failure.  If we run the regressions with attached patch, it will make
>>>> the regression tests fail consistently in same way.  The patch just
>>>> makes all transaction status updates to go via group clog update
>>>> mechanism.
>>>
>>> This does *not* give me a warm fuzzy feeling that this patch was
>>> ready to commit.  Or even that it was tested to the claimed degree.
>>>
>>
>> I think this is more of an implementation detail missed by me.  We
>> have done quite some performance/stress testing with a different
>> number of savepoints, but this could have been caught only by having
>> Rollback to Savepoint followed by a commit.  I agree that we could
>> have devised some simple way (like the one I shared above) to test the
>> wide range of tests with this new mechanism earlier.  This is a
>> learning from here and I will try to be more cautious about such
>> things in future.
>
> After some study, I don't feel confident that it's this simple.  The
> underlying issue here is that TransactionGroupUpdateXidStatus thinks
> it can assume that proc->clogGroupMemberXid, pgxact->nxids, and
> proc->subxids.xids match the values that were passed to
> TransactionIdSetPageStatus, but that's not checked anywhere.  For
> example, I thought about adding these assertions:
>
>        Assert(nsubxids == MyPgXact->nxids);
>        Assert(memcmp(subxids, MyProc->subxids.xids,
>               nsubxids * sizeof(TransactionId)) == 0);
>
> There's not even a comment in the patch anywhere that notes that we're
> assuming this, let alone anything that checks that it's actually true,
> which seems worrying.
>
> One thing that seems off is that we have this new field
> clogGroupMemberXid, which we use to determine the XID that is being
> committed, but for the subxids we think it's going to be true in every
> case.   Well, that seems a bit odd, right?  I mean, if the contents of
> the PGXACT are a valid way to figure out the subxids that we need to
> worry about, then why not also it to get the toplevel XID?
>
> Another point that's kind of bothering me is that this whole approach
> now seems to me to be an abstraction violation.  It relies on the set
> of subxids for which we're setting status in clog matching the set of
> subxids advertised in PGPROC.  But actually there's a fair amount of
> separation between those things.  What's getting passed down to clog
> is coming from xact.c's transaction state stack, which is completely
> separate from the procarray.  Now after going over the logic in some
> detail, it does look to me that you're correct that in the case of a
> toplevel commit they will always match, but in some sense that looks
> accidental.
>
> For example, look at this code from RecordTransactionAbort:
>
>     /*
>      * If we're aborting a subtransaction, we can immediately remove failed
>      * XIDs from PGPROC's cache of running child XIDs.  We do that here for
>      * subxacts, because we already have the child XID array at hand.  For
>      * main xacts, the equivalent happens just after this function returns.
>      */
>     if (isSubXact)
>         XidCacheRemoveRunningXids(xid, nchildren, children, latestXid);
>
> That code paints the removal of the aborted subxids from our PGPROC as
> an optimization, not a requirement for correctness.  And without this
> patch, that's correct: the XIDs are advertised in PGPROC so that we
> construct correct snapshots, but they only need to be present there
> for so long as there is a possibility that those XIDs might in the
> future commit.  Once they've aborted, it's not *necessary* for them to
> appear in PGPROC any more, but it doesn't hurt anything if they do.
> However, with this patch, removing them from PGPROC becomes a hard
> requirement, because otherwise the set of XIDs that are running
> according to the transaction state stack and the set that are running
> according to the PGPROC might be different.  Yet, neither the original
> patch nor your proposed fix patch updated any of the comments here.
>

There was a comment in existing code (proc.h) which states that it
will contain non-aborted transactions.  I agree that having it
explicitly mentioned in patch would have been much better.

/** Each backend advertises up to PGPROC_MAX_CACHED_SUBXIDS TransactionIds* for non-aborted subtransactions of its
currenttop transaction.  These* have to be treated as running XIDs by other backends.
 




> One might wonder whether it's even wise to tie these things together
> too closely.  For example, you can imagine a future patch for
> autonomous transactions stashing their XIDs in the subxids array.
> That'd be fine for snapshot purposes, but it would break this.
>
> Finally, I had an unexplained hang during the TAP tests while testing
> out your fix patch.  I haven't been able to reproduce that so it
> might've just been an artifact of something stupid I did, or of some
> unrelated bug, but I think it's best to back up and reconsider a bit
> here.
>

I agree that more analysis can help us to decide if we can use subxids
from PGPROC and if so under what conditions.  Have you considered the
another patch I have posted to fix the issue which is to do this
optimization only when subxids are not present?  In that patch, it
will remove the dependency of relying on subxids in PGPROC.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Fri, Mar 10, 2017 at 7:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I agree that more analysis can help us to decide if we can use subxids
> from PGPROC and if so under what conditions.  Have you considered the
> another patch I have posted to fix the issue which is to do this
> optimization only when subxids are not present?  In that patch, it
> will remove the dependency of relying on subxids in PGPROC.

Well, that's an option, but it narrows the scope of the optimization
quite a bit.  I think Simon previously opposed handling only the
no-subxid cases (although I may be misremembering) and I'm not that
keen about it either.

I was wondering about doing an explicit test: if the XID being
committed matches the one in the PGPROC, and nsubxids matches, and the
actual list of XIDs matches, then apply the optimization.  That could
replace the logic that you've proposed to exclude non-commit cases,
gxact cases, etc. and it seems fundamentally safer.  But it might be a
more expensive test, too, so I'm not sure.

It would be nice to get some other opinions on how (and whether) to
proceed with this.  I'm feeling really nervous about this right at the
moment, because it seems like everybody including me missed some
fairly critical points relating to the safety (or lack thereof) of
this patch, and I want to make sure that if it gets committed again,
we've really got everything nailed down tight.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Sun, Mar 12, 2017 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 10, 2017 at 7:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I agree that more analysis can help us to decide if we can use subxids
>> from PGPROC and if so under what conditions.  Have you considered the
>> another patch I have posted to fix the issue which is to do this
>> optimization only when subxids are not present?  In that patch, it
>> will remove the dependency of relying on subxids in PGPROC.
>
> Well, that's an option, but it narrows the scope of the optimization
> quite a bit.  I think Simon previously opposed handling only the
> no-subxid cases (although I may be misremembering) and I'm not that
> keen about it either.
>
> I was wondering about doing an explicit test: if the XID being
> committed matches the one in the PGPROC, and nsubxids matches, and the
> actual list of XIDs matches, then apply the optimization.  That could
> replace the logic that you've proposed to exclude non-commit cases,
> gxact cases, etc. and it seems fundamentally safer.  But it might be a
> more expensive test, too, so I'm not sure.
>

I think if the number of subxids is very small let us say under 5 or
so, then such a check might not matter, but otherwise it could be
expensive.

> It would be nice to get some other opinions on how (and whether) to
> proceed with this.  I'm feeling really nervous about this right at the
> moment, because it seems like everybody including me missed some
> fairly critical points relating to the safety (or lack thereof) of
> this patch, and I want to make sure that if it gets committed again,
> we've really got everything nailed down tight.
>

I think the basic thing that is missing in the last patch was that we
can't apply this optimization during WAL replay as during
recovery/hotstandby the xids/subxids are tracked in KnownAssignedXids.
The same is mentioned in header file comments in procarray.c and in
GetSnapshotData (look at an else loop of the check if
(!snapshot->takenDuringRecovery)).  As far as I can see, the patch has
considered that in the initial versions but then the check got dropped
in one of the later revisions by mistake. The patch version-5 [1] has
the check for recovery, but during some code rearrangement, it got
dropped in version-6 [2].  Having said that, I think the improvement
in case there are subtransactions will be lesser because having
subtransactions means more work under LWLock and that will have lesser
context switches.  This optimization is all about the reduction in
frequent context switches, so I think even if we don't optimize the
case for subtransactions we are not leaving much on the table and it
will make this optimization much safe.  To substantiate this theory
with data, see the difference in performance when subtransactions are
used [3] and when they are not used [4].

So we have four ways to proceed:
1. Have this optimization for subtransactions and make it safe by
having some additional conditions like check for recovery, explicit
check for if the actual transaction ids match with ids stored in proc.
2. Have this optimization when there are no subtransactions. In this
case, we can have a very simple check for this optimization.
3. Drop this patch and idea.
4. Consider it for next version.

I personally think second way is okay for this release as that looks
safe and gets us the maximum benefit we can achieve by this
optimization and then consider adding optimization for subtransactions
(first way) in the future version if we think it is safe and gives us
the benefit.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1KUVPxBcGTdOuKyvf5p1sQ0HeUbSMbTxtQc%3DP65OxiZog%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1L4iV-2qe7AyMVsb%2Bnz7SiX8JvCO%2BCqhXwaiXgm3CaBUw%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAFiTN-u3%3DXUi7z8dTOgxZ98E7gL1tzL%3Dq9Yd%3DCwWCtTtS6pOZw%40mail.gmail.com
[4] - https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Fri, Mar 17, 2017 at 2:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I was wondering about doing an explicit test: if the XID being
>> committed matches the one in the PGPROC, and nsubxids matches, and the
>> actual list of XIDs matches, then apply the optimization.  That could
>> replace the logic that you've proposed to exclude non-commit cases,
>> gxact cases, etc. and it seems fundamentally safer.  But it might be a
>> more expensive test, too, so I'm not sure.
>
> I think if the number of subxids is very small let us say under 5 or
> so, then such a check might not matter, but otherwise it could be
> expensive.

We could find out by testing it.  We could also restrict the
optimization to cases with just a few subxids, because if you've got a
large number of subxids this optimization probably isn't buying much
anyway.  We're trying to avoid grabbing CLogControlLock to do a very
small amount of work, but if you've got 10 or 20 subxids we're doing
as much work anyway as the group update optimization is attempting to
put into one batch.

> So we have four ways to proceed:
> 1. Have this optimization for subtransactions and make it safe by
> having some additional conditions like check for recovery, explicit
> check for if the actual transaction ids match with ids stored in proc.
> 2. Have this optimization when there are no subtransactions. In this
> case, we can have a very simple check for this optimization.
> 3. Drop this patch and idea.
> 4. Consider it for next version.
>
> I personally think second way is okay for this release as that looks
> safe and gets us the maximum benefit we can achieve by this
> optimization and then consider adding optimization for subtransactions
> (first way) in the future version if we think it is safe and gives us
> the benefit.
>
> Thoughts?

I don't like #2 very much.  Restricting it to a relatively small
number of transactions - whatever we can show doesn't hurt performance
- seems OK, but restriction it to the exactly-zero-subtransactions
case seems poor.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Mar 20, 2017 at 8:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 17, 2017 at 2:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> I was wondering about doing an explicit test: if the XID being
>>> committed matches the one in the PGPROC, and nsubxids matches, and the
>>> actual list of XIDs matches, then apply the optimization.  That could
>>> replace the logic that you've proposed to exclude non-commit cases,
>>> gxact cases, etc. and it seems fundamentally safer.  But it might be a
>>> more expensive test, too, so I'm not sure.
>>
>> I think if the number of subxids is very small let us say under 5 or
>> so, then such a check might not matter, but otherwise it could be
>> expensive.
>
> We could find out by testing it.  We could also restrict the
> optimization to cases with just a few subxids, because if you've got a
> large number of subxids this optimization probably isn't buying much
> anyway.
>

Yes, and I have modified the patch to compare xids and subxids for
group update.  In the initial short tests (with few client counts), it
seems like till 3 savepoints we can win and 10 savepoints onwards
there is some regression or at the very least there doesn't appear to
be any benefit.  We need more tests to identify what is the safe
number, but I thought it is better to share the patch to see if we
agree on the changes because if not, then the whole testing needs to
be repeated.  Let me know what do you think about attached?



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Ashutosh Sharma
Date:
Hi All,

I have tried to test 'group_update_clog_v11.1.patch' shared upthread by Amit on a high end machine. I have tested the patch with various savepoints in my test script. The machine details along with test scripts and the test results are shown below,

Machine details:
============
24 sockets, 192 CPU(s)
RAM - 500GB

test script:
========

\set aid random (1,30000000)
\set tid random (1,3000)

BEGIN;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s1;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
SAVEPOINT s2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s3;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
SAVEPOINT s4;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s5;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
END;

Non-default parameters
==================
max_connections = 200
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
checkpoint_timeout=900
synchronous_commit=off


pgbench -M prepared -c $thread -j $thread -T $time_for_reading postgres -f ~/test_script.sql

where, time_for_reading = 10 mins

Test Results:
=========

With 3 savepoints
=============

CLIENT COUNTTPS (HEAD)TPS (PATCH)% IMPROVEMENT
12850275537046.82048732
6462860665615.887686923
818464187521.559792028


With 5 savepoints
=============

CLIENT COUNTTPS (HEAD)TPS (PATCH)% IMPROVEMENT
12846559477152.482871196
645230652082-0.4282491492
812289128524.581332899



With 7 savepoints
=============

CLIENT COUNTTPS (HEAD)TPS (PATCH)% IMPROVEMENT
12841367415000.3215123166
644299641473-3.542189971
896659657-0.0827728919


With 10 savepoints
==============

CLIENT COUNTTPS (HEAD)TPS (PATCH)% IMPROVEMENT
12834513345970.24338655
643258132035-1.675823333
8729376224.511175099

Conclusion:
As seen from the test results mentioned above, there is some performance improvement with 3 SP(s), with 5 SP(s) the results with patch is slightly better than HEAD, with 7 and 10 SP(s) we do see regression with patch. Therefore, I think the threshold value of 4 for number of subtransactions considered in the patch looks fine to me.


--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


On Tue, Mar 21, 2017 at 6:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Mar 20, 2017 at 8:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 17, 2017 at 2:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> I was wondering about doing an explicit test: if the XID being
>>> committed matches the one in the PGPROC, and nsubxids matches, and the
>>> actual list of XIDs matches, then apply the optimization.  That could
>>> replace the logic that you've proposed to exclude non-commit cases,
>>> gxact cases, etc. and it seems fundamentally safer.  But it might be a
>>> more expensive test, too, so I'm not sure.
>>
>> I think if the number of subxids is very small let us say under 5 or
>> so, then such a check might not matter, but otherwise it could be
>> expensive.
>
> We could find out by testing it.  We could also restrict the
> optimization to cases with just a few subxids, because if you've got a
> large number of subxids this optimization probably isn't buying much
> anyway.
>

Yes, and I have modified the patch to compare xids and subxids for
group update.  In the initial short tests (with few client counts), it
seems like till 3 savepoints we can win and 10 savepoints onwards
there is some regression or at the very least there doesn't appear to
be any benefit.  We need more tests to identify what is the safe
number, but I thought it is better to share the patch to see if we
agree on the changes because if not, then the whole testing needs to
be repeated.  Let me know what do you think about attached?



--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Thu, Mar 9, 2017 at 5:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> However, I just realized that in
> both this case and in the case of group XID clearing, we weren't
> advertising a wait event for the PGSemaphoreLock calls that are part
> of the group locking machinery.  I think we should fix that, because a
> quick test shows that can happen fairly often -- not, I think, as
> often as we would have seen LWLock waits without these patches, but
> often enough that you'll want to know.  Patch attached.

I've pushed the portion of this that relates to ProcArrayLock.  (I
know this hasn't been discussed much, but there doesn't really seem to
be any reason for anybody to object, and looking at just the
LWLock/ProcArrayLock wait events gives a highly misleading answer.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Thu, Mar 23, 2017 at 1:18 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Conclusion:
As seen from the test results mentioned above, there is some performance improvement with 3 SP(s), with 5 SP(s) the results with patch is slightly better than HEAD, with 7 and 10 SP(s) we do see regression with patch. Therefore, I think the threshold value of 4 for number of subtransactions considered in the patch looks fine to me.


Thanks for the tests.  Attached find the rebased patch on HEAD.  I have ran latest pgindent on patch.  I have yet to add wait event for group lock waits in this patch as is done by Robert in commit  d4116a771925379c33cf4c6634ca620ed08b551d for ProcArrayGroupUpdate.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Mon, Jul 3, 2017 at 6:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 23, 2017 at 1:18 PM, Ashutosh Sharma <ashu.coek88@gmail.com>
> wrote:
>>
>> Conclusion:
>> As seen from the test results mentioned above, there is some performance
>> improvement with 3 SP(s), with 5 SP(s) the results with patch is slightly
>> better than HEAD, with 7 and 10 SP(s) we do see regression with patch.
>> Therefore, I think the threshold value of 4 for number of subtransactions
>> considered in the patch looks fine to me.
>>
>
> Thanks for the tests.  Attached find the rebased patch on HEAD.  I have ran
> latest pgindent on patch.  I have yet to add wait event for group lock waits
> in this patch as is done by Robert in commit
> d4116a771925379c33cf4c6634ca620ed08b551d for ProcArrayGroupUpdate.
>

I have updated the patch to support wait events and moved it to upcoming CF.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Tue, Jul 4, 2017 at 12:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have updated the patch to support wait events and moved it to upcoming CF.

This patch doesn't apply any more, but I made it apply with a hammer
and then did a little benchmarking (scylla, EDB server, Intel Xeon
E5-2695 v3 @ 2.30GHz, 2 sockets, 14 cores/socket, 2 threads/core).
The results were not impressive.  There's basically no clog contention
to remove, so the patch just doesn't really do anything.  For example,
here's a wait event profile with master and using Ashutosh's test
script with 5 savepoints:
     1  Lock            | tuple     2  IO              | SLRUSync     5  LWLock          | wal_insert     5  LWLock
    | XidGenLock     9  IO              | DataFileRead    12  LWLock          | lock_manager    16  IO              |
SLRURead   20  LWLock          | CLogControlLock    97  LWLock          | buffer_content   216  Lock            |
transactionid  237  LWLock          | ProcArrayLock  1238  IPC             | ProcArrayGroupUpdate  2266  Client
| ClientRead
 

This is just a 5-minute test; maybe things would change if we ran it
for longer, but if only 0.5% of the samples are blocked on
CLogControlLock without the patch, obviously the patch can't help
much.  I did some other experiments too, but I won't bother
summarizing the results here because they're basically boring.  I
guess I should have used a bigger machine.

Given that we've changed the approach here somewhat, I think we need
to validate that we're still seeing a substantial reduction in
CLogControlLock contention on big machines.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Wed, Aug 30, 2017 at 2:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jul 4, 2017 at 12:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I have updated the patch to support wait events and moved it to upcoming CF.
>
> This patch doesn't apply any more, but I made it apply with a hammer
> and then did a little benchmarking (scylla, EDB server, Intel Xeon
> E5-2695 v3 @ 2.30GHz, 2 sockets, 14 cores/socket, 2 threads/core).
> The results were not impressive.  There's basically no clog contention
> to remove, so the patch just doesn't really do anything.
>

Yeah, in such a case patch won't help.

>  For example,
> here's a wait event profile with master and using Ashutosh's test
> script with 5 savepoints:
>
>       1  Lock            | tuple
>       2  IO              | SLRUSync
>       5  LWLock          | wal_insert
>       5  LWLock          | XidGenLock
>       9  IO              | DataFileRead
>      12  LWLock          | lock_manager
>      16  IO              | SLRURead
>      20  LWLock          | CLogControlLock
>      97  LWLock          | buffer_content
>     216  Lock            | transactionid
>     237  LWLock          | ProcArrayLock
>    1238  IPC             | ProcArrayGroupUpdate
>    2266  Client          | ClientRead
>
> This is just a 5-minute test; maybe things would change if we ran it
> for longer, but if only 0.5% of the samples are blocked on
> CLogControlLock without the patch, obviously the patch can't help
> much.  I did some other experiments too, but I won't bother
> summarizing the results here because they're basically boring.  I
> guess I should have used a bigger machine.
>

That would have been better. In any case, will do the tests on some
higher end machine and will share the results.

> Given that we've changed the approach here somewhat, I think we need
> to validate that we're still seeing a substantial reduction in
> CLogControlLock contention on big machines.
>

Sure will do so.  In the meantime, I have rebased the patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Dilip Kumar
Date:
On Wed, Aug 30, 2017 at 12:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> That would have been better. In any case, will do the tests on some
> higher end machine and will share the results.
>
>> Given that we've changed the approach here somewhat, I think we need
>> to validate that we're still seeing a substantial reduction in
>> CLogControlLock contention on big machines.
>>
>
> Sure will do so.  In the meantime, I have rebased the patch.

I have repeated some of the tests we have performed earlier.

Machine:
Intel 8 socket machine with 128 core.

Configuration:

shared_buffers=8GB
checkpoint_timeout=40min
max_wal_size=20GB
max_connections=300
maintenance_work_mem=4GB
synchronous_commit=off
checkpoint_completion_target=0.9

I have run taken one reading for each test to measure the wait event.
Observation is same that at higher client count there is a significant
reduction in the contention on ClogControlLock.

Benchmark:  Pgbench simple_update, 30 mins run:

Head: (64 client) : (TPS 60720)
53808  Client          | ClientRead 26147  IPC             | ProcArrayGroupUpdate  7866  LWLock          |
CLogControlLock 3705  Activity        | LogicalLauncherMain  3699  Activity        | AutoVacuumMain  3353  LWLock
  | ProcArrayLoc  3099  LWLock          | wal_insert  2825  Activity        | BgWriterMain  2688  Lock            |
extend 1436  Activity        | WalWriterMain
 

Patch: (64 client) : (TPS 67207)53235  Client          | ClientRead 29470  IPC             | ProcArrayGroupUpdate  4302
LWLock          | wal_insert  3717  Activity        | LogicalLauncherMain  3715  Activity        | AutoVacuumMain  3463
LWLock          | ProcArrayLock  3140  Lock            | extend  2934  Activity        | BgWriterMain  1434  Activity
    | WalWriterMain  1198  Activity        | CheckpointerMain  1073  LWLock          | XidGenLock   869  IPC
| ClogGroupUpdate
 

Head:(72 Client): (TPS 57856)
55820  Client          | ClientRead 34318  IPC             | ProcArrayGroupUpdate 15392  LWLock          |
CLogControlLock 3708  Activity        | LogicalLauncherMain  3705  Activity        | AutoVacuumMain  3436  LWLock
  | ProcArrayLock
 

Patch:(72 Client): (TPS 65740)
 60356  Client          | ClientRead 38545  IPC             | ProcArrayGroupUpdate  4573  LWLock          | wal_insert
3708 Activity        | LogicalLauncherMain  3705  Activity        | AutoVacuumMain  3508  LWLock          |
ProcArrayLock 3492  Lock            | extend  2903  Activity        | BgWriterMain  1903  LWLock          | XidGenLock
1383 Activity        | WalWriterMain  1212  Activity        | CheckpointerMain  1056  IPC             |
ClogGroupUpdate


Head:(96 Client): (TPS 52170)
 62841  LWLock          | CLogControlLock 56150  IPC             | ProcArrayGroupUpdate 54761  Client          |
ClientRead 7037  LWLock          | wal_insert  4077  Lock            | extend  3727  Activity        |
LogicalLauncherMain 3727  Activity        | AutoVacuumMain  3027  LWLock          | ProcArrayLock
 

Patch:(96 Client): (TPS 67932)
 87378  IPC             | ProcArrayGroupUpdate 80201  Client          | ClientRead 11511  LWLock          | wal_insert
4102 Lock            | extend  3971  LWLock          | ProcArrayLock  3731  Activity        | LogicalLauncherMain  3731
Activity        | AutoVacuumMain  2948  Activity        | BgWriterMain  1763  LWLock          | XidGenLock  1736  IPC
         | ClogGroupUpdate
 

Head:(128 Client): (TPS 40820)
182569  LWLock          | CLogControlLock 61484  IPC             | ProcArrayGroupUpdate 37969  Client          |
ClientRead 5135  LWLock          | wal_insert  3699  Activity        | LogicalLauncherMain  3699  Activity        |
AutoVacuumMain

Patch:(128 Client): (TPS 67054)
174583  IPC             | ProcArrayGroupUpdate 66084  Client          | ClientRead 16738  LWLock          | wal_insert
4993 IPC             | ClogGroupUpdate  4893  LWLock          | ProcArrayLock  4839  Lock            | extend
 

Benchmark: select for update with 3 save points, 10 mins run

Script:
\set aid random (1,30000000)
\set tid random (1,3000)

BEGIN;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s1;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
SAVEPOINT s2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s3;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
END;

Head:(64 Client): (TPS 44577.1802)
 53808  Client          | ClientRead 26147  IPC             | ProcArrayGroupUpdate  7866  LWLock          |
CLogControlLock 3705  Activity        | LogicalLauncherMain  3699  Activity        | AutoVacuumMain  3353  LWLock
  | ProcArrayLock  3099  LWLock          | wal_insert
 

Patch:(64 Client): (TPS 46156.245)
53235  Client          | ClientRead 29470  IPC             | ProcArrayGroupUpdate  4302  LWLock          | wal_insert
3717 Activity        | LogicalLauncherMain  3715  Activity        | AutoVacuumMain  3463  LWLock          |
ProcArrayLock 3140  Lock            | extend  2934  Activity        | BgWriterMain  1434  Activity        |
WalWriterMain 1198  Activity        | CheckpointerMain  1073  LWLock          | XidGenLock   869  IPC             |
ClogGroupUpdate


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Robert Haas
Date:
On Fri, Sep 1, 2017 at 10:03 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> Sure will do so.  In the meantime, I have rebased the patch.
>
> I have repeated some of the tests we have performed earlier.

OK, these tests seem to show that this is still working.  Committed,
again.  Let's hope this attempt goes better than the last one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

From
Amit Kapila
Date:
On Fri, Sep 1, 2017 at 9:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 1, 2017 at 10:03 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> Sure will do so.  In the meantime, I have rebased the patch.
>>
>> I have repeated some of the tests we have performed earlier.
>

Thanks for repeating the performance tests.

> OK, these tests seem to show that this is still working.  Committed,
> again.  Let's hope this attempt goes better than the last one.
>

Thanks for committing.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com