Thread: Possible fails in pg_stat_statements test

Possible fails in pg_stat_statements test

From
"Anton A. Melnikov"
Date:
Hello,

There are some places in the pg_state_statement's regress test where the 
bool result of comparison between the number of rows obtained and 
wal_records generated by query should be displayed.
Now counting the number of wal_records for some query in 
pg_state_statement is done by the global pgWalUsage.wal_records counter 
difference calculation.
During query execution the extra wal_records may appear that are not 
relate to the query.
There are two reasons why this might happen:
1) Owing to the hit into pruning of some page in optional pruning 
function (heap_page_prune_opt()).
2) When a new page is required for a new xid in clog and 
WriteZeroPageXlogRec() was called.
In both cases an extra wal record with zero xl_xid is generated, so 
wal_records counter gives an incremented value for this query and 
pg_stat_statement test will fail.

This patch introduces an additional counter of wal records not related 
to the query being executed.
Due to this counter pg_stat_statement finds out the number of wal 
records that are not relevant to the query and does not include them in 
the per query statistic.
This removes the possibility of the error described above.

There is a way to reproduce this error when patch is not applied:
1) start server with "shared_preload_libraries = 'pg_stat_statements'" 
string in the postgresql.conf;
2) replace makefile in contrib/pg_stat_statements with attached one;
3) replace test file 
contrib/pg_stat_statements/sql/pg_stat_statements.sql and expected 
results contrib/pg_stat_statements/expected/pg_stat_statements.out
with shorter versions from attached files;
4) copy test.sh to contrib/pg_stat_statements and make sure that PGHOME 
point to your server;
5) cd to contrib/pg_stat_statements and execute:
export ITER=1 && while ./start.sh || break; export ITER=$(($ITER+1)); do 
:; done

Usually 100-200 iterations will be enough.
To catch the error more faster one can add wal_records column to SELECT
in line 26 of contrib/pg_stat_statements/sql/pg_stat_statements.sql as 
followes:
SELECT query, calls, rows, wal_records,
and replace the contrib/pg_stat_statements/expected/pg_stat_statements.out
with attached pg_stat_statements-fast.out

With best regards,
-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Possible fails in pg_stat_statements test

From
"Anton A. Melnikov"
Date:
Hello!

Here is the second version of the patch rebased onto the current master. 
No logical changes.
All other attached files from previous letter are actual.

With best regards,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: Possible fails in pg_stat_statements test

From
Andres Freund
Date:
Hi,

On 2022-01-14 11:11:07 +0300, Anton A. Melnikov wrote:
> This patch introduces an additional counter of wal records not related to
> the query being executed.

They're not unrelated though.


> Due to this counter pg_stat_statement finds out the number of wal records
> that are not relevant to the query and does not include them in the per
> query statistic.

-many. For read-only queries the generated WAL due to on-access pruning can be
a significant factor in performance. Removing that information makes
pg_stat_statments *less* useful.


> This removes the possibility of the error described above.
> 
> There is a way to reproduce this error when patch is not applied:
> 1) start server with "shared_preload_libraries = 'pg_stat_statements'"
> string in the postgresql.conf;
> 2) replace makefile in contrib/pg_stat_statements with attached one;
> 3) replace test file contrib/pg_stat_statements/sql/pg_stat_statements.sql
> and expected results
> contrib/pg_stat_statements/expected/pg_stat_statements.out
> with shorter versions from attached files;
> 4) copy test.sh to contrib/pg_stat_statements and make sure that PGHOME
> point to your server;
> 5) cd to contrib/pg_stat_statements and execute:
> export ITER=1 && while ./start.sh || break; export ITER=$(($ITER+1)); do :;
> done
> 
> Usually 100-200 iterations will be enough.
> To catch the error more faster one can add wal_records column to SELECT
> in line 26 of contrib/pg_stat_statements/sql/pg_stat_statements.sql as
> followes:
> SELECT query, calls, rows, wal_records,
> and replace the contrib/pg_stat_statements/expected/pg_stat_statements.out
> with attached pg_stat_statements-fast.out

Can the test failures be encountered without such an elaborate setup? If not,
then I don't really see why we need to do anything here?

Greetings,

Andres Freund



Re: Possible fails in pg_stat_statements test

From
"Anton A. Melnikov"
Date:
Hello,

thank you much for your attention and for your thought.

On 20.03.2022 20:36, Andres Freund wrote:
>> This patch introduces an additional counter of wal records not related to
>> the query being executed.
> 
> They're not unrelated though.

Yes, i've missformulated here.
Indeed there is a relation but it seems of the some other kind.
It would be nice to clarify the terminology.
Maybe divide WAL records into two kinds:
1) WAL records, the number of which depends on the given query itself. 
(say strong relation)
2) WAL records, the number of which depends on the given query and on 
the previous query history. (say weak relation)

So  modified in the patch wal_records counter will belongs to the first 
kind while the number of wal records due to on-access pruning and new 
clog page generation to the second.

> -many. For read-only queries the generated WAL due to on-access pruning can be
> a significant factor in performance. Removing that information makes
> pg_stat_statments *less* useful.

A separate counter for the second type of records, say, 
extra_wal_records, will not only remove this disadvantage, but on the 
contrary will provide additional information.

The next version of the patch with additional counter is attached.

Really, now it is clearly seen that sometimes
> WAL due to on-access pruning can be a significant factor !
After  pgbench -c10 -t300:
postgres=# SELECT substring(query for 30), wal_records, 
extra_wal_records FROM pg_stat_statements WHERE extra_wal_records != 0;

            substring            | wal_records | extra_wal_records
--------------------------------+-------------+-------------------
  UPDATE pgbench_tellers SET tba |        4557 |                15
  create table pgbench_history(t |          48 |                 1
  create table pgbench_branches( |          40 |                 1
  UPDATE pgbench_accounts SET ab |        5868 |              1567
  drop table if exists pgbench_a |          94 |                 1
  UPDATE pgbench_branches SET bb |        5993 |                14
  SELECT abalance FROM pgbench_a |           0 |                 7
(7 rows)

> Can the test failures be encountered without such an elaborate setup? If not,
> then I don't really see why we need to do anything here?

There was a real bug report from our test department. They do long time 
repetitive tests and sometimes met this failure.
So i suppose there is a non-zero probability that such error can occur 
in the one-shot test as well.
The sequence given in the first letter helps to catch this failure quickly.

With best regards,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: Possible fails in pg_stat_statements test

From
Robert Haas
Date:
On Wed, Mar 30, 2022 at 2:20 AM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:
> > Can the test failures be encountered without such an elaborate setup? If not,
> > then I don't really see why we need to do anything here?
>
> There was a real bug report from our test department. They do long time
> repetitive tests and sometimes met this failure.
> So i suppose there is a non-zero probability that such error can occur
> in the one-shot test as well.
> The sequence given in the first letter helps to catch this failure quickly.

I don't think that the idea of "extra" WAL records is very principled.
It's pretty vague what "extra" means, and your definition seems to be
basically "whatever would be needed to make this test case pass." I
think the problem is basically with the test cases's idea that # of
WAL records and # of table rows ought to be equal. I think that's just
false. In general, we'd also have to worry about index insertions,
which would provoke variable numbers of WAL records depending on
whether they cause a page split. And we'd have to worry about TOAST
table insertions, which could produce different numbers of records
depending on the size of the data, the configured block size and TOAST
threshold, and whether the TOAST table index incurs a page split. So
even if we added a mechanism like what you propose here, we would only
be fixing this particular test case, not creating infrastructure of
any general utility.

If it's true that this test case sometimes randomly fails, then we
ought to fix that somehow, maybe by just removing this particular
check from the test case, or changing it to >=, or something like
that. But I don't think adding a new counter is the right idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Possible fails in pg_stat_statements test

From
"Anton A. Melnikov"
Date:
Hello!

On 30.03.2022 22:36, Robert Haas wrote:
> I don't think that the idea of "extra" WAL records is very principled.
> It's pretty vague what "extra" means, and your definition seems to be
> basically "whatever would be needed to make this test case pass." I
> think the problem is basically with the test cases's idea that # of
> WAL records and # of table rows ought to be equal. I think that's just
> false. In general, we'd also have to worry about index insertions,
> which would provoke variable numbers of WAL records depending on
> whether they cause a page split. And we'd have to worry about TOAST
> table insertions, which could produce different numbers of records
> depending on the size of the data, the configured block size and TOAST
> threshold, and whether the TOAST table index incurs a page split. 

Thank you very much for this information. I really didn't take it into 
account.

> If it's true that this test case sometimes randomly fails, then we
> ought to fix that somehow, maybe by just removing this particular
> check from the test case, or changing it to >=, or something like
> that. But I don't think adding a new counter is the right idea.

Indeed. Then there is a very simple solution for this particular case as 
wal_records counter may only sometime becomes greater but never less.
The corresponding patch is attached.


With best regards,
-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: Possible fails in pg_stat_statements test

From
Julien Rouhaud
Date:
Hi,

On Thu, Mar 31, 2022 at 06:08:01PM +0300, Anton A. Melnikov wrote:
> Hello!
> 
> On 30.03.2022 22:36, Robert Haas wrote:
> > I don't think that the idea of "extra" WAL records is very principled.
> > It's pretty vague what "extra" means, and your definition seems to be
> > basically "whatever would be needed to make this test case pass."

I agree, and even it there was a better definition there probably isn't much to
learn from it.

> I
> > think the problem is basically with the test cases's idea that # of
> > WAL records and # of table rows ought to be equal. I think that's just
> > false. In general, we'd also have to worry about index insertions,
> > which would provoke variable numbers of WAL records depending on
> > whether they cause a page split. And we'd have to worry about TOAST
> > table insertions, which could produce different numbers of records
> > depending on the size of the data, the configured block size and TOAST
> > threshold, and whether the TOAST table index incurs a page split.

Indeed, we added this test as it was hitting only a few queries with small
rows, which we thought would be stable, but that's apparently not the case.  I
think the reason we never had any problem is that the buildfarm currently
doesn't run pg_stat_statement regression test, as it's marked as
NO_INSTALLCHECK.  Other CI systems like at pgpro evidently have a different
approach.

> > If it's true that this test case sometimes randomly fails, then we
> > ought to fix that somehow, maybe by just removing this particular
> > check from the test case, or changing it to >=, or something like
> > that. But I don't think adding a new counter is the right idea.
> 
> Indeed. Then there is a very simple solution for this particular case as
> wal_records counter may only sometime becomes greater but never less.
> The corresponding patch is attached.

+1 for this approach, and the patch looks good to me.



Re: Possible fails in pg_stat_statements test

From
Robert Haas
Date:
On Thu, Mar 31, 2022 at 12:00 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > Indeed. Then there is a very simple solution for this particular case as
> > wal_records counter may only sometime becomes greater but never less.
> > The corresponding patch is attached.
>
> +1 for this approach, and the patch looks good to me.

Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Possible fails in pg_stat_statements test

From
"Anton A. Melnikov"
Date:

On 06.07.2022 20:11, Robert Haas wrote:
> On Thu, Mar 31, 2022 at 12:00 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>>> Indeed. Then there is a very simple solution for this particular case as
>>> wal_records counter may only sometime becomes greater but never less.
>>> The corresponding patch is attached.
>>
>> +1 for this approach, and the patch looks good to me.
> 
> Committed.
> 

Thanks a lot!

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company