Thread: testing framework for MVCC & vacuum (freeze) & heap_page_prune etc.

Hi,

There are so many sistuations in this area, and it is easy to broke
something. I want to know if we have some existing testing framework for
this area (design, code, licence etc). I'm willing to design one myself
but it would be better have a ask first to see if there is some existing
excellent project I can start with and contribute to. 

Thanks!

-- 
Best Regards
Andy Fan




Andy Fan <zhihuifan1213@163.com> writes:

> Hi,
>
> I'm willing to design one myself
> but it would be better have a ask first to see if there is some existing
> excellent project I can start with and contribute to. 

Just to show that I'm not a person who taking things for granted, this is
my draft in this topic. I know there are lots of things missed and much
more details need to be think more, that's why I was asking if we have
existing project.

====
Key Concepts:

* MVCC
- xmin  (insert, update)
- xmax  (delete, update, select for update, epq query)
- CommandId
        (insert/update/delete -> query)
- committs (xmin/xmax)

XID may be a state of in-progress, committed/aborted, 2pc prepared. 

When during scanning tuple within a snapshot,

challenge 1: transaction stage changes:
(a) new xid. (in-progress)
(b) in-progress -> commit.
(c) in-progress -> abort
(d) new xid -> commit/abort.

Challenge 2: The old version deleted by hot update prune or vacuum.

hot update prune need hot update query.

We may also have sub-transaction, sub-transaction can be tested with
"statement_level_txn".

We also need to review the above on standby (hot-standby case)

todo: xid state crash recovery
todo: xid stage on standby (read-only).
todo: multiple IDs in each transaction.

* vacuum, freeze.

Besides the MVCC works correctly, we also need the vacuum & freeze works
correctly.

vacuum_freeze.sql. -- run 'vacuum (freeze) w;' randomly.

* error detection:

check_cnt.sql  -- Check if the table count is 10000 rows all the time.
we inject random sleep time after fetching each tuple to simulate long
query, this leave the enough time for others to change their transaction
stage or prune the dead tuple etc.

check_diff.sql -- Check the result and check if it have 1...10000 IDs,
only used for troubleshooting.  random sleep time is there already.

check_vacuum_freeze.sql: -- a PGSQL function to check the
pg_stat_user_tables.n_xxx.  pg_class.relfrozenxid/relminmxid, they
should be able to advance after some period. If it not, some WARNING
should be added.

-- 
Best Regards
Andy Fan




Re: testing framework for MVCC & vacuum (freeze) & heap_page_prune etc.

From
"Andrey M. Borodin"
Date:

> On 10 Dec 2024, at 08:31, Andy Fan <zhihuifan1213@163.com> wrote:
>
> I want to know if we have some existing testing framework for
> this area (design, code, licence etc).

I think isolation tests [0] are what you are looking for. These tests are designed to test concurrent execution of
variousqueries. 
More subtle race conditions require coordination of injection points [1] or are tested with stochastic tests [2].


Best regards, Andrey Borodin.

[0] https://git.postgresql.org/cgit/postgresql.git/tree/src/test/isolation/README
[1] https://git.postgresql.org/cgit/postgresql.git/tree/src/test/modules/test_misc/t/006_signal_autovacuum.pl#n55
[2] https://git.postgresql.org/cgit/postgresql.git/tree/contrib/amcheck/t/003_cic_2pc.pl


"Andrey M. Borodin" <x4mmm@yandex-team.ru> writes:

>> On 10 Dec 2024, at 08:31, Andy Fan <zhihuifan1213@163.com> wrote:
>> 
>> I want to know if we have some existing testing framework for
>> this area (design, code, licence etc).
>
> I think isolation tests [0] are what you are looking for. These tests are designed to test concurrent execution of
variousqueries.
 
> More subtle race conditions require coordination of injection points
> [1] or are tested with stochastic tests [2].

Thanks, I always think the isolation tests under t/ directory is
valuable and it is a great way prove a bug is really fixed under the
known case. However it is not perfect because:

(a). All the cases are the known cases, kind of different from high
concurrently case which may find some bugs we are not aware of.
(b). every tap test needs to create a new instance and run the
test, it is hard to think this is a effective way for lots of tests.

So for testing some unknown bug, I am imaging the testing framework
should contains the following module at least.

(a). Workload.  Produce kind of queries with different character, like
hot update, epq update and so on.
(b). Testing enginer with pgbench.  run these queries with high
concurrey.
(c). error detection.  error, core, wrong result and so on.

Some more modules like:
relkind module: the same column definitions with different
relkind, like (plain table, partition table etc.), so that the same
workload can hit different knowledge. different index definitions and so
on. 

parallel module: different server configuration (debug_paralle_query
etc.).

Common GUCS module: different enable_xxx configuration, compare the
result under Read Repeatable  isolation level.

In the past 2 ~ 0.7 years, we internally develop such testing framework and
found out some unkonwn bug in the community version. like [1] [2]. But
that project has stopped and I thought the project is kind of heavy in
implementation and test case organization, so I want to redesign the v2
and for the *MVCC & vacuum (freeze) & heap_page_prune* part only(which 
is more related to my current work). 

[1]
https://www.postgresql.org/message-id/tencent_A3CE810F59132D8E230475A5F0F7A08C8307%40qq.com
[2]
https://www.postgresql.org/message-id/202312121711.tzjyb5yeb3fa%40alvherre.pgsql 

-- 
Best Regards
Andy Fan