Thread: testing framework for MVCC & vacuum (freeze) & heap_page_prune etc.
Hi, There are so many sistuations in this area, and it is easy to broke something. I want to know if we have some existing testing framework for this area (design, code, licence etc). I'm willing to design one myself but it would be better have a ask first to see if there is some existing excellent project I can start with and contribute to. Thanks! -- Best Regards Andy Fan
Andy Fan <zhihuifan1213@163.com> writes: > Hi, > > I'm willing to design one myself > but it would be better have a ask first to see if there is some existing > excellent project I can start with and contribute to. Just to show that I'm not a person who taking things for granted, this is my draft in this topic. I know there are lots of things missed and much more details need to be think more, that's why I was asking if we have existing project. ==== Key Concepts: * MVCC - xmin (insert, update) - xmax (delete, update, select for update, epq query) - CommandId (insert/update/delete -> query) - committs (xmin/xmax) XID may be a state of in-progress, committed/aborted, 2pc prepared. When during scanning tuple within a snapshot, challenge 1: transaction stage changes: (a) new xid. (in-progress) (b) in-progress -> commit. (c) in-progress -> abort (d) new xid -> commit/abort. Challenge 2: The old version deleted by hot update prune or vacuum. hot update prune need hot update query. We may also have sub-transaction, sub-transaction can be tested with "statement_level_txn". We also need to review the above on standby (hot-standby case) todo: xid state crash recovery todo: xid stage on standby (read-only). todo: multiple IDs in each transaction. * vacuum, freeze. Besides the MVCC works correctly, we also need the vacuum & freeze works correctly. vacuum_freeze.sql. -- run 'vacuum (freeze) w;' randomly. * error detection: check_cnt.sql -- Check if the table count is 10000 rows all the time. we inject random sleep time after fetching each tuple to simulate long query, this leave the enough time for others to change their transaction stage or prune the dead tuple etc. check_diff.sql -- Check the result and check if it have 1...10000 IDs, only used for troubleshooting. random sleep time is there already. check_vacuum_freeze.sql: -- a PGSQL function to check the pg_stat_user_tables.n_xxx. pg_class.relfrozenxid/relminmxid, they should be able to advance after some period. If it not, some WARNING should be added. -- Best Regards Andy Fan
Re: testing framework for MVCC & vacuum (freeze) & heap_page_prune etc.
From
"Andrey M. Borodin"
Date:
> On 10 Dec 2024, at 08:31, Andy Fan <zhihuifan1213@163.com> wrote: > > I want to know if we have some existing testing framework for > this area (design, code, licence etc). I think isolation tests [0] are what you are looking for. These tests are designed to test concurrent execution of variousqueries. More subtle race conditions require coordination of injection points [1] or are tested with stochastic tests [2]. Best regards, Andrey Borodin. [0] https://git.postgresql.org/cgit/postgresql.git/tree/src/test/isolation/README [1] https://git.postgresql.org/cgit/postgresql.git/tree/src/test/modules/test_misc/t/006_signal_autovacuum.pl#n55 [2] https://git.postgresql.org/cgit/postgresql.git/tree/contrib/amcheck/t/003_cic_2pc.pl
"Andrey M. Borodin" <x4mmm@yandex-team.ru> writes: >> On 10 Dec 2024, at 08:31, Andy Fan <zhihuifan1213@163.com> wrote: >> >> I want to know if we have some existing testing framework for >> this area (design, code, licence etc). > > I think isolation tests [0] are what you are looking for. These tests are designed to test concurrent execution of variousqueries. > More subtle race conditions require coordination of injection points > [1] or are tested with stochastic tests [2]. Thanks, I always think the isolation tests under t/ directory is valuable and it is a great way prove a bug is really fixed under the known case. However it is not perfect because: (a). All the cases are the known cases, kind of different from high concurrently case which may find some bugs we are not aware of. (b). every tap test needs to create a new instance and run the test, it is hard to think this is a effective way for lots of tests. So for testing some unknown bug, I am imaging the testing framework should contains the following module at least. (a). Workload. Produce kind of queries with different character, like hot update, epq update and so on. (b). Testing enginer with pgbench. run these queries with high concurrey. (c). error detection. error, core, wrong result and so on. Some more modules like: relkind module: the same column definitions with different relkind, like (plain table, partition table etc.), so that the same workload can hit different knowledge. different index definitions and so on. parallel module: different server configuration (debug_paralle_query etc.). Common GUCS module: different enable_xxx configuration, compare the result under Read Repeatable isolation level. In the past 2 ~ 0.7 years, we internally develop such testing framework and found out some unkonwn bug in the community version. like [1] [2]. But that project has stopped and I thought the project is kind of heavy in implementation and test case organization, so I want to redesign the v2 and for the *MVCC & vacuum (freeze) & heap_page_prune* part only(which is more related to my current work). [1] https://www.postgresql.org/message-id/tencent_A3CE810F59132D8E230475A5F0F7A08C8307%40qq.com [2] https://www.postgresql.org/message-id/202312121711.tzjyb5yeb3fa%40alvherre.pgsql -- Best Regards Andy Fan