Re: Testing autovacuum wraparound (including failsafe) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Testing autovacuum wraparound (including failsafe)
Date
Msg-id CAH2-Wz=f2dfc-0BcM5X5ttXTeRNPg+JtxEMCuSZR82a7XHfRkg@mail.gmail.com
Whole thread Raw
In response to Testing autovacuum wraparound (including failsafe)  (Andres Freund <andres@anarazel.de>)
Responses Re: Testing autovacuum wraparound (including failsafe)  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Fri, Apr 23, 2021 at 1:43 PM Andres Freund <andres@anarazel.de> wrote:
> I started to write a test for $Subject, which I think we sorely need.

+1

> Currently my approach is to:
> - start a cluster, create a few tables with test data
> - acquire SHARE UPDATE EXCLUSIVE in a prepared transaction, to prevent
>   autovacuum from doing anything
> - cause dead tuples to exist
> - restart
> - run pg_resetwal -x 2000027648
> - do things like acquiring pins on pages that block vacuum from progressing
> - commit prepared transaction
> - wait for template0, template1 datfrozenxid to increase
> - wait for relfrozenxid for most relations in postgres to increase
> - release buffer pin
> - wait for postgres datfrozenxid to increase

Just having a standard-ish way to do stress testing like this would
add something.

> 2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the
>    failsafe mode, we can't really create 4GB relations on the BF. While
>    writing the tests I've lowered this to 4MB...

The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the
related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits
how often we'll consider the failsafe in the single-pass/no-indexes
case.

I see no reason why it cannot be changed now. VACUUM_FSM_EVERY_PAGES
also frustrates FSM testing in the single-pass case in about the same
way, so maybe that should be considered as well? Note that the FSM
handling for the single pass case is actually a bit different to the
two pass/has-indexes case, since the single pass case calls
lazy_vacuum_heap_page() directly in its first and only pass over the
heap (that's the whole point of having it of course).

> 3) pg_resetwal -x requires to carefully choose an xid: It needs to be the
>    first xid on a clog page. It's not hard to determine which xids are but it
>    depends on BLCKSZ and a few constants in clog.c. I've for now hardcoded a
>    value appropriate for 8KB, but ...

Ugh.

> For 2), I don't really have a better idea than making that configurable
> somehow?

That could make sense as a developer/testing option, I suppose. I just
doubt that it makes sense as anything else.

> 2021-04-23 13:32:30.899 PDT [2027738] LOG:  automatic aggressive vacuum to prevent wraparound of table
"postgres.public.small_trunc":index scans: 1
 
>         pages: 400 removed, 28 remain, 0 skipped due to pins, 0 skipped frozen
>         tuples: 14000 removed, 1000 remain, 0 are dead but not yet removable, oldest xmin: 2000027651
>         buffer usage: 735 hits, 1262 misses, 874 dirtied
>         index scan needed: 401 pages from table (1432.14% of total) had 14000 dead item identifiers removed
>         index "small_trunc_pkey": pages: 43 in total, 37 newly deleted, 37 currently deleted, 0 reusable
>         avg read rate: 559.048 MB/s, avg write rate: 387.170 MB/s
>         system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
>         WAL usage: 1809 records, 474 full page images, 3977538 bytes
>
> '1432.14% of total' - looks like removed pages need to be added before the
> percentage calculation?

Clearly this needs to account for removed heap pages in order to
consistently express the percentage of pages with LP_DEAD items in
terms of a percentage of the original table size. I can fix this
shortly.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: Testing autovacuum wraparound (including failsafe)
Next
From: Andres Freund
Date:
Subject: Re: Testing autovacuum wraparound (including failsafe)