Thread: Re: pgsql: Fix "base" snapshot handling in logical decoding
On 2018-Jun-28, Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > Fix "base" snapshot handling in logical decoding > > According to buildfarm member friarbird, and as confirmed here, > the contrib/test_decoding/specs/oldest_xmin.spec test added by this > commit fails under CLOBBER_CACHE_ALWAYS. Hm. Running this test under CLOBBER_CACHE_ALWAYS I see that the VACUUM FULL step takes 31 seconds (and the test succeeds). This is not top-of-the-line hardware by a long shot (Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz) but I can believe that the other machines are slower or busier. Or there might be a completely different explanation ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-Jun-29, Alvaro Herrera wrote: > On 2018-Jun-28, Tom Lane wrote: > > > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > > Fix "base" snapshot handling in logical decoding > > > > According to buildfarm member friarbird, and as confirmed here, > > the contrib/test_decoding/specs/oldest_xmin.spec test added by this > > commit fails under CLOBBER_CACHE_ALWAYS. > > Hm. Running this test under CLOBBER_CACHE_ALWAYS I see that the VACUUM > FULL step takes 31 seconds (and the test succeeds). This is not > top-of-the-line hardware by a long shot (Intel(R) Core(TM) i7-4600U CPU > @ 2.10GHz) but I can believe that the other machines are slower or > busier. It does fail in the indicated way in CLOBBER_CACHE_RECURSIVELY, but I guess that's expected. It also fails if I reduce the timeout from 60 seconds to 25. I suppose 60 seconds (isolationtester's default timeout) is just not enough time for those machines. We could increase it to 180 seconds and see if that's enough to make them pass ... Another possibility is to examine column 'state' in isolationtester. I thought we only used that timeout for 'waiting' queries, but I see it's not doing that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> On 2018-Jun-28, Tom Lane wrote: >>> According to buildfarm member friarbird, and as confirmed here, >>> the contrib/test_decoding/specs/oldest_xmin.spec test added by this >>> commit fails under CLOBBER_CACHE_ALWAYS. > I suppose 60 seconds (isolationtester's default timeout) is just not > enough time for those machines. We could increase it to 180 seconds and > see if that's enough to make them pass ... What I want to know is why this test is doing a database-wide VACUUM FULL in the first place. If that isn't profligate wastage of testing cycles, why not? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >>> On 2018-Jun-28, Tom Lane wrote: >>>> According to buildfarm member friarbird, and as confirmed here, >>>> the contrib/test_decoding/specs/oldest_xmin.spec test added by this >>>> commit fails under CLOBBER_CACHE_ALWAYS. > >> I suppose 60 seconds (isolationtester's default timeout) is just not >> enough time for those machines. We could increase it to 180 seconds and >> see if that's enough to make them pass ... > > What I want to know is why this test is doing a database-wide VACUUM FULL > in the first place. If that isn't profligate wastage of testing cycles, > why not? > > regards, tom lane Oh, that's my fault. I think just VACUUM pg_attribute is enough there -- it takes 42ms on my laptop with CLOBBER_CACHE_ALWAYS ('vacuum full' occupies 1 minute), patch is attached. The test is still steadily fails without the main patch. There is also one thing that puzzles me as I don't know much about vacuum internals. If I do plain VACUUM of pg_attribute in the test, it shouts "catalog is missing 1 attribute(s) for relid" error (which is quite expected), while with 'VACUUM FULL pg_attribute' the tuple is silently (and wrongly, with dropped column missing) decoded. Moreover, if I perform the test manually, and do 'VACUUM FULL;', sometimes test becomes useless -- that is, tuple is successfully decoded with all three columns, as though VACUUM was not actually executed. All this is without the main patch, of course. I think I will look into this soon. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/test_decoding/expected/oldest_xmin.out b/contrib/test_decoding/expected/oldest_xmin.out index d09342c4be..a5d7d3c2ca 100644 --- a/contrib/test_decoding/expected/oldest_xmin.out +++ b/contrib/test_decoding/expected/oldest_xmin.out @@ -15,7 +15,7 @@ step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slo data step s1_commit: COMMIT; -step s0_vacuum: VACUUM FULL; +step s0_vacuum: VACUUM pg_attribute; step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts','1'); data diff --git a/contrib/test_decoding/specs/oldest_xmin.spec b/contrib/test_decoding/specs/oldest_xmin.spec index 4f8af70aa2..cd2b69cae2 100644 --- a/contrib/test_decoding/specs/oldest_xmin.spec +++ b/contrib/test_decoding/specs/oldest_xmin.spec @@ -22,7 +22,7 @@ step "s0_getxid" { SELECT txid_current() IS NULL; } step "s0_alter" { ALTER TYPE basket DROP ATTRIBUTE mangos; } step "s0_commit" { COMMIT; } step "s0_checkpoint" { CHECKPOINT; } -step "s0_vacuum" { VACUUM FULL; } +step "s0_vacuum" { VACUUM pg_attribute; } step "s0_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0','skip-empty-xacts', '1'); } session "s1"
Arseny Sher <a.sher@postgrespro.ru> writes: > There is also one thing that puzzles me as I don't know much about > vacuum internals. If I do plain VACUUM of pg_attribute in the test, it > shouts "catalog is missing 1 attribute(s) for relid" error (which is > quite expected), while with 'VACUUM FULL pg_attribute' the tuple is > silently (and wrongly, with dropped column missing) decoded. Moreover, > if I perform the test manually, and do 'VACUUM FULL;', sometimes test > becomes useless -- that is, tuple is successfully decoded with all three > columns, as though VACUUM was not actually executed. All this is without > the main patch, of course. I think I will look into this soon. So I have been jumping around this and learned a few curious things. 1) Test in its current shape sometimes doesn't fulfill its aim indeed -- that is, despite issued VACUUM the tuple is still decoded with all three fields. This happens because during attempt to advance xmin there is a good possiblity to encounter xl_running_xacts record logged before our CHECKPOINT (they are logged each 15 seconds). We do not try to advance xmin twice without client acknowledgment, so in this case xmin will not be advanced far enough to allow vacuum prune entry from pg_attribute. 2) This is not easy to notice because often (but not always) explicit VACUUM is not needed at all: tuple is often pruned by microvacuum (heap_page_prune_opts) right in the final decoding session. If we hadn't bumped xmin far enough during previous get_changes, we do that now, so microvacuum actually purges the entry. But if we were so unfortunate that 1) extra xl_running_xacts was logged and 2) microvacuum was in bad mood and didn't come, pg_attribute is not vacuumed and test becomes useless. To make this bulletproof, in the attached patch I doubled first get_changes: now there are two client acks, so our VACUUM always does the job. 3) As a side note, answer to my question 'why do we get different errors with VACUUM and VACUUM FULL' is the following. With VACUUM FULL, not only old pg_attribute entry is pruned, but also xmin of new entry with attisdropped=true is reset to frozen xid. This means that decoding session (RelationBuildTupleDesc) actually sees 3 attributes, and the fact that one of them is dropped doesn't embarrass this function (apparently relnatts in pg_class is never decremented) -- we just go ahead and decode only live attributes. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/test_decoding/expected/oldest_xmin.out b/contrib/test_decoding/expected/oldest_xmin.out index d09342c4be..d1b4f17e3a 100644 --- a/contrib/test_decoding/expected/oldest_xmin.out +++ b/contrib/test_decoding/expected/oldest_xmin.out @@ -1,6 +1,6 @@ Parsed test spec with 2 sessions -starting permutation: s0_begin s0_getxid s1_begin s1_insert s0_alter s0_commit s0_checkpoint s0_get_changes s1_commit s0_vacuums0_get_changes +starting permutation: s0_begin s0_getxid s1_begin s1_insert s0_alter s0_commit s0_checkpoint s0_get_changes s0_get_changess1_commit s0_vacuum s0_get_changes step s0_begin: BEGIN; step s0_getxid: SELECT txid_current() IS NULL; ?column? @@ -14,8 +14,11 @@ step s0_checkpoint: CHECKPOINT; step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts','1'); data +step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts','1'); +data + step s1_commit: COMMIT; -step s0_vacuum: VACUUM FULL; +step s0_vacuum: VACUUM pg_attribute; step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts','1'); data diff --git a/contrib/test_decoding/specs/oldest_xmin.spec b/contrib/test_decoding/specs/oldest_xmin.spec index 4f8af70aa2..141fe2b145 100644 --- a/contrib/test_decoding/specs/oldest_xmin.spec +++ b/contrib/test_decoding/specs/oldest_xmin.spec @@ -22,7 +22,7 @@ step "s0_getxid" { SELECT txid_current() IS NULL; } step "s0_alter" { ALTER TYPE basket DROP ATTRIBUTE mangos; } step "s0_commit" { COMMIT; } step "s0_checkpoint" { CHECKPOINT; } -step "s0_vacuum" { VACUUM FULL; } +step "s0_vacuum" { VACUUM pg_attribute; } step "s0_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0','skip-empty-xacts', '1'); } session "s1" @@ -30,8 +30,11 @@ step "s1_begin" { BEGIN; } step "s1_insert" { INSERT INTO harvest VALUES ((1, 2, 3)); } step "s1_commit" { COMMIT; } -# Checkpoint with following get_changes forces to advance xmin. ALTER of a +# Checkpoint with following get_changes forces xmin advancement. We do +# get_changes twice because if one more xl_running_xacts record had slipped +# before our CHECKPOINT, xmin will be advanced only on this record, thus not +# reaching value needed for vacuuming corresponding pg_attribute entry. ALTER of # composite type is a rare form of DDL which allows T1 to see the tuple which # will be removed (xmax set) before T1 commits. That is, interlocking doesn't # forbid modifying catalog after someone read it (and didn't commit yet). -permutation "s0_begin" "s0_getxid" "s1_begin" "s1_insert" "s0_alter" "s0_commit" "s0_checkpoint" "s0_get_changes" "s1_commit""s0_vacuum" "s0_get_changes" +permutation "s0_begin" "s0_getxid" "s1_begin" "s1_insert" "s0_alter" "s0_commit" "s0_checkpoint" "s0_get_changes" "s0_get_changes""s1_commit""s0_vacuum" "s0_get_changes"
Hi Arseny. I'm writing a commit message to push this test change, and I can't explain this bit: On 2018-Jul-02, Arseny Sher wrote: > 3) As a side note, answer to my question 'why do we get different errors > with VACUUM and VACUUM FULL' is the following. With VACUUM FULL, not > only old pg_attribute entry is pruned, but also xmin of new entry > with attisdropped=true is reset to frozen xid. This means that > decoding session (RelationBuildTupleDesc) actually sees 3 attributes, > and the fact that one of them is dropped doesn't embarrass this > function (apparently relnatts in pg_class is never decremented) -- > we just go ahead and decode only live attributes. I just don't see it that VACUUM FULL would change the xmin of anything to FrozenXid, and in my experiments it doesn't. Did you mean VACUUM FREEZE? PS - sorry about the broken CC I added :-( -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thanks for the detective work! I pushed this test change. (I still don't quite understand what you mean with the freezing part, though.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I just don't see it that VACUUM FULL would change the xmin of anything > to FrozenXid, and in my experiments it doesn't. Did you mean VACUUM > FREEZE? Well, docs for VACUUM say: FREEZE Selects aggressive “freezing” of tuples. Specifying FREEZE is equivalent to performing VACUUM with the vacuum_freeze_min_age and vacuum_freeze_table_age parameters set to zero. Aggressive freezing is always performed when the table is rewritten, so this option is redundant when FULL is specified. So this is literally the same. rewrite_heap_tuple does the job. > Thanks for the detective work! I pushed this test change. Thank you, I appreciate this. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company