Re: logical decoding and replication of sequences - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: logical decoding and replication of sequences
Date
Msg-id 5709e125-a50c-e386-3dce-1d662b588ac7@enterprisedb.com
Whole thread Raw
In response to Re: logical decoding and replication of sequences  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: logical decoding and replication of sequences  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers

On 3/7/22 22:25, Tomas Vondra wrote:
> 
> 
> On 3/7/22 17:53, Tomas Vondra wrote:
>> On 2/28/22 12:46, Amit Kapila wrote:
>>> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> On 2/10/22 19:17, Tomas Vondra wrote:
>>>>> I've polished & pushed the first part adding sequence decoding
>>>>> infrastructure etc. Attached are the two remaining parts.
>>>>>
>>>>> I plan to wait a day or two and then push the test_decoding part. The
>>>>> last part (for built-in replication) will need more work and maybe
>>>>> rethinking the grammar etc.
>>>>>
>>>>
>>>> I've pushed the second part, adding sequences to test_decoding.
>>>>
>>>
>>> The test_decoding is failing randomly in the last few days. I am not
>>> completely sure but they might be related to this work. The two of
>>> these appears to be due to the same reason:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07
>>>
>>> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
>>> "reorderbuffer.c", Line: 1173, PID: 35013)
>>> 0   postgres                            0x00593de0 ExceptionalCondition + 160\\0
>>>
>>
>> This might be related to the issue reported by Amit, i.e. that
>> sequence_decode does not call ReorderBufferProcessXid(). If this keeps
>> failing, we'll have to add some extra debug info (logging LSN etc.), at
>> least temporarily. It'd be valuable to inspect the WAL too.
>>
>>> Another:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2022-02-16%2006%3A21%3A48
>>>
>>> --- /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/expected/rewrite.out
>>> 2022-02-14 20:19:14.000000000 +0000
>>> +++ /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/results/rewrite.out
>>> 2022-02-16 07:42:18.000000000 +0000
>>> @@ -126,6 +126,7 @@
>>>    table public.replication_example: INSERT: id[integer]:4
>>> somedata[integer]:3 text[character varying]:null
>>> testcolumn1[integer]:null
>>>    table public.replication_example: INSERT: id[integer]:5
>>> somedata[integer]:4 text[character varying]:null
>>> testcolumn1[integer]:2 testcolumn2[integer]:1
>>>    COMMIT
>>> +  sequence public.replication_example_id_seq: transactional:0
>>> last_value: 38 log_cnt: 0 is_called:1
>>>    BEGIN
>>>    table public.replication_example: INSERT: id[integer]:6
>>> somedata[integer]:5 text[character varying]:null
>>> testcolumn1[integer]:3 testcolumn2[integer]:null
>>>    COMMIT
>>> @@ -133,7 +134,7 @@
>>>    table public.replication_example: INSERT: id[integer]:7
>>> somedata[integer]:6 text[character varying]:null
>>> testcolumn1[integer]:4 testcolumn2[integer]:null
>>>    table public.replication_example: INSERT: id[integer]:8
>>> somedata[integer]:7 text[character varying]:null
>>> testcolumn1[integer]:5 testcolumn2[integer]:null
>>> testcolumn3[integer]:1
>>>    COMMIT
>>> - (15 rows)
>>> + (16 rows)
>>>
>>
>> Interesting. I can think of one reason that might cause this - we log
>> the first sequence increment after a checkpoint. So if a checkpoint
>> happens in an unfortunate place, there'll be an extra WAL record. On
>> slow / busy machines that's quite possible, I guess.
>>
> 
> I've tweaked the checkpoint_interval to make checkpoints more aggressive
> (set it to 1s), and it seems my hunch was correct - it produces failures
> exactly like this one. The best fix probably is to just disable decoding
> of sequences in those tests that are not aimed at testing sequence decoding.
> 

I've pushed a fix for this, adding "include-sequences=0" to a couple
test_decoding tests, which were failing with concurrent checkpoints.

Unfortunately, I realized we have a similar issue in the "sequences"
tests too :-( Imagine you do a series of sequence increments, e.g.

  SELECT nextval('s') FROM generate_sequences(1,100);

If there's a concurrent checkpoint, this may add an extra WAL record,
affecting the decoded output (and also the data stored in the sequence
relation itself). Not sure what to do about this ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: cpluspluscheck complains about use of register
Next
From: "Gunnar \"Nick\" Bluth"
Date:
Subject: Re: [PATCH] pg_stat_toast v8