Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
Date
Msg-id e9fd0b47-4fe9-0c6e-78d1-c288ac000ca7@2ndquadrant.com
Whole thread Raw
In response to Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions  (Erik Rijkers <er@xs4all.nl>)
Responses Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions  (Erik Rijkers <er@xs4all.nl>)
List pgsql-hackers

On 12/24/2017 10:00 AM, Erik Rijkers wrote:
>>>>>
>>>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>>>
>>>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>>>> "reorderbuffer.c", Line: 1773)
>>>>>
>>>>> I can cobble up a script but I hope you have enough from the assertion
>>>>> to see what's going wrong...
>>>>
>>>> The assertion says that the iterator produces changes in order that
>>>> does
>>>> not correlate with LSN. But I have a hard time understanding how that
>>>> could happen, particularly because according to the line number this
>>>> happens in ReorderBufferCommit(), i.e. the current (non-streaming)
>>>> case.
>>>>
>>>> So instructions to reproduce the issue would be very helpful.
>>>
>>> Using:
>>>
>>> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
>>> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
>>> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
>>> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
>>> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
>>> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
>>>
>>> As you expected the problem is the same with these new patches.
>>>
>>> I have now tested more, and seen that it not always fails.  I guess that
>>> it here fails 3 times out of 4.  But the laptop I'm using at the moment
>>> is old and slow -- it may well be a factor as we've seen before [1].
>>>
>>> Attached is the bash that I put together.  I tested with
>>> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
>>> often.  This same program run with HEAD never seems to fail (I tried a
>>> few dozen times).
>>>
>>
>> Thanks. Unfortunately I still can't reproduce the issue. I even tried
>> running it in valgrind, to see if there are some memory access issues
>> (which should also slow it down significantly).
> 
> One wonders again if 2ndquadrant shouldn't invest in some old hardware ;)
> 

Well, I've done tests on various machines, including some really slow
ones, and I still haven't managed to reproduce the failures using your
script. So I don't think that would really help. But I have reproduced
it by using a custom stress test script.

Turns out the asserts are overly strict - instead of

  Assert(prev_lsn < current_lsn);

it should have been

  Assert(prev_lsn <= current_lsn);

because some XLOG records may contain multiple rows (e.g. MULTI_INSERT).

The attached v3 fixes this issue, and also a couple of other thinkos:

1) The AssertChangeLsnOrder assert check was somewhat broken.

2) We've been sending aborts for all subtransactions, even those not yet
streamed. So downstream got confused and fell over because of an assert.

3) The streamed transactions were written to /tmp, using filenames using
subscription OID and XID of the toplevel transaction. That's fine, as
long as there's just a single replica running - if there are more, the
filenames will clash, causing really strange failures. So move the files
to base/pgsql_tmp where regular temporary files are written. I'm not
claiming this is perfect, perhaps we need to invent another location.

FWIW I believe the relation sync cache is somewhat broken by the
streaming. I thought resetting it would be good enough, but it's more
complicated (and trickier) than that. I'm aware of it, and I'll look
into that next - but probably not before 2018.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: General purpose hashing func in pgbench
Next
From: nikhil raj
Date:
Subject: High CPU usage in postgres servers