Re: BRIN summarization vs. WAL logging - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: BRIN summarization vs. WAL logging
Date
Msg-id c049a38c-b124-ebfc-18b7-9edfec746f58@enterprisedb.com
Whole thread Raw
In response to Re: BRIN summarization vs. WAL logging  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers

On 1/26/22 19:14, Alvaro Herrera wrote:
> On 2022-Jan-26, Robert Haas wrote:
> 
>> On Tue, Jan 25, 2022 at 10:12 PM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
> 
>>> 2) brin_summarize_range()
>>>
>>> Now, the issue I think is more serious, more likely to happen, and
>>> harder to fix. When summarizing a range, we write two WAL records:
>>>
>>> INSERT heapBlk 2 pagesPerRange 2 offnum 2, blkref #0: rel 1663/63 ...
>>> SAMEPAGE_UPDATE offnum 2, blkref #0: rel 1663/63341/73957 blk 2
>>>
>>> So, what happens if we lost the second WAL record, e.g. due to a crash?
>>
>> Ouch. As you say, XLogFlush() won't fix that. The WAL logging scheme
>> needs to be redesigned.
> 
> I'm not sure what a good fix is.  I was thinking that maybe if a
> placeholder tuple is found during index scan, and the corresponding
> process is no longer running, then the index scanner would remove the
> placeholder tuple, making the range unsummarized again.  However, how
> would we know that the process is gone?
> 
> Another idea is to use WAL's rm_cleanup: during replay, remember that a
> placeholder tuple was seen, then remove the info if we see an update
> from the originating process that replaces the placeholder tuple with a
> real one; at cleanup time, if the list of remembered placeholder tuples
> is nonempty, remove them.
> 
> (I vaguely recall we used the WAL rm_cleanup mechanism for something
> like this, but we no longer do AFAICS.)
> 
> ... Oh, but if there is a promotion involved, we may end up with a
> placeholder insertion before the promotion and the update afterwards.
> That would probably not be handled well.
> 

Right, I think we'd miss those. And can't that happen for regular 
recovery too. If the placeholder tuple is before the redo LSN, we'd miss 
it too, right? But something prevents that.

I think the root cause is the two WAL records are not linked together, 
so we fail to ensure the atomicity of the operation. Trying to fix this 
by tracking placeholder tuples seems like solving the symptoms.

The easiest solution would be to link the records by XID, but of course 
that goes against the whole placeholder tuple idea - no one could modify 
the placeholder tuple in between. Maybe that's a price we have to pay.

> 
> One thing not completely clear to me is whether this only affects
> placeholder tuples.  Is it possible to have this problem with regular
> BRIN tuples?  I think it isn't.
> 

Yeah. I've been playing with this for a while, trying to cause issues 
with non-placeholder tuples. But I think that's fine - once the tuple 
becomes non-placeholder, all subsequent updates are part of the 
transaction that modifies the table. So if that fails, we don't update 
the index tuple.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: psql - add SHOW_ALL_RESULTS option
Next
From: Peter Eisentraut
Date:
Subject: Re: Skipping logical replication transactions on subscriber side