Hi,
Sorry for the delay, I didn't had time to come back to it until this afternoon.
On Mon, Apr 10, 2023 at 09:18:46AM +0000, Hayato Kuroda (Fujitsu) wrote:
>
> I have analyzed about the point but it seemed to be difficult. This is because
> some additional records like followings may be inserted. PSA the script which is
> used for testing. Note that "double CHECKPOINT_SHUTDOWN" issue might be wrong,
> so I wanted to withdraw it once. Sorry for noise.
>
> * HEAP/HEAP2 records. These records may be inserted by checkpointer.
>
> IIUC, if there are tuples which have not been flushed yet when shutdown is requested,
> the checkpointer writes back all of them into heap file. At that time many WAL
> records are generated. I think we cannot predict the number of records beforehand.
>
> * INVALIDATION(S) records. These records may be inserted by VACUUM.
>
> There is a possibility that autovacuum runs and generate WAL records. I think we
> cannot predict the number of records beforehand because it depends on the number
> of objects.
>
> * RUNNING_XACTS record
>
> It might be a timing issue, but I found that sometimes background writer generated
> a XLOG_RUNNING record. According to the function BackgroundWriterMain(), it will be
> generated when the process spends 15 seconds since last logging and there are
> important records. I think it is difficult to predict whether this will be appeared or not.
I don't think that your analysis is correct. Slots are guaranteed to be
stopped after all the normal backends have been stopped, exactly to avoid such
extraneous records.
What is happening here is that the slot's confirmed_flush_lsn is properly
updated in memory and ends up being the same as the current LSN before the
shutdown. But as it's a logical slot and those records aren't decoded, the
slot isn't marked as dirty and therefore isn't saved to disk. You don't see
that behavior when doing a manual checkpoint before (per your script comment),
as in that case the checkpoint also tries to save the slot to disk but then
finds a slot that was marked as dirty and therefore saves it.
In your script's scenario, when you restart the server the previous slot data
is restored and the confirmed_flush_lsn goes backward, which explains those
extraneous records.
It's probably totally harmless to throw away that value for now (and probably
also doesn't lead to crazy amount of work after restart, I really don't know
much about the logical slot code), but clearly becomes problematic with your
usecase. One easy way to fix this is to teach the checkpoint code to force
saving the logical slots to disk even if they're not marked as dirty during a
shutdown checkpoint, as done in the attached v1 patch (renamed as .txt to not
interfere with the cfbot). With this patch applied I reliably only see a final
shutdown checkpoint record with your scenario.
Now such a change will make shutdown a bit more expensive when using logical
replication, even if in 99% of cases you will not need to save the
confirmed_flush_lsn value, so I don't know if that's acceptable or not.