RE: archive status ".ready" files may be created too early - Mailing list pgsql-hackers

From matsumura.ryo@fujitsu.com
Subject RE: archive status ".ready" files may be created too early
Date
Msg-id OSAPR01MB502711B9C35BC2BE07943E91E88F0@OSAPR01MB5027.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: archive status ".ready" files may be created too early  ("Bossart, Nathan" <bossartn@amazon.com>)
Responses Re: archive status ".ready" files may be created too early
List pgsql-hackers
2020-03-26 18:50:24 Bossart, Nathan <bossartn(at)amazon(dot)com> wrote:
> The v3 patch is a proof-of-concept patch that moves the ready-for-
> archive logic to the WAL writer process.  We mark files as ready-for-
> archive when the WAL flush pointer has advanced beyond a known WAL
> record boundary.


I like such a simple resolution, but I cannot agree it.

1.
This patch makes wal_writer_delay to have two meanings. For example,
an user setting the parameter to a bigger value gets a archived file
later.

2.
Even if we create a new parameter, we and users cannot determine the
best value.

3.
PostgreSQL guarantees that if a database cluster stopped smartly,
the cluster flushed and archived all WAL record as follows.

 [xlog.c]
  * If archiving is enabled, rotate the last XLOG file so that all the
  * remaining records are archived (postmaster wakes up the archiver
  * process one more time at the end of shutdown). The checkpoint
  * record will go to the next XLOG file and won't be archived (yet).

Therefore, the idea may need that end-synchronization between WalWriter
and archiver(pgarch).  I cannot agree it because processing for stopping
system has complexity inherently and the syncronization makes it more 
complicated.  Your idea gives up currency of the notifying instead of simplicity,
but I think that the synchronization may ruin its merit.

4.
I found the patch spills a chance for notifying.  We have to be more careful.
At the following case, WalWriter will notify after a little less than 3 times
of wal_writer_delay in worst case.  It may not be allowed depending on value
of wal_writer_delay. If we create a new parameter, we cannot explain to user about it.

Premise:
- Seg1 has been already notified.
- FlushedPtr is 0/2D00000 (= all WAL record is flushed).

-----
Step 1.
Backend-A updates InsertPtr to 0/2E00000, but does not
copy WAL record to buffer.

Step 2. (sleep)
WalWriter memorize InsertPtr 0/2E00000 to the local variable
(LocalInsertPtr) and sleep because FlushedPtr has not passed
InsertPtr.

Step 3.
Backend-A copies WAL record to buffer.

Step 4.
Backend-B process updates InsertPtr to 0/3100000,
copies their record to buffer, commits (flushes it by itself),
and updates FlushedPtr to 0/3100000.

Step 5.
WalWriter detects that FlushedPtr(0/3100000) passes
LocalInsertPtr(0/2E00000), but WalWriter cannot notify Seg2
though it should be notified.

It is caused by that WalWriter does not know that
which record is crossing segment boundary.

Then, after two sleeping for cheking that InsertPtr passes
FlushedPtr again in worst case, Seg2 is notified.

Step 6. (sleep)
WalWriter sleep.

Step 7.
Backend-C inserts WAL record, flush, and updates as follows:
InsertPtr --> 0/3200000
FlushedPtr --> 0/3200000

Step 8.
Backend-D updates InsertPtr to 0/3300000, but does not copy
record to buffer.

Step 9. (sleep)
WalWriter memorize InsertPtr 0/3300000 to LocalInsertPtr
and sleep because FlushedPtr has been 0/3200000.

Step 10.
Backend-D copies its record.

Step 11.
Someone(Backend-X or WalWriter) flushes and updates FlushedPtr
to 0/3300000.

Step 12.
WalWriter detects that FlushedPtr(0/3300000) passes
LocalInsertPtr(0/3300000) and notify Seg2.
-----


I'm preparing a patch that backend inserting segment-crossboundary
WAL record leaves its EndRecPtr and someone flushing it checks
the EndRecPtr and notifies..


Regards
Ryo Matsumura

pgsql-hackers by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: OpenSSL 3.0.0 compatibility
Next
From: Michael Paquier
Date:
Subject: Re: segmentation fault using currtid and partitioned tables