Thread: base backup vs. concurrent truncation

base backup vs. concurrent truncation

From
Robert Haas
Date:
Hi,

Apologies if this has already been discussed someplace, but I couldn't
find a previous discussion. It seems to me that base backups are
broken in the face of a concurrent truncation that reduces the number
of segments in a relation.

Suppose we have a relation that is 1.5GB in size, so that we have two
files 23456, which is 1GB, and 23456.1, which is 0.5GB. We'll back
those files up in whichever order the directory scan finds them.
Suppose we back up 23456.1 first. Then the relation is truncated to
0.5GB, so 23456.1 is removed and 23456 gets a lot shorter. Next we
back up the file 23456. Now our backup contains files 23456 and
23456.1, each 0.5GB. But this breaks the invariant in md.c:

 *      On disk, a relation must consist of consecutively numbered segment
 *      files in the pattern
 *              -- Zero or more full segments of exactly RELSEG_SIZE blocks each
 *              -- Exactly one partial segment of size 0 <= size <
RELSEG_SIZE blocks
 *              -- Optionally, any number of inactive segments of size 0 blocks.

basebackup.c's theory about relation truncation is that it doesn't
really matter because WAL replay will fix things up. But in this case,
I don't think it will, because WAL replay relies on the above
invariant holding. As mdnblocks says:

                /*
                 * If segment is exactly RELSEG_SIZE, advance to next one.
                 */
                segno++;

So I think what's going to happen is we're not going to notice 23456.1
when we recover the backup. It will just sit there as an orphaned file
forever, unless we extend 23456 back to a full 1GB, at which point we
might abruptly start considering that file part of the relation again.

Assuming I'm not wrong about all of this, the question arises: whose
fault is this, and what to do about it? It seems to me that it's a bit
hard to blame basebackup.c, because if you used pg_backup_start() and
pg_backup_stop() and copied the directory yourself, you'd have exactly
the same situation, and while we could (and perhaps should) teach
basebackup.c to do something smarter, it doesn't seem realistic to
impose complex constraints on the user's choice of file copy tool.
Furthermore, I think that the problem could arise without performing a
backup at all: say that the server crashes on the OS level in
mid-truncation, and the truncation of segment 0 reaches disk but the
removal of segment 1 does not.

So I think the problem is with md.c assuming that its invariant must
hold on a cluster that's not guaranteed to be in a consistent state.
But mdnblocks() clearly can't try to open every segment up to whatever
the maximum theoretical possible segment number is every time it's
invoked, because that would be wicked expensive. An idea that occurs
to me is to remove all segment files following the first partial
segment during startup, before we begin WAL replay. If that state
occurs at startup, then either we have a scenario involving
truncation, like those above, or a scenario involving relation
extension, where we added a new segment and that made it to disk but
the prior extension of the previous last segment file to maximum
length did not. But in that case, WAL replay should, I think, fix
things up. However, I'm not completely sure that there isn't some hole
in this theory, and this way forward also doesn't sound particularly
cheap. Nonetheless I don't have another idea right now.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: base backup vs. concurrent truncation

From
Aleksander Alekseev
Date:
Hi,

Just a quick observation:

> basebackup.c's theory about relation truncation is that it doesn't
> really matter because WAL replay will fix things up. But in this case,
> I don't think it will, because WAL replay relies on the above
> invariant holding.

Assuming this is the case perhaps we can reduce the scenario and
consider this simpler one:

1. The table is truncated
2. The DBMS is killed before making a checkpoint
3. We are in recovery and presumably see a pair of 0.5 Gb segments

Or can't we?

-- 
Best regards,
Aleksander Alekseev



Re: base backup vs. concurrent truncation

From
Aleksander Alekseev
Date:
Hi again,

> Assuming this is the case perhaps we can reduce the scenario and
> consider this simpler one:
>
> 1. The table is truncated
> 2. The DBMS is killed before making a checkpoint
> 3. We are in recovery and presumably see a pair of 0.5 Gb segments
>
> Or can't we?

Oh, I see. If the process will be killed this perhaps is not going to
happen. Whether this can happen if we pull the plug from the machine
is probably a design implementation of the particular filesystem and
whether it's journaled.

Hm...

-- 
Best regards,
Aleksander Alekseev



Re: base backup vs. concurrent truncation

From
Robert Haas
Date:
On Fri, Apr 21, 2023 at 10:56 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> > Assuming this is the case perhaps we can reduce the scenario and
> > consider this simpler one:
> >
> > 1. The table is truncated
> > 2. The DBMS is killed before making a checkpoint
> > 3. We are in recovery and presumably see a pair of 0.5 Gb segments
> >
> > Or can't we?
>
> Oh, I see. If the process will be killed this perhaps is not going to
> happen. Whether this can happen if we pull the plug from the machine
> is probably a design implementation of the particular filesystem and
> whether it's journaled.

Right. I mentioned that scenario in the original email:

"Furthermore, I think that the problem could arise without performing a
backup at all: say that the server crashes on the OS level in
mid-truncation, and the truncation of segment 0 reaches disk but the
removal of segment 1 does not."

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: base backup vs. concurrent truncation

From
Aleksander Alekseev
Date:
Hi,

> > Oh, I see. If the process will be killed this perhaps is not going to
> > happen. Whether this can happen if we pull the plug from the machine
> > is probably a design implementation of the particular filesystem and
> > whether it's journaled.
>
> Right. I mentioned that scenario in the original email:
>
> "Furthermore, I think that the problem could arise without performing a
> backup at all: say that the server crashes on the OS level in
> mid-truncation, and the truncation of segment 0 reaches disk but the
> removal of segment 1 does not."

Right. I didn't fully understand this scenario at first.

I tried to reproduce it to see what actually happens. Here is what I did.

```
create table truncateme(id integer, val varchar(1024));
alter table truncateme set (autovacuum_enabled = off);

-- takes ~30 seconds
insert into truncateme
  select id,
  (
    select string_agg(chr((33+random()*(126-33)) :: integer), '')
    from generate_series(1,1000)
  )
  from generate_series(1,2*1024*1024) as id;

checkpoint;

select relfilenode from pg_class where relname = 'truncateme';
 relfilenode
-------------
       32778
```

We got 3 segments (and no VM fork since I disabled VACUUM):

```
$ cd ~/pginstall/data-master/base/16384/
$ ls -lah 32778*
-rw-------  1 eax  staff   1.0G Apr 21 23:47 32778
-rw-------  1 eax  staff   1.0G Apr 21 23:48 32778.1
-rw-------  1 eax  staff   293M Apr 21 23:48 32778.2
-rw-------  1 eax  staff   608K Apr 21 23:48 32778_fsm
```

Let's backup the last segment:

```
$ cp 32778.2 ~/temp/32778.2_backup
```

Truncate the table:

```
delete from truncateme where id > 1024*1024;
vacuum truncateme;
```

And kill Postgres:

```
$ pkill -9 postgres
```

... before it finishes another checkpoint:

```
LOG:  checkpoints are occurring too frequently (4 seconds apart)
HINT:  Consider increasing the configuration parameter "max_wal_size".
LOG:  checkpoint starting: wal
[and there was no "checkpoint complete"]
```

Next:

```
$ ls -lah 32778*
-rw-------  1 eax  staff   1.0G Apr 21 23:50 32778
-rw-------  1 eax  staff   146M Apr 21 23:50 32778.1
-rw-------  1 eax  staff     0B Apr 21 23:50 32778.2
-rw-------  1 eax  staff   312K Apr 21 23:50 32778_fsm
-rw-------  1 eax  staff    40K Apr 21 23:50 32778_vm

$ cp ~/temp/32778.2_backup ./32778.2
```

I believe this simulates the case when pg_basebackup did a checkpoint,
copied 32778.2 before the rest of the segments, Postgres did a
truncate, and pg_basebackup received the rest of the data including
WAL. The WAL contains the record about the truncation, see
RelationTruncate(). Just as an observation: we keep zero sized
segments instead of deleting them.

So if I start Postgres now I expect it to return to a consistent
state, ideally the same state it had before the crash in terms of the
segments.

What I actually get is:

```
LOG:  database system was interrupted; last known up at 2023-04-21 23:50:08 MSK
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 9/C4035B18
LOG:  invalid record length at 9/E8FCBF10: expected at least 24, got 0
LOG:  redo done at 9/E8FCBEB0 system usage: CPU: user: 1.58 s, system:
0.96 s, elapsed: 2.61 s
LOG:  checkpoint starting: end-of-recovery immediate wait
LOG:  checkpoint complete: wrote 10 buffers (0.0%); 0 WAL file(s)
added, 0 removed, 36 recycled; write=0.005 s, sync=0.003 s,
total=0.016 s; sync files=9, longest=0.002 s, average=0.001 s;
distance=605784 kB, estimate=605784 kB; lsn=9/E8FCBF10, redo
lsn=9/E8FCBF10
LOG:  database system is ready to accept connections
```

... and:

```
-rw-------  1 eax  staff   1.0G Apr 21 23:50 32778
-rw-------  1 eax  staff   146M Apr 21 23:53 32778.1
-rw-------  1 eax  staff     0B Apr 21 23:53 32778.2
```

Naturally the database is consistent too. So it looks like the
recovery protocol worked as expected after all, at least in the
upcoming PG16 and this specific scenario.

Did I miss anything? If not, perhaps we should just update the comments a bit?

-- 
Best regards,
Aleksander Alekseev



Re: base backup vs. concurrent truncation

From
Robert Haas
Date:
On Fri, Apr 21, 2023 at 5:19 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> Naturally the database is consistent too. So it looks like the
> recovery protocol worked as expected after all, at least in the
> upcoming PG16 and this specific scenario.
>
> Did I miss anything? If not, perhaps we should just update the comments a bit?

I was confused about what was happening here but after trying to
reproduce I think I figured it out. In my test, I saw the .1 file grow
to a full 1GB and then the truncate happened afterward.
Unsurprisingly, that works. I believe that what's happening here is
that the DELETE statement triggers FPIs for all of the blocks it
touches. Therefore, when the DELETE records are replayed, those blocks
get written back to the relation, extending it. That gets it up to the
required 1GB size after which the .2 file is visible to the smgr
again, so the truncate works fine. I think that to reproduce the
scenario, you want the truncate to happen in its own checkpoint cycle.

I also found out from Andres that he's complained about pretty much
the same problem just a couple of months ago:

https://www.postgresql.org/message-id/20230223010147.32oir7sb66slqnjk@awork3.anarazel.de

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: base backup vs. concurrent truncation

From
Andres Freund
Date:
Hi,

On 2023-04-21 09:42:57 -0400, Robert Haas wrote:
> Apologies if this has already been discussed someplace, but I couldn't
> find a previous discussion. It seems to me that base backups are
> broken in the face of a concurrent truncation that reduces the number
> of segments in a relation.

I think
https://www.postgresql.org/message-id/20230223010147.32oir7sb66slqnjk@awork3.anarazel.de
and the commits + discussions referenced from there is relevant for the topic.


> Suppose we have a relation that is 1.5GB in size, so that we have two
> files 23456, which is 1GB, and 23456.1, which is 0.5GB. We'll back
> those files up in whichever order the directory scan finds them.
> Suppose we back up 23456.1 first. Then the relation is truncated to
> 0.5GB, so 23456.1 is removed and 23456 gets a lot shorter. Next we
> back up the file 23456. Now our backup contains files 23456 and
> 23456.1, each 0.5GB. But this breaks the invariant in md.c:


>  *      On disk, a relation must consist of consecutively numbered segment
>  *      files in the pattern
>  *              -- Zero or more full segments of exactly RELSEG_SIZE blocks each
>  *              -- Exactly one partial segment of size 0 <= size <
> RELSEG_SIZE blocks
>  *              -- Optionally, any number of inactive segments of size 0 blocks.
> 
> basebackup.c's theory about relation truncation is that it doesn't
> really matter because WAL replay will fix things up. But in this case,
> I don't think it will, because WAL replay relies on the above
> invariant holding. As mdnblocks says:
> 
>                 /*
>                  * If segment is exactly RELSEG_SIZE, advance to next one.
>                  */
>                 segno++;
> 
> So I think what's going to happen is we're not going to notice 23456.1
> when we recover the backup. It will just sit there as an orphaned file
> forever, unless we extend 23456 back to a full 1GB, at which point we
> might abruptly start considering that file part of the relation again.

One important point is that I think 23456.1 at that point can contain
data. Which can lead to deleted rows reappearing, vacuum failing due to rows
from before relfrozenxid existing, etc.


> Assuming I'm not wrong about all of this, the question arises: whose
> fault is this, and what to do about it? It seems to me that it's a bit
> hard to blame basebackup.c, because if you used pg_backup_start() and
> pg_backup_stop() and copied the directory yourself, you'd have exactly
> the same situation, and while we could (and perhaps should) teach
> basebackup.c to do something smarter, it doesn't seem realistic to
> impose complex constraints on the user's choice of file copy tool.

Agreed.


> So I think the problem is with md.c assuming that its invariant must
> hold on a cluster that's not guaranteed to be in a consistent state.
> But mdnblocks() clearly can't try to open every segment up to whatever
> the maximum theoretical possible segment number is every time it's
> invoked, because that would be wicked expensive. An idea that occurs
> to me is to remove all segment files following the first partial
> segment during startup, before we begin WAL replay. If that state
> occurs at startup, then either we have a scenario involving
> truncation, like those above, or a scenario involving relation
> extension, where we added a new segment and that made it to disk but
> the prior extension of the previous last segment file to maximum
> length did not. But in that case, WAL replay should, I think, fix
> things up. However, I'm not completely sure that there isn't some hole
> in this theory, and this way forward also doesn't sound particularly
> cheap. Nonetheless I don't have another idea right now.

What we've discussed somewhere in the past is to always truncate N+1 when
creating the first page in N. I.e. if we extend into 23456.1, we truncate
23456.2 to 0 blocks.  As far as I can tell, that'd solve this issue?

Greetings,

Andres Freund



Re: base backup vs. concurrent truncation

From
Robert Haas
Date:
On Mon, Apr 24, 2023 at 8:03 PM Andres Freund <andres@anarazel.de> wrote:
> What we've discussed somewhere in the past is to always truncate N+1 when
> creating the first page in N. I.e. if we extend into 23456.1, we truncate
> 23456.2 to 0 blocks.  As far as I can tell, that'd solve this issue?

Yeah, although leaving 23456.2 forever unless and until that happens
doesn't sound amazing.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: base backup vs. concurrent truncation

From
Andres Freund
Date:
Hi,

On 2023-04-25 11:42:43 -0400, Robert Haas wrote:
> On Mon, Apr 24, 2023 at 8:03 PM Andres Freund <andres@anarazel.de> wrote:
> > What we've discussed somewhere in the past is to always truncate N+1 when
> > creating the first page in N. I.e. if we extend into 23456.1, we truncate
> > 23456.2 to 0 blocks.  As far as I can tell, that'd solve this issue?
> 
> Yeah, although leaving 23456.2 forever unless and until that happens
> doesn't sound amazing.

It isn't - but the alternatives aren't great either. It's not that easy to hit
this scenario, so I think something along these lines is more palatable than
adding a pass through the entire data directory.

I think eventually we'll have to make the WAL logging bulletproof enough
against this to avoid the risk of it. I think that is possible.

I suspect we would need to prevent checkpoints from happening in the wrong
moment, if we were to go down that route.

I guess that eventually we'll need to polish the infrastructure for
determining restartpointsm so that delayChkptFlags doesn't actually prevent
checkpoints, just moves the restart to a safe LSN.  Although I guess that
truncations aren't frequent enough (compared to transaction commits), for that
to be required "now".

Greetings,

Andres Freund



Re: base backup vs. concurrent truncation

From
Aleksander Alekseev
Date:
Hi,

> I think that to reproduce the scenario, you want the truncate to happen in
> its own checkpoint cycle.

OK, let's try this again.

In order to effectively disable the checkpointer I added the following lines
to postgresql.conf:

```
checkpoint_timeout = 3600
max_wal_size = 100G
```

I'm also keeping an eye on `logfile` in order to make sure the system doesn't
do anything unexpected.

Then:

```
-- Just double-checking
show checkpoint_timeout;
 checkpoint_timeout
--------------------
 1h

show max_wal_size;
 max_wal_size
--------------
 100GB

create table truncateme(id integer, val varchar(1024));
alter table truncateme set (autovacuum_enabled = off);
select relfilenode from pg_class where relname = 'truncateme';

 relfilenode
-------------
       16385

-- takes ~30 seconds
insert into truncateme
  select id,
  (
    select string_agg(chr((33+random()*(126-33)) :: integer), '')
    from generate_series(1,1000)
  )
  from generate_series(1,2*1024*1024) as id;

delete from truncateme where id > 1024*1024;

select count(*) from truncateme;
  count
---------
 1048576

-- Making a checkpoint as pg_basebackup would do.
-- Also, making sure truncate will happen in its own checkpoint cycle.
checkpoint;
```

Again I see 3 segments:

```
$ ls -lah 16385*
-rw------- 1 eax eax 1.0G May  1 19:24 16385
-rw------- 1 eax eax 1.0G May  1 19:27 16385.1
-rw------- 1 eax eax 293M May  1 19:27 16385.2
-rw------- 1 eax eax 608K May  1 19:24 16385_fsm
```

Making a backup of .2 as if I'm pg_basebackup:

```
cp 16385.2 ~/temp/16385.2
```

Truncating the table:

```
vacuum truncateme;
```

... and killing postgres:

```
$ pkill -9 postgres

```

Now I see:

```
$ ls -lah 16385*
-rw------- 1 eax eax 1.0G May  1 19:30 16385
-rw------- 1 eax eax 147M May  1 19:31 16385.1
-rw------- 1 eax eax    0 May  1 19:31 16385.2
-rw------- 1 eax eax 312K May  1 19:31 16385_fsm
-rw------- 1 eax eax  40K May  1 19:31 16385_vm
$ cp ~/temp/16385.2 ./16385.2
```

Starting postgres:

```
LOG:  starting PostgreSQL 16devel on x86_64-linux, compiled by
gcc-11.3.0, 64-bit
LOG:  listening on IPv4 address "0.0.0.0", port 5432
LOG:  listening on IPv6 address "::", port 5432
LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
LOG:  database system was interrupted; last known up at 2023-05-01 19:27:22 MSK
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/8AAB36B0
LOG:  invalid record length at 0/CE9BDE60: expected at least 24, got 0
LOG:  redo done at 0/CE9BDE28 system usage: CPU: user: 6.51 s, system:
2.45 s, elapsed: 8.97 s
LOG:  checkpoint starting: end-of-recovery immediate wait
LOG:  checkpoint complete: wrote 10 buffers (0.0%); 0 WAL file(s)
added, 0 removed, 68 recycled; write=0.026 s, sync=1.207 s,
total=1.769 s; sync files=10, longest=1.188 s, average=0.121 s;
distance=1113129 kB, estimate=1113129 kB; lsn=0/CE9BDE60, redo
lsn=0/CE9BDE60
LOG:  database system is ready to accept connections
```

```
$ ls -lah 16385*
-rw------- 1 eax eax 1.0G May  1 19:33 16385
-rw------- 1 eax eax 147M May  1 19:33 16385.1
-rw------- 1 eax eax    0 May  1 19:33 16385.2
-rw------- 1 eax eax 312K May  1 19:33 16385_fsm
-rw------- 1 eax eax  40K May  1 19:33 16385_vm
```

```
select count(*) from truncateme;
  count
---------
 1048576
 ```

So I'm still unable to reproduce the described scenario, at least on PG16.

-- 
Best regards,
Aleksander Alekseev



Re: base backup vs. concurrent truncation

From
Robert Haas
Date:
On Mon, May 1, 2023 at 12:54 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> So I'm still unable to reproduce the described scenario, at least on PG16.

Well, that proves that either (1) the scenario that I described is
impossible for some unknown reason or (2) something is wrong with your
test scenario. I bet it's (2), but if it's (1), it would be nice to
know what the reason is. One can't feel good about code that appears
on the surface to be broken even if one knows that some unknown
magical thing is preventing disaster.

I find it pretty hard to accept that there's no problem at all here,
especially in view of the fact that Andres independently posted about
the same issue on another thread. It's pretty clear from looking at
the code that mdnblocks() can't open any segments past the first one
that isn't of the maximum possible size. It's also fairly clear that a
crash or a base backup can create such situations. And finally it's
pretty clear that having an old partial segment be rediscovered due to
the relation be re-extended would be quite bad. So how can there not
be a problem? I admit I haven't done the legwork to nail down a test
case where everything comes together just right to show user-visible
breakage, but your success in finding one where it doesn't is no proof
of anything.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: base backup vs. concurrent truncation

From
Andres Freund
Date:
Hi,

On 2023-05-08 08:57:08 -0400, Robert Haas wrote:
> On Mon, May 1, 2023 at 12:54 PM Aleksander Alekseev
> <aleksander@timescale.com> wrote:
> > So I'm still unable to reproduce the described scenario, at least on PG16.
> 
> Well, that proves that either (1) the scenario that I described is
> impossible for some unknown reason or (2) something is wrong with your
> test scenario. I bet it's (2), but if it's (1), it would be nice to
> know what the reason is. One can't feel good about code that appears
> on the surface to be broken even if one knows that some unknown
> magical thing is preventing disaster.

It seems pretty easy to create disconnected segments. You don't even need a
basebackup for it.


To make it easier, I rebuilt with segsize_blocks=16. This isn't required, it
just makes it a lot cheaper to play around. To noones surprise: I'm not a
patient person...


Started server with autovacuum=off.

DROP TABLE IF EXISTS large;
CREATE TABLE large AS SELECT generate_series(1, 100000);
SELECT current_setting('data_directory') || '/' || pg_relation_filepath('large');

ls -l /srv/dev/pgdev-dev/base/5/24585*
shows lots of segments.

attach gdb, set breakpoint on truncate.

DROP TABLE large;

breakpoint will fire. Continue once.

In concurrent session, trigger checkpoint. Due to the checkpoint we'll not
replay any WAL record. And the checkpoint will unlink the first segment.

Kill the server.

After crash recovery, you end up with all but the first segment still
present. As the first segment doesn't exist anymore, nothing prevents that oid
from being recycled in the future. Once it is recycled and the first segment
grows large enough, the later segments will suddenly re-appear.


It's not quite so trivial to reproduce issues with partial truncations /
concurrent base backups. The problem is that it's hard to guarantee the
iteration order of the base backup process.  You'd just need to write a manual
base backup script though.

Consider a script mimicking the filesystem returning directory entries in
"reverse name order". Recipe includes two sessions. One (BB) doing a base
backup, the other (DT) running VACUUM making the table shorter.

BB: Copy <relfilenode>.2
BB: Copy <relfilenode>.1
SS: Truncate relation to < SEGSIZE
BB: Copy <relfilenode>


The replay of the smgrtruncate record will determine the relation size to
figure out what segments to remove. Because <relfilenode> is < SEGSIZE it'll
only truncate <relfilenode>, not <relfilenode>.N.  And boom, a disconnected
segment.

(I'll post a separate email about an evolved proposal about fixing this set of
issues)

Greetings,

Andres Freund



Re: base backup vs. concurrent truncation

From
Andres Freund
Date:
Hi,

On 2023-04-25 10:28:58 -0700, Andres Freund wrote:
> On 2023-04-25 11:42:43 -0400, Robert Haas wrote:
> > On Mon, Apr 24, 2023 at 8:03 PM Andres Freund <andres@anarazel.de> wrote:
> > > What we've discussed somewhere in the past is to always truncate N+1 when
> > > creating the first page in N. I.e. if we extend into 23456.1, we truncate
> > > 23456.2 to 0 blocks.  As far as I can tell, that'd solve this issue?
> > 
> > Yeah, although leaving 23456.2 forever unless and until that happens
> > doesn't sound amazing.
> 
> It isn't - but the alternatives aren't great either. It's not that easy to hit
> this scenario, so I think something along these lines is more palatable than
> adding a pass through the entire data directory.

> I think eventually we'll have to make the WAL logging bulletproof enough
> against this to avoid the risk of it. I think that is possible.


I think we should extend my proposal above with improved WAL logging.

Right now the replay of XLOG_SMGR_TRUNCATE records uses the normal
smgrtruncate() path - which essentially relies on smgrnblocks() to determine
the relation size, which in turn iterates over the segments until it finds one
< SEGSIZE.

That's fine during normal running, where we are consistent. But it's bogus
when we're not consistent - in case of a concurrent truncation, the base
backup might have missed intermediate segments, while copying later segments.

We should fix this by including not just the "target" length in the WAL
record, but also the "current" length. Then during WAL replay of such records
we'd not look at the files currently present, we'd just stupidly truncate all
the segments mentioned in the range.


I think we ought to do the same for mdunlinkfork().



> I suspect we would need to prevent checkpoints from happening in the wrong
> moment, if we were to go down that route.
> 
> I guess that eventually we'll need to polish the infrastructure for
> determining restartpointsm so that delayChkptFlags doesn't actually prevent
> checkpoints, just moves the restart to a safe LSN.  Although I guess that
> truncations aren't frequent enough (compared to transaction commits), for that
> to be required "now".
> 

Unfortunately the current approach of smgrtruncate records is quite racy with
checkpoints. You can end up with a sequence of something like

1) SMGRTRUNCATE record
2) CHECKPOINT;
3) Truncating of segment files

if you crash anywhere in 3), we don't replay the SMGRTRUNCATE record. It's not
a great solution, but I suspect we'll just have to set delayChkptFlags during
truncations to prevent this.


I also think that we need to fix the order in which mdunlink() operates. It
seems *quite* bogus that we unlink in "forward" segment order, rather than in
reverse order (like mdtruncate()). If we crash after unlinking segment.N,
we'll not redo unlinking the later segments after a crash. Nor in WAL
replay. While the improved WAL logging would address this as well, it still
seems pointlessly risky to iterate forward, rather than backward.


Even with those changes, I think we might still need something like the
"always truncate the next segment" bit I described in my prior email though.

Greetings,

Andres Freund



Re: base backup vs. concurrent truncation

From
Aleksander Alekseev
Date:
Hi Robert,

> I admit I haven't done the legwork to nail down a test
> case where everything comes together just right to show user-visible
> breakage, but your success in finding one where it doesn't is no proof
> of anything.

Respectfully, what made you think this was my intention?

Quite the opposite, personally I am inclined to think that the problem
does exist. In order to fix it however we need a test that reliably
reproduces it first. Otherwise there is no way to figure out whether
the fix was correct or not.

What the experiment showed is that the test scenario you initially
described is probably the wrong one for reasons yet to be understood
and we need to come up with a better one.

-- 
Best regards,
Aleksander Alekseev



Re: base backup vs. concurrent truncation

From
Greg Stark
Date:
Including the pre-truncation length in the wal record is the obviously
solid approach and I none of the below is a good substitution for it.
But....

On Tue, 25 Apr 2023 at 13:30, Andres Freund <andres@anarazel.de> wrote:
>
> It isn't - but the alternatives aren't great either. It's not that easy to hit
> this scenario, so I think something along these lines is more palatable than
> adding a pass through the entire data directory.

Doing one pass through the entire data directory on startup before
deciding the directory is consistent doesn't sound like a crazy idea.
It's pretty easy to imagine bugs in backup software that leave out
files in the middle of tables -- some of us don't even have to
imagine...

Similarly checking for a stray next segment whenever extending a file
to maximum segment size seems like a reasonable thing to check for
too.

These kinds of checks are the kind of paranoia that catches filesystem
bugs, backup software bugs, cron jobs, etc that we don't even know to
watch for.

-- 
greg



Re: base backup vs. concurrent truncation

From
Robert Haas
Date:
On Tue, May 9, 2023 at 5:14 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> > I admit I haven't done the legwork to nail down a test
> > case where everything comes together just right to show user-visible
> > breakage, but your success in finding one where it doesn't is no proof
> > of anything.
>
> Respectfully, what made you think this was my intention?

Honestly I have no idea what your intention was and didn't mean to
judge it. However, I don't think that troubleshooting the test case
you put together is the thing that I want to spend time on right now,
and I hope that it will still be possible to make some progress on the
underlying issue despite that.

> Quite the opposite, personally I am inclined to think that the problem
> does exist. In order to fix it however we need a test that reliably
> reproduces it first. Otherwise there is no way to figure out whether
> the fix was correct or not.
>
> What the experiment showed is that the test scenario you initially
> described is probably the wrong one for reasons yet to be understood
> and we need to come up with a better one.

Hopefully what Andres posted will help in this regard.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: base backup vs. concurrent truncation

From
Stephen Frost
Date:
Greetings,

* Greg Stark (stark@mit.edu) wrote:
> Including the pre-truncation length in the wal record is the obviously
> solid approach and I none of the below is a good substitution for it.

I tend to agree with the items mentioned in Andres's recent email on
this thread too in terms of improving the WAL logging around this.

> On Tue, 25 Apr 2023 at 13:30, Andres Freund <andres@anarazel.de> wrote:
> > It isn't - but the alternatives aren't great either. It's not that easy to hit
> > this scenario, so I think something along these lines is more palatable than
> > adding a pass through the entire data directory.
>
> Doing one pass through the entire data directory on startup before
> deciding the directory is consistent doesn't sound like a crazy idea.

We're already making a pass through the entire data directory on
crash-restart (and fsync'ing everything too), which includes when
restoring from backup.  See src/backend/access/transam/xlog.c:5155
Extending that to check for oddities like segments following a not-1GB
segment certainly seems like a good idea to me.

> It's pretty easy to imagine bugs in backup software that leave out
> files in the middle of tables -- some of us don't even have to
> imagine...

Yup.

> Similarly checking for a stray next segment whenever extending a file
> to maximum segment size seems like a reasonable thing to check for
> too.

Yeah, that definitely seems like a good idea.  Extending a relation is
already expensive and we've taken steps to deal with that and so
detecting that the file we were expecting to create is already there
certainly seems like a good idea and I wouldn't expect (?) to add a lot
of extra time in the normal case.

> These kinds of checks are the kind of paranoia that catches filesystem
> bugs, backup software bugs, cron jobs, etc that we don't even know to
> watch for.

Agreed, and would also help in cases where such a situation already
exists out there somewhere and which no amount of new WAL records would
make go away..

Thanks,

Stephen

Attachment