Thread: base backup vs. concurrent truncation
Hi, Apologies if this has already been discussed someplace, but I couldn't find a previous discussion. It seems to me that base backups are broken in the face of a concurrent truncation that reduces the number of segments in a relation. Suppose we have a relation that is 1.5GB in size, so that we have two files 23456, which is 1GB, and 23456.1, which is 0.5GB. We'll back those files up in whichever order the directory scan finds them. Suppose we back up 23456.1 first. Then the relation is truncated to 0.5GB, so 23456.1 is removed and 23456 gets a lot shorter. Next we back up the file 23456. Now our backup contains files 23456 and 23456.1, each 0.5GB. But this breaks the invariant in md.c: * On disk, a relation must consist of consecutively numbered segment * files in the pattern * -- Zero or more full segments of exactly RELSEG_SIZE blocks each * -- Exactly one partial segment of size 0 <= size < RELSEG_SIZE blocks * -- Optionally, any number of inactive segments of size 0 blocks. basebackup.c's theory about relation truncation is that it doesn't really matter because WAL replay will fix things up. But in this case, I don't think it will, because WAL replay relies on the above invariant holding. As mdnblocks says: /* * If segment is exactly RELSEG_SIZE, advance to next one. */ segno++; So I think what's going to happen is we're not going to notice 23456.1 when we recover the backup. It will just sit there as an orphaned file forever, unless we extend 23456 back to a full 1GB, at which point we might abruptly start considering that file part of the relation again. Assuming I'm not wrong about all of this, the question arises: whose fault is this, and what to do about it? It seems to me that it's a bit hard to blame basebackup.c, because if you used pg_backup_start() and pg_backup_stop() and copied the directory yourself, you'd have exactly the same situation, and while we could (and perhaps should) teach basebackup.c to do something smarter, it doesn't seem realistic to impose complex constraints on the user's choice of file copy tool. Furthermore, I think that the problem could arise without performing a backup at all: say that the server crashes on the OS level in mid-truncation, and the truncation of segment 0 reaches disk but the removal of segment 1 does not. So I think the problem is with md.c assuming that its invariant must hold on a cluster that's not guaranteed to be in a consistent state. But mdnblocks() clearly can't try to open every segment up to whatever the maximum theoretical possible segment number is every time it's invoked, because that would be wicked expensive. An idea that occurs to me is to remove all segment files following the first partial segment during startup, before we begin WAL replay. If that state occurs at startup, then either we have a scenario involving truncation, like those above, or a scenario involving relation extension, where we added a new segment and that made it to disk but the prior extension of the previous last segment file to maximum length did not. But in that case, WAL replay should, I think, fix things up. However, I'm not completely sure that there isn't some hole in this theory, and this way forward also doesn't sound particularly cheap. Nonetheless I don't have another idea right now. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, Just a quick observation: > basebackup.c's theory about relation truncation is that it doesn't > really matter because WAL replay will fix things up. But in this case, > I don't think it will, because WAL replay relies on the above > invariant holding. Assuming this is the case perhaps we can reduce the scenario and consider this simpler one: 1. The table is truncated 2. The DBMS is killed before making a checkpoint 3. We are in recovery and presumably see a pair of 0.5 Gb segments Or can't we? -- Best regards, Aleksander Alekseev
Hi again, > Assuming this is the case perhaps we can reduce the scenario and > consider this simpler one: > > 1. The table is truncated > 2. The DBMS is killed before making a checkpoint > 3. We are in recovery and presumably see a pair of 0.5 Gb segments > > Or can't we? Oh, I see. If the process will be killed this perhaps is not going to happen. Whether this can happen if we pull the plug from the machine is probably a design implementation of the particular filesystem and whether it's journaled. Hm... -- Best regards, Aleksander Alekseev
On Fri, Apr 21, 2023 at 10:56 AM Aleksander Alekseev <aleksander@timescale.com> wrote: > > Assuming this is the case perhaps we can reduce the scenario and > > consider this simpler one: > > > > 1. The table is truncated > > 2. The DBMS is killed before making a checkpoint > > 3. We are in recovery and presumably see a pair of 0.5 Gb segments > > > > Or can't we? > > Oh, I see. If the process will be killed this perhaps is not going to > happen. Whether this can happen if we pull the plug from the machine > is probably a design implementation of the particular filesystem and > whether it's journaled. Right. I mentioned that scenario in the original email: "Furthermore, I think that the problem could arise without performing a backup at all: say that the server crashes on the OS level in mid-truncation, and the truncation of segment 0 reaches disk but the removal of segment 1 does not." -- Robert Haas EDB: http://www.enterprisedb.com
Hi, > > Oh, I see. If the process will be killed this perhaps is not going to > > happen. Whether this can happen if we pull the plug from the machine > > is probably a design implementation of the particular filesystem and > > whether it's journaled. > > Right. I mentioned that scenario in the original email: > > "Furthermore, I think that the problem could arise without performing a > backup at all: say that the server crashes on the OS level in > mid-truncation, and the truncation of segment 0 reaches disk but the > removal of segment 1 does not." Right. I didn't fully understand this scenario at first. I tried to reproduce it to see what actually happens. Here is what I did. ``` create table truncateme(id integer, val varchar(1024)); alter table truncateme set (autovacuum_enabled = off); -- takes ~30 seconds insert into truncateme select id, ( select string_agg(chr((33+random()*(126-33)) :: integer), '') from generate_series(1,1000) ) from generate_series(1,2*1024*1024) as id; checkpoint; select relfilenode from pg_class where relname = 'truncateme'; relfilenode ------------- 32778 ``` We got 3 segments (and no VM fork since I disabled VACUUM): ``` $ cd ~/pginstall/data-master/base/16384/ $ ls -lah 32778* -rw------- 1 eax staff 1.0G Apr 21 23:47 32778 -rw------- 1 eax staff 1.0G Apr 21 23:48 32778.1 -rw------- 1 eax staff 293M Apr 21 23:48 32778.2 -rw------- 1 eax staff 608K Apr 21 23:48 32778_fsm ``` Let's backup the last segment: ``` $ cp 32778.2 ~/temp/32778.2_backup ``` Truncate the table: ``` delete from truncateme where id > 1024*1024; vacuum truncateme; ``` And kill Postgres: ``` $ pkill -9 postgres ``` ... before it finishes another checkpoint: ``` LOG: checkpoints are occurring too frequently (4 seconds apart) HINT: Consider increasing the configuration parameter "max_wal_size". LOG: checkpoint starting: wal [and there was no "checkpoint complete"] ``` Next: ``` $ ls -lah 32778* -rw------- 1 eax staff 1.0G Apr 21 23:50 32778 -rw------- 1 eax staff 146M Apr 21 23:50 32778.1 -rw------- 1 eax staff 0B Apr 21 23:50 32778.2 -rw------- 1 eax staff 312K Apr 21 23:50 32778_fsm -rw------- 1 eax staff 40K Apr 21 23:50 32778_vm $ cp ~/temp/32778.2_backup ./32778.2 ``` I believe this simulates the case when pg_basebackup did a checkpoint, copied 32778.2 before the rest of the segments, Postgres did a truncate, and pg_basebackup received the rest of the data including WAL. The WAL contains the record about the truncation, see RelationTruncate(). Just as an observation: we keep zero sized segments instead of deleting them. So if I start Postgres now I expect it to return to a consistent state, ideally the same state it had before the crash in terms of the segments. What I actually get is: ``` LOG: database system was interrupted; last known up at 2023-04-21 23:50:08 MSK LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 9/C4035B18 LOG: invalid record length at 9/E8FCBF10: expected at least 24, got 0 LOG: redo done at 9/E8FCBEB0 system usage: CPU: user: 1.58 s, system: 0.96 s, elapsed: 2.61 s LOG: checkpoint starting: end-of-recovery immediate wait LOG: checkpoint complete: wrote 10 buffers (0.0%); 0 WAL file(s) added, 0 removed, 36 recycled; write=0.005 s, sync=0.003 s, total=0.016 s; sync files=9, longest=0.002 s, average=0.001 s; distance=605784 kB, estimate=605784 kB; lsn=9/E8FCBF10, redo lsn=9/E8FCBF10 LOG: database system is ready to accept connections ``` ... and: ``` -rw------- 1 eax staff 1.0G Apr 21 23:50 32778 -rw------- 1 eax staff 146M Apr 21 23:53 32778.1 -rw------- 1 eax staff 0B Apr 21 23:53 32778.2 ``` Naturally the database is consistent too. So it looks like the recovery protocol worked as expected after all, at least in the upcoming PG16 and this specific scenario. Did I miss anything? If not, perhaps we should just update the comments a bit? -- Best regards, Aleksander Alekseev
On Fri, Apr 21, 2023 at 5:19 PM Aleksander Alekseev <aleksander@timescale.com> wrote: > Naturally the database is consistent too. So it looks like the > recovery protocol worked as expected after all, at least in the > upcoming PG16 and this specific scenario. > > Did I miss anything? If not, perhaps we should just update the comments a bit? I was confused about what was happening here but after trying to reproduce I think I figured it out. In my test, I saw the .1 file grow to a full 1GB and then the truncate happened afterward. Unsurprisingly, that works. I believe that what's happening here is that the DELETE statement triggers FPIs for all of the blocks it touches. Therefore, when the DELETE records are replayed, those blocks get written back to the relation, extending it. That gets it up to the required 1GB size after which the .2 file is visible to the smgr again, so the truncate works fine. I think that to reproduce the scenario, you want the truncate to happen in its own checkpoint cycle. I also found out from Andres that he's complained about pretty much the same problem just a couple of months ago: https://www.postgresql.org/message-id/20230223010147.32oir7sb66slqnjk@awork3.anarazel.de -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-04-21 09:42:57 -0400, Robert Haas wrote: > Apologies if this has already been discussed someplace, but I couldn't > find a previous discussion. It seems to me that base backups are > broken in the face of a concurrent truncation that reduces the number > of segments in a relation. I think https://www.postgresql.org/message-id/20230223010147.32oir7sb66slqnjk@awork3.anarazel.de and the commits + discussions referenced from there is relevant for the topic. > Suppose we have a relation that is 1.5GB in size, so that we have two > files 23456, which is 1GB, and 23456.1, which is 0.5GB. We'll back > those files up in whichever order the directory scan finds them. > Suppose we back up 23456.1 first. Then the relation is truncated to > 0.5GB, so 23456.1 is removed and 23456 gets a lot shorter. Next we > back up the file 23456. Now our backup contains files 23456 and > 23456.1, each 0.5GB. But this breaks the invariant in md.c: > * On disk, a relation must consist of consecutively numbered segment > * files in the pattern > * -- Zero or more full segments of exactly RELSEG_SIZE blocks each > * -- Exactly one partial segment of size 0 <= size < > RELSEG_SIZE blocks > * -- Optionally, any number of inactive segments of size 0 blocks. > > basebackup.c's theory about relation truncation is that it doesn't > really matter because WAL replay will fix things up. But in this case, > I don't think it will, because WAL replay relies on the above > invariant holding. As mdnblocks says: > > /* > * If segment is exactly RELSEG_SIZE, advance to next one. > */ > segno++; > > So I think what's going to happen is we're not going to notice 23456.1 > when we recover the backup. It will just sit there as an orphaned file > forever, unless we extend 23456 back to a full 1GB, at which point we > might abruptly start considering that file part of the relation again. One important point is that I think 23456.1 at that point can contain data. Which can lead to deleted rows reappearing, vacuum failing due to rows from before relfrozenxid existing, etc. > Assuming I'm not wrong about all of this, the question arises: whose > fault is this, and what to do about it? It seems to me that it's a bit > hard to blame basebackup.c, because if you used pg_backup_start() and > pg_backup_stop() and copied the directory yourself, you'd have exactly > the same situation, and while we could (and perhaps should) teach > basebackup.c to do something smarter, it doesn't seem realistic to > impose complex constraints on the user's choice of file copy tool. Agreed. > So I think the problem is with md.c assuming that its invariant must > hold on a cluster that's not guaranteed to be in a consistent state. > But mdnblocks() clearly can't try to open every segment up to whatever > the maximum theoretical possible segment number is every time it's > invoked, because that would be wicked expensive. An idea that occurs > to me is to remove all segment files following the first partial > segment during startup, before we begin WAL replay. If that state > occurs at startup, then either we have a scenario involving > truncation, like those above, or a scenario involving relation > extension, where we added a new segment and that made it to disk but > the prior extension of the previous last segment file to maximum > length did not. But in that case, WAL replay should, I think, fix > things up. However, I'm not completely sure that there isn't some hole > in this theory, and this way forward also doesn't sound particularly > cheap. Nonetheless I don't have another idea right now. What we've discussed somewhere in the past is to always truncate N+1 when creating the first page in N. I.e. if we extend into 23456.1, we truncate 23456.2 to 0 blocks. As far as I can tell, that'd solve this issue? Greetings, Andres Freund
On Mon, Apr 24, 2023 at 8:03 PM Andres Freund <andres@anarazel.de> wrote: > What we've discussed somewhere in the past is to always truncate N+1 when > creating the first page in N. I.e. if we extend into 23456.1, we truncate > 23456.2 to 0 blocks. As far as I can tell, that'd solve this issue? Yeah, although leaving 23456.2 forever unless and until that happens doesn't sound amazing. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-04-25 11:42:43 -0400, Robert Haas wrote: > On Mon, Apr 24, 2023 at 8:03 PM Andres Freund <andres@anarazel.de> wrote: > > What we've discussed somewhere in the past is to always truncate N+1 when > > creating the first page in N. I.e. if we extend into 23456.1, we truncate > > 23456.2 to 0 blocks. As far as I can tell, that'd solve this issue? > > Yeah, although leaving 23456.2 forever unless and until that happens > doesn't sound amazing. It isn't - but the alternatives aren't great either. It's not that easy to hit this scenario, so I think something along these lines is more palatable than adding a pass through the entire data directory. I think eventually we'll have to make the WAL logging bulletproof enough against this to avoid the risk of it. I think that is possible. I suspect we would need to prevent checkpoints from happening in the wrong moment, if we were to go down that route. I guess that eventually we'll need to polish the infrastructure for determining restartpointsm so that delayChkptFlags doesn't actually prevent checkpoints, just moves the restart to a safe LSN. Although I guess that truncations aren't frequent enough (compared to transaction commits), for that to be required "now". Greetings, Andres Freund
Hi, > I think that to reproduce the scenario, you want the truncate to happen in > its own checkpoint cycle. OK, let's try this again. In order to effectively disable the checkpointer I added the following lines to postgresql.conf: ``` checkpoint_timeout = 3600 max_wal_size = 100G ``` I'm also keeping an eye on `logfile` in order to make sure the system doesn't do anything unexpected. Then: ``` -- Just double-checking show checkpoint_timeout; checkpoint_timeout -------------------- 1h show max_wal_size; max_wal_size -------------- 100GB create table truncateme(id integer, val varchar(1024)); alter table truncateme set (autovacuum_enabled = off); select relfilenode from pg_class where relname = 'truncateme'; relfilenode ------------- 16385 -- takes ~30 seconds insert into truncateme select id, ( select string_agg(chr((33+random()*(126-33)) :: integer), '') from generate_series(1,1000) ) from generate_series(1,2*1024*1024) as id; delete from truncateme where id > 1024*1024; select count(*) from truncateme; count --------- 1048576 -- Making a checkpoint as pg_basebackup would do. -- Also, making sure truncate will happen in its own checkpoint cycle. checkpoint; ``` Again I see 3 segments: ``` $ ls -lah 16385* -rw------- 1 eax eax 1.0G May 1 19:24 16385 -rw------- 1 eax eax 1.0G May 1 19:27 16385.1 -rw------- 1 eax eax 293M May 1 19:27 16385.2 -rw------- 1 eax eax 608K May 1 19:24 16385_fsm ``` Making a backup of .2 as if I'm pg_basebackup: ``` cp 16385.2 ~/temp/16385.2 ``` Truncating the table: ``` vacuum truncateme; ``` ... and killing postgres: ``` $ pkill -9 postgres ``` Now I see: ``` $ ls -lah 16385* -rw------- 1 eax eax 1.0G May 1 19:30 16385 -rw------- 1 eax eax 147M May 1 19:31 16385.1 -rw------- 1 eax eax 0 May 1 19:31 16385.2 -rw------- 1 eax eax 312K May 1 19:31 16385_fsm -rw------- 1 eax eax 40K May 1 19:31 16385_vm $ cp ~/temp/16385.2 ./16385.2 ``` Starting postgres: ``` LOG: starting PostgreSQL 16devel on x86_64-linux, compiled by gcc-11.3.0, 64-bit LOG: listening on IPv4 address "0.0.0.0", port 5432 LOG: listening on IPv6 address "::", port 5432 LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" LOG: database system was interrupted; last known up at 2023-05-01 19:27:22 MSK LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/8AAB36B0 LOG: invalid record length at 0/CE9BDE60: expected at least 24, got 0 LOG: redo done at 0/CE9BDE28 system usage: CPU: user: 6.51 s, system: 2.45 s, elapsed: 8.97 s LOG: checkpoint starting: end-of-recovery immediate wait LOG: checkpoint complete: wrote 10 buffers (0.0%); 0 WAL file(s) added, 0 removed, 68 recycled; write=0.026 s, sync=1.207 s, total=1.769 s; sync files=10, longest=1.188 s, average=0.121 s; distance=1113129 kB, estimate=1113129 kB; lsn=0/CE9BDE60, redo lsn=0/CE9BDE60 LOG: database system is ready to accept connections ``` ``` $ ls -lah 16385* -rw------- 1 eax eax 1.0G May 1 19:33 16385 -rw------- 1 eax eax 147M May 1 19:33 16385.1 -rw------- 1 eax eax 0 May 1 19:33 16385.2 -rw------- 1 eax eax 312K May 1 19:33 16385_fsm -rw------- 1 eax eax 40K May 1 19:33 16385_vm ``` ``` select count(*) from truncateme; count --------- 1048576 ``` So I'm still unable to reproduce the described scenario, at least on PG16. -- Best regards, Aleksander Alekseev
On Mon, May 1, 2023 at 12:54 PM Aleksander Alekseev <aleksander@timescale.com> wrote: > So I'm still unable to reproduce the described scenario, at least on PG16. Well, that proves that either (1) the scenario that I described is impossible for some unknown reason or (2) something is wrong with your test scenario. I bet it's (2), but if it's (1), it would be nice to know what the reason is. One can't feel good about code that appears on the surface to be broken even if one knows that some unknown magical thing is preventing disaster. I find it pretty hard to accept that there's no problem at all here, especially in view of the fact that Andres independently posted about the same issue on another thread. It's pretty clear from looking at the code that mdnblocks() can't open any segments past the first one that isn't of the maximum possible size. It's also fairly clear that a crash or a base backup can create such situations. And finally it's pretty clear that having an old partial segment be rediscovered due to the relation be re-extended would be quite bad. So how can there not be a problem? I admit I haven't done the legwork to nail down a test case where everything comes together just right to show user-visible breakage, but your success in finding one where it doesn't is no proof of anything. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-05-08 08:57:08 -0400, Robert Haas wrote: > On Mon, May 1, 2023 at 12:54 PM Aleksander Alekseev > <aleksander@timescale.com> wrote: > > So I'm still unable to reproduce the described scenario, at least on PG16. > > Well, that proves that either (1) the scenario that I described is > impossible for some unknown reason or (2) something is wrong with your > test scenario. I bet it's (2), but if it's (1), it would be nice to > know what the reason is. One can't feel good about code that appears > on the surface to be broken even if one knows that some unknown > magical thing is preventing disaster. It seems pretty easy to create disconnected segments. You don't even need a basebackup for it. To make it easier, I rebuilt with segsize_blocks=16. This isn't required, it just makes it a lot cheaper to play around. To noones surprise: I'm not a patient person... Started server with autovacuum=off. DROP TABLE IF EXISTS large; CREATE TABLE large AS SELECT generate_series(1, 100000); SELECT current_setting('data_directory') || '/' || pg_relation_filepath('large'); ls -l /srv/dev/pgdev-dev/base/5/24585* shows lots of segments. attach gdb, set breakpoint on truncate. DROP TABLE large; breakpoint will fire. Continue once. In concurrent session, trigger checkpoint. Due to the checkpoint we'll not replay any WAL record. And the checkpoint will unlink the first segment. Kill the server. After crash recovery, you end up with all but the first segment still present. As the first segment doesn't exist anymore, nothing prevents that oid from being recycled in the future. Once it is recycled and the first segment grows large enough, the later segments will suddenly re-appear. It's not quite so trivial to reproduce issues with partial truncations / concurrent base backups. The problem is that it's hard to guarantee the iteration order of the base backup process. You'd just need to write a manual base backup script though. Consider a script mimicking the filesystem returning directory entries in "reverse name order". Recipe includes two sessions. One (BB) doing a base backup, the other (DT) running VACUUM making the table shorter. BB: Copy <relfilenode>.2 BB: Copy <relfilenode>.1 SS: Truncate relation to < SEGSIZE BB: Copy <relfilenode> The replay of the smgrtruncate record will determine the relation size to figure out what segments to remove. Because <relfilenode> is < SEGSIZE it'll only truncate <relfilenode>, not <relfilenode>.N. And boom, a disconnected segment. (I'll post a separate email about an evolved proposal about fixing this set of issues) Greetings, Andres Freund
Hi, On 2023-04-25 10:28:58 -0700, Andres Freund wrote: > On 2023-04-25 11:42:43 -0400, Robert Haas wrote: > > On Mon, Apr 24, 2023 at 8:03 PM Andres Freund <andres@anarazel.de> wrote: > > > What we've discussed somewhere in the past is to always truncate N+1 when > > > creating the first page in N. I.e. if we extend into 23456.1, we truncate > > > 23456.2 to 0 blocks. As far as I can tell, that'd solve this issue? > > > > Yeah, although leaving 23456.2 forever unless and until that happens > > doesn't sound amazing. > > It isn't - but the alternatives aren't great either. It's not that easy to hit > this scenario, so I think something along these lines is more palatable than > adding a pass through the entire data directory. > I think eventually we'll have to make the WAL logging bulletproof enough > against this to avoid the risk of it. I think that is possible. I think we should extend my proposal above with improved WAL logging. Right now the replay of XLOG_SMGR_TRUNCATE records uses the normal smgrtruncate() path - which essentially relies on smgrnblocks() to determine the relation size, which in turn iterates over the segments until it finds one < SEGSIZE. That's fine during normal running, where we are consistent. But it's bogus when we're not consistent - in case of a concurrent truncation, the base backup might have missed intermediate segments, while copying later segments. We should fix this by including not just the "target" length in the WAL record, but also the "current" length. Then during WAL replay of such records we'd not look at the files currently present, we'd just stupidly truncate all the segments mentioned in the range. I think we ought to do the same for mdunlinkfork(). > I suspect we would need to prevent checkpoints from happening in the wrong > moment, if we were to go down that route. > > I guess that eventually we'll need to polish the infrastructure for > determining restartpointsm so that delayChkptFlags doesn't actually prevent > checkpoints, just moves the restart to a safe LSN. Although I guess that > truncations aren't frequent enough (compared to transaction commits), for that > to be required "now". > Unfortunately the current approach of smgrtruncate records is quite racy with checkpoints. You can end up with a sequence of something like 1) SMGRTRUNCATE record 2) CHECKPOINT; 3) Truncating of segment files if you crash anywhere in 3), we don't replay the SMGRTRUNCATE record. It's not a great solution, but I suspect we'll just have to set delayChkptFlags during truncations to prevent this. I also think that we need to fix the order in which mdunlink() operates. It seems *quite* bogus that we unlink in "forward" segment order, rather than in reverse order (like mdtruncate()). If we crash after unlinking segment.N, we'll not redo unlinking the later segments after a crash. Nor in WAL replay. While the improved WAL logging would address this as well, it still seems pointlessly risky to iterate forward, rather than backward. Even with those changes, I think we might still need something like the "always truncate the next segment" bit I described in my prior email though. Greetings, Andres Freund
Hi Robert, > I admit I haven't done the legwork to nail down a test > case where everything comes together just right to show user-visible > breakage, but your success in finding one where it doesn't is no proof > of anything. Respectfully, what made you think this was my intention? Quite the opposite, personally I am inclined to think that the problem does exist. In order to fix it however we need a test that reliably reproduces it first. Otherwise there is no way to figure out whether the fix was correct or not. What the experiment showed is that the test scenario you initially described is probably the wrong one for reasons yet to be understood and we need to come up with a better one. -- Best regards, Aleksander Alekseev
Including the pre-truncation length in the wal record is the obviously solid approach and I none of the below is a good substitution for it. But.... On Tue, 25 Apr 2023 at 13:30, Andres Freund <andres@anarazel.de> wrote: > > It isn't - but the alternatives aren't great either. It's not that easy to hit > this scenario, so I think something along these lines is more palatable than > adding a pass through the entire data directory. Doing one pass through the entire data directory on startup before deciding the directory is consistent doesn't sound like a crazy idea. It's pretty easy to imagine bugs in backup software that leave out files in the middle of tables -- some of us don't even have to imagine... Similarly checking for a stray next segment whenever extending a file to maximum segment size seems like a reasonable thing to check for too. These kinds of checks are the kind of paranoia that catches filesystem bugs, backup software bugs, cron jobs, etc that we don't even know to watch for. -- greg
On Tue, May 9, 2023 at 5:14 AM Aleksander Alekseev <aleksander@timescale.com> wrote: > > I admit I haven't done the legwork to nail down a test > > case where everything comes together just right to show user-visible > > breakage, but your success in finding one where it doesn't is no proof > > of anything. > > Respectfully, what made you think this was my intention? Honestly I have no idea what your intention was and didn't mean to judge it. However, I don't think that troubleshooting the test case you put together is the thing that I want to spend time on right now, and I hope that it will still be possible to make some progress on the underlying issue despite that. > Quite the opposite, personally I am inclined to think that the problem > does exist. In order to fix it however we need a test that reliably > reproduces it first. Otherwise there is no way to figure out whether > the fix was correct or not. > > What the experiment showed is that the test scenario you initially > described is probably the wrong one for reasons yet to be understood > and we need to come up with a better one. Hopefully what Andres posted will help in this regard. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Greg Stark (stark@mit.edu) wrote: > Including the pre-truncation length in the wal record is the obviously > solid approach and I none of the below is a good substitution for it. I tend to agree with the items mentioned in Andres's recent email on this thread too in terms of improving the WAL logging around this. > On Tue, 25 Apr 2023 at 13:30, Andres Freund <andres@anarazel.de> wrote: > > It isn't - but the alternatives aren't great either. It's not that easy to hit > > this scenario, so I think something along these lines is more palatable than > > adding a pass through the entire data directory. > > Doing one pass through the entire data directory on startup before > deciding the directory is consistent doesn't sound like a crazy idea. We're already making a pass through the entire data directory on crash-restart (and fsync'ing everything too), which includes when restoring from backup. See src/backend/access/transam/xlog.c:5155 Extending that to check for oddities like segments following a not-1GB segment certainly seems like a good idea to me. > It's pretty easy to imagine bugs in backup software that leave out > files in the middle of tables -- some of us don't even have to > imagine... Yup. > Similarly checking for a stray next segment whenever extending a file > to maximum segment size seems like a reasonable thing to check for > too. Yeah, that definitely seems like a good idea. Extending a relation is already expensive and we've taken steps to deal with that and so detecting that the file we were expecting to create is already there certainly seems like a good idea and I wouldn't expect (?) to add a lot of extra time in the normal case. > These kinds of checks are the kind of paranoia that catches filesystem > bugs, backup software bugs, cron jobs, etc that we don't even know to > watch for. Agreed, and would also help in cases where such a situation already exists out there somewhere and which no amount of new WAL records would make go away.. Thanks, Stephen