Thread: "PANIC: could not open critical system index 2662" - twice
On Thu, 2023-04-06 at 16:41 +0000, Evgeny Morozov wrote: > Our PostgreSQL 15.2 instance running on Ubuntu 18.04 has crashed with this error: > > 2023-04-05 09:24:03.448 UTC [15227] ERROR: index "pg_class_oid_index" contains unexpected zero page at block 0 > [...] > > We had the same thing happened about a month ago on a different database on the same cluster. > For a while PG actually ran OK as long as you didn't access that specific DB, but when trying > to back up that DB with pg_dump it would crash every time. At that time one of the disks > hosting the ZFS dataset with the PG data directory on it was reporting errors, so we thought > it was likely due to that. > > Unfortunately, before we could replace the disks, PG crashed completely and would not start > again at all, so I had to rebuild the cluster from scratch and restore from pg_dump backups > (still onto the old, bad disks). Once the disks were replaced (all of them) I just copied > the data to them using zfs send | zfs receive and didn't bother restoring pg_dump backups > again - which was perhaps foolish in hindsight. > > Well, yesterday it happened again. The server still restarted OK, so I took fresh pg_dump > backups of the databases we care about (which ran fine), rebuilt the cluster and restored > the pg_dump backups again - now onto the new disks, which are not reporting any problems. > > So while everything is up and running now this error has me rather concerned. Could the > error we're seeing now have been caused by some corruption in the PG data that's been there > for a month (so it could still be attributed to the bad disk), which should now be fixed by > having restored from backups onto good disks? Yes, that is entirely possible. > Could this be a PG bug? It could be, but data corruption caused by bad hardware is much more likely. > What can I do to figure out why this is happening and prevent it from happening again? No idea about the former, but bad hardware is a good enough explanation. As to keeping it from happening: use good hardware. Yours, Laurenz Albe
On Fri, Apr 07, 2023 at 01:04:34PM +0200, Laurenz Albe wrote: > On Thu, 2023-04-06 at 16:41 +0000, Evgeny Morozov wrote: >> Could this be a PG bug? > > It could be, but data corruption caused by bad hardware is much more likely. There is no way to be completely sure here, except if we would be able to put our hands on a reproducible test case that would break the cluster so much that we'd get into this state. I don't recall seeing this error pattern in recent history, though. -- Michael
Attachment
On 2023-04-07 13:04:34 +0200, Laurenz Albe wrote: > On Thu, 2023-04-06 at 16:41 +0000, Evgeny Morozov wrote: > > What can I do to figure out why this is happening and prevent it from happening again? > > No idea about the former, but bad hardware is a good enough explanation. > > As to keeping it from happening: use good hardware. Also: Use checksums. PostgreSQL offers data checksums[1]. Some filesystems also offer checksums. This doesn't prevent corruption but at least it will be detected early and can't spread. hp [1] For some reason I thought the Debian/Ubuntu packages enabled this by default. But that doesn't seem to be the case. -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
> No idea about the former, but bad hardware is a good enough explanation. > As to keeping it from happening: use good hardware. Alright, thanks, I'll just keep my fingers crossed that it doesn't happen again then! > Also: Use checksums. PostgreSQL offers data checksums[1]. Some filesystems also offer checksums. We have data_checksums=on. (It must be on by default, since I cannot find that in our config files anywhere.) However, the docs say "Only data pages are protected by checksums; internal data structures and temporary files are not.", so I guess pg_class_oid_index might be an "internal data structure"? We also have checksum=on for the ZFS dataset on which the data is stored (also the default - we didn't change it). ZFS did detect problems (zpool status reported read, write and checksum errors for one of the old disks), but it also said "errors: No known data errors". I understood that to meant that it recovered from the errors, i.e. wrote the data different disk blocks or read it from another disk in the pool.
On Tue, Apr 11, 2023 at 04:44:54PM +0000, Evgeny Morozov wrote: > We have data_checksums=on. (It must be on by default, since I cannot > find that in our config files anywhere.) initdb does not enable checksums by default, requiring a -k/--data-checksums, so likely this addition comes from from your environment. > However, the docs say "Only > data pages are protected by checksums; internal data structures and > temporary files are not.", so I guess pg_class_oid_index might be an > "internal data structure"? pg_class_oid_index is a btree index that relies on 8k on-disk pages (default size), so it is subject to the same rules as normal relations regarding checksums for the pages flushed to disk, even if it is on a catalog. -- Michael
Attachment
On 12/04/2023 2:35 am, Michael Paquier wrote: > initdb does not enable checksums by default, requiring a > -k/--data-checksums, so likely this addition comes from from your > environment. Indeed, turns out we had it in init_db_options. > However, the docs say "Only >> data pages are protected by checksums; internal data structures and >> temporary files are not.", so I guess pg_class_oid_index might be an >> "internal data structure"? > pg_class_oid_index is a btree index that relies on 8k on-disk pages > (default size), so it is subject to the same rules as normal relations > regarding checksums for the pages flushed to disk, even if it is on a > catalog. OK, so then what does that mean for the error in the subject? At what point should that problem have been detected by the data checksums?
On Thu, 2023-04-13 at 06:56 +0000, Evgeny Morozov wrote: > On 12/04/2023 2:35 am, Michael Paquier wrote: > > initdb does not enable checksums by default, requiring a > > -k/--data-checksums, so likely this addition comes from from your > > environment. > > OK, so then what does that mean for the error in the subject? At what > point should that problem have been detected by the data checksums? It means that if the error is caused by a faulty disk changing your data, you'll notice as soon as you touch the page. That would perhaps not have made a lot of difference in your case, except that the error message would have been different and proof that the disk was the problem. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
On 13/04/2023 5:02 pm, Laurenz Albe wrote: > It means that if the error is caused by a faulty disk changing your data, > you'll notice as soon as you touch the page. > > That would perhaps not have made a lot of difference in your case, > except that the error message would have been different and proof > that the disk was the problem. OK, but we had data checksums on the whole time. So that means that the disk was NOT the problem in our case?
On Thu, 2023-04-13 at 19:07 +0000, Evgeny Morozov wrote: > On 13/04/2023 5:02 pm, Laurenz Albe wrote: > > It means that if the error is caused by a faulty disk changing your data, > > you'll notice as soon as you touch the page. > > > > That would perhaps not have made a lot of difference in your case, > > except that the error message would have been different and proof > > that the disk was the problem. > > OK, but we had data checksums on the whole time. So that means that the > disk was NOT the problem in our case? Hmm, I am not certain. The block was filled with zeros from your error message, and I think such blocks don't trigger a checksum warning. So if your disk replaces a valid block with zeros (filesystem check after crash?), that could explain what you see. Yours, Laurenz Albe
On Fri, 14 Apr 2023, Laurenz Albe wrote: >So if your disk replaces a valid block with zeros (filesystem check >after crash?), that could explain what you see. Oh, I had that happen on a RAID 1 once. On of the two discs had an intermittent error (write I guess) but didn’t fail out of the RAID, and some of the reads from there got zero-filled blocks in some positions. It was a CVS repository so I was able to identify all such blocks in question and restore them from the rsync slave (whose initial population predated the HDD issue). Hm, now that I think about it, it could even have been a read error with subsequent block reassignment. Oh well. Filesystem issues (ext3, and ext4 without/predating auto_da_alloc, in particular) could be it just as well of course. bye, //mirabilos -- 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
> Hmm, I am not certain. The block was filled with zeros from your error > message, and I think such blocks don't trigger a checksum warning. OK, so data_checksums=on might not have made any difference in this case? > So if your disk replaces a valid block with zeros (filesystem check > after crash?), that could explain what you see. If by "crash" here you mean the OS crashing - we didn't have that happen. The OS is on separate disks, which have not reported any errors. When we first ran into this problem the PG data was on a ZFS RAIDZ (i.e. RAID5) volume of 3 disks, and for one of them `zpool status -v` reported read, write and checksum error count > 0, but it also said "errors: No known data errors" and the disk status remained "online" (it did not become "faulted" or "offline"). (Now we have the PG data on a ZFS mirror volume of 2 new disks, which have not reported any errors.) I don't know whether ZFS zero-fills blocks on disk errors. As I understood, ZFS should have been able to recover from disk errors (that were "unrecoverable" at the hardware level) using the data on the other two disks (which did not report any errors). Thus, PG should not have seen any corrupted data (if ZFS was working correctly). https://unix.stackexchange.com/questions/341614/understanding-the-error-reporting-of-zfs-on-linux seems to confirm this. Am I misunderstanding something?
> On 14 Apr 2023, at 9:38, Evgeny Morozov <postgresql3@realityexists.net> wrote: (…) > I don't know whether ZFS zero-fills blocks on disk errors. As I > understood, ZFS should have been able to recover from disk errors (that > were "unrecoverable" at the hardware level) using the data on the other > two disks (which did not report any errors). Thus, PG should not have > seen any corrupted data (if ZFS was working correctly). > https://unix.stackexchange.com/questions/341614/understanding-the-error-reporting-of-zfs-on-linux > seems to confirm this. Am I misunderstanding something? Your problem coincides with a thread at freebsd-current with very similar data corruption after a recent OpenZFS import:blocks of all zeroes, but also missing files. So, perhaps these problems are related? Apparently, there was a recent fix for a data corruption issue with the block_cloning feature enabled, but people are stillseeing corruption even when they never enabled that feature. I couldn’t really find the start of the thread in the archives, so this one kind of jumps into the middle of the thread ata relevant-looking point: https://lists.freebsd.org/archives/freebsd-current/2023-April/003446.html Regards, Alban Hertroys -- There is always an exception to always.
On 14/04/2023 10:42 am, Alban Hertroys wrote: > Your problem coincides with a thread at freebsd-current with very > similar data corruption after a recent OpenZFS import: blocks of all > zeroes, but also missing files. So, perhaps these problems are related? > Apparently, there was a recent fix for a data corruption issue with the block_cloning feature enabled, but people are stillseeing corruption even when they never enabled that feature. > > I couldn’t really find the start of the thread in the archives, so this one kind of jumps into the middle of the threadat a relevant-looking point: > > https://lists.freebsd.org/archives/freebsd-current/2023-April/003446.html That thread was a bit over my head, I'm afraid, so I can't say if it's related. I haven't detected any missing files, anyway. Well, the problem happened again! Kind of... This time PG has not crashed with the PANIC error in the subject, but pg_dumping certain DBs again fails with pg_dump: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5434" failed: FATAL: index "pg_class_oid_index" contains unexpected zero page at block 0 PG server log contains: 2023-05-03 04:31:49.903 UTC [14724] postgres@test_behavior_638186279733138190 FATAL: index "pg_class_oid_index" contains unexpected zero page at block 0 2023-05-03 04:31:49.903 UTC [14724] postgres@test_behavior_638186279733138190 HINT: Please REINDEX it. The server PID does not change on such a pg_dump attempt, so it appears that only the backend process for the pg_dump connection crashes this time. I don't see any disk errors and there haven't been any OS crashes. This currently happens for two DBs. Both of them are very small DBs created by automated .NET tests using Npgsql as client. The code creates such a test DB from a template DB and tries to drop it at the end of the test. This times out sometimes and on timeout our test code tries to drop the DB again (while the first drop command is likely still pending on the server). This second attempt to drop the DB also timed out: [12:40:39] Npgsql.NpgsqlException : Exception while reading from stream ----> System.TimeoutException : Timeout during reading attempt at Npgsql.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|194_0(NpgsqlConnector connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage) at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken) at Npgsql.NpgsqlDataReader.NextResult() at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteNonQuery(Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteNonQuery() ... [12:41:41] (same error again for the same DB) From looking at old logs it seems like the same thing happened last time (in April) as well. That's quite an unlikely coincidence if a bad disk or bad filesystem was to blame, isn't it? I've tried to reproduce this by re-running those tests over and over, but without success so far. So what can I do about this? Do I just try to drop those databases again manually? But what about the next time it happens? How do I figure out the cause and prevent this problem?
On Thu, 2023-05-04 at 15:49 +0000, Evgeny Morozov wrote: > Well, the problem happened again! Kind of... This time PG has not > crashed with the PANIC error in the subject, but pg_dumping certain DBs > again fails with > > > pg_dump: error: connection to server on socket > "/var/run/postgresql/.s.PGSQL.5434" failed: FATAL: index > "pg_class_oid_index" contains unexpected zero page at block 0 If you dumped and restored the database after the last time the error happened, there must be a systemic problem. Perhaps you have bad hardware, or a problem with a storage driver, file system or some other low-level software component. It might of course be a PostgreSQL bug too, but it is hard to say without a way to reproduce... Yours, Laurenz Albe
On 4/05/2023 6:42 pm, Laurenz Albe wrote: > On Thu, 2023-05-04 at 15:49 +0000, Evgeny Morozov wrote: >> Well, the problem happened again! Kind of... This time PG has not >> crashed with the PANIC error in the subject, but pg_dumping certain DBs >> again fails with >> >> >> pg_dump: error: connection to server on socket >> "/var/run/postgresql/.s.PGSQL.5434" failed: FATAL: index >> "pg_class_oid_index" contains unexpected zero page at block 0 > If you dumped and restored the database after the last time the error > happened, there must be a systemic problem. I dumped and restored the "real" databases I cared about. The tests databases on which error now happens are new (created 2 days ago). > Perhaps you have bad hardware, or a problem with a storage driver, > file system or some other low-level software component. > It might of course be a PostgreSQL bug too, but it is hard to say > without a way to reproduce... I'm now thinking of setting up a dedicated AWS EC2 instance just for these little DBs that get created by our automated tests. If the problem happens there as well then that would strongly point towards a bug in PostgreSQL, wouldn't it? (And if nothing else, at least it won't affect the more important DBs!) Meanwhile, what do I do with the existing server, though? Just try to drop the problematic DBs again manually?
[snip]
I'm now thinking of setting up a dedicated AWS EC2 instance just for these little DBs that get created by our automated tests. If the problem happens there as well then that would strongly point towards a bug in PostgreSQL, wouldn't it?
Many other people besides you would have noticed regular corruption of system catalogs.
Meanwhile, what do I do with the existing server, though? Just try to drop the problematic DBs again manually?
"Fix the hardware" is what you do.
Born in Arizona, moved to Babylonia.
On Fri, May 5, 2023 at 6:11 AM Evgeny Morozov <postgresql3@realityexists.net> wrote: > Meanwhile, what do I do with the existing server, though? Just try to > drop the problematic DBs again manually? That earlier link to a FreeBSD thread is surely about bleeding edge new ZFS stuff that was briefly broken then fixed, being discovered by people running code imported from OpenZFS master branch into FreeBSD main branch (ie it's not exactly released, not following the details but I think it might soon be 2.2?), but you're talking about an LTS Ubuntu release from 2018, which shipped "ZFS on Linux" version 0.7.5, unless you installed a newer version somehow? So it doesn't sound like it could be related. That doesn't mean it couldn't be a different ZFS bug though. While looking into file system corruption issues that had similar symptoms on some other file system (which turned out to be a bug in btrfs) I did bump into a claim that ZFS could product unexpected zeroes in some mmap coherency scenario, OpenZFS issue #14548. I don't immediately see how PostgreSQL could get tangled up with that problem though, as we aren't doing that... It seems quite interesting that it's always pg_class_oid_index block 0 (the btree meta-page), which feels more like a PostgreSQL bug, unless the access pattern of that particular file/block is somehow highly unusual compared to every other block and tickling bugs elsewhere in the stack. How does that file look, in terms of size, and how many pages in it are zero? I think it should be called base/5/2662. Oooh, but this is a relation that goes through RelationMapOidToFilenumber. What does select pg_relation_filepath('pg_class_oid_index') show in the corrupted database, base/5/2662 or something else? Now *that* is a piece of logic that changed in PostgreSQL 15. It changed from sector-based atomicity assumptions to a directory entry swizzling trick, in commit d8cd0c6c95c0120168df93aae095df4e0682a08a. Hmm.
On Fri, May 5, 2023 at 11:15 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Now *that* is a piece of > logic that changed in PostgreSQL 15. It changed from sector-based > atomicity assumptions to a directory entry swizzling trick, in commit > d8cd0c6c95c0120168df93aae095df4e0682a08a. Hmm. I spoke too soon, that only changed in 16. But still, it means there are two files that could be corrupted here, pg_filenode.map which might somehow be pointing to the wrong file, and the relation (index) main fork file.
On Fri, May 5, 2023 at 11:15 AM Thomas Munro <thomas.munro@gmail.com> wrote: > What does select > pg_relation_filepath('pg_class_oid_index') show in the corrupted > database, base/5/2662 or something else? Oh, you can't get that far, but perhaps you could share the pg_filenode.map file? Or alternatively strace -f PostgreSQL while it's starting up to see which file it's reading, just to be sure. One way to find clues about whether PostgreSQL did something wrong, once we definitely have the right relfilenode for the index, aside from examining its contents, would be to search the WAL for references to that block with pg_waldump. Maybe you still have enough WAL if it happened recently?
On Fri, May 5, 2023 at 11:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:What does select pg_relation_filepath('pg_class_oid_index') show in the corrupted database, base/5/2662 or something else?Oh, you can't get that far, but perhaps you could share the pg_filenode.map file?
Hi Thomas, thanks very much for looking into this!
Maybe! What should I do with pg_waldump? I've never used it before.
>>>>> "Evgeny" == Evgeny Morozov <postgresql3@realityexists.net> writes: Evgeny> Indeed, I cannot get that far due to the same error. I read Evgeny> about ignore_system_indexes, but... Evgeny> # sudo -u postgres psql -w -p 5434 -c "set ignore_system_indexes=on"; Evgeny> ERROR: parameter "ignore_system_indexes" cannot be set after connection Evgeny> start sudo -u postgres psql -w -p 5434 -d "options='-P'" (make that -d "dbname=whatever options='-P'" if you need to specify some database name; or use PGOPTIONS="-P" in the environment.) -- Andrew (irc:RhodiumToad)
sudo -u postgres psql -w -p 5434 -d "options='-P'"(make that -d "dbname=whatever options='-P'" if you need to specify some database name; or use PGOPTIONS="-P" in the environment.)
# sudo -u postgres psql -w -p 5434 -d "dbname=test_behavior_638186279733138190 options='-P'"
psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5434" failed: FATAL: could not open relation with OID 2964
(PG server log contains the same error message and no further details.)
select 2964::regclass returns "pg_db_role_setting" FWIW.
On Fri, May 5, 2023 at 7:50 PM Evgeny Morozov <postgresql3@realityexists.net> wrote: > The OID of the bad DB ('test_behavior_638186279733138190') is 1414389 and I've uploaded base/1414389/pg_filenode.map andalso base/5/2662 (in case that's helpful) as https://objective.realityexists.net/temp/pgstuff1.zip Thanks. That pg_filenode.map looks healthy to me. tmunro@build1:~/junk $ od -t x1 pg_filenode.map 0000000 17 27 59 00 11 00 00 00 eb 04 00 00 eb 04 00 00 0000020 e1 04 00 00 e1 04 00 00 e7 04 00 00 e7 04 00 00 0000040 df 04 00 00 df 04 00 00 14 0b 00 00 14 0b 00 00 0000060 15 0b 00 00 15 0b 00 00 4b 10 00 00 4b 10 00 00 0000100 4c 10 00 00 4c 10 00 00 82 0a 00 00 82 0a 00 00 0000120 83 0a 00 00 83 0a 00 00 8f 0a 00 00 8f 0a 00 00 0000140 90 0a 00 00 90 0a 00 00 62 0a 00 00 62 0a 00 00 0000160 63 0a 00 00 63 0a 00 00 66 0a 00 00 66 0a 00 00 ... hex(2662) is 0xa66, and we see 63 0a 00 00 followed by 63 0a 00 00 in that last line as expected, so that rules out the idea that it's somehow trashed that map file and points to the wrong relation file. Next can you share the file base/1414389/2662? ("5" was from the wrong database.) > > Maybe you still have enough WAL if it happened recently? > > Maybe! What should I do with pg_waldump? I've never used it before. Try something like: pg_waldump -R 1663/1414389/2662 -F main 000000010000000000000001 000000010000000000000007 ... but change that to the range of files you have in your pg_wal.
On 6/05/2023 1:06 am, Thomas Munro wrote: > Next can you share the file base/1414389/2662? ("5" was from the wrong > database.) Right - I should have realised that! base/1414389/2662 is indeed all nulls, 32KB of them. I included the file anyway in https://objective.realityexists.net/temp/pgstuff2.zip > Try something like: > pg_waldump -R 1663/1414389/2662 -F main 000000010000000000000001 > 000000010000000000000007 > > ... but change that to the range of files you have in your pg_wal. The PG server had only 2 files left from today, and pg_waldumping them gave this error: pg_waldump -R 1663/1414389/2662 -F main 00000001000000650000005E 00000001000000650000005F pg_waldump: error: error in WAL record at 65/5F629838: invalid record length at 65/5F62A1E0: wanted 24, got 0 Not sure if that's something to worry about or not! Then I realised we're actually archiving our WAL files with pgBackRest, retrieved the WAL files for the time the DB was created and used (~12:39 UTC on 2023-05-02) and re-ran pg_waldump on those. pg_waldump -R 1663/1414389/2662 -F main 000000010000005B00000000 000000010000005B0000000F rmgr: XLOG len (rec/tot): 51/ 108, tx: 242382, lsn: 5B/0222BC68, prev 5B/0222BC38, desc: FPI , blkref #0: rel 1663/1414389/2662 blk 0 FPW rmgr: XLOG len (rec/tot): 51/ 3224, tx: 242382, lsn: 5B/0222BCD8, prev 5B/0222BC68, desc: FPI , blkref #0: rel 1663/1414389/2662 blk 1 FPW rmgr: XLOG len (rec/tot): 51/ 1544, tx: 242382, lsn: 5B/0222C988, prev 5B/0222BCD8, desc: FPI , blkref #0: rel 1663/1414389/2662 blk 2 FPW rmgr: XLOG len (rec/tot): 49/ 121, tx: 242382, lsn: 5B/0222CF90, prev 5B/0222C988, desc: FPI , blkref #0: rel 1663/1414389/2662 blk 3 FPW rmgr: Btree len (rec/tot): 64/ 64, tx: 242384, lsn: 5B/02321870, prev 5B/023217A0, desc: INSERT_LEAF off 132, blkref #0: rel 1663/1414389/2662 blk 2 rmgr: Btree len (rec/tot): 64/ 64, tx: 242384, lsn: 5B/02322640, prev 5B/02322570, desc: INSERT_LEAF off 133, blkref #0: rel 1663/1414389/2662 blk 2 ... (many more entries like that) pg_waldump: error: error in WAL record at 5B/FFFFF38: missing contrecord at 5B/FFFFF70 The full output is also in https://objective.realityexists.net/temp/pgstuff2.zip
On Sat, May 6, 2023 at 9:58 PM Evgeny Morozov <postgresql3@realityexists.net> wrote: > Right - I should have realised that! base/1414389/2662 is indeed all > nulls, 32KB of them. I included the file anyway in > https://objective.realityexists.net/temp/pgstuff2.zip OK so it's not just page 0, you have 32KB or 4 pages of all zeroes. That's the expected length of that relation when copied from the initial template, and consistent with the pg_waldump output (it uses FPIs to copy blocks 0-3). We can't see the block contents but we know that block 2 definitely is not all zeroes at that point because there are various modifications to it, which not only write non-zeroes but must surely have required a sane page 0. So it does indeed look like something unknown has replaced 32KB of data with 32KB of zeroes underneath us. Are there more non-empty files that are all-zeroes? Something like this might find them: for F in base/1414389/* do if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null then echo $F fi done
On 6/05/2023 12:34 pm, Thomas Munro wrote: > So it does indeed look like something unknown has replaced 32KB of > data with 32KB of zeroes underneath us. Are there more non-empty > files that are all-zeroes? Something like this might find them: > > for F in base/1414389/* > do > if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null > then > echo $F > fi > done Yes, a total of 309 files are all-zeroes (and 52 files are not). I also checked the other DB that reports the same "unexpected zero page at block 0" error, "test_behavior_638186280406544656" (OID 1414967) - similar story there. I uploaded the lists of zeroed and non-zeroed files and the ls -la output for both as https://objective.realityexists.net/temp/pgstuff3.zip I then searched recursively such all-zeroes files in $PGDATA/base and did not find any outside of those two directories (base/1414389 and base/1414967). None in $PGDATA/global, either.
On Sun, May 7, 2023 at 12:29 AM Evgeny Morozov <postgresql3@realityexists.net> wrote: > On 6/05/2023 12:34 pm, Thomas Munro wrote: > > So it does indeed look like something unknown has replaced 32KB of > > data with 32KB of zeroes underneath us. Are there more non-empty > > files that are all-zeroes? Something like this might find them: > > > > for F in base/1414389/* > > do > > if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null > > then > > echo $F > > fi > > done > > Yes, a total of 309 files are all-zeroes (and 52 files are not). > > I also checked the other DB that reports the same "unexpected zero page > at block 0" error, "test_behavior_638186280406544656" (OID 1414967) - > similar story there. I uploaded the lists of zeroed and non-zeroed files > and the ls -la output for both as > https://objective.realityexists.net/temp/pgstuff3.zip > > I then searched recursively such all-zeroes files in $PGDATA/base and > did not find any outside of those two directories (base/1414389 and > base/1414967). None in $PGDATA/global, either. So "diff -u zeroed-files-1414967.txt zeroed-files-1414389.txt" shows that they have the same broken stuff in the range cloned from the template database by CREATE DATABASE STRATEGY=WAL_LOG, and it looks like it's *all* the cloned catalogs, and then they have some non-matching relfilenodes > 1400000, presumably stuff you created directly in the new database (I'm not sure if I can say for sure that those files are broken, without knowing what they are). Did you previously run this same workload on versions < 15 and never see any problem? 15 gained a new feature CREATE DATABASE ... STRATEGY=WAL_LOG, which is also the default. I wonder if there is a bug somewhere near that, though I have no specific idea. If you explicitly added STRATEGY=FILE_COPY to your CREATE DATABASE commands, you'll get the traditional behaviour. It seems like you have some kind of high frequency testing workload that creates and tests databases all day long, and just occasionally detects this corruption. Would you like to try requesting FILE_COPY for a while and see if it eventually happens like that too? My spidey sense is leaning away from filesystem bugs. We've found plenty of filesystem bugs on these mailing lists over the years and of course it's not impossible, but I dunno... it seems quite suspicious that all the system catalogs have apparently been wiped during or moments after the creation of a new database that's running new PostgreSQL 15 code...
On Sat, May 6, 2023 at 6:35 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, May 6, 2023 at 9:58 PM Evgeny Morozov > <postgresql3@realityexists.net> wrote: > > Right - I should have realised that! base/1414389/2662 is indeed all > > nulls, 32KB of them. I included the file anyway in > > https://objective.realityexists.net/temp/pgstuff2.zip > > OK so it's not just page 0, you have 32KB or 4 pages of all zeroes. > That's the expected length of that relation when copied from the > initial template, and consistent with the pg_waldump output (it uses > FPIs to copy blocks 0-3). We can't see the block contents but we know > that block 2 definitely is not all zeroes at that point because there > are various modifications to it, which not only write non-zeroes but > must surely have required a sane page 0. > > So it does indeed look like something unknown has replaced 32KB of > data with 32KB of zeroes underneath us. This may be related... I seem to recall the GNUlib folks talking about a cp bug on sparse files. It looks like it may be fixed in coreutils release 9.2 (2023-03-20): https://github.com/coreutils/coreutils/blob/master/NEWS#L233 If I recall correctly, it had something to do with the way copy_file_range worked. (Or maybe, it did not work as expected). According to the GNUlib docs (https://www.gnu.org/software/gnulib/manual/html_node/copy_005ffile_005frange.html): This function has many problems on Linux kernel versions before 5.3 > Are there more non-empty > files that are all-zeroes? Something like this might find them: > > for F in base/1414389/* > do > if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null > then > echo $F > fi > done
On Sun, May 7, 2023 at 10:23 AM Jeffrey Walton <noloader@gmail.com> wrote: > This may be related... I seem to recall the GNUlib folks talking about > a cp bug on sparse files. It looks like it may be fixed in coreutils > release 9.2 (2023-03-20): > https://github.com/coreutils/coreutils/blob/master/NEWS#L233 > > If I recall correctly, it had something to do with the way > copy_file_range worked. (Or maybe, it did not work as expected). > > According to the GNUlib docs > (https://www.gnu.org/software/gnulib/manual/html_node/copy_005ffile_005frange.html): > > This function has many problems on Linux > kernel versions before 5.3 That's quite interesting, thanks (we've been talking about making direct use of copy_file_range() in a few threads, I'll definitely be looking into that history), but we don't currently use copy_file_range() or any coreutils stuff in the relevant code paths here -- this data is copied by plain old pread() and pwrite().
Thomas Munro <thomas.munro@gmail.com> writes: > Did you previously run this same workload on versions < 15 and never > see any problem? 15 gained a new feature CREATE DATABASE ... > STRATEGY=WAL_LOG, which is also the default. I wonder if there is a > bug somewhere near that, though I have no specific idea. Per the release notes I was just writing ... <listitem> <!-- Author: Michael Paquier <michael@paquier.xyz> Branch: master [8a8661828] 2023-02-22 10:14:52 +0900 Branch: REL_15_STABLE [fa5dd460c] 2023-02-22 10:14:56 +0900 --> <para> Fix potential corruption of the template (source) database after <command>CREATE DATABASE</command> with the <literal>STRATEGY WAL_LOG</literal> option (Nathan Bossart, Ryo Matsumura) </para> <para> Improper buffer handling created a risk that any later modification of the template's <structname>pg_class</structname> catalog would be lost. </para> </listitem> The comment about only pg_class being affected is my interpretation of what the commit message said, but I might have misunderstood. regards, tom lane
On Sun, May 7, 2023 at 1:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Did you previously run this same workload on versions < 15 and never > > see any problem? 15 gained a new feature CREATE DATABASE ... > > STRATEGY=WAL_LOG, which is also the default. I wonder if there is a > > bug somewhere near that, though I have no specific idea. > > Per the release notes I was just writing ... > > <listitem> > <!-- > Author: Michael Paquier <michael@paquier.xyz> > Branch: master [8a8661828] 2023-02-22 10:14:52 +0900 > Branch: REL_15_STABLE [fa5dd460c] 2023-02-22 10:14:56 +0900 > --> > <para> > Fix potential corruption of the template (source) database after > <command>CREATE DATABASE</command> with the <literal>STRATEGY > WAL_LOG</literal> option (Nathan Bossart, Ryo Matsumura) > </para> Hmm. That bug seems to have caused corruption (backwards time travel) of blocks in the *source* DB's pg_class, by failing to write back changes. We seem to have zeroed pages in the *target* database, for all catalogs (apparently everything copied by RelationCopyStorageUsingBuffer()), even though the template is still fine. It is as if RelationCopyStorageUsingBuffer() created the zero-filed file with smgrextend(), but then the buffer data was never written out even though we memcpy'd it into the a buffer and set the buffer dirty. Bug-in-PostgreSQL explanations could include that we forgot it was dirty, or some backend wrote it out to the wrong file; but if we were forgetting something like permanent or dirty, would there be a more systematic failure? Oh, it could require special rare timing if it is similar to 8a8661828's confusion about permanence level or otherwise somehow not setting BM_PERMANENT, but in the target blocks, so I think that'd require a checkpoint AND a crash. It doesn't reproduce for me, but perhaps more unlucky ingredients are needed. Bug-in-OS/FS explanations could include that a whole lot of writes were mysteriously lost in some time window, so all those files still contain the zeroes we write first in smgrextend(). I guess this previously rare (previously limited to hash indexes?) use of sparse file hole-punching could be a factor in an it's-all-ZFS's-fault explanation: openat(AT_FDCWD,"base/16390/2662",O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC,0600) = 36 (0x24) openat(AT_FDCWD,"base/1/2662",O_RDWR|O_CLOEXEC,00) = 37 (0x25) lseek(37,0x0,SEEK_END) = 32768 (0x8000) lseek(37,0x0,SEEK_END) = 32768 (0x8000) pwrite(36,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,8192,0x6000) = 8192 (0x2000) <-- smgrextend(final block) lseek(36,0x0,SEEK_END) = 32768 (0x8000) I was trying to think about how I might go about trying to repro the exact system setup. Evgeny, do you mind sharing your "zfs get all /path/to/pgdata" (curious to see block size, compression settings, anything else etc) and your postgresql.conf? And your exact Ubuntu kernel version and ZFS package versions?
On 6/05/2023 11:13 pm, Thomas Munro wrote: > Did you previously run this same workload on versions < 15 and never > see any problem? Yes, kind of. We have a test suite that creates one test DB and runs a bunch of tests on it. Two of these tests, however, create another DB each (also by cloning the same template DB) in order to test copying data between DBs. It's only these "extra" DBs that were corrupted, at least on this occasion. (Hard to say about the last time, because that time it all went south and the whole server crashed, and we may have had some residual corruption from bad disks then - who knows.) I'm not sure whether the tests that created the extra DBs existed before we upgraded to PG 15, but we definitely have not seen such problems on PG 13 or 14. > It seems like you have some kind of high frequency testing workload that creates and tests databases all day long, andjust occasionally detects this corruption. Maybe 10-30 times per day normally, depending on the day. However, I have tried to repro this by running those two specific tests thousands of times in one day, without success. > Would you like to try requesting FILE_COPY for a while and see if it eventually happens like that too? Sure, we can try that. On 7/05/2023 12:30 pm, Thomas Munro wrote: > your "zfs get all /path/to/pgdata" PROPERTY VALUE SOURCE type filesystem - creation Mon Mar 6 17:07 2023 - used 166G - available 2.34T - referenced 166G - compressratio 2.40x - mounted yes - quota none default reservation none default recordsize 16K local mountpoint /default sharenfs off default checksum on default compression lz4 received atime off inherited from pgdata devices on default exec off inherited from pgdata setuid off inherited from pgdata readonly off default zoned off default snapdir hidden default aclinherit restricted default createtxg 90 - canmount on received xattr on default copies 1 default version 5 - utf8only off - normalization none - casesensitivity sensitive - vscan off default nbmand off default sharesmb off default refquota none default refreservation none default primarycache all default secondarycache all default usedbysnapshots 199M - usedbydataset 166G - usedbychildren 0B - usedbyrefreservation 0B - logbias latency default dedup off default mlslabel none default sync standard default dnodesize legacy default refcompressratio 2.40x - written 64.9M - logicalused 397G - logicalreferenced 397G - volmode default default filesystem_limit none default snapshot_limit none default filesystem_count none default snapshot_count none default snapdev hidden default acltype off default context none default fscontext none default defcontext none default rootcontext none default relatime off default redundant_metadata all default overlay off default > your postgresql.conf? We have a bunch of config files, so I tried to get the resulting config using "select name, setting from pg_settings where source = 'configuration file'" - hopefully that gives what you wanted. name | setting ----------------------------+------------------------------------------------------- archive_command | pgbackrest --stanza="behavior-pg15" archive-push "%p" archive_mode | on archive_timeout | 900 cluster_name | 15/behavior DateStyle | ISO, MDY default_text_search_config | pg_catalog.english dynamic_shared_memory_type | posix external_pid_file | /var/run/postgresql/15-behavior.pid full_page_writes | off lc_messages | C lc_monetary | C lc_numeric | C lc_time | C listen_addresses | * log_checkpoints | on log_connections | on log_disconnections | on log_file_mode | 0640 log_line_prefix | %m [%p] %q%u@%d log_lock_waits | on log_min_duration_statement | 1000 log_temp_files | 0 log_timezone | Etc/UTC maintenance_work_mem | 1048576 max_connections | 100 max_slot_wal_keep_size | 30000 max_wal_size | 1024 min_wal_size | 80 port | 5434 shared_buffers | 4194304 ssl | on ssl_cert_file | (redacted) ssl_ciphers | TLSv1.2:TLSv1.3:!aNULL ssl_dh_params_file | (redacted) ssl_key_file | (redacted) ssl_min_protocol_version | TLSv1.2 temp_buffers | 10240 TimeZone | Etc/UTC unix_socket_directories | /var/run/postgresql wal_compression | pglz wal_init_zero | off wal_level | replica wal_recycle | off work_mem | 262144 > And your exact Ubuntu kernel version and ZFS package versions? Ubuntu 18.04.6 Kernel 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux zfsutils-linux package version 0.7.5-1ubuntu16.12 amd64
On Mon, May 8, 2023 at 4:10 AM Evgeny Morozov <postgresql3@realityexists.net> wrote: > On 6/05/2023 11:13 pm, Thomas Munro wrote: > > Would you like to try requesting FILE_COPY for a while and see if it eventually happens like that too? > Sure, we can try that. Maybe you could do some one way and some the other, so that we try to learn more? > Ubuntu 18.04.6 > Kernel 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 > x86_64 x86_64 x86_64 GNU/Linux > zfsutils-linux package version 0.7.5-1ubuntu16.12 amd64 I tried for a few hours to reproduce this by trying to make as many things as similar to yours as I could based on info in this thread (Linux: up-to-date Ubuntu 18.04 in vagrant which has nearly the same kernel 4.15.0-208-generic and a close but slightly different version of ancient ZFS 0.7.5-1ubuntu15, not sure why, ZFS: mirror (I used a pair of loopback files), recordsize=16kB, compression=lz4, PG: compiled from tag REL_15_2, data_checksums=on, full_page_writes=off, wal_recycle=off, wal_init_zero=off), with what I thought might be roughly what you're doing (creating three DBs, two clones of the first, with various modification at various points, with various overlapping activities, and then checking for catalog corruption). No cigar. Hrmph.
On Sun, May 07, 2023 at 10:30:52PM +1200, Thomas Munro wrote: > Bug-in-PostgreSQL explanations could include that we forgot it was > dirty, or some backend wrote it out to the wrong file; but if we were > forgetting something like permanent or dirty, would there be a more > systematic failure? Oh, it could require special rare timing if it is > similar to 8a8661828's confusion about permanence level or otherwise > somehow not setting BM_PERMANENT, but in the target blocks, so I think > that'd require a checkpoint AND a crash. It doesn't reproduce for me, > but perhaps more unlucky ingredients are needed. > > Bug-in-OS/FS explanations could include that a whole lot of writes > were mysteriously lost in some time window, so all those files still > contain the zeroes we write first in smgrextend(). I guess this > previously rare (previously limited to hash indexes?) use of sparse > file hole-punching could be a factor in an it's-all-ZFS's-fault > explanation: Yes, you would need a bit of all that. I can reproduce the same backtrace here. That's just my usual laptop with ext4, so this would be a Postgres bug. First, here are the four things running in parallel so as I can get a failure in loading a critical index when connecting: 1) Create and drop a database with WAL_LOG as strategy and the regression database as template: while true; do createdb --template=regression --strategy=wal_log testdb; dropdb testdb; done 2) Feeding more data to pg_class in the middle, while testing the connection to the database created: while true; do psql -c 'create table popo as select 1 as a;' regression > /dev/null 2>&1 ; psql testdb -c "select 1" > /dev/null 2>&1 ; psql -c 'drop table popo' regression > /dev/null 2>&1 ; psql testdb -c "select 1" > /dev/null 2>&1 ; done; 3) Force some checkpoints: while true; do psql -c 'checkpoint' > /dev/null 2>&1; sleep 4; done 4) Force a few crashes and recoveries: while true ; do pg_ctl stop -m immediate ; pg_ctl start ; sleep 4 ; done And here is one backtrace: (gdb) bt #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 #1 0x00007f8a395cad2f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 #2 0x00007f8a3957bef2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #3 0x00007f8a39566472 in __GI_abort () at ./stdlib/abort.c:79 #4 0x000055f0b36b95ac in errfinish (filename=0x55f0b38d4f68 "relcache.c", lineno=4335, funcname=0x55f0b38d6500 <__func__.12>"load_critical_index") at elog.c:604 #5 0x000055f0b36a9c64 in load_critical_index (indexoid=2662, heapoid=1259) at relcache.c:4335 #6 0x000055f0b36a9712 in RelationCacheInitializePhase3 () at relcache.c:4110 (gdb) up 5 #5 0x000055f0b36a9c64 in load_critical_index (indexoid=2662, heapoid=1259) at relcache.c:4335 4335 elog(PANIC, "could not open critical system index %u", indexoid); You can also get failures with btree deduplication because of the CTAS running above, by the way, but that's just the top of the iceberg once the state is corrupted: #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 #1 0x00007fcae38abd2f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 #2 0x00007fcae385cef2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #3 0x00007fcae3847472 in __GI_abort () at ./stdlib/abort.c:79 #4 0x00005556a809f584 in ExceptionalCondition (conditionName=0x5556a812b639 "false", fileName=0x5556a812abc8 "heapam.c",lineNumber=7882) at assert.c:66 #5 0x00005556a79e13db in index_delete_sort_cmp (deltid1=0x5556a9682158, deltid2=0x7ffdb62c7088) at heapam.c:7882 #6 0x00005556a79e14f6 in index_delete_sort (delstate=0x7ffdb62c8350) at heapam.c:7923 #7 0x00005556a79e0cd0 in heap_index_delete_tuples (rel=0x7fcae0f1a9a8, delstate=0x7ffdb62c8350) at heapam.c:7580 #8 0x00005556a7a07be4 in table_index_delete_tuples (rel=0x7fcae0f1a9a8, delstate=0x7ffdb62c8350) at ../../../../src/include/access/tableam.h:1337 #9 0x00005556a7a0a7a3 in _bt_delitems_delete_check (rel=0x7fcae0f231c8, buf=179, heapRel=0x7fcae0f1a9a8, delstate=0x7ffdb62c8350)at nbtpage.c:1541 #10 0x00005556a7a07545 in _bt_simpledel_pass (rel=0x7fcae0f231c8, buffer=179, heapRel=0x7fcae0f1a9a8, deletable=0x7ffdb62c8430,ndeletable=160, newitem=0x5556a9689260, Anyway, you would be able to see that b3e184a (pretty much the same as 15.2), without 8a86618 included. Once 8a86618 is included, all these steps are stable in the backend, at least here. Or do we have some low-hanging fruit with the WAL_LOG strategy? That could always be possible, of course, but that looks like the same issue to me, just with a different symptom showing up. -- Michael
Attachment
On Mon, May 8, 2023 at 2:24 PM Michael Paquier <michael@paquier.xyz> wrote: > I can reproduce the same backtrace here. That's just my usual laptop > with ext4, so this would be a Postgres bug. First, here are the four > things running in parallel so as I can get a failure in loading a > critical index when connecting: That sounds like good news, but I'm still confused: do you see all 0s in the target database (popo)'s catalogs, as reported (and if so can you explain how they got there?), or is it regression that is corrupted in more subtle ways also involving 1s?
On Mon, May 08, 2023 at 02:46:37PM +1200, Thomas Munro wrote: > That sounds like good news, but I'm still confused: do you see all 0s > in the target database (popo)'s catalogs, as reported (and if so can > you explain how they got there?), or is it regression that is > corrupted in more subtle ways also involving 1s? Nope, I have not been able to confirm that yet without 8a86618. The test runs at a high frequency, so that's kind of hard to catch. I have not been able to get things in a state where I could look at a FPW for pg_class or its index, either, in a way similar to Evgeny. -- Michael
Attachment
On 8/05/2023 4:24 am, Michael Paquier wrote: > here are the four things running in parallel so as I can get a failure > in loading a critical index when connecting Wow, that is some amazing detective work! We do indeed create tables during our tests, specifically partitions of tables copied from the template DB. Checkpoints seem to be happening every few minutes (we don't force them, but there is a big DB with more writes on the same instance - it's probably due to that). PG is not crashing in our case, though - not this time. Do you have any idea why the "drop database" command would have timed out (not completed after 30 seconds) for the corrupted DBs? On 8/05/2023 4:17 am, Thomas Munro wrote: > Maybe you could do some one way and some the other, so that we try to > learn more? Do you still want me to try this given what Michael has found? Or anything else to help narrow this down?
On 2023-May-07, Thomas Munro wrote: > Did you previously run this same workload on versions < 15 and never > see any problem? 15 gained a new feature CREATE DATABASE ... > STRATEGY=WAL_LOG, which is also the default. I wonder if there is a > bug somewhere near that, though I have no specific idea. If you > explicitly added STRATEGY=FILE_COPY to your CREATE DATABASE commands, > you'll get the traditional behaviour. Maybe it would be sensible to make STRATEGY_FILE=FILE_COPY the default again, for branch 15, before today's release. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Doing what he did amounts to sticking his fingers under the hood of the implementation; if he gets his fingers burnt, it's his problem." (Tom Lane)
On Mon, May 8, 2023 at 7:55 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Sun, May 07, 2023 at 10:30:52PM +1200, Thomas Munro wrote: > > Bug-in-PostgreSQL explanations could include that we forgot it was > > dirty, or some backend wrote it out to the wrong file; but if we were > > forgetting something like permanent or dirty, would there be a more > > systematic failure? Oh, it could require special rare timing if it is > > similar to 8a8661828's confusion about permanence level or otherwise > > somehow not setting BM_PERMANENT, but in the target blocks, so I think > > that'd require a checkpoint AND a crash. It doesn't reproduce for me, > > but perhaps more unlucky ingredients are needed. > > > > Bug-in-OS/FS explanations could include that a whole lot of writes > > were mysteriously lost in some time window, so all those files still > > contain the zeroes we write first in smgrextend(). I guess this > > previously rare (previously limited to hash indexes?) use of sparse > > file hole-punching could be a factor in an it's-all-ZFS's-fault > > explanation: > > Yes, you would need a bit of all that. > > I can reproduce the same backtrace here. That's just my usual laptop > with ext4, so this would be a Postgres bug. First, here are the four > things running in parallel so as I can get a failure in loading a > critical index when connecting: > 1) Create and drop a database with WAL_LOG as strategy and the > regression database as template: > while true; do > createdb --template=regression --strategy=wal_log testdb; > dropdb testdb; > done > 2) Feeding more data to pg_class in the middle, while testing the > connection to the database created: > while true; > do psql -c 'create table popo as select 1 as a;' regression > /dev/null 2>&1 ; > psql testdb -c "select 1" > /dev/null 2>&1 ; > psql -c 'drop table popo' regression > /dev/null 2>&1 ; > psql testdb -c "select 1" > /dev/null 2>&1 ; > done; > 3) Force some checkpoints: > while true; do psql -c 'checkpoint' > /dev/null 2>&1; sleep 4; done > 4) Force a few crashes and recoveries: > while true ; do pg_ctl stop -m immediate ; pg_ctl start ; sleep 4 ; done > I am able to reproduce this using the steps given above, I am also trying to analyze this further. I will send the update once I get some clue. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Maybe it would be sensible to make STRATEGY_FILE=FILE_COPY the default > again, for branch 15, before today's release. If we had more than one such report, I'd be in favor of that. But I think it's a bit premature to conclude that the copy strategy is to blame. regards, tom lane
Hi, On 2023-05-07 16:10:28 +0000, Evgeny Morozov wrote: > Yes, kind of. We have a test suite that creates one test DB and runs a > bunch of tests on it. Two of these tests, however, create another DB > each (also by cloning the same template DB) in order to test copying > data between DBs. It's only these "extra" DBs that were corrupted, at > least on this occasion. Did you have any occasions where CREATE or DROP DATABASE was interrupted? Either due the connection being terminated or a crash? As described in https://postgr.es/m/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de we don't handle that correctly for DROP DATABASE. I think that might actually fit the symptoms - the DropDatabaseBuffers() will throw away the dirty buffer contents from the WAL strategy CREATE DATABASE, but if you then get cancelled ata point before all the files are removed, the on-disk fails with all-zeroes would remain. Greetings, Andres Freund
On 8/05/2023 9:47 pm, Andres Freund wrote: > Did you have any occasions where CREATE or DROP DATABASE was interrupted? > Either due the connection being terminated or a crash? I've uploaded an edited version of the PG log for the time as https://objective.realityexists.net/temp/log-extract-2023-05-02.txt (test_behavior_638186279733138190 and test_behavior_638186280406544656 are the two DBs that got corrupted). I cannot see any crash in the test logs or the PG logs, but whether it was interrupted is less clear. I don't know whether the the tests ran successfully up to the point where they tried to drop the DBs (I've since added logging to show that next time), but DROP DATABASE did not return after 30 seconds and the client library (Npgsql) then tried to cancel the requests. We then tried to drop the DB again, with the same results in both cases. After the second attempts timed out we closed the connections anyway - so maybe that was the interruption? > As described in > https://postgr.es/m/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de > we don't handle that correctly for DROP DATABASE. > > I think that might actually fit the symptoms - the DropDatabaseBuffers() will > throw away the dirty buffer contents from the WAL strategy CREATE DATABASE, > but if you then get cancelled ata point before all the files are removed, the > on-disk fails with all-zeroes would remain. Oooh, that does seem to fit! Thank you for digging that up.
Hi, On 2023-05-08 20:27:14 +0000, Evgeny Morozov wrote: > On 8/05/2023 9:47 pm, Andres Freund wrote: > > Did you have any occasions where CREATE or DROP DATABASE was interrupted? > > Either due the connection being terminated or a crash? > > I've uploaded an edited version of the PG log for the time as > https://objective.realityexists.net/temp/log-extract-2023-05-02.txt > (test_behavior_638186279733138190 and test_behavior_638186280406544656 > are the two DBs that got corrupted). > > I cannot see any crash in the test logs or the PG logs, but whether it > was interrupted is less clear. I don't know whether the the tests ran > successfully up to the point where they tried to drop the DBs (I've > since added logging to show that next time), but DROP DATABASE did not > return after 30 seconds and the client library (Npgsql) then tried to > cancel the requests. We then tried to drop the DB again, with the same > results in both cases. After the second attempts timed out we closed the > connections anyway - so maybe that was the interruption? Yep, that is the interruption. I suspect that something was not processing interrupts, which then lead the WaitForProcSignalBarrier() in dropdb() to block. Are you using any extensions? Do you have any chance to figure out what statements were running concurrently with the DROP DATABASE? Seems we figured out what the problem is... Just need to figure out how to fix it. I think the minimal fix will entail moving the commit of the transaction into dropdb(). We need to commit before executing DropDatabaseBuffers(), as the database is inconsistent after that point. Luckily DROP DATABASE already can't be executed in a transaction, so that's easily possible. That means we'd loose track of the files for the database in case of a crash, but that's surely better than a corrupt database. And it's already the case for CREATE DATABASE. Additionally we probably need to prevent processing cancellations between DropDatabaseBuffers() and remove_dbtablespaces(). But we can't just stop accepting interrupts, because that'd break WaitForProcSignalBarrier(). Ah, what fun. Unfortunately QueryCancelHoldoffCount is not sufficient for this purpose - we don't just need to prevent cancellations, we also can't accept termintions either... For a bit I thought we might be able to just reorder operations, moving the WaitForProcSignalBarrier() earlier. Unfortunately I don't think that works, because until DropDatabaseBuffers() has executed, backends might write such buffers back, we need to do WaitForProcSignalBarrier() after that. I've toyed with the idea of splitting DropDatabaseBuffers() into two. In a first pass, mark all buffers as IO_IN_PROGRESS (plus perhaps some other flag). That'd prevent new IO from being started for those buffers. Then WaitForProcSignalBarrier(), to make sure nobody has an open FD for those files. Then actually drop all those buffers and unlink the files. That now can happen with interrupts held, without any chance of being blocked, afaict. In case of being cancelled during WaitForProcSignalBarrier(), AbortBufferIO() would remove IO_IN_PROGRESS, and everything would be fine. I don't like the idea of WaitForProcSignalBarrier() while having buffers marked as in-progress. I don't *immediately* see a deadlock danger, but I'd be unsurprised if there were some. But perhaps a similar approach could be the solution? My gut says that the rought direction might allow us to keep dropdb() a single transaction. Greetings, Andres Freund
On Mon, May 08, 2023 at 07:15:20PM +0530, Dilip Kumar wrote: > I am able to reproduce this using the steps given above, I am also > trying to analyze this further. I will send the update once I get > some clue. Have you been able to reproduce this on HEAD or at the top of REL_15_STABLE, or is that 15.2-ish without fa5dd46? One thing I was wondering about to improve the odds of the hits is to be more aggressive with the number of relations created at once, so as we are much more aggressive with the number of pages extended in pg_class from the origin database. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > One thing I was wondering about to improve the odds of the hits is to > be more aggressive with the number of relations created at once, so as > we are much more aggressive with the number of pages extended in > pg_class from the origin database. Andres seems to think it's a problem with aborting a DROP DATABASE. Adding more data might serve to make the window wider, perhaps. regards, tom lane
On Tue, May 9, 2023 at 10:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael@paquier.xyz> writes: > > One thing I was wondering about to improve the odds of the hits is to > > be more aggressive with the number of relations created at once, so as > > we are much more aggressive with the number of pages extended in > > pg_class from the origin database. > > Andres seems to think it's a problem with aborting a DROP DATABASE. > Adding more data might serve to make the window wider, perhaps. Here's an easy way: @@ -1689,6 +1689,14 @@ dropdb(const char *dbname, bool missing_ok, bool force) /* Close all smgr fds in all backends. */ WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE)); +/* XXX pretend one of the above steps got interrupted by a statement timeout or ^C */ +if (random() % 2 == 0) +{ +QueryCancelPending = true; +InterruptPending = true; +CHECK_FOR_INTERRUPTS(); +} postgres=# create database db2; CREATE DATABASE postgres=# drop database db2; ERROR: canceling statement due to user request $ psql db2 psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC: could not open critical system index 2662 $ od -t x1 base/111117/2662 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0100000 $ od -t x1 base/111117/2837 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0040000 $ od -t x1 base/111117/2840 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0100000
On Mon, May 08, 2023 at 06:04:23PM -0400, Tom Lane wrote: > Andres seems to think it's a problem with aborting a DROP DATABASE. > Adding more data might serve to make the window wider, perhaps. And the odds get indeed much better once I use these two toys: CREATE OR REPLACE FUNCTION create_tables(num_tables int) RETURNS VOID AS $func$ BEGIN FOR i IN 1..num_tables LOOP EXECUTE format(' CREATE TABLE IF NOT EXISTS %I (id int)', 't_' || i); END LOOP; END $func$ LANGUAGE plpgsql; CREATE OR REPLACE FUNCTION drop_tables(num_tables int) RETURNS VOID AS $func$ BEGIN FOR i IN 1..num_tables LOOP EXECUTE format(' DROP TABLE IF EXISTS %I', 't_' || i); END LOOP; END $func$ LANGUAGE plpgsql; And then use this loop with the others I have mentioned upthread (just create origindb and the functions in them): while true; do psql -c 'select create_tables(1000)' origindb > /dev/null 2>&1 ; psql testdb -c "select 1" > /dev/null 2>&1 ; psql -c 'select drop_tables(1000)' origindb > /dev/null 2>&1 ; psql testdb -c "select 1" > /dev/null 2>&1 ; done; On top of that, I have also been able to reproduce the issue on HEAD, and luckily some pg_class file remain around, full of zeros: $ hexdump ./base/199634/1259 0000000 0000 0000 0000 0000 0000 0000 0000 0000 The contents of 2662, though, looked OK. Echoing Alvaro.. Could we, err, revisit the choice of making WAL_LOG the default strategy even for this set of minor releases? FWIW, I've mentioned that this choice was too aggressive in the thread of 8a86618.. -- Michael
Attachment
Hi, On 2023-05-08 14:04:00 -0700, Andres Freund wrote: > But perhaps a similar approach could be the solution? My gut says that the > rought direction might allow us to keep dropdb() a single transaction. I started to hack on the basic approach of committing after the catalog changes. But then I started wondering if we're not tackling this all wrong. We don't actually want to drop the catalog contents early as that prevents the database from being dropped again, while potentially leaving behind contents in the case of a crash / interrupt. Similary, the fact that we only commit the transaction at the end of createdb() leads to interrupts / crashes orphaning the contents of that [partial] database. We also hve similar issues with movedb(), I think. This is a non-transactional operation. I think we should copy the approach of the CONCURRENTLY operations. Namely add a new column to pg_database, indicating whether the database contents are valid. An invalid database can be dropped, but not connected to. Then we could have createdb() commit before starting to create the target database directory (with invalid = true, of course). After all the filesystem level stuff is complete, set invalid = false. For dropping a database we'd use heap_inplace_update() to set invalid = true just before the DropDatabaseBuffers(), preventing any connections after that point. Making movedb() safe is probably a bit harder - I think it'd temporarily require two pg_database entries? Of course we can't add a new column in the back branches. IIRC we had a similar issue with CIC some point, and just ended up misusing some other column for the backbranches? We could e.g. use datconnlimit == -2 for that purpose (but would need to make sure that ALTER DATABASE can't unset it). My current gut feeling is that we should use datconnlimit == -2 to prevent connections after reaching DropDatabaseBuffers() in dropdb(), and use a new column in 16, for both createdb() and dropdb(). In some ways handling createdb() properly is a new feature, but it's also arguably a bug that we leak the space - and I think the code will be better if we work on both together. Greetings, Andres Freund
Hi, On 2023-05-08 17:46:37 -0700, Andres Freund wrote: > My current gut feeling is that we should use datconnlimit == -2 to prevent > connections after reaching DropDatabaseBuffers() in dropdb(), and use a new > column in 16, for both createdb() and dropdb(). Attached is a rough prototype of that idea (only using datconnlimit == -2 for now). This would need a fair bit more polish. The tests are crappy and output stuff to stderr and don't validate enough. The error messages are bad. No docs changes. More comments about why datconnlimit == -2 is used. Etc. But I think it should be sufficient to discuss whether this is a viable path. I guess we need to move this to -hackers. Perhaps I'll post subsequent versions below https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de ? Greetings, Andres Freund
Attachment
On Tue, May 9, 2023 at 3:15 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, May 08, 2023 at 07:15:20PM +0530, Dilip Kumar wrote: > > I am able to reproduce this using the steps given above, I am also > > trying to analyze this further. I will send the update once I get > > some clue. > > Have you been able to reproduce this on HEAD or at the top of > REL_15_STABLE, or is that 15.2-ish without fa5dd46? > I am able to reproduce on REL_15_STABLE as well with your test case. The only difference is without fa5dd46 the issue gets reproduced just in a couple of seconds and very consistent OTOH on REL_15_STABLE it takes time to reproduce 2-3 mins and also reproduction is not very consistent. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 8/05/2023 11:04 pm, Andres Freund wrote: > Are you using any extensions? Only plpgsql. > Do you have any chance to figure out what statements were running > concurrently with the DROP DATABASE? No. Is there some way to log that, other than just logging all statements (which seems impractical)? Most of the statements running on that server should be inserts into another database, some reads. Something could be creating or dropping partitions in another DB in parallel, too, but not frequently.
On Mon, May 8, 2023 at 4:10 AM Evgeny Morozov
<postgresql3@realityexists.net> wrote:
> On 6/05/2023 11:13 pm, Thomas Munro wrote:
> > Would you like to try requesting FILE_COPY for a while and see if it eventually happens like that too?
> Sure, we can try that.
Maybe you could do some one way and some the other, so that we try to
learn more?
> Ubuntu 18.04.6
> Kernel 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023
> x86_64 x86_64 x86_64 GNU/Linux
> zfsutils-linux package version 0.7.5-1ubuntu16.12 amd64
I tried for a few hours to reproduce this by trying to make as many
things as similar to yours as I could based on info in this thread
(Linux: up-to-date Ubuntu 18.04 in vagrant which has nearly the same
kernel 4.15.0-208-generic and a close but slightly different version
of ancient ZFS 0.7.5-1ubuntu15, not sure why, ZFS: mirror (I used a
pair of loopback files), recordsize=16kB, compression=lz4, PG:
compiled from tag REL_15_2, data_checksums=on, full_page_writes=off,
wal_recycle=off, wal_init_zero=off), with what I thought might be
roughly what you're doing (creating three DBs, two clones of the
first, with various modification at various points, with various
overlapping activities, and then checking for catalog corruption). No
cigar. Hrmph.
We do indeed create temp tables, both in other databases and in the ones being tested. (We also create non-temp tables there.)It could be as simple as creating temp tables in the other database (since I believe pg_class was hit).
Also, not sure if the OP has a set of things done after he creates the DB that may help?
Basically we read rows from the source database, create some partitions of tables in the target database, insert into a temp table there using BULK COPY, then using a regular INSERT copy from the temp tables to the new partitions.
Now that the probem has been reproduced and understood by the PG developers, could anyone explain why PG crashed entirely with the "PANIC" error back in April when only specific databases were corrupted, not any global objects necesary for PG to run? And why did it not crash with the "PANIC" on this occasion?
On 10/05/2023 6:39 am, Kirk Wolak wrote:We do indeed create temp tables, both in other databases and in the ones being tested. (We also create non-temp tables there.)It could be as simple as creating temp tables in the other database (since I believe pg_class was hit).Also, not sure if the OP has a set of things done after he creates the DB that may help?Basically we read rows from the source database, create some partitions of tables in the target database, insert into a temp table there using BULK COPY, then using a regular INSERT copy from the temp tables to the new partitions.
Now that the probem has been reproduced and understood by the PG developers, could anyone explain why PG crashed entirely with the "PANIC" error back in April when only specific databases were corrupted, not any global objects necesary for PG to run? And why did it not crash with the "PANIC" on this occasion?
Why would it PANIC on non-system data corruption, but not on system data corruption?
On 9/05/2023 3:32 am, Andres Freund wrote: > Attached is a rough prototype of that idea (only using datconnlimit == > -2 for now). > I guess we need to move this to -hackers. Perhaps I'll post subsequent > versions below > https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de ? So now that a patch is in the works, can I drop the two corrupted databases? Is there a workaround I can use to reduce the risk of running into this issue again until a patch is released? (Which I guess would be in August?)
On 9/05/2023 3:32 am, Andres Freund wrote:
> Attached is a rough prototype of that idea (only using datconnlimit ==
> -2 for now).
> I guess we need to move this to -hackers. Perhaps I'll post subsequent
> versions below
> https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de ?
So now that a patch is in the works, can I drop the two corrupted
databases? Is there a workaround I can use to reduce the risk of running
into this issue again until a patch is released? (Which I guess would be
in August?)
Hi, On 2023-05-16 14:20:46 +0000, Evgeny Morozov wrote: > On 9/05/2023 3:32 am, Andres Freund wrote: > > Attached is a rough prototype of that idea (only using datconnlimit == > > -2 for now). > > I guess we need to move this to -hackers. Perhaps I'll post subsequent > > versions below > > https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de ? > So now that a patch is in the works, can I drop the two corrupted > databases? Yes. > Is there a workaround I can use to reduce the risk of running into this > issue again until a patch is released? (Which I guess would be in August?) Try to prevent the DROP DATABASE from getting cancelled :/. If you want to go a bit further, you could rename the database to *_dropped before dropping it, and then try to do the DROP DATABASE. That way you'd at least know that it's corrupt because of a failed DROP database. Greetings, Andres Freund
On 17/05/2023 1:39 am, Andres Freund wrote: > Try to prevent the DROP DATABASE from getting cancelled :/. I still don't know why that's happening. I mean, I know why it gets cancelled (the client timeout we set in Npgsql), but I don't know why the drop does not succeed within 30 seconds. We could, of course, increase that timeout, but my gut feeling is that if it hasn't happened in 30 seconds it won't happen in 5 minutes, either, and the connection will have to be closed eventually. Can you think of any way to troubleshoot that? > If you want to go a bit further, you could rename the database to > *_dropped before dropping it, and then try to do the DROP DATABASE. > That way you'd at least know that it's corrupt because of a failed > DROP database. That's a good idea and I will do that, but my bigger concern is PG crashing entirely with the "PANIC" error and refusing to start again. I still don't know why that happened back in April (but not this time, in May), so I don't know how to prevent that.
There haven't been any updates posted to https://www.postgresql.org/message-id/20230509040203.z6mvijumv7wxcuib%40awork3.anarazel.de so I just wanted to check if there is any update on the status of the patch? Can we expect it in PostgreSQL 15.4? Thanks.
Hi, On 2023-06-19 10:04:35 +0000, Evgeny Morozov wrote: > There haven't been any updates posted to > https://www.postgresql.org/message-id/20230509040203.z6mvijumv7wxcuib%40awork3.anarazel.de > so I just wanted to check if there is any update on the status of the > patch? Can we expect it in PostgreSQL 15.4? Thanks. I pushed the fixes to all branches just now. Thanks for the report! Greetings, Andres Freund