Thread: BUG #15467: The database subdirectory"pg_tblspc/1932420460/PG_10_201707211/16400" is missing.
BUG #15467: The database subdirectory"pg_tblspc/1932420460/PG_10_201707211/16400" is missing.
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15467 Logged by: Jay Email address: tsuraan@gmail.com PostgreSQL version: 10.1 Operating system: Linux 4.1.18/AWS Description: I have a system running on AWS that had been processing data without any apparent issues, when the main database apparently just stopped existing. The postgres logs (considerably snipped) look like this: ... [2018-10-29 15:51:30 UTC]STATEMENT: INSERT INTO ... RETURNING inserted [2018-10-29 15:52:05 UTC]FATAL: database "db" does not exist [2018-10-29 15:52:05 UTC]DETAIL: The database subdirectory "pg_tblspc/1932420460/PG_10_201707211/16400" is missing. [2018-10-29 15:52:05 UTC]FATAL: database "db" does not exist [2018-10-29 15:52:05 UTC]DETAIL: The database subdirectory "pg_tblspc/1932420460/PG_10_201707211/16400" is missing. [2018-10-29 15:52:05 UTC]FATAL: database "db" does not exist [2018-10-29 15:52:05 UTC]DETAIL: The database subdirectory "pg_tblspc/1932420460/PG_10_201707211/16400" is missing. ... Prior to the INSERT statement there are about six months of other insert, select, update, etc statements (well, not really because of log rotation, but that's the idea). After that final INSERT, postgres apparently decided that it has table spaces in use, and the main db is stored on them. At "SELECT * FROM pg_tablespace" looks like this: psql (10.1) Type "help" for help. root=# select * from pg_tablespace ; spcname | spcowner | spcacl | spcoptions ------------+----------+--------+------------ pg_default | 10 | | pg_global | 10 | | (2 rows) So, I don't know why it thinks it needs to look for the main database in some other tablespace. We've never declared any explicit tablespaces, so that pg_tblspc directory is always empty, on every system we run (including this one). I've stopped postgres, copied the data onto a system running postgres 10.5, and started it there. The behaviour is identical, so I assume the tablespace info must be stored somewhere other than in the pg_tablespace table. Maybe I could clean it up, if I knew where the info was stored? I'm curious how postgres could get into this state, but mostly I'd be happy to get it functional again.
Re: BUG #15467: The database subdirectory "pg_tblspc/1932420460/PG_10_201707211/16400" is missing.
From
Andrew Gierth
Date:
>>>>> "PG" == PG Bug reporting form <noreply@postgresql.org> writes: PG> I have a system running on AWS that had been processing data without any PG> apparent issues, when the main database apparently just stopped existing. PG> The postgres logs (considerably snipped) look like this: PG> ... PG> [2018-10-29 15:51:30 UTC]STATEMENT: INSERT INTO ... RETURNING inserted PG> [2018-10-29 15:52:05 UTC]FATAL: database "db" does not exist PG> [2018-10-29 15:52:05 UTC]DETAIL: The database subdirectory PG> "pg_tblspc/1932420460/PG_10_201707211/16400" is missing. What is in the pg_database row (especially the dattablespace column) for that database? -- Andrew (irc:RhodiumToad)
Re: BUG #15467: The database subdirectory "pg_tblspc/1932420460/PG_10_201707211/16400"is missing.
From
tsuraan
Date:
> PG> [2018-10-29 15:51:30 UTC]STATEMENT: INSERT INTO ... RETURNING inserted > PG> [2018-10-29 15:52:05 UTC]FATAL: database "db" does not exist > PG> [2018-10-29 15:52:05 UTC]DETAIL: The database subdirectory > PG> "pg_tblspc/1932420460/PG_10_201707211/16400" is missing. > > What is in the pg_database row (especially the dattablespace column) for > that database? Well, that can't be good: psql (10.1) Type "help" for help. root=# select * from pg_database; ERROR: cache lookup failed for type 0 I get that both on the original system (running 10.1) and on my up-to-date system that's running 10.5.
Re: BUG #15467: The database subdirectory "pg_tblspc/1932420460/PG_10_201707211/16400" is missing.
From
Andrew Gierth
Date:
>>>>> "tsuraan" == tsuraan <tsuraan@gmail.com> writes: >> What is in the pg_database row (especially the dattablespace column) for >> that database? tsuraan> Well, that can't be good: tsuraan> psql (10.1) tsuraan> Type "help" for help. tsuraan> root=# select * from pg_database; tsuraan> ERROR: cache lookup failed for type 0 tsuraan> I get that both on the original system (running 10.1) and on tsuraan> my up-to-date system that's running 10.5. Does that error persist if you delete all files named pg_internal.init in the data dir? (those files are a cache of the system tables and are regenerated on the next new connection if they are missing) -- Andrew (irc:RhodiumToad)
Re: BUG #15467: The database subdirectory "pg_tblspc/1932420460/PG_10_201707211/16400"is missing.
From
tsuraan
Date:
> Does that error persist if you delete all files named pg_internal.init > in the data dir? (those files are a cache of the system tables and are > regenerated on the next new connection if they are missing) There were a few of those files, so I removed them: $ find data -name pg_internal.init data/base/1/pg_internal.init data/base/16401/pg_internal.init data/base/12292/pg_internal.init data/base/12291/pg_internal.init data/base/16402/pg_internal.init data/global/pg_internal.init $ find data -name pg_internal.init -exec rm {} \; $ find data -name pg_internal.init $ Starting up postgres again and querying pg_database still gives the same cache lookup error. Another person noticed that the missing directory (pg_tblspc/1932420460/PG_10_201707211/16400) corresponded to an existing directory under data/base/16400, so he copied the contents from data/base/16400 into pg_tblspc/1932420460/PG_10_201707211/16400 (making the previously non-existent parent path "pg_tblspc/1932420460/PG_10_201707211/" along the way). Doing that allowed the main database to once again be queried, and all the data appears to be present. That system still can't query its pg_databases table, so I assume it's still pretty badly broken, but at least we can try a pg_dump and recovery onto a new, clean, system. I still have my copy of the broken database though, and I'd love to know if there's a proper fix for it, so I'll keep this discussion going as long as anybody has ideas :)
Re: BUG #15467: The database subdirectory "pg_tblspc/1932420460/PG_10_201707211/16400" is missing.
From
Andrew Gierth
Date:
>>>>> "tsuraan" == tsuraan <tsuraan@gmail.com> writes: tsuraan> Starting up postgres again and querying pg_database still tsuraan> gives the same cache lookup error. Do you still get the error from just select oid from pg_database; or select oid,dattablespace from pg_database; ? tsuraan> Another person noticed that the missing directory tsuraan> (pg_tblspc/1932420460/PG_10_201707211/16400) I just noticed: 1932420460 = 0x732E656C = "le.s" which is suspiciously textual. Any chance you can do this query: select pg_relation_filepath('pg_database'::regclass); and then do hexdump -C on that file? -- Andrew (irc:RhodiumToad)
Re: BUG #15467: The database subdirectory "pg_tblspc/1932420460/PG_10_201707211/16400"is missing.
From
tsuraan
Date:
> Do you still get the error from just > > select oid from pg_database; > > or > > select oid,dattablespace from pg_database; That does work! root=# select oid,dattablespace from pg_database ; oid | dattablespace -------+--------------- 16402 | 1663 1 | 1663 12291 | 1663 16400 | 1932420460 12292 | 1663 16401 | 1663 (6 rows) > I just noticed: 1932420460 = 0x732E656C = "le.s" which is suspiciously > textual. > > Any chance you can do this query: > > select pg_relation_filepath('pg_database'::regclass); > > and then do hexdump -C on that file? Quite a bit there, but the region around there does look weird: 00001d40 00 00 00 00 00 00 00 00 00 01 00 00 ff ff ff ff |................| 00001d50 00 00 00 00 f4 06 00 00 01 00 00 00 6c 65 2e 73 |............le.s| 00001d60 6f 00 70 79 63 00 65 2e 00 00 00 00 00 00 00 00 |o.pyc.e.........| 00001d70 00 ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00001d80 00 00 00 00 00 00 00 00 3a 00 00 ff 00 00 00 00 |........:.......| 00001d90 00 00 00 00 00 00 00 ff 00 00 00 00 00 00 00 00 |................| 00001da0 28 f8 12 00 00 00 00 00 00 00 00 00 00 00 00 00 |(...............| 00001db0 0d 00 0d 80 0a 29 20 00 00 00 00 00 03 30 00 00 |.....) ......0..| 00001dc0 74 65 6d 70 6c 61 74 65 30 00 00 00 00 00 00 00 |template0.......| 00001dd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| The le.s that you noticed actually looks like le.so, and then there's a pyc and an "e." after it (null-separated). Also weird is that there are six different sections describing(?) template1: hexdump -C global/1262 |grep -B1 -A1 template 00001110 0b 00 0d 80 0a 29 20 00 00 00 00 00 01 00 00 00 |.....) .........| 00001120 74 65 6d 70 6c 61 74 65 31 00 00 00 00 00 00 00 |template1.......| 00001130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -- 00001240 0b 00 0d c0 0a 25 20 00 00 00 00 00 01 00 00 00 |.....% .........| 00001250 74 65 6d 70 6c 61 74 65 31 00 00 00 00 00 00 00 |template1.......| 00001260 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -- 00001360 00 00 00 00 00 00 00 00 09 00 0d c0 0a 25 20 00 |.............% .| 00001370 00 00 00 00 01 00 00 00 74 65 6d 70 6c 61 74 65 |........template| 00001380 31 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |1...............| -- 00001480 08 00 0d c0 0a 25 20 00 00 00 00 00 01 00 00 00 |.....% .........| 00001490 74 65 6d 70 6c 61 74 65 31 00 00 00 00 00 00 00 |template1.......| 000014a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -- 000015a0 00 00 00 00 00 00 00 00 07 00 0d c0 0a 25 20 00 |.............% .| 000015b0 00 00 00 00 01 00 00 00 74 65 6d 70 6c 61 74 65 |........template| 000015c0 31 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |1...............| -- 00001db0 0d 00 0d 80 0a 29 20 00 00 00 00 00 03 30 00 00 |.....) ......0..| 00001dc0 74 65 6d 70 6c 61 74 65 30 00 00 00 00 00 00 00 |template0.......| 00001dd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -- 00001ee0 06 00 0d c0 0a 25 20 00 00 00 00 00 01 00 00 00 |.....% .........| 00001ef0 74 65 6d 70 6c 61 74 65 31 00 00 00 00 00 00 00 |template1.......| 00001f00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| Also, of the two main databases in use, one of them has 8 different sections, and the other has 6. I'm not seeing anything like that on other systems, but maybe it's normal? I can paste in the entire hexdump if it's useful, but there's a whole ton of stuff there, so I'm leaving it out unless somebody really wants to see it.
Re: BUG #15467: The database subdirectory "pg_tblspc/1932420460/PG_10_201707211/16400" is missing.
From
Andrew Gierth
Date:
>>>>> "tsuraan" == tsuraan <tsuraan@gmail.com> writes: tsuraan> Quite a bit there, but the region around there does look weird: tsuraan> 00001d40 00 00 00 00 00 00 00 00 00 01 00 00 ff ff ff ff |................| tsuraan> 00001d50 00 00 00 00 f4 06 00 00 01 00 00 00 6c 65 2e 73 |............le.s| tsuraan> 00001d60 6f 00 70 79 63 00 65 2e 00 00 00 00 00 00 00 00 |o.pyc.e.........| tsuraan> 00001d70 00 ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| tsuraan> 00001d80 00 00 00 00 00 00 00 00 3a 00 00 ff 00 00 00 00 |........:.......| tsuraan> 00001d90 00 00 00 00 00 00 00 ff 00 00 00 00 00 00 00 00 |................| tsuraan> 00001da0 28 f8 12 00 00 00 00 00 00 00 00 00 00 00 00 00 |(...............| tsuraan> 00001db0 0d 00 0d 80 0a 29 20 00 00 00 00 00 03 30 00 00 |.....) ......0..| tsuraan> 00001dc0 74 65 6d 70 6c 61 74 65 30 00 00 00 00 00 00 00 |template0.......| tsuraan> 00001dd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| tsuraan> The le.s that you noticed actually looks like le.so, and then tsuraan> there's a pyc and an "e." after it (null-separated). That's quite odd. But it looks like the data here is corrupt from offset 1d5c for the next 64 bytes (+/- 4 bytes, maybe, but much more likely to be exactly 64). That's not something I recall seeing before, but some kind of hardware corruption issue couldn't be ruled out. I don't suppose you have data checksums enabled? The "cache lookup failed" error is caused by the fact that the column _after_ dattablespace, i.e. datacl, is also corrupt (the database must have had a non-null ACL with an additional grant to exactly one role going by the layout of records). tsuraan> Also weird is that there are six different sections tsuraan> describing(?) template1: That's not really weird - the pg_database rows can get updated, e.g. from grant/revoke, and autovacuum usually won't kick in for a long time because the table is so tiny. So some number of old dead row versions is expected - likewise for your own db. -- Andrew (irc:RhodiumToad)