Re: Unused files in the database directory after crashed VACUUM FULL - Mailing list pgsql-general
From | Hannes Erven |
---|---|
Subject | Re: Unused files in the database directory after crashed VACUUM FULL |
Date | |
Msg-id | 4ae62ff0-f33e-2a26-79ff-dcaa39ee92ff@erven.at Whole thread Raw |
In response to | Re: Unused files in the database directory after crashed VACUUM FULL (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Unused files in the database directory after crashed VACUUM FULL
(Tom Lane <tgl@sss.pgh.pa.us>)
|
List | pgsql-general |
Hi again, Am 10.02.19 um 16:41 schrieb Tom Lane: > Hannes Erven <hannes@erven.at> writes: >> I've just had a "VACUUM FULL <table>" crash due to 100% disk usage. >> Clearly my fault, I was expecting the new table to be small enough. > > What do you mean by "crash" exactly? A normal transactional failure > should've cleaned up orphaned files. I suppose if the kernel decided to > kill -9 the vacuum process, that wouldn't happen --- but that's not > the usual response to out-of-disk-space. Here's the exact log (the pgadmin process was running the VACCUM FULL): 2019-02-09 23:44:53.516 CET [20341] @/ <> PANIC: could not write to file "pg_wal/xlogtemp.20341": No space left on device 2019-02-09 23:44:53.516 CET [19181] @/ <> LOG: WAL writer process (PID 20341) was terminated by signal 6: Aborted 2019-02-09 23:44:53.516 CET [19181] @/ <> LOG: terminating any other active server processes 2019-02-09 23:44:53.516 CET [20816] postgres@IP/db <pgAdmin 4 - DB:db> WARNING: terminating connection because of crash of another serve r process 2019-02-09 23:44:53.516 CET [20816] postgres@IP/db <pgAdmin 4 - DB:db> DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2019-02-09 23:44:53.516 CET [20816] postgres@IP/db <pgAdmin 4 - DB:db> HINT: In a moment you should be able to reconnect to the database and repeat your command. 2019-02-09 23:44:53.597 CET [19181] @/ <> LOG: all server processes terminated; reinitializing 2019-02-09 23:44:53.679 CET [11686] @/ <> LOG: database system was interrupted; last known up at 2019-02-09 23:44:40 CET 2019-02-09 23:44:58.761 CET [11686] @/ <> LOG: database system was not properly shut down; automatic recovery in progress 2019-02-09 23:44:58.807 CET [11686] @/ <> LOG: redo starts at 1DA/B3400798 2019-02-09 23:44:58.811 CET [11686] @/ <> FATAL: could not extend file "base/16400/612716_fsm": No space left on device 2019-02-09 23:44:58.811 CET [11686] @/ <> HINT: Check free disk space. 2019-02-09 23:44:58.811 CET [11686] @/ <> CONTEXT: WAL redo at 1DA/B3406310 for Heap/INSERT: off 4 After freeing space and restarting: 2019-02-10 06:28:25.622 CET [14344] @/ <> LOG: database system was interrupted while in recovery at 2019-02-09 23:44:58 CET 2019-02-10 06:28:25.622 CET [14344] @/ <> HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. 2019-02-10 06:28:26.762 CET [14344] @/ <> LOG: database system was not properly shut down; automatic recovery in progress 2019-02-10 06:28:26.831 CET [14344] @/ <> LOG: redo starts at 1DA/B3400798 2019-02-10 06:28:27.624 CET [14344] @/ <> LOG: redo done at 1DA/D5FFF9D8 >> Is there a safe procedure how to >> check/clean up "unnecessary" files in the cluster directory? > > You just described it --- verify with pg_filenode_relation that the > file doesn't correspond to any pg_class entry, then manually delete. OK, I see. I came up with this: with d0 as ( select oid from pg_database where datname=current_database() ), d1 as ( select pg_ls_dir as fn, regexp_match(pg_ls_dir, '^([0-9]+)(.*)$') as match from d0, pg_ls_dir('base/'||d0.oid) order by 1 ), d2 as ( select fn, match[1] as base, match[2] as ext from d1 where (fn NOT ILIKE 'pg_%') ), d3 as ( select d.*, pg_filenode_relation(0, base::oid) as relation from d2 d ) select fn, pg_size_pretty((pg_stat_file('base/'||d0.oid||'/'||fn)).size) from d0, d3 where relation is null; fn | pg_size_pretty ------------+---------------- 612713 | 25 MB 612718 | 1024 MB 612718.1 | 1024 MB 612718.2 | 180 MB t11_621351 | 96 kB t11_621354 | 0 bytes t11_621356 | 8192 bytes t11_621357 | 160 kB t11_621360 | 0 bytes t11_621362 | 8192 bytes So there's still 612713 and 612718* left to clean up; judging from the numbers and timestamps, they probably belong to other VACUUM FULLs I manually canceled in the pgadmin UI during my experiments. Is this interesting enough to further investigate? If yes, I'll be happy to work on a test case. In any case, I've now learned (how) to check for orphaned files if a database cluster's on-disk size is way off from the sums of pg_total_relation_size() + WAL... Thanks again & best regards, -hannes
pgsql-general by date: