Re: Unused files in the database directory after crashed VACUUM FULL - Mailing list pgsql-general

From Hannes Erven
Subject Re: Unused files in the database directory after crashed VACUUM FULL
Date
Msg-id 4ae62ff0-f33e-2a26-79ff-dcaa39ee92ff@erven.at
Whole thread Raw
In response to Re: Unused files in the database directory after crashed VACUUM FULL  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Unused files in the database directory after crashed VACUUM FULL  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Hi again,


Am 10.02.19 um 16:41 schrieb Tom Lane:
> Hannes Erven <hannes@erven.at> writes:
>> I've just had a "VACUUM FULL <table>" crash due to 100% disk usage.
>> Clearly my fault, I was expecting the new table to be small enough.
> 
> What do you mean by "crash" exactly?  A normal transactional failure
> should've cleaned up orphaned files.  I suppose if the kernel decided to
> kill -9 the vacuum process, that wouldn't happen --- but that's not
> the usual response to out-of-disk-space.


Here's the exact log (the pgadmin process was running the VACCUM FULL):

2019-02-09 23:44:53.516 CET [20341] @/ <> PANIC:  could not write to 
file "pg_wal/xlogtemp.20341": No space left on device
2019-02-09 23:44:53.516 CET [19181] @/ <> LOG:  WAL writer process (PID 
20341) was terminated by signal 6: Aborted
2019-02-09 23:44:53.516 CET [19181] @/ <> LOG:  terminating any other 
active server processes
2019-02-09 23:44:53.516 CET [20816] postgres@IP/db <pgAdmin 4 - DB:db> 
WARNING:  terminating connection because of crash of another serve
r process
2019-02-09 23:44:53.516 CET [20816] postgres@IP/db <pgAdmin 4 - DB:db> 
DETAIL:  The postmaster has commanded this server process to roll
back the current transaction and exit, because another server process 
exited abnormally and possibly corrupted shared memory.
2019-02-09 23:44:53.516 CET [20816] postgres@IP/db <pgAdmin 4 - DB:db> 
HINT:  In a moment you should be able to reconnect to the database
  and repeat your command.
2019-02-09 23:44:53.597 CET [19181] @/ <> LOG:  all server processes 
terminated; reinitializing
2019-02-09 23:44:53.679 CET [11686] @/ <> LOG:  database system was 
interrupted; last known up at 2019-02-09 23:44:40 CET
2019-02-09 23:44:58.761 CET [11686] @/ <> LOG:  database system was not 
properly shut down; automatic recovery in progress
2019-02-09 23:44:58.807 CET [11686] @/ <> LOG:  redo starts at 1DA/B3400798
2019-02-09 23:44:58.811 CET [11686] @/ <> FATAL:  could not extend file 
"base/16400/612716_fsm": No space left on device
2019-02-09 23:44:58.811 CET [11686] @/ <> HINT:  Check free disk space.
2019-02-09 23:44:58.811 CET [11686] @/ <> CONTEXT:  WAL redo at 
1DA/B3406310 for Heap/INSERT: off 4


After freeing space and restarting:

2019-02-10 06:28:25.622 CET [14344] @/ <> LOG:  database system was 
interrupted while in recovery at 2019-02-09 23:44:58 CET
2019-02-10 06:28:25.622 CET [14344] @/ <> HINT:  This probably means 
that some data is corrupted and you will have to use the last backup for 
recovery.
2019-02-10 06:28:26.762 CET [14344] @/ <> LOG:  database system was not 
properly shut down; automatic recovery in progress
2019-02-10 06:28:26.831 CET [14344] @/ <> LOG:  redo starts at 1DA/B3400798
2019-02-10 06:28:27.624 CET [14344] @/ <> LOG:  redo done at 1DA/D5FFF9D8


>> Is there a safe procedure how to
>> check/clean up "unnecessary" files in the cluster directory?
> 
> You just described it --- verify with pg_filenode_relation that the
> file doesn't correspond to any pg_class entry, then manually delete.


OK, I see. I came up with this:

with d0 as (
     select oid from pg_database where datname=current_database()
), d1 as (
select pg_ls_dir as fn,
   regexp_match(pg_ls_dir, '^([0-9]+)(.*)$') as match
from d0, pg_ls_dir('base/'||d0.oid)
order by 1
),
d2 as (
select fn, match[1] as base, match[2] as ext
from d1
where (fn NOT ILIKE 'pg_%')
),
d3 as (
select d.*,  pg_filenode_relation(0, base::oid) as relation
from d2 d
)

select fn, pg_size_pretty((pg_stat_file('base/'||d0.oid||'/'||fn)).size)
from d0, d3
where relation is null;



      fn     | pg_size_pretty
------------+----------------
  612713     | 25 MB
  612718     | 1024 MB
  612718.1   | 1024 MB
  612718.2   | 180 MB
  t11_621351 | 96 kB
  t11_621354 | 0 bytes
  t11_621356 | 8192 bytes
  t11_621357 | 160 kB
  t11_621360 | 0 bytes
  t11_621362 | 8192 bytes


So there's still 612713 and 612718* left to clean up; judging from the 
numbers and timestamps, they probably belong to other VACUUM FULLs I 
manually canceled in the pgadmin UI during my experiments.

Is this interesting enough to further investigate? If yes, I'll be happy 
to work on a test case.
In any case, I've now learned (how) to check for orphaned files if a 
database cluster's on-disk size is way off from the sums of 
pg_total_relation_size() + WAL...


Thanks again & best regards,

    -hannes




pgsql-general by date:

Previous
From: Martín Fernández
Date:
Subject: Promote replica before being able to accept connections
Next
From: auxsvr
Date:
Subject: Shared hosting with FDW on AWS RDS