Thread: [HACKERS] Orphaned files in base/[oid]

[HACKERS] Orphaned files in base/[oid]

From
Chris Travers
Date:
Hi all;

I am trying to track down a problem we are seeing that looks very similar to bug #12050, and would certainly consider trying to contribute a fix if we agree on one.  (I am not sure we can, so absent that, the next question is whether it makes sense to create a utility to fix the problem when it comes up so that a dump/restore is not needed).

The system:
PostgreSQL 9.6.3
Gentoo Linux.

Problem:
The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble with disk space.  Querying pg_database_size, as well as du on the subdirectory of base/ show total usage to be around 3.8TB.  Summing up the size of the relations in pg_class though shows around 2.1TB. 

Initial troubleshooting found around 150 GB of space in pg_temp which had never been cleared and was at least several days old.  Restarting the server cleared these up.

Poking around the base/[oid] directory, I found a large number of files which did not correspond with a pg_class entry.  One of the apparent relations was nearly 1TB in size.

What I think happened:
I think various pg_temp/* and orphaned relation files (In base/[oid]) were created when PostgreSQL crashed due to running out of space in various operations including creating materialised views.

So my question is if there is a way we can safely clean these up on server restart?  If not does it make sense to try to create a utility that can connect to PostgreSQL, seek out valid files, and delete the rest?

--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] Orphaned files in base/[oid]

From
Chris Travers
Date:


On Aug 14, 2017 14:12, "Chris Travers" <chris.travers@adjust.com> wrote:
Hi all;

I am trying to track down a problem we are seeing that looks very similar to bug #12050, and would certainly consider trying to contribute a fix if we agree on one.  (I am not sure we can, so absent that, the next question is whether it makes sense to create a utility to fix the problem when it comes up so that a dump/restore is not needed).

The system:
PostgreSQL 9.6.3
Gentoo Linux.

Problem:
The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble with disk space.  Querying pg_database_size, as well as du on the subdirectory of base/ show total usage to be around 3.8TB.  Summing up the size of the relations in pg_class though shows around 2.1TB. 

Initial troubleshooting found around 150 GB of space in pg_temp which had never been cleared and was at least several days old.  Restarting the server cleared these up.

Poking around the base/[oid] directory, I found a large number of files which did not correspond with a pg_class entry.  One of the apparent relations was nearly 1TB in size.

What I think happened:
I think various pg_temp/* and orphaned relation files (In base/[oid]) were created when PostgreSQL crashed due to running out of space in various operations including creating materialised views.

So my question is if there is a way we can safely clean these up on server restart?  If not does it make sense to try to create a utility that can connect to PostgreSQL, seek out valid files, and delete the rest?

Ok I  have identified one case where symptoms I am seeing can be reproduced.  I am currently working on a Mac so there may be quirks in my repro.  However....

When the WAL writer runs out of disk space no cleanup is done.

So I  will be looking at possible solutions next.

--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin


Re: [HACKERS] Orphaned files in base/[oid]

From
Andres Freund
Date:
Hi,

On 2017-08-14 14:12:22 +0200, Chris Travers wrote:
> Problem:
> The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble
> with disk space.  Querying pg_database_size, as well as du on the
> subdirectory of base/ show total usage to be around 3.8TB.  Summing up the
> size of the relations in pg_class though shows around 2.1TB.
> 
> Initial troubleshooting found around 150 GB of space in pg_temp which had
> never been cleared and was at least several days old.  Restarting the
> server cleared these up.
> 
> Poking around the base/[oid] directory, I found a large number of files
> which did not correspond with a pg_class entry.  One of the apparent
> relations was nearly 1TB in size.
> 
> What I think happened:
> I think various pg_temp/* and orphaned relation files (In base/[oid]) were
> created when PostgreSQL crashed due to running out of space in various
> operations including creating materialised views.
> 
> So my question is if there is a way we can safely clean these up on server
> restart?  If not does it make sense to try to create a utility that can
> connect to PostgreSQL, seek out valid files, and delete the rest?

I think the fix here is to call RemovePgTempFiles() during
crash-restarts, instead of just full starts. The previously stated need
to be able to inspect temp files after a crash can be less impactfully
fulfilled with restart_after_crash = false.

Greetings,

Andres Freund



Re: [HACKERS] Orphaned files in base/[oid]

From
Chris Travers
Date:


On Mon, Aug 14, 2017 at 6:33 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2017-08-14 14:12:22 +0200, Chris Travers wrote:
> Problem:
> The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble
> with disk space.  Querying pg_database_size, as well as du on the
> subdirectory of base/ show total usage to be around 3.8TB.  Summing up the
> size of the relations in pg_class though shows around 2.1TB.
>
> Initial troubleshooting found around 150 GB of space in pg_temp which had
> never been cleared and was at least several days old.  Restarting the
> server cleared these up.
>
> Poking around the base/[oid] directory, I found a large number of files
> which did not correspond with a pg_class entry.  One of the apparent
> relations was nearly 1TB in size.
>
> What I think happened:
> I think various pg_temp/* and orphaned relation files (In base/[oid]) were
> created when PostgreSQL crashed due to running out of space in various
> operations including creating materialised views.
>
> So my question is if there is a way we can safely clean these up on server
> restart?  If not does it make sense to try to create a utility that can
> connect to PostgreSQL, seek out valid files, and delete the rest?

I think the fix here is to call RemovePgTempFiles() during
crash-restarts, instead of just full starts. The previously stated need
to be able to inspect temp files after a crash can be less impactfully
fulfilled with restart_after_crash = false.

But that only clears temp files right?

I am less concerned about the temp files because a restart clears them.

The bigger issue I see are with the orphaned base files.  It looks like files in base/[oid] don't get cleaned up either if I read my output correctly and it would explain why we saw 1.7TB of discrepancy between relations and database size.  Safety-wise it seems like the best way out of that is a dump/restore but doing that with a 2.1TB database is annoying.
 
Greetings,

Andres Freund



--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] Orphaned files in base/[oid]

From
Tom Lane
Date:
Chris Travers <chris.travers@adjust.com> writes:
> On Mon, Aug 14, 2017 at 6:33 PM, Andres Freund <andres@anarazel.de> wrote:
>> I think the fix here is to call RemovePgTempFiles() during
>> crash-restarts, instead of just full starts. The previously stated need
>> to be able to inspect temp files after a crash can be less impactfully
>> fulfilled with restart_after_crash = false.

> But that only clears temp files right?
> I am less concerned about the temp files because a restart clears them.

It will clear temp files, and also temp tables.

> The bigger issue I see are with the orphaned base files.

It would be possible to have orphaned non-temp tables if you'd suffered
a crash during the transaction that created those tables.  Ordinarily
a newly-created table file wouldn't be that large, but if your workflow
created tables and shoved boatloads of data into them in the same
transaction, it's not so hard to see this becoming an issue.

The core problem with zapping non-temp table files is that you can't
do that unless you're sure you have consistent, up-to-date pg_class
data that nobody else is busy adding to.  It's hard to see an external
application being able to do that safely.  You certainly can't do it
at the point in the postmaster startup cycle where we currently do
the other things --- for those, we rely only on filesystem naming
conventions to identify what to zap.
        regards, tom lane



Re: [HACKERS] Orphaned files in base/[oid]

From
Andres Freund
Date:
On 2017-08-14 14:40:46 -0400, Tom Lane wrote:
> The core problem with zapping non-temp table files is that you can't
> do that unless you're sure you have consistent, up-to-date pg_class
> data that nobody else is busy adding to.  It's hard to see an external
> application being able to do that safely.  You certainly can't do it
> at the point in the postmaster startup cycle where we currently do
> the other things --- for those, we rely only on filesystem naming
> conventions to identify what to zap.

I think there are some possibilities to close the gap here. We could
e.g. have <relfilenode>.delete_on_crash marker files that get installed
when creating a new persistent relfilenode. If we set up things so they
get deleted post commit, but inside the critical section, we could rely
on them being present in case of crash, but consistently removed during
WAL replay. At the end of recovery, iterate over the whole datadir and
nuke all relations with marker files present.

I first thought that'd cost an additional fsync per relation
created. But I think we actually can delay that to a pre-commit phase,
if we have XLOG_SMGR_CREATE create those markers via a flag, and fsync
them just before checkpoint (via the usual delayed fsync mechanism).
That'd still require an XLogFlush(), but that seems hard to avoid unless
we just don't create relations on FS level until buffers are
evicted and/or BufferSync().


Alternatively we could do something without marker files, with some
added complexity: Keep track of all "uncommitted new files" in memory,
and log them every checkpoint. Commit/abort records clear elements of
that list. Since we always start replay at the beginning of a
checkpoint, we'd always reach a moment with such an up2date list of
pending-action files before reaching end-of-recovery. At end-of-recovery
we can delete all unconfirmed files.  To avoid out-of-memory due to too
many tracked relations, we'd possibly still have to have marker files...

Regards,

Andres



Re: [HACKERS] Orphaned files in base/[oid]

From
Chris Travers
Date:


On Mon, Aug 14, 2017 at 8:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:


It would be possible to have orphaned non-temp tables if you'd suffered
a crash during the transaction that created those tables.  Ordinarily
a newly-created table file wouldn't be that large, but if your workflow
created tables and shoved boatloads of data into them in the same
transaction, it's not so hard to see this becoming an issue.

I think the working theory is that these were very like a number of very large (multi-hundred-GB materialised views). 

The core problem with zapping non-temp table files is that you can't
do that unless you're sure you have consistent, up-to-date pg_class
data that nobody else is busy adding to.  It's hard to see an external
application being able to do that safely.  You certainly can't do it
at the point in the postmaster startup cycle where we currently do
the other things --- for those, we rely only on filesystem naming
conventions to identify what to zap. 

Yeah that occurred to me. At this point I would settle for something I could run with Postgres in single user mode.  Although that is very far from ideal.  So what I wonder is if at least a short-term solution might be a utility that starts Postgres in single user mode and we insist that PostgreSQL is otherwise not running before the run.

I am certainly not feeling qualified at present for more advanced solutions but that I might be able to do.

                        regards, tom lane



--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] Orphaned files in base/[oid]

From
Michael Paquier
Date:
On Tue, Aug 15, 2017 at 3:56 AM, Andres Freund <andres@anarazel.de> wrote:
> I think there are some possibilities to close the gap here. We could
> e.g. have <relfilenode>.delete_on_crash marker files that get installed
> when creating a new persistent relfilenode. If we set up things so they
> get deleted post commit, but inside the critical section, we could rely
> on them being present in case of crash, but consistently removed during
> WAL replay. At the end of recovery, iterate over the whole datadir and
> nuke all relations with marker files present.

I agree that an approach like that has value for the problem defined
in this thread.

> I first thought that'd cost an additional fsync per relation
> created. But I think we actually can delay that to a pre-commit phase,
> if we have XLOG_SMGR_CREATE create those markers via a flag, and fsync
> them just before checkpoint (via the usual delayed fsync mechanism).
> That'd still require an XLogFlush(), but that seems hard to avoid unless
> we just don't create relations on FS level until buffers are
> evicted and/or BufferSync().

Yeah, that should work as well.
-- 
Michael



Re: [HACKERS] Orphaned files in base/[oid]

From
Chris Travers
Date:

There's another side to this and that I am not sure it is a backend crash.

Here is what I did to reproduce:

2 virtual disk images:  100mb for main data, 40 MB for WAL.  work_mem set to 256MB. The idea is to test different out of space conditions.

Create table as ...; drop table; select pg_size_pretty(pg_current_xlog_location() - '0/0');

I played around with parameters to determine how different kinds of out of space errors were handled.

1.  running out of temp space was cleaned up without a server restart needed.
2.  A relation running out of disk space *seemed* to get cleaned up.
3.  Running out of WAL space left *both* temp and non-temp files.

I wonder about a different solution.  Would it be possible to special case vacuum to check for and remove (or just move to where they can be removed) files when vacuuming pg_class?  At the point we are vacuuming pg_class, we ought to be able to know that a relfilenode shouldn't be used anymore, right?

--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] Orphaned files in base/[oid]

From
Tom Lane
Date:
Chris Travers <chris.travers@adjust.com> writes:
> I wonder about a different solution.  Would it be possible to special case
> vacuum to check for and remove (or just move to where they can be removed)
> files when vacuuming pg_class?  At the point we are vacuuming pg_class, we
> ought to be able to know that a relfilenode shouldn't be used anymore,
> right?

I don't think so.  It's not clear to me whether you have in mind "scan
pg_class, collect relfilenodes from all live tuples, then zap all files
not in that set" or "when removing a dead tuple, zap the relfilenode
it mentions".  But neither one works.  The first case has a race condition
against new pg_class entries.  As for the second, the existence of a dead
tuple bearing relfilenode N isn't evidence that some other live tuple
can't have relfilenode N.

Another problem for the second solution is that in the case you're worried
about (ie, PANIC due to out-of-WAL-space during relation's creating
transaction), there's no very good reason to expect that the relation's
pg_class tuple ever made it to disk at all.

A traditional low-tech answer to this has been to keep the WAL on a
separate volume from the main data store, so that it's protected from
out-of-space conditions in the main store and temp areas.  The space
needs for WAL proper are generally much more predictable than the main
store, so it's easier to keep the dedicated space from overflowing.
(Stalled replication/archiving processes can be hazardous to your
health in this scenario, though, if they prevent prompt recycling of
WAL files.)
        regards, tom lane



Re: [HACKERS] Orphaned files in base/[oid]

From
Chris Travers
Date:


On Tue, Aug 15, 2017 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Chris Travers <chris.travers@adjust.com> writes:
> I wonder about a different solution.  Would it be possible to special case
> vacuum to check for and remove (or just move to where they can be removed)
> files when vacuuming pg_class?  At the point we are vacuuming pg_class, we
> ought to be able to know that a relfilenode shouldn't be used anymore,
> right?

I don't think so.  It's not clear to me whether you have in mind "scan
pg_class, collect relfilenodes from all live tuples, then zap all files
not in that set" or "when removing a dead tuple, zap the relfilenode
it mentions".  But neither one works.  The first case has a race condition
against new pg_class entries.  As for the second, the existence of a dead
tuple bearing relfilenode N isn't evidence that some other live tuple
can't have relfilenode N.

Ah because if the file never made it on to disk the number could be re-used. 

Another problem for the second solution is that in the case you're worried
about (ie, PANIC due to out-of-WAL-space during relation's creating
transaction), there's no very good reason to expect that the relation's
pg_class tuple ever made it to disk at all.

A traditional low-tech answer to this has been to keep the WAL on a
separate volume from the main data store, so that it's protected from
out-of-space conditions in the main store and temp areas.  The space
needs for WAL proper are generally much more predictable than the main
store, so it's easier to keep the dedicated space from overflowing.
(Stalled replication/archiving processes can be hazardous to your
health in this scenario, though, if they prevent prompt recycling of
WAL files.)

Yeah, most of our dbs here have wal on a separate volume but not this system.  This system is also unusual in that disk usage varies wildly (and I am not 100% sure that this is the only case which causes it though I can reproduce it consistently in the case of the wal writer running out of disk space with symptoms exactly what I found).

So for now that leaves my fallback approach as a way to fix it when I see it.

I have written a shell script which does as follows:
1.  starts Postgres in single user mode with a data directory or dies (won't run if Postgres seems to be already running)
2.  gets the old of the current database
3.  lists all files consisting of only digits in the  base/[dboid] directory
4. asks Postgres (In single user mode again) for all relfilenodes and oids of tables (In my testing both were required because there were some cases where relfilenodes were not set in some system 
5.  Loops through the file nodes gathered, checks against the relfilenode entries, and zaps $f, $f_*, and $f.*.  Currently for testing "zaps" has been to move to a lostnfound folder for inspection following the script.  The logic here is not perfect and is very slightly under inclusive, but better that than the other way.

Then we can start Postgres again.  I cannot find a better way to avoid race conditions, I guess. At any rate it sounds like preventing the problem more generally may be something beyond what I would feel comfortable trying to do as a patch at my current level of familiarity with he source code.

The full script is included inline below my signature in case it is of interest to anyone on the list.


                        regards, tom lane



--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 

Saarbrücker Straße 37a, 10405 Berlin

---

#!/bin/bash

datadir=$1
database=$2

pg_ctl -D $datadir stop

dboidfile="$PWD/cleanupdb.oid"
reloidfile="$PWD/refilenodes.list"

echo "COPY (select oid from pg_database where datname = current_database()) TO '$dboidfile'" | postgres --single -D $datadir $database > /dev/null


if (($?)) 
then
   echo "FATAL: Could not start Postgres in single user mode"
   exit 1
fi

dboid=`cat $dboidfile`
filenodes=`(cd test/base/$dboid; ls [0-9]*[0-9] | grep -v '\.' | sort -n)`
#echo $filenodes

echo "COPY (select relfilenode from pg_class union select oid as relfilenode from pg_class) TO '$reloidfile'" | postgres --single -D $datadir $database > /dev/null
relfilenodes=`cat $reloidfile`
#echo $relfilenodes
if [[ -z relfilenodes ]]
then
   echo "FATAL: did not get any relfilenodes"
   exit 2
fi

mkdir lostnfound;
for f in $filenodes
do
  if [[ -z `echo $relfilenodes | grep -w $f` ]]
  then
      echo moving $f to lostnfound
      mv $datadir/base/$dboid/$f lostnfound
      mv $datadir/base/$dboid/${f}_* lostnfound 2> /dev/null
      mv $datadir/base/$dboid/${f}.* lostnfound 2> /dev/null
  fi
done
rm $dboidfile
rm $reloidfile

Re: [HACKERS] Orphaned files in base/[oid]

From
Chris Travers
Date:



So having throught about this a bit more, and having had some real-world experience with the script now, I have an idea that might work and some questions to make it succeed.

My thinking is to add a new form of vacuum called VACUUM FSCK.

This would:
1. lock pg_class in exclusive mode (or do I need exclusive access?), as this is needed to solve the race conditions.  As I see, this seems to bring the database to a screeching halt concurrency-wise (but unlike my script would allow other databases to be accessed normally).
2. read the files where the name consists of only digits out of the filesystem and compare with oids from pg_class and relfilenodes
3.  Any file not found in that list would then unlink it, as well as any files with the patter followed by an underscore or period.

This would mean that the following cases would not be handled:

If you have the first extent gone but later extents are present we check on the first extant, and so would not see the later ones.  Same goes for visibility maps and other helper files.

If you add a file in the directory which has a name like 34F3A222BC, that would never get cleaned up because it contains non-digits.

So this leads to the following questions:

1.  Is locking pg_class enough to avoid race conditions?  Is exclusive mode sufficient or do I need exclusive access mode?
2.  would it be preferable to move the file to a directory rather than unlinking it?
3.  Should I perform any sort of check on the tables at the end to make sure everything is ok?

--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] Orphaned files in base/[oid]

From
Robert Haas
Date:
On Mon, Aug 14, 2017 at 2:56 PM, Andres Freund <andres@anarazel.de> wrote:
> I think there are some possibilities to close the gap here. We could
> e.g. have <relfilenode>.delete_on_crash marker files that get installed
> when creating a new persistent relfilenode. If we set up things so they
> get deleted post commit, but inside the critical section, we could rely
> on them being present in case of crash, but consistently removed during
> WAL replay. At the end of recovery, iterate over the whole datadir and
> nuke all relations with marker files present.

At the risk of being predictable, I think we should add an undo
subsystem instead of continuing to create ad-hoc solutions to problems
like this.  (Of course, that's being worked on by Thomas, Amit, and
others.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Orphaned files in base/[oid]

From
Andres Freund
Date:
On 2017-08-16 14:20:02 +0200, Chris Travers wrote:
> So having throught about this a bit more, and having had some real-world
> experience with the script now, I have an idea that might work and some
> questions to make it succeed.
> 
> My thinking is to add a new form of vacuum called VACUUM FSCK.
> 
> This would:
> 1. lock pg_class in exclusive mode (or do I need exclusive access?), as
> this is needed to solve the race conditions.  As I see, this seems to bring
> the database to a screeching halt concurrency-wise (but unlike my script
> would allow other databases to be accessed normally).
> 2. read the files where the name consists of only digits out of the
> filesystem and compare with oids from pg_class and relfilenodes
> 3.  Any file not found in that list would then unlink it, as well as any
> files with the patter followed by an underscore or period.
> 
> This would mean that the following cases would not be handled:
> 
> If you have the first extent gone but later extents are present we check on
> the first extant, and so would not see the later ones.  Same goes for
> visibility maps and other helper files.
> 
> If you add a file in the directory which has a name like 34F3A222BC, that
> would never get cleaned up because it contains non-digits.
> 
> So this leads to the following questions:
> 
> 1.  Is locking pg_class enough to avoid race conditions?  Is exclusive mode
> sufficient or do I need exclusive access mode?
> 2.  would it be preferable to move the file to a directory rather than
> unlinking it?
> 3.  Should I perform any sort of check on the tables at the end to make
> sure everything is ok?

I think this entirely is the wrong approach.  We shouldn't add weird
check commands that require locks on pg_class, we should avoid leaving
the orphaned files in the first place.  I've upthread outlined
approached how to do so.

Greetings,

Andres Freund



Re: [HACKERS] Orphaned files in base/[oid]

From
Chris Travers
Date:


On Wed, Aug 16, 2017 at 7:15 PM, Andres Freund <andres@anarazel.de> wrote:


I think this entirely is the wrong approach.  We shouldn't add weird
check commands that require locks on pg_class, we should avoid leaving
the orphaned files in the first place.  I've upthread outlined
approached how to do so.

And in the mean time I suppose there is no reason that the approach I have outlined cannot be used by an external program.    I think it is sane to draw the line at having the db responsible for cleaning up after itself when something goes wrong and leave external programs to do after-the-fact review and cleanup. 

Greetings,

Andres Freund



--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: Orphaned files in base/[oid]

From
Alexey Gordeev
Date:
On 14.08.2017 23:56, Andres Freund wrote:
> Alternatively we could do something without marker files, with some
> added complexity: Keep track of all "uncommitted new files" in memory,
> and log them every checkpoint. Commit/abort records clear elements of
> that list. Since we always start replay at the beginning of a
> checkpoint, we'd always reach a moment with such an up2date list of
> pending-action files before reaching end-of-recovery. At end-of-recovery
> we can delete all unconfirmed files.  To avoid out-of-memory due to too
> many tracked relations, we'd possibly still have to have marker files...
> 

Hi, hackers.

I'm sorry, but I want to bump this thread, because there is still no 
good solution to solve the problem. I see there are few threads with 
undo-based approaches, which looks preferable, but have some pitfalls. 
Is there any chance to return to non-undo approaches partially discussed 
here? What do you think about the following solutions?
1) Make `pendingDeletes` shared and let postmaster clean all garbage in 
case of child process dying. Cons: Not works in case of postmaster 
dying. Should care about `pendingDeletes` pointers validity.
2) Catch and store all records with relfilenode during WAL replay, 
delete all orphaned nodes at the end of replaying. Cons: The final 
delete may use an incomplete list of nodes, as there was something 
before the latest checkpoint. The general opacity - we remove something 
without a corresponded WAL record (or possibly do it in wrong place in 
general).
3) This way is close to one I quoted and a combination of two above. 
`pendingDeletes` is shared. Each checkpoint creates a WAL record with a 
list of open transactions and created nodes. WAL replaying can use this 
list as base, adding nodes to it from newer records. The final delete 
operation has a complete list of orphaned nodes. Cons: Complexity(?). 
Others(?).

Can it work? Are any of this approaches still relevant?

I'm sorry for the last html-formatted message. Our web-based app is too 
smart.

-- 
Regards,
Alex Go, C developer
goa@arenadata.io, www.arenadata.tech