Thread: Re: [PATCHES] Cleaning up unreferenced table files

Re: [PATCHES] Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
Maybe we should take a different approach to the problem:

1. Create new file with an extension to mark that it's not
    yet committed (eg. 1234.notcommitted)
2. ...
3. Take CheckpointStartLock
4. Write commit record to WAL, with list of created files.
5. rename created file (1234.notcommitted -> 1234).
6. Release CheckpointStartLock

This would guarantee that after successful WAL replay, all files in the
data directory with .notcommitted extension can be safely deleted. No need
to read pg_database or pg_class.

We would take a performance hit because of the additional rename and fsync
step. Also, we must somehow make sure that the new file or the directory
it's in is fsynced on checkpoint to make sure that the rename is flushed
to disk.

A variant of the scheme would be to create two files on step 1. One would
be the actual relfile (1234) and the other would an empty marker file
(1234.notcommitted). That way the smgr code wouldn't have to care it the
file is new or not when opening it.

- Heikki

On Thu, 5 May 2005, Tom Lane wrote:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Applied.
>
> Now that I've had a chance to look at it, this patch is thoroughly
> broken.  Problems observed in a quick review:
>
> 1. It doesn't work at all for non-default tablespaces: it will
> claim that every file in such a tablespace is stale.  The fact
> that it does that rather than failing entirely is accidental.
> It tries to read the database's pg_class in the target tablespace
> whether it's there or not.  Because the system is still in recovery
> mode, the low-level routines allow the access to the nonexistent
> pg_class table to pass --- in fact they think they should create
> the file, so after it runs there's a bogus empty "1259" file in each
> such tablespace (which of course it complains about, too).  The code
> then proceeds to think that pg_class is empty so of course everything
> draws a warning.
>
> 2. It's not robust against stale subdirectories of a tablespace
> (ie, subdirs corresponding to a nonexistent database) --- again,
> it'll try to read a nonexistent pg_class.  Then it'll produce a
> bunch of off-target complaint messages.
>
> 3. It's assuming that relfilenode is unique database-wide, when no
> such assumption is safe.  We only have a guarantee that it's unique
> tablespace-wide.
>
> 4. It fails to examine table segment files (such as "nnn.1").  These
> should be complained of when the "nnn" doesn't match any hash entry.
>
> 5. It will load every relfilenode value in pg_class into the hashtable
> whether it's meaningful or not.  There should be a check on relkind.
>
> 6. I don't think relying on strtol to decide if a filename is entirely
> numeric is very safe.  Note all the extra defenses in pg_atoi against
> various platform-specific misbehaviors of strtol.  Personally I'd use a
> strspn test instead.
>
> 7. There are no checks for readdir failure (compare any other readdir
> loop in the backend).
>
> See also Simon Riggs' complaints that the circumstances under which it's
> done are pretty randomly selected.  (One particular thing that I think
> is a bad idea is to do this in a standalone backend.  Any sort of
> corruption in any db's pg_class would render it impossible to start up.)
>
> To fix the first three problems, and also avoid the performance problem
> of multiply rescanning a database's pg_class for each of its
> tablespaces, I would suggest that the hashtable entries be widened to
> RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid).  Then
> there should be one iteration over pg_database to learn the OIDs and
> default tablespaces of each database; with that you can read pg_class
> from its correct location for each database and load all the entries
> into the hashtable.  Then you iterate through the tablespaces looking
> for stuff not present in the hashtable.  You might also want to build a
> list or hashtable of known database OIDs, so that you can recognize a
> stale subdirectory immediately and issue a direct complaint about it
> without even recursing into it.
>
>             regards, tom lane
>

- Heikki

Re: [PATCHES] Cleaning up unreferenced table files

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Maybe we should take a different approach to the problem:
> 1. Create new file with an extension to mark that it's not
>     yet committed (eg. 1234.notcommitted)

This is pushing the problem into the wrong place, viz the lowest-level
file access routines, which will now all have to know about
.notcommitted status.  It also creates race conditions --- think about
backend A trying to commit file 1234 at about the same time that
backend B is trying to flush some dirty buffers belonging to that file.
But most importantly, it doesn't handle the file-deletion case.

            regards, tom lane

Re: [PATCHES] Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
On Sat, 7 May 2005, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> Maybe we should take a different approach to the problem:
>> 1. Create new file with an extension to mark that it's not
>>     yet committed (eg. 1234.notcommitted)
>
> This is pushing the problem into the wrong place, viz the lowest-level
> file access routines, which will now all have to know about
> .notcommitted status.  It also creates race conditions --- think about
> backend A trying to commit file 1234 at about the same time that
> backend B is trying to flush some dirty buffers belonging to that file.

True. With the rename variant, it might indeed get messy.

Consider the variant with extra marker files. In that case, backend B 
doesn't have to know about the .notcommitted status to flush the buffers.

> But most importantly, it doesn't handle the file-deletion case.

File-deletions are easy to handle. Just write the list of pending 
deletions to WAL on commit.

To recap, we have 2 slightly different scenarios:

a) Delete a file, write commit record, crash
b) Create a file, crash

Just WAL logging the deletions on commit would take care of A. The 
.notcommitted mechanism would take care of B.

- Heikki


Re: [PATCHES] Cleaning up unreferenced table files

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Consider the variant with extra marker files. In that case, backend B 
> doesn't have to know about the .notcommitted status to flush the buffers.

[ shrug ]  It's still broken, and the reason is that there's no
equivalent of fsync for directory operations.  Consider
A creates 1234 and 1234.notcommitted
A commits
B performs a checkpoint
crash

all before A manages to delete 1234.notcommitted, or at least before
that deletion has made its way to disk.  Upon restart, only WAL
events after the checkpoint will be replayed, so 1234.notcommitted
doesn't go away, and then you've got a problem.

To fix this there would need to be a way (1) for B to be aware of the
pending file deletion and (2) for B to delay committing the checkpoint
until the directory update is surely down on disk.  Your proposal
doesn't provide for (1), and even if we fixed that, I know of no
portable kernel API for (2).  fsync isn't applicable.

While your original patch is buggy, it's at least fixable and has
localized, limited impact.  I don't think these schemes are safe
at all --- they put a great deal more weight on the semantics of
the filesystem than I care to do.
        regards, tom lane


Re: [PATCHES] Cleaning up unreferenced table files

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > Consider the variant with extra marker files. In that case, backend B 
> > doesn't have to know about the .notcommitted status to flush the buffers.
> 
> [ shrug ]  It's still broken, and the reason is that there's no
> equivalent of fsync for directory operations.  Consider

Traditionally that's because directory operations were always synchronous, and
hence didn't need to be fsynced. I think this is still true, other systems
like qmail's maildir still depend on this, for example.

-- 
greg



Re: [PATCHES] Cleaning up unreferenced table files

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> [ shrug ]  It's still broken, and the reason is that there's no
>> equivalent of fsync for directory operations.  Consider

> Traditionally that's because directory operations were always
> synchronous, and hence didn't need to be fsynced.

That might be true with respect to the process requesting the directory
operation ... but I think you missed the point entirely.
        regards, tom lane


Re: [PATCHES] Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
On Sun, 8 May 2005, Tom Lane wrote:

> While your original patch is buggy, it's at least fixable and has
> localized, limited impact.  I don't think these schemes are safe
> at all --- they put a great deal more weight on the semantics of
> the filesystem than I care to do.

I'm going to try this some more, because I feel that a scheme like this 
that doesn't rely on scanning pg_class and the file system would in fact 
be safer.

The key is to A) obey the "WAL first" rule, and A) remember information 
about file creations over a checkpoint. The problem with the my previous 
suggestion was that it didn't reliably accomplish either :).

Right now we break the WAL rule because the file creation is recorded 
after the file is created. And the record is not flushed.

The trivial way to fix that is to write and flush the xlog record before 
actually creating the file. (for a more optimized way to do it, see end of 
message). Then we could trust that there aren't any files in the data 
directory that don't have a corresponding record in WAL.

But that's not enough. If a checkpoint occurs after the file is 
created, but before the transaction ends, WAL replay doesn't see the file 
creation record. That's why we need a mechanism to carry the information 
over the checkpoint.

We could do that by extending the ForwardFsyncRequest function or by
creating something similar to that. When a backend writes the file 
creation WAL record, it also sends a message to the bgwriter that says 
"I'm xid 1234, and I have just created file foobar/1234" (while holding 
CheckpointStartLock). Bgwriter keeps a list of xid/file pairs like it 
keeps a list of pending fsync operations. On checkpoint, the checkpointer 
scans the list and removes entries for transactions that have already 
ended, and attaches the remaining list to the checkpoint record.

WAL replay would start with the xid/file list in the checkpoint record, 
and update it during the replay whenever a file creation or a transaction 
commit/rollback record is seen. On a rollback record, files created by 
that transaction are deleted. At the end of WAL replay, the files that are 
left in the list belong to transactions that implicitly aborted, and can 
be deleted.

If we don't want to extend the checkpoint record, a separate WAL record 
works too.

Now, the more optimized way to do A:

Delay the actual file creation until it's first written to. The write 
needs to be WAL logged anyway, so we would just piggyback on that.

Implemented this way, I don't think there would be a significant 
performance hit from the scheme. We would create more ForwardFsyncRequest 
traffic, but not much compared to the block fsync requests we have right 
now.

BTW: If we allowed mdopen to create the file if it doesn't exist already, 
would we need the current file creation xlog record for anything? (I'm 
not suggesting to do that, just trying to get more insight)

- Heikki


Re: [PATCHES] Cleaning up unreferenced table files

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> On Sun, 8 May 2005, Tom Lane wrote:
> 
> > While your original patch is buggy, it's at least fixable and has
> > localized, limited impact.  I don't think these schemes are safe
> > at all --- they put a great deal more weight on the semantics of
> > the filesystem than I care to do.
> 
> I'm going to try this some more, because I feel that a scheme like this 
> that doesn't rely on scanning pg_class and the file system would in fact 
> be safer.

The current code is nice and localized and doesn't add any burden on our
existing code, which is already complicated enough.  I think we either
fix checkfiles.c, or we remove it and decide it isn't worth checking for
unrefrenced files.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: [PATCHES] Cleaning up unreferenced table files

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On Sun, 8 May 2005, Tom Lane wrote:
>> While your original patch is buggy, it's at least fixable and has
>> localized, limited impact.  I don't think these schemes are safe
>> at all --- they put a great deal more weight on the semantics of
>> the filesystem than I care to do.

> I'm going to try this some more, because I feel that a scheme like this 
> that doesn't rely on scanning pg_class and the file system would in fact 
> be safer.

I think this proposal is getting more and more invasive and expensive,
and it's all to solve a problem that we do not even know is worth
spending any time on.  I *really* think this is the wrong direction
to take.  Aside from the required effort and risk of breaking things,
the original patch incurred cost only during crash recovery; this is
pushing costs into the normal code paths.

> Delay the actual file creation until it's first written to. The write 
> needs to be WAL logged anyway, so we would just piggyback on that.

This is a bad idea since by then it's (potentially) too late to roll
back the creating transaction if the creation fails.  Consider for
instance a tablespace directory that's mispermissioned read-only, or
some such.  I'd rather have the CREATE TABLE fail than a later INSERT.
(Admittedly, we can't completely guarantee that an INSERT won't hit
some kind of filesystem-level problem, but it's still something to
try to avoid.)

Also, the "first write" actually comes from mdextend, which is not a
WAL-logged operation AFAIR.  Some rethinking of that would be necessary
before this would have any chance of working.
        regards, tom lane


Re: [PATCHES] Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
On Tue, 10 May 2005, Bruce Momjian wrote:

> The current code is nice and localized and doesn't add any burden on our
> existing code, which is already complicated enough.  I think we either
> fix checkfiles.c, or we remove it and decide it isn't worth checking for
> unrefrenced files.

Let's pull the patch for now.

I still think we should check for unreferenced files, but it needs some 
more thought and it's by no means urgent.

If I have time later on, I might write a patch to implement the WAL-based 
scheme to see how much change to existing code it really would need.

- Heikki


Re: [PATCHES] Cleaning up unreferenced table files

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> On Tue, 10 May 2005, Bruce Momjian wrote:
> 
> > The current code is nice and localized and doesn't add any burden on our
> > existing code, which is already complicated enough.  I think we either
> > fix checkfiles.c, or we remove it and decide it isn't worth checking for
> > unrefrenced files.
> 
> Let's pull the patch for now.

Removed, and TODO item restored.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: [PATCHES] Cleaning up unreferenced table files

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On Tue, 10 May 2005, Bruce Momjian wrote:
>> The current code is nice and localized and doesn't add any burden on our
>> existing code, which is already complicated enough.  I think we either
>> fix checkfiles.c, or we remove it and decide it isn't worth checking for
>> unrefrenced files.

> Let's pull the patch for now.

FWIW, I was OK with the idea of adding something similar to the given
patch to find out whether we had a problem or not.  With sufficient
evidence that lost files are a big problem, I'd be in favor of a
mechanism of the kind proposed in Heikki's latest messages.  The
disconnect for me at the moment is that there's no evidence to justify
that amount of effort/risk.  A startup-time patch would have provided
that evidence, or else have proven that it's not worth spending more
time on.
        regards, tom lane


Re: [PATCHES] Cleaning up unreferenced table files

From
Bruce Momjian
Date:
Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > On Tue, 10 May 2005, Bruce Momjian wrote:
> >> The current code is nice and localized and doesn't add any burden on our
> >> existing code, which is already complicated enough.  I think we either
> >> fix checkfiles.c, or we remove it and decide it isn't worth checking for
> >> unrefrenced files.
> 
> > Let's pull the patch for now.
> 
> FWIW, I was OK with the idea of adding something similar to the given
> patch to find out whether we had a problem or not.  With sufficient
> evidence that lost files are a big problem, I'd be in favor of a
> mechanism of the kind proposed in Heikki's latest messages.  The
> disconnect for me at the moment is that there's no evidence to justify
> that amount of effort/risk.  A startup-time patch would have provided
> that evidence, or else have proven that it's not worth spending more
> time on.

Agreed.  Imagine a backend creates a table file, then the operating
system crashes.  I assume WAL wasn't fsync'ed, so there is no way that
WAL can discover that unreferenced file.  

While I think WAL can correct some cases, I don't think it can correct
them all, so it seems it is necessary to check the file system against
pg_class to catch all the cases.  The transaction and file system
semantics are just different and need to be checked against each other.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073