Re: [PATCHES] Cleaning up unreferenced table files - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: [PATCHES] Cleaning up unreferenced table files
Date
Msg-id Pine.OSF.4.61.0505102211560.368341@kosh.hut.fi
Whole thread Raw
In response to Re: [PATCHES] Cleaning up unreferenced table files  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [PATCHES] Cleaning up unreferenced table files  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: [PATCHES] Cleaning up unreferenced table files  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sun, 8 May 2005, Tom Lane wrote:

> While your original patch is buggy, it's at least fixable and has
> localized, limited impact.  I don't think these schemes are safe
> at all --- they put a great deal more weight on the semantics of
> the filesystem than I care to do.

I'm going to try this some more, because I feel that a scheme like this 
that doesn't rely on scanning pg_class and the file system would in fact 
be safer.

The key is to A) obey the "WAL first" rule, and A) remember information 
about file creations over a checkpoint. The problem with the my previous 
suggestion was that it didn't reliably accomplish either :).

Right now we break the WAL rule because the file creation is recorded 
after the file is created. And the record is not flushed.

The trivial way to fix that is to write and flush the xlog record before 
actually creating the file. (for a more optimized way to do it, see end of 
message). Then we could trust that there aren't any files in the data 
directory that don't have a corresponding record in WAL.

But that's not enough. If a checkpoint occurs after the file is 
created, but before the transaction ends, WAL replay doesn't see the file 
creation record. That's why we need a mechanism to carry the information 
over the checkpoint.

We could do that by extending the ForwardFsyncRequest function or by
creating something similar to that. When a backend writes the file 
creation WAL record, it also sends a message to the bgwriter that says 
"I'm xid 1234, and I have just created file foobar/1234" (while holding 
CheckpointStartLock). Bgwriter keeps a list of xid/file pairs like it 
keeps a list of pending fsync operations. On checkpoint, the checkpointer 
scans the list and removes entries for transactions that have already 
ended, and attaches the remaining list to the checkpoint record.

WAL replay would start with the xid/file list in the checkpoint record, 
and update it during the replay whenever a file creation or a transaction 
commit/rollback record is seen. On a rollback record, files created by 
that transaction are deleted. At the end of WAL replay, the files that are 
left in the list belong to transactions that implicitly aborted, and can 
be deleted.

If we don't want to extend the checkpoint record, a separate WAL record 
works too.

Now, the more optimized way to do A:

Delay the actual file creation until it's first written to. The write 
needs to be WAL logged anyway, so we would just piggyback on that.

Implemented this way, I don't think there would be a significant 
performance hit from the scheme. We would create more ForwardFsyncRequest 
traffic, but not much compared to the block fsync requests we have right 
now.

BTW: If we allowed mdopen to create the file if it doesn't exist already, 
would we need the current file creation xlog record for anything? (I'm 
not suggesting to do that, just trying to get more insight)

- Heikki


pgsql-hackers by date:

Previous
From: David Walker
Date:
Subject: Re: Can we get patents?
Next
From: Bruce Momjian
Date:
Subject: Re: [PATCHES] Cleaning up unreferenced table files