Heikki Linnakangas wrote:
> Tom Lane wrote:
>> I had an idea this morning that might be useful: back off the strength
>> of what we try to guarantee. Specifically, does it matter if we leak a
>> file on crash, as long as it isn't occupying a lot of disk space?
>> (I suppose if you had enough crashes to accumulate many thousands of
>> leaked files, the directory entries would start to be a performance drag,
>> but if your DB crashes that much you have other problems.) This leads
>> to the idea that we don't really need to protect the open(O_CREAT) per
>> se. Rather, we can emit a WAL entry *after* successful creation of a
>> file, while it's still empty. This eliminates all the issues about
>> logging an action that might fail. The WAL entry would need to include
>> the relfilenode and the creating XID. Crash recovery would track these
>> until it saw the commit or abort or prepare record for the XID, and if
>> it didn't find any, would remove the file.
>
> That idea, like all other approaches based on tracking WAL records, fail
> if there's a checkpoint after the WAL record (and that's quite likely to
> happen if the file is large). WAL replay wouldn't see the file creation
> WAL entry, and wouldn't know to track the xid. We'd need a way to carry
> the information over checkpoints.
Yes, checkpoints would need to include a list of created-but-yet-uncommitted
files. I think the hardest part is figuring out a way to get that information
to the backend doing the checkpoint - my idea was to track them in shared
memory, but that would impose a hard limit on the number of concurrent
file creations. Not nice :-(
But wait... I just had an idea.
We already got such a central list of created-but-uncommited
files - pg_class itself. There is a small window between file creation
and inserting the name into pg_class - but as Tom says, if we leak it then,
it won't use up much space anyway.
So maybe we should just scan pg_class on VACUUM, and obtain a list of files
that are referenced only from DEAD tuples. Those files we can than safely
delete, no?
If we *do* want a strict no-leakage guarantee, than we'd have to update pg_class
before creating the file, and flush the WAL. If we take Alvaro's idea of storing
temporary relations in a seperate directory, we could skip the flush for those,
because we can just clean out that directory after recovery. Having to flush
the WAL when creating non-temporary relations doesn't sound too bad - those
operations won't occur very often, I'd say.
greetings, Florian Pflug