Re: [PATCH] Lazy xid assingment V2 - Mailing list pgsql-hackers

From Florian G. Pflug
Subject Re: [PATCH] Lazy xid assingment V2
Date
Msg-id 46D9C317.70807@phlo.org
Whole thread Raw
In response to Re: [PATCH] Lazy xid assingment V2  ("Heikki Linnakangas" <heikki@enterprisedb.com>)
List pgsql-hackers
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> I had an idea this morning that might be useful: back off the strength
>> of what we try to guarantee.  Specifically, does it matter if we leak a
>> file on crash, as long as it isn't occupying a lot of disk space?
>> (I suppose if you had enough crashes to accumulate many thousands of
>> leaked files, the directory entries would start to be a performance drag,
>> but if your DB crashes that much you have other problems.)  This leads
>> to the idea that we don't really need to protect the open(O_CREAT) per
>> se.  Rather, we can emit a WAL entry *after* successful creation of a
>> file, while it's still empty.  This eliminates all the issues about
>> logging an action that might fail.  The WAL entry would need to include
>> the relfilenode and the creating XID.  Crash recovery would track these
>> until it saw the commit or abort or prepare record for the XID, and if
>> it didn't find any, would remove the file.
> 
> That idea, like all other approaches based on tracking WAL records, fail
> if there's a checkpoint after the WAL record (and that's quite likely to
> happen if the file is large). WAL replay wouldn't see the file creation
> WAL entry, and wouldn't know to track the xid. We'd need a way to carry
> the information over checkpoints.

Yes, checkpoints would need to include a list of created-but-yet-uncommitted
files. I think the hardest part is figuring out a way to get that information
to the backend doing the checkpoint - my idea was to track them in shared
memory, but that would impose a hard limit on the number of concurrent
file creations. Not nice :-(

But wait... I just had an idea.
We already got such a central list of created-but-uncommited
files - pg_class itself. There is a small window between file creation
and inserting the name into pg_class - but as Tom says, if we leak it then,
it won't use up much space anyway.

So maybe we should just scan pg_class on VACUUM, and obtain a list of files
that are referenced only from DEAD tuples. Those files we can than safely
delete, no?

If we *do* want a strict no-leakage guarantee, than we'd have to update pg_class
before creating the file, and flush the WAL. If we take Alvaro's idea of storing
temporary relations in a seperate directory, we could skip the flush for those,
because we can just clean out that directory after recovery. Having to flush
the WAL when creating non-temporary relations doesn't sound too bad - those
operations won't occur very often, I'd say.

greetings, Florian Pflug



pgsql-hackers by date:

Previous
From: "Josh Tolley"
Date:
Subject: Re: Per-function search_path => per-function GUC settings
Next
From: John DeSoi
Date:
Subject: Re: Per-function search_path => per-function GUC settings