Re: Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment - Mailing list pgsql-hackers

From Florian G. Pflug
Subject Re: Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment
Date
Msg-id 46D6FEEB.10309@phlo.org
Whole thread Raw
In response to Re: Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-hackers
Gregory Stark wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
> 
>>>> It seems doable, but it's not pretty. One possible scheme would be to
>>>> emit a record *after* chosing a name but *before* creating the file,
>>> No, because the way you know the name is good is a successful
>>> open(O_CREAT).
>> The idea was to log *twice*. Once the we're about to create a file, and
>> the second time that we succeeded. That way, the filename shows up in the
>> log, even if we crash immediatly after physically creating the file, which
>> gives recovery at least a chance to clean up the mess.
> 
> It sounds like if the reason it fails is because someone else created the same
> file name you'll delete the wrong file?

Carefull bookkeeping during recovery should be able to eliminate that risk,
I think. I've thought a bit more like this, and came up with the following
idea that also take checkpoints into account.

We keep a global table of (xid, filename) pairs in shared memory. File creation
becomes  1) Generate a new filename  2) Add (CurrentTransactionId, filename) to the list, emit a XLOG record     saying
wedid this, and flush the log. If the filename is already on     the list, start over at (1).  3) Create the file. If
thisfails, delete the list entry and the file,     and start over at (1).  4) On (sub) transaction ABORT, we remove
entrieswith the xids we abort,     and delete the files.  5) On top transaction COMMIT, we remove entries with the xids
wecommit,     and keep the files.  6) During top transaction PREPARE, we record the entries with matching xids     in
the2PC state file.
 

When creating a checkpoint, we include the global filelist in the checkpoint. We
might need some interlock to ensure that concurrent global filelist updates 
don't get lost - but maybe doing things in the correct order is sufficient to
guarantee this.

During recovery, we track the fate of the files in a similar (but local) list. .) We initialize our local tracking list
withthe one found in the latest    CHECKPOINT. .) When we encounter a COMMIT record, we remove all files with xids
matching   those in the COMMIT record without deleting them. .) When we encounter a PREPARE record, we remove all files
withmatching xids,    and record them in the 2PC state file. They are deleted if the PREPARED    transaction is
aborted..) When we encounter an ABORT record, we remove all files with matching xids    from the list, and delete them.
.)When we encounter a runtime CHECKPOINT, it's list should match our tracking    list. .) When we encounter a shutdown
CHECKPOINT,we remove all files from our local    list that are not in the checkpoint's list, and delete those files.
 

The XLOG flush in step (2) is pretty nasty, but I think any solution that
guarantees to prevent leaks will have to flush something to disk at that
point. The global table isn't too appealing either, because it
will limit how many concurrent transactions will be able to create files. It
could be replaced by some on-disk thing, though.

This solution sounds rather heavy-weight, but I thought I'd share the idea.

Back to work on lazy xid assignment now ;-)

greetings, Florian Pflug


pgsql-hackers by date:

Previous
From: Ron Mayer
Date:
Subject: Re: Why is there a tsquery data type?
Next
From: "Florian G. Pflug"
Date:
Subject: Re: Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment