Thread: [PATCH] Lazy xid assingment V2

[PATCH] Lazy xid assingment V2

From
"Florian G. Pflug"
Date:
Hi

I've updated the lazy xid assignment patch, incorporating the results of the
discussion over the past few days. I'll try to explain what the patch does,
and why. This reasoning got a bit lengthy, but I felt that I should summarize
my own thought, plus the relevant parts of the discussions here...
And I might be a bit chatty too ;-)

The patch can be found at: http://soc.phlo.org/lazyxidassign.v2.patch
(Seems that large attachments don't get through on this list - or was that
just bad luck when I tried posting the first version of that patch?)

It includes a new regression test that asserts the we really don't assign xids
for pure selects. The patch got a bit large in the end, though most of it is
just the removal of XLOG_NO_TRAN and the various MyXact* flags and friends.
The interesting parts really aren't that much code.

The patch survives "make installcheck-parallel", and some light testing I did. I
didn't really hammer it yet, though.
I'll post this to the patches list if nobody finds a hole, or suggest further
improvements.

XID Assignment, Snapshot creation and oldest XMIN computation
==============================================================
We no longer call GetNewTransactionId during transaction start.
Instead, we leave the xid set to InvalidTransactionId, similar
to what we currently do for subtransactions.
GetTopTransactionId checks if we have a xid assigned, and assigns
one using GetNewTransactionId if it's still InvalidTransactionId -
again, similar to what GetCurrentTransactionId does for subtransactions
already. There is a new function GetTopTransactionIdIfAny() bears
the same relation to GetTopTransactionId() as GetCurrentTransactionIdIfAny()
does to GetCurrentTransactionId() - it won't force xid assignment,
but rather return InvalidTransactionId if the xid is not set.

Both GetTopTransactionId() and GetCurrentTranscationId() use
AssignTranscationId() to do the actualy work - this is a modified
version of what is currently called AssignSubTransactionId().

GetOldestXmin() is changed slightly - it used to use
GetTopTransactionId() as a starting point for it's xmin
calculation, falling back to ReadNewTransactionId() if called
from outside a transaction. Now, it always starts with ReadNewTransactionId().
It also used to ignore procarray entries without a valid xmin,
which it obviously mustn't do anymore - they might still have
a valid xmin set.

The changes to GetSnapshotData() are similar. Again, we initialize
xmin to ReadNewTransactionId() instead of GetTopTransactionId().
When scanning the procarray, we don't skip our own entry, but
just process it normally.

The function xid_age computes the age of a given xid relative
to the transaction's toplevel xid. Since forcing xid assignment
just to do that computation seemed silly, I modified xid_age
to use MyProc->xmin as a reference instead. I could have used
ReadNewTransactionId(), but that involves locking overhead,
and we probably want to use something that remains constant
during a transaction.

Correctness issues:
-------------------
* Users of Get{Top|Current}TransactionId{IfAny} *
Since subtransactions *already* assign their xids on-demand, most of
the code correctly distinguishes between GetCurrentTransactionId()
and GetCurrentTransactionIdIfAny(). So that code is safe. Callers
of GetTopTransactionId() have to be checked - but there are only
very few of them. In fact, all callers are discussed above, with
the exception of some logging code which I modified to use
GetCurrentTransactionIdIfAny() instead.

* Obtaining a snapshot *
The most interesting function from a correctness perspective is
GetSnapshotData(). It's correctness *does* heavily depend on
the assumption that no transaction can *exit* the set of running
transactions while it takes a snapshot - not after it set xmax
in any case. This assumption ensures that if transaction A sees B
as as committed, and B sees C as committed, than A will see C
as committed.

The assumption implies the "commit transitivity" because
because C can't commit while A and B take their snapshot,
and B can't commit while A takes it's snapshot. So for
A to see B as committed, it'll have to take it's snapshot
after B committed. And B will have to have taken it's snapshot
after C commited, if B is to see C as committed. But since
B will obviously haven taken it's snapshot before it committed,
the ordering must then be |C commit|B snapshot|B commit|A snapshot|.
Which implies that A will indeed see C as commited.

This reasoning stays valid even if we lazy assign xids (No xids
show up in the reasoning at all), provided that we see *all*
transactions as committed that did so before we took a snapshot.
But even with lazy xid assignment, the xid will surely be assigned
before commit (or not at all, so nobody cares about us anyway) -
so when we a snapshot after some commit, ReadNewTransactionId()
will return a larger value than the just committed xid. So our
snapshot's xmax will end up > that xid, the xid won't show up in
our snasphot, and so we'll see it as committed.

* Global xmin computation*
Both GetSnapshotData() and GetOldestXmin() assume that concurrent
callers will end up computing the same GlobalXmin. This is guaranteed
as long as we hold the ProcArrayLock in ExclusiveMode while doing
the GlobalXmin computation - something that my patch doesn't change.

To cut the long story short - the small modifications outlined
above are all that is needed to ensure the correctness of
both GetSnapshotData and GetOldestXmin(),
even with lazily assignment of xids.

Top- and subtransaction commit and abort:
=========================================
* Deciding whether to write COMMIT/ABORT records or not *
The existing code maintains quite a few flags needed to decide
if a COMMIT or ABORT record needs to be written, if the
XLOG needs to be flushed afterwards and if so, how far.
Those flage are
* MyXactMadeTempRelUpdate
* MyXactMadeXLogEntry
* MyLastRecPtr
* ProcLastRecEnd

There is also a historical distinction between transactional and
non-transactional xlog entries - the latter are generated by
passing XLOG_NO_TRAN to XLogInsert. The only difference between
those two types is that MyLastRecPtr is only advanced for
transactional xlog entries, though.

With lazy xid assignment, we have a much cleaner way to decide if
a transaction has done anything that warrants writing a commit
record. We assume that if someone forced xid assignment by calling
GetCurrentTransactionId, than he will also have written that xid
to disk, and so we must write a commit record. If no transaction id
was assigned, then writing a commit record is impossible (there is
no xid to put into that record), and also completely useless. The
same hold true for ABORT records.

A transaction might still have created an xlog entry, even without
having been assigned an xid. An example for this is nextval(). We
still want to force those records to permanent storage, even if we
don't write a commit record. We therefore still track the end of the
latest xlog record a transaction has written. My patch renames
ProcLastRecEnd to XactLastRecEnd, and resets that value after committing
or aborting a transaction.

The rest of the flags (MyXactMadeTempRelUpdate, MyXactMadeXLogEntry
and MyLastRecPtr) are removed because they no longer serve a useful
purpose. Since removing MyLastRecPtr makes XLOG_NO_TRAN utterly
useless, that flags is remove too.

* Files scheduled for deletion on COMMIT & asynchronous commits *
Whenever a transaction drops a relation, it schedules the physical file
for deletion on COMMIT. Obviously we must only delete such a file when
we are *absolutely* sure that our transaction is committed. Since
we can only assume that when the COMMIT record has physically hit the disk,
we first *must* not skip writing the COMMIT record if there are files
scheduled for deletion, and second *must* not allow asynchronous commits
in that case.

The first requirement is already satisfied - file creation
will surely be followed by a catalog update, which will force xid assingment,
and thus writing a COMMIT record. There is an assertion in
RecordTransactionCommit that checks for no schedules file deletions when
we don't have a xid assigned.

The second requirement can be relaxed a bit. If all to-be-deleted files are
in fact temporary relations, then they will be gone after a crash anyway. So
we may delete them before the COMMIT record hits the disk.

Note that there is a small window where we might leak a file. If we flush the
COMMIT record to disk, CHECKPOINT immediatly afterwards, and then crash before
physically deleting the file, we won't do the deletion during recovery since our
COMMIT record is before the latest CHECKPOINT. For asynchronous commit that
window is actually *smaller* since the COMMIT record hits the disk later (maybe
even after deleting the files. This is therefore *only* allowed if all files
are temporary relations).

* Files scheduled for deletion on ABORT *
Whenever a transaction creates a file, it put it onto the on-abort delete list.
There is no protection against leaks here - if we crash after creating the file,
but before the transaction COMMITS or ABORTS, the file is leaked. The current
code attempts to make that window smaller by forcing a xlog flush on ABORT if
there are files to be deleted. But there is not much point in that - the
transaction might have run for an arbitrarily long time after creating the file,
and during that whole time a crash would have caused file leakage. So getting
paranoid at ABORT time seems unjustified. And flushing might even do more harm
than good here - it's entirely possible that it'll take longer to flush that it
would have to get to the file deletion point if we had not flushed.

* Suma sumarum *
This modifies RecordTransactionCommit and RecordSubTransactionCommit, plus it
joins RecordSubTransactionAbort and RecordTransactionAbort into one function,
implementing the following rules (Which seems much clearer than the ones
in the current code).

Toplevel COMMIT:
.) If we have an XID assigned, write a COMMIT record. If not, assert that there   are no pending deletions.
.) If we're not committing asynchronously, we flush the XLOG up to   XactLastRecEnd. Otherwise, we just do
XLogSetAsyncCommitLSN(),and let   the wal writer do the flushing.
 
.) If we have an XID assigned, we update the CLOG. For asynchronous commits,   we have to pass the LSN of our COMMIT
recordto prevent the update from   hitting the disk before the xlog record does.
 
.) Reset XactLastRecEnd

Subtransaction COMMIT:
.) Do nothing if we don't have a xid assigned, otherwise
.) Just update the CLOG. No WAL-logging needed.

Toplevel *and* Subtransaction ABORT:
.) Do nothing if we don't have a xid assigned, otherwise
.) Write an ABORT record, and then
.) update the CLOG. We don't ensure that this doesn't hit the disk early -   there is not much difference between
activelyaborting, and just being   assumed to have crashed.
 
.) For subtransactions, remove the relevent xids from the procarray, otherwise   reset XactLastRecEnd.

Waiting for transactions to exit / ResourceOwnerIds
===================================================
Each transaction currently holds a lock on it's xid. This is used for two quite
different purposes - one is waiting for a transaction that modified some
tuple - this still works with lazily assigned xids. But during CREATE INDEX
CONCURRENTLY, this is used to first wait for all transactions that are holding
a lock conflicting with ShareLock on the relation in question, and than to 
out-wait *all* currently running transactions. This obviously fails when xids
are assigned lazily. In the first waiting phase because a transaction will
acquire the lock *before* requesting an xid, in the second phase because we have
to out-wait even pure selects.

Therefore, the concept of a ResourceOwnerId is introduced. This id identifies
the toplevel resourceowner of a transaction, and the owning transaction holds
a lock on it just as it does for transaction ids. Porting the first waiting
phase to this is trivial - we just collect the RIDs instead of the XIDs of
the current lock holders, and wait for them in turn.

The second waiting phase is a bit more complicated, because the current code
just waits on all xids in it's reference snapshot - which won't work for RIDs,
because they do not show up in snapshots. So, instead, we scan the procarray,
and wait for any transaction with an xmin < the xmax of our reference snapshot.

ResourceOwnerIds are a struct composed of two uint32 fields - one sessionId that
is assigned to each backend at backend startup (From a counter in shmem), and
one localTransactionId that is just incremented locally whenever we need a new
unique RID in a backend. This ensures that RIDs create no new locking overhead,
apart from having to hold at least two locks per transaction instead of one.

Locks on ResourceOwnerIds are not saved to the 2PC state file on PREPARE, since
the RID or a transaction will be reused at time point after it was prepared.

ResourceOwnerIds & pg_locks.
============================
Locking requests on RIDs are shown in pg_locks, just as those on xids are.
Therefore, some representation has to be chosen for RIDs. XID won't work,
because due to the local/global split we need 8 bytes to represent a RID.
After much discussion (A new datatype is to heavy-weight, leaving out bits
might hinder debugging, and using int8 is impossible as long as we support
INT64_IS_BUSTED), we settled on a string representation. This at least has
the advantage that the transition to a new full-blown type for RIDs would
be seamless shoudl we even choose to go that way.

My patch adds two columns to pg_locks - "resourceownerid" on the 'left'
side (The locking request), and "resourceowner" on the 'right' side
(the holder/waiter). This is similar to how xids are handled in that view.

Possible future Improvements:
=============================
It's not entirely clear that we really need the exclusive ProcArrayLock
when exiting a transaction that didn't hold an xid. Not holding that lock
*will* make two concurrent calculations of globalxmin get two different
results sometimes - but that might actually be OK.

With some effort, we could get rid of the lock we hold on the xid entirely
I think. We'd need to reserve a special sessionId (0xffffffff probably)
for RIDs of prepared transactions. During prepare, we'd then grab a lock
on the RID (0xffffffff, xid) *before* releasing the lock on our old RID.
In XactLockTableWait we'd have to first check if the transaction using
the xid we want to wait for is still running, and get it's RID if it is.
We'd than have to wait on that RID, and after getting that lock we'd have
to wait on the RID (0xffffffff, xid). Only after getting that lock too,
we may return. But that doesn't seem much prettier than just holding two
locks per transaction, so for now I think it's best to leave things as
they are.

greetings, Florian Pflug



Re: [PATCH] Lazy xid assingment V2

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> The patch can be found at: http://soc.phlo.org/lazyxidassign.v2.patch
> (Seems that large attachments don't get through on this list - or was that
> just bad luck when I tried posting the first version of that patch?)

No, the message size limit on -hackers is relatively small.  You
shouldn't post patches here anyway, they belong on -patches.

A few comments ...

> GetOldestXmin() is changed slightly - it used to use
> GetTopTransactionId() as a starting point for it's xmin
> calculation, falling back to ReadNewTransactionId() if called
> from outside a transaction. Now, it always starts with ReadNewTransactionId().

It's kind of annoying to do ReadNewTransactionId() when most of the time
it won't affect the result.  However I don't see much choice; if we
tried to read it only after failing to find any entries in the
procarray, we'd have a race condition, since someone might acquire
an xid and put it into the procarray after we look at their PGPROC
and before we can do ReadNewTransactionId.

It might be worth doing GetTopTransactionIdIfAny and only calling
ReadNewTransactionId when that fails, since the former is far
cheaper than the latter.  Also we could cheaply check to see if
our own xmin is set, and use that to initialize the computation.

> The function xid_age computes the age of a given xid relative
> to the transaction's toplevel xid. Since forcing xid assignment
> just to do that computation seemed silly, I modified xid_age
> to use MyProc->xmin as a reference instead. I could have used
> ReadNewTransactionId(), but that involves locking overhead,
> and we probably want to use something that remains constant
> during a transaction.

I don't much like this, since as I mentioned before I don't think
MyProc->xmin is going to be constant over a whole transaction for
long.  I don't think xid_age is performance-critical in any way,
so I'd vote for letting it force XID assignment.

> The first requirement is already satisfied - file creation
> will surely be followed by a catalog update, which will force xid assingment,
> and thus writing a COMMIT record. There is an assertion in
> RecordTransactionCommit that checks for no schedules file deletions when
> we don't have a xid assigned.

Rather than an Assert (which won't be there in production), I'd suggest
a standard test and elog(ERROR).  This point is sufficiently critical
that I don't want the test to be missing in production builds ...

> Therefore, the concept of a ResourceOwnerId is introduced. This id identifies
> the toplevel resourceowner of a transaction, and the owning transaction holds
> a lock on it just as it does for transaction ids. Porting the first waiting
> phase to this is trivial - we just collect the RIDs instead of the XIDs of
> the current lock holders, and wait for them in turn.

I don't particularly like calling these ResourceOwnerId --
ResourceOwners are an implementation artifact that's purely
backend-local.  If we were to get rid of them in favor of some other
resource management mechanism, we'd still need RIDs for the waiting
purposes you mention.  So I'm in favor of calling them something else,
maybe virtual transaction ids (VXID or VID) --- not wedded to that if
someone has a better idea.

> Locks on ResourceOwnerIds are not saved to the 2PC state file on PREPARE, since
> the RID or a transaction will be reused at time point after it was prepared.

Hmm.  I think this is all right since we don't actually need to wait for
a prepared xact, but it still seems like a wart.

Other than these relatively minor quibbles, the description seems all
right, and it's so complete that I'm not sure I need to read the code
;-)


As I just mentioned over in pgsql-performance, I'm strongly inclined
to push this patch into 8.3, even though we're certainly far past the
feature freeze deadline.  Not allocating XIDs for transactions that
only do SELECTs seems to me like it'll be a big win for a lot of
high-performance applications.  It's not so much that the act of
allocating an XID is really expensive, it's that making a significant
reduction in the rate at which XIDs are used up translates directly
to a reduction in many overhead costs.  For instance, the "live" area of
pg_clog and pg_subtrans gets smaller, reducing pressure on those buffer
areas, and the frequency with which vacuums are needed for
anti-wraparound purposes drops.

Comments?
        regards, tom lane


Re: [PATCH] Lazy xid assingment V2

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
> A few comments ...
> 
>> GetOldestXmin() is changed slightly - it used to use
>> GetTopTransactionId() as a starting point for it's xmin
>> calculation, falling back to ReadNewTransactionId() if called
>> from outside a transaction. Now, it always starts with ReadNewTransactionId().
> 
> It's kind of annoying to do ReadNewTransactionId() when most of the time
> it won't affect the result.  However I don't see much choice; if we
> tried to read it only after failing to find any entries in the
> procarray, we'd have a race condition, since someone might acquire
> an xid and put it into the procarray after we look at their PGPROC
> and before we can do ReadNewTransactionId.
> 
> It might be worth doing GetTopTransactionIdIfAny and only calling
> ReadNewTransactionId when that fails, since the former is far
> cheaper than the latter.  Also we could cheaply check to see if
> our own xmin is set, and use that to initialize the computation.
Sounds worthwile, I'm going to do this.

>> The function xid_age computes the age of a given xid relative
>> to the transaction's toplevel xid. Since forcing xid assignment
>> just to do that computation seemed silly, I modified xid_age
>> to use MyProc->xmin as a reference instead. I could have used
>> ReadNewTransactionId(), but that involves locking overhead,
>> and we probably want to use something that remains constant
>> during a transaction.
> 
> I don't much like this, since as I mentioned before I don't think
> MyProc->xmin is going to be constant over a whole transaction for
> long.  I don't think xid_age is performance-critical in any way,
> so I'd vote for letting it force XID assignment.
Hm... I agree that xid_age is probably not performance-critical.
I guess it's more the complete counter-intuitivity of forcing
xid assignment in some arbitrary function thats bugging me a bit.

Maybe it'd be sufficient to guarantee a constant return value
of this function within one statement for read-committed
transaction? For serializable transactions I'd still be constant
Sounds not completely ill-logical to me - after all the age
of an xid *does* really change if my xmin moves forward.

But I can live with just doing GetTopTransactionId() too...

>> The first requirement is already satisfied - file creation
>> will surely be followed by a catalog update, which will force xid assingment,
>> and thus writing a COMMIT record. There is an assertion in
>> RecordTransactionCommit that checks for no schedules file deletions when
>> we don't have a xid assigned.
> 
> Rather than an Assert (which won't be there in production), I'd suggest
> a standard test and elog(ERROR).  This point is sufficiently critical
> that I don't want the test to be missing in production builds ...
I agree - I didn't consider the fact that Asserts are disable in
production builds when I made it an assertion.

>> Therefore, the concept of a ResourceOwnerId is introduced. This id identifies
>> the toplevel resourceowner of a transaction, and the owning transaction holds
>> a lock on it just as it does for transaction ids. Porting the first waiting
>> phase to this is trivial - we just collect the RIDs instead of the XIDs of
>> the current lock holders, and wait for them in turn.
> 
> I don't particularly like calling these ResourceOwnerId --
> ResourceOwners are an implementation artifact that's purely
> backend-local.  If we were to get rid of them in favor of some other
> resource management mechanism, we'd still need RIDs for the waiting
> purposes you mention.  So I'm in favor of calling them something else,
> maybe virtual transaction ids (VXID or VID) --- not wedded to that if
> someone has a better idea.
I not particularly proud of the name "ResourceOwnerId" - it was just the
best I could come up with ;-)

I quite like VirtualTransactionId actually - it emphasizes the point
that they never hit the disk. So I'll rename it to VirtualTransactionId
if no better idea comes up.

>> Locks on ResourceOwnerIds are not saved to the 2PC state file on PREPARE, since
>> the RID or a transaction will be reused at time point after it was prepared.
> 
> Hmm.  I think this is all right since we don't actually need to wait for
> a prepared xact, but it still seems like a wart.
It certainly is a wart, though but removing it will create a bigger one
I fear... The only other I idea I had was to make locks on RIDs session 
locks, and to drop them explicitly at toplevel commit and abort time,
and that certainly has a quite high warty-ness score...

> Other than these relatively minor quibbles, the description seems all
> right, and it's so complete that I'm not sure I need to read the code
> ;-)
I *did* warn that I tend to be a bit chatty at times, though ;-)

greetings, Florian Pflug



Re: [PATCH] Lazy xid assingment V2

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> Tom Lane wrote:
>> I don't much like this, since as I mentioned before I don't think
>> MyProc->xmin is going to be constant over a whole transaction for
>> long.  I don't think xid_age is performance-critical in any way,
>> so I'd vote for letting it force XID assignment.

> Hm... I agree that xid_age is probably not performance-critical.
> I guess it's more the complete counter-intuitivity of forcing
> xid assignment in some arbitrary function thats bugging me a bit.

Well, I don't see it as arbitrary --- we're trying to hide the fact that
XIDs are assigned lazily from user-visible behavior.

>>> Locks on ResourceOwnerIds are not saved to the 2PC state file on PREPARE, since
>>> the RID or a transaction will be reused at time point after it was prepared.
>> 
>> Hmm.  I think this is all right since we don't actually need to wait for
>> a prepared xact, but it still seems like a wart.
>
> It certainly is a wart, though but removing it will create a bigger one
> I fear... The only other I idea I had was to make locks on RIDs session 
> locks, and to drop them explicitly at toplevel commit and abort time,
> and that certainly has a quite high warty-ness score...

Hmm.  I was thinking that we could just not do anything special here,
but on second thought I see the problem: on crash and restart we are
intending to restart the session ID counter from zero, which creates
a significant probability of some session conflicting with a VXID
lock held by a prepared xact.  So I guess filtering the VXID lock
out of the set sent to disk is probably the best answer.  There's
definitely no need for the prepared xact to continue to hold that
lock after crash/restart.
        regards, tom lane


Re: [PATCH] Lazy xid assingment V2

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> "Florian G. Pflug" <fgp@phlo.org> writes:
>> Tom Lane wrote:
>>> I don't much like this, since as I mentioned before I don't think
>>> MyProc->xmin is going to be constant over a whole transaction for
>>> long.  I don't think xid_age is performance-critical in any way,
>>> so I'd vote for letting it force XID assignment.
>
>> Hm... I agree that xid_age is probably not performance-critical.
>> I guess it's more the complete counter-intuitivity of forcing
>> xid assignment in some arbitrary function thats bugging me a bit.
>
> Well, I don't see it as arbitrary --- we're trying to hide the fact that
> XIDs are assigned lazily from user-visible behavior.

It seems both this and some of the other cases of having to call
ReadNewTransactionId would be eliminated if we invented a new xid treated
similarly to current_time() and others which are calculated once per
transaction and then cached for subsequent accesses. So xid_age() would
measure relative to a fixed xid, it just wouldn't be *our* xid necessarily.

Just a thought.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: [PATCH] Lazy xid assingment V2

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> It seems both this and some of the other cases of having to call
> ReadNewTransactionId would be eliminated if we invented a new xid treated
> similarly to current_time() and others which are calculated once per
> transaction and then cached for subsequent accesses. So xid_age() would
> measure relative to a fixed xid, it just wouldn't be *our* xid necessarily.

Hm, so basically call ReadNewTransactionId the first time one of these
functions is used within a given transaction, and save that?  Or we
could use our own transaction ID if one was already assigned at that
point.  Seems reasonable.  Not sure what to call the thing though.
ArbitraryReferenceXID?
        regards, tom lane


Re: [PATCH] Lazy xid assingment V2

From
"Florian G. Pflug"
Date:
August Zajonc wrote:
I assume that you meant that mail to go to pgsql-hackers too, and just
clicked the wrong button by accident. If not, please forgive that
I post the reply to the list.

> Are you going to look at putting temp relations into a special folder 
> for 8.4? I really like that idea for crash cleanup and generally.
According to Alvaro, that is already the case. I didn't check, though.

> For non temp files, can they be created in a pending folder, work done 
> on them, and then moved over to final destination just before commit?
Hm... that doesn't sound bad at first sight. We'd need one such folder
per tablespace, though.

> You could still leak them (ie, you move to final location and bail 
> before commit) but it seems to narrow the window down significantly.
That leak could be prevented I think if the COMMIT record indicated
which files are to be moved, and the actual move happens after the
flushing the COMMIT record to disk. Probably the move would need to
happen while we're in the commit critical section too, to guard
against concurrent checkpoints - but that all seems doable.

> Actually, as I write this, a bad idea, the temp files are the low 
> hanging fruit, don't think it's worth dealing with non-temp files until 
> you can make it totally leakproof.
There is no fundamental difference between temporary and non-temporary
delete-on-abort files I think. There is one for delete-on-commit files,
(and only because of asynchronous commits)
but the leakage possibility for those is already really small (I'd
take a COMMIT, a crash *and* a CHECKPOINT at just the right time to
get a leak).

The hardest part seems to be making the code cope with moved files.
Though we only need to care about that local backend I believe, because
all others haven't yet seen the files anyway.

greetings, Florian Pflug



Re: [PATCH] Lazy xid assingment V2

From
"Florian G. Pflug"
Date:
Gregory Stark wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
> 
>> "Florian G. Pflug" <fgp@phlo.org> writes:
>>> Tom Lane wrote:
>>>> I don't much like this, since as I mentioned before I don't think
>>>> MyProc->xmin is going to be constant over a whole transaction for
>>>> long.  I don't think xid_age is performance-critical in any way,
>>>> so I'd vote for letting it force XID assignment.
>>> Hm... I agree that xid_age is probably not performance-critical.
>>> I guess it's more the complete counter-intuitivity of forcing
>>> xid assignment in some arbitrary function thats bugging me a bit.
>> Well, I don't see it as arbitrary --- we're trying to hide the fact that
>> XIDs are assigned lazily from user-visible behavior.
> 
> It seems both this and some of the other cases of having to call
> ReadNewTransactionId would be eliminated if we invented a new xid treated
> similarly to current_time() and others which are calculated once per
> transaction and then cached for subsequent accesses. So xid_age() would
> measure relative to a fixed xid, it just wouldn't be *our* xid necessarily.

Hm.. for xid_age that would be OK I think. I'm not so sure about
GetOldestXmin(), though. If we cache the value *before* assigning a
xid, ur cached value might very well turn out to be older than any xmin and
xid in the procarray. GetOldestXmin() would then return a value older
than necessary, which effectively means keeping old data around longer
than necessary. If we only cache the value *after* setting an xmin
it's entirely redundant - we might just use the xid, and be done with it.

I've actually checked who actually uses GetOldestXmin now. The list
is surprisingly short. CreateCheckPoint - To find the pg_subtrans truncation point. Called                    from
bgwriter,and therefore from outside a transaction. IndexBuildHeapScan - Only in the nonconcurrent case. Hardly
           performance-critical. vacuum_set_xid_limits - To decide which tuples are recently dead (Only
        called one per VACUUM). Hardly performance-critical.
 

The reason this list is so short is that we *already* cache a global xmin
value in RecentGlobalXmin. So doing any more optimizations for
GetOldestXmin() seems like overkill. So I'd say lets do the optimization
Tom suggested (Checking GetTopTransactionIdIfAny() before calling
ReadNewTransactionId()), and be done with it.

greetings, Florian Pflug



Re: [PATCH] Lazy xid assingment V2

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> Tom Lane wrote:
>>> I don't much like this, since as I mentioned before I don't think
>>> MyProc->xmin is going to be constant over a whole transaction for
>>> long.  I don't think xid_age is performance-critical in any way,
>>> so I'd vote for letting it force XID assignment.
> 
>> Hm... I agree that xid_age is probably not performance-critical.
>> I guess it's more the complete counter-intuitivity of forcing
>> xid assignment in some arbitrary function thats bugging me a bit.
> 
> Well, I don't see it as arbitrary --- we're trying to hide the fact that
> XIDs are assigned lazily from user-visible behavior.

Hm.. but xid_age already exposes an implementation detail (that xid
comparison is circular). Actually, after more thought on this I'd
argue that the only really correct reference for computing a xid's
age is GetOldestXmin(). It's what VACUUM uses to calculate the
freezeLimit, after all. Now, using GetOldestXmin() directly is probably
unwise because it might change at any time. So I now think that
using RecentOldestXmin is the correct thing to do. The fact that it
might change between statements for a read-committed transaction
is quite natural I'd say. After all, read-committed transaction *do*
see changes made by other transaction, so why shouldn't they see
xid age values changing, if this is what those values really do?

Of course, this is very similar to Gregory's suggestion - only that
it uses a cached xid value that already exists, instead of adding
a new one. Plus, it uses a value set once per snapshot instead of
once per transaction.

greetings, Florian Pflug


Re: [PATCH] Lazy xid assingment V2

From
Alvaro Herrera
Date:
Florian G. Pflug wrote:
> August Zajonc wrote:
> I assume that you meant that mail to go to pgsql-hackers too, and just
> clicked the wrong button by accident. If not, please forgive that
> I post the reply to the list.
>
>> Are you going to look at putting temp relations into a special folder for 
>> 8.4? I really like that idea for crash cleanup and generally.
> According to Alvaro, that is already the case. I didn't check, though.

Huh, no, I was just suggesting we could do it :-)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: [PATCH] Lazy xid assingment V2

From
"Florian G. Pflug"
Date:
Alvaro Herrera wrote:
> Florian G. Pflug wrote:
>> August Zajonc wrote:
>> I assume that you meant that mail to go to pgsql-hackers too, and just
>> clicked the wrong button by accident. If not, please forgive that
>> I post the reply to the list.
>>
>>> Are you going to look at putting temp relations into a special folder for 
>>> 8.4? I really like that idea for crash cleanup and generally.
>> According to Alvaro, that is already the case. I didn't check, though.
> 
> Huh, no, I was just suggesting we could do it :-)
Ups, sorry :-)

I just reread your mail - obviously, my eyes somehow skipped the initial "Why"
when I first read it :-)

greetings, Florian Pflug



Re: [PATCH] Lazy xid assingment V2

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> Hm.. but xid_age already exposes an implementation detail (that xid
> comparison is circular).

But the only reason anyone cares about it at all, really, is that
xid comparison is circular ... the function was only invented in
the first place to help manage xid-wraparound prevention.

> Actually, after more thought on this I'd
> argue that the only really correct reference for computing a xid's
> age is GetOldestXmin(). It's what VACUUM uses to calculate the
> freezeLimit, after all. Now, using GetOldestXmin() directly is probably
> unwise because it might change at any time. So I now think that
> using RecentOldestXmin is the correct thing to do. The fact that it
> might change between statements for a read-committed transaction
> is quite natural I'd say.

It might change between statements for a serializable transaction too;
or within statements for either kind, because it depends on the actions
of other xacts not just our own.  That doesn't seem like a good plan at
all.

If you wanted to cache GetOldestXmin()'s result on first use of xid_age
within a transaction, then okay, but is it worth that much trouble?

Another perspective on this is that ReadNewTransactionId is the correct
reference point for xid_age, because what you really want to know about
when you use it is how much daylight there is before xid wraparound.
So maybe we should redefine xid_age as a volatile function that indeed
uses ReadNewTransactionId on every call.  But that is probably going to
complicate use of the function more than is justified.

I note that pg_description defines xid_age as "age of a transaction ID,
in transactions before current transaction", and I'm content to stick
with that definition.  Even if it does force consuming an XID.  I'm not
sure that we could devise a comparably succinct description of any of
these other proposed behaviors, nor that we should even consider
changing that description.
        regards, tom lane


Re: [PATCH] Lazy xid assingment V2

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> Actually, after more thought on this I'd
>> argue that the only really correct reference for computing a xid's
>> age is GetOldestXmin(). It's what VACUUM uses to calculate the
>> freezeLimit, after all. Now, using GetOldestXmin() directly is probably
>> unwise because it might change at any time. So I now think that
>> using RecentOldestXmin is the correct thing to do. The fact that it
>> might change between statements for a read-committed transaction
>> is quite natural I'd say.
> 
> It might change between statements for a serializable transaction too;
> or within statements for either kind, because it depends on the actions
> of other xacts not just our own.  That doesn't seem like a good plan at
> all.
> 
> If you wanted to cache GetOldestXmin()'s result on first use of xid_age
> within a transaction, then okay, but is it worth that much trouble?

D'oh - I managed to add confusion here I think, by writing RecentOldestXmin
(which just doesn't exist ;-) where I meant RecentGlobalXmin (which does
exist). RecentGlobalXmin is updated whenever we set a new snapshot.
(And nowhere else, I believe).

> Another perspective on this is that ReadNewTransactionId is the correct
> reference point for xid_age, because what you really want to know about
> when you use it is how much daylight there is before xid wraparound.
> So maybe we should redefine xid_age as a volatile function that indeed
> uses ReadNewTransactionId on every call.  But that is probably going to
> complicate use of the function more than is justified.
I can see your point there. With that perspective in mind, using the
current snapshot's xmax might be an option. That would be the best
approximation of ReadNewTransactionId() that still allows
xid_age to be STABLE.

> I note that pg_description defines xid_age as "age of a transaction ID,
> in transactions before current transaction", and I'm content to stick
> with that definition.  Even if it does force consuming an XID.  I'm not
> sure that we could devise a comparably succinct description of any of
> these other proposed behaviors, nor that we should even consider
> changing that description.
If we use LatestSnapshot->xmax, we could write "age of a transaction ID,
in transactions before (the) start of (the) current transaction".

greetings, Florian Pflug



Re: [PATCH] Lazy xid assingment V2

From
Alvaro Herrera
Date:
Tom Lane wrote:

> I note that pg_description defines xid_age as "age of a transaction ID,
> in transactions before current transaction", and I'm content to stick
> with that definition.  Even if it does force consuming an XID.  I'm not
> sure that we could devise a comparably succinct description of any of
> these other proposed behaviors, nor that we should even consider
> changing that description.

Hmm, what happens if somebody calls 
select age(xmin) from bigtable
?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: [PATCH] Lazy xid assingment V2

From
August Zajonc
Date:
Florian G. Pflug wrote:
>> You could still leak them (ie, you move to final location and bail 
>> before commit) but it seems to narrow the window down significantly.
> That leak could be prevented I think if the COMMIT record indicated
> which files are to be moved, and the actual move happens after the
> flushing the COMMIT record to disk. Probably the move would need to
> happen while we're in the commit critical section too, to guard
> against concurrent checkpoints - but that all seems doable.
It does fix the crash leak. On recovery you could finish the move if 
necessary based on commit record.

I do have a question about jamming though. Will the system work if the 
file ended up stuck in this folder? Let's say the move destination has a 
duplicate file that conflicts, or permissions change under you, or disks 
fill.

Clearly one could error out and force the crash / recovery, but avoiding 
that would be better. You could also infer where the file actually is 
(unmoved) based on the commit record, and perhaps on attempts to use 
that relation report an error after re-attempting the move (ie, couldn't 
move xxx, disk full / conflict etc) or simply emit a warning and use it 
in its pre-move state.  This is a *very* narrow case obviously on a well 
maintained system, so I lean towards a clear error message and a higher 
profile notice without too much fancy footwork.

I didn't post to -hackers because I'm in way over my head...but I enjoy 
following the list.

- August






Re: [PATCH] Lazy xid assingment V2

From
Tom Lane
Date:
August Zajonc <augustz@augustz.com> writes:
> I do have a question about jamming though. Will the system work if the 
> file ended up stuck in this folder? Let's say the move destination has a 
> duplicate file that conflicts, or permissions change under you, or disks 
> fill.

Yeah, the move-the-file approach seems to introduce its own set of
failure modes, which hardly seems like something we want.

I had an idea this morning that might be useful: back off the strength
of what we try to guarantee.  Specifically, does it matter if we leak a
file on crash, as long as it isn't occupying a lot of disk space?
(I suppose if you had enough crashes to accumulate many thousands of
leaked files, the directory entries would start to be a performance drag,
but if your DB crashes that much you have other problems.)  This leads
to the idea that we don't really need to protect the open(O_CREAT) per
se.  Rather, we can emit a WAL entry *after* successful creation of a
file, while it's still empty.  This eliminates all the issues about
logging an action that might fail.  The WAL entry would need to include
the relfilenode and the creating XID.  Crash recovery would track these
until it saw the commit or abort or prepare record for the XID, and if
it didn't find any, would remove the file.

With this approach I think we'd not even need to force-fsync the WAL
entry; instead treat it like an async COMMIT record (pass its LSN to
the walwriter).  Even in the absence of any subsequent WAL activity,
it would reach disk via the walwriter before the new file could be
filled to a size that would bother anyone.  (If the new file is being
filled via WAL-logged insertions, then you can probably make even
stronger statements than that, but we do have operations like COPY
and CREATE INDEX that can fill a file with unlogged insertions.)
So the performance impact would be about nil.

I still don't think that this area is in urgent need of a fix, but
I wanted to get this idea into the archives.
        regards, tom lane


Re: [PATCH] Lazy xid assingment V2

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> I had an idea this morning that might be useful: back off the strength
> of what we try to guarantee.  Specifically, does it matter if we leak a
> file on crash, as long as it isn't occupying a lot of disk space?
> (I suppose if you had enough crashes to accumulate many thousands of
> leaked files, the directory entries would start to be a performance drag,
> but if your DB crashes that much you have other problems.)  This leads
> to the idea that we don't really need to protect the open(O_CREAT) per
> se.  Rather, we can emit a WAL entry *after* successful creation of a
> file, while it's still empty.  This eliminates all the issues about
> logging an action that might fail.  The WAL entry would need to include
> the relfilenode and the creating XID.  Crash recovery would track these
> until it saw the commit or abort or prepare record for the XID, and if
> it didn't find any, would remove the file.

That idea, like all other approaches based on tracking WAL records, fail
if there's a checkpoint after the WAL record (and that's quite likely to
happen if the file is large). WAL replay wouldn't see the file creation
WAL entry, and wouldn't know to track the xid. We'd need a way to carry
the information over checkpoints.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: [PATCH] Lazy xid assingment V2

From
"Florian G. Pflug"
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> I had an idea this morning that might be useful: back off the strength
>> of what we try to guarantee.  Specifically, does it matter if we leak a
>> file on crash, as long as it isn't occupying a lot of disk space?
>> (I suppose if you had enough crashes to accumulate many thousands of
>> leaked files, the directory entries would start to be a performance drag,
>> but if your DB crashes that much you have other problems.)  This leads
>> to the idea that we don't really need to protect the open(O_CREAT) per
>> se.  Rather, we can emit a WAL entry *after* successful creation of a
>> file, while it's still empty.  This eliminates all the issues about
>> logging an action that might fail.  The WAL entry would need to include
>> the relfilenode and the creating XID.  Crash recovery would track these
>> until it saw the commit or abort or prepare record for the XID, and if
>> it didn't find any, would remove the file.
> 
> That idea, like all other approaches based on tracking WAL records, fail
> if there's a checkpoint after the WAL record (and that's quite likely to
> happen if the file is large). WAL replay wouldn't see the file creation
> WAL entry, and wouldn't know to track the xid. We'd need a way to carry
> the information over checkpoints.

Yes, checkpoints would need to include a list of created-but-yet-uncommitted
files. I think the hardest part is figuring out a way to get that information
to the backend doing the checkpoint - my idea was to track them in shared
memory, but that would impose a hard limit on the number of concurrent
file creations. Not nice :-(

But wait... I just had an idea.
We already got such a central list of created-but-uncommited
files - pg_class itself. There is a small window between file creation
and inserting the name into pg_class - but as Tom says, if we leak it then,
it won't use up much space anyway.

So maybe we should just scan pg_class on VACUUM, and obtain a list of files
that are referenced only from DEAD tuples. Those files we can than safely
delete, no?

If we *do* want a strict no-leakage guarantee, than we'd have to update pg_class
before creating the file, and flush the WAL. If we take Alvaro's idea of storing
temporary relations in a seperate directory, we could skip the flush for those,
because we can just clean out that directory after recovery. Having to flush
the WAL when creating non-temporary relations doesn't sound too bad - those
operations won't occur very often, I'd say.

greetings, Florian Pflug



Re: [PATCH] Lazy xid assingment V2

From
"Florian G. Pflug"
Date:
August Zajonc wrote:
>> Yes, checkpoints would need to include a list of 
>> created-but-yet-uncommitted
>> files. I think the hardest part is figuring out a way to get that 
>> information
>> to the backend doing the checkpoint - my idea was to track them in shared
>> memory, but that would impose a hard limit on the number of concurrent
>> file creations. Not nice :-(
> I'm confused about this.
> 
> As long as we assert the rule that the file name can't change on the 
> move, then after commit the file can be in only one of two places. The 
> name of the file is known (ie, pg_class). The directories are known. 
> What needs to be carried forwarded past a checkpoint? We don't even look 
> at WAL, so checkpoints are irrelevant it seems>
> If there is a crash just after commit and before the move, no harm. You 
> just move on startup. If the move fails, no harm, you can emit warning 
> and open in /pending (or simply error, even easier).
If you're going to open the file from /pending, whats the point of moving
it in the first place?

The idea would have to be that you move on commit (Or on COMMIT-record
replay, in case of a crash), and then, after recovering the whole wal,
you could remove leftover files in /pending.

The main problem is that you have to do the move *after* flushing the COMMIT
record to disk - otherwise you're gonna leak the file if you crash between
moving and flushing.

But that implies that the transaction is *already* committed when you do
the move. Others won't know that yet (You do the move *after* flushing,
but *before* updating the CLOG) - but still, since the COMMIT-record is
on disk, you cannot rollback anymore (Since if you crash, and replay the
COMMIT record, the transaction  *will* be committed).

So, what are you going to do if the move fails? You cannot roll back, and
you cannot update the CLOG (because than others would see your new table,
but no datafile). The only option is to PANIC. This will lead to a server
restart, WAL recovery, and probably another PANIC once the COMMIT-record
is replayed (Since the move probably still won't be possible).

It might be even worse - I'm not sure that a rename is an atomic operation
on most filesystems. If it's not, then you might end up with two files if
power fails *just* as you rename, or, worse with no file at all. Even a slight
possibility of the second case seems unacceptable - I means loosing
a committed transaction.

I agree that we should eventually find a way to guarantee either no file
leakage, or at least an upper bound on the amount of wasted space. But
doing so at the cost of PANICing if the move fails seems like a bad
tradeoff...

greetings, Florian Pflug


Re: [PATCH] Lazy xid assingment V2

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> It might be even worse - I'm not sure that a rename is an atomic operation
> on most filesystems.

rename(2) is specified to be atomic by POSIX, but relinking a file into
a different directory can hardly be --- it's not even provided as a
single kernel call, is it?

And there's still the problem that changing the filename on-the-fly is
going to break tons of low-level stuff, most of which is not supposed to
know about transactions at all, notably bgwriter.

What I was thinking about was a "flag file" separate from the data file
itself, a bit like what we use for archiver signaling.  If nnnn is the
new data file, then "touch nnnn.new" to mark the file as needing to be
deleted on restart.  Remove these files just *before* commit.  This
leaves you with a narrow window between removing the flag file and
actually committing, but there's no risk of having to PANIC --- if the
remove fails, you just abort the transaction.

However, this has nonzero overhead compared to the current behavior.
I'm still dubious that we have a problem that needs solving ...
        regards, tom lane


Re: [PATCH] Lazy xid assingment V2

From
August Zajonc
Date:
Florian G. Pflug wrote:
> August Zajonc wrote:
>> I'm confused about this.
>>
>> As long as we assert the rule that the file name can't change on the 
>> move, then after commit the file can be in only one of two places. 
>> The name of the file is known (ie, pg_class). The directories are 
>> known. What needs to be carried forwarded past a checkpoint? We don't 
>> even look at WAL, so checkpoints are irrelevant it seems
>> If there is a crash just after commit and before the move, no harm. 
>> You just move on startup. If the move fails, no harm, you can emit 
>> warning and open in /pending (or simply error, even easier).
> If you're going to open the file from /pending, whats the point of moving
> it in the first place?
Allow time for someone to sort out the disk situation without impacting 
things. Preserves the concept that after COMMIT a file exists on disk 
that is accessible, which is what people I think expect.
> The idea would have to be that you move on commit (Or on COMMIT-record
> replay, in case of a crash), and then, after recovering the whole wal,
> you could remove leftover files in /pending.
I think so. What I was thinking was

Limit to moves from spclocation/file.new to spclocation/file. So given a 
pg_class filename the datafile can only have two possible names. 
relfilenode or relfilenode.new

you commit, then move.

If crash occurs before commit, you leak a .new table.

If move fails after commit you emit warning. The commit is still valid, 
because fopen can fall back to .new. The data is still there.

On crash recovery move .new files that show up in pg_class to their 
proper name if the disk has been fixed and move can succeed. For .new 
files that don't exist in pg_class after log replay, delete them.

Fallback open.

fopen relfilenode,
if ENOENT fopen relfilenode.new   if ENOENT error   elseif emit warning

>
> So, what are you going to do if the move fails? You cannot roll back, and
> you cannot update the CLOG (because than others would see your new table,
> but no datafile). The only option is to PANIC. This will lead to a server
> restart, WAL recovery, and probably another PANIC once the COMMIT-record
> is replayed (Since the move probably still won't be possible).
That was the idea of the fallback generally, avoid this issue. You never 
rollback. If datafile is not where expected, it can only be one other 
place.
> It might be even worse - I'm not sure that a rename is an atomic 
> operation
> on most filesystems. If it's not, then you might end up with two files if
> power fails *just* as you rename, or, worse with no file at all. Even 
> a slight
> possibility of the second case seems unacceptable - I means loosing
> a committed transaction.
Yes, atomic renames are an assumption.
>
> I agree that we should eventually find a way to guarantee either no file
> leakage, or at least an upper bound on the amount of wasted space. But
> doing so at the cost of PANICing if the move fails seems like a bad
> tradeoff...
>
Agreed...
> greetings, Florian Pflug



Re: [PATCH] Lazy xid assingment V2

From
August Zajonc
Date:
Tom Lane wrote:
> What I was thinking about was a "flag file" separate from the data file
> itself, a bit like what we use for archiver signaling.  If nnnn is the
> new data file, then "touch nnnn.new" to mark the file as needing to be
> deleted on restart.  Remove these files just *before* commit.  This
> leaves you with a narrow window between removing the flag file and
> actually committing, but there's no risk of having to PANIC --- if the
> remove fails, you just abort the transaction.
>
> However, this has nonzero overhead compared to the current behavior.
> I'm still dubious that we have a problem that needs solving ...
>
>             regards, tom lane
Maybe just get back to the beginning:

We are concerned about leaking files on *crashes*.

So during recovery, the only time we are going to care, iterate through 
list of files that should exist (pg_class) after replay of log.

Any files that don't belong (but fit the standard naming conventions) 
move to spclocation/trash or similar.

For the admin who had critical data they needed to recover and is going 
to spend the time going to the disk too pull it out they still can 
access it. This also saves admins who create table like filenames in 
table spaces.

For most other folks they can dump /trash.

Does this avoid the dataloss worries?



Re: [PATCH] Lazy xid assingment V2

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> It might be even worse - I'm not sure that a rename is an atomic operation
>> on most filesystems.
> 
> rename(2) is specified to be atomic by POSIX, but relinking a file into
> a different directory can hardly be --- it's not even provided as a
> single kernel call, is it?
I'd have thought that they only guarantee that if the new name already
exists it's atomically replaced. But I might be wrong....

> And there's still the problem that changing the filename on-the-fly is
> going to break tons of low-level stuff, most of which is not supposed to
> know about transactions at all, notably bgwriter.
Good point - I thought that we wouldn't have to care about this because
we could close the relation before renaming in the committing backend
and be done with it, because other backends won't see the new file
before we update the clog. But you're right, bgwriter is a problem
and one not easily solved...

So that rename-on-commit idea seems to be quite dead...

> What I was thinking about was a "flag file" separate from the data file
> itself, a bit like what we use for archiver signaling.  If nnnn is the
> new data file, then "touch nnnn.new" to mark the file as needing to be
> deleted on restart.  Remove these files just *before* commit.  This
> leaves you with a narrow window between removing the flag file and
> actually committing, but there's no risk of having to PANIC --- if the
> remove fails, you just abort the transaction.
Hm.. we could call the file "nnn.xid.new", and delete it after the commit,
silently ignoring any failures. During both database-wide VACUUM and
after recovery we'd remove any leftover *.xid.new files, but only
if the xid is marked committed in the clog. After that cleanup step,
we'd delete any files which still have an associated flag file.

Processing those nnn.xid.new files during VACUUM is just needed to
avoid any problems because of xid wraparound - it could maybe
be replaced by maybe naming the file nnn.epoch.xid.new

> However, this has nonzero overhead compared to the current behavior.
> I'm still dubious that we have a problem that needs solving ...
I agree that file leakage is not a critical problem - if it were, they'd
be much more complaints...

But it's still something that a postgres DBA has to be aware of, because
it might bite you quite badly. Since IMHO admin friendlyness is one of
the strengths of postgresql, removing the possibility of leakage would be
nice in the long term.

Nothing that needs any rushing, though - and nothing that we'd want to pay
for in terms of performance.




Re: [PATCH] Lazy xid assingment V2

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> Tom Lane wrote:
>> rename(2) is specified to be atomic by POSIX, but relinking a file into
>> a different directory can hardly be --- it's not even provided as a
>> single kernel call, is it?

> I'd have thought that they only guarantee that if the new name already
> exists it's atomically replaced. But I might be wrong....

I reread the spec and realized that rename() does include moving a link
into a different directory --- but it only promises that replacement of
the target filename is atomic, not that (say) the link couldn't appear
in both directories concurrently.  Also it's not clear that the spec
intends to make any hard guarantees about the filesystem state after
crash-and-recovery.

In any case I don't think we can make renaming of active data files work
--- bufmgr and bgwriter need those file names to be stable.  The
flag-file approach seems more promising.

There's also the plan B of scanning pg_class to decide which relfilenode
values are legit.  IIRC Bruce did up a patch for this about a year ago,
which I vetoed because I was afraid of the consequences if it removed
data that someone really needed.  Someone just mentioned doing the same
thing but pushing the unreferenced files into a "trash" directory
instead of actually deleting them.  While that answers the
risk-of-data-loss objection, I'm not sure it does much for the goal of
avoiding useless space consumption: how many DBAs will faithfully
examine and clean out that trash directory?
        regards, tom lane


Re: [PATCH] Lazy xid assingment V2

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> There's also the plan B of scanning pg_class to decide which relfilenode
> values are legit.  IIRC Bruce did up a patch for this about a year ago,
> which I vetoed because I was afraid of the consequences if it removed
> data that someone really needed.  

I posted a patch like that, 2-3 years ago I think. IIRC, the consensus
back then was to just write a log message of the stale files, so an
admin can go and delete them manually.  That's safer than just deleting
them, and we'll get an idea of how much of a problem this is in
practice; at the moment a DBA has no way to know if there's some leaked
space, except doing a manual compare of pg_class and filesystem. If it
turns out to be reliable enough, and the problem big enough, we might
start deleting the files automatically in future releases.

I never got around to fixing the issues with the patch, but it's been
tickling me a bit for all these years.

> Someone just mentioned doing the same
> thing but pushing the unreferenced files into a "trash" directory
> instead of actually deleting them.  While that answers the
> risk-of-data-loss objection, I'm not sure it does much for the goal of
> avoiding useless space consumption: how many DBAs will faithfully
> examine and clean out that trash directory?

That sounds like a good idea to me. If you DBA finds himself running out
of disk space unexpectedly, he'll start looking around. Doing a "rm
trash/*" surely seems easier and safer than deleting individual files
from base.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: [PATCH] Lazy xid assingment V2

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Someone just mentioned doing the same
>> thing but pushing the unreferenced files into a "trash" directory
>> instead of actually deleting them.

> That sounds like a good idea to me.

It suddenly strikes me that there's lots of precedent for it: fsck moves
unreferenced files into lost+found, instead of just deleting them.
        regards, tom lane


Re: [PATCH] Lazy xid assingment V2

From
August Zajonc
Date:
Tom Lane wrote:
> There's also the plan B of scanning pg_class to decide which relfilenode
> values are legit.  IIRC Bruce did up a patch for this about a year ago,
> which I vetoed because I was afraid of the consequences if it removed
> data that someone really needed.  Someone just mentioned doing the same
> thing but pushing the unreferenced files into a "trash" directory
> instead of actually deleting them.  While that answers the
> risk-of-data-loss objection, I'm not sure it does much for the goal of
> avoiding useless space consumption: how many DBAs will faithfully
> examine and clean out that trash directory?
>
>   
For the admin who for some reason deletes critical input data before 
seeing a COMMIT return from postgresql they can probably keep the files.

The thing is, the leak occurs in situation where a COMMIT hasn't 
returned to the user, so we are trying to guarantee no data-loss even 
when the user doesn't see a successful commit? That's a tall order 
obviously and hopefully people design their apps to attend to 
transaction success / failure.

Plan B certainly won't take more space, and is probably the easiest to 
cleanup.









Re: [PATCH] Lazy xid assingment V2

From
"Heikki Linnakangas"
Date:
August Zajonc wrote:
> The thing is, the leak occurs in situation where a COMMIT hasn't
> returned to the user, so we are trying to guarantee no data-loss even
> when the user doesn't see a successful commit? That's a tall order
> obviously and hopefully people design their apps to attend to
> transaction success / failure.

No, by the time we'd delete/move to trash the files, we know that
they're no longer needed. But because deleting a relation is such a
drastic thing to do, with potential for massive data loss, we're being
extra paranoid. What if there's a bug in PostgreSQL that makes us delete
the wrong file? Or transaction wraparound happens, despite the
protections that we now have in place? Or you have bad RAM in the
server, and a bit flips at an unfortunate place? That's the kind of
situations we're worried about.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com