Thread: Exposing the Xact commit order to the user

Exposing the Xact commit order to the user

From
Jan Wieck
Date:
In some systems (data warehousing, replication), the order of commits is
important, since that is the order in which changes have become visible.
This information could theoretically be extracted from the WAL, but
scanning the entire WAL just to extract this tidbit of information would
be excruciatingly painful.

The following is based on ideas that emerged during last weeks PGCon.
Consider it an implementation proposal, if you like.

We introduce a new set of files. The files represent segments of an
infinite array of structures. The present segments are the available
"window" of data. Similar to CLOG files, the individual file name will
represent the high bits of a "serial" number, the offset of the record
inside the file represents the low bits of the "serial".

The system will have postgresql.conf options for enabling/disabling the
whole shebang, how many shared buffers to allocate for managing access
to the data and to define the retention period of the data based on data
volume and/or age of the commit records.

Each record of the Transaction Commit Info consists of
     txid          xci_transaction_id     timestamptz   xci_begin_timestamp     timestamptz   xci_commit_timestamp
int64        xci_total_rowcount
 

32 bytes total.

CommitTransaction() inside of xact.c will call a function, that inserts
a new record into this array. The operation will for most of the time be
nothing than taking a spinlock and adding the record to shared memory.
All the data for the record is readily available, does not require
further locking and can be collected locally before taking the spinlock.
The begin_timestamp is the transactions idea of CURRENT_TIMESTAMP, the
commit_timestamp is what CommitTransaction() just decided to write into
the WAL commit record and the total_rowcount is the sum of inserted,
updated and deleted heap tuples during the transaction, which should be
easily available from the statistics collector, unless row stats are
disabled, in which case the datum would be zero.

The function will return the "sequence" number which CommitTransaction()
in turn will record in the WAL commit record together with the
begin_timestamp. While both, the begin as well as the commit timestamp
are crucial to determine what data a particular transaction should have
seen, the row count is not and will not be recorded in WAL.

Checkpoint handling will call a function to flush the shared buffers.
Together with this, the information from WAL records will be sufficient
to recover this data (except for row counts) during crash recovery.

Exposing the data will be done via a set returning function. The SRF
takes two arguments. The maximum number of rows to return and the last
serial number processed by the reader. The advantage of such SRF is that
the result can be used in a query that right away delivers audit or
replication log information in transaction commit order. The SRF can
return an empty set if no further transactions have committed since, or
an error if data segments needed to answer the request have already been
purged.

Purging of the data will be possible in several different ways.
Autovacuum will call a function that drops segments of the data that are  outside the postgresql.conf configuration
withrespect to maximum age
 
or data volume. There will also be a function reserved for superusers to
explicitly purge the data up to a certain serial number.


Comments, suggestions?


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


> Exposing the data will be done via a set returning function. The SRF
> takes two arguments. The maximum number of rows to return and the last
> serial number processed by the reader. The advantage of such SRF is that
> the result can be used in a query that right away delivers audit or
> replication log information in transaction commit order. The SRF can
> return an empty set if no further transactions have committed since, or
> an error if data segments needed to answer the request have already been
> purged.

In light of the proposed purging scheme, how would it be able to distinguish 
between those two cases (nothing there yet vs. was there but purged)?

- -- 
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201005231646
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkv5lIAACgkQvJuQZxSWSsiR3gCgvyK/NPd6WmKGUqdo/3fdWIR7
LAQAoJqk3gYpEgtjw10gINDKFXTAnWO5
=sSvK
-----END PGP SIGNATURE-----




Re: Exposing the Xact commit order to the user

From
Robert Haas
Date:
On Sun, May 23, 2010 at 4:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
> The system will have postgresql.conf options for enabling/disabling the
> whole shebang, how many shared buffers to allocate for managing access
> to the data and to define the retention period of the data based on data
> volume and/or age of the commit records.

It would be nice if this could just be managed out of shared_buffers
rather than needing to configure a separate pool just for this
feature.  But, I'm not sure how much work that is, and if it turns out
to be too ugly then I'd say it's not a hard requirement.  In general,
I think we talked during the meeting about the desirability of folding
specific pools into shared_buffers rather than managing them
separately, but I'm not aware that we have any cases where we do that
today so it might be hard (or not).

> Each record of the Transaction Commit Info consists of
>
>     txid          xci_transaction_id
>     timestamptz   xci_begin_timestamp
>     timestamptz   xci_commit_timestamp
>     int64         xci_total_rowcount
>
> 32 bytes total.

Are we sure it's worth including the row count?  I wonder if we ought
to leave that out and let individual clients of the mechanism track
that if they're so inclined, especially since it won't be reliable
anyway.

> CommitTransaction() inside of xact.c will call a function, that inserts
> a new record into this array. The operation will for most of the time be
> nothing than taking a spinlock and adding the record to shared memory.
> All the data for the record is readily available, does not require
> further locking and can be collected locally before taking the spinlock.

What happens when you need to switch pages?

> The function will return the "sequence" number which CommitTransaction()
> in turn will record in the WAL commit record together with the
> begin_timestamp. While both, the begin as well as the commit timestamp
> are crucial to determine what data a particular transaction should have
> seen, the row count is not and will not be recorded in WAL.

It would certainly be better if we didn't to bloat the commit xlog
records to do this.  Is there any way to avoid that?

> Checkpoint handling will call a function to flush the shared buffers.
> Together with this, the information from WAL records will be sufficient
> to recover this data (except for row counts) during crash recovery.

Right.

> Exposing the data will be done via a set returning function. The SRF
> takes two arguments. The maximum number of rows to return and the last
> serial number processed by the reader. The advantage of such SRF is that
> the result can be used in a query that right away delivers audit or
> replication log information in transaction commit order. The SRF can
> return an empty set if no further transactions have committed since, or
> an error if data segments needed to answer the request have already been
> purged.
>
> Purging of the data will be possible in several different ways.
> Autovacuum will call a function that drops segments of the data that are
>  outside the postgresql.conf configuration with respect to maximum age
> or data volume. There will also be a function reserved for superusers to
> explicitly purge the data up to a certain serial number.

Dunno if autovacuuming this is the right way to go.  Seems like that
could leave to replication breaks, and it's also more work than not
doing that.  I'd just say that if you turn this on you're responsible
for pruning it, full stop.

> Anyone who trades liberty for security deserves neither
> liberty nor security. -- Benjamin Franklin

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/23/2010 4:48 PM, Greg Sabino Mullane wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
> 
> 
>> Exposing the data will be done via a set returning function. The SRF
>> takes two arguments. The maximum number of rows to return and the last
>> serial number processed by the reader. The advantage of such SRF is that
>> the result can be used in a query that right away delivers audit or
>> replication log information in transaction commit order. The SRF can
>> return an empty set if no further transactions have committed since, or
>> an error if data segments needed to answer the request have already been
>> purged.
> 
> In light of the proposed purging scheme, how would it be able to distinguish 
> between those two cases (nothing there yet vs. was there but purged)?

There is a difference between an empty result set and an exception.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/23/2010 8:38 PM, Robert Haas wrote:
> On Sun, May 23, 2010 at 4:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
>> The system will have postgresql.conf options for enabling/disabling the
>> whole shebang, how many shared buffers to allocate for managing access
>> to the data and to define the retention period of the data based on data
>> volume and/or age of the commit records.
> 
> It would be nice if this could just be managed out of shared_buffers
> rather than needing to configure a separate pool just for this
> feature.  But, I'm not sure how much work that is, and if it turns out
> to be too ugly then I'd say it's not a hard requirement.  In general,
> I think we talked during the meeting about the desirability of folding
> specific pools into shared_buffers rather than managing them
> separately, but I'm not aware that we have any cases where we do that
> today so it might be hard (or not).

I'm not sure the retention policies of the shared buffer cache, the WAL 
buffers, CLOG buffers and every other thing we try to cache are that 
easy to fold into one single set of logic. But I'm all ears.

> 
>> Each record of the Transaction Commit Info consists of
>>
>>     txid          xci_transaction_id
>>     timestamptz   xci_begin_timestamp
>>     timestamptz   xci_commit_timestamp
>>     int64         xci_total_rowcount
>>
>> 32 bytes total.
> 
> Are we sure it's worth including the row count?  I wonder if we ought
> to leave that out and let individual clients of the mechanism track
> that if they're so inclined, especially since it won't be reliable
> anyway.

Nope, we (my belly and I) are not sure about the absolute worth of the 
row count. It would be a convenient number to have there, but I can live 
without it.

> 
>> CommitTransaction() inside of xact.c will call a function, that inserts
>> a new record into this array. The operation will for most of the time be
>> nothing than taking a spinlock and adding the record to shared memory.
>> All the data for the record is readily available, does not require
>> further locking and can be collected locally before taking the spinlock.
> 
> What happens when you need to switch pages?

Then the code will have to grab another free buffer or evict one.

> 
>> The function will return the "sequence" number which CommitTransaction()
>> in turn will record in the WAL commit record together with the
>> begin_timestamp. While both, the begin as well as the commit timestamp
>> are crucial to determine what data a particular transaction should have
>> seen, the row count is not and will not be recorded in WAL.
> 
> It would certainly be better if we didn't to bloat the commit xlog
> records to do this.  Is there any way to avoid that?

If you can tell me how a crash recovering system can figure out what the 
exact "sequence" number of the WAL commit record at hand should be, 
let's rip it.

> 
>> Checkpoint handling will call a function to flush the shared buffers.
>> Together with this, the information from WAL records will be sufficient
>> to recover this data (except for row counts) during crash recovery.
> 
> Right.
> 
>> Exposing the data will be done via a set returning function. The SRF
>> takes two arguments. The maximum number of rows to return and the last
>> serial number processed by the reader. The advantage of such SRF is that
>> the result can be used in a query that right away delivers audit or
>> replication log information in transaction commit order. The SRF can
>> return an empty set if no further transactions have committed since, or
>> an error if data segments needed to answer the request have already been
>> purged.
>>
>> Purging of the data will be possible in several different ways.
>> Autovacuum will call a function that drops segments of the data that are
>>  outside the postgresql.conf configuration with respect to maximum age
>> or data volume. There will also be a function reserved for superusers to
>> explicitly purge the data up to a certain serial number.
> 
> Dunno if autovacuuming this is the right way to go.  Seems like that
> could leave to replication breaks, and it's also more work than not
> doing that.  I'd just say that if you turn this on you're responsible
> for pruning it, full stop.

It is an option. "Keep it until I tell you" is a perfectly valid 
configuration option. One you probably don't want to forget about, but 
valid none the less.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


> In light of the proposed purging scheme, how would it be able to distinguish 
> between those two cases (nothing there yet vs. was there but purged)?

> There is a difference between an empty result set and an exception.

No, I meant how will the *function* know, if a superuser and/or some 
background process can purge records at any time?

- -- 
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201005240928
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkv6f0UACgkQvJuQZxSWSsh0xwCgmXLtKngoBBYX0TxDM2TlJRId
AVIAoMHYa3c9Ej2vUJyFufxBR5vDPzQ+
=e1mh
-----END PGP SIGNATURE-----




Re: Exposing the Xact commit order to the user

From
Robert Haas
Date:
On Sun, May 23, 2010 at 9:44 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
> I'm not sure the retention policies of the shared buffer cache, the WAL
> buffers, CLOG buffers and every other thing we try to cache are that easy to
> fold into one single set of logic. But I'm all ears.

I'm not sure either, although it seems like LRU ought to be good
enough for most things.  I'm more worried about things like whether
the BufferDesc abstraction is going to get in the way.

>>> CommitTransaction() inside of xact.c will call a function, that inserts
>>> a new record into this array. The operation will for most of the time be
>>> nothing than taking a spinlock and adding the record to shared memory.
>>> All the data for the record is readily available, does not require
>>> further locking and can be collected locally before taking the spinlock.
>>
>> What happens when you need to switch pages?
>
> Then the code will have to grab another free buffer or evict one.

Hopefully not while holding a spin lock.  :-)

>>> The function will return the "sequence" number which CommitTransaction()
>>> in turn will record in the WAL commit record together with the
>>> begin_timestamp. While both, the begin as well as the commit timestamp
>>> are crucial to determine what data a particular transaction should have
>>> seen, the row count is not and will not be recorded in WAL.
>>
>> It would certainly be better if we didn't to bloat the commit xlog
>> records to do this.  Is there any way to avoid that?
>
> If you can tell me how a crash recovering system can figure out what the
> exact "sequence" number of the WAL commit record at hand should be, let's
> rip it.

Hmm...  could we get away with WAL-logging the next sequence number
just once per checkpoint?  When you replay the checkpoint record, you
update the control file with the sequence number.  Then all the
commits up through the next checkpoint just use consecutive numbers
starting at that value.

> It is an option. "Keep it until I tell you" is a perfectly valid
> configuration option. One you probably don't want to forget about, but valid
> none the less.

As Tom is fond of saying, if it breaks, you get to keep both pieces.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: Exposing the Xact commit order to the user

From
"Kevin Grittner"
Date:
Jan Wieck wrote:
> In some systems (data warehousing, replication), the order of
> commits is important, since that is the order in which changes
> have become visible.
This issue intersects with the serializable work I've been doing.
While in database transactions using S2PL the above is true, in
snapshot isolation and the SSI implementation of serializable
transactions, it's not. In particular, the snapshot anomalies which
can cause non-serializable behavior happen precisely because the
apparent order of execution doesn't match anything so linear as
order of commit.
I'll raise that receipting example again. You have transactions
which grab the current deposit data and insert it into receipts, as
payments are received. At some point in the afternoon, the deposit
date in a control table is changed to the next day, so that the
receipts up to that point can be deposited during banking hours with
the current date as their deposit date. A report is printed (and
likely a transfer transaction recorded to move "cash in drawer" to
"cash in checking", but I'll ignore that aspect for this example).
Some receipts may not be committed when the update to the date in
the control table is committed.
This is "eventually consistent" -- once all the receipts with the
old date commit or roll back the database is OK, but until then you
might be able to select the new date in the control table and the
set of receipts matching the old date without the database telling
you that you're missing data. The new serializable implementation
fixes this, but there are open R&D items (due to the need to discuss
the issues) on the related Wiki page related to hot standby and
other replication. Will we be able to support transactional
integrity on slave machines?
What if the update to the control table and the insert of receipts
all happen on the master, but someone decides to move the (now
happily working correctly with serializable transactions) reporting
to a slave machine? (And by the way, don't get too hung up on this
particular example, I could generate dozens more on demand -- the
point is that order of commit doesn't always correspond to apparent
order of execution; in this case the receipts *appear* to have
executed first, because they are using a value "later" updated to
something else by a different transaction, even though that other
transaction *committed* first.)
Replicating or recreating the whole predicate locking and conflict
detection on slaves is not feasible for performance reasons. (I
won't elaborate unless someone feels that's not intuitively
obvious.) The only sane way I can see to have a slave database allow
serializable behavior is to WAL-log the acquisition of a snapshot by
a serializable transaction, and the rollback or commit, on the
master, and to have the serializable snapshot build on a slave
exclude any serializable transactions for which there are still
concurrent serializable transactions. Yes, that does mean WAL-
logging the snapshot acquisition even if the transaction doesn't yet
have an xid, and WAL-logging the commit or rollback even if it never
acquires an xid.
I think this solve the issue Jan raises as long as serializable
transactions are used; if they aren't there are no guarantees of
transactional integrity no matter how you track commit sequence,
unless it can be based on S2PL-type blocking locks.  I'll have to
leave that to someone else to sort out.
-Kevin



Re: Exposing the Xact commit order to the user

From
Robert Haas
Date:
On Mon, May 24, 2010 at 11:24 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Jan Wieck wrote:
>
>> In some systems (data warehousing, replication), the order of
>> commits is important, since that is the order in which changes
>> have become visible.
>
> This issue intersects with the serializable work I've been doing.
> While in database transactions using S2PL the above is true, in
> snapshot isolation and the SSI implementation of serializable
> transactions, it's not.

I think you're confusing two subtly different things.  The way to
prove that a set of transactions running under some implementation of
serializability is actually serializable is to construct a serial
order of execution consistent with the view of the database that each
transaction saw.  This may or may not match the commit order, as you
say.  But the commit order is still the order the effects of those
transactions have become visible - if we inserted a new read-only
transaction into the stream at some arbitrary point in time, it would
see all the transactions which committed before it and none of those
that committed afterward.  So I think Jan's statement is correct.

Having said that, I think your concerns about how things will look
from a slave's point of view are possibly valid.  A transaction
running on a slave is essentially a read-only transaction that the
master doesn't know about.  It's not clear to me whether adding such a
transaction to the timeline could result in either (a) that
transaction being rolled back or (b) some impact on which other
transactions got rolled back.  If it did, that would obviously be a
problem for serializability on slaves, though your proposed fix sounds
like it would be prohibitively expensive for many users.  But can this
actually happen?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: Exposing the Xact commit order to the user

From
"Kevin Grittner"
Date:
Robert Haas  wrote:
> I think you're confusing two subtly different things.
The only thing I'm confused about is what benefit anyone expects to
get from looking at data between commits in some way other than our
current snapshot mechanism.  Can someone explain a use case where
what Jan is proposing is better than snapshot isolation?  It doesn't
provide any additional integrity guarantees that I can see.
> But the commit order is still the order the effects of those
> transactions have become visible - if we inserted a new read-only
> transaction into the stream at some arbitrary point in time, it
> would see all the transactions which committed before it and none
> of those that committed afterward.
Isn't that what a snapshot does already?
> your proposed fix sounds like it would be prohibitively expensive
> for many users. But can this actually happen?
How so?  The transaction start/end logging, or looking at that data
when building a snapshot?
-Kevin


Re: Exposing the Xact commit order to the user

From
Heikki Linnakangas
Date:
On 24/05/10 19:51, Kevin Grittner wrote:
> The only thing I'm confused about is what benefit anyone expects to
> get from looking at data between commits in some way other than our
> current snapshot mechanism.  Can someone explain a use case where
> what Jan is proposing is better than snapshot isolation?  It doesn't
> provide any additional integrity guarantees that I can see.

Right, it doesn't. What it provides is a way to reconstruct a snapshot 
at any point in time, after the fact. For example, after transactions A, 
C, D and B have committed in that order, it allows you to reconstruct a 
snapshot just like you would've gotten immediately after the commit of 
A, C, D and B respectively. That's useful replication tools like Slony 
that needs to commit the changes of those transactions in the slave in 
the same order as they were committed in the master.

I don't know enough of Slony et al. to understand why that'd be better 
than the current heartbeat mechanism they use, taking a snapshot every 
few seconds, batching commits.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Exposing the Xact commit order to the user

From
Robert Haas
Date:
On Mon, May 24, 2010 at 12:51 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Robert Haas  wrote:
>> I think you're confusing two subtly different things.
>
> The only thing I'm confused about is what benefit anyone expects to
> get from looking at data between commits in some way other than our
> current snapshot mechanism.  Can someone explain a use case where
> what Jan is proposing is better than snapshot isolation?  It doesn't
> provide any additional integrity guarantees that I can see.

It's a tool for replication solutions to use.

>> But the commit order is still the order the effects of those
>> transactions have become visible - if we inserted a new read-only
>> transaction into the stream at some arbitrary point in time, it
>> would see all the transactions which committed before it and none
>> of those that committed afterward.
>
> Isn't that what a snapshot does already?

Yes, for a particular transaction.  But this is to allow transactions
to be replayed (in order) on another node.

>> your proposed fix sounds like it would be prohibitively expensive
>> for many users. But can this actually happen?
>
> How so?  The transaction start/end logging, or looking at that data
> when building a snapshot?

I guess what I'm asking is - if the reconstructed transaction order
inferred by SSI doesn't match the actual commit order, can we get a
serialization anomaly on the standby by replaying transactions there
in commit order?  Can you give an example and explain how your
proposal would solve it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: Exposing the Xact commit order to the user

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of dom may 23 20:38:14 -0400 2010:
> On Sun, May 23, 2010 at 4:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
> > The system will have postgresql.conf options for enabling/disabling the
> > whole shebang, how many shared buffers to allocate for managing access
> > to the data and to define the retention period of the data based on data
> > volume and/or age of the commit records.
> 
> It would be nice if this could just be managed out of shared_buffers
> rather than needing to configure a separate pool just for this
> feature.

FWIW we've talked about this for years -- see old discussions about how
pg_subtrans becomes a bottleneck in certain cases and you want to
enlarge the number of buffers allocated to it (probably easy to find by
searching posts from Jignesh).  I'm guessing the new notify code would
benefit from this as well.

It'd be nice to have as a side effect, but if not, IMHO this proposal
could simply use a fixed buffer pool like all other slru.c callers until
someone gets around to fixing that.  Adding more GUC switches for this
strikes me as overkill.

-- 
Álvaro Herrera <alvherre@alvh.no-ip.org>


Re: Exposing the Xact commit order to the user

From
Robert Haas
Date:
On Mon, May 24, 2010 at 3:07 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> It'd be nice to have as a side effect, but if not, IMHO this proposal
> could simply use a fixed buffer pool like all other slru.c callers until
> someone gets around to fixing that.  Adding more GUC switches for this
> strikes me as overkill.

I agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: Exposing the Xact commit order to the user

From
Dan Ports
Date:
On Sun, May 23, 2010 at 04:21:58PM -0400, Jan Wieck wrote:
> In some systems (data warehousing, replication), the order of commits is
> important, since that is the order in which changes have become visible.
> This information could theoretically be extracted from the WAL, but
> scanning the entire WAL just to extract this tidbit of information would
> be excruciatingly painful.

This is very interesting to me as I've been doing some (research --
nowhere near production-level) work on building a transactional
application-level (i.e. memcached-like) cache atop Postgres. One of the
features I needed to support it was basically what you describe.

Without getting too far into the details of what I'm doing, I needed to
make it clear to a higher layer which commits were visible to a given
query. That is, I wanted to know both the order of commits and where
particular snapshots fit into this ordering. (A SnapshotData struct
obviously contains the visibility information, but a representation in
terms of the commit ordering is both more succinct and allows for easy
ordering comparisons).

Something you might want to consider, then, is adding an interface to
find out the timestamp of the current transaction's snapshot, i.e. the
timestamp of the most recent committed transaction visible to it. I
wouldn't expect this to be difficult to implement as transaction
completion/visibility is already synchronized via ProcArrayLock.

> Each record of the Transaction Commit Info consists of
> 
>       txid          xci_transaction_id
>       timestamptz   xci_begin_timestamp
>       timestamptz   xci_commit_timestamp
>       int64         xci_total_rowcount

Another piece of information that seems useful to provide here would be
the logical timestamp of the transaction, i.e. a counter that's
incremented by one for each transaction. But maybe that's implicit in
the log ordering?

I'm not clear on why the total rowcount is useful, but perhaps I'm
missing something obvious.


I've actually implemented some semblance of this on Postgres 8.2, but it
sounds like what you're interested in is more sophisticated. In
particular, I wasn't at all concerned with durability or WAL stuff, and
I had some specific requirements about when it was OK to purge the
data. Because of this (and very limited development time), I just
threw something together with a simple shared buffer.

I don't think I have any useful code to offer, but let me know if
there's some way I can help out.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/


Re: Exposing the Xact commit order to the user

From
"Kevin Grittner"
Date:
Robert Haas  wrote:
> It's a tool for replication solutions to use.
I was thrown by the original post referencing "data warehousing".
For replication I definitely see that it would be good to provide
some facility to grab a coherent snapshot out of the transaction
stream, but I'm still not clear on a use case where other solutions
aren't better.  If you want a *particular* past snapshot, something
akin to the transactional caching that Dan Ports mentioned seems
best.  If you just want a coherent snapshot like snapshot isolation,
the current mechanisms seem to work (unless I'm missing something?).
If you want solid data integrity querying the most recent replicated
data, the proposal I posted earlier in the thread is the best I can
see, so far.
> if the reconstructed transaction order inferred by SSI doesn't
> match the actual commit order, can we get a serialization anomaly
> on the standby by replaying transactions there in commit order?
Yes.  If we don't do *something* to address it, the replicas
(slaves) will operate as read-only snapshot isolation, not true
serializable.
> Can you give an example and explain how your proposal would solve
> it?
I gave an example (without rigorous proof accompanying it, granted)
earlier in the thread.  In that example, if you allow a selection
against a snapshot which includes the earlier commit (the update of
the control table) and before the later commits (the receipts which
used the old deposit date) you have exactly the kind of
serialization anomaly which the work in progress prevents on the
source (master) database -- the receipts *appear* to run in earlier
transactions because the see the pre-update deposit date, but they
show up out of order.  As far as I'm concerned this is only a
problem if the user *requested* serializable behavior for all
transactions involved.
If we send the information I suggested in the WAL stream, then any
slave using the WAL stream could build a snapshot for a serializable
transaction which excluded serializable transactions from the source
which overlap with still-pending serializable transactions on the
source.  In this example, the update of the control table would not
be visible to a serializable transaction on the slave until any
overlapping serializable transactions (which would include any
receipts using the old date) had also committed, so you could never
see the writes out of order.
I don't think that passing detailed predicate locking information
would be feasible from a performance perspective, but since the
slaves are read-only, I think it is fine to pass just the minimal
transaction-level information I described.
-Kevin




Re: Exposing the Xact commit order to the user

From
Robert Haas
Date:
On Mon, May 24, 2010 at 4:03 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Robert Haas  wrote:
>
>> It's a tool for replication solutions to use.
>
> I was thrown by the original post referencing "data warehousing".
> For replication I definitely see that it would be good to provide
> some facility to grab a coherent snapshot out of the transaction
> stream, but I'm still not clear on a use case where other solutions
> aren't better.  If you want a *particular* past snapshot, something
> akin to the transactional caching that Dan Ports mentioned seems
> best.  If you just want a coherent snapshot like snapshot isolation,
> the current mechanisms seem to work (unless I'm missing something?).
> If you want solid data integrity querying the most recent replicated
> data, the proposal I posted earlier in the thread is the best I can
> see, so far.

Well, AIUI, what you're really trying to do is derive the delta
between an old snapshot and a newer snapshot.

>> Can you give an example and explain how your proposal would solve
>> it?
>
> I gave an example (without rigorous proof accompanying it, granted)
> earlier in the thread.  In that example, if you allow a selection
> against a snapshot which includes the earlier commit (the update of
> the control table) and before the later commits (the receipts which
> used the old deposit date) you have exactly the kind of
> serialization anomaly which the work in progress prevents on the
> source (master) database -- the receipts *appear* to run in earlier
> transactions because the see the pre-update deposit date, but they
> show up out of order.

Yep, I see it now.

>  As far as I'm concerned this is only a
> problem if the user *requested* serializable behavior for all
> transactions involved.

Agreed.

> If we send the information I suggested in the WAL stream, then any
> slave using the WAL stream could build a snapshot for a serializable
> transaction which excluded serializable transactions from the source
> which overlap with still-pending serializable transactions on the
> source.  In this example, the update of the control table would not
> be visible to a serializable transaction on the slave until any
> overlapping serializable transactions (which would include any
> receipts using the old date) had also committed, so you could never
> see the writes out of order.
>
> I don't think that passing detailed predicate locking information
> would be feasible from a performance perspective, but since the
> slaves are read-only, I think it is fine to pass just the minimal
> transaction-level information I described.

I suspect that's still going to be sort of hard on performance, but
maybe we should get serializable working and committed on one node
first and then worry about how to distribute it.  I think there might
be other approaches to this problem (global transaction coordinator?
standby requests snapshot from primary?).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/24/2010 12:51 PM, Kevin Grittner wrote:
> Robert Haas  wrote:
>  
>> I think you're confusing two subtly different things.
>  
> The only thing I'm confused about is what benefit anyone expects to
> get from looking at data between commits in some way other than our
> current snapshot mechanism.  Can someone explain a use case where
> what Jan is proposing is better than snapshot isolation?  It doesn't
> provide any additional integrity guarantees that I can see.
>  
>> But the commit order is still the order the effects of those
>> transactions have become visible - if we inserted a new read-only
>> transaction into the stream at some arbitrary point in time, it
>> would see all the transactions which committed before it and none
>> of those that committed afterward.
>  
> Isn't that what a snapshot does already?

It does and the proposed is a mere alternative serving the same purpose.

Have you ever looked at one of those queries, that Londiste or Slony 
issue against the provider DB in order to get all the log data that has 
been committed between two snapshots? Is that really the best you can 
think of?


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/24/2010 3:10 PM, Dan Ports wrote:
> I'm not clear on why the total rowcount is useful, but perhaps I'm
> missing something obvious.

It is a glimpse into the future. Several years of pain doing replication 
work has taught me that knowing approximately who much work the next 
chunk will be "before you select it all" is a really useful thing.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Dan Ports
Date:
On Mon, May 24, 2010 at 10:24:07AM -0500, Kevin Grittner wrote:
> Jan Wieck wrote:
>  
> > In some systems (data warehousing, replication), the order of
> > commits is important, since that is the order in which changes
> > have become visible.
>  
> This issue intersects with the serializable work I've been doing.
> While in database transactions using S2PL the above is true, in
> snapshot isolation and the SSI implementation of serializable
> transactions, it's not. In particular, the snapshot anomalies which
> can cause non-serializable behavior happen precisely because the
> apparent order of execution doesn't match anything so linear as
> order of commit.

All true, but this doesn't pose a problem in snapshot isolation. Maybe
this is obvious to everyone else, but just to be clear: a transaction's
snapshot is determined entirely by which transactions committed before
it snapshotted (and hence are visible to it). Thus, replaying update
transactions in the sae order on a slave makes the same sequence of
states visible to it.

Of course (as in your example) some of these states could expose
snapshot isolation anomalies. But that's true on a single-replica
system too.


Now, stepping into the SSI world...

> Replicating or recreating the whole predicate locking and conflict
> detection on slaves is not feasible for performance reasons. (I
> won't elaborate unless someone feels that's not intuitively
> obvious.) The only sane way I can see to have a slave database allow
> serializable behavior is to WAL-log the acquisition of a snapshot by
> a serializable transaction, and the rollback or commit, on the
> master, and to have the serializable snapshot build on a slave
> exclude any serializable transactions for which there are still
> concurrent serializable transactions. Yes, that does mean WAL-
> logging the snapshot acquisition even if the transaction doesn't yet
> have an xid, and WAL-logging the commit or rollback even if it never
> acquires an xid.

One important observation is that any anomaly that occurs on the slave
can be resolved by aborting a local read-only transaction. This is a
good thing, because the alternatives are too horrible to consider.

You could possibly cut the costs of predicate locking by having the
master ship with each transaction the list of predicate locks it
acquired. But you'd still have to track locks for read-only
transactions, so maybe that's not a significant cost improvement. On
the other hand, if you're willing to pay the price of serializability
on the master, why not the slaves too?

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/


Re: Exposing the Xact commit order to the user

From
Florian Pflug
Date:
On May 25, 2010, at 0:42 , Dan Ports wrote:
> On Mon, May 24, 2010 at 10:24:07AM -0500, Kevin Grittner wrote:
>> Jan Wieck wrote:
>>
>>> In some systems (data warehousing, replication), the order of
>>> commits is important, since that is the order in which changes
>>> have become visible.
>>
>> This issue intersects with the serializable work I've been doing.
>> While in database transactions using S2PL the above is true, in
>> snapshot isolation and the SSI implementation of serializable
>> transactions, it's not. In particular, the snapshot anomalies which
>> can cause non-serializable behavior happen precisely because the
>> apparent order of execution doesn't match anything so linear as
>> order of commit.
>
> All true, but this doesn't pose a problem in snapshot isolation. Maybe
> this is obvious to everyone else, but just to be clear: a transaction's
> snapshot is determined entirely by which transactions committed before
> it snapshotted (and hence are visible to it). Thus, replaying update
> transactions in the sae order on a slave makes the same sequence of
> states visible to it.

The subtle point here is whether you consider the view from the "outside" (in the sense of what a read-only transaction
startedat an arbitrary time can or cannot observe), or from the "inside" (what updating transactions can observe and
mightbase their updates on). 

The former case is completely determined by the commit ordering of the transactions, while the latter is not -
otherwiseserializability wouldn't be such a hard problem. 

For some problems, like replication, the former ("outside") view is what matters - if slave synthesizes transactions
thatinsert/update/delete the very same tuples as the original transaction did, and commits them in the same order, no
read-onlytransaction can observe the difference. But that is *not* a serial schedule of the original transactions,
sincethe transactions are *not* the same - the merely touch the same tuples. In fact, if you try replaying the original
SQL,you *will* get different results on the slave, and not only because of now() and the like. 

best regards,
Florian Pflug





Re: Exposing the Xact commit order to the user

From
Tom Lane
Date:
Florian Pflug <fgp@phlo.org> writes:
> The subtle point here is whether you consider the view from the "outside" (in the sense of what a read-only
transactionstarted at an arbitrary time can or cannot observe), or from the "inside" (what updating transactions can
observeand might base their updates on).
 

> The former case is completely determined by the commit ordering of the transactions, while the latter is not -
otherwiseserializability wouldn't be such a hard problem.
 

BTW, doesn't all this logic fall in a heap as soon as you consider
read-committed transactions?
        regards, tom lane


Re: Exposing the Xact commit order to the user

From
Florian Pflug
Date:
On May 25, 2010, at 3:21 , Tom Lane wrote:
> Florian Pflug <fgp@phlo.org> writes:
>> The subtle point here is whether you consider the view from the "outside" (in the sense of what a read-only
transactionstarted at an arbitrary time can or cannot observe), or from the "inside" (what updating transactions can
observeand might base their updates on). 
>
>> The former case is completely determined by the commit ordering of the transactions, while the latter is not -
otherwiseserializability wouldn't be such a hard problem. 
>
> BTW, doesn't all this logic fall in a heap as soon as you consider
> read-committed transactions?


Why would it? There's still a well defined point in time at which the transaction's effects become visible, and every
othertransaction commits either before that time or after that time. An observer started between two transactions sees
thefirst's changes but not the second's. One replace observing read committed transactions by a series of smaller
repeatableread transactions, since the observers are read-only anyway. 

This of course says nothing about what state the updating transactions themselves see as the current state. For e.g.
replicationthat is adequate, since you'd not replay the original commands but rather the effects they had in terms of
physicaltuple updates. On replay, the effects of a transaction to therefor not depend on the state the transaction
sees.

best regards,
Florian Pflug



Re: Exposing the Xact commit order to the user

From
Nicolas Barbier
Date:
2010/5/25 Dan Ports <drkp@csail.mit.edu>:

> On Mon, May 24, 2010 at 10:24:07AM -0500, Kevin Grittner wrote:
>
>> Replicating or recreating the whole predicate locking and conflict
>> detection on slaves is not feasible for performance reasons. (I
>> won't elaborate unless someone feels that's not intuitively
>> obvious.) The only sane way I can see to have a slave database allow
>> serializable behavior is to WAL-log the acquisition of a snapshot by
>> a serializable transaction, and the rollback or commit, on the
>> master, and to have the serializable snapshot build on a slave
>> exclude any serializable transactions for which there are still
>> concurrent serializable transactions. Yes, that does mean WAL-
>> logging the snapshot acquisition even if the transaction doesn't yet
>> have an xid, and WAL-logging the commit or rollback even if it never
>> acquires an xid.
>
> One important observation is that any anomaly that occurs on the slave
> can be resolved by aborting a local read-only transaction. This is a
> good thing, because the alternatives are too horrible to consider.
>
> You could possibly cut the costs of predicate locking by having the
> master ship with each transaction the list of predicate locks it
> acquired. But you'd still have to track locks for read-only
> transactions, so maybe that's not a significant cost improvement. On
> the other hand, if you're willing to pay the price of serializability
> on the master, why not the slaves too?

I don't understand the problem. According to me, in the context of
SSI, a read-only slave can just map SERIALIZABLE to the technical
implementation of REPEATABLE READ (i.e., the currently-existing
"SERIALIZABLE"). The union of the transactions on the master and the
slave(s) will still exhibit SERIALIZABLE behavior because the
transactions on the slave cannot write anything and are therefore
irrelevant.

Is anything wrong with that reasoning?

Nicolas


Re: Exposing the Xact commit order to the user

From
Simon Riggs
Date:
On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote:

> In some systems (data warehousing, replication), the order of commits is
> important, since that is the order in which changes have become visible.
> This information could theoretically be extracted from the WAL, but
> scanning the entire WAL just to extract this tidbit of information would
> be excruciatingly painful.

I think it would be quite simple to read WAL. WALSender reads the WAL
file after its been flushed, so it would be simple for it to read a blob
of WAL and then extract the commit order from it.

Overall though, it would be easier and more efficient to *add* info to
WAL and then do all this processing *after* WAL has been transported
elsewhere. Extracting info with DDL triggers, normal triggers, commit
order and everything else seems like too much work to me. Every other
RDBMS has moved away from trigger-based replication and we should give
that serious consideration also.

-- Simon Riggs           www.2ndQuadrant.com



Re: Exposing the Xact commit order to the user

From
Dan Ports
Date:
On Tue, May 25, 2010 at 02:00:42PM +0200, Nicolas Barbier wrote:
> I don't understand the problem. According to me, in the context of
> SSI, a read-only slave can just map SERIALIZABLE to the technical
> implementation of REPEATABLE READ (i.e., the currently-existing
> "SERIALIZABLE"). The union of the transactions on the master and the
> slave(s) will still exhibit SERIALIZABLE behavior because the
> transactions on the slave cannot write anything and are therefore
> irrelevant.

This, unfortunately, isn't true in SSI.

Consider read-only transactions on a single node SSI database -- the
situation is the same for read-only transactions that run on a slave. 
These transactions can be part of anomalies, so they need to be checked
for conflicts and potentially aborted.

Consider Kevin's favorite example, where one table contains the current
date and the other is a list of receipts (initially empty).  T1 inserts (select current_date) into receipts, but
doesn'tcommit T2 increments current_date and commits T3 reads both current_date and the receipt table T1 commits 
 
T3, which is a read-only transaction, sees the incremented date and an
empty list of receipts. But T1 later commits a new entry in the
receipts table with the old date. No serializable ordering allows this.
However, if T3 hadn't performed its read, there'd be no problem; we'd
just serialize T1 before T2 and no one would be the wiser.

SSI would detect a potential conflict here, which we could resolve by
aborting T3. (We could also abort T1, but if this is a replicated
system this isn't always an option -- T3 might be running on the
slave, so only the slave will know about the conflict, and it can't
very well abort an update transaction on the master.)


There's another example of a read-only transaction anomaly that could
cause similar problems at
http://portal.acm.org/citation.cfm?doid=1031570.1031573, but I think
this one is easier to follow.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/


Re: Exposing the Xact commit order to the user

From
Florian Pflug
Date:
On May 25, 2010, at 20:18 , Dan Ports wrote:
> On Tue, May 25, 2010 at 02:00:42PM +0200, Nicolas Barbier wrote:
>> I don't understand the problem. According to me, in the context of
>> SSI, a read-only slave can just map SERIALIZABLE to the technical
>> implementation of REPEATABLE READ (i.e., the currently-existing
>> "SERIALIZABLE"). The union of the transactions on the master and the
>> slave(s) will still exhibit SERIALIZABLE behavior because the
>> transactions on the slave cannot write anything and are therefore
>> irrelevant.
>
> This, unfortunately, isn't true in SSI.
>
> Consider read-only transactions on a single node SSI database -- the
> situation is the same for read-only transactions that run on a slave.
> These transactions can be part of anomalies, so they need to be checked
> for conflicts and potentially aborted.
>
> Consider Kevin's favorite example, where one table contains the current
> date and the other is a list of receipts (initially empty).
>  T1 inserts (select current_date) into receipts, but doesn't commit
>  T2 increments current_date and commits
>  T3 reads both current_date and the receipt table
>  T1 commits
>
> T3, which is a read-only transaction, sees the incremented date and an
> empty list of receipts. But T1 later commits a new entry in the
> receipts table with the old date. No serializable ordering allows this.
>
> However, if T3 hadn't performed its read, there'd be no problem; we'd
> just serialize T1 before T2 and no one would be the wiser.

Hm, so in fact SSI sometimes allows the database to be inconsistent, but only as long as nobody tries to observe it?

Btw, I still don't get how this follows from the Cahill paper. For a transaction to lie on a dangerous circle, it needs
incomingand outgoing edges in the conflict graph, right? But I'd have though that conflicts are always between a reader
anda writer or between two writers. So how can a read-only transaction have incoming and outgoing edges? 

best regards,
Florian Pflug




Re: Exposing the Xact commit order to the user

From
"Kevin Grittner"
Date:
Florian Pflug <fgp@phlo.org> wrote:
> Hm, so in fact SSI sometimes allows the database to be
> inconsistent, but only as long as nobody tries to observe it?
Not exactly.  The eventually-persisted state is always consistent,
but there can be a transitory committed state which would violate
user-defined constraints or business rules *if viewed*.  This is
what I've been on about -- the commit sequence is not necessarily
the same as the apparent order of execution.  A read-only
transaction, if run before the overlapping commits "settle", can
view a state which is not consistent with any serial order of
execution, and might therefore break the rules.  SSI detects that
and rolls one of the transactions back if they're all running at
serializable transaction isolation in a single SSI database, but the
question is how to handle this when the read happens in a replica.
> Btw, I still don't get how this follows from the Cahill paper. For
> a transaction to lie on a dangerous circle, it needs incoming and
> outgoing edges in the conflict graph, right?
At least one of the transactions participating in the cycle does. 
There's no requirement that they all do.
-Kevin


Re: Exposing the Xact commit order to the user

From
Dan Ports
Date:
On Tue, May 25, 2010 at 08:35:44PM +0200, Florian Pflug wrote:
> Hm, so in fact SSI sometimes allows the database to be inconsistent, but only as long as nobody tries to observe it?

Yes. Note that even while it's in an inconsistent state, you can still
perform any query that doesn't observe the inconsistency -- hopefully
most queries fall into this category.

> Btw, I still don't get how this follows from the Cahill paper. For a transaction to lie on a dangerous circle, it
needsincoming and outgoing edges in the conflict graph, right? But I'd have though that conflicts are always between a
readerand a writer or between two writers. So how can a read-only transaction have incoming and outgoing edges?
 

Right, the read-only transaction can't have incoming edges, but it can
have outgoing edges. So it can't be the "pivot" itself (the transaction
with both outgoing and incoming edges), but it can cause *another*
transaction to be.

In the example I gave, T3 (the r/o transaction) has an outgoing edge to
T1, because it didn't see T1's concurrent update. T1 already had an
outgoing edge to T2, so adding in this incoming edge from T3 creates
the dangerous structure.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/


Re: Exposing the Xact commit order to the user

From
"Kevin Grittner"
Date:
Jan Wieck <JanWieck@Yahoo.com> wrote:
> Have you ever looked at one of those queries, that Londiste or
> Slony issue against the provider DB in order to get all the log
> data that has been committed between two snapshots? Is that really
> the best you can think of?
No, I admit I haven't.  In fact, I was thinking primarily in terms
of log-driven situations, like HS.  What would be the best place for
me to look to come up to speed on your use case?  (I'm relatively
sure that the issue isn't that there's no information to find, but
that a sequential pass over all available information would take a
*long* time.)  I've been working through the issues on WAL-based
replicas, and have some additional ideas and alternatives, but I'd
like to see the "big picture", including trigger-based replication,
before posting.
-Kevin


Re: Exposing the Xact commit order to the user

From
Nicolas Barbier
Date:
2010/5/25 Dan Ports <drkp@csail.mit.edu>:

> On Tue, May 25, 2010 at 02:00:42PM +0200, Nicolas Barbier wrote:
>
>> I don't understand the problem. According to me, in the context of
>> SSI, a read-only slave can just map SERIALIZABLE to the technical
>> implementation of REPEATABLE READ (i.e., the currently-existing
>> "SERIALIZABLE"). The union of the transactions on the master and the
>> slave(s) will still exhibit SERIALIZABLE behavior because the
>> transactions on the slave cannot write anything and are therefore
>> irrelevant.
>
> This, unfortunately, isn't true in SSI.
>
> Consider read-only transactions on a single node SSI database -- the
> situation is the same for read-only transactions that run on a slave.
> These transactions can be part of anomalies, so they need to be checked
> for conflicts and potentially aborted.
>
> Consider Kevin's favorite example, where one table contains the current
> date and the other is a list of receipts (initially empty).
>  T1 inserts (select current_date) into receipts, but doesn't commit
>  T2 increments current_date and commits
>  T3 reads both current_date and the receipt table
>  T1 commits
>
> T3, which is a read-only transaction, sees the incremented date and an
> empty list of receipts. But T1 later commits a new entry in the
> receipts table with the old date. No serializable ordering allows this.
> However, if T3 hadn't performed its read, there'd be no problem; we'd
> just serialize T1 before T2 and no one would be the wiser.
>
> SSI would detect a potential conflict here, which we could resolve by
> aborting T3. (We could also abort T1, but if this is a replicated
> system this isn't always an option -- T3 might be running on the
> slave, so only the slave will know about the conflict, and it can't
> very well abort an update transaction on the master.)

Ah, indeed. I made the same reasoning mistake as Florian (presumably)
did: I didn't think of the fact that the read-only transaction doesn't
need to be the pivot.

Nicolas


Re: Exposing the Xact commit order to the user

From
Florian Pflug
Date:
On May 25, 2010, at 20:48 , Dan Ports wrote:
> On Tue, May 25, 2010 at 08:35:44PM +0200, Florian Pflug wrote:
>> Hm, so in fact SSI sometimes allows the database to be inconsistent, but only as long as nobody tries to observe it?
>
> Yes. Note that even while it's in an inconsistent state, you can still
> perform any query that doesn't observe the inconsistency -- hopefully
> most queries fall into this category.

Yeah, as long as you just walk by without looking, the database is happy ;-)

>> Btw, I still don't get how this follows from the Cahill paper. For a transaction to lie on a dangerous circle, it
needsincoming and outgoing edges in the conflict graph, right? But I'd have though that conflicts are always between a
readerand a writer or between two writers. So how can a read-only transaction have incoming and outgoing edges? 
>
> Right, the read-only transaction can't have incoming edges, but it can
> have outgoing edges. So it can't be the "pivot" itself (the transaction
> with both outgoing and incoming edges), but it can cause *another*
> transaction to be.
>
> In the example I gave, T3 (the r/o transaction) has an outgoing edge to
> T1, because it didn't see T1's concurrent update. T1 already had an
> outgoing edge to T2, so adding in this incoming edge from T3 creates
> the dangerous structure.

Hm, but for there to be an actual problem (and not a false positive), an actual dangerous circle has to exist in the
dependencygraph. The existence of a dangerous structure is just a necessary (but not sufficient) and easily checked-for
conditionfor that, right? Now, if a read-only transaction only ever has outgoing edges, it cannot be part of a
(dangerousor not) circle, and hence any dangerous structure it is part of is a false positive. 

I guess my line of reasoning is flawed somehow, but I cannot figure out why...

best regards,
Florian Pflug



Re: Exposing the Xact commit order to the user

From
Nicolas Barbier
Date:
2010/5/25 Florian Pflug <fgp@phlo.org>:

> On May 25, 2010, at 20:18 , Dan Ports wrote:
>
>> T3, which is a read-only transaction, sees the incremented date and an
>> empty list of receipts. But T1 later commits a new entry in the
>> receipts table with the old date. No serializable ordering allows this.
>>
>> However, if T3 hadn't performed its read, there'd be no problem; we'd
>> just serialize T1 before T2 and no one would be the wiser.
>
> Hm, so in fact SSI sometimes allows the database to be inconsistent, but only as long as nobody tries to observe it?

I would not call this an inconsistent state: it would become
inconsistent only after someone (e.g., T3) has observed it _and_ T1
commits.

Nicolas


Re: Exposing the Xact commit order to the user

From
Nicolas Barbier
Date:
2010/5/25 Florian Pflug <fgp@phlo.org>:

> Hm, but for there to be an actual problem (and not a false positive), an
> actual dangerous circle has to exist in the dependency graph. The
> existence of a dangerous structure is just a necessary (but not
> sufficient) and easily checked-for condition for that, right? Now, if a
> read-only transaction only ever has outgoing edges, it cannot be part
> of a (dangerous or not) circle, and hence any dangerous structure it is
> part of is a false positive.
>
> I guess my line of reasoning is flawed somehow, but I cannot figure out why...

In the general case, "wr" dependencies also create "must be serialized
before" edges. It seems that those edges can be discarded when finding
a pivot, but if you want to go "back to basics":

("<" means "must be serialized before".)

* T1 < T2, because T1 reads a version of a data element for which T2
later creates a newer version (rw between T1 and T2).
* T3 < T1, because T3 reads a version of a data element for which T1
later creates a newer version (rw between T3 and T1).
* T2 < T3, because T2 creates a version of a data element, which is
then read by T3 (wr between T2 and T3).

(As you can see, those 3 edges form a cycle.)

Nicolas


Re: Exposing the Xact commit order to the user

From
"Kevin Grittner"
Date:
Florian Pflug <fgp@phlo.org> wrote:
> Hm, but for there to be an actual problem (and not a false
> positive), an actual dangerous circle has to exist in the
> dependency graph. The existence of a dangerous structure is just a
> necessary (but not sufficient) and easily checked-for condition
> for that, right? Now, if a read-only transaction only ever has
> outgoing edges, it cannot be part of a (dangerous or not) circle,
> and hence any dangerous structure it is part of is a false
> positive.
> 
> I guess my line of reasoning is flawed somehow, but I cannot
> figure out why...
Here's why:
We're tracking rw-dependencies, where the "time-arrow" showing
effective order of execution points from the reader to the writer
(since the reader sees a state prior to the write, it effectively
executes before it).  These are important because there have to be
two such dependencies, one in to the pivot and one out from the
pivot, for a problem to exist.  (See various works by Dr. Alan
Fekete, et al, for details.)  But other dependencies can imply an
order of execution.  In particular, a wr-dependency, where a
transaction *can* see data committed by another transaction, implies
that the *writer* came first in the order of execution.  In this
example, the transaction which lists the receipts successfully reads
the control table update, but is not able to read the receipt
insert.  This completes the cycle, making it a real anomaly and not
a false positive.
Note that the wr-dependency can actually exist outside the database,
making it pretty much impossible to accurately tell a false positive
from a true anomaly when the pivot exists and the transaction
writing data which the pivot can't read commits first.  For example,
let's say that the update to the control table is committed from an
application which, seeing that its update came back without error,
proceeds to list the receipts for the old date in a subsequent
transaction.  You have a wr-dependency which is, in reality, quite
real and solid with no way to notice it within the database engine. 
That's why the techniques used in SSI are pretty hard to improve
upon beyond more detailed and accurate tracking of rw-conflicts.
-Kevin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/24/2010 9:30 AM, Greg Sabino Mullane wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
> 
> 
>> In light of the proposed purging scheme, how would it be able to distinguish 
>> between those two cases (nothing there yet vs. was there but purged)?
> 
>> There is a difference between an empty result set and an exception.
> 
> No, I meant how will the *function* know, if a superuser and/or some 
> background process can purge records at any time?

The data contains timestamps which are supposedly taken in commit order. 
Checking the age of the last entry in the file should be simple enough 
to determine if the segment matches the "max age" configuration (if 
set). In the case of a superuser telling what to purge he would just 
call a function with a serial number (telling the obsolete segments).


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
>> No, I meant how will the *function* know, if a superuser and/or some 
>> background process can purge records at any time?

> The data contains timestamps which are supposedly taken in commit order. 

You can *not* rely on the commit timestamps to be in exact order.
(Perhaps approximate ordering is good enough for what you want here,
but just be careful to not fall into the trap of assuming that they're
exactly ordered.)
        regards, tom lane


Re: Exposing the Xact commit order to the user

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> maybe we should get serializable working and committed on one
> node first and then worry about how to distribute it.  I think
> there might be other approaches to this problem
Well, I've got two or three other ideas on how we can manage this
for HS, but since I now realize that I've totally misunderstood the
main use case for this (which is to support trigger-based
replication), I'd like to be clear on something before letting it
drop.  The big question is, do such replicas need to support
serializable access to the data modified by serializable
transactions in the source database?  That is, is there a need for
such replicas to only see states which are possible in some serial
order of execution of serializable transactions on the source
database?  Or to phrase the same question a third way, should there
be a way to run queries on such replicas with confidence that what
is viewed is consistent with user-defined constraints and business
rules?
If not, there's no intersection between this feature and SSI.  If
there is, I think we should think through at least a general
strategy sooner, rather than later.
-Kevin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/25/2010 12:03 PM, Simon Riggs wrote:
> On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote:
> 
>> In some systems (data warehousing, replication), the order of commits is
>> important, since that is the order in which changes have become visible.
>> This information could theoretically be extracted from the WAL, but
>> scanning the entire WAL just to extract this tidbit of information would
>> be excruciatingly painful.
> 
> I think it would be quite simple to read WAL. WALSender reads the WAL
> file after its been flushed, so it would be simple for it to read a blob
> of WAL and then extract the commit order from it.
> 
> Overall though, it would be easier and more efficient to *add* info to
> WAL and then do all this processing *after* WAL has been transported
> elsewhere. Extracting info with DDL triggers, normal triggers, commit
> order and everything else seems like too much work to me. Every other
> RDBMS has moved away from trigger-based replication and we should give
> that serious consideration also.

Reading the entire WAL just to find all COMMIT records, then go back to 
the origin database to get the actual replication log you're looking for 
is simpler and more efficient? I don't think so.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Simon Riggs
Date:
On Tue, 2010-05-25 at 16:41 -0400, Jan Wieck wrote:
> On 5/25/2010 12:03 PM, Simon Riggs wrote:
> > On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote:
> > 
> >> In some systems (data warehousing, replication), the order of commits is
> >> important, since that is the order in which changes have become visible.
> >> This information could theoretically be extracted from the WAL, but
> >> scanning the entire WAL just to extract this tidbit of information would
> >> be excruciatingly painful.
> > 
> > I think it would be quite simple to read WAL. WALSender reads the WAL
> > file after its been flushed, so it would be simple for it to read a blob
> > of WAL and then extract the commit order from it.
> > 
> > Overall though, it would be easier and more efficient to *add* info to
> > WAL and then do all this processing *after* WAL has been transported
> > elsewhere. Extracting info with DDL triggers, normal triggers, commit
> > order and everything else seems like too much work to me. Every other
> > RDBMS has moved away from trigger-based replication and we should give
> > that serious consideration also.
> 
> Reading the entire WAL just to find all COMMIT records, then go back to 
> the origin database to get the actual replication log you're looking for 
> is simpler and more efficient? I don't think so.

Agreed, but I think I've not explained myself well enough.

I proposed two completely separate ideas; the first one was this:

If you must get commit order, get it from WAL on *origin*, using exact
same code that current WALSender provides, plus some logic to read
through the WAL records and extract commit/aborts. That seems much
simpler than the proposal you outlined and as SR shows, its low latency
as well since commits write to WAL. No need to generate event ticks
either, just use XLogRecPtrs as WALSender already does.

I see no problem with integrating that into core, technically or
philosophically.

-- Simon Riggs           www.2ndQuadrant.com



Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/25/2010 4:50 PM, Simon Riggs wrote:
> On Tue, 2010-05-25 at 16:41 -0400, Jan Wieck wrote:
>> On 5/25/2010 12:03 PM, Simon Riggs wrote:
>> > On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote:
>> > 
>> >> In some systems (data warehousing, replication), the order of commits is
>> >> important, since that is the order in which changes have become visible.
>> >> This information could theoretically be extracted from the WAL, but
>> >> scanning the entire WAL just to extract this tidbit of information would
>> >> be excruciatingly painful.
>> > 
>> > I think it would be quite simple to read WAL. WALSender reads the WAL
>> > file after its been flushed, so it would be simple for it to read a blob
>> > of WAL and then extract the commit order from it.
>> > 
>> > Overall though, it would be easier and more efficient to *add* info to
>> > WAL and then do all this processing *after* WAL has been transported
>> > elsewhere. Extracting info with DDL triggers, normal triggers, commit
>> > order and everything else seems like too much work to me. Every other
>> > RDBMS has moved away from trigger-based replication and we should give
>> > that serious consideration also.
>> 
>> Reading the entire WAL just to find all COMMIT records, then go back to 
>> the origin database to get the actual replication log you're looking for 
>> is simpler and more efficient? I don't think so.
> 
> Agreed, but I think I've not explained myself well enough.
> 
> I proposed two completely separate ideas; the first one was this:
> 
> If you must get commit order, get it from WAL on *origin*, using exact
> same code that current WALSender provides, plus some logic to read
> through the WAL records and extract commit/aborts. That seems much
> simpler than the proposal you outlined and as SR shows, its low latency
> as well since commits write to WAL. No need to generate event ticks
> either, just use XLogRecPtrs as WALSender already does.
> 
> I see no problem with integrating that into core, technically or
> philosophically.
> 

Which means that if I want to allow a consumer of that commit order data 
to go offline for three days or so to replicate the 5 requested, low 
volume tables, the origin needs to hang on to the entire WAL log from 
all 100 other high volume tables?


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/25/2010 4:16 PM, Tom Lane wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
>>> No, I meant how will the *function* know, if a superuser and/or some 
>>> background process can purge records at any time?
> 
>> The data contains timestamps which are supposedly taken in commit order. 
> 
> You can *not* rely on the commit timestamps to be in exact order.
> (Perhaps approximate ordering is good enough for what you want here,
> but just be careful to not fall into the trap of assuming that they're
> exactly ordered.)

I am well aware of the fact that commit timestamps within the WAL can go 
backwards and that the serial numbers of this proposed implementation of 
commit order can even be different from what the timestamps AND the WAL 
are saying.

As long as the serial number (record position inside of segment) is 
determined while the transaction still holds all its locks, this is 
going to be good enough for what async replication users today are used 
to. Again, it will not magically make it possible to determine a 
serializable order of actions, that happened from transactions running 
in read committed isolation level, post mortem. I don't even even think 
that is possible at all.

And I don't think anyone proposed a solution for that problem anyways.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/25/2010 3:18 PM, Kevin Grittner wrote:
> Jan Wieck <JanWieck@Yahoo.com> wrote:
>  
>> Have you ever looked at one of those queries, that Londiste or
>> Slony issue against the provider DB in order to get all the log
>> data that has been committed between two snapshots? Is that really
>> the best you can think of?
>  
> No, I admit I haven't.  In fact, I was thinking primarily in terms
> of log-driven situations, like HS.  What would be the best place for
> me to look to come up to speed on your use case?  (I'm relatively
> sure that the issue isn't that there's no information to find, but
> that a sequential pass over all available information would take a
> *long* time.)  I've been working through the issues on WAL-based
> replicas, and have some additional ideas and alternatives, but I'd
> like to see the "big picture", including trigger-based replication,
> before posting.

In short, what both systems are doing is as follows. An AFTER ROW 
trigger records the OLD PK and all changed columns, as well as the txid 
and a global, not cached serial number. Some background process 
periodically starts a serializable transaction and records the resulting 
snapshot.

To replicate from one consistent state to the next, the replication 
system now selects all log rows between two snapshots. Between here 
means it simulates MVCC visibility in the sense of that the writing 
transaction was in progress when the first snapshot was taken and had 
committed at the second. The resulting WHERE clause looks something like
    WHERE (xid > s1.xmax OR (xid >= s1.xmin AND xid IN (s1.xip)))      AND (xid < s2.xmin OR (xid <= s2.xmax AND xid
NOTIN (s2.xip)))
 

Note that xip here is a comma separated list of txid's. I think it is 
easy to see that this is not a cheap query.

Anyhow, that set of log rows is now ordered by the serial number and 
applied to the replica.

Without this logic, the replication system could not combine multiple 
origin sessions into one replication session without risking to never 
find a state, in which it can commit.

It may be possible to work with two sessions on the replica and not 
require any reordering of the original actions at all. I need to think 
about that for a little longer since this idea just occurred to me a 
second ago.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/26/2010 7:03 AM, Jan Wieck wrote:
> To replicate from one consistent state to the next, the replication 
> system now selects all log rows between two snapshots. Between here 
> means it simulates MVCC visibility in the sense of that the writing 
> transaction was in progress when the first snapshot was taken and had 
> committed at the second. The resulting WHERE clause looks something like

Or it entirely happened between the snapshots, obviously.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Greg Stark
Date:
On Sun, May 23, 2010 at 9:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
> Each record of the Transaction Commit Info consists of
>
>     txid          xci_transaction_id
>     timestamptz   xci_begin_timestamp
>     timestamptz   xci_commit_timestamp
>     int64         xci_total_rowcount
>

So I think you're going about this backwards.

Instead of discussing implementation I think you should start with the
API the replication system needs. In particular I'm not sure you
really want a server-side query at all. I'm wondering if you wouldn't
be better off with a public machine-parsable text format version of
the WAL. Ie, at the same time as writing out all the nitty gritty to
the binary wal we would write out a summary of public data to an xml
version containing just parts of the data stream that we can promise
won't change, such as transaction id, lsn, timestamp.

--
greg


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/26/2010 10:04 AM, Greg Stark wrote:
> On Sun, May 23, 2010 at 9:21 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
>> Each record of the Transaction Commit Info consists of
>>
>>     txid          xci_transaction_id
>>     timestamptz   xci_begin_timestamp
>>     timestamptz   xci_commit_timestamp
>>     int64         xci_total_rowcount
>>
> 
> So I think you're going about this backward
> 
> Instead of discussing implementation I think you should start with the
> API the replication system needs. In particular I'm not sure you
> really want a server-side query at all. I'm wondering if you wouldn't
> be better off with a public machine-parsable text format version of
> the WAL. Ie, at the same time as writing out all the nitty gritty to
> the binary wal we would write out a summary of public data to an xml
> version containing just parts of the data stream that we can promise
> won't change, such as transaction id, lsn, timestamp.

Since the actual row level change information and other event data is 
found inside of regular tables, identified by TXID and sequence number, 
I am pretty sure I want that data in a server-side query. What you are 
proposing is to read the xid's and timestamps with an external process, 
that now forcibly needs to reside on the DB server itself (neither 
Londiste nor Slony have that requirement as of today), then bring it 
back into the DB at least inside the WHERE clause of a query.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: Exposing the Xact commit order to the user

From
Steve Singer
Date:
Jan Wieck wrote:

> Since the actual row level change information and other event data is 
> found inside of regular tables, identified by TXID and sequence number, 
> I am pretty sure I want that data in a server-side query. What you are 
> proposing is to read the xid's and timestamps with an external process, 
> that now forcibly needs to reside on the DB server itself (neither 
> Londiste nor Slony have that requirement as of today), then bring it 
> back into the DB at least inside the WHERE clause of a query.


It depends on how you approach the problem.

If you had a process that could scan WAL files (or a platform/version 
independent representation of these WAL files) you could run that 
process on any server (the origin server, a replica, or some third 
server with the software installed).  Where you run it involves making 
trade-offs on the costs of storing transferring and processing the files 
and would ideally be configurable.

You could then have a process that transfers all of the data logged by 
the triggers to the replicas as soon as it is committed.  Basically 
saying 'copy any rows in sl_log from the origin to the replica that we 
haven't already sent to that replica'

You could then move the work of figuring out the commit order onto the 
replica where you would combine the output of the WAL scanning process 
with the transaction data that has been copied to the replica.


> 
> 
> Jan
> 


-- 
Steve Singer
Afilias Canada
Data Services Developer
416-673-1142


Re: Exposing the Xact commit order to the user

From
Robert Haas
Date:
On Wed, May 26, 2010 at 11:43 AM, Steve Singer <ssinger@ca.afilias.info> wrote:
> Jan Wieck wrote:
>
>> Since the actual row level change information and other event data is
>> found inside of regular tables, identified by TXID and sequence number, I am
>> pretty sure I want that data in a server-side query. What you are proposing
>> is to read the xid's and timestamps with an external process, that now
>> forcibly needs to reside on the DB server itself (neither Londiste nor Slony
>> have that requirement as of today), then bring it back into the DB at least
>> inside the WHERE clause of a query.
>
>
> It depends on how you approach the problem.
>
> If you had a process that could scan WAL files (or a platform/version
> independent representation of these WAL files) you could run that process on
> any server (the origin server, a replica, or some third server with the
> software installed).  Where you run it involves making trade-offs on the
> costs of storing transferring and processing the files and would ideally be
> configurable.
>
> You could then have a process that transfers all of the data logged by the
> triggers to the replicas as soon as it is committed.  Basically saying 'copy
> any rows in sl_log from the origin to the replica that we haven't already
> sent to that replica'
>
> You could then move the work of figuring out the commit order onto the
> replica where you would combine the output of the WAL scanning process with
> the transaction data that has been copied to the replica.

I'm sure it's possible to make this work however you want to do it,
but I don't really see what advantage Greg Stark's proposal has over
Jan's original proposal.  Recording the commits in one extra place at
commit time is practically free, especially compared to the overall
cost of replication.  Rescanning the WAL seems likely to be much more
expensive and potentially introduces more failure paths.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/26/2010 10:04 AM, Greg Stark wrote:
> Instead of discussing implementation I think you should start with the
> API the replication system needs.

... but to answer that request, actually I don't even think we should be 
discussing API specifics.

During PGCon, Marco Kreen, Jim Nasby and I were discussing what the 
requirements of a unified message queue, shared by Londiste and Slony 
may look like. For some use cases of pgq, there isn't even any interest 
in user table changes. These are simply a reliable, database backed 
message passing system.

Today both systems use an "agreeable" order of changes selected by 
rather expensive queries based on serializable snapshot information and 
a global, non cacheable serial number.

This could be replaced with a logic based on the actual commit order of 
the transactions. This order does not need to be 100% accurate. As long 
as the order is recorded after all user actions have been performed 
(trigger queue shut down) and while the transaction is still holding 
onto its locks, that order is good enough. This will not allow a 
conflicting transaction, waiting on locks to be released, to appear 
having committed before the lock conflict winner.

It is obvious that in cases where only small portions or even none of 
the user table changes are needed, holding on to or even parsing the 
ENTIRE WAL sounds suboptimal for this use case.

Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: Exposing the Xact commit order to the user

From
Greg Stark
Date:
On Wed, May 26, 2010 at 5:10 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
> ... but to answer that request, actually I don't even think we should be
> discussing API specifics.
>

How about just API generalities? Like, where do you need this data, on
the master or on the slave? Would PGXC like it on the transaction
coordinator?

What question do you need to answer, do you need to pull out sets of
commits in certain ranges or look up specific transaction ids and find
out when they committed? Or do you only need to answer which of two
transaction ids committed first?



-- 
greg


Re: Exposing the Xact commit order to the user

From
Heikki Linnakangas
Date:
Could you generate the commit-order log by simply registering a commit 
hook (RegisterXactCallback(XACT_EVENT_COMMIT)) that writes such a log 
somewhere in the data directory? That would work with older versions 
too, no server changes required.

It would not get called during recovery, but I believe that would be 
sufficient for Slony. You could always batch commits that you don't know 
when they committed as if they committed simultaneously.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/26/2010 1:17 PM, Heikki Linnakangas wrote:
> Could you generate the commit-order log by simply registering a commit 
> hook (RegisterXactCallback(XACT_EVENT_COMMIT)) that writes such a log 
> somewhere in the data directory? That would work with older versions 
> too, no server changes required.
> 

That would work, as it seems that the backend keeps holding on to its 
locks until after calling the callbacks.

> It would not get called during recovery, but I believe that would be 
> sufficient for Slony. You could always batch commits that you don't know 
> when they committed as if they committed simultaneously.

Here you are mistaken. If the origin crashes but can recover not yet 
flushed to xlog-commit-order transactions, then the consumer has no idea 
about the order of those commits, which throws us back to the point 
where we require a non cacheable global sequence to replay the 
individual actions of those "now batched" transactions in an agreeable 
order.

The commit order data needs to be covered by crash recovery.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Heikki Linnakangas
Date:
On 26/05/10 21:43, Jan Wieck wrote:
> On 5/26/2010 1:17 PM, Heikki Linnakangas wrote:
>> It would not get called during recovery, but I believe that would be
>> sufficient for Slony. You could always batch commits that you don't
>> know when they committed as if they committed simultaneously.
>
> Here you are mistaken. If the origin crashes but can recover not yet
> flushed to xlog-commit-order transactions, then the consumer has no idea
> about the order of those commits, which throws us back to the point
> where we require a non cacheable global sequence to replay the
> individual actions of those "now batched" transactions in an agreeable
> order.
>
> The commit order data needs to be covered by crash recovery.

Perhaps I'm missing something, but I thought that Slony currently uses a 
heartbeat, and all transactions committed between two beats are banged 
together and committed as one in the slave so that their relative commit 
order doesn't matter. Can we not do the same for commits missing from 
the commit-order log?

I'm thinking that the commit-order log would contain two kinds of records:

a) Transaction with XID X committed
b) All transactions with XID < X committed

During normal operation we write the 1st kind of record at every commit. 
After crash recovery (perhaps at the first commit after recovery or when 
the slon daemon first polls the server, as there's no hook for 
end-of-recovery), we write the 2nd kind of record.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Exposing the Xact commit order to the user

From
Dimitri Fontaine
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Perhaps I'm missing something, but I thought that Slony currently uses a
> heartbeat, and all transactions committed between two beats are banged
> together and committed as one in the slave so that their relative commit
> order doesn't matter. 

I guess Slony does the same as pgq here: all events of all those
transactions between two given ticks are batched together in the order
of the event commits. (In fact the batches are made at the consumer
request, so possibly spreading more than 2 ticks at a time).

If you skip that event ordering (within transactions), you can't
maintain foreign keys on the slaves, among other things.

The idea of this proposal is to be able to get this commit order
directly from where the information is maintained, rather than use some
sort of user sequence for that.

So even ordering the txid and txid_snapshots with respect to WAL commit
time (LSN) won't be the whole story, for any given transaction
containing more than one event we also need to have them in order. I
know Jan didn't forget about it so it must either be in the proposal or
easily derived, too tired to recheck.

Regards,
-- 
dim


Re: Exposing the Xact commit order to the user

From
Robert Haas
Date:
On Wed, May 26, 2010 at 4:11 PM, Dimitri Fontaine
<dfontaine@hi-media.com> wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Perhaps I'm missing something, but I thought that Slony currently uses a
>> heartbeat, and all transactions committed between two beats are banged
>> together and committed as one in the slave so that their relative commit
>> order doesn't matter.
>
> I guess Slony does the same as pgq here: all events of all those
> transactions between two given ticks are batched together in the order
> of the event commits. (In fact the batches are made at the consumer
> request, so possibly spreading more than 2 ticks at a time).
>
> If you skip that event ordering (within transactions), you can't
> maintain foreign keys on the slaves, among other things.
>
> The idea of this proposal is to be able to get this commit order
> directly from where the information is maintained, rather than use some
> sort of user sequence for that.

Exactly.

> So even ordering the txid and txid_snapshots with respect to WAL commit
> time (LSN) won't be the whole story, for any given transaction
> containing more than one event we also need to have them in order. I
> know Jan didn't forget about it so it must either be in the proposal or
> easily derived, too tired to recheck.

Right, so the point is - with this proposal, he can switch to using a
LOCAL sequence number to order events within the session and then
order the sessions using the commit ordering.  Right now, he has to
use a GLOBAL sequence number because there's no way to know the commit
order.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/26/2010 3:16 PM, Heikki Linnakangas wrote:
> On 26/05/10 21:43, Jan Wieck wrote:
>> On 5/26/2010 1:17 PM, Heikki Linnakangas wrote:
>>> It would not get called during recovery, but I believe that would be
>>> sufficient for Slony. You could always batch commits that you don't
>>> know when they committed as if they committed simultaneously.
>>
>> Here you are mistaken. If the origin crashes but can recover not yet
>> flushed to xlog-commit-order transactions, then the consumer has no idea
>> about the order of those commits, which throws us back to the point
>> where we require a non cacheable global sequence to replay the
>> individual actions of those "now batched" transactions in an agreeable
>> order.
>>
>> The commit order data needs to be covered by crash recovery.
> 
> Perhaps I'm missing something, 

Apparently, more about that at the end.

> I'm thinking that the commit-order log would contain two kinds of records:
> 
> a) Transaction with XID X committed
> b) All transactions with XID < X committed

If that was true then long running transactions would delay all commits 
for transactions that started after them. Do they?

> 
> During normal operation we write the 1st kind of record at every commit. 
> After crash recovery (perhaps at the first commit after recovery or when 
> the slon daemon first polls the server, as there's no hook for 
> end-of-recovery), we write the 2nd kind of record.

I think the callback is also called during backend startup, which means 
that it could record the first XID to come which is known from the 
control file and in that case, all < XID's are committed or aborted.

Which leads us to your missing piece above, the need for the global non 
cacheable sequence.

Consider two transactions A and B that due to transaction batching 
between snapshots get applied together. Let the order of actions be

1. A starts
2. B starts
3. B selects a row for update, then updates the row
4. A tries to do the same and blocks
5. B commits
6. A gets the lock, the row, does the update
7. A commits

If Slony (or Londiste) would not record the exact order of those 
individual row actions, then it would not have any idea if within that 
batch the action of B (higher XID) actually came first. Without that 
knowledge there is a 50/50 chance of getting your replica out of sync 
with that simple conflict.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
"Kevin Grittner"
Date:
Jan Wieck <JanWieck@Yahoo.com> wrote:
> Without this logic, the replication system could not combine
> multiple origin sessions into one replication session without
> risking to never find a state, in which it can commit.
My latest idea for handling this in WAL-based replication involves
WAL-logging information about the transaction through which a the
committing transaction makes it safe to view.  There are a few
options here at the detail level that I'm still thinking through.
The idea would be that the xmin from read-only queries on the slaves
might be somewhere behind where you would expect based on
transactions committed.  (The details involve such things as where
non-serializable transactions fall into the plan on both sides, and
whether it's worth the effort to special-case read-only transactions
on the master.)
I can't say that I'm 100% sure that some lurking detail won't shoot
this technique down for HS, but it seems good to me at a conceptual
level.
I think, however, that this fails to work for systems like Slony and
Londiste because there could be transactions writing to tables which
are not replication targets, so the snapshot adjustments wouldn't be
safe.  True?  (If not true, I think that adding some sort of xmin
value, depending on the answers to the above questions, to Jan's
proposed structure might support better transactional integrity,
even to the level of full serializable support, at the cost of
delaying visibility of committed data.)
-Kevin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/26/2010 4:11 PM, Dimitri Fontaine wrote:
> So even ordering the txid and txid_snapshots with respect to WAL commit
> time (LSN) won't be the whole story, for any given transaction
> containing more than one event we also need to have them in order. I
> know Jan didn't forget about it so it must either be in the proposal or
> easily derived, too tired to recheck.

No, that detail is actually not explained in the proposal. When applying 
all changes in transaction commit order, there is no need for a global 
sequence. A local counter per backend is sufficient because the total 
order of <xact-commit-order>, <local-xact-seq> yields a similarly 
agreeable order of actions.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Heikki Linnakangas
Date:
On 26/05/10 23:32, Jan Wieck wrote:
> Consider two transactions A and B that due to transaction batching
> between snapshots get applied together. Let the order of actions be
>
> 1. A starts
> 2. B starts
> 3. B selects a row for update, then updates the row
> 4. A tries to do the same and blocks
> 5. B commits
> 6. A gets the lock, the row, does the update
> 7. A commits
>
> If Slony (or Londiste) would not record the exact order of those
> individual row actions, then it would not have any idea if within that
> batch the action of B (higher XID) actually came first. Without that
> knowledge there is a 50/50 chance of getting your replica out of sync
> with that simple conflict.

Hmm, I don't see how even a fully reliable WAL-logged commit-order log 
would save you then. It seems that you need to not only know the 
relative order of commits, but the order of commits relative to actions 
within the transactions. I.e. in the above example it's not enough to 
know that B committed before A, you also have to know that A updated the 
row only after B committed.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/26/2010 12:38 PM, Greg Stark wrote:
> On Wed, May 26, 2010 at 5:10 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
>> ... but to answer that request, actually I don't even think we should be
>> discussing API specifics.
>>
> 
> How about just API generalities? Like, where do you need this data, on
> the master or on the slave? Would PGXC like it on the transaction
> coordinator?
> 
> What question do you need to answer, do you need to pull out sets of
> commits in certain ranges or look up specific transaction ids and find
> out when they committed? Or do you only need to answer which of two
> transaction ids committed first?

The question I want answered is
  "what was the order and xid of the next 0..n transactions, that  committed after transaction X?"

Preferably I would avoid scanning the entire available WAL just to get 
the next n xid's to process.

The proposal assigned a unique serial number (file segment and position 
driven) to each xid and used that for the ordering as well as 
identification of the last known transaction. That is certainly a 
premature implementation detail.

In this implementation it wouldn't even matter if a transaction that was 
recorded actually never made it because it crashed before the WAL flush. 
It would be reported by this "commit order" feature, but there would be 
no traces of whatever it did to be found inside the DB, so that anomaly 
is harmless.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: Exposing the Xact commit order to the user

From
Heikki Linnakangas
Date:
On 26/05/10 23:45, Heikki Linnakangas wrote:
> On 26/05/10 23:32, Jan Wieck wrote:
>> Consider two transactions A and B that due to transaction batching
>> between snapshots get applied together. Let the order of actions be
>>
>> 1. A starts
>> 2. B starts
>> 3. B selects a row for update, then updates the row
>> 4. A tries to do the same and blocks
>> 5. B commits
>> 6. A gets the lock, the row, does the update
>> 7. A commits
>>
>> If Slony (or Londiste) would not record the exact order of those
>> individual row actions, then it would not have any idea if within that
>> batch the action of B (higher XID) actually came first. Without that
>> knowledge there is a 50/50 chance of getting your replica out of sync
>> with that simple conflict.
>
> Hmm, I don't see how even a fully reliable WAL-logged commit-order log
> would save you then. It seems that you need to not only know the
> relative order of commits, but the order of commits relative to actions
> within the transactions. I.e. in the above example it's not enough to
> know that B committed before A, you also have to know that A updated the
> row only after B committed.

Ok, I think I understand it now. The commit order is enough, because 
replaying the actions in the order "all actions of B, then all actions 
of A" yields the same result.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Exposing the Xact commit order to the user

From
Heikki Linnakangas
Date:
On 26/05/10 23:49, Jan Wieck wrote:
> In this implementation it wouldn't even matter if a transaction that was
> recorded actually never made it because it crashed before the WAL flush.
> It would be reported by this "commit order" feature, but there would be
> no traces of whatever it did to be found inside the DB, so that anomaly
> is harmless.

Hmm, I think it would also not matter if the reported commit order 
doesn't match exactly the order of the commit records, as long as 
there's no dependency between the two transactions.

What I'm after is that I think it would be enough to establish the 
commit order using deferred triggers that are fired during pre-commit 
processing. The trigger could get a number from a global sequence to 
establish the commit order, and write it to a table. So you still need a 
global sequence, but it's only needed once per commit.

(you have to handle deferred triggers that fire after the commit-order 
trigger. perhaps by getting another number from the global sequence and 
replacing the previous number with it)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/26/2010 4:52 PM, Heikki Linnakangas wrote:

> Ok, I think I understand it now. The commit order is enough, because 
> replaying the actions in the order "all actions of B, then all actions 
> of A" yields the same result.

Precisely.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/26/2010 5:12 PM, Heikki Linnakangas wrote:
> On 26/05/10 23:49, Jan Wieck wrote:
>> In this implementation it wouldn't even matter if a transaction that was
>> recorded actually never made it because it crashed before the WAL flush.
>> It would be reported by this "commit order" feature, but there would be
>> no traces of whatever it did to be found inside the DB, so that anomaly
>> is harmless.
> 
> Hmm, I think it would also not matter if the reported commit order 
> doesn't match exactly the order of the commit records, as long as 
> there's no dependency between the two transactions.
> 
> What I'm after is that I think it would be enough to establish the 
> commit order using deferred triggers that are fired during pre-commit 
> processing. The trigger could get a number from a global sequence to 
> establish the commit order, and write it to a table. So you still need a 
> global sequence, but it's only needed once per commit.

You're not trying to derail this thread into yet another of our famous 
"commit trigger" battles, are you?

> 
> (you have to handle deferred triggers that fire after the commit-order 
> trigger. perhaps by getting another number from the global sequence and 
> replacing the previous number with it)

I could imagine a commit trigger as a special case that is fired AFTER 
the trigger queue was shut down, so any operation that causes any 
further triggers to fire would automatically abort the transaction. A 
draconian, but reasonable restriction.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/26/2010 4:34 PM, Kevin Grittner wrote:
> Jan Wieck <JanWieck@Yahoo.com> wrote:
>  
>> Without this logic, the replication system could not combine
>> multiple origin sessions into one replication session without
>> risking to never find a state, in which it can commit.
>  
> My latest idea for handling this in WAL-based replication involves
> WAL-logging information about the transaction through which a the
> committing transaction makes it safe to view.  There are a few
> options here at the detail level that I'm still thinking through.
> The idea would be that the xmin from read-only queries on the slaves
> might be somewhere behind where you would expect based on
> transactions committed.  (The details involve such things as where
> non-serializable transactions fall into the plan on both sides, and
> whether it's worth the effort to special-case read-only transactions
> on the master.)
>  
> I can't say that I'm 100% sure that some lurking detail won't shoot
> this technique down for HS, but it seems good to me at a conceptual
> level.

Without simulating multiple simultaneous transactions during playback, 
how are you going to manage that the tuples, already inserted on behalf 
of the ongoing master transactions, disappear when they abort on the master?


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: Exposing the Xact commit order to the user

From
"Kevin Grittner"
Date:
Jan Wieck <JanWieck@Yahoo.com> wrote:
> On 5/26/2010 4:34 PM, Kevin Grittner wrote:
>> My latest idea for handling this in WAL-based replication
>> involves WAL-logging information about the transaction through
>> which a the committing transaction makes it safe to view.  There
>> are a few options here at the detail level that I'm still
>> thinking through.  The idea would be that the xmin from read-only
>> queries on the slaves might be somewhere behind where you would
>> expect based on transactions committed.  (The details involve
>> such things as where non-serializable transactions fall into the
>> plan on both sides, and whether it's worth the effort to
>> special-case read-only transactions on the master.)
>>  
>> I can't say that I'm 100% sure that some lurking detail won't
>> shoot this technique down for HS, but it seems good to me at a
>> conceptual level.
> 
> Without simulating multiple simultaneous transactions during
> playback, how are you going to manage that the tuples, already
> inserted on behalf of the ongoing master transactions, disappear
> when they abort on the master?
When do writes ever become visible to a snapshot without having been
committed?  I'm not talking about changing that in any way.  I'm
talking about deferring visibility of committed transactions until
they can be viewed without risking serialization anomalies.  This
requires, at a minimum, that any concurrent serializable
transactions which are not read-only have completed.
(Perhaps I'm not understanding your question....)
-Kevin


Re: Exposing the Xact commit order to the user

From
Greg Stark
Date:
On Wed, May 26, 2010 at 5:38 PM, Greg Stark <gsstark@mit.edu> wrote:
> How about just API generalities? Like, where do you need this data, on
> the master or on the slave? Would PGXC like it on the transaction
> coordinator?
>
> What question do you need to answer, do you need to pull out sets of
> commits in certain ranges or look up specific transaction ids and find
> out when they committed? Or do you only need to answer which of two
> transaction ids committed first?
>

This thread has been hard to follow for me. Were any of these
questions answered?

-- 
greg


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/27/2010 9:59 AM, Greg Stark wrote:
> On Wed, May 26, 2010 at 5:38 PM, Greg Stark <gsstark@mit.edu> wrote:
>> How about just API generalities? Like, where do you need this data, on
>> the master or on the slave? Would PGXC like it on the transaction
>> coordinator?
>>
>> What question do you need to answer, do you need to pull out sets of
>> commits in certain ranges or look up specific transaction ids and find
>> out when they committed? Or do you only need to answer which of two
>> transaction ids committed first?
>>
> 
> This thread has been hard to follow for me. Were any of these
> questions answered?
> 

Yes.

On 5/26/2010 4:49 PM, Jan Wieck wrote:
> On 5/26/2010 12:38 PM, Greg Stark wrote:
>> > On Wed, May 26, 2010 at 5:10 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
>>> >> ... but to answer that request, actually I don't even think we should be
>>> >> discussing API specifics.
>>> >>
>> > 
>> > How about just API generalities? Like, where do you need this data, on
>> > the master or on the slave? Would PGXC like it on the transaction
>> > coordinator?
>> > 
>> > What question do you need to answer, do you need to pull out sets of
>> > commits in certain ranges or look up specific transaction ids and find
>> > out when they committed? Or do you only need to answer which of two
>> > transaction ids committed first?
> 
> The question I want answered is
> 
>    "what was the order and xid of the next 0..n transactions, that
>    committed after transaction X?"
> 
> Preferably I would avoid scanning the entire available WAL just to get 
> the next n xid's to process.
> 
> The proposal assigned a unique serial number (file segment and position 
> driven) to each xid and used that for the ordering as well as 
> identification of the last known transaction. That is certainly a 
> premature implementation detail.
> 
> In this implementation it wouldn't even matter if a transaction that was 
> recorded actually never made it because it crashed before the WAL flush. 
> It would be reported by this "commit order" feature, but there would be 
> no traces of whatever it did to be found inside the DB, so that anomaly 
> is harmless.
> 
> 
> Jan
> 
> -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
 



-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/27/2010 12:01 PM, Jan Wieck wrote:
> On 5/27/2010 9:59 AM, Greg Stark wrote:
>> This thread has been hard to follow for me. Were any of these
>> questions answered?
>> 
> 
> Yes.

The thing missing is any sort of answer to that problem description.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: Exposing the Xact commit order to the user

From
Bruce Momjian
Date:
Jan Wieck wrote:
> >> Reading the entire WAL just to find all COMMIT records, then go back to 
> >> the origin database to get the actual replication log you're looking for 
> >> is simpler and more efficient? I don't think so.
> > 
> > Agreed, but I think I've not explained myself well enough.
> > 
> > I proposed two completely separate ideas; the first one was this:
> > 
> > If you must get commit order, get it from WAL on *origin*, using exact
> > same code that current WALSender provides, plus some logic to read
> > through the WAL records and extract commit/aborts. That seems much
> > simpler than the proposal you outlined and as SR shows, its low latency
> > as well since commits write to WAL. No need to generate event ticks
> > either, just use XLogRecPtrs as WALSender already does.
> > 
> > I see no problem with integrating that into core, technically or
> > philosophically.
> > 
> 
> Which means that if I want to allow a consumer of that commit order data 
> to go offline for three days or so to replicate the 5 requested, low 
> volume tables, the origin needs to hang on to the entire WAL log from 
> all 100 other high volume tables?

I suggest writing an external tool that strips out what you need that
can be run at any time, rather than creating a new data format and
overhead for this usecase.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com


Re: Exposing the Xact commit order to the user

From
Robert Haas
Date:
On May 28, 2010, at 7:19 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Jan Wieck wrote:
>>>> Reading the entire WAL just to find all COMMIT records, then go  
>>>> back to
>>>> the origin database to get the actual replication log you're  
>>>> looking for
>>>> is simpler and more efficient? I don't think so.
>>>
>>> Agreed, but I think I've not explained myself well enough.
>>>
>>> I proposed two completely separate ideas; the first one was this:
>>>
>>> If you must get commit order, get it from WAL on *origin*, using  
>>> exact
>>> same code that current WALSender provides, plus some logic to read
>>> through the WAL records and extract commit/aborts. That seems much
>>> simpler than the proposal you outlined and as SR shows, its low  
>>> latency
>>> as well since commits write to WAL. No need to generate event ticks
>>> either, just use XLogRecPtrs as WALSender already does.
>>>
>>> I see no problem with integrating that into core, technically or
>>> philosophically.
>>>
>>
>> Which means that if I want to allow a consumer of that commit order  
>> data
>> to go offline for three days or so to replicate the 5 requested, low
>> volume tables, the origin needs to hang on to the entire WAL log from
>> all 100 other high volume tables?
>
> I suggest writing an external tool that strips out what you need that
> can be run at any time, rather than creating a new data format and
> overhead for this usecase.

That would be FAR more complex, less robust, and less performant -  
whereas doing what Jan has proposed is pretty straightforward and  
should have minimal impact on performance - or none when not enabled.

...Robert


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 5/28/2010 7:19 PM, Bruce Momjian wrote:
> Jan Wieck wrote:
>> >> Reading the entire WAL just to find all COMMIT records, then go back to 
>> >> the origin database to get the actual replication log you're looking for 
>> >> is simpler and more efficient? I don't think so.
>> > 
>> > Agreed, but I think I've not explained myself well enough.
>> > 
>> > I proposed two completely separate ideas; the first one was this:
>> > 
>> > If you must get commit order, get it from WAL on *origin*, using exact
>> > same code that current WALSender provides, plus some logic to read
>> > through the WAL records and extract commit/aborts. That seems much
>> > simpler than the proposal you outlined and as SR shows, its low latency
>> > as well since commits write to WAL. No need to generate event ticks
>> > either, just use XLogRecPtrs as WALSender already does.
>> > 
>> > I see no problem with integrating that into core, technically or
>> > philosophically.
>> > 
>> 
>> Which means that if I want to allow a consumer of that commit order data 
>> to go offline for three days or so to replicate the 5 requested, low 
>> volume tables, the origin needs to hang on to the entire WAL log from 
>> all 100 other high volume tables?
> 
> I suggest writing an external tool that strips out what you need that
> can be run at any time, rather than creating a new data format and
> overhead for this usecase.
> 

Stripping it out from what?


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Bruce Momjian
Date:
Jan Wieck wrote:
> >> > I see no problem with integrating that into core, technically or
> >> > philosophically.
> >> > 
> >> 
> >> Which means that if I want to allow a consumer of that commit order data 
> >> to go offline for three days or so to replicate the 5 requested, low 
> >> volume tables, the origin needs to hang on to the entire WAL log from 
> >> all 100 other high volume tables?
> > 
> > I suggest writing an external tool that strips out what you need that
> > can be run at any time, rather than creating a new data format and
> > overhead for this usecase.
> > 
> 
> Stripping it out from what?

Stripping it from the WAL.  Your system seems to require double-writes
on a commit, which is something we have avoided in the past.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + None of us is going to be here forever. +


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 6/1/2010 11:09 AM, Bruce Momjian wrote:
> Jan Wieck wrote:
>> >> > I see no problem with integrating that into core, technically or
>> >> > philosophically.
>> >> > 
>> >> 
>> >> Which means that if I want to allow a consumer of that commit order data 
>> >> to go offline for three days or so to replicate the 5 requested, low 
>> >> volume tables, the origin needs to hang on to the entire WAL log from 
>> >> all 100 other high volume tables?
>> > 
>> > I suggest writing an external tool that strips out what you need that
>> > can be run at any time, rather than creating a new data format and
>> > overhead for this usecase.
>> > 
>> 
>> Stripping it out from what?
> 
> Stripping it from the WAL.  Your system seems to require double-writes
> on a commit, which is something we have avoided in the past.
> 

Your suggestion seems is based on several false assumptions. This does 
neither require additional physical writes on commit, nor is consuming 
the entire WAL just to filter out commit records anything even remotely 
desirable for systems like Londiste or Slony.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Chris Browne
Date:
drkp@csail.mit.edu (Dan Ports) writes:
> I'm not clear on why the total rowcount is useful, but perhaps I'm
> missing something obvious.

It would make it easy to conclude:
  "This next transaction did 8328194 updates.  Maybe we should do  some kind of checkpoint (e.g. - commit transaction
orsuch) before  working on it."
 
   versus
  "This transaction we're thinking of working on had 7 updates.  No  big deal..."
-- 
(reverse (concatenate 'string "ofni.secnanifxunil" "@" "enworbbc"))
http://linuxfinances.info/info/finances.html
Rules of  the Evil Overlord #189. "I  will never tell the  hero "Yes I
was the one who  did it, but you'll never be able  to prove it to that
incompetent  old fool."  Chances  are, that  incompetent  old fool  is
standing behind the curtain."  <http://www.eviloverlord.com/>


Re: Exposing the Xact commit order to the user

From
Chris Browne
Date:
heikki.linnakangas@enterprisedb.com (Heikki Linnakangas) writes:
> On 24/05/10 19:51, Kevin Grittner wrote:
>> The only thing I'm confused about is what benefit anyone expects to
>> get from looking at data between commits in some way other than our
>> current snapshot mechanism.  Can someone explain a use case where
>> what Jan is proposing is better than snapshot isolation?  It doesn't
>> provide any additional integrity guarantees that I can see.
>
> Right, it doesn't. What it provides is a way to reconstruct a snapshot
> at any point in time, after the fact. For example, after transactions
> A, C, D and B have committed in that order, it allows you to
> reconstruct a snapshot just like you would've gotten immediately after
> the commit of A, C, D and B respectively. That's useful replication
> tools like Slony that needs to commit the changes of those
> transactions in the slave in the same order as they were committed in
> the master.
>
> I don't know enough of Slony et al. to understand why that'd be better
> than the current heartbeat mechanism they use, taking a snapshot every
> few seconds, batching commits.

I see two advantages:
a) Identifying things on a transaction-by-transaction basis means that   the snapshots ("syncs") don't need to be
captured,which is   presently an area of fragility.  If the slon daemon falls over on   Friday evening, and nobody
noticesuntil Monday, the "snapshot"   reverts to being all updates between Friday and whenever SYNCs   start to be
collectedagain.
 
   Exposing commit orders eliminates that fragility.  SYNCs don't   need to be captured anymore, so they can't be
missed(which is   today's problem).
 
b) The sequence currently used to control log application ordering is   a bottleneck, as it is a single sequence shared
acrossall   connections.
 
   It could be eliminated in favor of (perhaps) an in-memory variable   defined on a per-connection basis.
   It's not a bottleneck that we hear a lot of complaints about, but   the sequence certainly is a bottleneck.

-- 
select 'cbbrowne' || '@' || 'cbbrowne.com';
http://cbbrowne.com/info/internet.html
"MS  apparently now  has a  team dedicated  to tracking  problems with
Linux  and publicizing them.   I guess  eventually they'll  figure out
this back fires... ;)" -- William Burrow <aa126@DELETE.fan.nb.ca>


Re: Exposing the Xact commit order to the user

From
Greg Stark
Date:
On Wed, Jun 2, 2010 at 6:45 PM, Chris Browne <cbbrowne@acm.org> wrote:
> It would make it easy to conclude:
>
>   "This next transaction did 8328194 updates.  Maybe we should do
>   some kind of checkpoint (e.g. - commit transaction or such) before
>   working on it."
>
>    versus
>
>   "This transaction we're thinking of working on had 7 updates.  No
>   big deal..."

I'm puzzled how you would define this value. How do you add 7 inserts,
7 deletes, and 7 updates? Is that 21 rows modified? Why are the 7
inserts and 7 deletes worth twice as much as the 7 updates when
they're basically the same thing? What if the inserts fired triggers
which inserted 7 more rows, is that 14? What if the 7 updates modified
2 TB of TOAST data but the 8238194 updates were all to the same record
and they were all HOT updates so all it did was change 8kB?

In any case you'll have all the actual data from your triggers or
hooks or whatever so what value does having the system keep track of
this add?



--
greg


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 6/2/2010 7:49 PM, Greg Stark wrote:
> On Wed, Jun 2, 2010 at 6:45 PM, Chris Browne <cbbrowne@acm.org> wrote:
>> It would make it easy to conclude:
>>
>>   "This next transaction did 8328194 updates.  Maybe we should do
>>   some kind of checkpoint (e.g. - commit transaction or such) before
>>   working on it."
>>
>>    versus
>>
>>   "This transaction we're thinking of working on had 7 updates.  No
>>   big deal..."
> 
> I'm puzzled how you would define this value. How do you add 7 inserts,
> 7 deletes, and 7 updates? Is that 21 rows modified? Why are the 7
> inserts and 7 deletes worth twice as much as the 7 updates when
> they're basically the same thing? What if the inserts fired triggers
> which inserted 7 more rows, is that 14? What if the 7 updates modified
> 2 TB of TOAST data but the 8238194 updates were all to the same record
> and they were all HOT updates so all it did was change 8kB?
> 
> In any case you'll have all the actual data from your triggers or
> hooks or whatever so what value does having the system keep track of
> this add?

The point is not that we don't have that information now. The point is 
having a hint BEFORE wading through possibly gigabytes of WAL or log data.

If getting that information requires to read all the log data twice or 
the need to read gigabytes of otherwise useless WAL data (as per Bruce's 
suggestion), we better not get it at all and just keep doing what we are 
doing now.

I actually have a hard time understanding why people are so opposed to a 
feature that has zero impact at all unless a DBA actually turns in ON. 
What is the problem with exposing the commit order of transactions?


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Bruce Momjian
Date:
Jan Wieck wrote:
> The point is not that we don't have that information now. The point is 
> having a hint BEFORE wading through possibly gigabytes of WAL or log data.
> 
> If getting that information requires to read all the log data twice or 
> the need to read gigabytes of otherwise useless WAL data (as per Bruce's 
> suggestion), we better not get it at all and just keep doing what we are 
> doing now.
> 
> I actually have a hard time understanding why people are so opposed to a 
> feature that has zero impact at all unless a DBA actually turns in ON. 
> What is the problem with exposing the commit order of transactions?

If you want to fork Postgres and add it, go ahead, but if the community
has to maintain the code and document it, we care.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + None of us is going to be here forever. +


Re: Exposing the Xact commit order to the user

From
"Kevin Grittner"
Date:
Jan Wieck <JanWieck@Yahoo.com> wrote:
> I actually have a hard time understanding why people are so
> opposed to a feature that has zero impact at all unless a DBA
> actually turns in ON.  What is the problem with exposing the
> commit order of transactions?
FWIW, once I came to understand the use case, it seems to me a
perfectly reasonable and useful thing to have.  It does strike me
that there may be value to add one more xid to support certain
types of integrity for some use cases, but that's certainly
something which could be added later, if at all.  Once I realized
that, I just dropped out of the discussion; perhaps I should have
bowed out with an endorsement.
Unless my memory is failing me worse than usual, Dan Ports, who is
working on the serializable implementation so he can use the
predicate locking with a transaction-aware caching feature, needs
the ability to track commit order of transactions by xid; so the use
cases go beyond Slony and Londiste.
-Kevin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 6/3/2010 4:04 PM, Bruce Momjian wrote:
> If you want to fork Postgres and add it, go ahead, but if the community
> has to maintain the code and document it, we care.

That comment was rather unprofessional. I think the rest of us still try 
to find the best solution for the problem, not kill the discussion. You 
may want to rejoin that effort.

I care about an efficient, low overhead way to get a certain 
information, that is otherwise extremely difficult, expensive and 
version dependent to get.

I care about cleaning up more of the mistakes, made in the original 
development of Slony. Namely using hacks and kluges to implement 
details, not supported by a current version of PostgreSQL. Londiste and 
Slony made a good leap on that with the txid data type. Slony made 
another step like that with 2.0, switching to the (for that very purpose 
developed and contributed) native trigger configuration instead of 
hacking system catalogs. This would be another step in that direction 
and we would be able to unify Londiste's and Slony's transport mechanism 
and eliminating the tick/sync kluge.

Care to explain what exactly you care about?


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Chris Browne
Date:
bruce@momjian.us (Bruce Momjian) writes:
> Jan Wieck wrote:
>> The point is not that we don't have that information now. The point is 
>> having a hint BEFORE wading through possibly gigabytes of WAL or log data.
>> 
>> If getting that information requires to read all the log data twice or 
>> the need to read gigabytes of otherwise useless WAL data (as per Bruce's 
>> suggestion), we better not get it at all and just keep doing what we are 
>> doing now.
>> 
>> I actually have a hard time understanding why people are so opposed to a 
>> feature that has zero impact at all unless a DBA actually turns in ON. 
>> What is the problem with exposing the commit order of transactions?
>
> If you want to fork Postgres and add it, go ahead, but if the community
> has to maintain the code and document it, we care.

Are you "caring" or "opposing"?  It seems rather uncharitable to imply
that Jan doesn't care.

I know *I'm* not interested in a forked Postgres for this - I would
prefer to find out what things could be done that don't involve gross
amounts of WAL file grovelling for data that mayn't necessarily even
be available.
-- 
select 'cbbrowne' || '@' || 'cbbrowne.com';
http://cbbrowne.com/info/internet.html
"MS  apparently now  has a  team dedicated  to tracking  problems with
Linux  and publicizing them.   I guess  eventually they'll  figure out
this back fires... ;)" -- William Burrow <aa126@DELETE.fan.nb.ca>


Re: Exposing the Xact commit order to the user

From
Chris Browne
Date:
gsstark@mit.edu (Greg Stark) writes:
> On Wed, Jun 2, 2010 at 6:45 PM, Chris Browne <cbbrowne@acm.org> wrote:
>> It would make it easy to conclude:
>>
>>   "This next transaction did 8328194 updates.  Maybe we should do
>>   some kind of checkpoint (e.g. - commit transaction or such) before
>>   working on it."
>>
>>    versus
>>
>>   "This transaction we're thinking of working on had 7 updates.  No
>>   big deal..."
>
> I'm puzzled how you would define this value. How do you add 7 inserts,
> 7 deletes, and 7 updates? Is that 21 rows modified? Why are the 7
> inserts and 7 deletes worth twice as much as the 7 updates when
> they're basically the same thing? What if the inserts fired triggers
> which inserted 7 more rows, is that 14? What if the 7 updates modified
> 2 TB of TOAST data but the 8238194 updates were all to the same record
> and they were all HOT updates so all it did was change 8kB?

The presence of those questions (and their ambiguity) is the reason
why there's a little squirming as to whether this is super-useful and
super-necessary.

What this offers is *SOME* idea of how much updating work a particular
transaction did.  It's a bit worse than you suggest:
- If replication triggers have captured tuples, those would get  counted.
- TOAST updates might lead to extra updates being counted.

But back to where you started, I'd anticipate 7 inserts, 7 deletes,
and 7 updates being counted as something around 21 updates.

And if that included 5 TOAST changes, it might bump up to 26.

If there were replication triggers in place, that might bump the count
up to 45 (which I chose arbitrarily).

> In any case you'll have all the actual data from your triggers or
> hooks or whatever so what value does having the system keep track of
> this add?

This means that when we'd pull the list of transactions to consider,
we'd get something like:
select * from next_transactions('4218:23', 50);

[list of 50 transactions returned, each with...  -> txid  -> START timestamp  -> COMMIT timestamp  -> Approximate # of
updates

Then, for each of the 50, I'd pull replication log data for the
corresponding transaction.

If I have the approximate # of updates, that might lead me to stop
short, and say:
 "That next update looks like a doozy!  I'm going to stop and commit what I've got before doing that one."

It's not strictly necessary, but would surely be useful for flow
control.
-- 
select 'cbbrowne' || '@' || 'cbbrowne.com';
http://cbbrowne.com/info/internet.html
"MS  apparently now  has a  team dedicated  to tracking  problems with
Linux  and publicizing them.   I guess  eventually they'll  figure out
this back fires... ;)" -- William Burrow <aa126@DELETE.fan.nb.ca>


Re: Exposing the Xact commit order to the user

From
Greg Stark
Date:
On Thu, Jun 3, 2010 at 8:50 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
>> I'm puzzled how you would define this value. How do you add 7 inserts,
>> 7 deletes, and 7 updates? Is that 21 rows modified?
>
> I actually have a hard time understanding why people are so opposed to a
> feature that has zero impact at all unless a DBA actually turns in ON. What
> is the problem with exposing the commit order of transactions?

The post you were responding to was regarding the meaninglessness of
the "number of records" attribute you wanted. Your response is a non
sequitor.

I think the commit order of transactions would be a good thing to
expose though I've asked repeatedly what kind of interface you need
and never gotten answers to all the questions.




-- 
greg


Re: Exposing the Xact commit order to the user

From
"Kevin Grittner"
Date:
Greg Stark <gsstark@mit.edu> wrote:
> what kind of interface you need
For the potential uses I can see, it would be great to have a SRF
which took two parameters: xid of last known commit and a limit how
many commits past that to return.  Perhaps a negative number could
move earlier in time, if that seems reasonable to others.  I think
that's also consistent with Jan's posts.  A GUC to enable it and
some way to specify retention (or force cleanup) are the only other
user-facing features which come to mind for me.  (Not sure what form
that last should take, but didn't Jan say something about both of
these early in the thread?)
Do you see a need for something else (besides, obviously, docs)?
-Kevin


Re: Exposing the Xact commit order to the user

From
"Kevin Grittner"
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> Greg Stark <gsstark@mit.edu> wrote:
>  
>> what kind of interface you need
>  
> For the potential uses I can see, it would be great to have a SRF
> which took two parameters: xid of last known commit and a limit
> how many commits past that to return.
Jan's very first post had it right; my idea was flawed:
| Exposing the data will be done via a set returning function. The
| SRF takes two arguments. The maximum number of rows to return and
| the last serial number processed by the reader. The advantage of
| such SRF is that the result can be used in a query that right away
| delivers audit or replication log information in transaction
| commit order. The SRF can return an empty set if no further
| transactions have committed since, or an error if data segments
| needed to answer the request have already been purged.
| 
| Purging of the data will be possible in several different ways.
| Autovacuum will call a function that drops segments of the data
| that are outside the postgresql.conf configuration with respect to
| maximum age or data volume. There will also be a function reserved
| for superusers to explicitly purge the data up to a certain serial
| number.
Apologies for not looking back to the start of the thread before
that last post.  It was all laid out right at the start.
-Kevin


Re: Exposing the Xact commit order to the user

From
Bruce Momjian
Date:
Jan Wieck wrote:
> On 6/3/2010 4:04 PM, Bruce Momjian wrote:
> > If you want to fork Postgres and add it, go ahead, but if the community
> > has to maintain the code and document it, we care.
> 
> That comment was rather unprofessional. I think the rest of us still try 
> to find the best solution for the problem, not kill the discussion. You 
> may want to rejoin that effort.
> 
> I care about an efficient, low overhead way to get a certain 
> information, that is otherwise extremely difficult, expensive and 
> version dependent to get.
> 
> I care about cleaning up more of the mistakes, made in the original 
> development of Slony. Namely using hacks and kluges to implement 
> details, not supported by a current version of PostgreSQL. Londiste and 
> Slony made a good leap on that with the txid data type. Slony made 
> another step like that with 2.0, switching to the (for that very purpose 
> developed and contributed) native trigger configuration instead of 
> hacking system catalogs. This would be another step in that direction 
> and we would be able to unify Londiste's and Slony's transport mechanism 
> and eliminating the tick/sync kluge.
> 
> Care to explain what exactly you care about?

Here is what I was replying to:

> >> I actually have a hard time understanding why people are so opposed t$
> > >> feature that has zero impact at all unless a DBA actually turns in ON.
> >> What is the problem with exposing the commit order of transactions?

Jan's comment is why should others care what he wants because it has
zero impact?  I am saying the community cares because we have to
maintain the code.  I stand by my comment.

I remember a dismissive comment by Jan when 'session_replication_role'
was added, and a similar strong comment from me at that time as well. 
It seems we are doing this again.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + None of us is going to be here forever. +


Re: Exposing the Xact commit order to the user

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Jan Wieck wrote:
> > On 6/3/2010 4:04 PM, Bruce Momjian wrote:
> > > If you want to fork Postgres and add it, go ahead, but if the community
> > > has to maintain the code and document it, we care.
> > 
> > That comment was rather unprofessional. I think the rest of us still try 
> > to find the best solution for the problem, not kill the discussion. You 
> > may want to rejoin that effort.
> > 
> > I care about an efficient, low overhead way to get a certain 
> > information, that is otherwise extremely difficult, expensive and 
> > version dependent to get.
> > 
> > I care about cleaning up more of the mistakes, made in the original 
> > development of Slony. Namely using hacks and kluges to implement 
> > details, not supported by a current version of PostgreSQL. Londiste and 
> > Slony made a good leap on that with the txid data type. Slony made 
> > another step like that with 2.0, switching to the (for that very purpose 
> > developed and contributed) native trigger configuration instead of 
> > hacking system catalogs. This would be another step in that direction 
> > and we would be able to unify Londiste's and Slony's transport mechanism 
> > and eliminating the tick/sync kluge.
> > 
> > Care to explain what exactly you care about?
> 
> Here is what I was replying to:
> 
> > >> I actually have a hard time understanding why people are so opposed t$
> > > >> feature that has zero impact at all unless a DBA actually turns in ON.
> > >> What is the problem with exposing the commit order of transactions?
> 
> Jan's comment is why should others care what he wants because it has
> zero impact?  I am saying the community cares because we have to
> maintain the code.  I stand by my comment.
> 
> I remember a dismissive comment by Jan when 'session_replication_role'
> was added, and a similar strong comment from me at that time as well. 
> It seems we are doing this again.

Of course, if I am misintepreting what Jan said, please let me know.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + None of us is going to be here forever. +


Re: Exposing the Xact commit order to the user

From
Alvaro Herrera
Date:
Excerpts from Chris Browne's message of jue jun 03 16:21:35 -0400 2010:

> What this offers is *SOME* idea of how much updating work a particular
> transaction did.  It's a bit worse than you suggest:
> 
>  - If replication triggers have captured tuples, those would get
>    counted.
> 
>  - TOAST updates might lead to extra updates being counted.
> 
> But back to where you started, I'd anticipate 7 inserts, 7 deletes,
> and 7 updates being counted as something around 21 updates.
> 
> And if that included 5 TOAST changes, it might bump up to 26.
> 
> If there were replication triggers in place, that might bump the count
> up to 45 (which I chose arbitrarily).

Why not send separate numbers of tuple inserts/updates/deletes, which we
already have from pgstats?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 6/3/2010 5:58 PM, Greg Stark wrote:
> On Thu, Jun 3, 2010 at 8:50 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
>>> I'm puzzled how you would define this value. How do you add 7 inserts,
>>> 7 deletes, and 7 updates? Is that 21 rows modified?
>>
>> I actually have a hard time understanding why people are so opposed to a
>> feature that has zero impact at all unless a DBA actually turns in ON. What
>> is the problem with exposing the commit order of transactions?
> 
> The post you were responding to was regarding the meaninglessness of
> the "number of records" attribute you wanted. Your response is a non
> sequitor.

I never proposed a "number of records" attribute. I proposed a sum of 
the row counts in the statistics collector. That row count would be a 
mix of insert, update, delete and toast operations. It's not an exact 
indicator of anything, but a good enough hint of how much data may come 
down the pipe if I were to select all replication data belonging to that 
transaction.

> 
> I think the commit order of transactions would be a good thing to
> expose though I've asked repeatedly what kind of interface you need
> and never gotten answers to all the questions.

In the original email that started this whole thread I wrote:

> Exposing the data will be done via a set returning function. The SRF
> takes two arguments. The maximum number of rows to return and the last
> serial number processed by the reader. The advantage of such SRF is that
> the result can be used in a query that right away delivers audit or
> replication log information in transaction commit order. The SRF can
> return an empty set if no further transactions have committed since, or
> an error if data segments needed to answer the request have already been
> purged.

Did that not answer your question?


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 6/3/2010 6:24 PM, Kevin Grittner wrote:

> Apologies for not looking back to the start of the thread before
> that last post.  It was all laid out right at the start.

No need to apologize. Happens.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 6/3/2010 7:11 PM, Alvaro Herrera wrote:
> Excerpts from Chris Browne's message of jue jun 03 16:21:35 -0400 2010:
> 
>> What this offers is *SOME* idea of how much updating work a particular
>> transaction did.  It's a bit worse than you suggest:
>> 
>>  - If replication triggers have captured tuples, those would get
>>    counted.
>> 
>>  - TOAST updates might lead to extra updates being counted.
>> 
>> But back to where you started, I'd anticipate 7 inserts, 7 deletes,
>> and 7 updates being counted as something around 21 updates.
>> 
>> And if that included 5 TOAST changes, it might bump up to 26.
>> 
>> If there were replication triggers in place, that might bump the count
>> up to 45 (which I chose arbitrarily).
> 
> Why not send separate numbers of tuple inserts/updates/deletes, which we
> already have from pgstats?
> 

We only have them for the entire database. The purpose of this is just a 
guesstimate about what data volume to expect if I were to select all log 
from a particular transaction.

This datum isn't critical, just handy for the overall feature to be useful.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Robert Haas
Date:
On Thu, Jun 3, 2010 at 6:29 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Jan Wieck wrote:
>> On 6/3/2010 4:04 PM, Bruce Momjian wrote:
>> > If you want to fork Postgres and add it, go ahead, but if the community
>> > has to maintain the code and document it, we care.
>>
>> That comment was rather unprofessional. I think the rest of us still try
>> to find the best solution for the problem, not kill the discussion. You
>> may want to rejoin that effort.
>>
>> I care about an efficient, low overhead way to get a certain
>> information, that is otherwise extremely difficult, expensive and
>> version dependent to get.
>>
>> I care about cleaning up more of the mistakes, made in the original
>> development of Slony. Namely using hacks and kluges to implement
>> details, not supported by a current version of PostgreSQL. Londiste and
>> Slony made a good leap on that with the txid data type. Slony made
>> another step like that with 2.0, switching to the (for that very purpose
>> developed and contributed) native trigger configuration instead of
>> hacking system catalogs. This would be another step in that direction
>> and we would be able to unify Londiste's and Slony's transport mechanism
>> and eliminating the tick/sync kluge.
>>
>> Care to explain what exactly you care about?
>
> Here is what I was replying to:
>
>> >> I actually have a hard time understanding why people are so opposed t$
>> > >> feature that has zero impact at all unless a DBA actually turns in ON.
>> >> What is the problem with exposing the commit order of transactions?
>
> Jan's comment is why should others care what he wants because it has
> zero impact?  I am saying the community cares because we have to
> maintain the code.  I stand by my comment.
>
> I remember a dismissive comment by Jan when 'session_replication_role'
> was added, and a similar strong comment from me at that time as well.
> It seems we are doing this again.

I think it's entirely legitimate and proper for us to make a decision
about whether this feature is worth including in core PostgreSQL.  We
are obviously not in the business of adding random features solely for
the benefit of third-party applications.  That having been said, there
are several reasons why I believe that this particular feature is an
excellent candidate for inclusion in core.

1. It solves a problem for which there is no easy workaround.
Rereading all the WAL to extract the commit records is not an easy
workaround, nor is what Slony and Londiste are doing now.

2. It is usable by multiple projects, not just one.  It may well have
applications beyond replication (e.g. distributed transactions), but
at a very minimum it is usable by and useful to multiple replication
solutions.

3. It has a clear specification which can be easily understood even by
people who do not fully understand how replication solutions will make
use of it, which makes code maintenance much less burdensome.
Obviously, Jan's original email on this topic was just a sketch, but I
find it to be pretty clear.

4. We have an existing precedent of being willing to add limited
support into core to allow replication solutions to do their thing
(session_replication_role, ALTER TABLE ... ENABLE REPLICA TRIGGER,
etc).  Even though we now have built-in replication via HS and SR,
there is still a BIG use case for Slony, Londiste, and other add-on
tools.  Making those tools more successful and performant is good for
PostgreSQL.

5. It does not involve highly invasive changes to the core code.

6. It can be turned off for users who don't want it.

I find the skeptical attitude on this thread altogether unwarranted.
Jan made his case and, at least IMHO, presented it pretty clearly.  He
then answered, multiple times, numerous questions which were already
addressed in the original email, as well as various others.  I think
we should be very careful about assuming that we understand
replication and its needs better than someone who has spent many years
developing one of the major PostgreSQL replication solutions.
Independent of Jan's qualifications, there are clearly several people
on this thread who understand why this is useful and valuable,
including me.  I am obviously not in a position to insist that we
accept this feature (assuming Jan produces a patch rather than getting
discouraged and giving up) but I would like us to think very, very
carefully before rejecting it, and not to do so unless we have a DARN
good reason.  Most patches add code, and therefore require code
maintenance - that is not, by itself, a reason to reject them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: Exposing the Xact commit order to the user

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> > I care about cleaning up more of the mistakes, made in the original 
> > development of Slony. Namely using hacks and kluges to implement 
> > details, not supported by a current version of PostgreSQL. Londiste and 
> > Slony made a good leap on that with the txid data type. Slony made 
> > another step like that with 2.0, switching to the (for that very purpose 
> > developed and contributed) native trigger configuration instead of 
> > hacking system catalogs. This would be another step in that direction 
> > and we would be able to unify Londiste's and Slony's transport mechanism 
> > and eliminating the tick/sync kluge.
> > 
> > Care to explain what exactly you care about?
> 
> Here is what I was replying to:
> 
> > >> I actually have a hard time understanding why people are so opposed t$
> > > >> feature that has zero impact at all unless a DBA actually turns in ON.
> > >> What is the problem with exposing the commit order of transactions?
> 
> Jan's comment is why should others care what he wants because it has
> zero impact?  I am saying the community cares because we have to
> maintain the code.  I stand by my comment.
> 
> I remember a dismissive comment by Jan when 'session_replication_role'
> was added, and a similar strong comment from me at that time as well. 
> It seems we are doing this again.

FYI, I talked to Jan on the phone and we have resolved this issue.  :-)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + None of us is going to be here forever. +


Re: Exposing the Xact commit order to the user

From
Greg Stark
Date:
On Fri, Jun 4, 2010 at 2:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I find the skeptical attitude on this thread altogether unwarranted.
> Jan made his case and, at least IMHO, presented it pretty clearly.

Just to be clear I think the idea of exposing commit order is a
no-brainer.  The specific interface is what I was questioning.

A function which takes a starting xid and a number of transactions to
return seems very tied to one particular application. I could easily
see other systems such as a multi-master system instead only wanting
to compare two transactions to find out which committed first. Or
non-replication applications where you have an LSN and want to know
whether a given transaction had committed by that time.

So one possible interface would be to do something like
xids_committed_between(lsn_start, lsn_end) -- and yes, possibly with
an optional argument to limit the number or records returned.

So you could do:

old := select pg_current_xlog_location();
while (1)
{ sleep 60s; new := select pg_current_xlog_location() process(select xids_committed_between(old,new)) old := new
}

This might be more useful for PITR recovery for example where you want
to find out what transactions committed between now and some known
point of corruption.

I could also see it being useful to have a function
pg_xlog_location_of_commit(xid). That would let you run recovery until
a particular transaction committed or test whether your replica is
caught up to a particular commit. It could be useful for monitoring
Hot Standby slaves.


>  He
> then answered, multiple times, numerous questions which were already
> addressed in the original email, as well as various others.

I think I did miss some of the original description. That might have
caused some of the difficulty as I was asking questions about
something he assumed he had already answered.

>  I think
> we should be very careful about assuming that we understand
> replication and its needs better than someone who has spent many years
> developing one of the major PostgreSQL replication solutions.

Well the flip side of that is that we want an interface that's useful
for more than just one replication system. This is something basic
enough that I think it will be useful for more than just replication
if we design it generally enough. It should be useful for
backup/restore processes and monitoring as well as various forms of
replication including master-slave trigger based systems but also
including PITR-based replication, log-parsing systems, multi-master
trigger based systems, 2PC-based systems, etc.



--
greg


Re: Exposing the Xact commit order to the user

From
Robert Haas
Date:
On Fri, Jun 4, 2010 at 10:44 AM, Greg Stark <gsstark@mit.edu> wrote:
> On Fri, Jun 4, 2010 at 2:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I find the skeptical attitude on this thread altogether unwarranted.
>> Jan made his case and, at least IMHO, presented it pretty clearly.
>
> Just to be clear I think the idea of exposing commit order is a
> no-brainer.  The specific interface is what I was questioning.

OK, thanks for that clarification.

> A function which takes a starting xid and a number of transactions to
> return seems very tied to one particular application. I could easily
> see other systems such as a multi-master system instead only wanting
> to compare two transactions to find out which committed first. Or
> non-replication applications where you have an LSN and want to know
> whether a given transaction had committed by that time.
>
> So one possible interface would be to do something like
> xids_committed_between(lsn_start, lsn_end) -- and yes, possibly with
> an optional argument to limit the number or records returned.

I'm imagining that the backend data storage for this would be a file
containing, essentially, a struct for each commit repeated over and
over again, packed tightly.  It's easy to index into such a file using
a sequence number (give me the 1000'th commit) but searching by LSN
would require (a) storing the LSNs and (b) binary search.  Maybe it's
worth adding that complexity, but I'm not sure that it is.  Keeping
the size of this file small is important for ensuring that it has
minimal performance impact (which is also why I'm not sold on trying
to include the tuple counters that Jan proposed - I think we can solve
the problem he's worried about there more cleanly in other ways).

> So you could do:
>
> old := select pg_current_xlog_location();
> while (1)
> {
>  sleep 60s;
>  new := select pg_current_xlog_location()
>  process(select xids_committed_between(old,new))
>  old := new
> }
>
> This might be more useful for PITR recovery for example where you want
> to find out what transactions committed between now and some known
> point of corruption.

This could also be done by selecting the current commit sequence
number, getting the XIDs committed between the two commit sequence
numbers, etc.

> I could also see it being useful to have a function
> pg_xlog_location_of_commit(xid). That would let you run recovery until
> a particular transaction committed or test whether your replica is
> caught up to a particular commit. It could be useful for monitoring
> Hot Standby slaves.

Well, you'd need to index the commit data to make that work, I think,
so that adds a lot of complexity.  The implementation as proposed lets
you find the commits after a known point in order of occurrence, but
it doesn't let you inquire about the location of a particular commit.
If you want to run recovery until a particular transaction commits, we
could teach the recovery code to look for the commit record for that
XID and then pause at that point (or just before that point, if
someone wanted that as an alternative behavior), which would be much
simpler than using this mechanism.  And if you want to check whether
slaves are caught up, it would probably be better to use LSN rather
than commits, because you could be caught up on commits but way behind
on WAL replay.

>>  I think
>> we should be very careful about assuming that we understand
>> replication and its needs better than someone who has spent many years
>> developing one of the major PostgreSQL replication solutions.
>
> Well the flip side of that is that we want an interface that's useful
> for more than just one replication system. This is something basic
> enough that I think it will be useful for more than just replication
> if we design it generally enough. It should be useful for
> backup/restore processes and monitoring as well as various forms of
> replication including master-slave trigger based systems but also
> including PITR-based replication, log-parsing systems, multi-master
> trigger based systems, 2PC-based systems, etc.

Making it general enough to serve multiple needs is good, but we've
got to make sure that the extra complexity is buying us something.
Jan seems pretty confident that this could be used by Londiste also,
though it would be nice to have some confirmation from the Londiste
developer(s) on that.  I think it may also have applications for
distributed transactions and multi-master replication, but I am not
too sure it helps much for PITR-based replication or log-parsing
systems.  We want to design something that is good, but trying to
solve too many problems may end up solving none of them well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: Exposing the Xact commit order to the user

From
Alvaro Herrera
Date:
Excerpts from Jan Wieck's message of jue jun 03 19:52:19 -0400 2010:
> On 6/3/2010 7:11 PM, Alvaro Herrera wrote:

> > Why not send separate numbers of tuple inserts/updates/deletes, which we
> > already have from pgstats?
> 
> We only have them for the entire database. The purpose of this is just a 
> guesstimate about what data volume to expect if I were to select all log 
> from a particular transaction.

But we already have per table counters.  Couldn't we aggregate them per
transaction as well, if this feature is enabled?  I'm guessing that this
is going to have some uses besides Slony; vague measurements could turn
out to be unusable for some of these.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 6/4/2010 10:44 AM, Greg Stark wrote:
> On Fri, Jun 4, 2010 at 2:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I find the skeptical attitude on this thread altogether unwarranted.
>> Jan made his case and, at least IMHO, presented it pretty clearly.
> 
> Just to be clear I think the idea of exposing commit order is a
> no-brainer.  The specific interface is what I was questioning.
> 
> A function which takes a starting xid and a number of transactions to
> return seems very tied to one particular application. I could easily
> see other systems such as a multi-master system instead only wanting
> to compare two transactions to find out which committed first. Or
> non-replication applications where you have an LSN and want to know
> whether a given transaction had committed by that time.

Read the proposal again. I mean the original mail that started this 
tread. The function does NOT take an xid as argument.

Being able to compare two xid's against each other with respect to their 
commit order is eventually useful. The serial number of the data set, 
returned by the SRF as proposed, would perfectly satisfy that need. But 
not the way you envision for multimaster. Multimaster would ask "did xid 
X from server A commit before or after xid Y from server B?" That is a 
question completely outside the scope of this proposal.

Please keep it real.


Jan




-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Jan Wieck
Date:
On 6/4/2010 12:52 PM, Alvaro Herrera wrote:
> Excerpts from Jan Wieck's message of jue jun 03 19:52:19 -0400 2010:
>> On 6/3/2010 7:11 PM, Alvaro Herrera wrote:
> 
>> > Why not send separate numbers of tuple inserts/updates/deletes, which we
>> > already have from pgstats?
>> 
>> We only have them for the entire database. The purpose of this is just a 
>> guesstimate about what data volume to expect if I were to select all log 
>> from a particular transaction.
> 
> But we already have per table counters.  Couldn't we aggregate them per
> transaction as well, if this feature is enabled?  I'm guessing that this
> is going to have some uses besides Slony; vague measurements could turn
> out to be unusable for some of these.

We have them per table and per index, summarized over all transactions. 
It is debatable if bloating this feature with detailed statistics is 
useful or not, but I'd rather not have that bloat at the beginning, 
because otherwise I know exactly what is going to happen. People will 
just come back and say "zero impact my a..".


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Exposing the Xact commit order to the user

From
Marko Kreen
Date:
On 6/4/10, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jun 4, 2010 at 10:44 AM, Greg Stark <gsstark@mit.edu> wrote:
>  > A function which takes a starting xid and a number of transactions to
>  > return seems very tied to one particular application. I could easily
>  > see other systems such as a multi-master system instead only wanting
>  > to compare two transactions to find out which committed first. Or
>  > non-replication applications where you have an LSN and want to know
>  > whether a given transaction had committed by that time.
>  >
>  > So one possible interface would be to do something like
>  > xids_committed_between(lsn_start, lsn_end) -- and yes, possibly with
>  > an optional argument to limit the number or records returned.
>
>
> I'm imagining that the backend data storage for this would be a file
>  containing, essentially, a struct for each commit repeated over and
>  over again, packed tightly.  It's easy to index into such a file using
>  a sequence number (give me the 1000'th commit) but searching by LSN
>  would require (a) storing the LSNs and (b) binary search.  Maybe it's
>  worth adding that complexity, but I'm not sure that it is.  Keeping
>  the size of this file small is important for ensuring that it has
>  minimal performance impact (which is also why I'm not sold on trying
>  to include the tuple counters that Jan proposed - I think we can solve
>  the problem he's worried about there more cleanly in other ways).

AIUI, you index the file by offset.

>  >>  I think
>  >> we should be very careful about assuming that we understand
>  >> replication and its needs better than someone who has spent many years
>  >> developing one of the major PostgreSQL replication solutions.
>  >
>  > Well the flip side of that is that we want an interface that's useful
>  > for more than just one replication system. This is something basic
>  > enough that I think it will be useful for more than just replication
>  > if we design it generally enough. It should be useful for
>  > backup/restore processes and monitoring as well as various forms of
>  > replication including master-slave trigger based systems but also
>  > including PITR-based replication, log-parsing systems, multi-master
>  > trigger based systems, 2PC-based systems, etc.
>
>
> Making it general enough to serve multiple needs is good, but we've
>  got to make sure that the extra complexity is buying us something.
>  Jan seems pretty confident that this could be used by Londiste also,
>  though it would be nice to have some confirmation from the Londiste
>  developer(s) on that.  I think it may also have applications for
>  distributed transactions and multi-master replication, but I am not
>  too sure it helps much for PITR-based replication or log-parsing
>  systems.  We want to design something that is good, but trying to
>  solve too many problems may end up solving none of them well.

The potential for single shared queue implementation, with
the additional potential for merging async replication
implementations sounds attractive.  (Merging ~ having
single one that satisfies broad range of needs.)

Unless the functionality accepted into core will be limited
to replication only and/or performs worse than current
snapshot-based grouping.  Then it is uninteresting, of course.

Jan's proposal of storing small struct into segmented files
sounds like it could work.  Can't say anything more because
I can't imagine it as well as Jan.  Would need to play with
working implementation to say more...

-- 
marko


Re: Exposing the Xact commit order to the user

From
Alvaro Herrera
Date:
Excerpts from Marko Kreen's message of jue jun 10 18:10:50 -0400 2010:

> Jan's proposal of storing small struct into segmented files
> sounds like it could work.  Can't say anything more because
> I can't imagine it as well as Jan.  Would need to play with
> working implementation to say more...

We already have such a thing -- see pg_multixact

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Exposing the Xact commit order to the user

From
Simon Riggs
Date:
On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote:

> In some systems (data warehousing, replication), the order of commits is
> important, since that is the order in which changes have become visible.
> This information could theoretically be extracted from the WAL, but
> scanning the entire WAL just to extract this tidbit of information would
> be excruciatingly painful.

This idea had support from at least 6 hackers. I'm happy to add my own.

Can I suggest it is added as a hook, rather than argue about the details
too much? The main use case is in combination with external systems, so
that way we can maintain the relevant code with the system that cares
about it.

> CommitTransaction() inside of xact.c will call a function, that inserts
> a new record into this array. The operation will for most of the time be
> nothing than taking a spinlock and adding the record to shared memory.
> All the data for the record is readily available, does not require
> further locking and can be collected locally before taking the spinlock.
> The begin_timestamp is the transactions idea of CURRENT_TIMESTAMP, the
> commit_timestamp is what CommitTransaction() just decided to write into
> the WAL commit record and the total_rowcount is the sum of inserted,
> updated and deleted heap tuples during the transaction, which should be
> easily available from the statistics collector, unless row stats are
> disabled, in which case the datum would be zero.

Does this need to be called while in a critical section? Or can we wait
until after the actual marking of the commit before calling this?

> Checkpoint handling will call a function to flush the shared buffers.
> Together with this, the information from WAL records will be sufficient
> to recover this data (except for row counts) during crash recovery.

So it would need to work identically in recovery also?

These two values are not currently stored in the commit WAL record.

timestamptz   xci_begin_timestamp
int64         xci_total_rowcount

Both of those seem optional, so I don't really want them added to WAL.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services