Thread: Proposal for CSN based snapshots

Proposal for CSN based snapshots

From
Ants Aasma
Date:
Given the recent ideas being thrown about changing how freezing and
clog is handled and MVCC catalog access I thought I would write out
the ideas that I have had about speeding up snapshots in case there is
an interesting tie in with the current discussions.

To refresh your memory the basic idea is to change visibility
determination to be based on a commit sequence number (CSN for short)
- a 8 byte number incremented on every commit representing the total
ordering of commits. To take a snapshot in this scheme you only need
to know the value of last assigned CSN, all transactions with XID less
than or equal to that number were commited at the time of the
snapshots, everything above wasn't committed. Besides speeding up
snapshot taking, this scheme can also be a building block for
consistent snapshots in a multi-master environment with minimal
communication. Google's Spanner database uses snapshots based on a
similar scheme.

The main tricky part about this scheme is finding the CSN that was
assigned to each XIDs in face of arbitrarily long transactions and
snapshots using only a bounded amount of shared memory. The secondary
tricky part is doing this in a way that doesn't need locks for
visibility determination as that would kill any hope of a performance
gain.

We need to keep around CSN slots for all currently running
transactions and CSN slots of transactions that are concurrent with
any active CSN based snapshot (xid.csn > min(snapshot.csn)). To do
this I propose the following datastructures to do the XID-to-CSN
mapping. For most recently assigned XIDs there is a ringbuffer of
slots that contain the CSN values of the XIDs or special CSN values
for transactions that haven't completed yet, aborted transactions or
subtransactions. I call this the dense CSN map. Looking up a CSN of a
XID from the ringbuffer is just a trivial direct indexing into the
ring buffer.

For long running transactions the ringbuffer may do a full circle
before a transaction commits. Such CSN slots along with slots that are
needed for older snapshots are evicted from the dense buffer into a
sorted array of XID-CSN pairs, or the sparse mapping. For locking
purposes there are two sparse buffers, one of them being active the
other inactive, more on that later. Looking up the CSN value of a XID
that has been evicted into the sparse mapping is a matter of
performing a binary search to find the slot and reading the CSN value.

Because snapshots can remain active for an unbounded amount of time
and there can be unbounded amount of active snapshots, even the sparse
mapping can fill up. To handle that case, each backend advertises its
lowest snapshot number csn_min. When values need to be evicted from
the sparse mapping, they are evicted in CSN order and written into the
CSN log - a series of CSN-XID pairs. Backends that may still be
concerned about those values are then notified that values that they
might need to use have been evicted. Backends with invalidated
snapshots can then convert their snapshots to regular list of
concurrent XIDs snapshots at their leisure.

To convert a CSN based snapshot to XID based, a backend would first
scan the shared memory structures for xids up to snapshot.xmax for
CSNs that are concurrent to the snapshot and insert the XIDs into the
snapshot, then read in the CSN log starting from snapshots CSN,
looking for xid's less than the snapshots xmax. After this the
snapshot can be handled like current snapshots are handled.

A more detailed view of synchronization primitives required for common
operations follows.

Taking a new snapshot
---------------------

Taking a CSN based snapshot under this scheme would consist of reading
xmin, csn and xmax from global variables, unlocked and in that order
with read barriers in between each load. If this is our oldest
snapshot we write our csn_min into pgproc, do a full memory barrier
and check from a global variable if the CSN we used is still
guaranteed to not be evicted (exceedingly unlikely but cannot be ruled
out).

The read barrier between reading xmin and csn is needed so the
guarantee applies that no transaction with tx.xid < ss.xmin could have
committed with tx.csn >= ss.csn, so xmin can be used to safely exclude
csn lookups. Read barrier between reading csn and xmax is needed to
guarantee that if tx.xid >= ss.xmax, then it's known that tx.csn >=
ss.csn. From the write side, there needs to be at least one full
memory barrier between GetNewTransactionId updating nextXid and
CommitTransaction updating nextCsn, which is quite easily satisfied.
Updating global xmin without races is slightly trickier but doable.

Checking visibility
-------------------

XidInMVCCSnapshot will need to switch on the type of snapshot used
after checking xmin and xmax. For list-of-XIDs snapshots the current
logic applies. CSN based snapshots need to first do an unlocked read
from a backend specific global variable to check if we have been
instructed to convert our snapshots to xid based, if so the snapshot
is converted (more on that separately). Otherwise we look up the CSN
of the xid.

To map xid to csn, first the value from the csn slot corresponding to
the xid is read in from dense map (just a plain denseCSNMap[xid %
denseMapSize]), then after issuing a read barrier, the dense map
eviction horizon is checked to verify that the value that we read in
was in fact valid, if it is, it can be compared to the snapshot csn
value to get the result, if not continue to check the sparse map.

To check the sparse map, we read in the sparse map version counter
value and use it to determine which one of the two maps is currently
active (an optimistic locking scheme). We use linear or binary search
to look up the slot for the XID. If the XID is not found we know that
it wasn't committed after taking the snapshot, we then have to check
the clog if it was committed or not. Otherwise compare the value
against our snapshots csn to determine visibility. Finally after
issuing a read barrier, check the sparse map version counter to see if
the result is valid.

Checking from the dense map can omit clog checks as we can use special
values to signify aborted and uncommitted values. Even more so, we can
defer clog updates until the corresponding slots are evicted from the
dense map.

Assigning XIDs and evicting from CSN buffers
--------------------------------------------

XID assignment will need an additional step to allocate a CSN slot for
the transaction. If the dense map ring buffer has filled up, this
might require evicting some entries from the dense CSN map. For less
contention and better efficiency it would be a good idea to do
evictions larger blocks at a time. One 8th or 16th of the dense map at
a time might be a good balance here. Eviction will be done under a new
CSNEvictionLock.

First we publish the point up to which we are evicting from dense to
notify committing backends of the hazard.

We then read the current nextCSN value and publish it as the largest
possible global_csn_min value we can arrive at, so if there is a
backend in the middle of taking a snapshot that has fetched a CSN but
hasn't yet updated procarray entry will notice that it is at risk and
will retry. Using nextCSN is being very conservative, but as far as I
can see the invalidations will be rare and cheap. We have to issue a
full memory barrier here so either the snapshot take sees our value or
we see its csn min.

If there is enough space, we just use global_csn_min as the sparse map
eviction horizon. If there is not enough free space in the current
active sparse map to guarantee that dense map will fit, we scan both
the active sparse array and the to-be-evicted block, collecting the
missing space worth xid-csn pairs with smallest CSN values to a heap,
reducing the heap size for every xid,csn we omit due to not being
needed either because the transaction was aborted or because it is
visible to all active snapshots. When we finish scanning and the heap
isn't empty, then the largest value in the heap is the sparse map
eviction horizon.

After arriving at the correct sparse map eviction horizon, we iterate
through the sparse map and the dense map block to be evicted, copying
all active or not-all-visible CSN slots to the inactive sparse map. We
also update clog for every committed transaction that we found in the
dense map. If we decided to evict some values, we write them to the
CSN log here and update the sparse map eviction horizon with the CSN
we arrived at. At this point the state in the current active sparse
map and the evictable dense map block are duplicated into the inactive
sparse map and CSN log. We now need to make the new state visible to
visibility checkers in a safe order, issuing a write barrier before
each step so the previous changes are visible:

* Notify all backends that have csn_min <= sparse map eviction horizon
that their snapshots are invalid and at what CSN log location they can
start to read to find concurrent xids.
* Increment sparse map version counter to switch sparse maps.
* Raise the dense map eviction horizon, to free up the dense map block.
* Overwrite the dense map block with special UncommittedCSN values
that are tagged with their XID (more on that later)

At this point we can consider the block cleared up for further use.
Because we don't want to lock shared structures for snapshotting we
need to maintain a global xmin value. To do this we acquire a spinlock
on the global xmin value, and check if it's empty, if no other
transaction is running we replace it with our xid.

At this point we know the minimum CSN of any unconverted snapshots, so
it's also a good time to clean up unnecessary CSN log.

Finally we are done, so we can release CSNEvictionLock and XIDGenLock.

Committing a transaction
------------------------

Most of the commit sequence remains the same, but where we currently
do ProcArrayEndTransaction, we will acquire a new LWLock protecting
the nextCSN value, I'll call it the CSNGenLock for now. We first read
the dense array eviction horizon, then stamp all our non-overflowed or
still in dense map subtransaction slots with special SubcommittedCSN
value, then stamp the top level xid with nextCSN. We then issue a full
memory barrier, and check the soon-to-be-evicted pointer into the
dense map. If it overlaps with any of the XIDs we just tagged then the
backend doing the eviction might have missed our update, we have to
wait for CSNEvictionLock to become free and go and restamp the
relevant XIDs in the sparse map.

To stamp a XID with the commit CSN, we compare the XID to the dense
map eviction horizon. If the XID still maps to the dense array, we do
CAS and swap out UncommittedCSN(xid) with the value that we needed. If
the CAS fails, then between when we read the dense array horizon and
the actual stamping an eviction process replaced our slot with a newer
one. If we didn't tag the slots with the XID value, then we might
accidentally stamp another transactions slot. If the XID maps to the
sparse array, we have to take the CSNEvictionLock so the sparse array
doesn't get swapped out underneath us, and then look up the slot and
stamp it and then update the CLOG before releasing the lock. Lock
contention shouldn't be a problem here as only long running
transactions map to the sparse array.

When done with the stamping we will check the global xmin value, if
it's our xid, we grab the spinlock, scan through the procarray for the
next xmin value, update and release it.

Rolling back a transaction
--------------------------

Rolling back a transaction is basically the same as committing, but
the CSN slots need to be stamped with a AbortedCSN.

Subtransactions
---------------

Because of limited size of the sparse array, we cannot keep open CSN
slots for all of the potentially unbounded number of subtransactions
there. I propose something similar to what is done currently with
PGPROC_MAX_CACHED_SUBXIDS. When we assign xids to subtransactions that
are above this limit, we tag them in the dense array with a special
OverflowedSubxidCSN value. When evicting subtransactions from dense
array, non-overflowed subtransaction slots are handled like regular
slots. We discard the overflowed slots when evicting from the dense
map. We also keep track of the lowest subxid that was overflowed and
the latest subxid that was overflowed, lowest overflowed subxid is
reset when before eviction the highest overflowed subxid is lower than
the smallest xid in the sparse array (i.e. we know that the XID region
convered by the sparse array doesn't contain any overflowed subxids).
When constructing the regular snapshot we can then detect that we
don't have the full information about subtransactions and can
correctly set the overflowed flag on the snapshot. Similarly
visibility checks can omit subxid lookups for xids missing from the
sparse array when they know that the xid can't be overflowed.

Prepared transactions
---------------------

Prepared transactions are handled basically like regular transactions,
when starting up with prepared transactions, they are inserted into
the sparse array, when they are committed they get stamped with CSNs
and become visible like usual. We just need to account for them when
sizing the sparse array.

Crash safety and checkpoints
----------------------------

Because clog updating is delayed for transactions in the dense map,
checkpoints need to flush the dense array before writing out clog.
Otherwise the datastructures are fully transient and don't need any
special treatment.

Hot standby
-----------

I haven't yet worked out how CSN based snapshots best integrate with
hot standby. As far as I can tell, we could just use the current
KnownAssignedXidsGetAndSetXmin mechanism and get regular snapshots.
But I think there is an opportunity here to get rid of most of or all
of the KnownAssignedXids mechanism if we WAL log the CSNs that are
assigned to commits (or use a side channel as proposed by Robert). The
extra write is not great, but might not be a large problem after the
WAL insertion scaling work is done. Another option would be to use the
LSN of commit record as the next CSN value. The locking in that case
requires some further thought to guarantee that commits are stamped in
the same order as WAL records are inserted without holding
WALInsertLock for too long. That seems doable by inserting commiting
backends into a queue under WALInsertLock and then have them wait for
the transaction in front of them to commit when WALInsertLock has been
released.

Serializable transactions
-------------------------

I won't pretend to be familiar with SSI code, but as far as I can tell
serializable transactions don't need any modifications to work with
the CSN based snapshot scheme. There actually already is a commit
sequence number in the SSI code that could be replaced by the actual
CSN. IIRC one of the problems with serializable transactions on hot
standby was that transaction visibility order on the standby is
different from the master. If we use CSNs for visibility on the slave
then we can actually provide identical visibility order.

Required atomic primitives
--------------------------

Besides the copious amount of memory barriers that are required for
correctness. We will need the following lockless primitives:
* 64bit atomic read
* 64bit atomic write
* 64bit CAS

Are there any supported platforms where it would be impossible to have
those? AFAIK everything from 32bit x86 through POWER and MIPS to ARM
can do it. If there are any platforms that can't handle 64bit atomics,
would it be ok to run on them with reduced concurrency/performance?

Sizing the CSN maps
-------------------

CSN maps need to sized to accomodate the number of backends.

Dense array size should be picked so that most xids commit before
being evicted from the dense map and sparse array will contain slots
necessary for either long running transactions or for long snapshots
not yet converted to XID based snapshots. I did a few quick
simulations to measure the dynamics. If we assume a power law
distribution of transaction lengths and snapshots for the full length
of transactions with no think time, then 16 slots per backend is
enough to make the probability of eviction before commit less than
1e-6 and being needed at eviction due to a snapshot about 1:10000. In
the real world very long transactions are more likely than predicted
model, but at least the normal variation is mostly buffered by this
size. 16 slots = 128bytes per backend ends up at a 12.5KB buffer for
the default value of 100 backends, or 125KB for 1000 backends.

Sparse buffer needs to be at least big enough to fit CSN slots for the
xids of all active transactions and non-overflowed subtransactions. At
the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
at 16 bytes * (64 + 1) slots * 100 =  backends = 101.6KB per buffer,
or 203KB total in the default configuration.

Performance discussion
----------------------

Taking a snapshot is extremely cheap in this scheme, I think the cost
will be mostly for publishing the csnmin and rechecking validity after
it. Taking snapshots in the shadow of another snapshot (often the case
for the proposed MVCC catalog access) will be even cheaper as we don't
have to advertise the new snapshot. The delayed XID based snapshot
construction should be unlikely, but even if it isn't the costs are
similar to GetSnapshotData, but we don't need to hold a lock.

Visibility checking will also be simpler as for the cases where the
snapshot is covered by the dense array it only requires two memory
lookups and comparisons.

The main extra work for writing transactions is the eviction process.
The amortized worst case extra work per xid is dominated by copying
the sparse buffer back and forth and spooling out the csn log. We need
to write out 16 bytes per xid and copy sparse buffer size / eviction
block size sparse buffer slots. If we evict 1/8th of dense map at each
eviction it works out as 520 bytes copied per xid assigned. About the
same ballpark as GetSnapshotData is now.

With the described scheme, long snapshots will cause the sparse buffer
to quickly fill up and then spool out until the backend wakes up,
converts its snapshots and releases the eviction process to free up
the log. It would be more efficient to be slightly more proactive and
tell them to convert the snapshots earlier so if they manage to be
timely with their conversion we can avoid writing any CSN log.

I'm not particularly pleased about the fact that both xid assignment
and committing can block behind the eviction lock. On the other hand,
plugging in some numbers, with 100 backends doing 100k xid
assignments/s the lock will be acquired 1000 times per second for less
than 100us at a time. The contention might not be bad enough to
warrant extra complexity to deal with it. If it does happen to be a
problem, then I have some ideas how to cope with it.

Having to do CSN log writes while holding a LWLock might not be the
best of ideas, to combat that we can either add one more buffer so we
can do the actual write syscall after we release CSNEvictionLock, or
we can reuse the SLRU machinery to handle this.

Overall it looks to be a big win for typical workloads. Workloads
using large amounts of subtransactions might not be as well off, but I
doubt there will be a regression.

At this point I don't see any major issues with this approach. If the
ensuing discussion doesn't find any major showstoppers then I will
start to work on a patch bit-by-bit. It might take a while though as
my free hacking time has been severely cut down since we have a small
one in the family.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Proposal for CSN based snapshots

From
Markus Wanner
Date:
Ants,

On 06/07/2013 12:42 AM, Ants Aasma wrote:
> Given the recent ideas being thrown about changing how freezing and
> clog is handled and MVCC catalog access I thought I would write out
> the ideas that I have had about speeding up snapshots in case there is
> an interesting tie in with the current discussions.

Thanks for this write-up.

> To refresh your memory the basic idea is to change visibility
> determination to be based on a commit sequence number (CSN for short)
> - a 8 byte number incremented on every commit representing the total
> ordering of commits. To take a snapshot in this scheme you only need
> to know the value of last assigned CSN, all transactions with XID less

You mean with a *CSN* less than or equal to that number? There's
certainly no point in comparing a CSN with an XID.

> than or equal to that number were commited at the time of the
> snapshots, everything above wasn't committed. Besides speeding up
> snapshot taking, this scheme can also be a building block for
> consistent snapshots in a multi-master environment with minimal
> communication.

Agreed. Postgres-R uses a CommitOrderId, which is very similar in
concept, for example.

> The main tricky part about this scheme is finding the CSN that was
> assigned to each XIDs in face of arbitrarily long transactions and
> snapshots using only a bounded amount of shared memory. The secondary
> tricky part is doing this in a way that doesn't need locks for
> visibility determination as that would kill any hope of a performance
> gain.

Agreed. Para-phrasing, you can also say that CSNs can only ever identify
completed transactions, where XIDs can be used to identify transactions
in progress as well.

[ We cannot possibly get rid of XIDs entirely for that reason. And the
mapping between CSNs and XIDs has some additional cost. ]

> We need to keep around CSN slots for all currently running
> transactions and CSN slots of transactions that are concurrent with
> any active CSN based snapshot (xid.csn > min(snapshot.csn)). To do
> this I propose the following datastructures to do the XID-to-CSN
> mapping. For most recently assigned XIDs there is a ringbuffer of
> slots that contain the CSN values of the XIDs or special CSN values
> for transactions that haven't completed yet, aborted transactions or
> subtransactions. I call this the dense CSN map. Looking up a CSN of a
> XID from the ringbuffer is just a trivial direct indexing into the
> ring buffer.
> 
> For long running transactions the ringbuffer may do a full circle
> before a transaction commits. Such CSN slots along with slots that are
> needed for older snapshots are evicted from the dense buffer into a
> sorted array of XID-CSN pairs, or the sparse mapping. For locking
> purposes there are two sparse buffers, one of them being active the
> other inactive, more on that later. Looking up the CSN value of a XID
> that has been evicted into the sparse mapping is a matter of
> performing a binary search to find the slot and reading the CSN value.

I like this idea of dense and sparse "areas". Seems like a simple
enough, yet reasonably compact representation that might work well in
practice.

> Because snapshots can remain active for an unbounded amount of time
> and there can be unbounded amount of active snapshots, even the sparse
> mapping can fill up.

I don't think the number of active snapshots matters - after all, they
could all refer the same CSN. So that number shouldn't have any
influence on the size of the sparse mapping.

What does matter is the number of transactions referenced by such a
sparse map. You are of course correct in that this number is equally
unbounded.

> To handle that case, each backend advertises its
> lowest snapshot number csn_min. When values need to be evicted from
> the sparse mapping, they are evicted in CSN order and written into the
> CSN log - a series of CSN-XID pairs. Backends that may still be
> concerned about those values are then notified that values that they
> might need to use have been evicted. Backends with invalidated
> snapshots can then convert their snapshots to regular list of
> concurrent XIDs snapshots at their leisure.
> 
> To convert a CSN based snapshot to XID based, a backend would first
> scan the shared memory structures for xids up to snapshot.xmax for
> CSNs that are concurrent to the snapshot and insert the XIDs into the
> snapshot, then read in the CSN log starting from snapshots CSN,
> looking for xid's less than the snapshots xmax. After this the
> snapshot can be handled like current snapshots are handled.

Hm, I dislike the requirement to maintain two different snapshot formats.

Also mind that snapshot conversions - however unlikely you choose to
make them - may well result in bursts as multiple processes may need to
do such a conversion, all starting at the same point in time.

> Rolling back a transaction
> --------------------------
> 
> Rolling back a transaction is basically the same as committing, but
> the CSN slots need to be stamped with a AbortedCSN.

Is that really necessary? After all, an aborted transaction behaves
pretty much the same as a transaction in progress WRT visibility: it's
simply not visible.

Or why do you need to tell apart aborted from in-progress transactions
by CSN?

> Sizing the CSN maps
> -------------------
> 
> CSN maps need to sized to accomodate the number of backends.
> 
> Dense array size should be picked so that most xids commit before
> being evicted from the dense map and sparse array will contain slots
> necessary for either long running transactions or for long snapshots
> not yet converted to XID based snapshots. I did a few quick
> simulations to measure the dynamics. If we assume a power law
> distribution of transaction lengths and snapshots for the full length
> of transactions with no think time, then 16 slots per backend is
> enough to make the probability of eviction before commit less than
> 1e-6 and being needed at eviction due to a snapshot about 1:10000. In
> the real world very long transactions are more likely than predicted
> model, but at least the normal variation is mostly buffered by this
> size. 16 slots = 128bytes per backend ends up at a 12.5KB buffer for
> the default value of 100 backends, or 125KB for 1000 backends.

Sounds reasonable to me.

> Sparse buffer needs to be at least big enough to fit CSN slots for the
> xids of all active transactions and non-overflowed subtransactions. At
> the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
> at 16 bytes * (64 + 1) slots * 100 =  backends = 101.6KB per buffer,
> or 203KB total in the default configuration.

A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.

> Performance discussion
> ----------------------
> 
> Taking a snapshot is extremely cheap in this scheme, I think the cost
> will be mostly for publishing the csnmin and rechecking validity after
> it. Taking snapshots in the shadow of another snapshot (often the case
> for the proposed MVCC catalog access) will be even cheaper as we don't
> have to advertise the new snapshot. The delayed XID based snapshot
> construction should be unlikely, but even if it isn't the costs are
> similar to GetSnapshotData, but we don't need to hold a lock.
> 
> Visibility checking will also be simpler as for the cases where the
> snapshot is covered by the dense array it only requires two memory
> lookups and comparisons.

Keep in mind, though, that both of these lookups are into shared memory.
Especially the dense ring buffer may well turn into a point of
contention. Or at least the cache lines holding the most recent XIDs
within that ring buffer.

Where as currently, the snapshot's xip array resides in process-local
memory. (Granted, often enough, the proc array also is a point of
contention.)

> At this point I don't see any major issues with this approach. If the
> ensuing discussion doesn't find any major showstoppers then I will
> start to work on a patch bit-by-bit.

Bit-by-bit? Reminds me of punched cards. I don't think we accept patches
in that format.  :-)

> we have a small one in the family.

Congratulations on that one.

Regards

Markus Wanner



Re: Proposal for CSN based snapshots

From
Greg Stark
Date:
On Thu, Jun 6, 2013 at 11:42 PM, Ants Aasma <ants@cybertec.at> wrote:
> To refresh your memory the basic idea is to change visibility
> determination to be based on a commit sequence number (CSN for short)
> - a 8 byte number incremented on every commit representing the total
> ordering of commits

I think we would just use the LSN of the commit record which is
effectively the same but doesn't require a new counter.
I don't think this changes anything though.


-- 
greg



Re: Proposal for CSN based snapshots

From
Ants Aasma
Date:
On Fri, Jun 7, 2013 at 2:59 PM, Markus Wanner <markus@bluegap.ch> wrote:
>> To refresh your memory the basic idea is to change visibility
>> determination to be based on a commit sequence number (CSN for short)
>> - a 8 byte number incremented on every commit representing the total
>> ordering of commits. To take a snapshot in this scheme you only need
>> to know the value of last assigned CSN, all transactions with XID less
>
> You mean with a *CSN* less than or equal to that number? There's
> certainly no point in comparing a CSN with an XID.

That was what I meant. I guess my coffee hadn't kicked in yet there.

>> than or equal to that number were commited at the time of the
>> snapshots, everything above wasn't committed. Besides speeding up
>> snapshot taking, this scheme can also be a building block for
>> consistent snapshots in a multi-master environment with minimal
>> communication.
>
> Agreed. Postgres-R uses a CommitOrderId, which is very similar in
> concept, for example.

Do you think having this snapshot scheme would be helpful for Postgres-R?

>> Because snapshots can remain active for an unbounded amount of time
>> and there can be unbounded amount of active snapshots, even the sparse
>> mapping can fill up.
>
> I don't think the number of active snapshots matters - after all, they
> could all refer the same CSN. So that number shouldn't have any
> influence on the size of the sparse mapping.
>
> What does matter is the number of transactions referenced by such a
> sparse map. You are of course correct in that this number is equally
> unbounded.

Yes, that is what I meant to say but for some reason didn't.

>> To handle that case, each backend advertises its
>> lowest snapshot number csn_min. When values need to be evicted from
>> the sparse mapping, they are evicted in CSN order and written into the
>> CSN log - a series of CSN-XID pairs. Backends that may still be
>> concerned about those values are then notified that values that they
>> might need to use have been evicted. Backends with invalidated
>> snapshots can then convert their snapshots to regular list of
>> concurrent XIDs snapshots at their leisure.
>>
>> To convert a CSN based snapshot to XID based, a backend would first
>> scan the shared memory structures for xids up to snapshot.xmax for
>> CSNs that are concurrent to the snapshot and insert the XIDs into the
>> snapshot, then read in the CSN log starting from snapshots CSN,
>> looking for xid's less than the snapshots xmax. After this the
>> snapshot can be handled like current snapshots are handled.
>
> Hm, I dislike the requirement to maintain two different snapshot formats.
>
> Also mind that snapshot conversions - however unlikely you choose to
> make them - may well result in bursts as multiple processes may need to
> do such a conversion, all starting at the same point in time.

I agree that two snapshot formats is not great. On the other hand, the
additional logic is confined to XidInMVCCSnapshot and is reasonably
simple. If we didn't convert the snapshots we would have to keep
spooling out CSN log and look up XIDs for each visibility check. We
could add a cache for XIDs that were deemed concurrent, but that is in
effect just lazily constructing the same datastructure. The work
needed to convert is reasonably well bounded, can be done without
holding global locks and in most circumstances should only be
necessary for snapshots that are used for a long time and will
amortize the cost. I'm not worried about the bursts because the
conversion is done lockless and starting at the same point in time
leads to better cache utilization.

>> Rolling back a transaction
>> --------------------------
>>
>> Rolling back a transaction is basically the same as committing, but
>> the CSN slots need to be stamped with a AbortedCSN.
>
> Is that really necessary? After all, an aborted transaction behaves
> pretty much the same as a transaction in progress WRT visibility: it's
> simply not visible.
>
> Or why do you need to tell apart aborted from in-progress transactions
> by CSN?

I need to detect aborted transactions so they can be discared during
the eviction process, otherwise the sparse array will fill up. They
could also be filtered out by cross-referencing uncommitted slots with
the procarray. Having the abort case do some additional work to make
xid assigment cheaper looks like a good tradeoff.

>> Sparse buffer needs to be at least big enough to fit CSN slots for the
>> xids of all active transactions and non-overflowed subtransactions. At
>> the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
>> at 16 bytes * (64 + 1) slots * 100 =  backends = 101.6KB per buffer,
>> or 203KB total in the default configuration.
>
> A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
> the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.

8 byte alignment for CSNs is needed for atomic if not something else.
I think the size could be cut in half by using a base value for CSNs
if we assume that no xid is active for longer than 2B transactions as
is currently the case. I didn't want to include the complication in
the first iteration, so I didn't verify if that would have any
gotchas.

>> Performance discussion
>> ----------------------
>>
>> Taking a snapshot is extremely cheap in this scheme, I think the cost
>> will be mostly for publishing the csnmin and rechecking validity after
>> it. Taking snapshots in the shadow of another snapshot (often the case
>> for the proposed MVCC catalog access) will be even cheaper as we don't
>> have to advertise the new snapshot. The delayed XID based snapshot
>> construction should be unlikely, but even if it isn't the costs are
>> similar to GetSnapshotData, but we don't need to hold a lock.
>>
>> Visibility checking will also be simpler as for the cases where the
>> snapshot is covered by the dense array it only requires two memory
>> lookups and comparisons.
>
> Keep in mind, though, that both of these lookups are into shared memory.
> Especially the dense ring buffer may well turn into a point of
> contention. Or at least the cache lines holding the most recent XIDs
> within that ring buffer.
>
> Where as currently, the snapshot's xip array resides in process-local
> memory. (Granted, often enough, the proc array also is a point of
> contention.)

Visibility checks are done lock free so they don't cause any
contention. The number of times each cache line can be invalidated is
bounded by 8. Overall I think actual performance tests are needed to
see if there is a problem, or perhaps if having the data shared
actually helps with cache hit rates.

>> At this point I don't see any major issues with this approach. If the
>> ensuing discussion doesn't find any major showstoppers then I will
>> start to work on a patch bit-by-bit.
>
> Bit-by-bit? Reminds me of punched cards. I don't think we accept patches
> in that format.  :-)
>
>> we have a small one in the family.
>
> Congratulations on that one.

Thanks,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Proposal for CSN based snapshots

From
Ants Aasma
Date:
On Fri, Jun 7, 2013 at 3:47 PM, Greg Stark <stark@mit.edu> wrote:
> On Thu, Jun 6, 2013 at 11:42 PM, Ants Aasma <ants@cybertec.at> wrote:
>> To refresh your memory the basic idea is to change visibility
>> determination to be based on a commit sequence number (CSN for short)
>> - a 8 byte number incremented on every commit representing the total
>> ordering of commits
>
> I think we would just use the LSN of the commit record which is
> effectively the same but doesn't require a new counter.
> I don't think this changes anything though.

I briefly touched on that point in the Hot Standby section. This has
some consequences for locking in CommitTransaction, but otherwise LSN
is as good as any other monotonically increasing value.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Proposal for CSN based snapshots

From
Kevin Grittner
Date:
Ants Aasma <ants@cybertec.at> wrote:

> Serializable transactions
> -------------------------
>
> I won't pretend to be familiar with SSI code, but as far as I can
> tell serializable transactions don't need any modifications to
> work with the CSN based snapshot scheme. There actually already
> is a commit sequence number in the SSI code that could be
> replaced by the actual CSN.

That seems quite likely to work, and may be good for performance.

> IIRC one of the problems with serializable transactions on hot
> standby was that transaction visibility order on the standby is
> different from the master.

Pretty much.  Technically what SSI does is to ensure that every
serializable transaction's view of the data is consistent with some
serial (one-at-a time) exection of serializable transactions.  That
"apparent order of execution" does not always match commit order.
The problem is that without further work a hot standby query could
see a state which would not have been possible for it to see on the
master.  For example, a batch is closed but a transaction has not
yet committed which is part of that batch.  For an example, see:

http://wiki.postgresql.org/wiki/SSI#Read_Only_Transactions

As that example demonstrates, as long as no serializable
transaction *sees* the incomplete batch with knowledge (or at least
potential knowledge) of the batch being closed, the pending
transaction affecting the batch is allowed to commit.  If the
conflicting state is exposed by even a read-only query, the
transaction with the pending change to the batch is canceled.

A hot standby can't cancel the pending transaction -- at least not
without adding additional communications channels and latency.  The
ideas which have been bandied about have had to do with allowing
serializable transactions on a hot standby to use snapshots which
are known to be "safe" -- that is, they cannot expose any such
state.  It might be possible to identify known safe points in the
commit stream on the master and pass that information along in the
WAL stream.  The down side is that the hot standby would need to
either remember the last such safe snapshot or wait for the next
one, and while these usually come up fairly quickly in most
workloads, there is no actual bounding on how long that could take.

A nicer solution, if we can manage it, is to allow snapshots on the
hot standby which are not based exclusively on the commit order,
but use the apparent order of execution from the master.  It seems
non-trivial to get that right.

> If we use CSNs for visibility on the slave then we can actually
> provide identical visibility order.

As the above indicates, that's not really true without using
"apparent order of execution" instead of "commit order".  In the
absence of serializable transactions those are always the same (I
think), but to provide a way to allow serializable transactions on
the hot standby the order would need to be subject to rearrangement
based on read-write conflicts among transactions on the master, or
snapshots which could expose serialization anomalies would need to
be prohibited.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal for CSN based snapshots

From
Markus Wanner
Date:
Ants,

the more I think about this, the more I start to like it.

On 06/07/2013 02:50 PM, Ants Aasma wrote:
> On Fri, Jun 7, 2013 at 2:59 PM, Markus Wanner <markus@bluegap.ch> wrote:
>> Agreed. Postgres-R uses a CommitOrderId, which is very similar in
>> concept, for example.
> 
> Do you think having this snapshot scheme would be helpful for Postgres-R?

Yeah, it could help to reduce patch size, after a rewrite to use such a CSN.

>> Or why do you need to tell apart aborted from in-progress transactions
>> by CSN?
> 
> I need to detect aborted transactions so they can be discared during
> the eviction process, otherwise the sparse array will fill up. They
> could also be filtered out by cross-referencing uncommitted slots with
> the procarray. Having the abort case do some additional work to make
> xid assigment cheaper looks like a good tradeoff.

I see.

>>> Sparse buffer needs to be at least big enough to fit CSN slots for the
>>> xids of all active transactions and non-overflowed subtransactions. At
>>> the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
>>> at 16 bytes * (64 + 1) slots * 100 =  backends = 101.6KB per buffer,
>>> or 203KB total in the default configuration.
>>
>> A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
>> the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.
> 
> 8 byte alignment for CSNs is needed for atomic if not something else.

Oh, right, atomic writes.

> I think the size could be cut in half by using a base value for CSNs
> if we assume that no xid is active for longer than 2B transactions as
> is currently the case. I didn't want to include the complication in
> the first iteration, so I didn't verify if that would have any
> gotchas.

In Postgres-R, I effectively used a 32-bit order id which wraps around.

In this case, I guess adjusting the base value will get tricky. Wrapping
could probably be used as well, instead.

> The number of times each cache line can be invalidated is
> bounded by 8.

Hm.. good point.

Regards

Markus Wanner



Re: Proposal for CSN based snapshots

From
Rajeev rastogi
Date:
On 11th June 2013, Markus Wanner wrote:

> >> Agreed. Postgres-R uses a CommitOrderId, which is very similar in
> >> concept, for example.
> >
> > Do you think having this snapshot scheme would be helpful for
> Postgres-R?
>
> Yeah, it could help to reduce patch size, after a rewrite to use such a
> CSN.
>
> >> Or why do you need to tell apart aborted from in-progress
> >> transactions by CSN?
> >
> > I need to detect aborted transactions so they can be discared during
> > the eviction process, otherwise the sparse array will fill up. They
> > could also be filtered out by cross-referencing uncommitted slots
> with
> > the procarray. Having the abort case do some additional work to make
> > xid assigment cheaper looks like a good tradeoff.
>
> I see.
>
> >>> Sparse buffer needs to be at least big enough to fit CSN slots for
> >>> the xids of all active transactions and non-overflowed
> >>> subtransactions. At the current level PGPROC_MAX_CACHED_SUBXIDS=64,
> >>> the minimum comes out at 16 bytes * (64 + 1) slots * 100 =
> backends
> >>> = 101.6KB per buffer, or 203KB total in the default configuration.
> >>
> >> A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I
> >> guess the given 16 bytes includes alignment to 8 byte boundaries.
> Sounds good.
> >
> > 8 byte alignment for CSNs is needed for atomic if not something else.
>
> Oh, right, atomic writes.
>
> > I think the size could be cut in half by using a base value for CSNs
> > if we assume that no xid is active for longer than 2B transactions as
> > is currently the case. I didn't want to include the complication in
> > the first iteration, so I didn't verify if that would have any
> > gotchas.
>
> In Postgres-R, I effectively used a 32-bit order id which wraps around.
>
> In this case, I guess adjusting the base value will get tricky.
> Wrapping could probably be used as well, instead.
>
> > The number of times each cache line can be invalidated is bounded by
> > 8.
>
> Hm.. good point.
>

We are also planning to implement CSN based snapshot.
So I am curious to know whether any further development is happening on this.
If not then what is the reason?

Am I missing something?

Thanks and Regards,
Kumar Rajeev Rastogi




Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 01/24/2014 02:10 PM, Rajeev rastogi wrote:
> We are also planning to implement CSN based snapshot.
> So I am curious to know whether any further development is happening on this.

I started looking into this, and plan to work on this for 9.5. It's a 
big project, so any help is welcome. The design I have in mind is to use 
the LSN of the commit record as the CSN (as Greg Stark suggested).

Some problems and solutions I have been thinking of:

The core of the design is to store the LSN of the commit record in 
pg_clog. Currently, we only store 2 bits per transaction there, 
indicating if the transaction committed or not, but the patch will 
expand it to 64 bits, to store the LSN. To check the visibility of an 
XID in a snapshot, the XID's commit LSN is looked up in pg_clog, and 
compared with the snapshot's LSN.

Currently, before consulting the clog for an XID's status, it is 
necessary to first check if the transaction is still in progress by 
scanning the proc array. To get rid of that requirement, just before 
writing the commit record in the WAL, the backend will mark the clog 
slot with a magic value that says "I'm just about to commit". After 
writing the commit record, it is replaced with the record's actual LSN. 
If a backend sees the magic value in the clog, it will wait for the 
transaction to finish the insertion, and then check again to get the 
real LSN. I'm thinking of just using XactLockTableWait() for that. This 
mechanism makes the insertion of a commit WAL record and updating the 
clog appear atomic to the rest of the system.

With this mechanism, taking a snapshot is just a matter of reading the 
current WAL insertion point. There is no need to scan the proc array, 
which is good. However, it probably still makes sense to record an xmin 
and an xmax in SnapshotData, for performance reasons. An xmax, in 
particular, will allow us to skip checking the clog for transactions 
that will surely not be visible. We will no longer track the latest 
completed XID or the xmin like we do today, but we can use 
SharedVariableCache->nextXid as a conservative value for xmax, and keep 
a cached global xmin value in shared memory, updated when convenient, 
that can be just copied to the snapshot.

In theory, we could use a snapshot LSN as the cutoff-point for 
HeapTupleSatisfiesVisibility(). Maybe it's just because this is new, but 
that makes me feel uneasy. In any case, I think we'll need a cut-off 
point defined as an XID rather than an LSN for freezing purposes. In 
particular, we need a cut-off XID to determine how far the pg_clog can 
be truncated, and to store in relfrozenxid. So, we will still need the 
concept of a global oldest xmin.

When a snapshot is just an LSN, taking a snapshot can no longer 
calculate an xmin, like we currently do (there will be a snapshot LSN in 
place of an xmin in the proc array). So we will need a new mechanism to 
calculate the global oldest xmin. First scan the proc array to find the 
oldest still in-progress XID. That - 1 will become the new oldest global 
xmin, after all currently active snapshots have finished. We don't want 
to sleep in GetOldestXmin(), waiting for the snapshots to finish, so we 
should periodically advance a system-wide oldest xmin value, for example 
whenever the walwrite process wakes up, so that when we need an 
oldest-xmin value, we will always have a fairly recently calculated 
value ready in shared memory.

- Heikki



Re: Proposal for CSN based snapshots

From
Andres Freund
Date:
On 2014-05-12 16:56:51 +0300, Heikki Linnakangas wrote:
> On 01/24/2014 02:10 PM, Rajeev rastogi wrote:
> >We are also planning to implement CSN based snapshot.
> >So I am curious to know whether any further development is happening on this.
> 
> I started looking into this, and plan to work on this for 9.5. It's a big
> project, so any help is welcome. The design I have in mind is to use the LSN
> of the commit record as the CSN (as Greg Stark suggested).

Cool.

I haven't fully thought it through but I think it should make some of
the decoding code simpler. And it should greatly simplify the hot
standby code.

Some of the stuff in here will be influence whether your freezing
replacement patch gets in. Do you plan to further pursue that one?

> The core of the design is to store the LSN of the commit record in pg_clog.
> Currently, we only store 2 bits per transaction there, indicating if the
> transaction committed or not, but the patch will expand it to 64 bits, to
> store the LSN. To check the visibility of an XID in a snapshot, the XID's
> commit LSN is looked up in pg_clog, and compared with the snapshot's LSN.

We'll continue to need some of the old states? You plan to use values
that can never be valid lsns for them? I.e. 0/0 IN_PROGRESS, 0/1 ABORTED
etc?
How do you plan to deal with subtransactions?

> Currently, before consulting the clog for an XID's status, it is necessary
> to first check if the transaction is still in progress by scanning the proc
> array. To get rid of that requirement, just before writing the commit record
> in the WAL, the backend will mark the clog slot with a magic value that says
> "I'm just about to commit". After writing the commit record, it is replaced
> with the record's actual LSN. If a backend sees the magic value in the clog,
> it will wait for the transaction to finish the insertion, and then check
> again to get the real LSN. I'm thinking of just using XactLockTableWait()
> for that. This mechanism makes the insertion of a commit WAL record and
> updating the clog appear atomic to the rest of the system.

So it's quite possible that clog will become more of a contention point
due to the doubled amount of writes.

> In theory, we could use a snapshot LSN as the cutoff-point for
> HeapTupleSatisfiesVisibility(). Maybe it's just because this is new, but
> that makes me feel uneasy.

It'd possibly also end up being less efficient because you'd visit the
clog for potentially quite some transactions to get the LSN.

Greetings,

Andres Freund



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 05/12/2014 05:41 PM, Andres Freund wrote:
> I haven't fully thought it through but I think it should make some of
> the decoding code simpler. And it should greatly simplify the hot
> standby code.

Cool. I was worried it might conflict with the logical decoding stuff in 
some fundamental way, as I'm not really familiar with it.

> Some of the stuff in here will be influence whether your freezing
> replacement patch gets in. Do you plan to further pursue that one?

Not sure. I got to the point where it seemed to work, but I got a bit of 
a cold feet proceeding with it. I used the page header's LSN field to 
define the "epoch" of the page, but I started to feel uneasy about it. I 
would be much more comfortable with an extra field in the page header, 
even though that uses more disk space. And requires dealing with pg_upgrade.

>> The core of the design is to store the LSN of the commit record in pg_clog.
>> Currently, we only store 2 bits per transaction there, indicating if the
>> transaction committed or not, but the patch will expand it to 64 bits, to
>> store the LSN. To check the visibility of an XID in a snapshot, the XID's
>> commit LSN is looked up in pg_clog, and compared with the snapshot's LSN.
>
> We'll continue to need some of the old states? You plan to use values
> that can never be valid lsns for them? I.e. 0/0 IN_PROGRESS, 0/1 ABORTED
> etc?

Exactly.

Using 64 bits per XID instead of just 2 will obviously require a lot 
more disk space, so we might actually want to still support the old clog 
format too, as an "archive" format. The clog for old transactions could 
be converted to the more compact 2-bits per XID format (or even just 1 bit).

> How do you plan to deal with subtransactions?

pg_subtrans will stay unchanged. We could possibly merge it with 
pg_clog, reserving some 32-bit chunk of values that are not valid LSNs 
to mean an uncommitted subtransaction, with the parent XID. That assumes 
that you never need to look up the parent of an already-committed 
subtransaction. I thought that was true at first, but I think the SSI 
code looks up the parent of a committed subtransaction, to find its 
predicate locks. Perhaps it could be changed, but seems best to leave it 
alone for now; there will be a lot code churn anyway.

I think we can get rid of the sub-XID array in PGPROC. It's currently 
used to speed up TransactionIdIsInProgress(), but with the patch it will 
no longer be necessary to call TransactionIdIsInProgress() every time 
you check the visibility of an XID, so it doesn't need to be so fast 
anymore.

With the new "commit-in-progress" status in clog, we won't need the 
sub-committed clog status anymore. The "commit-in-progress" status will 
achieve the same thing.

>> Currently, before consulting the clog for an XID's status, it is necessary
>> to first check if the transaction is still in progress by scanning the proc
>> array. To get rid of that requirement, just before writing the commit record
>> in the WAL, the backend will mark the clog slot with a magic value that says
>> "I'm just about to commit". After writing the commit record, it is replaced
>> with the record's actual LSN. If a backend sees the magic value in the clog,
>> it will wait for the transaction to finish the insertion, and then check
>> again to get the real LSN. I'm thinking of just using XactLockTableWait()
>> for that. This mechanism makes the insertion of a commit WAL record and
>> updating the clog appear atomic to the rest of the system.
>
> So it's quite possible that clog will become more of a contention point
> due to the doubled amount of writes.

Yeah. OTOH, each transaction will take more space in the clog, which 
will spread the contention across more pages. And I think there are ways 
to mitigate contention in clog, if it becomes a problem. We could make 
the locking more fine-grained than one lock per page, use atomic 64-bit 
reads/writes on platforms that support it, etc.

>> In theory, we could use a snapshot LSN as the cutoff-point for
>> HeapTupleSatisfiesVisibility(). Maybe it's just because this is new, but
>> that makes me feel uneasy.
>
> It'd possibly also end up being less efficient because you'd visit the
> clog for potentially quite some transactions to get the LSN.

True.

- Heikki



Re: Proposal for CSN based snapshots

From
Robert Haas
Date:
On Mon, May 12, 2014 at 10:41 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-05-12 16:56:51 +0300, Heikki Linnakangas wrote:
>> On 01/24/2014 02:10 PM, Rajeev rastogi wrote:
>> >We are also planning to implement CSN based snapshot.
>> >So I am curious to know whether any further development is happening on this.
>>
>> I started looking into this, and plan to work on this for 9.5. It's a big
>> project, so any help is welcome. The design I have in mind is to use the LSN
>> of the commit record as the CSN (as Greg Stark suggested).
>
> Cool.

Yes, very cool.  I remember having some concerns about using the LSN
of the commit record as the CSN.  I think the biggest one was the need
to update clog with the CSN before the commit record had been written,
which your proposal to store a temporary sentinel value there until
the commit has completed might address.  However, I wonder what
happens if you write the commit record and then the attempt to update
pg_clog fails.  I think you'll have to PANIC, which kind of sucks.  It
would be nice to pin the pg_clog page into the SLRU before writing the
commit record so that we don't have to fear needing to re-read it
afterwards, but the SLRU machinery doesn't currently have that notion.

Another thing to think about is that LSN = CSN will make things much
more difficult if we ever want to support multiple WAL streams with a
separate LSN sequence for each.  Perhaps you'll say that's a pipe
dream anyway, and I agree it's probably 5 years out, but I think it
may be something we'll want eventually.  With synthetic CSNs those
systems are more decoupled.  OTOH, one advantage of LSN = CSN is that
the commit order as seen on the standby would always match the commit
order as seen on the master, which is currently not true, and would be
a very nice property to have.

I think we're likely to find that system performance is quite
sensitive to any latency in updating the global-xmin.  One thing about
the present system is that if you take a snapshot while a very "old"
transaction is still running, you're going to use that as your
global-xmin for the entire lifetime of your transaction.  It might be
possible, with some of the rejiggering you're thinking about, to
arrange things so that there are opportunities for processes to roll
forward their notion of the global-xmin, making HOT pruning more
efficient.  Whether anything good happens there or not is sort of a
side issue, but we need to make sure the efficiency of HOT pruning
doesn't regress.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal for CSN based snapshots

From
Ants Aasma
Date:
On Mon, May 12, 2014 at 4:56 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 01/24/2014 02:10 PM, Rajeev rastogi wrote:
>>
>> We are also planning to implement CSN based snapshot.
>> So I am curious to know whether any further development is happening on
>> this.
>
>
> I started looking into this, and plan to work on this for 9.5. It's a big
> project, so any help is welcome. The design I have in mind is to use the LSN
> of the commit record as the CSN (as Greg Stark suggested).

I did do some coding work on this, but the free time I used to work on
this basically disappeared with a child in the family. I guess what I
have has bitrotted beyond recognition. However I may still have some
insight that may be of use.

From your comments I presume that you are going with the original,
simpler approach proposed by Robert to simply keep the XID-CSN map
around for ever and probe it for all visibility lookups that lie
outside of the xmin-xmax range? As opposed to the more complex hybrid
approach I proposed that keeps a short term XID-CSN map and lazily
builds conventional list-of-concurrent-XIDs snapshots for long lived
snapshots. I think that would be prudent, as the simpler approach
needs mostly the same ground work and if turns out to work well
enough, simpler is always better.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Proposal for CSN based snapshots

From
Andres Freund
Date:
On 2014-05-12 18:01:59 +0300, Heikki Linnakangas wrote:
> On 05/12/2014 05:41 PM, Andres Freund wrote:
> >I haven't fully thought it through but I think it should make some of
> >the decoding code simpler. And it should greatly simplify the hot
> >standby code.
> 
> Cool. I was worried it might conflict with the logical decoding stuff in
> some fundamental way, as I'm not really familiar with it.

My gut feeling is that it should be possible to make it work. I'm too
deep into the last week of a project to properly analyze it, but I am
sure we'll find a place to grab a drink and discuss it next
week.
Essentially all it needs is to be able to represent snapshots from the
past including (and that's the hard part) a snapshot from somewhere in
the midst of an xact. The latter is done by loggin cmin/cmax for all
catalog tuples and building a lookup table when looking inside an
xact. That shouldn't change much for CSN based snapshots. I think.

> >Some of the stuff in here will be influence whether your freezing
> >replacement patch gets in. Do you plan to further pursue that one?
>
> Not sure. I got to the point where it seemed to work, but I got a bit of a
> cold feet proceeding with it. I used the page header's LSN field to define
> the "epoch" of the page, but I started to feel uneasy about it.

Yea. I don't think the approach is fundamentally broken but it touches a
*lot* of arcane places... Or at least it needs to touch many and the
trick is finding them all :)

> I would be
> much more comfortable with an extra field in the page header, even though
> that uses more disk space. And requires dealing with pg_upgrade.

Maybe we can reclaim pagesize_version and prune_xid in some way? It
seems to be prune_xid could be represented as an LSN with CSN snapshots
combined with your freezing approach and we probably don't need the last
two bytes of the lsn for that purpose...

> Using 64 bits per XID instead of just 2 will obviously require a lot more
> disk space, so we might actually want to still support the old clog format
> too, as an "archive" format. The clog for old transactions could be
> converted to the more compact 2-bits per XID format (or even just 1 bit).

Wouldn't it make more sense to have two slrus then? A SRLU with dynamic
width doesn't seem easily doable.

> >How do you plan to deal with subtransactions?
> 
> pg_subtrans will stay unchanged. We could possibly merge it with pg_clog,
> reserving some 32-bit chunk of values that are not valid LSNs to mean an
> uncommitted subtransaction, with the parent XID. That assumes that you never
> need to look up the parent of an already-committed subtransaction. I thought
> that was true at first, but I think the SSI code looks up the parent of a
> committed subtransaction, to find its predicate locks. Perhaps it could be
> changed, but seems best to leave it alone for now; there will be a lot code
> churn anyway.
> 
> I think we can get rid of the sub-XID array in PGPROC. It's currently used
> to speed up TransactionIdIsInProgress(), but with the patch it will no
> longer be necessary to call TransactionIdIsInProgress() every time you check
> the visibility of an XID, so it doesn't need to be so fast anymore.

Whether it can be removed depends on how the whole hot standby stuff is
dealt with... Also, there's some other callsites that do
TransactionIdIsInProgress() at some frequency. Just think about all the
multixact business :(

> With the new "commit-in-progress" status in clog, we won't need the
> sub-committed clog status anymore. The "commit-in-progress" status will
> achieve the same thing.

Wouldn't that cause many spurious waits? Because commit-in-progress
needs to be waited on, but a sub-committed xact surely not?

> >So it's quite possible that clog will become more of a contention point
> >due to the doubled amount of writes.
> 
> Yeah. OTOH, each transaction will take more space in the clog, which will
> spread the contention across more pages. And I think there are ways to
> mitigate contention in clog, if it becomes a problem.

I am not opposed, more wondering if you'd thought about it.

I don't think spreading the contention works very well with the current
implementation of slru.c. It's already very prone to throwing away the
wrong page. Widening it will just make that worse.

> We could make the
> locking more fine-grained than one lock per page, use atomic 64-bit
> reads/writes on platforms that support it, etc.

We *really* need an atomics abstraction layer... There's more and more
stuff coming that needs it.

This is going to be a *large* patch.

Greetings,

Andres Freund



Re: Proposal for CSN based snapshots

From
Ants Aasma
Date:
On Mon, May 12, 2014 at 6:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> However, I wonder what
> happens if you write the commit record and then the attempt to update
> pg_clog fails.  I think you'll have to PANIC, which kind of sucks.

CLOG IO error while commiting is already a PANIC, SimpleLruReadPage()
does SlruReportIOError(), which in turn does ereport(ERROR), while
inside a critical section initiated in RecordTransactionCommit().

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Proposal for CSN based snapshots

From
Greg Stark
Date:
On Mon, May 12, 2014 at 2:56 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Currently, before consulting the clog for an XID's status, it is necessary
> to first check if the transaction is still in progress by scanning the proc
> array. To get rid of that requirement, just before writing the commit record
> in the WAL, the backend will mark the clog slot with a magic value that says
> "I'm just about to commit". After writing the commit record, it is replaced
> with the record's actual LSN. If a backend sees the magic value in the clog,
> it will wait for the transaction to finish the insertion, and then check
> again to get the real LSN. I'm thinking of just using XactLockTableWait()
> for that. This mechanism makes the insertion of a commit WAL record and
> updating the clog appear atomic to the rest of the system.


Would it be useful to store the current WAL insertion point along with
the "about to commit" flag so it's effectively a promise that this
transaction will commit no earlier than XXX. That should allow most
transactions to decide if those records are visible or not unless
they're very recent transactions which started in that short window
while the committing transaction was in the process of committing.

-- 
greg



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 05/12/2014 06:26 PM, Andres Freund wrote:
>> >With the new "commit-in-progress" status in clog, we won't need the
>> >sub-committed clog status anymore. The "commit-in-progress" status will
>> >achieve the same thing.
> Wouldn't that cause many spurious waits? Because commit-in-progress
> needs to be waited on, but a sub-committed xact surely not?

Ah, no. Even today, a subxid isn't marked as sub-committed, until you 
commit the top-level transaction. The sub-commit state is a very 
transient state during the commit process, used to make the commit of 
the sub-transactions and the top-level transaction appear atomic. The 
commit-in-progress state would be a similarly short-lived state. You 
mark the subxids and the top xid as commit-in-progress just before the 
XLogInsert() of the commit record, and you replace them with the real 
LSNs right after XLogInsert().

- Heikki



Re: Proposal for CSN based snapshots

From
Andres Freund
Date:
On 2014-05-12 19:14:55 +0300, Heikki Linnakangas wrote:
> On 05/12/2014 06:26 PM, Andres Freund wrote:
> >>>With the new "commit-in-progress" status in clog, we won't need the
> >>>sub-committed clog status anymore. The "commit-in-progress" status will
> >>>achieve the same thing.
> >Wouldn't that cause many spurious waits? Because commit-in-progress
> >needs to be waited on, but a sub-committed xact surely not?
> 
> Ah, no. Even today, a subxid isn't marked as sub-committed, until you commit
> the top-level transaction. The sub-commit state is a very transient state
> during the commit process, used to make the commit of the sub-transactions
> and the top-level transaction appear atomic. The commit-in-progress state
> would be a similarly short-lived state. You mark the subxids and the top xid
> as commit-in-progress just before the XLogInsert() of the commit record, and
> you replace them with the real LSNs right after XLogInsert().

Ah, right. Forgot that detail...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Proposal for CSN based snapshots

From
Ants Aasma
Date:
On Mon, May 12, 2014 at 7:10 PM, Greg Stark <stark@mit.edu> wrote:
> Would it be useful to store the current WAL insertion point along with
> the "about to commit" flag so it's effectively a promise that this
> transaction will commit no earlier than XXX. That should allow most
> transactions to decide if those records are visible or not unless
> they're very recent transactions which started in that short window
> while the committing transaction was in the process of committing.

I don't believe this is worth the complexity. The contention window is
extremely short here.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Proposal for CSN based snapshots

From
Rajeev rastogi
Date:
On 12 May 2014 19:27, Heikki Linnakangas Wrote:
> On 01/24/2014 02:10 PM, Rajeev rastogi wrote:
> > We are also planning to implement CSN based snapshot.
> > So I am curious to know whether any further development is happening
> on this.
>
> I started looking into this, and plan to work on this for 9.5. It's a
> big project, so any help is welcome. The design I have in mind is to
> use the LSN of the commit record as the CSN (as Greg Stark suggested).

Great !

> Some problems and solutions I have been thinking of:
>
> The core of the design is to store the LSN of the commit record in
> pg_clog. Currently, we only store 2 bits per transaction there,
> indicating if the transaction committed or not, but the patch will
> expand it to 64 bits, to store the LSN. To check the visibility of an
> XID in a snapshot, the XID's commit LSN is looked up in pg_clog, and
> compared with the snapshot's LSN.

Isn't it will be bit in-efficient to look in to pg_clog to read XID's commit
LSN for every visibility check?
> With this mechanism, taking a snapshot is just a matter of reading the
> current WAL insertion point. There is no need to scan the proc array,
> which is good. However, it probably still makes sense to record an xmin
> and an xmax in SnapshotData, for performance reasons. An xmax, in
> particular, will allow us to skip checking the clog for transactions
> that will surely not be visible. We will no longer track the latest
> completed XID or the xmin like we do today, but we can use
> SharedVariableCache->nextXid as a conservative value for xmax, and keep
> a cached global xmin value in shared memory, updated when convenient,
> that can be just copied to the snapshot.

I think we can update xmin, whenever transaction with its XID equal
to xmin gets committed (i.e. in ProcArrayEndTransaction).

Thanks and Regards,
Kumar Rajeev Rastogi




Re: Proposal for CSN based snapshots

From
Amit Kapila
Date:
On Mon, May 12, 2014 at 7:26 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> In theory, we could use a snapshot LSN as the cutoff-point for
> HeapTupleSatisfiesVisibility(). Maybe it's just because this is new, but
> that makes me feel uneasy.

To accomplish this won't XID-CSN map table be required and how will
it be maintained (means when to clear and add a entry to that map table)?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 05/13/2014 09:44 AM, Amit Kapila wrote:
> On Mon, May 12, 2014 at 7:26 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> In theory, we could use a snapshot LSN as the cutoff-point for
>> HeapTupleSatisfiesVisibility(). Maybe it's just because this is new, but
>> that makes me feel uneasy.
>
> To accomplish this won't XID-CSN map table be required and how will
> it be maintained (means when to clear and add a entry to that map table)?

Not sure I understand. The clog is a mapping from XID to CSN. What 
vacuum needs to know is whether the xmin and/or xmax is visible to 
everyone (and whether they committed or aborted). To determine that, it 
needs the oldest still active snapshot LSN. That can be found by 
scanning the proc array. It's pretty much the same as a regular MVCC 
visibility check, but using the oldest still-active snapshot.

- Heikki



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 05/13/2014 08:08 AM, Rajeev rastogi wrote:
>> >The core of the design is to store the LSN of the commit record in
>> >pg_clog. Currently, we only store 2 bits per transaction there,
>> >indicating if the transaction committed or not, but the patch will
>> >expand it to 64 bits, to store the LSN. To check the visibility of an
>> >XID in a snapshot, the XID's commit LSN is looked up in pg_clog, and
>> >compared with the snapshot's LSN.
> Isn't it will be bit in-efficient to look in to pg_clog to read XID's commit
> LSN for every visibility check?

Maybe. If no hint bit is set on the tuple, you have to check the clog 
anyway to determine if the tuple is committed. And if for XIDs older 
than xmin or newer than xmax, you don't need to check pg_clog. But it's 
true that for tuples with hint bit set, and xmin < XID < xmax, you have 
to check the pg_clog in the new system, when currently you only need to 
do a binary search of the local array in the snapshot. My gut feeling is 
that it won't be significantly slower in practice. If it becomes a 
problem, some rearrangement pg_clog code might help, or you could build 
a cache of XID->CSN mappings that you've alread looked up in 
SnapshotData. So I don't think that's going to be a show-stopper.

- Heikki



Re: Proposal for CSN based snapshots

From
Amit Kapila
Date:
On Tue, May 13, 2014 at 1:59 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 05/13/2014 09:44 AM, Amit Kapila wrote:
>>
>> To accomplish this won't XID-CSN map table be required and how will
>> it be maintained (means when to clear and add a entry to that map table)?
>
>
> Not sure I understand. The clog is a mapping from XID to CSN.

The case I was referring is for xmin < XID < xmax, which you have
mentioned below that you are planing to directly refer pg_clog.  This
is I think one of the main place where new design can have impact
on performance, but as you said it is better to first do the implementation
based on pg_clog rather than directly jumping to optimize by maintaining
XID to CSN mapping.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Proposal for CSN based snapshots

From
Rajeev rastogi
Date:
On 13 May 2014 14:06, Heikki Linnakangas

> >> >The core of the design is to store the LSN of the commit record in
> >> >pg_clog. Currently, we only store 2 bits per transaction there,
> >> >indicating if the transaction committed or not, but the patch will
> >> >expand it to 64 bits, to store the LSN. To check the visibility of
> >> >an XID in a snapshot, the XID's commit LSN is looked up in pg_clog,
> >> >and compared with the snapshot's LSN.
> > Isn't it will be bit in-efficient to look in to pg_clog to read XID's
> > commit LSN for every visibility check?
>
> Maybe. If no hint bit is set on the tuple, you have to check the clog
> anyway to determine if the tuple is committed. And if for XIDs older
> than xmin or newer than xmax, you don't need to check pg_clog. But it's
> true that for tuples with hint bit set, and xmin < XID < xmax, you have
> to check the pg_clog in the new system, when currently you only need to
> do a binary search of the local array in the snapshot. My gut feeling
> is that it won't be significantly slower in practice. If it becomes a
> problem, some rearrangement pg_clog code might help, or you could build
> a cache of XID->CSN mappings that you've alread looked up in
> SnapshotData. So I don't think that's going to be a show-stopper.

Yes definitely it should not be not show-stopper. This can be optimized later by method
as you mentioned  and also by some cut-off technique based on which we can
decide that a XID beyond a certain range will be always visible, and thereby
avoiding look-up in pg_clog.

Thanks and Regards,
Kumar Rajeev Rastogi



Re: Proposal for CSN based snapshots

From
Bruce Momjian
Date:
On Mon, May 12, 2014 at 06:01:59PM +0300, Heikki Linnakangas wrote:
> >Some of the stuff in here will be influence whether your freezing
> >replacement patch gets in. Do you plan to further pursue that one?
> 
> Not sure. I got to the point where it seemed to work, but I got a
> bit of a cold feet proceeding with it. I used the page header's LSN
> field to define the "epoch" of the page, but I started to feel
> uneasy about it. I would be much more comfortable with an extra
> field in the page header, even though that uses more disk space. And
> requires dealing with pg_upgrade.

FYI, pg_upgrade copies pg_clog from the old cluster, so there will be a
pg_upgrade issue anyway.

I am not excited about a 32x increase in clog size, especially since we
already do freezing at 200M transactions to allow for more aggressive
clog trimming.  Extrapolating that out, it means we would freeze every
6.25M transactions.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Proposal for CSN based snapshots

From
Robert Haas
Date:
On Thu, May 15, 2014 at 2:34 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, May 12, 2014 at 06:01:59PM +0300, Heikki Linnakangas wrote:
>> >Some of the stuff in here will be influence whether your freezing
>> >replacement patch gets in. Do you plan to further pursue that one?
>>
>> Not sure. I got to the point where it seemed to work, but I got a
>> bit of a cold feet proceeding with it. I used the page header's LSN
>> field to define the "epoch" of the page, but I started to feel
>> uneasy about it. I would be much more comfortable with an extra
>> field in the page header, even though that uses more disk space. And
>> requires dealing with pg_upgrade.
>
> FYI, pg_upgrade copies pg_clog from the old cluster, so there will be a
> pg_upgrade issue anyway.
>
> I am not excited about a 32x increase in clog size, especially since we
> already do freezing at 200M transactions to allow for more aggressive
> clog trimming.  Extrapolating that out, it means we would freeze every
> 6.25M transactions.

It seems better to allow clog to grow larger than to force
more-frequent freezing.

If the larger clog size is a show-stopper (and I'm not sure I have an
intelligent opinion on that just yet), one way to get around the
problem would be to summarize CLOG entries after-the-fact.  Once an
XID precedes the xmin of every snapshot, we don't need to know the
commit LSN any more.  So we could read the old pg_clog files and write
new summary files.  Since we don't need to care about subcommitted
transactions either, we could get by with just 1 bit per transaction,
1 = committed, 0 = aborted.  Once we've written and fsync'd the
summary files, we could throw away the original files.  That might
leave us with a smaller pg_clog than what we have today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal for CSN based snapshots

From
Andres Freund
Date:
On 2014-05-15 15:40:06 -0400, Robert Haas wrote:
> On Thu, May 15, 2014 at 2:34 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Mon, May 12, 2014 at 06:01:59PM +0300, Heikki Linnakangas wrote:
> >> >Some of the stuff in here will be influence whether your freezing
> >> >replacement patch gets in. Do you plan to further pursue that one?
> >>
> >> Not sure. I got to the point where it seemed to work, but I got a
> >> bit of a cold feet proceeding with it. I used the page header's LSN
> >> field to define the "epoch" of the page, but I started to feel
> >> uneasy about it. I would be much more comfortable with an extra
> >> field in the page header, even though that uses more disk space. And
> >> requires dealing with pg_upgrade.
> >
> > FYI, pg_upgrade copies pg_clog from the old cluster, so there will be a
> > pg_upgrade issue anyway.
> >
> > I am not excited about a 32x increase in clog size, especially since we
> > already do freezing at 200M transactions to allow for more aggressive
> > clog trimming.  Extrapolating that out, it means we would freeze every
> > 6.25M transactions.

The default setting imo is far too low for a database of any relevant
activity. If I had the stomach for the fight around it I'd suggest
increasing it significantly by default. People with small databases
won't be hurt significantly because they simply don't have that many
transactions and autovacuum will get around to cleanup long before
normally.

> It seems better to allow clog to grow larger than to force
> more-frequent freezing.

Yes.

> If the larger clog size is a show-stopper (and I'm not sure I have an
> intelligent opinion on that just yet), one way to get around the
> problem would be to summarize CLOG entries after-the-fact.  Once an
> XID precedes the xmin of every snapshot, we don't need to know the
> commit LSN any more.  So we could read the old pg_clog files and write
> new summary files.  Since we don't need to care about subcommitted
> transactions either, we could get by with just 1 bit per transaction,
> 1 = committed, 0 = aborted.  Once we've written and fsync'd the
> summary files, we could throw away the original files.  That might
> leave us with a smaller pg_clog than what we have today.

I think the easiest way for now would be to have pg_clog with the same
format as today and a rangewise much smaller pg_csn storing the lsns
that are needed. That'll leave us with pg_upgrade'ability without
needing to rewrite pg_clog during the upgrade.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Proposal for CSN based snapshots

From
Bruce Momjian
Date:
On Thu, May 15, 2014 at 10:06:32PM +0200, Andres Freund wrote:
> > If the larger clog size is a show-stopper (and I'm not sure I have an
> > intelligent opinion on that just yet), one way to get around the
> > problem would be to summarize CLOG entries after-the-fact.  Once an
> > XID precedes the xmin of every snapshot, we don't need to know the
> > commit LSN any more.  So we could read the old pg_clog files and write
> > new summary files.  Since we don't need to care about subcommitted
> > transactions either, we could get by with just 1 bit per transaction,
> > 1 = committed, 0 = aborted.  Once we've written and fsync'd the
> > summary files, we could throw away the original files.  That might
> > leave us with a smaller pg_clog than what we have today.
> 
> I think the easiest way for now would be to have pg_clog with the same
> format as today and a rangewise much smaller pg_csn storing the lsns
> that are needed. That'll leave us with pg_upgrade'ability without
> needing to rewrite pg_clog during the upgrade.

Yes, I like the idea of storing the CSN separately.  One reason the
2-bit clog is so good is that we know we have atomic 1-byte writes on
all platforms.  Can we assume atomic 64-bit writes?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Proposal for CSN based snapshots

From
Andres Freund
Date:
On 2014-05-15 16:13:49 -0400, Bruce Momjian wrote:
> On Thu, May 15, 2014 at 10:06:32PM +0200, Andres Freund wrote:
> > > If the larger clog size is a show-stopper (and I'm not sure I have an
> > > intelligent opinion on that just yet), one way to get around the
> > > problem would be to summarize CLOG entries after-the-fact.  Once an
> > > XID precedes the xmin of every snapshot, we don't need to know the
> > > commit LSN any more.  So we could read the old pg_clog files and write
> > > new summary files.  Since we don't need to care about subcommitted
> > > transactions either, we could get by with just 1 bit per transaction,
> > > 1 = committed, 0 = aborted.  Once we've written and fsync'd the
> > > summary files, we could throw away the original files.  That might
> > > leave us with a smaller pg_clog than what we have today.
> > 
> > I think the easiest way for now would be to have pg_clog with the same
> > format as today and a rangewise much smaller pg_csn storing the lsns
> > that are needed. That'll leave us with pg_upgrade'ability without
> > needing to rewrite pg_clog during the upgrade.
> 
> Yes, I like the idea of storing the CSN separately.  One reason the
> 2-bit clog is so good is that we know we have atomic 1-byte writes on
> all platforms.

I don't think we rely on that anywhere. And in fact we don't have the
ability to do so for arbitrary bytes - lots of platforms can do that
only on specifically aligned bytes.

We rely on being able to atomically (as in either before/after no torn
value) write/read TransactionIds, but that's it I think?

>  Can we assume atomic 64-bit writes?

Not on 32bit platforms.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Proposal for CSN based snapshots

From
Alvaro Herrera
Date:
Andres Freund wrote:
> On 2014-05-15 15:40:06 -0400, Robert Haas wrote:
> > On Thu, May 15, 2014 at 2:34 PM, Bruce Momjian <bruce@momjian.us> wrote:

> > If the larger clog size is a show-stopper (and I'm not sure I have an
> > intelligent opinion on that just yet), one way to get around the
> > problem would be to summarize CLOG entries after-the-fact.  Once an
> > XID precedes the xmin of every snapshot, we don't need to know the
> > commit LSN any more.  So we could read the old pg_clog files and write
> > new summary files.  Since we don't need to care about subcommitted
> > transactions either, we could get by with just 1 bit per transaction,
> > 1 = committed, 0 = aborted.  Once we've written and fsync'd the
> > summary files, we could throw away the original files.  That might
> > leave us with a smaller pg_clog than what we have today.
> 
> I think the easiest way for now would be to have pg_clog with the same
> format as today and a rangewise much smaller pg_csn storing the lsns
> that are needed. That'll leave us with pg_upgrade'ability without
> needing to rewrite pg_clog during the upgrade.

Err, we're proposing a patch to add timestamps to each commit,
http://www.postgresql.org/message-id/20131022221600.GE4987@eldon.alvh.no-ip.org
which does so in precisely this way.

The idea that pg_csn or pg_committs can be truncated much earlier than
pg_clog has its merit, no doubt.  If we can make sure that the atomicity
is sane, +1 from me.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Proposal for CSN based snapshots

From
Andres Freund
Date:
On 2014-05-15 17:37:14 -0400, Alvaro Herrera wrote:
> Andres Freund wrote:
> > On 2014-05-15 15:40:06 -0400, Robert Haas wrote:
> > > On Thu, May 15, 2014 at 2:34 PM, Bruce Momjian <bruce@momjian.us> wrote:
> 
> > > If the larger clog size is a show-stopper (and I'm not sure I have an
> > > intelligent opinion on that just yet), one way to get around the
> > > problem would be to summarize CLOG entries after-the-fact.  Once an
> > > XID precedes the xmin of every snapshot, we don't need to know the
> > > commit LSN any more.  So we could read the old pg_clog files and write
> > > new summary files.  Since we don't need to care about subcommitted
> > > transactions either, we could get by with just 1 bit per transaction,
> > > 1 = committed, 0 = aborted.  Once we've written and fsync'd the
> > > summary files, we could throw away the original files.  That might
> > > leave us with a smaller pg_clog than what we have today.
> > 
> > I think the easiest way for now would be to have pg_clog with the same
> > format as today and a rangewise much smaller pg_csn storing the lsns
> > that are needed. That'll leave us with pg_upgrade'ability without
> > needing to rewrite pg_clog during the upgrade.
> 
> Err, we're proposing a patch to add timestamps to each commit,
> http://www.postgresql.org/message-id/20131022221600.GE4987@eldon.alvh.no-ip.org
> which does so in precisely this way.

I am not sure where my statements above conflict with committs?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Proposal for CSN based snapshots

From
Alvaro Herrera
Date:
Andres Freund wrote:
> On 2014-05-15 17:37:14 -0400, Alvaro Herrera wrote:
> > Andres Freund wrote:
> > > On 2014-05-15 15:40:06 -0400, Robert Haas wrote:
> > > > On Thu, May 15, 2014 at 2:34 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > 
> > > > If the larger clog size is a show-stopper (and I'm not sure I have an
> > > > intelligent opinion on that just yet), one way to get around the
> > > > problem would be to summarize CLOG entries after-the-fact.  Once an
> > > > XID precedes the xmin of every snapshot, we don't need to know the
> > > > commit LSN any more.  So we could read the old pg_clog files and write
> > > > new summary files.  Since we don't need to care about subcommitted
> > > > transactions either, we could get by with just 1 bit per transaction,
> > > > 1 = committed, 0 = aborted.  Once we've written and fsync'd the
> > > > summary files, we could throw away the original files.  That might
> > > > leave us with a smaller pg_clog than what we have today.
> > > 
> > > I think the easiest way for now would be to have pg_clog with the same
> > > format as today and a rangewise much smaller pg_csn storing the lsns
> > > that are needed. That'll leave us with pg_upgrade'ability without
> > > needing to rewrite pg_clog during the upgrade.
> > 
> > Err, we're proposing a patch to add timestamps to each commit,
> > http://www.postgresql.org/message-id/20131022221600.GE4987@eldon.alvh.no-ip.org
> > which does so in precisely this way.
> 
> I am not sure where my statements above conflict with committs?

I didn't say it did ...

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
So, here's a first version of the patch. Still very much WIP.

One thorny issue came up in discussions with other hackers on this in PGCon:

When a transaction is committed asynchronously, it becomes visible to
other backends before the commit WAL record is flushed. With CSN-based
snapshots, the order that transactions become visible is always based on
the LSNs of the WAL records. This is a problem when there is a mix of
synchronous and asynchronous commits:

If transaction A commits synchronously with commit LSN 1, and
transaction B commits asynchronously with commit LSN 2, B cannot become
visible before A. And we cannot acknowledge B as committed to the client
until it's visible to other transactions. That means that B will have to
wait for A's commit record to be flushed to disk, before it can return,
even though it was an asynchronous commit.

I personally think that's annoying, but we can live with it. The most
common usage of synchronous_commit=off is to run a lot of transactions
in that mode, setting it in postgresql.conf. And it wouldn't completely
defeat the purpose of mixing synchronous and asynchronous commits
either: an asynchronous commit still only needs to wait for any
already-logged synchronous commits to be flushed to disk, not the commit
record of the asynchronous transaction itself.

Ants' original design with a separate commit-sequence-number that's
different from the commit LSN would not have this problem, because that
would allow the commits to become visible to others in out-of-WAL-order.
However, the WAL order == commit order is a nice and simple property,
with other advantages.


Some bigger TODO items:

* Logical decoding is broken. I hacked on it enough that it looks
roughly sane and it compiles, but didn't spend more time to debug.

* I expanded pg_clog to 64-bits per XID, but people suggested keeping
pg_clog as is, with two bits per commit, and adding a new SLRU for the
commit LSNs beside it. Probably will need to do something like that to
avoid bloating the clog.

* Add some kind of backend-private caching of clog, to make it faster to
access. The visibility checks are now hitting the clog a lot more
heavily than before, as you need to check the clog even if the hint bits
are set, if the XID falls between xmin and xmax of the snapshot.

* Transactions currently become visible immediately when a WAL record is
inserted, before it's flushed. That's wrong, but shouldn't be difficult
to fix (except for the async commit issue explained above).

- Heikki


Attachment

Re: Proposal for CSN based snapshots

From
Greg Stark
Date:
On Fri, May 30, 2014 at 3:59 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> If transaction A commits synchronously with commit LSN 1, and transaction B
> commits asynchronously with commit LSN 2, B cannot become visible before A.
> And we cannot acknowledge B as committed to the client until it's visible to
> other transactions. That means that B will have to wait for A's commit
> record to be flushed to disk, before it can return, even though it was an
> asynchronous commit.


I thought that's what happens now.

What's more of a concern is synchronousl replication. We currently
have a hack that makes transactions committed locally invisible to
other transactions even though they've committed and synced to disk
until the slave responds that it's received the transaction. (I think
this is bogus personally, it just shifts the failure modes around. If
we wanted to do it properly we would have to do two-phase commit.)

I guess it still works because we don't support having synchronous
replication for just some transactions and not others. It would be
nice to support that but I think it would mean making it work like
local synchronous commit. It would only affect how long the commit
blocks, not when other transactions see the committed data.

-- 
greg



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 05/30/2014 06:09 PM, Greg Stark wrote:
> On Fri, May 30, 2014 at 3:59 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> If transaction A commits synchronously with commit LSN 1, and transaction B
>> commits asynchronously with commit LSN 2, B cannot become visible before A.
>> And we cannot acknowledge B as committed to the client until it's visible to
>> other transactions. That means that B will have to wait for A's commit
>> record to be flushed to disk, before it can return, even though it was an
>> asynchronous commit.
>
>
> I thought that's what happens now.
>
> What's more of a concern is synchronousl replication. We currently
> have a hack that makes transactions committed locally invisible to
> other transactions even though they've committed and synced to disk
> until the slave responds that it's received the transaction. (I think
> this is bogus personally, it just shifts the failure modes around. If
> we wanted to do it properly we would have to do two-phase commit.)

Yeah. To recap, the failure mode is that if the master crashes and 
restarts, the transaction becomes visible in the master even though it 
was never replicated.

> I guess it still works because we don't support having synchronous
> replication for just some transactions and not others. It would be
> nice to support that but I think it would mean making it work like
> local synchronous commit. It would only affect how long the commit
> blocks, not when other transactions see the committed data.

Actually, we do support that, see synchronous_commit=local.

- Heikki



Re: Proposal for CSN based snapshots

From
Andres Freund
Date:
Hi,

On 2014-05-30 17:59:23 +0300, Heikki Linnakangas wrote:
> So, here's a first version of the patch. Still very much WIP.

Cool.

> One thorny issue came up in discussions with other hackers on this in PGCon:
> 
> When a transaction is committed asynchronously, it becomes visible to other
> backends before the commit WAL record is flushed. With CSN-based snapshots,
> the order that transactions become visible is always based on the LSNs of
> the WAL records. This is a problem when there is a mix of synchronous and
> asynchronous commits:
> 
> If transaction A commits synchronously with commit LSN 1, and transaction B
> commits asynchronously with commit LSN 2, B cannot become visible before A.
> And we cannot acknowledge B as committed to the client until it's visible to
> other transactions. That means that B will have to wait for A's commit
> record to be flushed to disk, before it can return, even though it was an
> asynchronous commit.

> I personally think that's annoying, but we can live with it. The most common
> usage of synchronous_commit=off is to run a lot of transactions in that
> mode, setting it in postgresql.conf. And it wouldn't completely defeat the
> purpose of mixing synchronous and asynchronous commits either: an
> asynchronous commit still only needs to wait for any already-logged
> synchronous commits to be flushed to disk, not the commit record of the
> asynchronous transaction itself.

I have a hard time believing that users won't hate us for such a
regression. It's pretty common to mix both sorts of transactions and
this will - by my guesstimate - dramatically reduce throughput for the
async backends.

> * Logical decoding is broken. I hacked on it enough that it looks roughly
> sane and it compiles, but didn't spend more time to debug.

I think we can live with it not working for the first few
iterations. I'll look into it once the patch has stabilized a bit.

> * I expanded pg_clog to 64-bits per XID, but people suggested keeping
> pg_clog as is, with two bits per commit, and adding a new SLRU for the
> commit LSNs beside it. Probably will need to do something like that to avoid
> bloating the clog.

It also influences how on-disk compatibility is dealt with. So: How are
you planning to deal with on-disk compatibility?

> * Add some kind of backend-private caching of clog, to make it faster to
> access. The visibility checks are now hitting the clog a lot more heavily
> than before, as you need to check the clog even if the hint bits are set, if
> the XID falls between xmin and xmax of the snapshot.

That'll hurt a lot in concurrent scenarios :/. Have you measured how
'wide' xmax-xmin usually is? I wonder if we could just copy a range of
values from the clog when we start scanning....

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 05/30/2014 06:27 PM, Andres Freund wrote:
> On 2014-05-30 17:59:23 +0300, Heikki Linnakangas wrote:
>> One thorny issue came up in discussions with other hackers on this in PGCon:
>>
>> When a transaction is committed asynchronously, it becomes visible to other
>> backends before the commit WAL record is flushed. With CSN-based snapshots,
>> the order that transactions become visible is always based on the LSNs of
>> the WAL records. This is a problem when there is a mix of synchronous and
>> asynchronous commits:
>>
>> If transaction A commits synchronously with commit LSN 1, and transaction B
>> commits asynchronously with commit LSN 2, B cannot become visible before A.
>> And we cannot acknowledge B as committed to the client until it's visible to
>> other transactions. That means that B will have to wait for A's commit
>> record to be flushed to disk, before it can return, even though it was an
>> asynchronous commit.
>
>> I personally think that's annoying, but we can live with it. The most common
>> usage of synchronous_commit=off is to run a lot of transactions in that
>> mode, setting it in postgresql.conf. And it wouldn't completely defeat the
>> purpose of mixing synchronous and asynchronous commits either: an
>> asynchronous commit still only needs to wait for any already-logged
>> synchronous commits to be flushed to disk, not the commit record of the
>> asynchronous transaction itself.
>
> I have a hard time believing that users won't hate us for such a
> regression. It's pretty common to mix both sorts of transactions and
> this will - by my guesstimate - dramatically reduce throughput for the
> async backends.

Yeah, it probably would. Not sure how many people would care.

For an asynchronous commit, we could store the current WAL flush 
location as the commit LSN, instead of the location of the commit 
record. That would break the property that LSN == commit order, but that 
property is fundamentally incompatible with having async commits become 
visible without flushing previous transactions. Or we could even make it 
configurable, it would be fairly easy to support both behaviors.

>> * Logical decoding is broken. I hacked on it enough that it looks roughly
>> sane and it compiles, but didn't spend more time to debug.
>
> I think we can live with it not working for the first few
> iterations. I'll look into it once the patch has stabilized a bit.

Thanks!

>> * I expanded pg_clog to 64-bits per XID, but people suggested keeping
>> pg_clog as is, with two bits per commit, and adding a new SLRU for the
>> commit LSNs beside it. Probably will need to do something like that to avoid
>> bloating the clog.
>
> It also influences how on-disk compatibility is dealt with. So: How are
> you planning to deal with on-disk compatibility?
>
>> * Add some kind of backend-private caching of clog, to make it faster to
>> access. The visibility checks are now hitting the clog a lot more heavily
>> than before, as you need to check the clog even if the hint bits are set, if
>> the XID falls between xmin and xmax of the snapshot.
>
> That'll hurt a lot in concurrent scenarios :/. Have you measured how
> 'wide' xmax-xmin usually is?

That depends entirely on the workload. The worst case is a mix of a 
long-running transaction and a lot of short transaction. It could grow 
to millions of transactions or more in that case.

> I wonder if we could just copy a range of
> values from the clog when we start scanning....

I don't think that's practical, if the xmin-xmax gap is wide.

Perhaps we should take the bull by the horns and make clog faster to 
look up. If we e.g. mmapped the clog file into backend-private address 
space, we could all the locking overhead of an SLRU. On platforms with 
atomic 64-bit instructions, you could read the clog with just a memory 
barrier. Even on other architectures, you'd only need a spinlock.

- Heikki



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
(forgot to answer this question)

On 05/30/2014 06:27 PM, Andres Freund wrote:
> On 2014-05-30 17:59:23 +0300, Heikki Linnakangas wrote:
>> * I expanded pg_clog to 64-bits per XID, but people suggested keeping
>> pg_clog as is, with two bits per commit, and adding a new SLRU for the
>> commit LSNs beside it. Probably will need to do something like that to avoid
>> bloating the clog.
>
> It also influences how on-disk compatibility is dealt with. So: How are
> you planning to deal with on-disk compatibility?

Have pg_upgrade read the old clog and write it out in the new format.

- Heikki



Re: Proposal for CSN based snapshots

From
Robert Haas
Date:
On Fri, May 30, 2014 at 2:38 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> (forgot to answer this question)
>
> On 05/30/2014 06:27 PM, Andres Freund wrote:
>>
>> On 2014-05-30 17:59:23 +0300, Heikki Linnakangas wrote:
>>>
>>> * I expanded pg_clog to 64-bits per XID, but people suggested keeping
>>> pg_clog as is, with two bits per commit, and adding a new SLRU for the
>>> commit LSNs beside it. Probably will need to do something like that to
>>> avoid
>>> bloating the clog.
>>
>>
>> It also influences how on-disk compatibility is dealt with. So: How are
>> you planning to deal with on-disk compatibility?
>
>
> Have pg_upgrade read the old clog and write it out in the new format.

That doesn't address Bruce's concern about CLOG disk consumption.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal for CSN based snapshots

From
Greg Stark
Date:
On Tue, Jun 3, 2014 at 2:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> That doesn't address Bruce's concern about CLOG disk consumption.

Well we would only need the xid->lsn mapping for transactions since
globalxmin. Anything older we would just need the committed bit. So we
could maintain two structures, one like our current clog going back
until the freeze_max_age and one with 32-bits (or 64 bits?) per xid
but only going back as far as the globalxmin. There are a myriad of
compression techniques we could use on a sequence of mostly similar
mostly increasing numbers in a small range too. But I suspect they
wouldn't really be necessary.

Here's another thought. I don't see how to make this practical but it
would be quite convenient if it could be made so. The first time an
xid is looked up and is determined to be committed then all we'll ever
need in the future is the lsn. If we could replace the xid with the
LSN of the commit record right there in the page then future viewers
would be able to determine if it's visible without looking in the clog
or this new clog xid->lsn mapping. If that was the full LSN it would
never need to be frozen either. The problem with this is that the LSN
is big and actually moves faster than the xid. We could play the same
games with putting the lsn segment in the page header but it's
actually entirely feasible to have a snapshot that extends over
several segments. Way easier than a snapshot that extends over several
xid epochs. (This does make having a CSN like Oracle kind of tempting
after all)

-- 
greg



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 06/03/2014 05:53 PM, Greg Stark wrote:
> On Tue, Jun 3, 2014 at 2:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> That doesn't address Bruce's concern about CLOG disk consumption.
>
> Well we would only need the xid->lsn mapping for transactions since
> globalxmin. Anything older we would just need the committed bit. So we
> could maintain two structures, one like our current clog going back
> until the freeze_max_age and one with 32-bits (or 64 bits?) per xid
> but only going back as far as the globalxmin. There are a myriad of
> compression techniques we could use on a sequence of mostly similar
> mostly increasing numbers in a small range too. But I suspect they
> wouldn't really be necessary.

Yeah, that seems like a better design, after all.

Attached is a new patch. It now keeps the current pg_clog unchanged, but
adds a new pg_csnlog besides it. pg_csnlog is more similar to
pg_subtrans than pg_clog: it's not WAL-logged, is reset at startup, and
segments older than GlobalXmin can be truncated.

This addresses the disk space consumption, and simplifies pg_upgrade.

There are no other significant changes in this new version, so it's
still very much WIP. But please take a look!

- Heikki


Attachment

Re: Proposal for CSN based snapshots

From
Craig Ringer
Date:
On 05/30/2014 11:14 PM, Heikki Linnakangas wrote:
> 
> Yeah. To recap, the failure mode is that if the master crashes and
> restarts, the transaction becomes visible in the master even though it
> was never replicated.

Wouldn't another pg_clog bit for the transaction be able to sort that out?

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Proposal for CSN based snapshots

From
Robert Haas
Date:
On Mon, Jun 16, 2014 at 12:58 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 05/30/2014 11:14 PM, Heikki Linnakangas wrote:
>> Yeah. To recap, the failure mode is that if the master crashes and
>> restarts, the transaction becomes visible in the master even though it
>> was never replicated.
>
> Wouldn't another pg_clog bit for the transaction be able to sort that out?

How?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal for CSN based snapshots

From
Craig Ringer
Date:
On 06/18/2014 12:41 AM, Robert Haas wrote:
> On Mon, Jun 16, 2014 at 12:58 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> > On 05/30/2014 11:14 PM, Heikki Linnakangas wrote:
>>> >> Yeah. To recap, the failure mode is that if the master crashes and
>>> >> restarts, the transaction becomes visible in the master even though it
>>> >> was never replicated.
>> >
>> > Wouldn't another pg_clog bit for the transaction be able to sort that out?
> How?

A flag to indicate that the tx is locally committed but hasn't been
confirmed by a streaming synchronous replica, so it must not become
visible until the replica confirms it or SR is disabled.

Then scan pg_clog on start / replica connect and ask the replica to
confirm local commit for those tx's.

No?

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Proposal for CSN based snapshots

From
Robert Haas
Date:
On Tue, Jun 17, 2014 at 9:00 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 06/18/2014 12:41 AM, Robert Haas wrote:
>> On Mon, Jun 16, 2014 at 12:58 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>>> > On 05/30/2014 11:14 PM, Heikki Linnakangas wrote:
>>>> >> Yeah. To recap, the failure mode is that if the master crashes and
>>>> >> restarts, the transaction becomes visible in the master even though it
>>>> >> was never replicated.
>>> >
>>> > Wouldn't another pg_clog bit for the transaction be able to sort that out?
>> How?
>
> A flag to indicate that the tx is locally committed but hasn't been
> confirmed by a streaming synchronous replica, so it must not become
> visible until the replica confirms it or SR is disabled.
>
> Then scan pg_clog on start / replica connect and ask the replica to
> confirm local commit for those tx's.
>
> No?

No.  Otherwise, one of those bits could get changed after a backend
takes a snapshot and before it finishes using it - so that the
transaction snapshot is in effect changing underneath it.  You could
avoid that by memorizing the contents of CLOG when taking a snapshot,
but that would defeat the whole purpose of CSN-based snapshots, which
is to make the small and fixed-size.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal for CSN based snapshots

From
Craig Ringer
Date:
On 06/18/2014 09:12 PM, Robert Haas wrote:
> No.  Otherwise, one of those bits could get changed after a backend
> takes a snapshot and before it finishes using it - so that the
> transaction snapshot is in effect changing underneath it.  You could
> avoid that by memorizing the contents of CLOG when taking a snapshot,
> but that would defeat the whole purpose of CSN-based snapshots, which
> is to make the small and fixed-size.

Ah.

Thankyou. I appreciate the explanation.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Proposal for CSN based snapshots

From
Amit Langote
Date:
Hi,

On Fri, Jun 13, 2014 at 7:24 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Yeah, that seems like a better design, after all.
>
> Attached is a new patch. It now keeps the current pg_clog unchanged, but
> adds a new pg_csnlog besides it. pg_csnlog is more similar to pg_subtrans
> than pg_clog: it's not WAL-logged, is reset at startup, and segments older
> than GlobalXmin can be truncated.
>
> This addresses the disk space consumption, and simplifies pg_upgrade.
>
> There are no other significant changes in this new version, so it's still
> very much WIP. But please take a look!
>

Thanks for working on this important patch. I know this patch is still
largely a WIP but I would like to report an observation.

I applied this patch and did a few pgbench runs with 32 clients (this
is on a not so powerful VM, by the way) . Perhaps you suspect such a
thing already but I observed a relatively larger percentage of time
being spent in XLogInsert().

Perhaps XLogCtlInsert.insertpos_lck contention via GetXLogInsertRecPtr()?

--
Amit



Re: Proposal for CSN based snapshots

From
Simon Riggs
Date:
On 12 May 2014 17:14, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 05/12/2014 06:26 PM, Andres Freund wrote:
>>>
>>> >With the new "commit-in-progress" status in clog, we won't need the
>>> >sub-committed clog status anymore. The "commit-in-progress" status will
>>> >achieve the same thing.
>>
>> Wouldn't that cause many spurious waits? Because commit-in-progress
>> needs to be waited on, but a sub-committed xact surely not?
>
>
> Ah, no. Even today, a subxid isn't marked as sub-committed, until you commit
> the top-level transaction. The sub-commit state is a very transient state
> during the commit process, used to make the commit of the sub-transactions
> and the top-level transaction appear atomic. The commit-in-progress state
> would be a similarly short-lived state. You mark the subxids and the top xid
> as commit-in-progress just before the XLogInsert() of the commit record, and
> you replace them with the real LSNs right after XLogInsert().

My understanding is that we aim to speed up the rate of Snapshot
reads. Which may make it feasible to store autonomous transactions in
shared memory, hopefully DSM.

The above mechanism sounds like it might slow down transaction commit.
Could we prototype that and measure the speed?

More generally, do we have any explicit performance goals or
acceptance criteria for this patch?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Proposal for CSN based snapshots

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:

> Attached is a new patch. It now keeps the current pg_clog unchanged,
> but adds a new pg_csnlog besides it. pg_csnlog is more similar to
> pg_subtrans than pg_clog: it's not WAL-logged, is reset at startup,
> and segments older than GlobalXmin can be truncated.
>
> This addresses the disk space consumption, and simplifies pg_upgrade.
>
> There are no other significant changes in this new version, so it's
> still very much WIP. But please take a look!

Thanks.  I've been looking at this.  It needs a rebase; I had to apply
to commit 89cf2d52030 (Wed Jul 2 13:11:05 2014 -0400) because of
conflicts with the commit after that; it applies with some fuzz.  I read
the discussion in this thread and the README, to try to understand how
this is supposed to work.  It looks pretty good.

One thing not quite clear to me is the now-static RecentXmin,
GetRecentXmin(), AdvanceGlobalRecentXmin, and the like.  I found the
whole thing pretty confusing.  I noticed that ShmemVariableCache->recentXmin
is read without a lock in GetSnapshotData(), but exclusive ProcArrayLock
is taken to write to it in GetRecentXmin().  I think we can write it
without a lock also, since we assume that storing an xid is atomic.
With that change, I think you can make the acquisition of ProcArray Lock
to walk the pgproc list use shared mode, not exclusive.

Another point is that RecentXmin is no longer used directly, except in
TransactionIdIsActive().  But that routine is only used for an Assert()
now, so I think it's fine to just have GetRecentXmin() return the value
and not set RecentXmin as a side effect; my point is that ISTM
RecentXmin can be removed altogether, which makes that business simpler.


As far as GetOldestXmin() is concerned, that routine now ignores its
arguments.  Is that okay?  I imagine it's just a natural consequence of
how things work now.   [ ... reads some more code ... ]  Oh, now it's
only used to determine a freeze limit.  I think you should just remove
the arguments and make that whole thing simpler.  I was going to suggest
that AdvanceRecentGlobalXmin() could receive a possibly-NULL Relation
argument to pass down to GetOldestSnapshotGuts() which can make use of
it for a tigther limit, but since OldestXmin is only for freezing, maybe
there is no point in this extra complication.

Regarding the pendingGlobalXmin stuff, I didn't find any documentation.
I think you just need to write a very long comment on top of
AdvanceRecentGlobalXmin() to explain what it's doing.  After going over
the code half a dozen times I think I understand what's idea, and it
seems sound.

Not sure about the advancexmin_counter thing.  Maybe in some cases
sessions will only run 9 transactions or less and then finish, in which
case it will only get advanced at checkpoint time, which would be way
too seldom.  Maybe it would work to place the counter in shared memory:

acquire(spinlock);
if (ShmemVariableCache->counter++ >= some_number)
{
   SI don't understand this:hmemVariableCache->counter = 0;
   do_update = true;
}
release(spinlock);
if (do_update)
   AdvanceRecentGlobalXmin();
(Maybe we can forgo the spinlock altogether?  If the value gets out of
hand once in a while, it shouldn't really matter)  Perhaps the "some
number" should be scaled to some fraction of MaxBackends, but not less
than X (32?) and no more than Y (128?).  Or maybe just use constant 10,
as the current code does.  But it'd be total number of transactions, not
transaction in the current backend.


I think it'd be cleaner to have a distinct typedef to use when
XLogRecPtr is used as a snapshot representation.  Right now there is no
clarity on whether we're interested in the xlog position itself or it's
a snapshot value.


HeapTupleSatisfiesVacuum[X]() and various callers needs update to their
comments: when OldestXmin is mentioned, it should be OldestSnapshot.

I noticed that SubTransGetTopmostTransaction() is now only called from
predicate.c, and it's pretty constrained in what it wants.  I'm not
implying that we want to do anything in your patch about it, other than
perhaps add a note to it that we may want to examine it later for
possible changes.

I haven't gone over the transam.c, clog.c changes yet.

I attach a couple of minor tweaks to the README, mostly for clarity (but
also update clog -> csnlog in a couple of places).

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Proposal for CSN based snapshots

From
Jeff Davis
Date:
On Fri, 2014-06-13 at 13:24 +0300, Heikki Linnakangas wrote:
> Attached is a new patch. It now keeps the current pg_clog unchanged, but 
> adds a new pg_csnlog besides it. pg_csnlog is more similar to 
> pg_subtrans than pg_clog: it's not WAL-logged, is reset at startup, and 
> segments older than GlobalXmin can be truncated.

It appears that this patch weakens the idea of hint bits. Even if
HEAP_XMIN_COMMITTED is set, it still needs to find out if it's in the
snapshot.

That's fast if the xid is less than snap->xmin, but otherwise it needs
to do a fetch from the csnlog, which is exactly what the hint bits are
designed to avoid. And we can't get around this, because the whole point
of this patch is to remove the xip array from the snapshot.

If the transaction was committed a long time ago, then we could set
PD_ALL_VISIBLE and the VM bit, and a scan wouldn't even look at the hint
bit. If it was committed recently, then it's probably greater than the
recentXmin. I think there's still room for a hint bit to technically be
useful, but it seems quite narrow unless I'm missing something (and a
narrowly-useful hint bit doesn't seem to be useful at all).

I'm not complaining, and I hope this is not a showstopper for this
patch, but I think it's worth discussing.

Regards,Jeff Davis





Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 08/26/2014 12:03 PM, Jeff Davis wrote:
> On Fri, 2014-06-13 at 13:24 +0300, Heikki Linnakangas wrote:
>> Attached is a new patch. It now keeps the current pg_clog unchanged, but
>> adds a new pg_csnlog besides it. pg_csnlog is more similar to
>> pg_subtrans than pg_clog: it's not WAL-logged, is reset at startup, and
>> segments older than GlobalXmin can be truncated.
>
> It appears that this patch weakens the idea of hint bits. Even if
> HEAP_XMIN_COMMITTED is set, it still needs to find out if it's in the
> snapshot.
>
> That's fast if the xid is less than snap->xmin, but otherwise it needs
> to do a fetch from the csnlog, which is exactly what the hint bits are
> designed to avoid. And we can't get around this, because the whole point
> of this patch is to remove the xip array from the snapshot.

Yeah. This patch in the current state is likely much much slower than 
unpatched master, except in extreme cases where you have thousands of 
connections and short transactions so that without the patch, you spend 
most of the time acquiring snapshots.

My current thinking is that access to the csnlog will need to be made 
faster. Currently, it's just another SLRU, but I'm sure we can do better 
than that. Or add a backend-private cache on top of it; that might 
already alleviate it enough..

- Heikki




Re: Proposal for CSN based snapshots

From
Jeff Davis
Date:
On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote:
> My current thinking is that access to the csnlog will need to be made 
> faster. Currently, it's just another SLRU, but I'm sure we can do better 
> than that. Or add a backend-private cache on top of it; that might 
> already alleviate it enough..

Aren't those the same ideas that have been proposed for eliminating hint
bits?

If this patch makes taking snapshots much faster, then perhaps it's
enough incentive to follow through on one of those ideas. I am certainly
open to removing hint bits (which was probably clear from earlier
discussions) if there are significant wins elsewhere and the penalty is
not too great.

To continue the discussion on this patch, let's assume that we make
TransactionIdGetCommitLSN sufficiently fast. If that's the case, can't
we make HeapTupleSatisfiesMVCC look more like:
 LsnMin = TransactionIdGetCommitLSN(xmin); if (!COMMITLSN_IS_COMMITTED(LsnMin))    LsnMin = BIG_LSN;
 LsnMin = TransactionIdGetCommitLSN(xmin); if (!COMMITLSN_IS_COMMITTED(LsnMin))    LsnMin = BIG_LSN;
 return (snapshot->snapshotlsn >= LsnMin &&         snapshot->snapshotlsn <  LsnMax);

There would need to be some handling for locked tuples, or tuples
related to the current transaction, of course. But I still think it
would turn out simpler; perhaps by enough to save a few cycles.

Regards,Jeff Davis







Re: Proposal for CSN based snapshots

From
Greg Stark
Date:
On Tue, Aug 26, 2014 at 11:45 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> It appears that this patch weakens the idea of hint bits. Even if
>> HEAP_XMIN_COMMITTED is set, it still needs to find out if it's in the
>> snapshot.
>>
>> That's fast if the xid is less than snap->xmin, but otherwise it needs
>> to do a fetch from the csnlog, which is exactly what the hint bits are
>> designed to avoid. And we can't get around this, because the whole point
>> of this patch is to remove the xip array from the snapshot.
>
>
> Yeah. This patch in the current state is likely much much slower than
> unpatched master, except in extreme cases where you have thousands of
> connections and short transactions so that without the patch, you spend most
> of the time acquiring snapshots.


Interesting analysis.

I suppose the equivalent of hint bits would be to actually write the
CSN of the transaction into the record when the hint bit is set.

I don't immediately see how to make that practical. One thought would
be to have a list of xids in the page header with their corresponding
csn -- which starts to sound a lot like Oralce's "Interested
Transaction List". But I don't see how to make that work for the
hundreds of possible xids on the page.

The worst case for visibility resolution is you have a narrow table
that has random access DDL happening all the time, each update is a
short transaction and there are a very high rate of such transactions
spread out uniformly over a very large table. That means any given
page has over 200 rows with random xids spread over a very large range
of xids.

Currently the invariant hint bits give us is that each xid needs to be
looked up in the clog only a more or less fixed number of times, in
that scenario only once since the table is very large and the
transactions short lived.

-- 
greg



Re: Proposal for CSN based snapshots

From
Jeff Davis
Date:
On Tue, 2014-08-26 at 19:25 +0100, Greg Stark wrote:
> I don't immediately see how to make that practical. One thought would
> be to have a list of xids in the page header with their corresponding
> csn -- which starts to sound a lot like Oralce's "Interested
> Transaction List". But I don't see how to make that work for the
> hundreds of possible xids on the page.

I feel like that's moving in the wrong direction. That's still causing a
lot of modifications to a data page when the data is not changing, and
that's bad for a lot of cases that I'm interested in (checksums are one
example).

We are mixing two kinds of data: user data and visibility information.
Each is changed under different circumstances and has different
characteristics, and I'm beginning to think they shouldn't be mixed at
all.

What if we just devised a structure specially designed to hold
visibility information, put all of the visibility information there, and
data pages would only change where there is a real, user-initiated
I/U/D. Vacuum could still clear out dead tuples from the data area, but
it would do the rest of its work on the visibility structure. It could
even be a clever structure that could compress away large static areas
until they become active again.

Maybe this wouldn't work for all tables, but could be an option for big
tables with low update rates.

> The worst case for visibility resolution is you have a narrow table
> that has random access DDL happening all the time, each update is a
> short transaction and there are a very high rate of such transactions
> spread out uniformly over a very large table. That means any given
> page has over 200 rows with random xids spread over a very large range
> of xids.

That's not necessarily a bad case, unless the CLOG/CSNLOG lookup is a
significant fraction of the effort to update a tuple.

That would be a bad case if you introduce scans into the equation as
well, but that's not a problem if the all-visible bit is set.

> Currently the invariant hint bits give us is that each xid needs to be
> looked up in the clog only a more or less fixed number of times, in
> that scenario only once since the table is very large and the
> transactions short lived.

A backend-local cache might accomplish that, as well (would still need
to do a lookup, but no locks or contention). There would be some
challenges around invalidation (for xid wraparound) and pre-warming the
cache (so establishing a lot of connections doesn't cause a lot of CLOG
access).

Regards,Jeff Davis





Re: Proposal for CSN based snapshots

From
Greg Stark
Date:
On Tue, Aug 26, 2014 at 8:32 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> We are mixing two kinds of data: user data and visibility information.
> Each is changed under different circumstances and has different
> characteristics, and I'm beginning to think they shouldn't be mixed at
> all.
>
> What if we just devised a structure specially designed to hold
> visibility information, put all of the visibility information there, and
> data pages would only change where there is a real, user-initiated
> I/U/D. Vacuum could still clear out dead tuples from the data area, but
> it would do the rest of its work on the visibility structure. It could
> even be a clever structure that could compress away large static areas
> until they become active again.

Well fundamentally the reason the visibility information is with the
user data is so that we don't need to do two i/os to access the data.
The whole point of hint bits is to guarantee that after some amount of
time you can read data directly out of the heap page and use it
without doing any additional I/O.

-- 
greg



Re: Proposal for CSN based snapshots

From
Jeff Davis
Date:
On Tue, 2014-08-26 at 21:00 +0100, Greg Stark wrote:
> Well fundamentally the reason the visibility information is with the
> user data is so that we don't need to do two i/os to access the data.
> The whole point of hint bits is to guarantee that after some amount of
> time you can read data directly out of the heap page and use it
> without doing any additional I/O.

If the data is that static, then the visibility information would be
highly compressible, and surely in shared_buffers already.

(Yes, it would need to be pinned, which has a cost.)

Regards,Jeff Davis





Re: Proposal for CSN based snapshots

From
Amit Kapila
Date:
On Tue, Aug 26, 2014 at 11:55 PM, Greg Stark <stark@mit.edu> wrote:
> Interesting analysis.
>
> I suppose the equivalent of hint bits would be to actually write the
> CSN of the transaction into the record when the hint bit is set.
>
> I don't immediately see how to make that practical. One thought would
> be to have a list of xids in the page header with their corresponding
> csn -- which starts to sound a lot like Oralce's "Interested
> Transaction List". But I don't see how to make that work for the
> hundreds of possible xids on the page.

Here we can limit the number of such entries (xid-csn pair) that
can exist in block header and incase it starts to overflow, it can
use some alternate location.  Another idea could be that reserve
the space whenever new transaction starts to operate in block
and update the CSN for the same later and incase the maximum
number of available slots gets filled up, we can either use alternate
location or may be block the new transaction until such a slot is
available, here I am assuming that Vacuum would clear the slots
in blocks as and when such transactions become visible to
everyone or may be the transaction which didn't find any slot can
also try to clear such slots.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Proposal for CSN based snapshots

From
Jeff Davis
Date:
On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote:
> Yeah. This patch in the current state is likely much much slower than 
> unpatched master, except in extreme cases where you have thousands of 
> connections and short transactions so that without the patch, you spend 
> most of the time acquiring snapshots.

What else are you looking to accomplish with this patch during this
'fest? Bug finding? Design review? Performance testing?

I haven't had a good track record with my performance testing recently,
so I'm unlikely to be much help there.

It seems a bit early for bug hunting, unless you think it will raise
possible design issues.

I think there's already at least one design issue to consider, which is
whether we care about CLOG/CSNLOG access for hinted records where the
xid > snapshot->xmin (that is, accesses that previously would have
looked at xip). Would more discussion help here or do we need to wait
for performance numbers?

Regards,Jeff Davis





Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 08/27/2014 08:23 AM, Jeff Davis wrote:
> On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote:
>> Yeah. This patch in the current state is likely much much slower than
>> unpatched master, except in extreme cases where you have thousands of
>> connections and short transactions so that without the patch, you spend
>> most of the time acquiring snapshots.
>
> What else are you looking to accomplish with this patch during this
> 'fest? Bug finding? Design review? Performance testing?

Design review, mostly. I know the performance still sucks. Although if 
you can foresee some performance problems, aside from the extra CSNLOG 
lookups, it would be good to know.

> I think there's already at least one design issue to consider, which is
> whether we care about CLOG/CSNLOG access for hinted records where the
> xid > snapshot->xmin (that is, accesses that previously would have
> looked at xip). Would more discussion help here or do we need to wait
> for performance numbers?

I think my current plan is to try to make that CSNLOG lookup fast. In 
essence, rewrite SLRUs to be more performant. That would help with the 
current clog, too, which would make it more feasible to set hint bits 
less often. In particular, avoid dirtying pages just to set hint bits. 
I'm not sure if that's enough - you can't beat checking a single hint 
bit in the same cache line as the XID - but I think it's worth a try.

- Heikki




Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 08/27/2014 09:40 AM, Heikki Linnakangas wrote:
> On 08/27/2014 08:23 AM, Jeff Davis wrote:
>> On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote:
>>> Yeah. This patch in the current state is likely much much slower than
>>> unpatched master, except in extreme cases where you have thousands of
>>> connections and short transactions so that without the patch, you spend
>>> most of the time acquiring snapshots.
>>
>> What else are you looking to accomplish with this patch during this
>> 'fest? Bug finding? Design review? Performance testing?
>
> Design review, mostly. I know the performance still sucks. Although if
> you can foresee some performance problems, aside from the extra CSNLOG
> lookups, it would be good to know.

I think for this commitfest, I've gotten as much review of this patch 
that I can hope for. Marking as "Returned with Feedback". But of course, 
feel free to continue reviewing and commenting ;-).

- Heikki




Re: Proposal for CSN based snapshots

From
Alexander Korotkov
Date:
On Wed, Aug 27, 2014 at 10:42 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 08/27/2014 09:40 AM, Heikki Linnakangas wrote:
On 08/27/2014 08:23 AM, Jeff Davis wrote:
On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote:
Yeah. This patch in the current state is likely much much slower than
unpatched master, except in extreme cases where you have thousands of
connections and short transactions so that without the patch, you spend
most of the time acquiring snapshots.

What else are you looking to accomplish with this patch during this
'fest? Bug finding? Design review? Performance testing?

Design review, mostly. I know the performance still sucks. Although if
you can foresee some performance problems, aside from the extra CSNLOG
lookups, it would be good to know.

I think for this commitfest, I've gotten as much review of this patch that I can hope for. Marking as "Returned with Feedback". But of course, feel free to continue reviewing and commenting ;-).

What is current state of this patch? Does community want CSN based snapshots?
Last email in lists was in August 2014. Huawei did talk about their further research on this idea at PGCon and promised to publish their patch in open source.
However, it isn't published yet, and we don't know how long we could wait for it.
Now, our company have resources to work on CSN based snapshots for 9.6. If Huawei will not publish their patch, we can pick up this work from last community version of this patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Proposal for CSN based snapshots

From
Simon Riggs
Date:
On 24 July 2015 at 14:43, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
 
What is current state of this patch? Does community want CSN based snapshots?

CSN snapshots could give us

1. More scalable GetSnapshotData() calls
2. Faster GetSnapshotData() calls
3. Smaller snapshots which could be passed around more usefully between distributed servers

It depends on the exact design we use to get that. Certainly we do not want them if they cause a significant performance regression.
 
Now, our company have resources to work on CSN based snapshots for 9.6. If Huawei will not publish their patch, we can pick up this work from last community version of this patch.

Sounds good. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Proposal for CSN based snapshots

From
Robert Haas
Date:
On Fri, Jul 24, 2015 at 1:00 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> It depends on the exact design we use to get that. Certainly we do not want
> them if they cause a significant performance regression.

Yeah.  I think the performance worries expressed so far are:

- Currently, if you see an XID that is between the XMIN and XMAX of
the snapshot, you hit CLOG only on first access.  After that, the
tuple is hinted.  With this approach, the hint bit doesn't avoid
needing to hit CLOG anymore, because it's not enough to know whether
or not the tuple committed; you have to know the CSN at which it
committed, which means you have to look that up in CLOG (or whatever
SLRU stores this data).  Heikki mentioned adding some caching to
ameliorate this problem, but it sounds like he was worried that the
impact might still be significant.

- Mixing synchronous_commit=on and synchronous_commit=off won't work
as well, because if the LSN ordering of commit records matches the
order in which transactions become visible, then an async-commit
transaction can't become visible before a later sync-commit
transaction.  I expect we might just decide we can live with this, but
it's worth discussing in case people feel otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal for CSN based snapshots

From
Andres Freund
Date:
On 2015-07-24 14:21:39 -0400, Robert Haas wrote:
> - Mixing synchronous_commit=on and synchronous_commit=off won't work
> as well, because if the LSN ordering of commit records matches the
> order in which transactions become visible, then an async-commit
> transaction can't become visible before a later sync-commit
> transaction.  I expect we might just decide we can live with this, but
> it's worth discussing in case people feel otherwise.

I'm not following this anymore. Even though a couple months back I
apparently did. Heikki's description of the problem was:

On 2014-05-30 17:59:23 +0300, Heikki Linnakangas wrote:
> One thorny issue came up in discussions with other hackers on this in PGCon:
> 
> When a transaction is committed asynchronously, it becomes visible to other
> backends before the commit WAL record is flushed. With CSN-based snapshots,
> the order that transactions become visible is always based on the LSNs of
> the WAL records. This is a problem when there is a mix of synchronous and
> asynchronous commits:
> 
> If transaction A commits synchronously with commit LSN 1, and transaction B
> commits asynchronously with commit LSN 2, B cannot become visible before A.
> And we cannot acknowledge B as committed to the client until it's visible to
> other transactions. That means that B will have to wait for A's commit
> record to be flushed to disk, before it can return, even though it was an
> asynchronous commit.


Afacs the xlog insertion order isn't changed relevantly by the patch?

Right now the order for sync commit = on is:

XactLogCommitRecord();
XLogFlush();
TransactionIdCommitTree();
ProcArrayEndTransaction();

and for sync commit = off it's:
XactLogCommitRecord();
TransactionIdAsyncCommitTree();
ProcArrayEndTransaction();

with the CSN patch it's:

sync on:
TransactionIdSetCommitting();
XactLogCommitRecord();
XLogFlush(XactLastRecEnd);
TransactionIdCommitTree();

sync off:
TransactionIdSetCommitting();
XactLogCommitRecord();
TransactionIdCommitTree();

How does this cause the above described problem? Yes, mixing sync/async
commits can give somewhat weird visibility semantics, but that's not
really new, is it?

Regards,

Andres



Re: Proposal for CSN based snapshots

From
Simon Riggs
Date:
On 24 July 2015 at 19:21, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jul 24, 2015 at 1:00 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> It depends on the exact design we use to get that. Certainly we do not want
> them if they cause a significant performance regression.

Yeah.  I think the performance worries expressed so far are:

- Currently, if you see an XID that is between the XMIN and XMAX of
the snapshot, you hit CLOG only on first access.  After that, the
tuple is hinted.  With this approach, the hint bit doesn't avoid
needing to hit CLOG anymore, because it's not enough to know whether
or not the tuple committed; you have to know the CSN at which it
committed, which means you have to look that up in CLOG (or whatever
SLRU stores this data).  Heikki mentioned adding some caching to
ameliorate this problem, but it sounds like he was worried that the
impact might still be significant.

This seems like the heart of the problem. Changing a snapshot from a list of xids into one number is easy. Making XidInMVCCSnapshot() work is the hard part because there needs to be a translation/lookup from CSN to determine if it contains the xid.

That turns CSN into a reference to a cached snapshot, or a reference by which a snapshot can be derived on demand.
 
- Mixing synchronous_commit=on and synchronous_commit=off won't work
as well, because if the LSN ordering of commit records matches the
order in which transactions become visible, then an async-commit
transaction can't become visible before a later sync-commit
transaction.  I expect we might just decide we can live with this, but
it's worth discussing in case people feel otherwise.

Using the Commit LSN as CSN seems interesting, but it is not the only choice.

Commit LSN is not the precise order in which commits become visible because of the race condition between marking commit in WAL and marking commit in clog. That problem is accentuated by mixing async and sync, but that is not the only source of racing. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Proposal for CSN based snapshots

From
Alexander Korotkov
Date:
On Sat, Jul 25, 2015 at 11:39 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 24 July 2015 at 19:21, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jul 24, 2015 at 1:00 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> It depends on the exact design we use to get that. Certainly we do not want
> them if they cause a significant performance regression.

Yeah.  I think the performance worries expressed so far are:

- Currently, if you see an XID that is between the XMIN and XMAX of
the snapshot, you hit CLOG only on first access.  After that, the
tuple is hinted.  With this approach, the hint bit doesn't avoid
needing to hit CLOG anymore, because it's not enough to know whether
or not the tuple committed; you have to know the CSN at which it
committed, which means you have to look that up in CLOG (or whatever
SLRU stores this data).  Heikki mentioned adding some caching to
ameliorate this problem, but it sounds like he was worried that the
impact might still be significant.

This seems like the heart of the problem. Changing a snapshot from a list of xids into one number is easy. Making XidInMVCCSnapshot() work is the hard part because there needs to be a translation/lookup from CSN to determine if it contains the xid.

That turns CSN into a reference to a cached snapshot, or a reference by which a snapshot can be derived on demand.

I got the problem. Currently, once we set hint bits don't have to visit CLOG anymore. With CSN snapshots that is not so. We still have to translate XID into CSN in order to compare it with snapshot CSN. In version of CSN patch in this thread we still have XMIN and XMAX in the snapshot. AFAICS with CSN snapshots XMIN and XMAX are not necessary required to express snapshot, they were kept for optimization. That restricts usage of XID => CSN map with given range of XIDs. However, with long running transactions [XMIN; XMAX] range could be very wide and we could use XID => CSN map heavily in wide range of XIDs.

As I can see in Huawei PGCon talk "Dense Map" in shared memory is proposed for XID => CSN transformation. Having large enough "Dense Map" we can do most of XID => CSN transformations with single shared memory access. PGCon talk gives us result of pgbench. However, pgbench doesn't run any long transactions. With long running transaction we can run out of "Dense Map" for significant part of XID => CSN transformations. Dilip, did you test your patch with long transactions?

I'm also thinking about different option for optimizing this. When we set hint bits we can also change XMIN/XMAX with CSN. In this case we wouldn't need to do XID => CSN transformation for this tuple anymore. However, we have 32-bit XIDs for now. We could also have 32-bit CSNs. However, that would doubles our troubles with wraparound: we will have 2 counters that could wraparound. That could return us to thoughts about 64-bit XIDs.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
(Reviving an old thread)

I spent some time dusting off this old patch, to implement CSN
snapshots. Attached is a new patch, rebased over current master, and
with tons of comments etc. cleaned up. There's more work to be done
here, I'm posting this to let people know I'm working on this again. And
to have a backup on the 'net :-).

I switched to using a separate counter for CSNs. CSN is no longer the
same as the commit WAL record's LSN. While I liked the conceptual
simplicity of CSN == LSN a lot, and the fact that the standby would see
the same commit order as master, I couldn't figure out how to make async
commits to work.

Next steps:

* Hot standby feedback is broken, now that CSN != LSN again. Will have
to switch this back to using an "oldest XID", rather than a CSN.

* I plan to replace pg_subtrans with a special range of CSNs in the
csnlog. Something like, start the CSN counter at 2^32 + 1, and use CSNs
< 2^32 to mean "this is a subtransaction, parent is XXX". One less SLRU
to maintain.

* Put per-proc xmin back into procarray. I removed it, because it's not
necessary for snapshots or GetOldestSnapshot() (which replaces
GetOldestXmin()) anymore. But on second thoughts, we still need it for
deciding when it's safe to truncate the csnlog.

* In this patch, HeapTupleSatisfiesVacuum() is rewritten to use an
"oldest CSN", instead of "oldest xmin", but that's not strictly
necessary. To limit the size of the patch, I might revert those changes
for now.

* Rewrite the way RecentGlobalXmin is updated. As Alvaro pointed out in
his review comments two years ago, that was quite complicated. And I'm
worried that the lazy scheme I had might not allow pruning fast enough.
I plan to make it more aggressive, so that whenever the currently oldest
transaction finishes, it's responsible for advancing the "global xmin"
in shared memory. And the way it does that, is by scanning the csnlog,
starting from the current "global xmin", until the next still
in-progress XID. That could be a lot, if you have a very long-running
transaction that ends, but we'll see how it performs.

* Performance testing. Clearly this should have a performance benefit,
at least under some workloads, to be worthwhile. And not regress.

- Heikki


Attachment

Re: Proposal for CSN based snapshots

From
Ants Aasma
Date:
On Tue, Aug 9, 2016 at 3:16 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> (Reviving an old thread)
>
> I spent some time dusting off this old patch, to implement CSN snapshots.
> Attached is a new patch, rebased over current master, and with tons of
> comments etc. cleaned up. There's more work to be done here, I'm posting
> this to let people know I'm working on this again. And to have a backup on
> the 'net :-).

Great to hear. I hope to find some time to review this. In the
meanwhile here is a brain dump of my thoughts on the subject.

> I switched to using a separate counter for CSNs. CSN is no longer the same
> as the commit WAL record's LSN. While I liked the conceptual simplicity of
> CSN == LSN a lot, and the fact that the standby would see the same commit
> order as master, I couldn't figure out how to make async commits to work.

I had an idea how that would work. In short lastCommitSeqNo would be a
vector clock. Each tier of synchronization requiring different amount
of waiting between xlog insert and visibility would get a position in
the vector. Commits then need to be ordered only with respect to other
commits within the same tier. The vector position also needs to be
stored in the CSN log, probably by appropriating a couple of bits from
the CSN value.

I'm not sure if doing it this way would be worth the complexity, but
as far as I can see it would work. However I think keeping CSN and LSN
separate is a good idea, as it keeps the door open to using an
external source for the CSN value, enabling stuff like cheap multi
master globally consistent snapshots.

> Next steps:
>
> * Hot standby feedback is broken, now that CSN != LSN again. Will have to
> switch this back to using an "oldest XID", rather than a CSN.
>
> * I plan to replace pg_subtrans with a special range of CSNs in the csnlog.
> Something like, start the CSN counter at 2^32 + 1, and use CSNs < 2^32 to
> mean "this is a subtransaction, parent is XXX". One less SLRU to maintain.

That's a great idea. I had a similar idea to enable lock-free writing
into data structures, have 2^32 COMMITSEQNO_INPROGRESS values that
embed the xid owning the slot. This way a writer can just blindly CAS
into data structures without worrying about overwriting useful stuff
due to race conditions.

> * Put per-proc xmin back into procarray. I removed it, because it's not
> necessary for snapshots or GetOldestSnapshot() (which replaces
> GetOldestXmin()) anymore. But on second thoughts, we still need it for
> deciding when it's safe to truncate the csnlog.
>
> * In this patch, HeapTupleSatisfiesVacuum() is rewritten to use an "oldest
> CSN", instead of "oldest xmin", but that's not strictly necessary. To limit
> the size of the patch, I might revert those changes for now.
>
> * Rewrite the way RecentGlobalXmin is updated. As Alvaro pointed out in his
> review comments two years ago, that was quite complicated. And I'm worried
> that the lazy scheme I had might not allow pruning fast enough. I plan to
> make it more aggressive, so that whenever the currently oldest transaction
> finishes, it's responsible for advancing the "global xmin" in shared memory.
> And the way it does that, is by scanning the csnlog, starting from the
> current "global xmin", until the next still in-progress XID. That could be a
> lot, if you have a very long-running transaction that ends, but we'll see
> how it performs.

I had a similar idea. This would be greatly improved by a hybrid
snapshot scheme that could make the tail end of the CSN log a lot
sparser. More on that below.

> * Performance testing. Clearly this should have a performance benefit, at
> least under some workloads, to be worthwhile. And not regress.

I guess this is the main question. GetSnapshotData is probably going
to be faster, but the second important benefit that I saw was the
possibility of reducing locking around common activities. This of
course needs to guided by profiling - evils of premature optimization
and all that. But that said, my gut tells me that the most important
spots are:

* Looking up CSNs for XidVisibleInSnashot. Current version gets a
shared lock on CSNLogControlLock. That might get nasty, for example
when two concurrent sequential scans running on different cores or
sockets hit a patch of transactions within their xmin-xmax interval.

* GetSnapshotData. With tps on read only queries potentially pushing
into the millions, it would be awfully nice if contending on a single
cache line could be avoided. Currently it acquires ProcArrayLock and
CommitSeqNoLock to atomically update MyPgXact->snapshotcsn. The
pattern I had in mind to make this lock free would be to read a CSN,
publish value, check against latest OldestGlobalSnapshot value and
loop if necessary, and on the other side, calculate optimistic oldest
snapshot, publish and then recheck, if necessary, republish.

* While commits are not as performance critical it may still be
beneficial to defer clog updates and perform them in larger batches to
reduce clog lock traffic.

I think I have achieved some clarity on my idea of snapshots that
migrate between being CSN and XID based. The essential problem with
old CSN based snapshots is that the CSN log for their xmin-xmax
interval needs to be kept around in a quick to access datastructure.
And the problem with long running transactions is twofold, first is
that they need a well defined location where to publish the visibility
info for their xids and secondly, they enlarge the xmin-xmax interval
of all concurrent snapshots, needing a potentially huge amount of CSN
log to be kept around. One way to deal with it is to just accept it
and use a SLRU as in this patch.

My new insight is that a snapshot doesn't need to be either-or CSN or
XIP (xid in progress) array based, but it can also be a hybrid. There
would be a sorted array of in progress xids and a non-overlapping CSN
xmin-xmax interval where CSN log needs to be consulted. As snapshots
age they scan the CSN log and incrementally build their XIP array,
basically lazily constructing same data structure used in snapshots
today. Old in progress xids need a slot for themselves to be kept
around, but all new snapshots being taken would immediately classify
them as old, store them in XIP and not include them in their CSN xmin.
Under this scheme the amount of CSN log strictly needed is a
reasonably sized ring buffer for recent xids (probably sized based on
max conn), a sparse map for long transactions (bounded by max conn)
and some slack for snapshots still being converted. A SLRU or similar
is still needed because there is no way to ensure timely conversion of
all snapshots, but by properly timing the notification the probability
of actually using it should be extremely low. Datastructure
maintenance operations occur naturally on xid assignment and are
easily batched. Long running transactions still affect all snapshots,
but the effect is small and not dependent on transaction age, old
snapshots only affect themselves. Subtransactions are still a pain,
and would need something similar to the current suboverflowed scheme.
On the plus side, it seems like the number of subxids to overflow
could be global, not per backend.

In short:

* A directly mapped ring buffer for recent xids could be accessed lock
free, would take care most of the traffic in most of the cases. Looks
like it would be a good trade-off for complexity/performance.

* To keep workloads with wildly varying transaction lengths in bounded
amount of memory, a significantly more complex hybrid snapshot scheme
is needed. It remains to be seen if these workloads are actually a
significant issue. But if they are, the approach described might
provide a way out.

Regards,
Ants Aasma



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 08/10/2016 01:03 PM, Ants Aasma wrote:
> On Tue, Aug 9, 2016 at 3:16 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> (Reviving an old thread)
>>
>> I spent some time dusting off this old patch, to implement CSN snapshots.
>> Attached is a new patch, rebased over current master, and with tons of
>> comments etc. cleaned up. There's more work to be done here, I'm posting
>> this to let people know I'm working on this again. And to have a backup on
>> the 'net :-).
>
> Great to hear. I hope to find some time to review this. In the
> meanwhile here is a brain dump of my thoughts on the subject.

Thanks!

>> * Performance testing. Clearly this should have a performance benefit, at
>> least under some workloads, to be worthwhile. And not regress.
>
> I guess this is the main question. GetSnapshotData is probably going
> to be faster, but the second important benefit that I saw was the
> possibility of reducing locking around common activities. This of
> course needs to guided by profiling - evils of premature optimization
> and all that.

Yep.

For the record, one big reason I'm excited about this is that a snapshot 
can be represented as small fixed-size value. And the reason that's 
exciting is that it makes it more feasible for each backend to publish 
their snapshots, which in turn makes it possible to vacuum old tuples 
more aggressively. In particular, if you have one very long-running 
transaction (think pg_dump), and a stream of transactions, you could 
vacuum away tuples that were inserted after the long-running transaction 
started and deleted before any of the shorter transactions started. You 
could do that without CSN snapshots, but it's a lot simpler with them. 
But that's a separate story.

> But that said, my gut tells me that the most important spots are:
>
> * Looking up CSNs for XidVisibleInSnashot. Current version gets a
> shared lock on CSNLogControlLock. That might get nasty, for example
> when two concurrent sequential scans running on different cores or
> sockets hit a patch of transactions within their xmin-xmax interval.

I'm not too worried about that particular case. Shared LWLocks scale 
pretty well these days, thanks to Andres Freund's work in this area. A 
mix of updates and reads on the csnlog might become a scalability 
problem though. We'll find out once we start testing.

> * GetSnapshotData. With tps on read only queries potentially pushing
> into the millions, it would be awfully nice if contending on a single
> cache line could be avoided. Currently it acquires ProcArrayLock and
> CommitSeqNoLock to atomically update MyPgXact->snapshotcsn. The
> pattern I had in mind to make this lock free would be to read a CSN,
> publish value, check against latest OldestGlobalSnapshot value and
> loop if necessary, and on the other side, calculate optimistic oldest
> snapshot, publish and then recheck, if necessary, republish.

Yeah, I haven't spent any serious effort into optimizing this yet. For 
the final version, should definitely do something like that.

> * While commits are not as performance critical it may still be
> beneficial to defer clog updates and perform them in larger batches to
> reduce clog lock traffic.

Hmm. I doubt batching them is going to help much. But it should be 
straightforward to use atomic operations here too, to reduce locking.

> I think I have achieved some clarity on my idea of snapshots that
> migrate between being CSN and XID based. The essential problem with
> old CSN based snapshots is that the CSN log for their xmin-xmax
> interval needs to be kept around in a quick to access datastructure.
> And the problem with long running transactions is twofold, first is
> that they need a well defined location where to publish the visibility
> info for their xids and secondly, they enlarge the xmin-xmax interval
> of all concurrent snapshots, needing a potentially huge amount of CSN
> log to be kept around. One way to deal with it is to just accept it
> and use a SLRU as in this patch.
>
> My new insight is that a snapshot doesn't need to be either-or CSN or
> XIP (xid in progress) array based, but it can also be a hybrid. There
> would be a sorted array of in progress xids and a non-overlapping CSN
> xmin-xmax interval where CSN log needs to be consulted. As snapshots
> age they scan the CSN log and incrementally build their XIP array,
> basically lazily constructing same data structure used in snapshots
> today. Old in progress xids need a slot for themselves to be kept
> around, but all new snapshots being taken would immediately classify
> them as old, store them in XIP and not include them in their CSN xmin.
> Under this scheme the amount of CSN log strictly needed is a
> reasonably sized ring buffer for recent xids (probably sized based on
> max conn), a sparse map for long transactions (bounded by max conn)
> and some slack for snapshots still being converted. A SLRU or similar
> is still needed because there is no way to ensure timely conversion of
> all snapshots, but by properly timing the notification the probability
> of actually using it should be extremely low. Datastructure
> maintenance operations occur naturally on xid assignment and are
> easily batched. Long running transactions still affect all snapshots,
> but the effect is small and not dependent on transaction age, old
> snapshots only affect themselves. Subtransactions are still a pain,
> and would need something similar to the current suboverflowed scheme.
> On the plus side, it seems like the number of subxids to overflow
> could be global, not per backend.

Yeah, if the csnlog access turns out to be too expensive, we could do 
something like this. In theory, you can always convert a CSN snapshot 
into an old-style list of XIDs, by scanning the csnlog between the xmin 
and xmax. That could be expensive if the distance between xmin and xmax 
is large, of course. But as you said, you can have various hybrid forms, 
where you use a list of XIDs of some range as a cache, for example.

I'm hopeful that we can simply make the csnlog access fast enough, 
though. Looking up an XID in a sorted array is O(log n), while looking 
up an XID in the csnlog is O(1). That ignores all locking and different 
constant factors, of course, but it's not a given that accessing the 
csnlog has to be slower than a binary search of an XID array.

- Heikki



Re: Proposal for CSN based snapshots

From
Alexander Korotkov
Date:
On Tue, Aug 9, 2016 at 3:16 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
(Reviving an old thread)

I spent some time dusting off this old patch, to implement CSN snapshots. Attached is a new patch, rebased over current master, and with tons of comments etc. cleaned up. There's more work to be done here, I'm posting this to let people know I'm working on this again. And to have a backup on the 'net :-).

Great! It's very nice seeing that you're working on this patch.
After PGCON I've your patch rebased to the master, but you already did much more.
 
I switched to using a separate counter for CSNs. CSN is no longer the same as the commit WAL record's LSN. While I liked the conceptual simplicity of CSN == LSN a lot, and the fact that the standby would see the same commit order as master, I couldn't figure out how to make async commits to work.

I didn't get async commits problem at first glance.  AFAICS, the difference between sync commit and async is only that async commit doesn't wait WAL log to flush.  But async commit still receives LSN.
Could you describe it in more details?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 

Re: Proposal for CSN based snapshots

From
Alexander Korotkov
Date:
On Wed, Aug 10, 2016 at 2:10 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Yeah, if the csnlog access turns out to be too expensive, we could do something like this. In theory, you can always convert a CSN snapshot into an old-style list of XIDs, by scanning the csnlog between the xmin and xmax. That could be expensive if the distance between xmin and xmax is large, of course. But as you said, you can have various hybrid forms, where you use a list of XIDs of some range as a cache, for example.

I'm hopeful that we can simply make the csnlog access fast enough, though. Looking up an XID in a sorted array is O(log n), while looking up an XID in the csnlog is O(1). That ignores all locking and different constant factors, of course, but it's not a given that accessing the csnlog has to be slower than a binary search of an XID array.

FYI, I'm still fan of idea to rewrite XID with CSN in tuple in the same way we're writing hint bits now.
I'm going to make prototype of this approach which would be enough for performance measurements.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 08/10/2016 04:34 PM, Alexander Korotkov wrote:
> On Tue, Aug 9, 2016 at 3:16 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
>> I switched to using a separate counter for CSNs. CSN is no longer the same
>> as the commit WAL record's LSN. While I liked the conceptual simplicity of
>> CSN == LSN a lot, and the fact that the standby would see the same commit
>> order as master, I couldn't figure out how to make async commits to work.
>
> I didn't get async commits problem at first glance.  AFAICS, the difference
> between sync commit and async is only that async commit doesn't wait WAL
> log to flush.  But async commit still receives LSN.
> Could you describe it in more details?

Imagine that you have a stream of normal, synchronous, commits. They get 
assigned LSNs: 1, 2, 3, 4. They become visible to other transactions in 
that order.

The way I described this scheme in the first emails on this thread, was 
to use the current WAL insertion position as the snapshot. That's not 
correct, though: you have to use the current WAL *flush* position as the 
snapshot. Otherwise you would see the results of a transaction that 
hasn't been flushed to disk yet, i.e. which might still get lost, if you 
pull the power plug before the flush happens. So you have to use the 
last flush position as the snapshot.

Now, if you do an asynchronous commit, the effects of that should become 
visible immediately, without waiting for the next flush. You can't do 
that, if its CSN == LSN.

Perhaps you could make an exception for async commits, so that the CSN 
of an async commit is not the commit record's LSN, but the current WAL 
flush position, so that it becomes visible to others immediately. But 
then, how do you distinguish two async transactions that commit roughly 
at the same time, so that the flush position at time of commit is the 
same for both? And now you've given up on the CSN == LSN property, for 
async commits, anyway.

- Heikki




Re: Proposal for CSN based snapshots

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Imagine that you have a stream of normal, synchronous, commits. They get 
> assigned LSNs: 1, 2, 3, 4. They become visible to other transactions in 
> that order.

> The way I described this scheme in the first emails on this thread, was 
> to use the current WAL insertion position as the snapshot. That's not 
> correct, though: you have to use the current WAL *flush* position as the 
> snapshot. Otherwise you would see the results of a transaction that 
> hasn't been flushed to disk yet, i.e. which might still get lost, if you 
> pull the power plug before the flush happens. So you have to use the 
> last flush position as the snapshot.

Uh, what?  That's not the semantics we have today, and I don't see why
it's necessary or a good idea.  Once the commit is in the WAL stream,
any action taken on the basis of seeing the commit must be later in
the WAL stream.  So what's the problem?

> Now, if you do an asynchronous commit, the effects of that should become 
> visible immediately, without waiting for the next flush. You can't do 
> that, if its CSN == LSN.

This distinction is completely arbitrary, and unlike the way it works now.
        regards, tom lane



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 08/10/2016 05:09 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> Imagine that you have a stream of normal, synchronous, commits. They get
>> assigned LSNs: 1, 2, 3, 4. They become visible to other transactions in
>> that order.
>
>> The way I described this scheme in the first emails on this thread, was
>> to use the current WAL insertion position as the snapshot. That's not
>> correct, though: you have to use the current WAL *flush* position as the
>> snapshot. Otherwise you would see the results of a transaction that
>> hasn't been flushed to disk yet, i.e. which might still get lost, if you
>> pull the power plug before the flush happens. So you have to use the
>> last flush position as the snapshot.
>
> Uh, what?  That's not the semantics we have today, and I don't see why
> it's necessary or a good idea.  Once the commit is in the WAL stream,
> any action taken on the basis of seeing the commit must be later in
> the WAL stream.  So what's the problem?

I was talking about synchronous commits in the above. A synchronous 
commit is not made visible to other transactions, until the commit WAL 
record is flushed to disk.

You could argue that that doesn't need to be so, because indeed any 
action taken on the basis of seeing the commit must be later in the WAL 
stream. But that's what asynchronous commits are for. For synchronous 
commits, we have a higher standard.

- Heikki




Re: Proposal for CSN based snapshots

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 08/10/2016 05:09 PM, Tom Lane wrote:
>> Uh, what?  That's not the semantics we have today, and I don't see why
>> it's necessary or a good idea.  Once the commit is in the WAL stream,
>> any action taken on the basis of seeing the commit must be later in
>> the WAL stream.  So what's the problem?

> I was talking about synchronous commits in the above. A synchronous 
> commit is not made visible to other transactions, until the commit WAL 
> record is flushed to disk.

[ thinks for a bit... ]  Oh, OK, that's because we don't treat a
transaction as committed until its clog bit is set *and* it's not
marked as running in the PGPROC array.  And sync transactions will
flush WAL in between.

Still, having to invent CSNs seems like a huge loss for this design.
Personally I'd give up async commit first.  If we had only sync commit,
the rule could be "xact LSN less than snapshot threshold and less than
WAL flush position", and we'd not need CSNs.  I know some people like
async commit, but it's there because it was easy and cheap in our old
design, not because it's the world's greatest feature and worth giving
up performance for.
        regards, tom lane



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 08/10/2016 05:51 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On 08/10/2016 05:09 PM, Tom Lane wrote:
>>> Uh, what?  That's not the semantics we have today, and I don't see why
>>> it's necessary or a good idea.  Once the commit is in the WAL stream,
>>> any action taken on the basis of seeing the commit must be later in
>>> the WAL stream.  So what's the problem?
>
>> I was talking about synchronous commits in the above. A synchronous
>> commit is not made visible to other transactions, until the commit WAL
>> record is flushed to disk.
>
> [ thinks for a bit... ]  Oh, OK, that's because we don't treat a
> transaction as committed until its clog bit is set *and* it's not
> marked as running in the PGPROC array.  And sync transactions will
> flush WAL in between.

Right.

> Still, having to invent CSNs seems like a huge loss for this design.
> Personally I'd give up async commit first.  If we had only sync commit,
> the rule could be "xact LSN less than snapshot threshold and less than
> WAL flush position", and we'd not need CSNs.  I know some people like
> async commit, but it's there because it was easy and cheap in our old
> design, not because it's the world's greatest feature and worth giving
> up performance for.

I don't think that's a very popular opinion (I disagree, for one). 
Asynchronous commits are a huge performance boost for some applications. 
The alternative is fsync=off, and I don't want to see more people doing 
that. SSDs have made the penalty of an fsync much smaller, but it's 
still there.

Hmm. There's one more possible way this could all work. Let's have CSN 
== LSN, also for asynchronous commits. A snapshot is the current insert 
position, but also make note of the current flush position, when you 
take a snapshot. Now, when you use the snapshot, if you ever see an XID 
that committed between the snapshot's insert position and the flush 
position, wait for the WAL to be flushed up to the snapshot's insert 
position at that point. With that scheme, an asynchronous commit could 
return to the application without waiting for a flush, but if someone 
actually looks at the changes the transaction made, then that 
transaction would have to wait. Furthermore, we could probably skip that 
waiting too, if the reading transaction is also using 
synchronous_commit=off.

That's slightly different from the current behaviour. A transaction that 
runs with synchronous_commit=on, and reads data that was modified by an 
asynchronous transaction, would take a hit. But I think that would be 
acceptable.

- Heikki




Re: Proposal for CSN based snapshots

From
Robert Haas
Date:
On Wed, Aug 10, 2016 at 11:09 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Still, having to invent CSNs seems like a huge loss for this design.
>> Personally I'd give up async commit first.  If we had only sync commit,
>> the rule could be "xact LSN less than snapshot threshold and less than
>> WAL flush position", and we'd not need CSNs.  I know some people like
>> async commit, but it's there because it was easy and cheap in our old
>> design, not because it's the world's greatest feature and worth giving
>> up performance for.
>
> I don't think that's a very popular opinion (I disagree, for one).
> Asynchronous commits are a huge performance boost for some applications. The
> alternative is fsync=off, and I don't want to see more people doing that.
> SSDs have made the penalty of an fsync much smaller, but it's still there.

Uh, yeah.  Asynchronous commit can be 100 times faster on some
realistic workloads.  If we remove it, many people will have to decide
between running with fsync=off and abandoning PostgreSQL altogether.
That doesn't strike me as a remotely possible line of attack.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal for CSN based snapshots

From
"Joshua D. Drake"
Date:
On 08/10/2016 08:28 AM, Robert Haas wrote:
> On Wed, Aug 10, 2016 at 11:09 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Still, having to invent CSNs seems like a huge loss for this design.
>>> Personally I'd give up async commit first.  If we had only sync commit,
>>> the rule could be "xact LSN less than snapshot threshold and less than
>>> WAL flush position", and we'd not need CSNs.  I know some people like
>>> async commit, but it's there because it was easy and cheap in our old
>>> design, not because it's the world's greatest feature and worth giving
>>> up performance for.
>>
>> I don't think that's a very popular opinion (I disagree, for one).
>> Asynchronous commits are a huge performance boost for some applications. The
>> alternative is fsync=off, and I don't want to see more people doing that.
>> SSDs have made the penalty of an fsync much smaller, but it's still there.
>
> Uh, yeah.  Asynchronous commit can be 100 times faster on some
> realistic workloads.  If we remove it, many people will have to decide
> between running with fsync=off and abandoning PostgreSQL altogether.
> That doesn't strike me as a remotely possible line of attack.

+1 for Robert here, removing async commit is a non-starter. It is 
PostgreSQL performance 101 that you disable synchronous commit unless 
you have a specific data/business requirement that needs it. 
Specifically because of how much faster Pg is with async commit.

Sincerely,

jD


>


-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.



Re: Proposal for CSN based snapshots

From
Stephen Frost
Date:
* Joshua D. Drake (jd@commandprompt.com) wrote:
> +1 for Robert here, removing async commit is a non-starter. It is
> PostgreSQL performance 101 that you disable synchronous commit
> unless you have a specific data/business requirement that needs it.
> Specifically because of how much faster Pg is with async commit.

I agree that we don't want to get rid of async commit, but, for the
archive, I wouldn't recommend using it unless you specifically understand
and accept that trade-off, so I wouldn't lump it into a "PostgreSQL
performance 101" group- that's increasing work_mem, shared_buffers, WAL
size, etc.  Accepting that you're going to lose *committed* transactions
on a crash requires careful thought and consideration of what you're
going to do when that happens, not the other way around.

Thanks!

Stephen

Re: Proposal for CSN based snapshots

From
"Joshua D. Drake"
Date:
On 08/10/2016 09:04 AM, Stephen Frost wrote:
> * Joshua D. Drake (jd@commandprompt.com) wrote:
>> +1 for Robert here, removing async commit is a non-starter. It is
>> PostgreSQL performance 101 that you disable synchronous commit
>> unless you have a specific data/business requirement that needs it.
>> Specifically because of how much faster Pg is with async commit.
>
> I agree that we don't want to get rid of async commit, but, for the
> archive, I wouldn't recommend using it unless you specifically understand
> and accept that trade-off, so I wouldn't lump it into a "PostgreSQL
> performance 101" group- that's increasing work_mem, shared_buffers, WAL
> size, etc.  Accepting that you're going to lose *committed* transactions
> on a crash requires careful thought and consideration of what you're
> going to do when that happens, not the other way around.

Yes Stephen, you are correct which is why I said, "unless you have a 
specific data/business requirement that needs it".

Thanks!

jD

>
> Thanks!
>
> Stephen
>


-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.



Re: Proposal for CSN based snapshots

From
Ants Aasma
Date:
On Wed, Aug 10, 2016 at 6:09 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm. There's one more possible way this could all work. Let's have CSN ==
> LSN, also for asynchronous commits. A snapshot is the current insert
> position, but also make note of the current flush position, when you take a
> snapshot. Now, when you use the snapshot, if you ever see an XID that
> committed between the snapshot's insert position and the flush position,
> wait for the WAL to be flushed up to the snapshot's insert position at that
> point. With that scheme, an asynchronous commit could return to the
> application without waiting for a flush, but if someone actually looks at
> the changes the transaction made, then that transaction would have to wait.
> Furthermore, we could probably skip that waiting too, if the reading
> transaction is also using synchronous_commit=off.
>
> That's slightly different from the current behaviour. A transaction that
> runs with synchronous_commit=on, and reads data that was modified by an
> asynchronous transaction, would take a hit. But I think that would be
> acceptable.

My proposal of vector clocks would allow for pretty much exactly
current behavior.

To simplify, there would be lastSyncCommitSeqNo and
lastAsyncCommitSeqNo variables in ShmemVariableCache. Transaction
commit would choose which one to update based on synchronous_commit
setting and embed the value of the setting into CSN log. Snapshots
would contain both values, when checking for CSN visibility use the
value of the looked up synchronous_commit setting to decide which
value to compare against. Standby's replaying commit records would
just update both values, resulting in transactions becoming visible in
xlog order, as they do today. The scheme would allow for inventing a
new xlog record/replication message communicating visibility ordering.

However I don't see why inventing a separate CSN concept is a large
problem. Quite the opposite, unless there is a good reason that I'm
missing, it seems better to not unnecessarily conflate commit record
durability and transaction visibility ordering. Not having them tied
together allows for an external source to provide CSN values, allowing
for interesting distributed transaction implementations. E.g. using a
timestamp as the CSN a'la Google Spanner and the TrueTime API.

Regards,
Ants Aasma



Re: Proposal for CSN based snapshots

From
Alexander Korotkov
Date:
On Wed, Aug 10, 2016 at 6:09 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 08/10/2016 05:51 PM, Tom Lane wrote:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
On 08/10/2016 05:09 PM, Tom Lane wrote:
Uh, what?  That's not the semantics we have today, and I don't see why
it's necessary or a good idea.  Once the commit is in the WAL stream,
any action taken on the basis of seeing the commit must be later in
the WAL stream.  So what's the problem?

I was talking about synchronous commits in the above. A synchronous
commit is not made visible to other transactions, until the commit WAL
record is flushed to disk.

[ thinks for a bit... ]  Oh, OK, that's because we don't treat a
transaction as committed until its clog bit is set *and* it's not
marked as running in the PGPROC array.  And sync transactions will
flush WAL in between.

Right.

Still, having to invent CSNs seems like a huge loss for this design.
Personally I'd give up async commit first.  If we had only sync commit,
the rule could be "xact LSN less than snapshot threshold and less than
WAL flush position", and we'd not need CSNs.  I know some people like
async commit, but it's there because it was easy and cheap in our old
design, not because it's the world's greatest feature and worth giving
up performance for.

I don't think that's a very popular opinion (I disagree, for one). Asynchronous commits are a huge performance boost for some applications. The alternative is fsync=off, and I don't want to see more people doing that. SSDs have made the penalty of an fsync much smaller, but it's still there.

Hmm. There's one more possible way this could all work. Let's have CSN == LSN, also for asynchronous commits. A snapshot is the current insert position, but also make note of the current flush position, when you take a snapshot. Now, when you use the snapshot, if you ever see an XID that committed between the snapshot's insert position and the flush position, wait for the WAL to be flushed up to the snapshot's insert position at that point. With that scheme, an asynchronous commit could return to the application without waiting for a flush, but if someone actually looks at the changes the transaction made, then that transaction would have to wait. Furthermore, we could probably skip that waiting too, if the reading transaction is also using synchronous_commit=off.

That's slightly different from the current behaviour. A transaction that runs with synchronous_commit=on, and reads data that was modified by an asynchronous transaction, would take a hit. But I think that would be acceptable.

Oh, I found that I underestimated complexity of async commit...  :)

Do I understand right that now async commit right as follows?
1) Async transaction confirms commit before flushing WAL.
2) Other transactions sees effect of async transaction only when its WAL flushed.
3) In the session which just committed async transaction, effect of this transaction is visible immediately (before WAL flushed). Is it true?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 08/10/2016 07:54 PM, Alexander Korotkov wrote:
> Do I understand right that now async commit right as follows?
> 1) Async transaction confirms commit before flushing WAL.

Yes.

> 2) Other transactions sees effect of async transaction only when its WAL
> flushed.

No. Other transactions also see the effects of the async transaction 
before it's flushed.

> 3) In the session which just committed async transaction, effect of this
> transaction is visible immediately (before WAL flushed). Is it true?

Yes. (The same session is not a special case, per previous point. All 
sessions see the async transaction as committed, even before it's flushed.)

- Heikki




Re: Proposal for CSN based snapshots

From
Greg Stark
Date:
On Wed, Aug 10, 2016 at 5:54 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Oh, I found that I underestimated complexity of async commit...  :)

Indeed. I think Tom's attitude was right even if the specific
conclusion was wrong. While I don't think removing async commit is
viable I think it would be a laudable goal if we can remove some of
the complication in it. I normally describe async commit as "just like
a normal commit but don't block on the commit" but it's actually a bit
more complicated.

AIUI the additional complexity is that while async commits are visible
to everyone immediately other non-async transactions can be committed
but then be in limbo for a while before they are visible to others. So
other sessions will see the async commit "jump ahead" of any non-async
transactions even if those other transactions were committed first.
Any standbys will see the non-async transaction in the logs before the
async transaction and in a crash it's possible to lose the async
transaction even though it was visible but not the sync transaction
that wasn't.

Complexity like this makes it hard to implement other features such as
CSNs. IIRC this already bit hot standby as well. I think it would be a
big improvement if we had a clear, well defined commit order that was
easy to explain and easy to reason about when new changes are being
made.


-- 
greg



Re: Proposal for CSN based snapshots

From
Ants Aasma
Date:
On Wed, Aug 10, 2016 at 7:54 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Oh, I found that I underestimated complexity of async commit...  :)
>
> Do I understand right that now async commit right as follows?
> 1) Async transaction confirms commit before flushing WAL.
> 2) Other transactions sees effect of async transaction only when its WAL
> flushed.
> 3) In the session which just committed async transaction, effect of this
> transaction is visible immediately (before WAL flushed). Is it true?

Current code  simplified:

XactLogCommitRecord()
if (synchronous_commit)    XLogFlush()
ProcArrayEndTransaction() // Become visible

The issue we are discussing is that with CSNs, the "become visible"
portion must occur in CSN order. If CSN == LSN, then async
transactions that have their commit record after a sync record must
wait for the sync record to flush and become visible. Simplest
solution is to not require CSN == LSN and just assign a CSN value
immediately before becoming visible.

Regards,
Ants Aasma



Re: Proposal for CSN based snapshots

From
Kevin Grittner
Date:
On Wed, Aug 10, 2016 at 12:26 PM, Greg Stark <stark@mit.edu> wrote:

> Complexity like this makes it hard to implement other features such as
> CSNs. IIRC this already bit hot standby as well. I think it would be a
> big improvement if we had a clear, well defined commit order that was
> easy to explain and easy to reason about when new changes are being
> made.

And here I was getting concerned that there was no mention of
"apparent order of execution" for serializable transactions --
which does not necessarily match either the order of LSNs from
commit records nor CSNs.  The order in which transactions become
visible is clearly a large factor in determining AOoE, but it is
secondary to looking at whether a transaction modified data based
on reading the "before" image of a data set modified by a
concurrent transaction.

I still think that our best bet for avoiding anomalies when using
logical replication in complex environments is for logical
replication to apply transactions in apparent order of execution.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal for CSN based snapshots

From
Alexander Korotkov
Date:
On Wed, Aug 10, 2016 at 8:26 PM, Greg Stark <stark@mit.edu> wrote:
On Wed, Aug 10, 2016 at 5:54 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Oh, I found that I underestimated complexity of async commit...  :)

Indeed. I think Tom's attitude was right even if the specific
conclusion was wrong. While I don't think removing async commit is
viable I think it would be a laudable goal if we can remove some of
the complication in it. I normally describe async commit as "just like
a normal commit but don't block on the commit" but it's actually a bit
more complicated.

AIUI the additional complexity is that while async commits are visible
to everyone immediately other non-async transactions can be committed
but then be in limbo for a while before they are visible to others. So
other sessions will see the async commit "jump ahead" of any non-async
transactions even if those other transactions were committed first.
Any standbys will see the non-async transaction in the logs before the
async transaction and in a crash it's possible to lose the async
transaction even though it was visible but not the sync transaction
that wasn't.

Complexity like this makes it hard to implement other features such as
CSNs. IIRC this already bit hot standby as well. I think it would be a
big improvement if we had a clear, well defined commit order that was
easy to explain and easy to reason about when new changes are being
made.

Heikki, Ants, Greg, thank you for the explanation.  You restored order in my personal world.
Now I see that introduction of own sequence of CSN which is not equal to LSN makes sense.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
And here's a new patch version. Still lots of work to do, especially in 
performance testing, and minimizing the worst-case performance hit.

On 08/09/2016 03:16 PM, Heikki Linnakangas wrote:
> Next steps:
>
> * Hot standby feedback is broken, now that CSN != LSN again. Will have
> to switch this back to using an "oldest XID", rather than a CSN.
>
> * I plan to replace pg_subtrans with a special range of CSNs in the
> csnlog. Something like, start the CSN counter at 2^32 + 1, and use CSNs
> < 2^32 to mean "this is a subtransaction, parent is XXX". One less SLRU
> to maintain.
>
> * Put per-proc xmin back into procarray. I removed it, because it's not
> necessary for snapshots or GetOldestSnapshot() (which replaces
> GetOldestXmin()) anymore. But on second thoughts, we still need it for
> deciding when it's safe to truncate the csnlog.
>
> * In this patch, HeapTupleSatisfiesVacuum() is rewritten to use an
> "oldest CSN", instead of "oldest xmin", but that's not strictly
> necessary. To limit the size of the patch, I might revert those changes
> for now.

I did all of the above. This patch is now much smaller, as I didn't 
change all the places that used to deal with global-xmin's, like I did 
earlier. The oldest-xmin is now computed pretty like it always has been.

> * Rewrite the way RecentGlobalXmin is updated. As Alvaro pointed out in
> his review comments two years ago, that was quite complicated. And I'm
> worried that the lazy scheme I had might not allow pruning fast enough.
> I plan to make it more aggressive, so that whenever the currently oldest
> transaction finishes, it's responsible for advancing the "global xmin"
> in shared memory. And the way it does that, is by scanning the csnlog,
> starting from the current "global xmin", until the next still
> in-progress XID. That could be a lot, if you have a very long-running
> transaction that ends, but we'll see how it performs.

I ripped out all that, and created a GetRecentGlobalXmin() function that 
computes a global-xmin value when needed, like GetOldestXmin() does. 
Seems most straightforward. Since we no longer get a RecentGlobalXmin 
value essentially for free in GetSnapshotData(), as we no longer scan 
the proc array, it's better to compute the value only when needed.

> * Performance testing. Clearly this should have a performance benefit,
> at least under some workloads, to be worthwhile. And not regress.

I wrote a little C module to create a "worst-case" table. Every row in 
the table has a different xmin, and the xmin values are shuffled across 
the table, to defeat any caching.

A sequential scan of a table like that with 10 million rows took about 
700 ms on my laptop, when the hint bits are set, without this patch. 
With this patch, if there's a snapshot holding back the xmin horizon, so 
that we need to check the CSN log for every XID, it took about 30000 ms. 
So we have some optimization work to do :-). I'm not overly worried 
about that right now, as I think there's a lot of room for improvement 
in the SLRU code. But that's the next thing I'm going to work.

- Heikki




Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 08/22/2016 07:35 PM, Heikki Linnakangas wrote:
> And here's a new patch version...

And here's the attachment I forgot, *sigh*..

- Heikki


Attachment

Re: Proposal for CSN based snapshots

From
Robert Haas
Date:
Nice to see you working on this again.

On Mon, Aug 22, 2016 at 12:35 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> A sequential scan of a table like that with 10 million rows took about 700
> ms on my laptop, when the hint bits are set, without this patch. With this
> patch, if there's a snapshot holding back the xmin horizon, so that we need
> to check the CSN log for every XID, it took about 30000 ms. So we have some
> optimization work to do :-). I'm not overly worried about that right now, as
> I think there's a lot of room for improvement in the SLRU code. But that's
> the next thing I'm going to work.

So the worst case for this patch is obviously bad right now and, as
you say, that means that some optimization work is needed.

But what about the best case?  If we create a scenario where there are
no open read-write transactions at all and (somehow) lots and lots of
ProcArrayLock contention, how much does this help?

Because there's only a purpose to trying to minimize the losses if
there are some gains to which we can look forward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 08/22/2016 07:49 PM, Robert Haas wrote:
> Nice to see you working on this again.
>
> On Mon, Aug 22, 2016 at 12:35 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> A sequential scan of a table like that with 10 million rows took about 700
>> ms on my laptop, when the hint bits are set, without this patch. With this
>> patch, if there's a snapshot holding back the xmin horizon, so that we need
>> to check the CSN log for every XID, it took about 30000 ms. So we have some
>> optimization work to do :-). I'm not overly worried about that right now, as
>> I think there's a lot of room for improvement in the SLRU code. But that's
>> the next thing I'm going to work.
>
> So the worst case for this patch is obviously bad right now and, as
> you say, that means that some optimization work is needed.
>
> But what about the best case?  If we create a scenario where there are
> no open read-write transactions at all and (somehow) lots and lots of
> ProcArrayLock contention, how much does this help?

I ran some quick pgbench tests on my laptop, but didn't see any 
meaningful benefit. I think the best I could see is about 5% speedup, 
when running "pgbench -S", with 900 idle connections sitting in the 
background. On the positive side, I didn't see much slowdown either. 
(Sorry, I didn't record the details of those tests, as I was testing 
many different options and I didn't see a clear difference either way.)

It seems that Amit's PGPROC batch clearing patch was very effective. I 
remember seeing ProcArrayLock contention very visible earlier, but I 
can't hit that now. I suspect you'd still see contention on bigger 
hardware, though, my laptop has oly 4 cores. I'll have to find a real 
server for the next round of testing.

> Because there's only a purpose to trying to minimize the losses if
> there are some gains to which we can look forward.

Aside from the potential performance gains, this slashes a lot of 
complicated code:
 70 files changed, 2429 insertions(+), 6066 deletions(-)

That removed code is quite mature at this point, and I'm sure we'll add 
some code back to this patch as it evolves, but still.

Also, I'm looking forward for a follow-up patch, to track snapshots in 
backends at a finer level, so that vacuum could remove tuples more 
aggressively, if you have pg_dump running for days. CSN snapshots isn't 
a strict requirement for that, but it makes it simpler, when you can 
represent a snapshot with a small fixed-size integer.

Yes, seeing some direct performance gains would be nice too.

- Heikki



Re: Proposal for CSN based snapshots

From
Andres Freund
Date:
Hi,

On 2016-08-22 20:32:42 +0300, Heikki Linnakangas wrote:
> I ran some quick pgbench tests on my laptop, but didn't see any meaningful
> benefit. I think the best I could see is about 5% speedup, when running
> "pgbench -S", with 900 idle connections sitting in the background. On the
> positive side, I didn't see much slowdown either. (Sorry, I didn't record
> the details of those tests, as I was testing many different options and I
> didn't see a clear difference either way.)

Hm. Does the picture change if those are idle in transaction, after
assigning an xid.


> It seems that Amit's PGPROC batch clearing patch was very effective.

It usually breaks down if you have a mixed read/write workload - might
be worthehile prototyping that.


> I
> remember seeing ProcArrayLock contention very visible earlier, but I can't
> hit that now. I suspect you'd still see contention on bigger hardware,
> though, my laptop has oly 4 cores. I'll have to find a real server for the
> next round of testing.

Yea, I think that's true. I can just about see ProcArrayLock contention
on my more powerful laptop, to see it really bad you need bigger
hardware / higher concurrency.


Greetings,

Andres Freund



Re: Proposal for CSN based snapshots

From
Robert Haas
Date:
On Mon, Aug 22, 2016 at 1:32 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> But what about the best case?  If we create a scenario where there are
>> no open read-write transactions at all and (somehow) lots and lots of
>> ProcArrayLock contention, how much does this help?
>
> I ran some quick pgbench tests on my laptop, but didn't see any meaningful
> benefit. I think the best I could see is about 5% speedup, when running
> "pgbench -S", with 900 idle connections sitting in the background. On the
> positive side, I didn't see much slowdown either. (Sorry, I didn't record
> the details of those tests, as I was testing many different options and I
> didn't see a clear difference either way.)

That's not very exciting.

> It seems that Amit's PGPROC batch clearing patch was very effective. I
> remember seeing ProcArrayLock contention very visible earlier, but I can't
> hit that now. I suspect you'd still see contention on bigger hardware,
> though, my laptop has oly 4 cores. I'll have to find a real server for the
> next round of testing.

It's good that those patches were effective, and I bet that approach
could be further refined, too.  However, I think Amit may have
mentioned in an internal meeting that he was able to generate some
pretty serious ProcArrayLock contention with some of his hash index
patches applied.  I don't remember the details, though.

>> Because there's only a purpose to trying to minimize the losses if
>> there are some gains to which we can look forward.
>
> Aside from the potential performance gains, this slashes a lot of
> complicated code:
>
>  70 files changed, 2429 insertions(+), 6066 deletions(-)
>
> That removed code is quite mature at this point, and I'm sure we'll add some
> code back to this patch as it evolves, but still.

That's interesting, but it might just mean we're replacing well-tested
code with new, buggy code.  By the time you fix all the performance
regressions, those numbers could be a lot closer together.

> Also, I'm looking forward for a follow-up patch, to track snapshots in
> backends at a finer level, so that vacuum could remove tuples more
> aggressively, if you have pg_dump running for days. CSN snapshots isn't a
> strict requirement for that, but it makes it simpler, when you can represent
> a snapshot with a small fixed-size integer.

That would certainly be nice, but I think we need to be careful not to
sacrifice too much trying to get there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal for CSN based snapshots

From
Andres Freund
Date:
On 2016-08-22 13:41:57 -0400, Robert Haas wrote:
> On Mon, Aug 22, 2016 at 1:32 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >> But what about the best case?  If we create a scenario where there are
> >> no open read-write transactions at all and (somehow) lots and lots of
> >> ProcArrayLock contention, how much does this help?
> >
> > I ran some quick pgbench tests on my laptop, but didn't see any meaningful
> > benefit. I think the best I could see is about 5% speedup, when running
> > "pgbench -S", with 900 idle connections sitting in the background. On the
> > positive side, I didn't see much slowdown either. (Sorry, I didn't record
> > the details of those tests, as I was testing many different options and I
> > didn't see a clear difference either way.)
> 
> That's not very exciting.

I think it's neither exciting nor worrying - the benefit of the pgproc
batch clearing itself wasn't apparent on small hardware either...



Re: Proposal for CSN based snapshots

From
Amit Kapila
Date:
On Mon, Aug 22, 2016 at 11:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Aug 22, 2016 at 1:32 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> But what about the best case?  If we create a scenario where there are
>>> no open read-write transactions at all and (somehow) lots and lots of
>>> ProcArrayLock contention, how much does this help?
>>
>> I ran some quick pgbench tests on my laptop, but didn't see any meaningful
>> benefit. I think the best I could see is about 5% speedup, when running
>> "pgbench -S", with 900 idle connections sitting in the background. On the
>> positive side, I didn't see much slowdown either. (Sorry, I didn't record
>> the details of those tests, as I was testing many different options and I
>> didn't see a clear difference either way.)
>
> That's not very exciting.
>
>> It seems that Amit's PGPROC batch clearing patch was very effective. I
>> remember seeing ProcArrayLock contention very visible earlier, but I can't
>> hit that now. I suspect you'd still see contention on bigger hardware,
>> though, my laptop has oly 4 cores. I'll have to find a real server for the
>> next round of testing.
>
> It's good that those patches were effective, and I bet that approach
> could be further refined, too.  However, I think Amit may have
> mentioned in an internal meeting that he was able to generate some
> pretty serious ProcArrayLock contention with some of his hash index
> patches applied.  I don't remember the details, though.
>

Yes, thats right.  We are seeing ProcArrayLock as a bottleneck at
high-concurrency for hash-index reads.  On a quick look, it seems
patch still acquires ProcArrayLock in shared mode in GetSnapshotData()
although duration is reduced.  I think it is worth to try the workload
where we are seeing ProcArrayLock as bottleneck, but not sure if it
can drastically improves the situation, because as per our initial
analysis, lock acquisition and release itself causes problem.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 08/22/2016 08:38 PM, Andres Freund wrote:
> On 2016-08-22 20:32:42 +0300, Heikki Linnakangas wrote:
>> I
>> remember seeing ProcArrayLock contention very visible earlier, but I can't
>> hit that now. I suspect you'd still see contention on bigger hardware,
>> though, my laptop has oly 4 cores. I'll have to find a real server for the
>> next round of testing.
>
> Yea, I think that's true. I can just about see ProcArrayLock contention
> on my more powerful laptop, to see it really bad you need bigger
> hardware / higher concurrency.

As soon as I sent my previous post, Vladimir Borodin kindly offered
access to a 32-core server for performance testing. Thanks Vladimir!

I installed Greg Smith's pgbench-tools kit on that server, and ran some
tests. I'm seeing some benefit on "pgbench -N" workload, but only after
modifying the test script to use "-M prepared", and using Unix domain
sockets instead of TCP to connect. Apparently those things add enough
overhead to mask out the little difference.

Attached is a graph with the results. Full results are available at
https://hlinnaka.iki.fi/temp/csn-4-results/. In short, the patch
improved throughput, measured in TPS, with >= 32 or so clients. The
biggest difference was with 44 clients, which saw about 5% improvement.

So, not phenomenal, but it's something. I suspect that with more cores,
the difference would become more clear.

Like on a cue, Alexander Korotkov just offered access to a 72-core
system :-). Thanks! I'll run the same tests on that.

- Heikki


Attachment

Re: Proposal for CSN based snapshots

From
Andres Freund
Date:
On 2016-08-23 18:18:57 +0300, Heikki Linnakangas wrote:
> I installed Greg Smith's pgbench-tools kit on that server, and ran some
> tests. I'm seeing some benefit on "pgbench -N" workload, but only after
> modifying the test script to use "-M prepared", and using Unix domain
> sockets instead of TCP to connect. Apparently those things add enough
> overhead to mask out the little difference.

To make the problem more apparent, how are the differences for something
as extreme as just a
SELECT 1;
or a
SELECT 1 FROM pgbench_accounts WHERE aid = 1;

Greetings,

Andres Freund



Re: Proposal for CSN based snapshots

From
Heikki Linnakangas
Date:
On 08/23/2016 06:18 PM, Heikki Linnakangas wrote:
> On 08/22/2016 08:38 PM, Andres Freund wrote:
>> On 2016-08-22 20:32:42 +0300, Heikki Linnakangas wrote:
>>> I
>>> remember seeing ProcArrayLock contention very visible earlier, but I can't
>>> hit that now. I suspect you'd still see contention on bigger hardware,
>>> though, my laptop has oly 4 cores. I'll have to find a real server for the
>>> next round of testing.
>>
>> Yea, I think that's true. I can just about see ProcArrayLock contention
>> on my more powerful laptop, to see it really bad you need bigger
>> hardware / higher concurrency.
>
> As soon as I sent my previous post, Vladimir Borodin kindly offered
> access to a 32-core server for performance testing. Thanks Vladimir!
>
> I installed Greg Smith's pgbench-tools kit on that server, and ran some
> tests. I'm seeing some benefit on "pgbench -N" workload, but only after
> modifying the test script to use "-M prepared", and using Unix domain
> sockets instead of TCP to connect. Apparently those things add enough
> overhead to mask out the little difference.
>
> Attached is a graph with the results. Full results are available at
> https://hlinnaka.iki.fi/temp/csn-4-results/. In short, the patch
> improved throughput, measured in TPS, with >= 32 or so clients. The
> biggest difference was with 44 clients, which saw about 5% improvement.
>
> So, not phenomenal, but it's something. I suspect that with more cores,
> the difference would become more clear.
>
> Like on a cue, Alexander Korotkov just offered access to a 72-core
> system :-). Thanks! I'll run the same tests on that.

And here are the results on the 72 core machine (thanks again,
Alexander!). The test setup was the same as on the 32-core machine,
except that I ran it with more clients since the system has more CPU
cores. In summary, in the best case, the patch increases throughput by
about 10%. That peak is with 64 clients. Interestingly, as the number of
clients increases further, the gain evaporates, and the CSN version
actually performs worse than unpatched master. I don't know why that is.
One theory that by eliminating one bottleneck, we're now hitting another
bottleneck which doesn't degrade as gracefully when there's contention.

Full results are available at
https://hlinnaka.iki.fi/temp/csn-4-72core-results/.

- Heikki


Attachment

Re: Proposal for CSN based snapshots

From
Alexander Korotkov
Date:
On Wed, Aug 24, 2016 at 11:54 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 08/23/2016 06:18 PM, Heikki Linnakangas wrote:
On 08/22/2016 08:38 PM, Andres Freund wrote:
On 2016-08-22 20:32:42 +0300, Heikki Linnakangas wrote:
I
remember seeing ProcArrayLock contention very visible earlier, but I can't
hit that now. I suspect you'd still see contention on bigger hardware,
though, my laptop has oly 4 cores. I'll have to find a real server for the
next round of testing.

Yea, I think that's true. I can just about see ProcArrayLock contention
on my more powerful laptop, to see it really bad you need bigger
hardware / higher concurrency.

As soon as I sent my previous post, Vladimir Borodin kindly offered
access to a 32-core server for performance testing. Thanks Vladimir!

I installed Greg Smith's pgbench-tools kit on that server, and ran some
tests. I'm seeing some benefit on "pgbench -N" workload, but only after
modifying the test script to use "-M prepared", and using Unix domain
sockets instead of TCP to connect. Apparently those things add enough
overhead to mask out the little difference.

Attached is a graph with the results. Full results are available at
https://hlinnaka.iki.fi/temp/csn-4-results/. In short, the patch
improved throughput, measured in TPS, with >= 32 or so clients. The
biggest difference was with 44 clients, which saw about 5% improvement.

So, not phenomenal, but it's something. I suspect that with more cores,
the difference would become more clear.

Like on a cue, Alexander Korotkov just offered access to a 72-core
system :-). Thanks! I'll run the same tests on that.

And here are the results on the 72 core machine (thanks again, Alexander!). The test setup was the same as on the 32-core machine, except that I ran it with more clients since the system has more CPU cores. In summary, in the best case, the patch increases throughput by about 10%. That peak is with 64 clients. Interestingly, as the number of clients increases further, the gain evaporates, and the CSN version actually performs worse than unpatched master. I don't know why that is. One theory that by eliminating one bottleneck, we're now hitting another bottleneck which doesn't degrade as gracefully when there's contention.

Did you try to identify this second bottleneck with perf or something?
It would be nice to also run pgbench -S.  Also, it would be nice to check something like 10% of writes, 90% of reads (which is quite typical workload in real life I believe).

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] Proposal for CSN based snapshots

From
Amit Kapila
Date:
On Wed, Aug 24, 2016 at 2:24 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> And here are the results on the 72 core machine (thanks again, Alexander!).
> The test setup was the same as on the 32-core machine, except that I ran it
> with more clients since the system has more CPU cores. In summary, in the
> best case, the patch increases throughput by about 10%. That peak is with 64
> clients. Interestingly, as the number of clients increases further, the gain
> evaporates, and the CSN version actually performs worse than unpatched
> master. I don't know why that is. One theory that by eliminating one
> bottleneck, we're now hitting another bottleneck which doesn't degrade as
> gracefully when there's contention.
>

Quite possible and I think it could be either due CLOGControlLock or
WALWriteLock.  Are you planning to work on this for this release, if
not do you think meanwhile we should pursue "caching the snapshot"
idea of Andres?  Last time Mithun did some benchmarks [1] after fixing
some issues in the patch and he found noticeable performance increase.

[1] - https://www.postgresql.org/message-id/CAD__Ouic1Tvnwqm6Wf6j7Cz1Kk1DQgmy0isC7%3DOgX%2B3JtfGk9g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal for CSN based snapshots

From
Michael Paquier
Date:
On Wed, Aug 24, 2016 at 5:54 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> And here are the results on the 72 core machine (thanks again, Alexander!).
> The test setup was the same as on the 32-core machine, except that I ran it
> with more clients since the system has more CPU cores. In summary, in the
> best case, the patch increases throughput by about 10%. That peak is with 64
> clients. Interestingly, as the number of clients increases further, the gain
> evaporates, and the CSN version actually performs worse than unpatched
> master. I don't know why that is. One theory that by eliminating one
> bottleneck, we're now hitting another bottleneck which doesn't degrade as
> gracefully when there's contention.
>
> Full results are available at
> https://hlinnaka.iki.fi/temp/csn-4-72core-results/.

There has not been much activity on this thread for some time, and I
mentioned my intentions to some developers at the last PGCon. But I am
planning to study more the work that has been done here, with as
envisaged goal to present a patch for the first CF of PG11. Lots of
fun ahead.
-- 
Michael



Re: [HACKERS] Proposal for CSN based snapshots

From
Alexander Kuzmenkov
Date:
On 21.06.2017 04:48, Michael Paquier wrote:
> There has not been much activity on this thread for some time, and I
> mentioned my intentions to some developers at the last PGCon. But I am
> planning to study more the work that has been done here, with as
> envisaged goal to present a patch for the first CF of PG11. Lots of
> fun ahead.

Hi Michael,

Glad to see you working on this! I've been studying this topic too. 
Attached you can find a recently rebased version of Heikki's v4 patch.
I also fixed a bug that appeared on report-receipts isolation test: 
XidIsConcurrent should report that a transaction is concurrent to ours 
when its csn equals our snapshotcsn.

There is another bug where multiple tuples with the same primary key 
appear after a pgbench read-write test. Querying pgbench_branches with 
disabled index scans, I see several tuples with the same 'bid' value. 
Usually they do not touch the index so there are no errors. But 
sometimes HOT update is not possible and it errors out with violation of 
unique constraint, which is how I noticed it. I can't reproduce it on my 
machine and have to use a 72-core one.

For now I can conclude that the oldestActiveXid is not always updated 
correctly. In TransactionIdAsyncCommitTree, just before the transaction 
sets clog status, its xid can be less than oldestActiveXid. Other 
transactions are seeing it as aborted for some time before it writes to 
clog (see TransactionIdGetCommitSeqNo). They can update a tuple it 
deleted, and that leads to duplication. Unfortunately, I didn't have 
time yet to investigate this further.
-- 

Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal for CSN based snapshots

From
Michael Paquier
Date:
On Wed, Jun 21, 2017 at 1:48 PM, Alexander Kuzmenkov
<a.kuzmenkov@postgrespro.ru> wrote:
> Glad to see you working on this! I've been studying this topic too. Attached
> you can find a recently rebased version of Heikki's v4 patch.
> I also fixed a bug that appeared on report-receipts isolation test:
> XidIsConcurrent should report that a transaction is concurrent to ours when
> its csn equals our snapshotcsn.

I am finally looking at rebasing things properly now, and getting more
familiar in order to come up with a patch for the upcoming coming fest
(I can see some diff simplifications as well to impact less existing
applications), and this v5 is very suspicious. You are adding code
that should actually be removed. One such block can be found at the
beginning of the patch:   <indexterm>
+    <primary>txid_snapshot_xip</primary>
+   </indexterm>
+
+   <indexterm>
That's not actually a problem as I am reusing an older v4 from Heikki
now for the future, but I wanted to let you know about that first.
-- 
Michael



Re: [HACKERS] Proposal for CSN based snapshots

From
Alexander Kuzmenkov
Date:
> That's not actually a problem as I am reusing an older v4 from Heikki
> now for the future, but I wanted to let you know about that first.
Thank you, I'll look into that.

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] Proposal for CSN based snapshots

From
Alexander Kuzmenkov
Date:
Hi all,

So I did some more experiments on this patch.

* I fixed the bug with duplicate tuples I mentioned in the previous 
letter. Indeed, the oldestActiveXid could be advanced past the 
transaction's xid before it set the clog status. This happened because 
the oldestActiveXid is calculated based on the CSN log contents, and we 
wrote to CSN log before writing to clog. The fix is to write to clog 
before CSN log (TransactionIdAsyncCommitTree)

* We can remove the exclusive locking on CSNLogControlLock when setting 
the CSN for a transaction (CSNLogSetPageStatus). When we assign a CSN to 
a transaction and its children, the atomicity is guaranteed by using an 
intermediate state (COMMITSEQNO_COMMITTING), so it doesn't matter if 
this function is not atomic in itself. The shared lock should suffice here.

* On throughputs of about 100k TPS, we allocate ~1k CSN log pages per 
second. This is done with exclusive locking on CSN control lock, and 
noticeably increases contention. To alleviate this, I allocate new pages 
in batches (ExtendCSNLOG).

* When advancing oldestActiveXid, we scan CSN log to find an xid that is 
still in progress. To do that, we increment the xid and query its CSN 
using the high level function, acquiring and releasing the lock and 
looking up the log page for each xid. I wrote some code to acquire the 
lock only once and then scan the pages (CSNLogSetPageStatus).

* On bigger buffers the linear page lookup code that the SLRU uses now 
becomes slow. I added a shared dynahash table to speed up this lookup.

* I merged in recent changes from master (up to 7e1fb4). Unfortunately I 
didn't have enough time to fix the logical replication and snapshot 
import, so now it's completely broken.

I ran some pgbench with these tweaks (tpcb-like, 72 cores, scale 500). 
The throughput is good on lower number of clients (on 50 clients it's 
35% higher than on the master), but then it degrades steadily. After 250 
clients it's already lower than master; see the attached graph. In perf 
reports the CSN-related things have almost vanished, and I see lots of 
time spent working with clog. This is probably the situation where by 
making some parts faster, the contention in other parts becomes worse 
and overall we have a performance loss. Hilariously, at some point I saw 
a big performance increase after adding some debug printfs. I wanted to 
try some things with the clog next, but for now I'm out of time.

The new version of the patch is attached. Last time I apparently diff'ed 
it the other way around, now it should apply fine.

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal for CSN based snapshots

From
Amit Kapila
Date:
On Tue, Aug 1, 2017 at 7:41 PM, Alexander Kuzmenkov
<a.kuzmenkov@postgrespro.ru> wrote:
> Hi all,
>
> So I did some more experiments on this patch.
>
> * I fixed the bug with duplicate tuples I mentioned in the previous letter.
> Indeed, the oldestActiveXid could be advanced past the transaction's xid
> before it set the clog status. This happened because the oldestActiveXid is
> calculated based on the CSN log contents, and we wrote to CSN log before
> writing to clog. The fix is to write to clog before CSN log
> (TransactionIdAsyncCommitTree)
>
> * We can remove the exclusive locking on CSNLogControlLock when setting the
> CSN for a transaction (CSNLogSetPageStatus). When we assign a CSN to a
> transaction and its children, the atomicity is guaranteed by using an
> intermediate state (COMMITSEQNO_COMMITTING), so it doesn't matter if this
> function is not atomic in itself. The shared lock should suffice here.
>
> * On throughputs of about 100k TPS, we allocate ~1k CSN log pages per
> second. This is done with exclusive locking on CSN control lock, and
> noticeably increases contention. To alleviate this, I allocate new pages in
> batches (ExtendCSNLOG).
>
> * When advancing oldestActiveXid, we scan CSN log to find an xid that is
> still in progress. To do that, we increment the xid and query its CSN using
> the high level function, acquiring and releasing the lock and looking up the
> log page for each xid. I wrote some code to acquire the lock only once and
> then scan the pages (CSNLogSetPageStatus).
>
> * On bigger buffers the linear page lookup code that the SLRU uses now
> becomes slow. I added a shared dynahash table to speed up this lookup.
>
> * I merged in recent changes from master (up to 7e1fb4). Unfortunately I
> didn't have enough time to fix the logical replication and snapshot import,
> so now it's completely broken.
>
> I ran some pgbench with these tweaks (tpcb-like, 72 cores, scale 500). The
> throughput is good on lower number of clients (on 50 clients it's 35% higher
> than on the master), but then it degrades steadily. After 250 clients it's
> already lower than master; see the attached graph. In perf reports the
> CSN-related things have almost vanished, and I see lots of time spent
> working with clog. This is probably the situation where by making some parts
> faster, the contention in other parts becomes worse and overall we have a
> performance loss.

Yeah, this happens sometimes and I have also observed this behavior.

> Hilariously, at some point I saw a big performance
> increase after adding some debug printfs. I wanted to try some things with
> the clog next, but for now I'm out of time.
>

What problem exactly you are seeing in the clog, is it the contention
around CLOGControlLock or generally accessing CLOG is slower.  If
former, then we already have a patch [1] to address it.


[1] - https://commitfest.postgresql.org/14/358/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Proposal for CSN based snapshots

From
Alexander Kuzmenkov
Date:
> What problem exactly you are seeing in the clog, is it the contention
> around CLOGControlLock or generally accessing CLOG is slower.  If
> former, then we already have a patch [1] to address it.
It's the contention around CLogControlLock. Thank you for the pointer, 
next time I'll try it with the group clog update patch.


-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] Proposal for CSN based snapshots

From
Alexander Kuzmenkov
Date:
Here is some news about the CSN patch.

* I merged it with master (58bd60995f), which now has the clog group 
update. With this optimization, CSN is now faster than the master by 
about 15% on 100 to 400 clients (72 cores, pgbench tpcb-like, scale 
500). It does not degrade faster than master as it did before. The 
numbers of clients greater than 400 were not measured.

* Querying for CSN of subtransactions was not implemented in the 
previous version of the patch, so I added it. I tested the performance 
on the tpcb-like pgbench script with some savepoints added, and it was 
significantly worse than on the master. The main culprit seems to be the 
ProcArrayLock taken in GetSnapshotData, GetRecentGlobalXmin, 
ProcArrayEndTransaction. Although it is only taken in shared mode, just 
reading the current lock mode and writing the same value back takes 
about 10% CPU. Maybe we could do away with some of these locks, but 
there is some interplay with imported snapshots and replication slots 
which I don't understand well. I plan to investigate this next.

I am attaching the latest version of the patch and the graphs of 
tps/clients for tpcb and tpcb with savepoints.

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal for CSN based snapshots

From
Robert Haas
Date:
On Mon, Sep 25, 2017 at 10:17 AM, Alexander Kuzmenkov
<a.kuzmenkov@postgrespro.ru> wrote:
> Here is some news about the CSN patch.
>
> * I merged it with master (58bd60995f), which now has the clog group update.
> With this optimization, CSN is now faster than the master by about 15% on
> 100 to 400 clients (72 cores, pgbench tpcb-like, scale 500). It does not
> degrade faster than master as it did before. The numbers of clients greater
> than 400 were not measured.

Hmm, that's gratifying.

> * Querying for CSN of subtransactions was not implemented in the previous
> version of the patch, so I added it. I tested the performance on the
> tpcb-like pgbench script with some savepoints added, and it was
> significantly worse than on the master. The main culprit seems to be the
> ProcArrayLock taken in GetSnapshotData, GetRecentGlobalXmin,
> ProcArrayEndTransaction. Although it is only taken in shared mode, just
> reading the current lock mode and writing the same value back takes about
> 10% CPU. Maybe we could do away with some of these locks, but there is some
> interplay with imported snapshots and replication slots which I don't
> understand well. I plan to investigate this next.

That's not so good, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots

From
Alexander Kuzmenkov
Date:
Hi hackers,

Here is a new version of the patch with some improvements, rebased to 
117469006b.

Performance on pgbench tpcb with subtransactions is now slightly better 
than master. See the picture 'savepoints2'. This was achieved by 
removing unnecessary exclusive locking on CSNLogControlLock in 
SubTransSetParent. After that change, both versions are mostly waiting 
on XidGenLock in GetNewTransactionId.

Performance on default pgbench tpcb is also improved. At scale 500, csn 
is at best 30% faster than master, see the picture 'tpcb500'. These 
improvements are due to slight optimizations of GetSnapshotData and 
refreshing RecentGlobalXmin less often. At scale 1500, csn is slightly 
faster at up to 200 clients, but then degrades steadily: see the picture 
'tpcb1500'. Nevertheless, CSN-related code paths do not show up in perf 
profiles or LWLock wait statistics [1]. I think what we are seeing here 
is again that when some bottlenecks are removed, the fast degradation of 
LWLocks under contention leads to net drop in performance. With this in 
mind, I tried running the same benchmarks with patch from Yura Sokolov 
[2], which should improve LWLock performance on NUMA machines. Indeed, 
with this patch csn starts outperforming master on all numbers of 
clients measured, as you can see in the picture 'tpcb1500'. This LWLock 
change influences the csn a lot more than master, which also suggests 
that we are observing a superlinear degradation of LWLocks under 
increasing contention.

After this I plan to improve the comments, since many of them have 
become out of date, and work on logical replication.

[1] To collect LWLock wait statistics, I sample pg_stat_activity, and 
also use a bcc script by Andres Freund: 

https://www.postgresql.org/message-id/flat/20170622210845.d2hsbqv6rxu2tiye%40alap3.anarazel.de#20170622210845.d2hsbqv6rxu2tiye@alap3.anarazel.de

[2] 

https://www.postgresql.org/message-id/flat/2968c0be065baab8865c4c95de3f435c@postgrespro.ru#2968c0be065baab8865c4c95de3f435c@postgrespro.ru

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: [HACKERS] Proposal for CSN based snapshots

From
Alexander Korotkov
Date:
On Mon, Dec 4, 2017 at 6:07 PM, Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:
Performance on pgbench tpcb with subtransactions is now slightly better than master. See the picture 'savepoints2'. This was achieved by removing unnecessary exclusive locking on CSNLogControlLock in SubTransSetParent. After that change, both versions are mostly waiting on XidGenLock in GetNewTransactionId.

Performance on default pgbench tpcb is also improved. At scale 500, csn is at best 30% faster than master, see the picture 'tpcb500'. These improvements are due to slight optimizations of GetSnapshotData and refreshing RecentGlobalXmin less often. At scale 1500, csn is slightly faster at up to 200 clients, but then degrades steadily: see the picture 'tpcb1500'. Nevertheless, CSN-related code paths do not show up in perf profiles or LWLock wait statistics [1]. I think what we are seeing here is again that when some bottlenecks are removed, the fast degradation of LWLocks under contention leads to net drop in performance. With this in mind, I tried running the same benchmarks with patch from Yura Sokolov [2], which should improve LWLock performance on NUMA machines. Indeed, with this patch csn starts outperforming master on all numbers of clients measured, as you can see in the picture 'tpcb1500'. This LWLock change influences the csn a lot more than master, which also suggests that we are observing a superlinear degradation of LWLocks under increasing contention.

These results look promising for me.  Could you try benchmarking using more workloads including read-only and mixed mostly-read workloads?
You can try same benchmarks I used in my talk about CSN in pgconf.eu [1] slides 19-25 (and you're welcome to invent more benchmakrs yourself)

Also, I wonder how current version of CSN patch behaves in worst case when we have to scan the table with a lot of unique xid (and correspondingly have to do a lot of csnlog lookups)?  See [1] slide 18.  This worst case was significant part of my motivation to try "rewrite xid with csn" approach.  Please, find simple extension I used to fill table with random xmins in the attachment.


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] Proposal for CSN based snapshots

From
Alexander Kuzmenkov
Date:
El 08/12/17 a las 14:59, Alexander Korotkov escribió:
These results look promising for me.  Could you try benchmarking using more workloads including read-only and mixed mostly-read workloads?
You can try same benchmarks I used in my talk about CSN in pgconf.eu [1] slides 19-25 (and you're welcome to invent more benchmakrs yourself)


Sure, here are some more benchmarks.

I've already had measured the "skip some updates" and "select-only" pgbench variants, and also a simple "select 1" query. These are for scale 1500 and for 20 to 1000 connections. The graphs are attached.
"select 1", which basically benchmarks snapshot taking, shows an impressive twofold increase in TPS over master, but this is to be expected. "select-only" stabilizes at 20% higher than master. Interesting to note is that these select-only scenarios almost do not degrade with growing client count.
For the "skip some updates" scenario, CSN is slightly slower than master, but this is improved by the LWLock patch I mentioned upthread.

I also replicated the setup from your slides 23 and 25. I used scale 500 and client counts 20-300, and probably the same 72-core Xeon.
Slide 23 shows 22% write and 78% read queries, that is "-b select-only@9 -b tpcb-like@1". The corresponding picture is called "random.png". The absolute numbers are somewhat lower for my run, but CSN is about 40% faster than master, like the CSN-rewrite variant.
Slide 25 is a custom script called "rrw" with extra 20 read queries. We can see that since your run the master has improved much, and the current CSN shows the same general behaviour as CSN-rewrite, although being slower in absolute numbers.

Also, I wonder how current version of CSN patch behaves in worst case when we have to scan the table with a lot of unique xid (and correspondingly have to do a lot of csnlog lookups)?  See [1] slide 18.  This worst case was significant part of my motivation to try "rewrite xid with csn" approach.  Please, find simple extension I used to fill table with random xmins in the attachment.


OK, let's summarize how the worst case in question works. It happens when the tuples have a wide range of xmins and no hint bits. All visibility checks have to go to clog then. Fortunately, on master we also set hint bits during these checks, and the subsequent scans can determine visibility using just the hint bits and the current snapshot. This makes the subsequent scans run much faster. With CSN, this is not enough, and the visibility checks always go to CSN log for all the transactions newer than global xmin. This can become very slow if there is a long-running transaction that holds back the global xmin.

I made a simple test to see these effects. The procedure is as follows. I start a transaction in psql; that will be our long-running transaction. Next, I run pgbench for a minute, and randomize the tuple xmins in "pgbench_accounts" using your extension. Then I run "select sum(abalance) from pgbench_accounts" twice, and record the durations.
Here is a table with the results (50M tuples, 1M transactions for master and 400k for CSN):

Branch    scan 1, s   scan 2, s
--------------------------------
CSN          80          80   
master       13          3.5

So, we are indeed seeing the expected results. Significant slowdown with long-running transaction is an important problem for this patch. Even the first scan is much slower, because a CSN log page contains 16 times less transactions than a clog page, and we have the same number of buffers (max 128) for them. When the range of xids is wide, we spend most of the time loading and unloading the pages. I believe it can be improved by using more buffers for CSN log. I did some work in this direction, namely, made SLRUs use a dynahash table instead of linear search for page->buffer lookups. This is included in the v8 I posted earlier, but it should probably be done as a separate patch.

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment