Thread: UNDO and in-place update

UNDO and in-place update

From
Robert Haas
Date:
PostgreSQL tables and indexes not infrequently become bloated, and
this bloat is difficult to reverse once it has occurred.  When a row
is updated, a new row version is created and the space occupied by the
old version can be eventually be recycled.  Since there is always a
short delay and sometimes a very long delay (hours) between the time
when the new row version is created and the time the old row version
can be removed, tables grow to accommodate storing both the old and
new row versions.  When the old versions are reclaimed, there is no
easy way to shrink the table, because the free space is spread through
the table's data blocks rather than concentrated at the end of the
file.  Moreover, since the space used by old row versions are only
reclaimed lazily -- either by HOT pruning or by VACUUM -- the table
may be further extended even space for the new row version could have
been made available by more timely cleanup.  Indexes have similar
issues: even though in theory there is more latitude to consolidate
free space, we don't always do so, and lazy reclamation of row
versions may cause relation extension.  One factor contributing to
index bloat is that any non-HOT update to a heap tuple requires an
update to every index even if the indexed columns are not updated.

How can we get ahead of this problem?  One approach is to make it
easier to reorganize the table; for example, a variant of CLUSTER that
doesn't block concurrent DML would be useful.  However, that's really
only a stopgap, because it only makes it easier to salvage a table
where internal free space has become badly fragmented. What we want to
do is prevent free space from becoming fragmented in the first place.
Most other relational database systems do this by placing old row
versions in a separate data structure from the table itself where they
can easily be removed when no longer needed.  A row version becomes
"old" when it is updated or deleted; the "new" version enters the main
heap while the "old" version moves elsewhere.

To implement this in PostgreSQL, I believe we would need support for
UNDO.  Most database systems have both REDO -- what PostgreSQL
typically calls the write-ahead log (WAL) -- and UNDO.  REDO is
responsible for restoring physical consistency after a database crash,
while UNDO is responsible for reversing the effects of aborted
transactions.  When a transaction performs an operation (DO), it also
writes it to the write-ahead log (REDO) and records the information
needed to reverse it (UNDO).  If the transaction aborts, UNDO is used
to reverse all changes made by the transaction.  Both the actions
performed (DO) and the UNDO created for those actions are protected by
REDO.  Thus, UNDO must be written to disk at checkpoint time; UNDO
since the most recent checkpoint prior to a crash is regenerated by
WAL replay (REDO).  So, after a crash, we first replay WAL (REDO) and
then reverse out any transactions implicitly aborted by the crash
(UNDO). This basic DO-UNDO-REDO protocol has been well-understood for
decades.  It would likely be worthwhile to implement DO-UNDO-REDO in
PostgreSQL independently of any design for reducing bloat, because it
would provide a systematic framework for handling cleanup actions that
must be performed at the end of recovery. For example, if a
transaction creates a table and, while that transaction is still in
progress, there is an operating system or PostgreSQL crash, the
storage used by the table is permanently leaked.  This could be fixed
by having the operation that creates the table also emit UNDO which
would unlink the backing storage in the event of an abort.

We could have a single shared undo log to which all transactions
write, but the insertion point would almost certainly become a point
of contention. Moreover, it's not very desirable to interleave UNDO
from different transactions, because it makes cleanup difficult.
Suppose one transaction aborts while another continues to run; if
their UNDO records are interleaved, it will be difficult to recover
any space if one set of UNDO records can be discarded but the other
must be kept.  Instead, it seems best to have a separate UNDO log for
each transaction.  Each of these UNDO logs individually will be
written sequentially and will consist of a series of variable-length
records.  When a particular UNDO log is no longer needed, we can
remove the whole thing at once.  Unless the UNDO log spilled to disk
because it was large or because of a checkpoint during the
transaction, this is just a matter of deallocating or recycling
whatever shared memory we're using to store it; otherwise, we'll need
to remove or recycle the file, or the portion of a file, that was used
to persist it on disk.

Once we have UNDO, we could create a new type of heap where we emit
UNDO for each INSERT, UPDATE, or DELETE.  For an INSERT, we will
insert the new write and emit UNDO which will remove it.  For a
DELETE, we will immediately remove the old row version but emit UNDO
which will put it back.  For an UPDATE, we can proceed in two ways,
depending on the situation.  First, we can handle an UPDATE much as if
it were a DELETE combined with an INSERT, removing the old version,
adding the new one, and emitting UNDO to reverse both changes.
Second, in some cases, we can instead perform an in-place update,
modifying the existing record in place and emitting UNDO to restore
the old version.  It will be desirable to use the second strategy as
often as possible, because an in-place update does not bloat the heap.

Of course, there a few problems to solve.  Since the only way to find
an old row version is to look at UNDO, we will need to retain UNDO for
as long as it might contain row versions that some overlapping
transaction needs to see.  This means that we must retain UNDO for (1)
all transactions which are in progress, (2) for aborted transactions
until such time as all of the UNDO actions have been performed, and
(3) for committed transactions until they are all-visible.  Note that
(1) and (2) would be required no matter what; UNDO is fundamentally a
system for tracking actions to performed during transaction abort.
(3) is only required because of our desire to keep old row versions
out of the heap. We could limit the length of time for which UNDO in
category (3) is kept in order to implement "snapshot too old" for this
type of heap.  (Or we could keep it for an extra-long time to
implement time travel.)

Our current heap format requires each tuple to carry a very wide
header: 24-byte header -- or wider if the null bitmap is large. 18 of
those bytes plus assorted flag bits are only there for MVCC purposes.
For static data, all of that space is wasted.  Moreover, we can easily
end up dirtying the same page multiple times to set hint bits and,
later, to freeze XMIN and XMAX, resulting in repeated page writes.
With UNDO, we can do better.  If all UNDO for a given page have been
removed, then all aborted transactions have been undone and all
committed transactions are all-visible; thus, every tuple on that page
is all-visible.  Therefore, the page itself does not need to contain
any per-tuple visibility information; it only needs to be possible to
find any UNDO pertaining to that page.  Since the UNDO will be removed
at exactly the same time that the visibility information is no longer
needed, so we can store any required visibility information in the
UNDO itself.

Just as a particular WAL record is referenced using a byte position, a
particular UNDO record can be referenced by uniquely identifying the
transaction and the byte position within that transaction's UNDO log.
To uniquely identify a particular transaction, it seems we will need 8
bytes: the 4-byte epoch and the 4-byte XID.  To uniquely identify a
position within that transaction's UNDO log, we will perhaps need
another 8 bytes.  Thus, in total, an UNDO pointer will be 16 bytes.
We could perhaps reduce the size to 12 bytes by stipulating that a
transaction can only write 4GB of UNDO using a particular XID; if it
needed to generate more UNDO than that, it would need to allocate 1
additional XID for every 4GB of UNDO generated.  Pages in this new
type of heap will have space to store an UNDO pointer (epoch + XID +
byte offset) in the page header.  UNDO pointers where the XID is 0, 1,
or 2 cannot refer to a real UNDO log, because those XIDs are reserved
by PostgreSQL. Accordingly, we can reserve the all-zeroes UNDO pointer
as an invalid UNDO pointer.  Also, let's reserve all UNDO pointers
where the XID is 1 as temporary page data (TPD) pointers.  When a page
has been recently modified by only one transaction, the page's UNDO
pointer points to the most recent UNDO record generated by that
transaction for that page.  There's no need to reset the pointer when
the UNDO log is removed; instead, we interpret a pointer to an UNDO
log which no longer exists as a sure sign that the page is all-frozen.
When a page has been recently modified by more than one transaction,
the page's UNDO pointer is a TPD pointer.

A TPD pointer is a 64-bit reference to a TPD record.  An UNDO pointer
will be either 12 or 16 bytes, so there's definitely room to store 64
additional bits there in addition to the XID of 1.  TPD records are
stored in the TPD log, which is managed much like the write-ahead log:
records are accessed by byte offset, the log begins at byte position 0
and grows upward forever, and the older portions of the log can
discarded once they are no longer needed. Unlike WAL, however, an
already-written TPD record can be modified in place. New TPD records
and changes to TPD records must be protected by WAL.  A TPD will
contain one UNDO pointer per recent transaction (in progress, aborted
but not yet cleaned up, committed but not yet all-visible), indicating
the most recent change to the page by that transaction. Therefore,
it's always possible to find all of the recent transactions which have
modified a page and the most recent change by each one.  Each UNDO
record for a particular page should contain a back-pointer to the
prior change to that same page by the same transaction, if any, so
once we've found the most recent change to the page by each recent
transaction we can use these back-pointers to find all of the other
ones as well.  And that lets us find any old tuple versions we need.

If a transaction modifies a page whose existing UNDO pointer is
invalid, or one whose existing UNDO pointer points into its own UNDO
log, it can just overwrite the existing UNDO pointer with a pointer to
the new change. Likewise, if the UNDO pointer is a TPD pointer and the
modifying transaction is already present in the TPD record, it can
just update the TPD record with the UNDO pointer for the new
modification.  Alternatively, if the UNDO pointer is a TPD pointer and
one of the transactions in the TPD is no longer recent, it can grab
that slot in the existing TPD for its own UNDO pointer.  However, if
the page points to some other transaction's UNDO log and that
transaction is still recent, or if the page points to a TPD all of
whose members are still recent, then the modifying transaction must
create a new and larger TPD for the page.  The old TPD can then be
marked as garbage and reclaimed. Otherwise, a TPD entry can be
reclaimed when every UNDO pointer it contains points to an UNDO log
which has already been discarded.  We can treat a page with a pointer
to a TPD entry which has been discarded just like a page with a
pointer to an UNDO log that no longer exists: the page is all-visible.

In-place updates are fairly obviously safe and possible when no index
columns have been modified and the new tuple is no larger than the old
one.  The indexes don't need to know anything about the change, and
the new tuple will definitely fit.  In other cases, a bit more thought
is needed.  If the new tuple is larger than the old one, it may still
be possible to do the update in place.  The simplest case is when
there happens to be unused space following the existing tuple, and the
new and larger tuple will fit into that space.  In that case, we do
not need to do anything special. Alternatively, it may be that there's
enough free space elsewhere on the page to accommodate the new tuple.
In that case, we could take a cleanup lock and reorganize the page.
However, if we do, we must be careful to leave enough space to permit
a successful UNDO.  For example, imagine a completely full page where
we try to shrink one tuple in one transaction and enlarge a different
tuple in some other transaction.  If we allow both updates to be in
place, we will have a serious problem if the shrinking transaction
aborts and the enlarging transaction commits.

If any indexed columns are changed, an in-place update might confuse
the index AM.  Index AMs like BRIN which do not store TIDs don't care,
but those that do will get confused, because there will now be entries
for multiple index values pointing to the same TID.  That could
possibly be handled by forbidding index-only scans and requiring
rechecks in all cases, but that would be quite inefficient.  Instead,
we would likely wish to adopt the solution used by other database
systems which work on principles similar to those described here:
whenever we update or delete a tuple, use the old tuple to re-find any
index entries that might no longer be valid, and apply a delete-mark
to them.  When scanning an index, perform rechecks for all
delete-marked index items.  Claudio's proposal to keep otherwise-equal
btreek ys sorted by TID is probably essential for delete-marking to
perform well.  Note that delete-marking isn't strictly required: we
could simply choose not to do in-place updates any time an indexed
column is modified.  However, that leaves a lot of performance on the
table, because HOT updates are already fairly efficient; we want to
also reduce the bloat associated with updates that DO touch indexed
columns, and the write amplification that comes from touching every
index.

So, what are the pros and cons of this system?

- Performing updates in place where possible prevents bloat from being
created. It does not prevent all bloat: aside from any updates that
are not dead in place, a transaction that inserts many rows and aborts
or deletes many rows and commits will still leave the table larger
than the optimum size.  However, many workloads have more updates than
they do inserts or deletes, so this is probably a significant win.

- Old tuple versions are removed from the heap eagerly (by UNDO
cleanup at the end of a transaction) rather than lazily (by HOT
pruning and VACUUM).  This increases the chance of being able to reuse
internal free space rather than extending the relation.  Also, it's
probably better to incur the overhead of cleaning up after an abort
immediately rather than hours, days, weeks, or months later when
VACUUM kicks in; VACUUM behavior often surprises users.  Better yet,
in the case where most transactions commit, we really don't really
need to do any after-the-fact cleanup at all; the only thing that we
have to do in that case is throw away the UNDO logs for transactions
that have become all-visible, which is hopefully a cheap operation.

- Reading a page that has been recently modified gets significantly
more expensive; it is necessary to read the associated UNDO entries
and do a bunch of calculation that is significantly more complex than
what is required today. The problem is more severe when TPD entries
are required.  There are probably some steps that can be taken to
mitigate this problem, such as setting aside a bit per tuple to mean
"this tuple has not been recently modified", or caching additional
details in the TPD when one is used, but it's unlikely to be possible
to eliminate it completely.

- Most things that could cause a page to be rewritten multiple times
are eliminated.  Tuples no longer need to be frozen; instead, pages
are implicitly frozen by the removal of associated UNDO.  Similarly,
hint bits are gone.  The page-level all-visible bit needs some
thought; we'll still need a visibility map to support index-only
scans, but we probably don't want the page level bit any more, because
rewriting the whole heap just to set that one bit would be terrible.

- Delete-marking makes deletes more expensive, because they have to
delete-mark all of the index entries.  Updates are more expensive when
all of the index columns change, because now each index needs to be
updated in two ways (new index entries and delete-marking of old ones)
rather than one.  However, if some indexed columns are updated but
only a small minority of the indexes on the table include those
columns, updates are substantially cheaper, because new index entries
and delete-marking of old ones are required only for those indexes
where some column has been updated.  On the whole, it seems likely
that we'd come out ahead here on many workloads, because updates that
change only one or a few indexed columns are probably much more common
than updates which change them all.

- Delete-marking assumes that we’re able to re-find index tuples.  The
system today does not require such an assumption.

- Tuple headers become much smaller.  The page header becomes bigger,
but there's only one page header per page, and there are many tuple
headers per page, so on average we should come out ahead.

- Transaction abort can be lengthy if much UNDO is required.  This
problem can be mitigated by pushing the UNDO steps into the
background.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> [ Let's invent Oracle-style UNDO logs ]

I dunno.  I remember being told years ago, by an ex-Oracle engineer,
that he thought our approach was better.  I don't recall all the details
of the conversation but I think his key point was basically this:

> - Reading a page that has been recently modified gets significantly
> more expensive; it is necessary to read the associated UNDO entries
> and do a bunch of calculation that is significantly more complex than
> what is required today.

Oracle spends a lot of time on this, and it's really cache-inefficient
because the data is spread all over.  This was what this guy felt in
circa 2001; I'd have to think that the cache unfriendliness problem is
much worse for modern hardware.

Which is not to say that it's not worth experimenting with.  But what you
describe is going to be a huge amount of work even to get to the point
where we could try to measure the actual costs and benefits :-(

Heikki's been fooling with some ideas that I think have more promise.
I wish he'd get to the point of presenting them publicly rather than
just over beers at conferences.
        regards, tom lane



Re: UNDO and in-place update

From
Robert Haas
Date:
On Tue, Nov 22, 2016 at 10:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> [ Let's invent Oracle-style UNDO logs ]
>
> I dunno.  I remember being told years ago, by an ex-Oracle engineer,
> that he thought our approach was better.  I don't recall all the details
> of the conversation but I think his key point was basically this:
>
>> - Reading a page that has been recently modified gets significantly
>> more expensive; it is necessary to read the associated UNDO entries
>> and do a bunch of calculation that is significantly more complex than
>> what is required today.
>
> Oracle spends a lot of time on this, and it's really cache-inefficient
> because the data is spread all over.  This was what this guy felt in
> circa 2001; I'd have to think that the cache unfriendliness problem is
> much worse for modern hardware.

Yes, that's the main thing I'm worried about with this design.  Now,
bloat is pretty cache-inefficient too.  If our table is 2x the size of
Oracle's table, they can spend more cycles per page and still do
fairly well.  If that 2x increment pushes us out of RAM, they can
spend basically all the CPU time they want and still win handily.  But
that isn't to say that this concern isn't justified.

> Which is not to say that it's not worth experimenting with.  But what you
> describe is going to be a huge amount of work even to get to the point
> where we could try to measure the actual costs and benefits :-(

Yeah.

> Heikki's been fooling with some ideas that I think have more promise.
> I wish he'd get to the point of presenting them publicly rather than
> just over beers at conferences.

That would be good, too!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
Peter Geoghegan
Date:
On Tue, Nov 22, 2016 at 7:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Heikki's been fooling with some ideas that I think have more promise.
>> I wish he'd get to the point of presenting them publicly rather than
>> just over beers at conferences.
>
> That would be good, too!

I was told that TED was kind of formally proposed in the "MultiXact
hindsight design" thread.


-- 
Peter Geoghegan



Re: UNDO and in-place update

From
Robert Haas
Date:
On Tue, Nov 22, 2016 at 10:41 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Nov 22, 2016 at 7:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Heikki's been fooling with some ideas that I think have more promise.
>>> I wish he'd get to the point of presenting them publicly rather than
>>> just over beers at conferences.
>>
>> That would be good, too!
>
> I was told that TED was kind of formally proposed in the "MultiXact
> hindsight design" thread.

I read that, but:

1. Nothing's really come of that, AFAICS, and

2. It doesn't allow for in-place update, and

3. It doesn't eliminate the need for freezing.

Implementing TED would probably have been better than doing fkeylocks
the way we did, but now that we've mostly stabilized the multixact
work that was actually done, I have a hard time imagining that we
really want to revamp the whole thing without fixing any of the more
fundamental problems with our on-disk format.  In my mind, the bloat
created by storing both the old and new versions of an updated tuple
in the heap, and the need to rewrite pages multiple times to set hint
bits, set all-visible, and ultimately freeze are the three-ton
gorillas in the corner.  I am not arguing that what I've outlined here
is the only way to fix those problems or even the best way, just that
it isn't really worth the trouble of doing major surgery if we aren't
going to make a dent in those problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
Peter Geoghegan
Date:
On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> - Reading a page that has been recently modified gets significantly
>> more expensive; it is necessary to read the associated UNDO entries
>> and do a bunch of calculation that is significantly more complex than
>> what is required today.

Someone told me that there is something called an interested
transaction list stored in the page header, and from that I infer that
isn't *really* true. I think that unless you're interested in an
affected row, rather than just some row that happens to be on the same
page, you don't really have to worry about it.

> Oracle spends a lot of time on this, and it's really cache-inefficient
> because the data is spread all over.  This was what this guy felt in
> circa 2001; I'd have to think that the cache unfriendliness problem is
> much worse for modern hardware.

I imagine that temporal locality helps a lot. Most snapshots will be
interested in the same row version.

-- 
Peter Geoghegan



Re: UNDO and in-place update

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Oracle spends a lot of time on this, and it's really cache-inefficient
>> because the data is spread all over.  This was what this guy felt in
>> circa 2001; I'd have to think that the cache unfriendliness problem is
>> much worse for modern hardware.

> I imagine that temporal locality helps a lot. Most snapshots will be
> interested in the same row version.

Yeah, but the problem is that all those transactions have to scavenge
through a big chunk of UNDO log to find out what they need to know.

Ultimately, I doubt that update-in-place buys much that we don't already
have with HOT updates (which postdate this old conversation, btw).
If you want MVCC semantics, you need to hold both versions of the tuple
*somewhere*.  Oracle's solution is different from ours, but I'm not
exactly convinced that it's better.  It makes some aspects better and
others worse.
        regards, tom lane



Re: UNDO and in-place update

From
Amit Kapila
Date:
On Wed, Nov 23, 2016 at 9:32 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> - Reading a page that has been recently modified gets significantly
>>> more expensive; it is necessary to read the associated UNDO entries
>>> and do a bunch of calculation that is significantly more complex than
>>> what is required today.
>
> Someone told me that there is something called an interested
> transaction list stored in the page header, and from that I infer that
> isn't *really* true. I think that unless you're interested in an
> affected row, rather than just some row that happens to be on the same
> page, you don't really have to worry about it.
>

Yeah, so basically if there is an effect of any transaction which is
not visible to the snapshot of transaction reading the page, you need
to do something to read the old row/rows present on that page.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: UNDO and in-place update

From
Peter Geoghegan
Date:
On Tue, Nov 22, 2016 at 7:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> This basic DO-UNDO-REDO protocol has been well-understood for
> decades.

FWIW, while this is basically true, the idea of repurposing UNDO to be
usable for MVCC is definitely an Oracleism. Mohan's ARIES paper says
nothing about MVCC.


-- 
Peter Geoghegan



Re: UNDO and in-place update

From
Amit Kapila
Date:
On Wed, Nov 23, 2016 at 9:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Oracle spends a lot of time on this, and it's really cache-inefficient
>>> because the data is spread all over.  This was what this guy felt in
>>> circa 2001; I'd have to think that the cache unfriendliness problem is
>>> much worse for modern hardware.
>
>> I imagine that temporal locality helps a lot. Most snapshots will be
>> interested in the same row version.
>
> Yeah, but the problem is that all those transactions have to scavenge
> through a big chunk of UNDO log to find out what they need to know.
>

I think that depends on how we design to read the old data, if every
transaction has to go through UNDO log then it can impact the
performance, however, if we design such that the transaction can take
help from the previous transaction which has gone through undo log,
then it will be much better. For that to happen, I think we need some
kind of page level MVCC and that should be possible in design proposed
above where the latest transactions information is present at the page
level.

> Ultimately, I doubt that update-in-place buys much that we don't already
> have with HOT updates (which postdate this old conversation, btw).
> If you want MVCC semantics, you need to hold both versions of the tuple
> *somewhere*.  Oracle's solution is different from ours, but I'm not
> exactly convinced that it's better.  It makes some aspects better and
> others worse.
>

I think not only Oracle but some of the other database systems like
SQL Server (and IIRC MySQL) does a better job in terms of bloat
management (by keeping it separate from main Heap) which is why in
many cases they are preferred over PostgreSQL.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: UNDO and in-place update

From
Peter Geoghegan
Date:
On Tue, Nov 22, 2016 at 8:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ultimately, I doubt that update-in-place buys much that we don't already
> have with HOT updates (which postdate this old conversation, btw).
> If you want MVCC semantics, you need to hold both versions of the tuple
> *somewhere*.  Oracle's solution is different from ours, but I'm not
> exactly convinced that it's better.  It makes some aspects better and
> others worse.

I agree that HOT is roughly as effective, at least when you get mostly
HOT updates.

I think that there is an under-appreciated issue with index bloat and
VACUUM. Bloat in the sense of a B-Tree becoming unbalanced. I think
that short term bursts of non-HOT updates bloat the *keyspace* of
B-Trees, making them unbalanced. A short term burst (possibly
associated with one long running transaction that prevents pruning)
can cause long term damage to the key space.

The best thing by far about an alternative design like this is that it
performs *consistently*. I believe that we will lose on real-world
cases that play to the strengths of the current design. That doesn't
mean that it isn't worth it, but it does make it exceptional
difficult.

-- 
Peter Geoghegan



Re: UNDO and in-place update

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> The best thing by far about an alternative design like this is that it
> performs *consistently*.

Really?  I think it just moves the issues somewhere else.
        regards, tom lane



Re: UNDO and in-place update

From
Mark Kirkwood
Date:
On 23/11/16 16:31, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> [ Let's invent Oracle-style UNDO logs ]
>
> I dunno.  I remember being told years ago, by an ex-Oracle engineer,
> that he thought our approach was better.  I don't recall all the details
> of the conversation but I think his key point was basically this:
>
>> - Reading a page that has been recently modified gets significantly
>> more expensive; it is necessary to read the associated UNDO entries
>> and do a bunch of calculation that is significantly more complex than
>> what is required today.
>

Also ROLLBACK becomes vastly more expensive than COMMIT (I can recall 
many years ago when I used to be an Oracle DBA reading whole chapters of 
novels waiting for failed batch jobs to roll back).

However I'd like to add that I agree this is worth looking at, as 
ideally it would be great to be able to choose whether to have No-UNDO 
or UNDO on a table by table basis...

regards

Mark




Re: UNDO and in-place update

From
Pavan Deolasee
Date:


On Wed, Nov 23, 2016 at 10:07 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Nov 22, 2016 at 8:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ultimately, I doubt that update-in-place buys much that we don't already
> have with HOT updates (which postdate this old conversation, btw).
> If you want MVCC semantics, you need to hold both versions of the tuple
> *somewhere*.  Oracle's solution is different from ours, but I'm not
> exactly convinced that it's better.  It makes some aspects better and
> others worse.

I agree that HOT is roughly as effective, at least when you get mostly
HOT updates.


I agree. We'd tried update-in-place before writing HOT and it turned out to be complex and error-prone [1]. I am not suggesting that we can't do a better job this time, but as Tom said, it's going to be lot of work. If WARM makes any progress, it will also help addressing some of the issues around index bloats when HOT updates can not be done.

One key issue that I see is that unrelated long running transactions end up holding up OldestXmin advances and that leads to significant bloat to update-heavy tables. If we could somehow address that situation, that may still go a long way in containing bloat. Of course, there may still be cases where bloat will be unavoidable, but it seems far lesser work to me that a complete overhaul of the MVCC working.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: UNDO and in-place update

From
Peter Geoghegan
Date:
On Tue, Nov 22, 2016 at 8:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> The best thing by far about an alternative design like this is that it
>> performs *consistently*.
>
> Really?  I think it just moves the issues somewhere else.

Definitely, yes.

* HOT is great and all, but HOT-safety is a pretty onerous thing to
ask users to worry about. They don't know anything about it. Maybe
WARM eventually helps with this -- I don't know about that.

* It's bad that the effectiveness of index-only scans is strongly
influenced by the visibility map. And, autovacuum doesn't care about
index-only scans (nor should it, I suppose).

* The high watermark size of a table is much higher for an
update-mostly workload. When bloat is localized to certain values in
indexes, it's a lot worse.

* Our behavior with many duplicates in secondary indexes is pretty bad
in general, I suspect. I think we do badly at amortizing inserts into
indexes from updates (we dirty more leaf pages than we really need
to). I think we could do better at making LP_DEAD style IndexTuple
recycling more effective.

* Most obviously, autovacuum can create significant load at
inconvenient times. Paying that cost in a piecemeal fashion has a
certain appeal.

For what it's worth, I am not planning to work on this.

-- 
Peter Geoghegan



Re: UNDO and in-place update

From
Albe Laurenz
Date:
Robert Haas wrote:
> To implement this in PostgreSQL, I believe we would need support for
> UNDO.

> - Reading a page that has been recently modified gets significantly
> more expensive; it is necessary to read the associated UNDO entries
> and do a bunch of calculation that is significantly more complex than
> what is required today.

As others have said, that is the main drawback.

Also, and related to this, ROLLBACK would become an expensive operation.

Yours,
Laurenz Albe

Re: UNDO and in-place update

From
Robert Haas
Date:
On Tue, Nov 22, 2016 at 11:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Oracle spends a lot of time on this, and it's really cache-inefficient
>>> because the data is spread all over.  This was what this guy felt in
>>> circa 2001; I'd have to think that the cache unfriendliness problem is
>>> much worse for modern hardware.
>
>> I imagine that temporal locality helps a lot. Most snapshots will be
>> interested in the same row version.
>
> Yeah, but the problem is that all those transactions have to scavenge
> through a big chunk of UNDO log to find out what they need to know.

In my design, there's one UNDO log per transaction, so if a page has
been recently modified only by a single transaction whose effects are
visible to your snapshot, you can very quickly decide that you don't
need to look at UNDO at all.  If it's been recently modified by
multiple transactions all of which are visible to your snapshot, you
can consult the TPD and then decide that none of those transactions
are interesting and therefore their UNDO doesn't need to be examined.

When there are transactions that whose effects are not visible to your
snapshot, the cost is proportional to the number of times those
transactions have modified the page.  Therefore, the "bad" case for
this design is when the same page gets updated a large number of times
by one or more concurrent transactions.  If the transaction you can't
see has made 437 separate modifications to the page, you're going to
need to look at all 437 UNDO records to figure out what you can see.
That is likely to suck.

I think that one of the crucial questions for this design is how often
that will happen.  For people with mostly static data, it should be
rare.  In something like a pgbench workload, it boils down to how
often multiple transactions hit the same page before the previous
transactions that hit that page become all-visible.  Of course, if you
have too many backends relative to the scale factor, pgbench is going
to become lock-bound anyway and it probably won't matter much, but
it's about two orders of magnitude easier to hit the same page than to
hit the same tuple, so it could be that the point where you start
incurring significant costs for visibility checking and page
reconstruction happens well before the tuple lock contention becomes
noticeable.  It might be worth doing some mathematical modeling to try
to estimate how serious that effect is likely to be.

There are also some things that you can do to mitigate this, although
they add even more complexity.  For example:

- If the same transaction touches the same page repeatedly, you could
make a rule that every 5th or every 10th or every 20th modification to
the same page within the same transaction emits a larger summary
record into the UNDO that is a consolidated record of all changes made
to the page by that transaction thus far.  That penalizes writers, but
helps readers; when a reader reaches a summary record, it need look no
further.

- Peter alluded to the possibility - or so I thought - of caching
reconstructed page versions.  That would make a lot of sense for
sequential scans or cases where the same page is being accessed
repeatedly.

- If you're doing an index scan and you only care about one tuple, the
page could contain one bit per tuple which is set if the tuple has
been modified by any transaction the page views as recent.  If the bit
is clear for the one tuple you care about, you can ignore the UNDO
even if is for a transaction not visible to your snapshot, because you
know it didn't touch that tuple.

- If you create a TPD entry for the page because there are multiple
recent transactions that have modified it, you can put additional
information into the TPD entry to speed up access.  For example, you
could store an array indexed by itemid which indicates which of the
recent transactions have touched any given itemid.  This is mostly
useful for index scans, which can then ignore all the transactions
that don't pertain to the one TID they care about.

Which of those ideas are best?  Are they enough, even if we do them
all?  Will there be other down sides?  I don't know.  But I think it's
clear that bloat is an ongoing and serious concern for our users (cf.
snapshot too old), and that vacuuming is too (cf. freeze map) and I
believe we are reaching a point where we can't make much further
improvement in those areas without some kind of major architectural
change.  WARM might qualify; this idea certainly does.  I also believe
that we NEED to improve further.  At least for EnterpriseDB, the
problems I'm trying to address through this design are big problems.
If this design isn't good, let's come up with something else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
Thomas Munro
Date:
On Wed, Nov 23, 2016 at 6:01 PM, Peter Geoghegan <pg@heroku.com> wrote:
> * Our behavior with many duplicates in secondary indexes is pretty bad
> in general, I suspect.

From the pie-in-the-sky department:  I believe there are
snapshot-based systems that don't ever have more than one entry for a
logical row in a btree.  Instead, they do UNDO-log based snapshot
reconstruction for btrees too, generalising what is being discussed
for heap tables in this thread.  That means that whenever you descend
a btree, at each page you have to be prepared to go and look in the
UNDO log if necessary on a page-by-page basis.  Alternatively, some
systems use a shadow table rather than an UNDO log for the old
entries, but still privilege queries that need the latest version by
keeping only those in the btree.  I don't know the details but suspect
that both of those approaches might require CSN (= LSN?) based
snapshots, so that you can make page-level visibility determinations
while traversing the 'current' btree based on a page header.  That
would reduce both kinds of btree bloat: the primary kind of being
entries for dead tuples that still exist in the heap, and the
secondary kind being the resultant permanent btree fragmentation that
remains even after vacuum removes them.  Presumably once you have all
that, you can allow index-only-scans without a visibility map.  Also,
I suspect that such a "3D time travelling" btree would be a
requirement for index ordered tables in a snapshot-based RDBMS.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: UNDO and in-place update

From
"Tsunakawa, Takayuki"
Date:
IMHO, overall, there should be pros and cons of the current approach and the new UNDo one (like Oracle?), depending on
theworkload.  Under update-heavy workload, the UNDO method may be better.  OTOH, under the mostly-INSERT workload (like
datawarehouse?), the current method will be better because it writes no log for UNDO.
 

Furthermore, it maybe the best to be able to switch the method for each table and/or tablespace.  For example, in
pgbench,history table uses the current method,
 
and other tables use the UNDO method.  Is it time to introduce a pluggable storage system?

Because PostgreSQL is a follower in the UNDO approach, I think it will be better to study other DBMSs well (Oracle and
MySQL?). That includes not only their manuals, but also whitepapers and books.  Especially, I expect good books to give
deepknowledge on performance tuning and troubleshooting, from which we will be able to know the cons that Oracle's
materialsdon't state.
 

Regards
Takayuki Tsunakawa






Re: UNDO and in-place update

From
Peter Geoghegan
Date:
On Wed, Nov 23, 2016 at 11:32 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> IMHO, overall, there should be pros and cons of the current approach and the new UNDo one (like Oracle?), depending
onthe workload.  Under update-heavy workload, the UNDO method may be better.  OTOH, under the mostly-INSERT workload
(likedata warehouse?), the current method will be better because it writes no log for UNDO. 

I believe that you are correct about that.


--
Peter Geoghegan



Re: UNDO and in-place update

From
Bruce Momjian
Date:
On Wed, Nov 23, 2016 at 11:35:38PM -0800, Peter Geoghegan wrote:
> On Wed, Nov 23, 2016 at 11:32 PM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
> > IMHO, overall, there should be pros and cons of the current approach and the new UNDo one (like Oracle?), depending
onthe workload.  Under update-heavy workload, the UNDO method may be better.  OTOH, under the mostly-INSERT workload
(likedata warehouse?), the current method will be better because it writes no log for UNDO.
 
> 
> I believe that you are correct about that.

This is probably similar to our use of double-buffering, using
shared_buffers and the kernel cache --- in some cases, the
double-buffering is better (variable workloads), while in other cases a
single cache is best.

We have had trouble figuring out if we need to support both single and
double caching methods, and when to recommend one over the other.  Seems
UNDO would have the same complexity.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: UNDO and in-place update

From
Greg Stark
Date:
On 23 November 2016 at 04:28, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Nov 22, 2016 at 7:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> This basic DO-UNDO-REDO protocol has been well-understood for
>> decades.
>
> FWIW, while this is basically true, the idea of repurposing UNDO to be
> usable for MVCC is definitely an Oracleism. Mohan's ARIES paper says
> nothing about MVCC.


Fwiw, Oracle does not use the undo log for snapshot fetches. It's used
only for transaction rollback and recovery.

For snapshot isolation Oracle has yet a *third* copy of the data in a
space called the "rollback segment(s)". When you update a row in a
block you save the whole block in the rollback segment. When you try
to access a block you check if the CSN -- which is basically
equivalent to our LSN -- is newer than your snapshot and if it is you
fetch the old version of the block from the rollback.

Essentially their MVCC is done on a per-block level rather than a
per-row level and they keep only the newest version of the block in
the table, the rest are in the rollback segment.  For what it's worth
I think our approach is cleaner and more flexible. They had a lot of
trouble with their approach over the years and it works well only
because they invested an enormous amount of development in it and also
because people throw a lot of hardware at it too.

I think the main use case we have trouble with is actually the "update
every row in the table" type of update which requires we write to
every block, plus a second copy of every block, plus write full pages
of both copies, then later set hint bits dirtying pages again and
generating more full pages writes, then later come along and vacuum
which requires two more writes of every block, etc. If we had a
solution for the special case of an update that replaces every row in
a page that I think would complement HOT nicely and go a long way
towards fixing our issues.

Incidentally the "Interested transaction list" is for locking rows for
updates and it's basically similar to what we've discussed before of
having a "most frequent xmin" in the header and then a bit indicating
the xmin is missing from the row header. Except in their case they
don't need it for the actual xmin/xmax because their visibility is
done per-block, only the transient lock state

-- 
greg



Re: UNDO and in-place update

From
Robert Haas
Date:
On Wed, Nov 23, 2016 at 5:18 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Nov 23, 2016 at 6:01 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> * Our behavior with many duplicates in secondary indexes is pretty bad
>> in general, I suspect.
>
> From the pie-in-the-sky department:  I believe there are
> snapshot-based systems that don't ever have more than one entry for a
> logical row in a btree.  Instead, they do UNDO-log based snapshot
> reconstruction for btrees too, generalising what is being discussed
> for heap tables in this thread.  That means that whenever you descend
> a btree, at each page you have to be prepared to go and look in the
> UNDO log if necessary on a page-by-page basis.  Alternatively, some
> systems use a shadow table rather than an UNDO log for the old
> entries, but still privilege queries that need the latest version by
> keeping only those in the btree.  I don't know the details but suspect
> that both of those approaches might require CSN (= LSN?) based
> snapshots, so that you can make page-level visibility determinations
> while traversing the 'current' btree based on a page header.  That
> would reduce both kinds of btree bloat: the primary kind of being
> entries for dead tuples that still exist in the heap, and the
> secondary kind being the resultant permanent btree fragmentation that
> remains even after vacuum removes them.  Presumably once you have all
> that, you can allow index-only-scans without a visibility map.  Also,
> I suspect that such a "3D time travelling" btree would be a
> requirement for index ordered tables in a snapshot-based RDBMS.

I don't really understand how a system of this type copes with page
splits.  Suppose there are entries 1000, 2000, ..., 1e6 in the btree.
I now start two overlapping transactions, one inserting all the
positive odd values < 1e6 and the other inserting all the positive
even values < 1e6 that are not already present.  The insertions are
interleaved with each other.  After they get done inserting, what does
the 3D time traveling btree look like at this point?

The problem I'm getting at here is that the heap, unlike an index, is
unordered.  So when I've got all values 1..1e6, they've got to be in a
certain order.  Necessarily, the values from the in-flight
transactions are interleaved with each other and with the old data.
If they're not, I can no longer support efficient searches by the
in-flight transactions, plus I have an unbounded amount of cleanup
work to do at commit time.  But if I *do* have all of those values
interleaved in the proper order, then an abort leaves fragmented free
space.  I can go and kill all of the inserted tuples during UNDO of an
aborted transactions, but because page splits have happened, the index
tuples I need to kill may not be on the same pages where I inserted
them, so I might need to just search from the top of the tree, so it's
expensive.  And even if I do it anyway, the pages are left half-full.
The point is that the splits are the joint result of the two
concurrent transactions taken together - you can't blame individual
splits on individual transactions.

Another way to put this is - if you could make all of the index
operations appear to happen instantaneously at commit time, then I see
how we could make what you're talking about work.  And if you never
needed to split pages, then you could do that in the same way that you
can do it for the heap.  But that's not realistic.

Even for an UNDO-based heap, we have a mild form of this problem.
Suppose we have a full page and delete a tuple, creating free space.
Now another transaction inserts a tuple, using up that free space.
Well, if the first transaction aborts and the second one commits,
we've got trouble.  So we can't allow that.  We have to keep track of
the amount of space that could potentially be gobbled up by UNDO
actions and leave enough space on the page to guarantee that such
actions will succeed.  If that means that some new tuple has to be
placed on a different page instead, so be it.  (This could create some
minor bloat, but I think it's not really that bad.)  For an index, we
can't manage so easily, because we don't have the same kind of
flexibility about where to put inserted tuples.

Do you see some trick here that I'm missing?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
Bruce Momjian
Date:
On Thu, Nov 24, 2016 at 04:06:14PM +0000, Greg Stark wrote:
> For snapshot isolation Oracle has yet a *third* copy of the data in a
> space called the "rollback segment(s)". When you update a row in a
> block you save the whole block in the rollback segment. When you try
> to access a block you check if the CSN -- which is basically
> equivalent to our LSN -- is newer than your snapshot and if it is you
> fetch the old version of the block from the rollback.

I assume they can do this with good performance because, unlike UNDO,
they don't need to fsync the pages kept for snapshot visibility --- if
the server crashes, you don't need them.

> Essentially their MVCC is done on a per-block level rather than a
> per-row level and they keep only the newest version of the block in
> the table, the rest are in the rollback segment.  For what it's worth
> I think our approach is cleaner and more flexible. They had a lot of
> trouble with their approach over the years and it works well only
> because they invested an enormous amount of development in it and also
> because people throw a lot of hardware at it too.
> 
> I think the main use case we have trouble with is actually the "update
> every row in the table" type of update which requires we write to
> every block, plus a second copy of every block, plus write full pages
> of both copies, then later set hint bits dirtying pages again and
> generating more full pages writes, then later come along and vacuum
> which requires two more writes of every block, etc. If we had a
> solution for the special case of an update that replaces every row in
> a page that I think would complement HOT nicely and go a long way
> towards fixing our issues.

Any ideas how we could optimize the update-all workload?  Create a new
heap file and indexes?

Our previous worst-case workload was a single row that was updated
repeatedly, but HOT fixed that for non-indexed rows, and WARM will
improve it for indexed rows.

One of the big takeaways when writing my MVCC talk is that no matter how
much we optimize UPDATE, we still need cleanup for deletes and aborted
transactions.

I am hopeful we can continue reducing the number of times we write a
page  for maintenance purposes.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: UNDO and in-place update

From
Thomas Kellerer
Date:
> FWIW, while this is basically true, the idea of repurposing UNDO to be
> usable for MVCC is definitely an Oracleism. Mohan's ARIES paper says
> nothing about MVCC.
> For snapshot isolation Oracle has yet a *third* copy of the data in a
> space called the "rollback segment(s)". 

UNDO and rollback segment are the same thing in Oracle. 

In older versions it was just called "rollback segment" I think it started
with Oracle 10g that they called it UNDO. 

> Fwiw, Oracle does not use the undo log for snapshot fetches. 
> It's used only for transaction rollback and recovery.

As UNDO and "rollback" are the same they do use the UNDO information for
MVCC:

http://docs.oracle.com/database/121/CNCPT/consist.htm#GUID-00A3688F-1219-423C-A5ED-4B8F25BEEAFB__BABFDBAJ






--
View this message in context: http://postgresql.nabble.com/UNDO-and-in-place-update-tp5931575p5931844.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: UNDO and in-place update

From
Robert Haas
Date:
On Thu, Nov 24, 2016 at 2:32 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> IMHO, overall, there should be pros and cons of the current approach and the new UNDo one (like Oracle?), depending
onthe workload.  Under update-heavy workload, the UNDO method may be better.  OTOH, under the mostly-INSERT workload
(likedata warehouse?), the current method will be better because it writes no log for UNDO. 

The foreground operation will complete more quickly, because it won't
have to write UNDO.  On the other hand, you'll have to set hint bits
later, as well as freeze, which may be more expensive than writing
UNDO by the time all is said and done.  Whether it's better to do pay
a foreground tax immediately or to do deferred work at a later time
depends on things like whether you have quiet times during which you
can catch up on the deferred work ... but the number of users who have
gotten unpleasant surprises due to autovacuum kicking in during a busy
period is not small.

> Furthermore, it maybe the best to be able to switch the method for each table and/or tablespace.  For example, in
pgbench,history table uses the current method, 
> and other tables use the UNDO method.  Is it time to introduce a pluggable storage system?

IMHO, it's past time for that.

> Because PostgreSQL is a follower in the UNDO approach, I think it will be better to study other DBMSs well (Oracle
andMySQL?).  That includes not only their manuals, but also whitepapers and books.  Especially, I expect good books to
givedeep knowledge on performance tuning and troubleshooting, from which we will be able to know the cons that Oracle's
materialsdon't state. 

I agree up to a point.  I think we need to design our own system as
well as we can, not just copy what others have done.  For example, the
design I sketched will work with all of PostgreSQL's existing index
types.  You need to modify each AM in order to support in-place
updates when a column indexed by that AM has been modified, and that's
probably highly desirable, but it's not a hard requirement.  I believe
that's a better approach for us than insisting that we have to do it
in exactly the same way as some other system.  Now, that doesn't mean
we shouldn't learn from what works well and poorly in other systems,
but I think our goal here should be to chart the best way forward
given PostgreSQL's existing architecture and its existing strengths
and weaknesses, rather than to make it exactly like Oracle or MySQL or
anything else.  Few people on this mailing list would say that either
of those systems are categorically better than PostgreSQL; most, I
suspect, would disagree somewhat vigorously.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
Robert Haas
Date:
On Thu, Nov 24, 2016 at 11:06 AM, Greg Stark <stark@mit.edu> wrote:
> Fwiw, Oracle does not use the undo log for snapshot fetches. It's used
> only for transaction rollback and recovery.
>
> For snapshot isolation Oracle has yet a *third* copy of the data in a
> space called the "rollback segment(s)". When you update a row in a
> block you save the whole block in the rollback segment. When you try
> to access a block you check if the CSN -- which is basically
> equivalent to our LSN -- is newer than your snapshot and if it is you
> fetch the old version of the block from the rollback.

My understanding is that this isn't correct.  I think the rollback
segments are what they call the thing that stores UNDO.  See e.g.
http://ss64.com/ora/syntax-redo.html

Having said that, I'm by no means an Oracle expert.

> Essentially their MVCC is done on a per-block level rather than a
> per-row level and they keep only the newest version of the block in
> the table, the rest are in the rollback segment.  For what it's worth
> I think our approach is cleaner and more flexible. They had a lot of
> trouble with their approach over the years and it works well only
> because they invested an enormous amount of development in it and also
> because people throw a lot of hardware at it too.

People do throw a lot of hardware at Oracle, but I don't think I'd
accept the contention that Oracle generally gets less performance out
of the same amount of hardware than PostgreSQL.  There are certainly
some things we often do faster -- including inserts, deletes, and
transaction aborts -- but their implementation of updates seems to be
very fast.  Obviously, a lot depends on configuration and workload, so
it's hard to make general statements, but I think it's more correct to
imagine that people throw a lot of hardware at Oracle because that's
where their big, critical databases are than to imagine that it's
because Oracle is a resource hog.

> I think the main use case we have trouble with is actually the "update
> every row in the table" type of update which requires we write to
> every block, plus a second copy of every block, plus write full pages
> of both copies, then later set hint bits dirtying pages again and
> generating more full pages writes, then later come along and vacuum
> which requires two more writes of every block, etc. If we had a
> solution for the special case of an update that replaces every row in
> a page that I think would complement HOT nicely and go a long way
> towards fixing our issues.

This case is certainly a problem, but I don't believe it's anywhere
close to being our only problem.  My experience is that our system
works fairly well when there are only a few indexes.  However, with
heavily indexed tables, all of your updates become non-HOT and then
bad things happen.  So you can either leave out indexes that you need
for good query performance and then you're hosed, or you can add those
indexes and then you're hosed anyway because of the bloat and write
amplification associated with non-HOT updates.

Also, write-heavy workloads that perform OK as long as there are no
long-lived snapshots often enter a death spiral when you run reporting
queries on the same system.  Finer-grained snapshot tracking would
help with that problem, but it doesn't solve it: now a single
long-lived snapshot can "only" cause the database to double in size
instead of increasing without bound, a scant consolation.  Nor does it
change the fundamental calculus that once a table gets bloated, it's
very painful to de-bloat it.  The appeal of a design that supports
in-place update is that you don't bloat the table in the first place.
You still have the bloat, of course, but it's off in a separate data
structure that is engineered for efficient deletion.

I think that the whole emphasis on whether and to what degree this is
like Oracle is somewhat misplaced.  I would look at it a different
way.  We've talked many times over the years about how PostgreSQL is
optimized for aborts.  Everybody that I've heard comment on that issue
thinks that is a bad thing.  I am proposing a design that is optimized
for commits; that is, if the transaction commits, none of the pages it
modified need to be dirtied again afterwards at any point.  I think
that's an extremely important property and it's one that we are very
far from having today.  It necessarily implies that you cannot store
the old row versions in the heap, because if you do, then you are
going to have to dirty the pages again to get rid of them (unless you
prefer to just leak the space forever).  Now there is plenty of room
for argument about whether the specific design I proposed is going to
be any good, and I think that would be quite an interesting discussion
to have.  But I think if you say, well, you know, the fact that we may
rewrite the same page 5 or 6 times after a modification to set hint
bits (a few at a time) and HOT prune and set all-visible and freeze
isn't really any big deal, you must live in a world where the write
bandwidth of the I/O channel is a lot less of a problem than it is in
mine.  And we've been around and around on all of that stuff and
people have come up with various ideas to improve the situation - some
of which have been implemented - but many of those ideas involve
unpleasant trade-offs and so the core problems remain.  If we don't do
something more fundamental, we're still going to be up against the
same basic issues ten years from now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
Pavel Stehule
Date:



I think that the whole emphasis on whether and to what degree this is
like Oracle is somewhat misplaced.  I would look at it a different
way.  We've talked many times over the years about how PostgreSQL is
optimized for aborts.  Everybody that I've heard comment on that issue
thinks that is a bad thing. 

again this depends on usage - when you have a possibility to run VACUUM, then this strategy is better.

The fast aborts is one pretty good feature for stable usage.

Isn't possible to use UNDO log (or something similar) for VACUUM? ROLLBACK should be fast, but
VACUUM can do more work?
 
I am proposing a design that is optimized
for commits; that is, if the transaction commits, none of the pages it
modified need to be dirtied again afterwards at any point.  I think
that's an extremely important property and it's one that we are very
far from having today.  It necessarily implies that you cannot store
the old row versions in the heap, because if you do, then you are
going to have to dirty the pages again to get rid of them (unless you
prefer to just leak the space forever).  Now there is plenty of room
for argument about whether the specific design I proposed is going to
be any good, and I think that would be quite an interesting discussion
to have.  But I think if you say, well, you know, the fact that we may
rewrite the same page 5 or 6 times after a modification to set hint
bits (a few at a time) and HOT prune and set all-visible and freeze
isn't really any big deal, you must live in a world where the write
bandwidth of the I/O channel is a lot less of a problem than it is in
mine.  And we've been around and around on all of that stuff and
people have come up with various ideas to improve the situation - some
of which have been implemented - but many of those ideas involve
unpleasant trade-offs and so the core problems remain.  If we don't do
something more fundamental, we're still going to be up against the
same basic issues ten years from now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: UNDO and in-place update

From
Robert Haas
Date:
On Thu, Nov 24, 2016 at 6:20 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> I think that the whole emphasis on whether and to what degree this is
>> like Oracle is somewhat misplaced.  I would look at it a different
>> way.  We've talked many times over the years about how PostgreSQL is
>> optimized for aborts.  Everybody that I've heard comment on that issue
>> thinks that is a bad thing.
>
>
> again this depends on usage - when you have a possibility to run VACUUM,
> then this strategy is better.
>
> The fast aborts is one pretty good feature for stable usage.
>
> Isn't possible to use UNDO log (or something similar) for VACUUM? ROLLBACK
> should be fast, but
> VACUUM can do more work?

I think that in this design we wouldn't use VACUUM at all.  However,
if what you are saying is that we should try to make aborts
near-instantaneous by pushing UNDO actions into the background, I
agree entirely.  InnoDB already does that, IIUC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
Pavel Stehule
Date:


2016-11-25 1:44 GMT+01:00 Robert Haas <robertmhaas@gmail.com>:
On Thu, Nov 24, 2016 at 6:20 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> I think that the whole emphasis on whether and to what degree this is
>> like Oracle is somewhat misplaced.  I would look at it a different
>> way.  We've talked many times over the years about how PostgreSQL is
>> optimized for aborts.  Everybody that I've heard comment on that issue
>> thinks that is a bad thing.
>
>
> again this depends on usage - when you have a possibility to run VACUUM,
> then this strategy is better.
>
> The fast aborts is one pretty good feature for stable usage.
>
> Isn't possible to use UNDO log (or something similar) for VACUUM? ROLLBACK
> should be fast, but
> VACUUM can do more work?

I think that in this design we wouldn't use VACUUM at all.  However,
if what you are saying is that we should try to make aborts
near-instantaneous by pushing UNDO actions into the background, I
agree entirely.  InnoDB already does that, IIUC.

ok, it can be another process - that can be more aggressive and less limited than vacuum.

Regards

Pavel
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: UNDO and in-place update

From
Amit Kapila
Date:
On Thu, Nov 24, 2016 at 10:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I don't really understand how a system of this type copes with page
> splits.  Suppose there are entries 1000, 2000, ..., 1e6 in the btree.
> I now start two overlapping transactions, one inserting all the
> positive odd values < 1e6 and the other inserting all the positive
> even values < 1e6 that are not already present.  The insertions are
> interleaved with each other.  After they get done inserting, what does
> the 3D time traveling btree look like at this point?
>
> The problem I'm getting at here is that the heap, unlike an index, is
> unordered.  So when I've got all values 1..1e6, they've got to be in a
> certain order.  Necessarily, the values from the in-flight
> transactions are interleaved with each other and with the old data.
> If they're not, I can no longer support efficient searches by the
> in-flight transactions, plus I have an unbounded amount of cleanup
> work to do at commit time.  But if I *do* have all of those values
> interleaved in the proper order, then an abort leaves fragmented free
> space.
>

Yeah, that's right, but at next insert, on the same page, we can
defragment the page and claim all the space.  We might want to perform
such a cleanup only in certain cases like when the remaining space in
the page is below a certain threshold.

>  I can go and kill all of the inserted tuples during UNDO of an
> aborted transactions, but because page splits have happened, the index
> tuples I need to kill may not be on the same pages where I inserted
> them, so I might need to just search from the top of the tree, so it's
> expensive.
>

Another way to do is to write UNDO log for split operation (with exact
record locations on pages that are split) so that you don't need to
perform this expensive search operation.  I can understand writing
UNDO for split could be costly but so is writing WAL for it.  I think
for UNDO, we should follow a generic rule as for heap which is that
any change in the page should have it's corresponding UNDO.

>  And even if I do it anyway, the pages are left half-full.
> The point is that the splits are the joint result of the two
> concurrent transactions taken together - you can't blame individual
> splits on individual transactions.
>

Sure, but you are talking about an extreme case, so it might be okay
to leave pages half-full in such cases, next inserts will consume this
space.  Also aborts are less frequent operations, so the impact should
be minimal.

>
> Do you see some trick here that I'm missing?
>

I think we should be able to find some way to tackle the problems you
have listed above as there are databases (including Oracle) which
write UNDO for indexes (btree's) as well.  However, I think here the
crucial question is whether the only-heap-based-undo design can give
us consistent data for all possible cases if so, then it is better to
do that in the first phase and then later we can implement UNDO for
indexes as well.   As of now, nobody has reported any such problem, so
maybe we can stick with the only-heap-based-undo design.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: UNDO and in-place update

From
Amit Kapila
Date:
On Fri, Nov 25, 2016 at 8:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>
> Another way to do is to write UNDO log for split operation (with exact
> record locations on pages that are split) so that you don't need to
> perform this expensive search operation.  I can understand writing
> UNDO for split could be costly but so is writing WAL for it.  I think
> for UNDO, we should follow a generic rule as for heap
>

typo.
/heap/WAL

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: UNDO and in-place update

From
Bruce Momjian
Date:
On Thu, Nov 24, 2016 at 12:23:28PM -0500, Robert Haas wrote:
> I agree up to a point.  I think we need to design our own system as
> well as we can, not just copy what others have done.  For example, the
> design I sketched will work with all of PostgreSQL's existing index
> types.  You need to modify each AM in order to support in-place
> updates when a column indexed by that AM has been modified, and that's
> probably highly desirable, but it's not a hard requirement.

I feel you are going to get into the problem of finding the index entry
for the old row --- the same thing that is holding back more aggressive
WARM updates.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: UNDO and in-place update

From
Greg Stark
Date:
On 24 November 2016 at 23:03, Robert Haas <robertmhaas@gmail.com> wrote:
>> For snapshot isolation Oracle has yet a *third* copy of the data in a
>> space called the "rollback segment(s)".
>
> My understanding is that this isn't correct.  I think the rollback
> segments are what they call the thing that stores UNDO.  See e.g.
> http://ss64.com/ora/syntax-redo.html

It looks like you're right. Rollback segments and Undo segments are
two different pieces of code but one is just the old way and the other
the new way of managing the same data. You can't have both active in
the same database at the same time. I'm a bit confused because I
distinctly remembered an UNDO log back in the 8i days as well but
apparently that's just me imagining things. UNDO segments were
introduced in 9i.

This explained a bunch
http://satya-dba.blogspot.ie/2009/09/undo-tablespace-undo-management.html


-- 
greg



Re: UNDO and in-place update

From
Amit Kapila
Date:
On Fri, Nov 25, 2016 at 11:23 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Thu, Nov 24, 2016 at 12:23:28PM -0500, Robert Haas wrote:
>> I agree up to a point.  I think we need to design our own system as
>> well as we can, not just copy what others have done.  For example, the
>> design I sketched will work with all of PostgreSQL's existing index
>> types.  You need to modify each AM in order to support in-place
>> updates when a column indexed by that AM has been modified, and that's
>> probably highly desirable, but it's not a hard requirement.
>
> I feel you are going to get into the problem of finding the index entry
> for the old row --- the same thing that is holding back more aggressive
> WARM updates.
>

I think I see a problem in that regards when the index column in
updated twice with the same value.  For example,

table -  test_undo(c1 int, c2 char(10));
index -  idx_tu on test_undo(c1);

Step-1
Insert (2, 'abc')
Heap - 2, abc
Index - 2
Commit;

Step -2
Update (3,def) where c1 = 2;
Heap - 3, def
Index - 3 (another value will be 2)
Undo - 2, abc
Commit;

At this point a new query which scans the index for value 2 will reach
to the tuple (3,def).  As scan started after all commits, tuple
(3,def) appears okay, but it can compare the key value in the tuple to
detect that this is not the right tuple.  So, all is well till here.

Step -3
Update (2) where c1 = 3;
Heap - 2, def
Index - 2 (another values will be 3,2)
Undo - 3;
Commit

At this point, index scan for value 2 will find index tuple of step-1
(2) and will conclude 2,def as a right tuple, but actually, that is
wrong as the step-1 (2) index tuple should not be visible to the user.
Do you also this as a problem or am I missing something?  If this a
problem, then I think we need some form of delete marking system for
the index, probably transaction visibility information as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: UNDO and in-place update

From
Robert Haas
Date:
On Sat, Nov 26, 2016 at 10:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Nov 25, 2016 at 11:23 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> On Thu, Nov 24, 2016 at 12:23:28PM -0500, Robert Haas wrote:
>>> I agree up to a point.  I think we need to design our own system as
>>> well as we can, not just copy what others have done.  For example, the
>>> design I sketched will work with all of PostgreSQL's existing index
>>> types.  You need to modify each AM in order to support in-place
>>> updates when a column indexed by that AM has been modified, and that's
>>> probably highly desirable, but it's not a hard requirement.
>>
>> I feel you are going to get into the problem of finding the index entry
>> for the old row --- the same thing that is holding back more aggressive
>> WARM updates.
>>
>
> I think I see a problem in that regards when the index column in
> updated twice with the same value.  For example,
>
> table -  test_undo(c1 int, c2 char(10));
> index -  idx_tu on test_undo(c1);
>
> Step-1
> Insert (2, 'abc')
> Heap - 2, abc
> Index - 2
> Commit;
>
> Step -2
> Update (3,def) where c1 = 2;
> Heap - 3, def
> Index - 3 (another value will be 2)
> Undo - 2, abc
> Commit;
>
> At this point a new query which scans the index for value 2 will reach
> to the tuple (3,def).  As scan started after all commits, tuple
> (3,def) appears okay, but it can compare the key value in the tuple to
> detect that this is not the right tuple.  So, all is well till here.
>
> Step -3
> Update (2) where c1 = 3;
> Heap - 2, def
> Index - 2 (another values will be 3,2)
> Undo - 3;
> Commit
>
> At this point, index scan for value 2 will find index tuple of step-1
> (2) and will conclude 2,def as a right tuple, but actually, that is
> wrong as the step-1 (2) index tuple should not be visible to the user.
> Do you also this as a problem or am I missing something?  If this a
> problem, then I think we need some form of delete marking system for
> the index, probably transaction visibility information as well.

Well, my original email did contain a discussion of the need for
delete-marking.  I said that you couldn't do in-place updates when
indexed columns were modified unless the index AMs had support for
delete-marking, which is the same point you are making here.  However,
I agree that the case where the indexed column gets set back to a
previous value while the old index tuple for that value still exists
is particularly tricky.  I think that what we need to do there is
avoid inserting an index entry for a given value/TID pair if an index
entry for that value/TID pair already exists.

That's a little different from your analysis.  In your step-3
analysis, you say that an index scan for 2 will find the step-1 tuple,
but that's not really right.  The index scan will find the index tuple
which points to the whole chain of tuples, step-3 => step-2 => step-1,
and it will decide which heap tuple from that chain the user can see
based on the user's snapshot.  That's OK, but we're in real trouble if
step-3 inserted an additional index tuple pointing to that chain
rather than simply noting that one already exists.  If it adds an
additional one, then we'll visit two index tuples for that TID.  Each
time, the visibility information in UNDO will cause us to return the
correct tuple, but we've erred by returning it twice (once per index
entry) rather than just once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com]
> On Thu, Nov 24, 2016 at 2:32 AM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
> > IMHO, overall, there should be pros and cons of the current approach and
> the new UNDo one (like Oracle?), depending on the workload.  Under
> update-heavy workload, the UNDO method may be better.  OTOH, under the
> mostly-INSERT workload (like data warehouse?), the current method will be
> better because it writes no log for UNDO.
> 
> The foreground operation will complete more quickly, because it won't have
> to write UNDO.  On the other hand, you'll have to set hint bits later, as
> well as freeze, which may be more expensive than writing UNDO by the time
> all is said and done.  Whether it's better to do pay a foreground tax
> immediately or to do deferred work at a later time depends on things like
> whether you have quiet times during which you can catch up on the deferred
> work ... but the number of users who have gotten unpleasant surprises due
> to autovacuum kicking in during a busy period is not small.

I see.  autovacuum is certainly almost unpredictable, at least for those who are not aware of its existence and tuning.
Recently, one of our customers faced the inability to perform INSERTs because of xid wraparound.  
 
Their workload is INSERT-heavy, and (inefficiently) used autocommit to insert each row, which resulted in the xid
consumptionfaster than the slow xid wraparound autovacuum.
 


> > Furthermore, it maybe the best to be able to switch the method for
> > each table and/or tablespace.  For example, in pgbench, history table
> uses the current method, and other tables use the UNDO method.  Is it time
> to introduce a pluggable storage system?
> 
> IMHO, it's past time for that.

Do you mean by "past time" that the community decided not to introduce pluggable storage manager?  If it's true, that's
apity.  But I remember that there was a discussion about pluggable storage manager at PGConf or some other event this
year. Or, do you mean that the current approach should be abandoned and the UNDO approach replace it?
 


> > Because PostgreSQL is a follower in the UNDO approach, I think it will
> be better to study other DBMSs well (Oracle and MySQL?).  That includes
> not only their manuals, but also whitepapers and books.  Especially, I
> expect good books to give deep knowledge on performance tuning and
> troubleshooting, from which we will be able to know the cons that Oracle's
> materials don't state.
> 
> I agree up to a point.  I think we need to design our own system as well
> as we can, not just copy what others have done.  For example, the design
> I sketched will work with all of PostgreSQL's existing index types.  You
> need to modify each AM in order to support in-place updates when a column
> indexed by that AM has been modified, and that's probably highly desirable,
> but it's not a hard requirement.  I believe that's a better approach for
> us than insisting that we have to do it in exactly the same way as some
> other system.  Now, that doesn't mean we shouldn't learn from what works
> well and poorly in other systems, but I think our goal here should be to
> chart the best way forward given PostgreSQL's existing architecture and
> its existing strengths and weaknesses, rather than to make it exactly like
> Oracle or MySQL or anything else.  Few people on this mailing list would
> say that either of those systems are categorically better than PostgreSQL;
> most, I suspect, would disagree somewhat vigorously.

Yes, agreed.  I didn't intend to just imitate Oracle/MySQL design.  I meant that it will be better to study in advance
whattrouble Oracle/MySQL design has caused their users, and avoid pitfalls as much as possible.  For example, when I
ranTPC-B benchmark against Oracle and PostgreSQL, I was embarrassed by frequent deadlocks in Oracle.  It took some time
forme to find out that INITRANS needs to be tuned with ALTER TABLE.  PostgreSQL ran smoothly without any tuning.
 

I find your UNDO approach attractive.  On the other hand, I sometimes wonder where PostgreSQL is headed for.  I'm
sometimesasked by database users "How different is PostgreSQL from MySQL?"  If the UNDO approach is taken, PostgreSQL
wouldappear more similar to MySQL.  I don't say that's bad, but I wonder whether we can appeal the new feature in a big
picture. For example, the current (VACUUM) approach would prevent PostgreSQL from becoming a database for
OLTP/analyticsmixed workload, because long-running analytics queries cause tale and index bloat regardless of whether
thosequeries access the same data as the OLTP workload, wouldn't it?  Can we appeal the future of PostgreSQL and the
differencefrom MySQL as "PostgreSQL is pursuing to handle multiple workloads in the same database to better utilize
dataand IT resources.  The new UNDO approach is one step toward that."
 

Regards
Takayuki Tsunakawa



Re: UNDO and in-place update

From
Robert Haas
Date:
On Sun, Nov 27, 2016 at 8:26 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> I see.  autovacuum is certainly almost unpredictable, at least for those who are not aware of its existence and
tuning. Recently, one of our customers faced the inability to perform INSERTs because of xid wraparound. 
> Their workload is INSERT-heavy, and (inefficiently) used autocommit to insert each row, which resulted in the xid
consumptionfaster than the slow xid wraparound autovacuum. 

Yes, that can happen.  The new freeze map stuff in 9.6 should help
with it, even without doing anything like what I am proposing here.
But with this, there is no deferred work at all: once the transaction
is all-visible, the UNDO can be discarded and there's nothing further
to do.

>> > Furthermore, it maybe the best to be able to switch the method for
>> > each table and/or tablespace.  For example, in pgbench, history table
>> uses the current method, and other tables use the UNDO method.  Is it time
>> to introduce a pluggable storage system?
>>
>> IMHO, it's past time for that.
>
> Do you mean by "past time" that the community decided not to introduce pluggable storage manager?  If it's true,
that'sa pity.  But I remember that there was a discussion about pluggable storage manager at PGConf or some other event
thisyear.  Or, do you mean that the current approach should be abandoned and the UNDO approach replace it? 

I mean that we should most definitely introduce a pluggable storage
manager and that I think we should even have done it sooner, before
now.

>> I agree up to a point.  I think we need to design our own system as well
>> as we can, not just copy what others have done.  For example, the design
>> I sketched will work with all of PostgreSQL's existing index types.  You
>> need to modify each AM in order to support in-place updates when a column
>> indexed by that AM has been modified, and that's probably highly desirable,
>> but it's not a hard requirement.  I believe that's a better approach for
>> us than insisting that we have to do it in exactly the same way as some
>> other system.  Now, that doesn't mean we shouldn't learn from what works
>> well and poorly in other systems, but I think our goal here should be to
>> chart the best way forward given PostgreSQL's existing architecture and
>> its existing strengths and weaknesses, rather than to make it exactly like
>> Oracle or MySQL or anything else.  Few people on this mailing list would
>> say that either of those systems are categorically better than PostgreSQL;
>> most, I suspect, would disagree somewhat vigorously.
>
> Yes, agreed.  I didn't intend to just imitate Oracle/MySQL design.  I meant that it will be better to study in
advancewhat trouble Oracle/MySQL design has caused their users, and avoid pitfalls as much as possible.  For example,
whenI ran TPC-B benchmark against Oracle and PostgreSQL, I was embarrassed by frequent deadlocks in Oracle.  It took
sometime for me to find out that INITRANS needs to be tuned with ALTER TABLE.  PostgreSQL ran smoothly without any
tuning.

I agree that we want to stick with our principal of requiring as
little tuning as possible -- and where possible reduce tuning that is
required today.  I have only limited experience with it, but some of
those experiences were frustrating.

> I find your UNDO approach attractive.  On the other hand, I sometimes wonder where PostgreSQL is headed for.  I'm
sometimesasked by database users "How different is PostgreSQL from MySQL?"  If the UNDO approach is taken, PostgreSQL
wouldappear more similar to MySQL.  I don't say that's bad, but I wonder whether we can appeal the new feature in a big
picture. For example, the current (VACUUM) approach would prevent PostgreSQL from becoming a database for
OLTP/analyticsmixed workload, because long-running analytics queries cause tale and index bloat regardless of whether
thosequeries access the same data as the OLTP workload, wouldn't it?  Can we appeal the future of PostgreSQL and the
differencefrom MySQL as "PostgreSQL is pursuing to handle multiple workloads in the same database to better utilize
dataand IT resources.  The new UNDO approach is one step toward that." 

I think there are a lot of differences between PostgreSQL and MySQL
other than whether UNDO is used.  For example, just in the area of
storage, MySQL uses index-organized tables, while PostgreSQL's current
heap and this proposed new type of heap are both flat.  There are
differences in the types of indexing that are offered, the datatypes
available, the features available in every area of the system, and of
course the licensing.  If we get to a point where there are no real
differences in licensing or extensibility or features or any other
area between PostgreSQL and MySQL, then of course it makes no
difference which one people use, but I don't think there is any
prospect of such a thing happening.

PostgreSQL is a relatively feature-complete product at this point.  In
my view, the major thing missing is logical replication, and that's
being worked on.  However, there are two things that, IMHO, are really
hurting us.  The first is tooling, for example around backup and
recovery.  Users familiar with other systems expect things to be
easier to set up and administer than they are in PostgreSQL, and they
expect better tools for automating administrative tasks than are in
fact available.  The second is performance.  The work Andres is doing
to speed up the executor, in which Heikki has expressed interest and
shared ideas; the work on parallel query; Amit Langote's work on
partitioning; Kevin's work on materialized views; Amit Kapila's work
on hash indexes; Tomas Vondra's work on multivariate statistics; and
many other patches currently in flight are all attacks, from various
angles, on the problem of performance, and this idea is, too.  The
assumptions underlying every layer of the system need to be questioned
and challenged, and alternatives need to be explored.  Sometimes that
will mean trying things that don't work; often it will mean building
new things that provide users with choices they don't have today;
other times it will mean rewriting the functionality in place just so
it gets faster.  Some people want CTEs or window functions or XML
manipulation inside the server or more functionality in pgbench and
other people don't care about those things, but the one thing that
virtually everybody cares about is whether queries (and utility
commands) run fast.  I don't think that our performance today sucks -
we wouldn't have gotten this far if that were the case - but there's
clearly room for it to be a lot better, sometimes even by surprisingly
simple techniques (e.g. 5dfc198146b49ce7ecc8a1fc9d5e171fb75f6ba5).
Whether the techniques are simple or complicated, we need to put work
into making the necessary things happen; if other products optimize
and we don't, we'll fall behind.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
Amit Kapila
Date:
On Mon, Nov 28, 2016 at 4:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Well, my original email did contain a discussion of the need for
> delete-marking.  I said that you couldn't do in-place updates when
> indexed columns were modified unless the index AMs had support for
> delete-marking, which is the same point you are making here.
>

Sorry, I had not read that part earlier, but now that I read it, I
think there is a slight difference in what I am saying.   I thought
along with delete-marking, we might need transaction visibility
information in the index as well.  BTW, it is not completely clear
whether you want a delete-marking system or you think we could do
without that by avoiding in-place updates, it seems to me from what
you have written that you are inclined towards having a delete-marking
system.

>  However,
> I agree that the case where the indexed column gets set back to a
> previous value while the old index tuple for that value still exists
> is particularly tricky.  I think that what we need to do there is
> avoid inserting an index entry for a given value/TID pair if an index
> entry for that value/TID pair already exists.
>

Are you saying this for indexes with a delete-marking system or for
indexes without that or for both?

> That's a little different from your analysis.  In your step-3
> analysis, you say that an index scan for 2 will find the step-1 tuple,
> but that's not really right.  The index scan will find the index tuple
> which points to the whole chain of tuples, step-3 => step-2 => step-1,
> and it will decide which heap tuple from that chain the user can see
> based on the user's snapshot.
>

I think the scan will not traverse the chain if it starts after
step-3's commit and that's what I intend to say.

>  That's OK, but we're in real trouble if
> step-3 inserted an additional index tuple pointing to that chain
> rather than simply noting that one already exists.  If it adds an
> additional one, then we'll visit two index tuples for that TID.  Each
> time, the visibility information in UNDO will cause us to return the
> correct tuple, but we've erred by returning it twice (once per index
> entry) rather than just once.
>

Why will scan traverse the UNDO chain if it starts after step-3 commit?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: UNDO and in-place update

From
Bruce Momjian
Date:
On Sun, Nov 27, 2016 at 09:19:06AM +0530, Amit Kapila wrote:
> At this point, index scan for value 2 will find index tuple of step-1
> (2) and will conclude 2,def as a right tuple, but actually, that is
> wrong as the step-1 (2) index tuple should not be visible to the user.
> Do you also this as a problem or am I missing something?  If this a
> problem, then I think we need some form of delete marking system for
> the index, probably transaction visibility information as well.

Yes, very similar to the problems with WARM already discussed.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: UNDO and in-place update

From
Robert Haas
Date:
On Sun, Nov 27, 2016 at 10:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Nov 28, 2016 at 4:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, my original email did contain a discussion of the need for
>> delete-marking.  I said that you couldn't do in-place updates when
>> indexed columns were modified unless the index AMs had support for
>> delete-marking, which is the same point you are making here.
>
> Sorry, I had not read that part earlier, but now that I read it, I
> think there is a slight difference in what I am saying.   I thought
> along with delete-marking, we might need transaction visibility
> information in the index as well.

I think we need to avoid putting the visibility information in the
index because that will make the index much bigger.  It would be a
shame to get the visibility index out of the heap (as this design
does) only to be forced to add it to the index.  Note that,
percentage-wise, it's quite a bit worse to have visibility information
in the index than in the heap, because the index tuples are going to
be much narrower than the heap tuples, often just one column.  (The
cost of having this information in the heap can be amortized across
the whole width of the table.)

I don't have the whole design for the delete-marking stuff worked out
yet.  I'm thinking we could have a visibility map where the bit for
each page is 1 if the page certainly has no pending UNDO and 0 if it
might have some.  In addition, if a tuple is deleted or the indexed
column value is changed, we delete-mark the associated index entries.
If we later discover that the page has no current UNDO (i.e. is
all-visible) and the tuple at a given TID matches our index entry, we
can clear the delete-mark for that index entry.  So, an index-only
scan rechecks the heap if the tuples is delete-marked OR the
visibility-map bit for the page is not set; if neither is the case, it
can assume the heap tuple is visible.

Another option would be to get rid of the visibility map and rely only
on the delete-marks.  If we did that, then tuples would have to be
delete-marked when first inserted since they wouldn't be all-visible
until sometime after the commit of the inserting transaction.

> BTW, it is not completely clear
> whether you want a delete-marking system or you think we could do
> without that by avoiding in-place updates, it seems to me from what
> you have written that you are inclined towards having a delete-marking
> system.

Yes, that's my inclination.  I don't think it would be necessary to
have the delete-marking in order to produce a committable patch, but
the benefit of this approach would be much reduced without that.

>>  However,
>> I agree that the case where the indexed column gets set back to a
>> previous value while the old index tuple for that value still exists
>> is particularly tricky.  I think that what we need to do there is
>> avoid inserting an index entry for a given value/TID pair if an index
>> entry for that value/TID pair already exists.
>
> Are you saying this for indexes with a delete-marking system or for
> indexes without that or for both?

I'm saying that in any case in which you allow in-place update, you
have to make sure you don't get multiple entries pointing at the same
TID unless they have different values in the index tuple.

>> That's a little different from your analysis.  In your step-3
>> analysis, you say that an index scan for 2 will find the step-1 tuple,
>> but that's not really right.  The index scan will find the index tuple
>> which points to the whole chain of tuples, step-3 => step-2 => step-1,
>> and it will decide which heap tuple from that chain the user can see
>> based on the user's snapshot.
>
> I think the scan will not traverse the chain if it starts after
> step-3's commit and that's what I intend to say.

I don't see why that would be true.

>>  That's OK, but we're in real trouble if
>> step-3 inserted an additional index tuple pointing to that chain
>> rather than simply noting that one already exists.  If it adds an
>> additional one, then we'll visit two index tuples for that TID.  Each
>> time, the visibility information in UNDO will cause us to return the
>> correct tuple, but we've erred by returning it twice (once per index
>> entry) rather than just once.
>
> Why will scan traverse the UNDO chain if it starts after step-3 commit?

Why wouldn't it?  I think that if an index scan hits a page with an
UNDO chain, it always need to traverse the whole thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
Amit Kapila
Date:
On Mon, Nov 28, 2016 at 11:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 27, 2016 at 10:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Mon, Nov 28, 2016 at 4:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Well, my original email did contain a discussion of the need for
>>> delete-marking.  I said that you couldn't do in-place updates when
>>> indexed columns were modified unless the index AMs had support for
>>> delete-marking, which is the same point you are making here.
>>
>> Sorry, I had not read that part earlier, but now that I read it, I
>> think there is a slight difference in what I am saying.   I thought
>> along with delete-marking, we might need transaction visibility
>> information in the index as well.
>
> I think we need to avoid putting the visibility information in the
> index because that will make the index much bigger.
>

I agree that there will be an increase in index size, but it shouldn't
be much if we have transaction information (and that too only active
transactions) at page level instead of at tuple level.  I think the
number of concurrent writes on the index will be lesser as compared to
the heap.  There are a couple of benefits of having visibility
information in the index.

a. Heap and index could be independently cleaned up and in most cases
by foreground transactions only.  The work for vacuum will be quite
less as compared to now.  I think this will help us in avoiding the
bloat to a great degree.

b. Improved index-only scans, because now we don't need visibility map
of the heap to check tuple visibility.

c. Some speed improvements for index scans can also be expected
because with this we don't need to perform heap fetch for invisible
index tuples.

>  It would be a
> shame to get the visibility index out of the heap (as this design
> does) only to be forced to add it to the index.  Note that,
> percentage-wise, it's quite a bit worse to have visibility information
> in the index than in the heap, because the index tuples are going to
> be much narrower than the heap tuples, often just one column.  (The
> cost of having this information in the heap can be amortized across
> the whole width of the table.)
>

I think with proposed design there is a good chance that for many of
the index scans, there is a need for the extra hop to decide
visibility of index tuple. I think as of today, during index scan if
the corresponding heap page is not-all-visible, then we check the
corresponding heap tuple to decide about the visibility of index
tuple, whereas with proposed design, I think it first needs to check
heap page and then TPD, if there is more than one transaction
operating on page.

> I don't have the whole design for the delete-marking stuff worked out
> yet.  I'm thinking we could have a visibility map where the bit for
> each page is 1 if the page certainly has no pending UNDO and 0 if it
> might have some.  In addition, if a tuple is deleted or the indexed
> column value is changed, we delete-mark the associated index entries.
> If we later discover that the page has no current UNDO (i.e. is
> all-visible) and the tuple at a given TID matches our index entry, we
> can clear the delete-mark for that index entry.  So, an index-only
> scan rechecks the heap if the tuples is delete-marked OR the
> visibility-map bit for the page is not set; if neither is the case, it
> can assume the heap tuple is visible.
>

Such a design will work, but I think this is more work to mark the
tuple as Dead as compared to current system.

> Another option would be to get rid of the visibility map and rely only
> on the delete-marks.  If we did that, then tuples would have to be
> delete-marked when first inserted since they wouldn't be all-visible
> until sometime after the commit of the inserting transaction.
>

The previous idea sounds better.

>
>>>  That's OK, but we're in real trouble if
>>> step-3 inserted an additional index tuple pointing to that chain
>>> rather than simply noting that one already exists.  If it adds an
>>> additional one, then we'll visit two index tuples for that TID.  Each
>>> time, the visibility information in UNDO will cause us to return the
>>> correct tuple, but we've erred by returning it twice (once per index
>>> entry) rather than just once.
>>
>> Why will scan traverse the UNDO chain if it starts after step-3 commit?
>
> Why wouldn't it?  I think that if an index scan hits a page with an
> UNDO chain, it always need to traverse the whole thing.
>

If you notice the scan has started after all the writes have
committed, so what is the guarantee that there will be a UNDO chain?
Yes, it could be there if there is some live transaction which started
before step, otherwise, I am not sure if we can guarantee that UNDO
chain will be present.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: UNDO and in-place update

From
Alexander Korotkov
Date:
On Tue, Nov 29, 2016 at 8:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Nov 28, 2016 at 11:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 27, 2016 at 10:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Mon, Nov 28, 2016 at 4:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Well, my original email did contain a discussion of the need for
>>> delete-marking.  I said that you couldn't do in-place updates when
>>> indexed columns were modified unless the index AMs had support for
>>> delete-marking, which is the same point you are making here.
>>
>> Sorry, I had not read that part earlier, but now that I read it, I
>> think there is a slight difference in what I am saying.   I thought
>> along with delete-marking, we might need transaction visibility
>> information in the index as well.
>
> I think we need to avoid putting the visibility information in the
> index because that will make the index much bigger.
>

I agree that there will be an increase in index size, but it shouldn't
be much if we have transaction information (and that too only active
transactions) at page level instead of at tuple level.  I think the
number of concurrent writes on the index will be lesser as compared to
the heap.  There are a couple of benefits of having visibility
information in the index.

a. Heap and index could be independently cleaned up and in most cases
by foreground transactions only.  The work for vacuum will be quite
less as compared to now.  I think this will help us in avoiding the
bloat to a great degree.

b. Improved index-only scans, because now we don't need visibility map
of the heap to check tuple visibility.

c. Some speed improvements for index scans can also be expected
because with this we don't need to perform heap fetch for invisible
index tuples.

+1
I think once we're considering marking deleted index tuples, we should provide an option of visibility-aware indexes.
Probably, it shouldn't be the only option for UNDO-based table engine.  But it definitely should be one of options.

d. This is the way to eventually have index-organized tables.  Once index is visiblity-aware, it becomes possible
to store date there without heap but with snapshots and transactions.  Also, it would be possible to achieve more
unification between heap and index access methods.  Imagine that heap is TID => tuple map and index is index_key => tuple
map.  I think the reason why there is distinguishing between heap (which is hardcoded) access method and index access method is that
heap is visiblity-aware and index is not visiblity aware.  Once they both are visibility-aware, there is no much difference between them:
they are just different kind of maps.  And they could implement the same interface.  Imagine you can select between heap-organized table
and index-organized table just by choosing its primary access method.  If you select heap for primary access method, indexes would
refer TID.  If you select btree on could id as primary access method, indexes would refer id could.  That would be great extendability.
Way better than what we have now.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: UNDO and in-place update

From
Robert Haas
Date:
On Tue, Nov 29, 2016 at 12:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think we need to avoid putting the visibility information in the
>> index because that will make the index much bigger.
>
> I agree that there will be an increase in index size, but it shouldn't
> be much if we have transaction information (and that too only active
> transactions) at page level instead of at tuple level.

It depends on the workload.  If you have something where the number of
transactions per page is small, like a bulk load, that scheme may work
out great.  But if you have something like a pgbench workload with an
extra index on the abalance column, you could have a different
transaction for almost every tuple.  Of course the question of how
many of those will be recent is a good one; if it's not too many,
maybe it's not that bad.

> I think the
> number of concurrent writes on the index will be lesser as compared to
> the heap.  There are a couple of benefits of having visibility
> information in the index.
>
> a. Heap and index could be independently cleaned up and in most cases
> by foreground transactions only.  The work for vacuum will be quite
> less as compared to now.  I think this will help us in avoiding the
> bloat to a great degree.
>
> b. Improved index-only scans, because now we don't need visibility map
> of the heap to check tuple visibility.
>
> c. Some speed improvements for index scans can also be expected
> because with this we don't need to perform heap fetch for invisible
> index tuples.

I see what you're going for, but I'm not sure it's worth it.  I mean,
say you just have one bit per index tuple.  If it's set, the heap
tuple is definitely all-visible; if it's clear, then the tuple may or
may not be visible to your snapshot.  You insert tuples with the bit
clear, and clear the bit again when the tuple is deleted or the
indexed column is updated.  You set the bit when you follow the index
pointer and discover that the tuple is all-visible, so that next time
you don't need to follow it again.  Of the advantages you mention
above, this still allows (a), but less efficiently.  It allows (b).
It does not allow (c).  So, it's not as good.  But, it saves you the
overhead of storing xmin and/or xmax for each tuple in the page.  Even
if you have an optimization to list the XIDs once per page and then
refer to them from the tuples, you are going to need a byte in each
tuple for the xmin-index and a byte for the xmax-index, or something
like that, plus the space for the XID list itself.  That's not a ton
of space, maybe, but it's definitely more.  It's also a significantly
bigger change to the tuple format, I think.  What I'm proposing could
be done by repurposing the LP_* bits available in the line pointer.
So I think the questions are (1) will the extra space consumption hurt
us more than the benefit we will get? and (2) is it really worth the
work to change the page format to that degree?

As you can probably tell, I'm leaning toward "no".  I think a bit
per-tuple -- whether we call it delete-mark or definitely-all-visible
-- should be enough to get most of the benefit that is available here
for substantially less work and no extra space.

> I think with proposed design there is a good chance that for many of
> the index scans, there is a need for the extra hop to decide
> visibility of index tuple. I think as of today, during index scan if
> the corresponding heap page is not-all-visible, then we check the
> corresponding heap tuple to decide about the visibility of index
> tuple, whereas with proposed design, I think it first needs to check
> heap page and then TPD, if there is more than one transaction
> operating on page.

There's a couple of possible designs here, but there is the
possibility for extra hops in some cases.  But there are things we can
do to mitigate that.

1. If each tuple has a definitely-all-visible bit, you can check that
first; if it's set, the tuple is definitely all-visible you don't need
to do anything further.

2. If we still maintain a visibility map, you can check that next.  If
the bit is set for the target page, then the tuple is definitely
all-visible and you don't need to do anything further.

3. Otherwise, you need to look at the heap page.  But if the heap
page's UNDO pointer is invalid, then the whole page is all-visible, so
you're done.  (You can also set the visibility map bit and/or the
index tuple's definitely-all-visible bit to avoid having to recheck
the heap page in the future.)

4. Each tuple will have a bit indicating whether it's been recently
modified; that is, whether the transaction whose UNDO log is
referenced by the page's UNDO pointer did anything to that tuple; or
if the page points to a TPD entry, whether any of the transactions in
the TPD modified that tuple.  If that bit is not set for that
particular tuple, it's all-visible and you're done.

5. Otherwise you have to look at the TPD entry (if any) and UNDO logs.

If we put in the visibility information in the index as you are
proposing, then we can always resolve the visibility of the index
tuple at step 1; we just test xmin and xmax against the snapshot.  For
index-only scans, that's great, because we never need to consult the
heap at all.  For regular index scans, it's not nearly as great,
because we're still going to need to consult the TPD and UNDO logs to
determine which version of the tuple is visible to that snapshot.  You
can avoid looking at the heap in the case where no version of the
tuple is visible to the scan, but that's probably mostly going to
happen only in cases where the transaction is still in flight so in
most cases the heap will be in shared_buffers anyway and you won't
save very much.

> Such a design will work, but I think this is more work to mark the
> tuple as Dead as compared to current system.

Anything that allows update-in-place requires touching the index
entries when they are invalidated, right?  That's the main cost.

>>>>  That's OK, but we're in real trouble if
>>>> step-3 inserted an additional index tuple pointing to that chain
>>>> rather than simply noting that one already exists.  If it adds an
>>>> additional one, then we'll visit two index tuples for that TID.  Each
>>>> time, the visibility information in UNDO will cause us to return the
>>>> correct tuple, but we've erred by returning it twice (once per index
>>>> entry) rather than just once.
>>>
>>> Why will scan traverse the UNDO chain if it starts after step-3 commit?
>>
>> Why wouldn't it?  I think that if an index scan hits a page with an
>> UNDO chain, it always need to traverse the whole thing.
>
> If you notice the scan has started after all the writes have
> committed, so what is the guarantee that there will be a UNDO chain?
> Yes, it could be there if there is some live transaction which started
> before step, otherwise, I am not sure if we can guarantee that UNDO
> chain will be present.

The UNDO chain has to remain until all transactions that modified the
page are all-visible, not just until the transactions are committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
Alexander Korotkov
Date:
On Thu, Dec 1, 2016 at 6:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
There's a couple of possible designs here, but there is the
possibility for extra hops in some cases.  But there are things we can
do to mitigate that.

1. If each tuple has a definitely-all-visible bit, you can check that
first; if it's set, the tuple is definitely all-visible you don't need
to do anything further.

2. If we still maintain a visibility map, you can check that next.  If
the bit is set for the target page, then the tuple is definitely
all-visible and you don't need to do anything further.

3. Otherwise, you need to look at the heap page.  But if the heap
page's UNDO pointer is invalid, then the whole page is all-visible, so
you're done.  (You can also set the visibility map bit and/or the
index tuple's definitely-all-visible bit to avoid having to recheck
the heap page in the future.)

4. Each tuple will have a bit indicating whether it's been recently
modified; that is, whether the transaction whose UNDO log is
referenced by the page's UNDO pointer did anything to that tuple; or
if the page points to a TPD entry, whether any of the transactions in
the TPD modified that tuple.  If that bit is not set for that
particular tuple, it's all-visible and you're done.

5. Otherwise you have to look at the TPD entry (if any) and UNDO logs.

If we put in the visibility information in the index as you are
proposing, then we can always resolve the visibility of the index
tuple at step 1; we just test xmin and xmax against the snapshot.  For
index-only scans, that's great, because we never need to consult the
heap at all.  For regular index scans, it's not nearly as great,
because we're still going to need to consult the TPD and UNDO logs to
determine which version of the tuple is visible to that snapshot.  You
can avoid looking at the heap in the case where no version of the
tuple is visible to the scan, but that's probably mostly going to
happen only in cases where the transaction is still in flight so in
most cases the heap will be in shared_buffers anyway and you won't
save very much.

Idea of storing just one visibility bit in index tuple is a subject of serious doubts for me.

1. When definitely-all-visible isn't set then we have to recheck during scanning heap, right?
But our current recheck (i.e. rechecking scan quals) is not enough.  Imagine you have a range
index scan (val >= 100 AND val < 200), and val field was updated from val = 50 and val = 60
in some row.  Then you might have this row duplicated in the output.  Removing index scan and
sticking to only bitmap index scan doesn't seem to be fair.  Thus, we need to introduce another
kind of recheck that heapTuple.indexKey = indexTuple.indexKey.

2. Another question is how it could work while index key being intensively updated.  There is a risk
that we would have to traverse same UNDO-chains multiple times.  In worst case, we would have
to spend quadratic time of number of row versions since our snapshot taken.  We might try to mitigate this
by caching TID => heap tuple map for our snapshot.  But I don't like this approach.
If we have large enough index scan, then we could either run out of cache or consume too much memory
for that cache.

So, I think storing visibility information in index tuple is great for regular index scans too.  At least,
it allow us to evade #2 problem.  That is great by itself.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: UNDO and in-place update

From
Robert Haas
Date:
On Fri, Dec 2, 2016 at 5:01 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Idea of storing just one visibility bit in index tuple is a subject of
> serious doubts for me.
>
> 1. When definitely-all-visible isn't set then we have to recheck during
> scanning heap, right?
> But our current recheck (i.e. rechecking scan quals) is not enough.  Imagine
> you have a range
> index scan (val >= 100 AND val < 200), and val field was updated from val =
> 50 and val = 60
> in some row.  Then you might have this row duplicated in the output.
> Removing index scan and
> sticking to only bitmap index scan doesn't seem to be fair.  Thus, we need
> to introduce another
> kind of recheck that heapTuple.indexKey = indexTuple.indexKey.

Yes.

> 2. Another question is how it could work while index key being intensively
> updated.  There is a risk
> that we would have to traverse same UNDO-chains multiple times.  In worst
> case, we would have
> to spend quadratic time of number of row versions since our snapshot taken.
> We might try to mitigate this
> by caching TID => heap tuple map for our snapshot.  But I don't like this
> approach.
> If we have large enough index scan, then we could either run out of cache or
> consume too much memory
> for that cache.

I agree with the concern, but I don't think that's necessarily the
only mitigation strategy.  The details of what goes into UNDO and what
goes into the TPD aren't really defined yet, and if we structure those
so that you can efficiently find out what happened to a particular TID
with only a bounded number of accesses, then it might not be too bad.
If you imagine having to walk a chain of 1000 UNDO entries once per
TID, and there are 100 TIDs on the page, that sounds pretty bad.  But
maybe we can design the UNDO format in such a way that you never
actually need to walk the entire chain, or only in extremely rare
corner cases.

It strikes me that there are three possibilities.  One is that we can
design the UNDO-based visibility reconstruction so that it is
blazingly fast.  In that case, the only benefit of putting the
visibility information in the index is that index-only scans will be
able to avoid touching the heap in some cases where they presently
cannot.  The second possibility is that despite our best efforts the
UNDO-based visibility reconstruction is doomed to be cripplingly slow.
In that case, even in "good" cases like sequential scans and bitmap
heap scans where we can process the entire page at once instead of
possibly having to make a separate visit per TID, we'll still be
painfully slow.  In that case, we might as well forget this project
altogether.  The third is that we're somewhere in the middle.  If
UNDO-based visibility reconstruction is only a minor annoyance in
sequential scan and bitmap-heap scan cases but, despite our best
efforts, becomes highly painful in index scan cases, then the case for
putting XIDs in the index suddenly becomes a lot more interesting in
my mind.

We may well end up in exactly that place, but I think it's a little
too early to decide yet.  I think we need to write a detailed design
for how UNDO-based visibility reconstruction would actually work, and
maybe even build a prototype to see how that performs, before we can
decide on this.  I kind of hope we don't need to do XIDs-in-the-index;
it sounds like a lot of work, and this project is bound to be
extremely difficult even without that additional complication.
However, if it becomes clear that it's the only way for a system like
this to perform acceptably, then it'll have to be done (unless we give
up on the whole thing).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: UNDO and in-place update

From
Amit Kapila
Date:
On Thu, Dec 1, 2016 at 8:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 29, 2016 at 12:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I see what you're going for, but I'm not sure it's worth it.  I mean,
> say you just have one bit per index tuple.  If it's set, the heap
> tuple is definitely all-visible; if it's clear, then the tuple may or
> may not be visible to your snapshot.  You insert tuples with the bit
> clear, and clear the bit again when the tuple is deleted or the
> indexed column is updated.  You set the bit when you follow the index
> pointer and discover that the tuple is all-visible, so that next time
> you don't need to follow it again.  Of the advantages you mention
> above, this still allows (a), but less efficiently.  It allows (b).
> It does not allow (c).  So, it's not as good.  But, it saves you the
> overhead of storing xmin and/or xmax for each tuple in the page.  Even
> if you have an optimization to list the XIDs once per page and then
> refer to them from the tuples, you are going to need a byte in each
> tuple for the xmin-index and a byte for the xmax-index, or something
> like that, plus the space for the XID list itself.  That's not a ton
> of space, maybe, but it's definitely more.  It's also a significantly
> bigger change to the tuple format, I think.  What I'm proposing could
> be done by repurposing the LP_* bits available in the line pointer.
> So I think the questions are (1) will the extra space consumption hurt
> us more than the benefit we will get? and (2) is it really worth the
> work to change the page format to that degree?
>
> As you can probably tell, I'm leaning toward "no".  I think a bit
> per-tuple -- whether we call it delete-mark or definitely-all-visible
> -- should be enough to get most of the benefit that is available here
> for substantially less work and no extra space.
>

I can very well understand the reason for not doing so (IIUC, it is
about complexity and time to get it right when we are already trying
to solve one big and complex problem of the system), but saying most
of the benefit can be realized by having simple optimization seems
less convincing.  I think having simple optimizations won't solve the
multi-pass vacuum problem.  However, having the visibility information
in the index will solve that and avoid index bloat much aggressively.

>> I think with proposed design there is a good chance that for many of
>> the index scans, there is a need for the extra hop to decide
>> visibility of index tuple. I think as of today, during index scan if
>> the corresponding heap page is not-all-visible, then we check the
>> corresponding heap tuple to decide about the visibility of index
>> tuple, whereas with proposed design, I think it first needs to check
>> heap page and then TPD, if there is more than one transaction
>> operating on page.
>
> There's a couple of possible designs here, but there is the
> possibility for extra hops in some cases.  But there are things we can
> do to mitigate that.
>
> 1. If each tuple has a definitely-all-visible bit, you can check that
> first; if it's set, the tuple is definitely all-visible you don't need
> to do anything further.
>
> 2. If we still maintain a visibility map, you can check that next.  If
> the bit is set for the target page, then the tuple is definitely
> all-visible and you don't need to do anything further.
>
> 3. Otherwise, you need to look at the heap page.  But if the heap
> page's UNDO pointer is invalid, then the whole page is all-visible, so
> you're done.  (You can also set the visibility map bit and/or the
> index tuple's definitely-all-visible bit to avoid having to recheck
> the heap page in the future.)
>
> 4. Each tuple will have a bit indicating whether it's been recently
> modified; that is, whether the transaction whose UNDO log is
> referenced by the page's UNDO pointer did anything to that tuple; or
> if the page points to a TPD entry, whether any of the transactions in
> the TPD modified that tuple.  If that bit is not set for that
> particular tuple, it's all-visible and you're done.
>

I think 4th optimization is especially good and can cover many cases,
but I am not sure how do we unset it once it is set.  For example,
once the effect of transaction that has touched that row is
all-visible, how will we unset it.   It might be better to store the
offset of transaction that has last modified the tuple, if the last
transaction that has touched the row is visible to snapshot, then we
don't need to process the UNDO.

> 5. Otherwise you have to look at the TPD entry (if any) and UNDO logs.
>
> If we put in the visibility information in the index as you are
> proposing, then we can always resolve the visibility of the index
> tuple at step 1; we just test xmin and xmax against the snapshot.  For
> index-only scans, that's great, because we never need to consult the
> heap at all.  For regular index scans, it's not nearly as great,
> because we're still going to need to consult the TPD and UNDO logs to
> determine which version of the tuple is visible to that snapshot.
>

Yeah, but we might think of extending it in future so that index also
has UNDO logs which can help such a case.  Also, I think the cases
where the tuple is visible but not marked as definitely-all-visible
will be much cheaper.

One more thing that needs some thoughts is how will the rollback work
if keep just one bit for delete marking.  I mean for heap we can
directly apply from UNDO, but for the index, I think we need to also
take care of undoing it in some way.  For example, search the tree to
remove the deleting marking from old value and delete marking the new
value.  Now if that is what we need to do then first we need to
traverse the tree twice which is okay considering rollback is a seldom
operation, the second challenge would be how to make such an operation
atomic with respect to WAL.

>>>>>  That's OK, but we're in real trouble if
>>>>> step-3 inserted an additional index tuple pointing to that chain
>>>>> rather than simply noting that one already exists.  If it adds an
>>>>> additional one, then we'll visit two index tuples for that TID.  Each
>>>>> time, the visibility information in UNDO will cause us to return the
>>>>> correct tuple, but we've erred by returning it twice (once per index
>>>>> entry) rather than just once.
>>>>
>>>> Why will scan traverse the UNDO chain if it starts after step-3 commit?
>>>
>>> Why wouldn't it?  I think that if an index scan hits a page with an
>>> UNDO chain, it always need to traverse the whole thing.
>>
>> If you notice the scan has started after all the writes have
>> committed, so what is the guarantee that there will be a UNDO chain?
>> Yes, it could be there if there is some live transaction which started
>> before step, otherwise, I am not sure if we can guarantee that UNDO
>> chain will be present.
>
> The UNDO chain has to remain until all transactions that modified the
> page are all-visible, not just until the transactions are committed.
>

Okay, I think this will work if we avoid the second insertion of the
same value-TID pair in the index as you mentioned above.   Without
that, I think we might not distinguish which of the two rows are
visible for a transaction to which the latest (step-3) row is visible.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: UNDO and in-place update

From
Robert Haas
Date:
On Mon, Dec 5, 2016 at 4:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I can very well understand the reason for not doing so (IIUC, it is
> about complexity and time to get it right when we are already trying
> to solve one big and complex problem of the system), but saying most
> of the benefit can be realized by having simple optimization seems
> less convincing.  I think having simple optimizations won't solve the
> multi-pass vacuum problem.  However, having the visibility information
> in the index will solve that and avoid index bloat much aggressively.

I don't think that multi-pass vacuum would exist in this system,
regardless of any of the details we're talking about here.  If you are
doing in-place updates of tuples and changing the indexed columns,
then you can't rely on the current system for removing index entries
at all.  The indexed entries don't point to a "dead" line pointer;
they point to a tuple that no longer matches the index entry.

>> 3. Otherwise, you need to look at the heap page.  But if the heap
>> page's UNDO pointer is invalid, then the whole page is all-visible, so
>> you're done.  (You can also set the visibility map bit and/or the
>> index tuple's definitely-all-visible bit to avoid having to recheck
>> the heap page in the future.)
>>
>> 4. Each tuple will have a bit indicating whether it's been recently
>> modified; that is, whether the transaction whose UNDO log is
>> referenced by the page's UNDO pointer did anything to that tuple; or
>> if the page points to a TPD entry, whether any of the transactions in
>> the TPD modified that tuple.  If that bit is not set for that
>> particular tuple, it's all-visible and you're done.
>
> I think 4th optimization is especially good and can cover many cases,
> but I am not sure how do we unset it once it is set.  For example,
> once the effect of transaction that has touched that row is
> all-visible, how will we unset it.   It might be better to store the
> offset of transaction that has last modified the tuple, if the last
> transaction that has touched the row is visible to snapshot, then we
> don't need to process the UNDO.

If the page has an invalid UNDO pointer and you change it to a valid
UNDO pointer, you unset all the bits at the same time, except of
course for the bit pertaining to the tuple you are modifying.  If the
page's UNDO pointer is still valid when you update it, then you set
the bits for any tuples you are modifying for which it is not already
set.

> One more thing that needs some thoughts is how will the rollback work
> if keep just one bit for delete marking.  I mean for heap we can
> directly apply from UNDO, but for the index, I think we need to also
> take care of undoing it in some way.  For example, search the tree to
> remove the deleting marking from old value and delete marking the new
> value.  Now if that is what we need to do then first we need to
> traverse the tree twice which is okay considering rollback is a seldom
> operation, the second challenge would be how to make such an operation
> atomic with respect to WAL.

We don't necessarily need UNDO to clean up the indexes, although it
might be a good idea.  It would *work* to just periodically scan the
index for delete-marked items.  For each one, visit the heap and see
if there's a version of that tuple present in the heap or current UNDO
that matches the index entry.  If not, remove the index entry.
However, we could try to be more aggressive: when you UNDO an UPDATE
or DELETE in the heap, also go search the index and remove any
delete-marked entries that are no longer needed.  The advantage of
this is that we'd be more likely to visit the index entries while they
are still in cache.  The disadvantage is that it is complex and might
fail if the index is corrupted.  We can't have the whole system melt
down in that case; UNDO has to always succeed.

>> The UNDO chain has to remain until all transactions that modified the
>> page are all-visible, not just until the transactions are committed.
>
> Okay, I think this will work if we avoid the second insertion of the
> same value-TID pair in the index as you mentioned above.   Without
> that, I think we might not distinguish which of the two rows are
> visible for a transaction to which the latest (step-3) row is visible.

Agreed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UNDO and in-place update

From
Amit Kapila
Date:
On Tue, Dec 6, 2016 at 9:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Dec 5, 2016 at 4:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I can very well understand the reason for not doing so (IIUC, it is
>> about complexity and time to get it right when we are already trying
>> to solve one big and complex problem of the system), but saying most
>> of the benefit can be realized by having simple optimization seems
>> less convincing.  I think having simple optimizations won't solve the
>> multi-pass vacuum problem.  However, having the visibility information
>> in the index will solve that and avoid index bloat much aggressively.
>
> I don't think that multi-pass vacuum would exist in this system,
> regardless of any of the details we're talking about here.  If you are
> doing in-place updates of tuples and changing the indexed columns,
> then you can't rely on the current system for removing index entries
> at all.

Sure.  The point I was trying to make was that index entries can't be
removed independently apart from the case where some previous index
scan has marked it dead after consulting heap.  In this new system, I
think we can't remove undo entries of heap page till we clear
corresponding index entries. I think we need to somehow collect the
old values from undo corresponding to index and then scan the index
remove the index entry and after that corresponding undo entry can be
removed.

>  The indexed entries don't point to a "dead" line pointer;
> they point to a tuple that no longer matches the index entry.
>
>>> 3. Otherwise, you need to look at the heap page.  But if the heap
>>> page's UNDO pointer is invalid, then the whole page is all-visible, so
>>> you're done.  (You can also set the visibility map bit and/or the
>>> index tuple's definitely-all-visible bit to avoid having to recheck
>>> the heap page in the future.)
>>>
>>> 4. Each tuple will have a bit indicating whether it's been recently
>>> modified; that is, whether the transaction whose UNDO log is
>>> referenced by the page's UNDO pointer did anything to that tuple; or
>>> if the page points to a TPD entry, whether any of the transactions in
>>> the TPD modified that tuple.  If that bit is not set for that
>>> particular tuple, it's all-visible and you're done.
>>
>> I think 4th optimization is especially good and can cover many cases,
>> but I am not sure how do we unset it once it is set.  For example,
>> once the effect of transaction that has touched that row is
>> all-visible, how will we unset it.   It might be better to store the
>> offset of transaction that has last modified the tuple, if the last
>> transaction that has touched the row is visible to snapshot, then we
>> don't need to process the UNDO.
>
> If the page has an invalid UNDO pointer and you change it to a valid
> UNDO pointer, you unset all the bits at the same time, except of
> course for the bit pertaining to the tuple you are modifying.  If the
> page's UNDO pointer is still valid when you update it, then you set
> the bits for any tuples you are modifying for which it is not already
> set.
>

Okay, so this optimization can work only after all the active
transactions operating on a page are finished.  If that is true, in
some cases such a design can consume a lot of CPU traversing all the
tuples in a page for un-setting the bit, especially when such tuples
are less.


>> One more thing that needs some thoughts is how will the rollback work
>> if keep just one bit for delete marking.  I mean for heap we can
>> directly apply from UNDO, but for the index, I think we need to also
>> take care of undoing it in some way.  For example, search the tree to
>> remove the deleting marking from old value and delete marking the new
>> value.  Now if that is what we need to do then first we need to
>> traverse the tree twice which is okay considering rollback is a seldom
>> operation, the second challenge would be how to make such an operation
>> atomic with respect to WAL.
>
> We don't necessarily need UNDO to clean up the indexes, although it
> might be a good idea.  It would *work* to just periodically scan the
> index for delete-marked items.  For each one, visit the heap and see
> if there's a version of that tuple present in the heap or current UNDO
> that matches the index entry.  If not, remove the index entry.
>

I think this is somewhat similar to how we clean the index now and
seems to be a workable design.  However, don't you think that it will
have somewhat similar characteristics for index-bloat as we have now?
OTOH for heap, it will help us to take away old versions away from the
main heap, but still, we can't get rid of that space or reuse that
space till we can clean corresponding index entries.

> However, we could try to be more aggressive: when you UNDO an UPDATE
> or DELETE in the heap, also go search the index and remove any
> delete-marked entries that are no longer needed.  The advantage of
> this is that we'd be more likely to visit the index entries while they
> are still in cache.  The disadvantage is that it is complex and might
> fail if the index is corrupted.  We can't have the whole system melt
> down in that case; UNDO has to always succeed.
>

Agreed.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UNDO and in-place update

From
Dilip Kumar
Date:
On Wed, Jan 4, 2017 at 4:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> In this new system, I
> think we can't remove undo entries of heap page till we clear
> corresponding index entries. I think we need to somehow collect the
> old values from undo corresponding to index and then scan the index
> remove the index entry and after that corresponding undo entry can be
> removed.

Do we really need to keep undo for heap until index entry is not
removed? IIUC, we anyway need to revalidate the index key with heap
tuple. What I am trying the say is that if we no longer needed UNDO
for the heap page (e.g because of rollback) then we can apply the UNDO
and remove it. I agree that there will be multiple index entries will
be pointing to this tuple, but only one of them can pass the key
revalidation with the heap. isn't it?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UNDO and in-place update

From
Amit Kapila
Date:
On Wed, Jan 4, 2017 at 8:35 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Wed, Jan 4, 2017 at 4:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> In this new system, I
>> think we can't remove undo entries of heap page till we clear
>> corresponding index entries. I think we need to somehow collect the
>> old values from undo corresponding to index and then scan the index
>> remove the index entry and after that corresponding undo entry can be
>> removed.
>
> Do we really need to keep undo for heap until index entry is not
> removed?
>

UNDO has to be kept till heap page is marked as all visible.  This is
required to check the visibility of index.  Now, I think the page can
be marked as all visible when we removed corresponding dead entries in
heap.   I think the main point of discussion here is how vacuum will
clean heap and index entries in this new system.  Another idea could
be that we allow undo entries to be removed (we can allow heap entry
to be removed) based on if they are visible or not and then while
traversing index we cross check in heap if the entry can be removed.
Basically,  check the TID and traverse undo contents if any to verify
if the index entry can be removed.  I think this sounds to be more
complicated and less efficient than what I suggested earlier.

> IIUC, we anyway need to revalidate the index key with heap
> tuple. What I am trying the say is that if we no longer needed UNDO
> for the heap page (e.g because of rollback) then we can apply the UNDO
> and remove it. I agree that there will be multiple index entries will
> be pointing to this tuple, but only one of them can pass the key
> revalidation with the heap. isn't it?
>

Yeah, but I think for that, you need to grovel through all the undo
chain if present for a tuple.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UNDO and in-place update

From
Robert Haas
Date:
On Wed, Jan 4, 2017 at 6:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, so this optimization can work only after all the active
> transactions operating on a page are finished.  If that is true, in
> some cases such a design can consume a lot of CPU traversing all the
> tuples in a page for un-setting the bit, especially when such tuples
> are less.

I suppose.  I didn't think that cost was likely to be big enough to
worry about, but I might be wrong.  The worst case would be when you
modify one tuple on a page, let the transaction that did the
modification become all-visible, modify one tuple on the page again,
etc. and, at the same time, the page is entirely full of tuples.  So
you keep having to loop over all the bits to clear them (they are all
clear except one, but you don't know that) and then re-set just one of
them.  That's not free, but keep in mind that the existing system
would be forced to perform non-HOT updates in that situation, which
isn't free either.

Also, I'm thinking the bit could be stored in the line pointer rather
than the tuple, because with this design we don't need
LP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD any more.  We could use one
bit to indicate dead or not-dead and the second bit to indicate
recently-modified or not-recently-modified.  With that approach,
clearing the bits only requires iterating over the line pointer array,
not the tuples themselves.

>> We don't necessarily need UNDO to clean up the indexes, although it
>> might be a good idea.  It would *work* to just periodically scan the
>> index for delete-marked items.  For each one, visit the heap and see
>> if there's a version of that tuple present in the heap or current UNDO
>> that matches the index entry.  If not, remove the index entry.
>
> I think this is somewhat similar to how we clean the index now and
> seems to be a workable design.  However, don't you think that it will
> have somewhat similar characteristics for index-bloat as we have now?

Yes, it would have similar characteristics.  Thus we might want to do better.

> OTOH for heap, it will help us to take away old versions away from the
> main heap, but still, we can't get rid of that space or reuse that
> space till we can clean corresponding index entries.

I don't think that's true.  If in-place update is ever allowed in
cases where indexed columns have been modified, then the index already
has to cope with the possibility that the heap tuple it can see
doesn't match the index.  And if it can cope with that, then why do we
have to remove the index entry before reusing the heap TID?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UNDO and in-place update

From
Robert Haas
Date:
On Thu, Jan 5, 2017 at 6:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> UNDO has to be kept till heap page is marked as all visible.  This is
> required to check the visibility of index.  Now, I think the page can
> be marked as all visible when we removed corresponding dead entries in
> heap.   I think the main point of discussion here is how vacuum will
> clean heap and index entries in this new system.  Another idea could
> be that we allow undo entries to be removed (we can allow heap entry
> to be removed) based on if they are visible or not and then while
> traversing index we cross check in heap if the entry can be removed.
> Basically,  check the TID and traverse undo contents if any to verify
> if the index entry can be removed.  I think this sounds to be more
> complicated and less efficient than what I suggested earlier.

In my proposal, when a tuple gets updated or deleted in the heap, we
go find the corresponding index entries and delete-mark them.  When an
index tuple is delete-marked, we have to cross-check it against the
heap, because the index tuple might not match the version of the heap
tuple that our snapshot can see.  Therefore, it's OK to discard the
heap page's UNDO before cleaning the indexes, because the index tuples
are already delete-marked and rechecks will therefore do the right
thing.

In short, I agree with Dilip.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UNDO and in-place update

From
Amit Kapila
Date:
On Thu, Jan 5, 2017 at 9:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 4, 2017 at 6:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Okay, so this optimization can work only after all the active
>> transactions operating on a page are finished.  If that is true, in
>> some cases such a design can consume a lot of CPU traversing all the
>> tuples in a page for un-setting the bit, especially when such tuples
>> are less.
>
> I suppose.  I didn't think that cost was likely to be big enough to
> worry about, but I might be wrong.  The worst case would be when you
> modify one tuple on a page, let the transaction that did the
> modification become all-visible, modify one tuple on the page again,
> etc. and, at the same time, the page is entirely full of tuples.  So
> you keep having to loop over all the bits to clear them (they are all
> clear except one, but you don't know that) and then re-set just one of
> them.  That's not free, but keep in mind that the existing system
> would be forced to perform non-HOT updates in that situation, which
> isn't free either.
>
> Also, I'm thinking the bit could be stored in the line pointer rather
> than the tuple, because with this design we don't need
> LP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD any more.  We could use one
> bit to indicate dead or not-dead and the second bit to indicate
> recently-modified or not-recently-modified.  With that approach,
> clearing the bits only requires iterating over the line pointer array,
> not the tuples themselves.
>

I think this can help in mitigating the overhead.  However, now
another question is if we just unset it when there is no other active
transaction operating on the current page except for current
transaction, then will that tuple be considered all-visible?  I think
no transaction operating on a page can't be taken as a guarantee for
tuple to be marked as all-visible. If that is true, then what is
advantage of clearing the bit?

>>> We don't necessarily need UNDO to clean up the indexes, although it
>>> might be a good idea.  It would *work* to just periodically scan the
>>> index for delete-marked items.  For each one, visit the heap and see
>>> if there's a version of that tuple present in the heap or current UNDO
>>> that matches the index entry.  If not, remove the index entry.
>>
>> I think this is somewhat similar to how we clean the index now and
>> seems to be a workable design.  However, don't you think that it will
>> have somewhat similar characteristics for index-bloat as we have now?
>
> Yes, it would have similar characteristics.  Thus we might want to do better.
>
>> OTOH for heap, it will help us to take away old versions away from the
>> main heap, but still, we can't get rid of that space or reuse that
>> space till we can clean corresponding index entries.
>
> I don't think that's true.  If in-place update is ever allowed in
> cases where indexed columns have been modified, then the index already
> has to cope with the possibility that the heap tuple it can see
> doesn't match the index.  And if it can cope with that, then why do we
> have to remove the index entry before reusing the heap TID?
>

No need. I think I can see what you are saying.

>> UNDO has to be kept till heap page is marked as all visible.  This is
>> required to check the visibility of index.  Now, I think the page can
>> be marked as all visible when we removed corresponding dead entries in
>> heap.   I think the main point of discussion here is how vacuum will
>> clean heap and index entries in this new system.  Another idea could
>> be that we allow undo entries to be removed (we can allow heap entry
>> to be removed) based on if they are visible or not and then while
>> traversing index we cross check in heap if the entry can be removed.
>> Basically,  check the TID and traverse undo contents if any to verify
>> if the index entry can be removed.  I think this sounds to be more
>> complicated and less efficient than what I suggested earlier.
>
> In my proposal, when a tuple gets updated or deleted in the heap, we
> go find the corresponding index entries and delete-mark them.  When an
> index tuple is delete-marked, we have to cross-check it against the
> heap, because the index tuple might not match the version of the heap
> tuple that our snapshot can see.  Therefore, it's OK to discard the
> heap page's UNDO before cleaning the indexes, because the index tuples
> are already delete-marked and rechecks will therefore do the right
> thing.
>

Okay, I think we could clean undo and heap without caring for the
index.  Basically, the idea works on the premise that we won't allow
same value rows in the index for same TID and using rechecks we can
identify the deleted tuple.  On rethinking about the working of vacuum
in this system,  it seems we can clean the heap independently and then
for index we crosscheck each of the entry in the heap that is marked
as deleted.  Now this has the advantage that we don't need to do two
passes of the heap, but for the index, we might need to re-fetch heap
pages randomly to detect if the delete marked row is actually deleted.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UNDO and in-place update

From
Robert Haas
Date:
On Fri, Jan 6, 2017 at 6:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Also, I'm thinking the bit could be stored in the line pointer rather
>> than the tuple, because with this design we don't need
>> LP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD any more.  We could use one
>> bit to indicate dead or not-dead and the second bit to indicate
>> recently-modified or not-recently-modified.  With that approach,
>> clearing the bits only requires iterating over the line pointer array,
>> not the tuples themselves.
>
> I think this can help in mitigating the overhead.  However, now
> another question is if we just unset it when there is no other active
> transaction operating on the current page except for current
> transaction, then will that tuple be considered all-visible?  I think
> no transaction operating on a page can't be taken as a guarantee for
> tuple to be marked as all-visible. If that is true, then what is
> advantage of clearing the bit?

That's kind of a strange question.  The mission of this proposed bit
is to tell you whether the transaction(s) referenced in the page's
UNDO pointer have modified that tuple.  If you didn't clear it when
you reset the UNDO pointer to a new transaction, then the bit would
always be set for every tuple on the page and it would be completely
useless.  (I mean, the tuples had to be inserted originally, right?
So the bit was set then.  And if you never clear it, it will just stay
set.)

As to your other questions: If the page's UNDO pointer isn't valid,
then the whole page is all-visible.  If the page's UNDO pointer is
valid, then any tuple for which the bit is not set is all-visible.

> Okay, I think we could clean undo and heap without caring for the
> index.  Basically, the idea works on the premise that we won't allow
> same value rows in the index for same TID and using rechecks we can
> identify the deleted tuple.

Right, seems we are on the same page now.  Just to be clear, I'm not
saying there aren't other possible designs, and one of those designs
might be better.  But at least now we both seem to have the same
understanding of what I proposed, which seems like progress.

> On rethinking about the working of vacuum
> in this system,  it seems we can clean the heap independently and then
> for index we crosscheck each of the entry in the heap that is marked
> as deleted.

Right.

> Now this has the advantage that we don't need to do two
> passes of the heap, but for the index, we might need to re-fetch heap
> pages randomly to detect if the delete marked row is actually deleted.

Yes.  There are some performance trade-offs here no matter what
decision you make.  If you decide to try to remove delete-marked index
entries during UNDO, then (1) you have to somehow make sure that those
index entries aren't still needed by some previous transaction that
isn't yet all-visible (e.g. imagine key is updated 1->2->1, first
update commits but second one rolls back before first is all-visible)
which might be expensive and (2) if those index pages have been
evicted from cache you will incur additional I/O to bring them back
into cache, which might be more expensive than just cleaning them
later.  On the other hand, if you don't try to remove index pointers
during UNDO, then you're leaving bloat in your index.  We might want
to consider some combination of these things - e.g. if the page has
only one recent transaction (no TPD) and the index pages are still in
memory, have UNDO try to remove them, otherwise clean them later.  Or
some other rule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UNDO and in-place update

From
Amit Kapila
Date:
On Fri, Jan 6, 2017 at 7:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 6, 2017 at 6:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Also, I'm thinking the bit could be stored in the line pointer rather
>>> than the tuple, because with this design we don't need
>>> LP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD any more.  We could use one
>>> bit to indicate dead or not-dead and the second bit to indicate
>>> recently-modified or not-recently-modified.  With that approach,
>>> clearing the bits only requires iterating over the line pointer array,
>>> not the tuples themselves.
>>
>> I think this can help in mitigating the overhead.  However, now
>> another question is if we just unset it when there is no other active
>> transaction operating on the current page except for current
>> transaction, then will that tuple be considered all-visible?  I think
>> no transaction operating on a page can't be taken as a guarantee for
>> tuple to be marked as all-visible. If that is true, then what is
>> advantage of clearing the bit?
>
> That's kind of a strange question.  The mission of this proposed bit
> is to tell you whether the transaction(s) referenced in the page's
> UNDO pointer have modified that tuple.  If you didn't clear it when
> you reset the UNDO pointer to a new transaction, then the bit would
> always be set for every tuple on the page and it would be completely
> useless.  (I mean, the tuples had to be inserted originally, right?
> So the bit was set then.  And if you never clear it, it will just stay
> set.)
>
> As to your other questions: If the page's UNDO pointer isn't valid,
> then the whole page is all-visible.  If the page's UNDO pointer is
> valid, then any tuple for which the bit is not set is all-visible.
>

Okay, I see the point.  I think here UNDO pointer can be marked
invalid either during page-pruning or vacuum.

Another point which I think is not discussed till now is, how the
locking will work.  As of now, we first lock the tuple and then xid on
which we have to wait. In the new design, we don't know the xid which
has updated the tuple and I am not sure there is any easy way to find
that information.  One idea could be that we have some fixed number of
slots (i think we can make it variable as well, but for simplicity,
lets consider it as fixed) in the page header which will store the
offset to the transaction id inside a TPD entry of the page.  Consider
a TPD entry of page contains four transactions, so we will just store
enough information in heap page header to reach the transaction id for
these four transactions. I think each such page header slot could be
three or four bits long depending upon how many concurrent
transactions we want to support on a page after which a new
transaction has to wait (I think in most workloads supporting
simultaneous eight transactions on a page should be sufficient).
Then we can have an additional byte (or less than byte) in the tuple
header to store lock info which is nothing but an offset to the slot
in the page header.   We might find some other locking technique as
well, but I think keeping it same as current has benefit.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UNDO and in-place update

From
Robert Haas
Date:
On Mon, Jan 9, 2017 at 7:50 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, I see the point.  I think here UNDO pointer can be marked
> invalid either during page-pruning or vacuum.

I explicitly want to avoid that, because it means re-dirtying the
page.  The UNDO pointer becomes invalid by removing the UNDO log to
which it points; the page itself does not need to be dirtied for the
UNDO pointer to become invalid.

> Another point which I think is not discussed till now is, how the
> locking will work.  As of now, we first lock the tuple and then xid on
> which we have to wait. In the new design, we don't know the xid which
> has updated the tuple and I am not sure there is any easy way to find
> that information.

Yes, there is: you can get it from the UNDO pointer.  Each UNDO
pointer is of the form <epoch, XID, offset>. If the page has a regular
UNDO pointer, then you can get the XID for any modified table right
out of the UNDO pointer itself.  If the page has a TPD pointer, the
TPD contains a series of UNDO pointers and you can get all of the XIDs
that have touched the page by iterating through those UNDO pointers.
There is some work to do to make it cheap to find which XID pertains
to which updated tuple, but I think that problem can be solved by
choosing carefully what to store in the TPD.  I think, in fact, that
for performance it's absolutely essential to solve that problem well.

> One idea could be that we have some fixed number of
> slots (i think we can make it variable as well, but for simplicity,
> lets consider it as fixed) in the page header which will store the
> offset to the transaction id inside a TPD entry of the page.  Consider
> a TPD entry of page contains four transactions, so we will just store
> enough information in heap page header to reach the transaction id for
> these four transactions. I think each such page header slot could be
> three or four bits long depending upon how many concurrent
> transactions we want to support on a page after which a new
> transaction has to wait (I think in most workloads supporting
> simultaneous eight transactions on a page should be sufficient).
> Then we can have an additional byte (or less than byte) in the tuple
> header to store lock info which is nothing but an offset to the slot
> in the page header.   We might find some other locking technique as
> well, but I think keeping it same as current has benefit.

Yes, something like this can be done.  You don't really need any new
page-level header data, because you can get the XIDs from the TPD
entry (or from the page itself if there's only one).  But you could
expand the single "is-modified" bit that I've proposed adding to each
tuple to multiple bits.  0 means not recently modified.  1 means
modified by the first or only transaction that has recently modified
the page.  2 means modified by the second transaction that has
recently modified the page.  Etc.

What I was thinking about doing instead is storing an array in the TPD
containing the same information.  There would be one byte or one half
a byte or whatever per TID and it would contain the index of the XID
in the TPD that had most recently modified or locked that TID.  Your
solution might be better, though, at least for cases where the number
of tuples that have modified the page is small.  However, I'm not
totally sure.  I think it's important to keep the tuple headers VERY
small, like 3 bytes.  Or 2 bytes.  Or maybe even variable size but
only 1 byte in common cases.  So I expect bit space in those places to
be fairly scarce and precious.

Now that might be the wrong idea -- maybe it's worth expanding that
header in order to speed things up.  On the other hand, having to read
more database pages in order to process the same amount of data is
*really* expensive, especially when you start talking about data sets
that probably don't fit in memory, like a 10 TB or 100 TB table.  If
you've got 100 tuples per page (~81 bytes per tuple), increasing the
size of a tuple by 1 byte causes that 100 TB table to increase in size
by about 1.2 TB (modulo padding effects).  An extra byte of space (or
even an extra ten bytes) doesn't matter much for a table with a
million tuples in it because the whole table is going to fit in memory
either way, but when you have 10+ billion tuples those extra bytes
start to matter a LOT.  And the problem is not only or even primarily
the space consumption - the issue is that all of your queries run that
much slower because of the extra I/O.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UNDO and in-place update

From
Amit Kapila
Date:
On Mon, Jan 9, 2017 at 11:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 9, 2017 at 7:50 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> One idea could be that we have some fixed number of
>> slots (i think we can make it variable as well, but for simplicity,
>> lets consider it as fixed) in the page header which will store the
>> offset to the transaction id inside a TPD entry of the page.  Consider
>> a TPD entry of page contains four transactions, so we will just store
>> enough information in heap page header to reach the transaction id for
>> these four transactions. I think each such page header slot could be
>> three or four bits long depending upon how many concurrent
>> transactions we want to support on a page after which a new
>> transaction has to wait (I think in most workloads supporting
>> simultaneous eight transactions on a page should be sufficient).
>> Then we can have an additional byte (or less than byte) in the tuple
>> header to store lock info which is nothing but an offset to the slot
>> in the page header.   We might find some other locking technique as
>> well, but I think keeping it same as current has benefit.
>
> Yes, something like this can be done.  You don't really need any new
> page-level header data, because you can get the XIDs from the TPD
> entry (or from the page itself if there's only one).  But you could
> expand the single "is-modified" bit that I've proposed adding to each
> tuple to multiple bits.  0 means not recently modified.  1 means
> modified by the first or only transaction that has recently modified
> the page.  2 means modified by the second transaction that has
> recently modified the page.  Etc.
>

makes sense.

> What I was thinking about doing instead is storing an array in the TPD
> containing the same information.  There would be one byte or one half
> a byte or whatever per TID and it would contain the index of the XID
> in the TPD that had most recently modified or locked that TID.  Your
> solution might be better, though, at least for cases where the number
> of tuples that have modified the page is small.
>

I think we also need to prevent multiple backends trying to reserve a
slot in this array which can be a point of contention.  Another point
is during pruning, if due to row movement TIDs are changed, we need to
keep this array in sync.

>  However, I'm not
> totally sure.  I think it's important to keep the tuple headers VERY
> small, like 3 bytes.  Or 2 bytes.  Or maybe even variable size but
> only 1 byte in common cases.  So I expect bit space in those places to
> be fairly scarce and precious.
>

I agree that we should carefully choose the format so as to keep a
trade-off between performance and space savings.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UNDO and in-place update

From
Amit Kapila
Date:
On Tue, Jan 10, 2017 at 7:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 9, 2017 at 11:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> Yes, something like this can be done.  You don't really need any new
>> page-level header data, because you can get the XIDs from the TPD
>> entry (or from the page itself if there's only one).  But you could
>> expand the single "is-modified" bit that I've proposed adding to each
>> tuple to multiple bits.  0 means not recently modified.  1 means
>> modified by the first or only transaction that has recently modified
>> the page.  2 means modified by the second transaction that has
>> recently modified the page.  Etc.
>>
>
> makes sense.
>
>> What I was thinking about doing instead is storing an array in the TPD
>> containing the same information.  There would be one byte or one half
>> a byte or whatever per TID and it would contain the index of the XID
>> in the TPD that had most recently modified or locked that TID.  Your
>> solution might be better, though, at least for cases where the number
>> of tuples that have modified the page is small.
>>
>
> I think we also need to prevent multiple backends trying to reserve a
> slot in this array which can be a point of contention.  Another point
> is during pruning, if due to row movement TIDs are changed, we need to
> keep this array in sync.
>

Moving further, I have thought about consistent reads and different
formats for storing undo with pros and cons of each format.

Consistent Reads
----------------------------
If the page is modified by any transaction that is not visible to read
transaction's snapshot, we have to reconstruct a copy of the page.
Now there could be multiple ways to achieve it:

1. Apply the undo for the transactions that are not visible to read
transaction and keep the copy of the page.
2. Apply the undo for all the transactions in TPD entry of a page.

I think we can't do second because this can create rows which are not
consistent due to apply of undo log for transactions which are visible
to read transaction.  Now, where the first approach will work, but it
might turn out to be inefficient if not implemented carefully, as for
each read transaction we need to create separate copies of the page
where a different set of transactions are not visible to different
read transactions. Even for read transactions to which the same set of
transactions are not visible, we need some way to recognize the
version of the page that can be used, probably try with all the
version of pages and see if any of the existing versions can serve the
purpose.  I think we also need some way to link the different versions
of the page in shared buffers so that before trying to create a new
version of the page we can look at linked versions.

Yet another consideration in this area could be to perform consistent
reads differently for index scans.  For index scan, we will receive a
particular TID to read, so, in this case, we can try to reconstruct
the old image of that particular row.  Now, whereas this sounds
efficient, but we might need to reconstruct the old image of rows
multiple times for different read transactions. The effect will be
magnificent when we have to perform this for multiple rows of a page.
I think it might be better to always construct a complete copy of page
if one or more transactions that have changed the page are not visible
to read snapshot.

Undo Page Format
-------------------------------
I think we can organize undo in multiple ways

(a) One undo file per transaction - We can name each undo file as
epoch+xid or epch+xid.undo. These files can be kept in pg_undo
directory under PGDATA.  In this approach, we have to store
latest_undo offset in file header to perform the rollback operation.
Also, this offset can be used to allocate the location for new undo
record.  Apart from that undo entries can be organized one after
another.  Each undo entry will have two back pointers, one to reach
the previous undo location which will be used for rollback and the
second one to maintain the prior undo chain for changes to a
particular page which will be used to reconstruct the copy of page.
During recovery,  we need to read each undo file and check the
transaction status in 'clog', if it is not committed, then apply all
the undo starting from latest_undo position.  I think we should
organize undo's in form of pages with a size similar to heap pages.
This will be helpful both for writing WAL and doing checkpoints for
undo where we expect pages.  The access to UNDO can be via shared
buffers. I think we might want to keep the access to shared buffers
containing undo separate from heap/index pages so that they can't be
evicted based on access of heap pages.  Different TPD entries
containing the same XID will point to different locations in undo.
Undo location will be just a page offset for a particular page entry.
Vacuum can delete or make the undo file reusable when the
corresponding transaction precedes RecentGlobalXmin.

To decide which files can be removed/reused, Vacuum either needs to
traverse the TPD entries or need to find by traversing the files in
pg_undo directory.  Here, along with the removal of undo file, we also
need some way to indicate that TPD entry/entries for that undo are
invalid. A simple way could be that if the corresponding undo file
doesn't exist, then we can consider it as invalid.  However, that
might not work if we choose to store undo of multiple transactions in
one undo file.

The main advantage of this approach is that this will allow removing
the *not* required undo space efficiently.  The disadvantage is that
creating different files for each transaction can be costly and it can
lead to lot of undo files when there are some long running
transactions, of course, we can optimize by allowing to remove the
files in such cases and give user error "snapshot_too_old", but I
think still it can be a matter of concern.  To alleviate this concern,
we can try to keep multiple transactions undo data in one file.

(b) Multiple transactions in one undo file - Here we have to choose
some arbitrary name for undo files, let's say undo_1, undo_2, etc. for
the sake of discussion.  These files can be kept in pg_undo directory
under PGDATA.  In this approach, we can choose to select set of four
or eight 8KB pages (page of the same size as heap page) per
transaction, we can call each such set of pages as one undo segment.
If one segment is not sufficient to store undo for one transaction,
then we can allocate next segment.  Each segment can have segment
header in the first page which will contain previous segment address.
For the very first segment, previous segment address will be Nil, the
second segment will contain the address of the first segment, third,
will contain the address of second and so on.  Previous segment
address is needed to replay undo for rollback.  I have explicitly
avoided storing the next segment address as we don't need it.  We also
need some form of freespace map to keep track of undo segments that
are freed.  Once the vacuum decided to free the undo of a particular
transaction, it can add all the undo segments for that particular
transaction in freespace map.  For allocating a new segment for
existing transaction or new transaction, we need to first search in
freespace map for the available segment, if not available, then
allocate a new segment. In the file, the first segment will contain
slots to hold the latest_undo location and XID (probably (epoch +
XID)) for each of the transaction which has undo in the file.  We need
latest_undo location to rollback the transaction and XID is required
to rollback the transaction at the time of recovery.  Let's call the
first segment as a HEAD segment. We can extend this HEAD segment if
there is more space in the file, but for simplicity and matter of
discussion, let's consider one undo file can hold the transactions
that can get a slot in the HEAD segment.  Each slot can be 6 or 8
bytes long.  Considering each segment of size 32KB, we can hold undo
of ~4096 transactions in one file. We can keep some book-keeping
information to examine whether there is any empty slot in the first
segment of the file. So whenever a new transaction has to allocate an
undo, it checks in a HEAD segment of undo_1, if there is a space to
allocate new slot, allocate it, else try in next file undo_2.  We do
need to consider what to do when the undo file is chosen to write undo
for a transaction is finished and transaction still has to write more
data. There are a couple of options like we can first try to free the
space for any existing transaction which is already committed and if
there is no more freespace available we can either choose to wait or
allocate an additional XID to write UNDO as proposed in the initial
e-mail in this thread.  Like the previous approach, undo can be
accessed via shared buffers.  Similarly, each undo entry will have two
back pointers as described in the previous approach.  There are some
loose points like how freespace map will be organized which needs more
thoughts, but I think the core of the idea can be made to work.

This will overcome the disadvantages of previous approach (a) there
could be too many undo files (b) the cost associated with creating and
managing too many files.  I think this might be slightly lesser
efficient as compared to the previous approach in terms of purging the
space for unused undo space as it can't return the freed space in undo
file back to OS until all the transactions in undo file precedes
RecentGlobalXmin.  One disadvantage of this approach as compared to
the previous approach is that there can be some contention in finding
free undo slot in undo file, but I think the advantage it can bring by
having lesser files to manage will surpass such a disadvantage.

This still doesn't cover UNDO record contents for each operation.  I
think we can design that once the file format is fixed, although there
is no strict dependency between the two.

Thanks to Dilip Kumar for having offlist discussions on this topic to
figure out some of the design points.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UNDO and in-place update

From
Robert Haas
Date:
On Wed, Jan 11, 2017 at 5:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Moving further, I have thought about consistent reads and different
> formats for storing undo with pros and cons of each format.
>
> Consistent Reads
> ----------------------------
> If the page is modified by any transaction that is not visible to read
> transaction's snapshot, we have to reconstruct a copy of the page.
> Now there could be multiple ways to achieve it:
>
> 1. Apply the undo for the transactions that are not visible to read
> transaction and keep the copy of the page.
> 2. Apply the undo for all the transactions in TPD entry of a page.
>
> I think we can't do second because this can create rows which are not
> consistent due to apply of undo log for transactions which are visible
> to read transaction.  Now, where the first approach will work, but it
> might turn out to be inefficient if not implemented carefully, as for
> each read transaction we need to create separate copies of the page
> where a different set of transactions are not visible to different
> read transactions. Even for read transactions to which the same set of
> transactions are not visible, we need some way to recognize the
> version of the page that can be used, probably try with all the
> version of pages and see if any of the existing versions can serve the
> purpose.  I think we also need some way to link the different versions
> of the page in shared buffers so that before trying to create a new
> version of the page we can look at linked versions.
>
> Yet another consideration in this area could be to perform consistent
> reads differently for index scans.  For index scan, we will receive a
> particular TID to read, so, in this case, we can try to reconstruct
> the old image of that particular row.  Now, whereas this sounds
> efficient, but we might need to reconstruct the old image of rows
> multiple times for different read transactions. The effect will be
> magnificent when we have to perform this for multiple rows of a page.
> I think it might be better to always construct a complete copy of page
> if one or more transactions that have changed the page are not visible
> to read snapshot.

I was thinking about this whole area somewhat differently.  We don't
really need a complete imagine of the page.  We just need to find the
correct version of each tuple that is visible.  So I would suggest
that we consider avoiding page reconstruction altogether.  A
sequential scan iterates through each TID on the page and says "is
this tuple visible?".  Right now the answer is either "yes" or "no",
but maybe the answer could be "yes", "no", or "look at this other
version in the UNDO segment instead".  The advantage of this approach
is that we don't end up copying tuples into a new "reconstructed"
page.  That copying will take CPU time and the resulting page will
consume cache space.

Now, the problem with this is that we want to avoid walking through
the whole UNDO chain for the page multiple times.  But maybe there are
ways to avoid that -- for starters, if the recently-modified bit on
the page isn't set, we don't need to walk through the UNDO chain at
all.  For a sequential scan, we could maintain a backend-local cache
of information that gets updated as we walk the chain.  Perhaps
there's no way to make these ideas work and we're going to end up
doing some form of page reconstruction, but I think it's worth some
thought anyway.

> Undo Page Format
> -------------------------------
> I think we can organize undo in multiple ways
>
> (a) One undo file per transaction - We can name each undo file as
> epoch+xid or epch+xid.undo. These files can be kept in pg_undo
> directory under PGDATA.  In this approach, we have to store
> latest_undo offset in file header to perform the rollback operation.
> Also, this offset can be used to allocate the location for new undo
> record.  Apart from that undo entries can be organized one after
> another.  Each undo entry will have two back pointers, one to reach
> the previous undo location which will be used for rollback and the
> second one to maintain the prior undo chain for changes to a
> particular page which will be used to reconstruct the copy of page.

Sounds good so far.

> During recovery,  we need to read each undo file and check the
> transaction status in 'clog', if it is not committed, then apply all
> the undo starting from latest_undo position.

I mostly agree, but let's make this a little more precise.  During
normal running, when a transaction aborts, we will perform all of the
undo actions associated with that transaction, and we will generate
WAL for those changes.  For example, if an insert adds a tuple to a
page, and the containing transaction aborts, then there will be an
undo entry to remove the tuple.  But removing that tuple has to be
WAL-logged, just as adding it was.  When all actions for the
transaction have been undone, we will remove the undo log for that
transaction and the removal of the undo log will also be a WAL-logged
event.  Therefore, *during* recovery, we don't need to perform any
undo actions, because by replaying the existing WAL, we are also
replaying any undo actions which were done before the crash (or on the
master).  However, at the end of recovery, we might have some left
over undo logs.  Those represent transactions that were in progress at
the moment of the crash (or at the moment the standby was promoted) or
perhaps transactions which had already been aborted but for which undo
was not yet complete (if undo had been complete, the log itself would
have been removed).  And the UNDO for those transactions must still be
performed.  So the protocol is: first redo, then undo.  Jim Gray
discusses this in his classic book:
https://www.amazon.com/dp/1558601902

> I think we should
> organize undo's in form of pages with a size similar to heap pages.
> This will be helpful both for writing WAL and doing checkpoints for
> undo where we expect pages.  The access to UNDO can be via shared
> buffers.

Sounds good.

> I think we might want to keep the access to shared buffers
> containing undo separate from heap/index pages so that they can't be
> evicted based on access of heap pages.

I don't see a reason to do that.  I think one of the advantages of
storing the UNDO pages in the same pool as shared_buffers is that the
amount of UNDO that we store in there can float up and down according
to demand.  If the heap pages are being accessed more frequently than
the UNDO pages then let the UNDO pages get evicted.  If the reverse is
the case, that's OK too.

> Different TPD entries
> containing the same XID will point to different locations in undo.
> Undo location will be just a page offset for a particular page entry.

I think an undo location should be a byte offset from the beginning of
UNDO for that transaction.  Which page that's on can be figured out by
dividing by BLCKSZ.

> Vacuum can delete or make the undo file reusable when the
> corresponding transaction precedes RecentGlobalXmin.

If the associated transaction commits, yes.  If it aborts, then we can
recycle it once undo is complete.  But I think this should be the job
of a dedicated process which also executes undo actions; I don't think
it should be associated with vacuum.

> To decide which files can be removed/reused, Vacuum either needs to
> traverse the TPD entries or need to find by traversing the files in
> pg_undo directory.  Here, along with the removal of undo file, we also
> need some way to indicate that TPD entry/entries for that undo are
> invalid. A simple way could be that if the corresponding undo file
> doesn't exist, then we can consider it as invalid.  However, that
> might not work if we choose to store undo of multiple transactions in
> one undo file.

I think we want something close to what you describe as "a simple
way".  For a TPD entry that references an undo log, then the pointer
is invalid if that undo log no longer exists.  But let's not confuse
the files that are used to store the undo logs with the logs
themselves.  Logically, an undo log comes into existence the first
time we perform an operation that generates undo, and ceases to exist
when we explicitly destroy it (either after replaying all the entries
in the case of an abort, or after it becomes all-visible in the case
of a commit).  So we should be able to have data in shared memory at
all times (including during recovery) that says which undo logs exist
as of that moment in time -- and if appropriate, which files on disk
contain data from those undo logs.  In a lot of cases, the UNDO data
will hopefully be stored only in shared_buffers and not in ANY disk
file at all, because a short transaction should be able to create an
UNDO log and write some data to it and commit and become all-visible
and have the UNDO log disappear before a checkpoint happens.  In such
a case, we never need to talk to the file system at all.

> The main advantage of this approach is that this will allow removing
> the *not* required undo space efficiently.  The disadvantage is that
> creating different files for each transaction can be costly and it can
> lead to lot of undo files when there are some long running
> transactions, of course, we can optimize by allowing to remove the
> files in such cases and give user error "snapshot_too_old", but I
> think still it can be a matter of concern.  To alleviate this concern,
> we can try to keep multiple transactions undo data in one file.

Right.  I think the trade-off here changes as the amount of undo for a
single transaction increases.  If a single transaction generates 50GB
of undo, we don't want that undo to be mixed with undo from other
transactions, because otherwise we might end up with a form of bloat
in our undo space.  That transaction, because it's big, should get its
own file (or series of files).  But if you have a lot of transactions
which are each generating only a few kB of undo, it's likely better to
keep all of that undo in a single file.  Hopefully most of the time we
will never write that undo to the filesystem at all, but at checkpoint
time we will need to do so.  If we're running a pgbench workload with
200 concurrent transactions and a checkpoint hits, we do not want to
have to create, write, and fsync several hundred files.  We want to
write all of that undo to a single file or smaller number of larger
files.

> (b) Multiple transactions in one undo file - Here we have to choose
> some arbitrary name for undo files, let's say undo_1, undo_2, etc. for
> the sake of discussion.  These files can be kept in pg_undo directory
> under PGDATA.  In this approach, we can choose to select set of four
> or eight 8KB pages (page of the same size as heap page) per
> transaction, we can call each such set of pages as one undo segment.
> If one segment is not sufficient to store undo for one transaction,
> then we can allocate next segment.  Each segment can have segment
> header in the first page which will contain previous segment address.
> For the very first segment, previous segment address will be Nil, the
> second segment will contain the address of the first segment, third,
> will contain the address of second and so on.  Previous segment
> address is needed to replay undo for rollback.  I have explicitly
> avoided storing the next segment address as we don't need it.

You need some efficient way of looking up an undo pointer and finding
the corresponding record.  For example, suppose you have an undo
pointer that says <epoch 0, XID 123456, byte-offset 78901>.  (You
might get such a pointer from a heap page, or from a TPD entry, or
from a back-pointer in some other UNDO entry.)  You now need to find
the page 9 of the UNDO log for that transaction (because 78901/8192 =
9).  If that's still in shared_buffers, great; but if it's been dumped
to the filesystem, you need a very quick way of finding it and getting
it back into shared_buffers.

> We also
> need some form of freespace map to keep track of undo segments that
> are freed.  Once the vacuum decided to free the undo of a particular
> transaction, it can add all the undo segments for that particular
> transaction in freespace map.  For allocating a new segment for
> existing transaction or new transaction, we need to first search in
> freespace map for the available segment, if not available, then
> allocate a new segment. In the file, the first segment will contain
> slots to hold the latest_undo location and XID (probably (epoch +
> XID)) for each of the transaction which has undo in the file.  We need
> latest_undo location to rollback the transaction and XID is required
> to rollback the transaction at the time of recovery.  Let's call the
> first segment as a HEAD segment. We can extend this HEAD segment if
> there is more space in the file, but for simplicity and matter of
> discussion, let's consider one undo file can hold the transactions
> that can get a slot in the HEAD segment.  Each slot can be 6 or 8
> bytes long.  Considering each segment of size 32KB, we can hold undo
> of ~4096 transactions in one file. We can keep some book-keeping
> information to examine whether there is any empty slot in the first
> segment of the file. So whenever a new transaction has to allocate an
> undo, it checks in a HEAD segment of undo_1, if there is a space to
> allocate new slot, allocate it, else try in next file undo_2.  We do
> need to consider what to do when the undo file is chosen to write undo
> for a transaction is finished and transaction still has to write more
> data. There are a couple of options like we can first try to free the
> space for any existing transaction which is already committed and if
> there is no more freespace available we can either choose to wait or
> allocate an additional XID to write UNDO as proposed in the initial
> e-mail in this thread.  Like the previous approach, undo can be
> accessed via shared buffers.  Similarly, each undo entry will have two
> back pointers as described in the previous approach.  There are some
> loose points like how freespace map will be organized which needs more
> thoughts, but I think the core of the idea can be made to work.

Yeah, I think the core of the idea is OK, but (1) I still think vacuum
is irrelevant; it is the undo processes that matter here and (2) I
still think that you're thinking to have too much connection between
the logical concept of undo and where that undo is physically stored
in files.  I think the logical undo pointer should just be <epoch,
XID, byte-offset> and then where that gets stored is a detail.  So
there should be no need to allocate a new XID when the file fills up;
that's just a bookkeeping detail of how byte-offsets get mapped to
physical file positions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] UNDO and in-place update

From
Amit Kapila
Date:
On Thu, Jan 12, 2017 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 11, 2017 at 5:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Moving further, I have thought about consistent reads and different
>> formats for storing undo with pros and cons of each format.
>>
>> Consistent Reads
>> ----------------------------
>> If the page is modified by any transaction that is not visible to read
>> transaction's snapshot, we have to reconstruct a copy of the page.
>> Now there could be multiple ways to achieve it:
>>
>> 1. Apply the undo for the transactions that are not visible to read
>> transaction and keep the copy of the page.
>> 2. Apply the undo for all the transactions in TPD entry of a page.
>>
>> I think we can't do second because this can create rows which are not
>> consistent due to apply of undo log for transactions which are visible
>> to read transaction.  Now, where the first approach will work, but it
>> might turn out to be inefficient if not implemented carefully, as for
>> each read transaction we need to create separate copies of the page
>> where a different set of transactions are not visible to different
>> read transactions. Even for read transactions to which the same set of
>> transactions are not visible, we need some way to recognize the
>> version of the page that can be used, probably try with all the
>> version of pages and see if any of the existing versions can serve the
>> purpose.  I think we also need some way to link the different versions
>> of the page in shared buffers so that before trying to create a new
>> version of the page we can look at linked versions.
>>
>> Yet another consideration in this area could be to perform consistent
>> reads differently for index scans.  For index scan, we will receive a
>> particular TID to read, so, in this case, we can try to reconstruct
>> the old image of that particular row.  Now, whereas this sounds
>> efficient, but we might need to reconstruct the old image of rows
>> multiple times for different read transactions. The effect will be
>> magnificent when we have to perform this for multiple rows of a page.
>> I think it might be better to always construct a complete copy of page
>> if one or more transactions that have changed the page are not visible
>> to read snapshot.
>
> I was thinking about this whole area somewhat differently.  We don't
> really need a complete imagine of the page.  We just need to find the
> correct version of each tuple that is visible.  So I would suggest
> that we consider avoiding page reconstruction altogether.  A
> sequential scan iterates through each TID on the page and says "is
> this tuple visible?".  Right now the answer is either "yes" or "no",
> but maybe the answer could be "yes", "no", or "look at this other
> version in the UNDO segment instead".  The advantage of this approach
> is that we don't end up copying tuples into a new "reconstructed"
> page.  That copying will take CPU time and the resulting page will
> consume cache space.
>
> Now, the problem with this is that we want to avoid walking through
> the whole UNDO chain for the page multiple times.  But maybe there are
> ways to avoid that -- for starters, if the recently-modified bit on
> the page isn't set, we don't need to walk through the UNDO chain at
> all.  For a sequential scan, we could maintain a backend-local cache
> of information that gets updated as we walk the chain.  Perhaps
> there's no way to make these ideas work and we're going to end up
> doing some form of page reconstruction, but I think it's worth some
> thought anyway.
>

Sure, we can do that way and I agree that it is worth considering.
Few cases where it can be really costly is when undo chain overflows
to multiple pages and those pages doesn't exist in cache.  I think the
reason why it is probable is that undo chain is not maintained for row
update, rather it is maintained for a page level changes for each
transaction.  So, what this means is that if we have to traverse the
chain for record X, then I think it is quite possible that during
chain traversal, we will encounter undo of record Y, ofcourse we can
identify and ignore it, but certainly the chain can be much longer.
Another kind of cost could be construction of actual record from undo
record, for example if we consider to undo log only changed parts of
Update as we do for WAL, then we have to incur some record
construction cost as well. I think what can make it really costly is
repeating this work.  Even, if we consider the above as a worse case,
I am not sure whether it will be cheap for short transactions like
pgbench as there could be cases where concurrent updates/reads needs
to traverse undo chains.

Having said that, if you don't feel strongly about caching the
information in such a way that it can be reused by different sessions,
then we can initially implement the system as outlined by you, then we
can extend it based on the performance characteristics.


>
> I mostly agree, but let's make this a little more precise.  During
> normal running, when a transaction aborts, we will perform all of the
> undo actions associated with that transaction, and we will generate
> WAL for those changes.  For example, if an insert adds a tuple to a
> page, and the containing transaction aborts, then there will be an
> undo entry to remove the tuple.  But removing that tuple has to be
> WAL-logged, just as adding it was.  When all actions for the
> transaction have been undone, we will remove the undo log for that
> transaction and the removal of the undo log will also be a WAL-logged
> event.  Therefore, *during* recovery, we don't need to perform any
> undo actions, because by replaying the existing WAL, we are also
> replaying any undo actions which were done before the crash (or on the
> master).  However, at the end of recovery, we might have some left
> over undo logs.  Those represent transactions that were in progress at
> the moment of the crash (or at the moment the standby was promoted) or
> perhaps transactions which had already been aborted but for which undo
> was not yet complete (if undo had been complete, the log itself would
> have been removed).  And the UNDO for those transactions must still be
> performed.  So the protocol is: first redo, then undo.  Jim Gray
> discusses this in his classic book:
> https://www.amazon.com/dp/1558601902
>

Agreed.

>> I think we should
>> organize undo's in form of pages with a size similar to heap pages.
>> This will be helpful both for writing WAL and doing checkpoints for
>> undo where we expect pages.  The access to UNDO can be via shared
>> buffers.
>
> Sounds good.
>
>> I think we might want to keep the access to shared buffers
>> containing undo separate from heap/index pages so that they can't be
>> evicted based on access of heap pages.
>
> I don't see a reason to do that.  I think one of the advantages of
> storing the UNDO pages in the same pool as shared_buffers is that the
> amount of UNDO that we store in there can float up and down according
> to demand.  If the heap pages are being accessed more frequently than
> the UNDO pages then let the UNDO pages get evicted.  If the reverse is
> the case, that's OK too.
>

Okay, not a problem. I was thinking to priortise UNDO buffers, so that
they can only be evicted during checkpoint.  However, I think there
might not be much value in doing so.

>> Vacuum can delete or make the undo file reusable when the
>> corresponding transaction precedes RecentGlobalXmin.
>
> If the associated transaction commits, yes.  If it aborts, then we can
> recycle it once undo is complete.  But I think this should be the job
> of a dedicated process which also executes undo actions; I don't think
> it should be associated with vacuum.
>

Agreed that it will be good to keep a separate undo process.  However,
I am not exactly clear about your point of executing undo actions in
background. Do you mean to say that the main backend will mark
transaction as aborted in clog, but undo actions will be done in
backend or you have something else in mind?

>> To decide which files can be removed/reused, Vacuum either needs to
>> traverse the TPD entries or need to find by traversing the files in
>> pg_undo directory.  Here, along with the removal of undo file, we also
>> need some way to indicate that TPD entry/entries for that undo are
>> invalid. A simple way could be that if the corresponding undo file
>> doesn't exist, then we can consider it as invalid.  However, that
>> might not work if we choose to store undo of multiple transactions in
>> one undo file.
>
> I think we want something close to what you describe as "a simple
> way".  For a TPD entry that references an undo log, then the pointer
> is invalid if that undo log no longer exists.  But let's not confuse
> the files that are used to store the undo logs with the logs
> themselves.  Logically, an undo log comes into existence the first
> time we perform an operation that generates undo, and ceases to exist
> when we explicitly destroy it (either after replaying all the entries
> in the case of an abort, or after it becomes all-visible in the case
> of a commit).  So we should be able to have data in shared memory at
> all times (including during recovery) that says which undo logs exist
> as of that moment in time -- and if appropriate, which files on disk
> contain data from those undo logs.  In a lot of cases, the UNDO data
> will hopefully be stored only in shared_buffers and not in ANY disk
> file at all, because a short transaction should be able to create an
> UNDO log and write some data to it and commit and become all-visible
> and have the UNDO log disappear before a checkpoint happens.  In such
> a case, we never need to talk to the file system at all.
>

Okay, it seems we can deduce it from trasnction status. If the
transaction is aborted, then we know undo log is invalid.  If it is
in-progress, then there will be a valid undo log. If it is committed,
and all-visble (precedes RecentGlobalXmin), then the undo log will be
invalid.

>> The main advantage of this approach is that this will allow removing
>> the *not* required undo space efficiently.  The disadvantage is that
>> creating different files for each transaction can be costly and it can
>> lead to lot of undo files when there are some long running
>> transactions, of course, we can optimize by allowing to remove the
>> files in such cases and give user error "snapshot_too_old", but I
>> think still it can be a matter of concern.  To alleviate this concern,
>> we can try to keep multiple transactions undo data in one file.
>
> Right.  I think the trade-off here changes as the amount of undo for a
> single transaction increases.  If a single transaction generates 50GB
> of undo, we don't want that undo to be mixed with undo from other
> transactions, because otherwise we might end up with a form of bloat
> in our undo space.  That transaction, because it's big, should get its
> own file (or series of files).  But if you have a lot of transactions
> which are each generating only a few kB of undo, it's likely better to
> keep all of that undo in a single file.  Hopefully most of the time we
> will never write that undo to the filesystem at all, but at checkpoint
> time we will need to do so.  If we're running a pgbench workload with
> 200 concurrent transactions and a checkpoint hits, we do not want to
> have to create, write, and fsync several hundred files.

By create, it seems you are thinking about having undo files as some
in-memory thing (space reserved in memory) and if required, then only
create the file?  If so, then the point we have discussed above for
giving equal priority to undo pages in shared_buffers as heap pages is
relevant, because backend evicting undo page could be costly as
compare to heap page.

>
>> (b) Multiple transactions in one undo file - Here we have to choose
>> some arbitrary name for undo files, let's say undo_1, undo_2, etc. for
>> the sake of discussion.  These files can be kept in pg_undo directory
>> under PGDATA.  In this approach, we can choose to select set of four
>> or eight 8KB pages (page of the same size as heap page) per
>> transaction, we can call each such set of pages as one undo segment.
>> If one segment is not sufficient to store undo for one transaction,
>> then we can allocate next segment.  Each segment can have segment
>> header in the first page which will contain previous segment address.
>> For the very first segment, previous segment address will be Nil, the
>> second segment will contain the address of the first segment, third,
>> will contain the address of second and so on.  Previous segment
>> address is needed to replay undo for rollback.  I have explicitly
>> avoided storing the next segment address as we don't need it.
>
> You need some efficient way of looking up an undo pointer and finding
> the corresponding record.  For example, suppose you have an undo
> pointer that says <epoch 0, XID 123456, byte-offset 78901>.  (You
> might get such a pointer from a heap page, or from a TPD entry, or
> from a back-pointer in some other UNDO entry.)  You now need to find
> the page 9 of the UNDO log for that transaction (because 78901/8192 =
> 9).  If that's still in shared_buffers, great; but if it's been dumped
> to the filesystem, you need a very quick way of finding it and getting
> it back into shared_buffers.
>

Won't it be efficient if we can do the same mapping between undo
byte-offset and undo file as we do for WAL ((WALWrite offset to file).
We can keep file size bigger than what we have for each WAL Segment?
Do we need something more?

Do we need epoch,xid in back-pointer for undo or only byte-offset will do?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] UNDO and in-place update

From
Robert Haas
Date:
On Fri, Jan 13, 2017 at 5:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Sure, we can do that way and I agree that it is worth considering.
> Few cases where it can be really costly is when undo chain overflows
> to multiple pages and those pages doesn't exist in cache.  I think the
> reason why it is probable is that undo chain is not maintained for row
> update, rather it is maintained for a page level changes for each
> transaction.  So, what this means is that if we have to traverse the
> chain for record X, then I think it is quite possible that during
> chain traversal, we will encounter undo of record Y, ofcourse we can
> identify and ignore it, but certainly the chain can be much longer.
> Another kind of cost could be construction of actual record from undo
> record, for example if we consider to undo log only changed parts of
> Update as we do for WAL, then we have to incur some record
> construction cost as well. I think what can make it really costly is
> repeating this work.  Even, if we consider the above as a worse case,
> I am not sure whether it will be cheap for short transactions like
> pgbench as there could be cases where concurrent updates/reads needs
> to traverse undo chains.
>
> Having said that, if you don't feel strongly about caching the
> information in such a way that it can be reused by different sessions,
> then we can initially implement the system as outlined by you, then we
> can extend it based on the performance characteristics.

Sounds fine.  Also, one idea I've been considering is the idea of
allowing each shared buffer to have some associated "scratch space"
which would act as a cache that lasts as long as that page doesn't get
evicted.  I think that might have some applications for hash indexes
as well.  Of course deciding how much scratch space to allocate and
whether you wouldn't be better off using that memory some other way is
tricky, but it's a thought.

>> I don't see a reason to do that.  I think one of the advantages of
>> storing the UNDO pages in the same pool as shared_buffers is that the
>> amount of UNDO that we store in there can float up and down according
>> to demand.  If the heap pages are being accessed more frequently than
>> the UNDO pages then let the UNDO pages get evicted.  If the reverse is
>> the case, that's OK too.
>
> Okay, not a problem. I was thinking to priortise UNDO buffers, so that
> they can only be evicted during checkpoint.  However, I think there
> might not be much value in doing so.

Seems like we can leave the decision to later.  If that turns out to
be the thing that is hurting performance, we can fix it then.

>>> Vacuum can delete or make the undo file reusable when the
>>> corresponding transaction precedes RecentGlobalXmin.
>>
>> If the associated transaction commits, yes.  If it aborts, then we can
>> recycle it once undo is complete.  But I think this should be the job
>> of a dedicated process which also executes undo actions; I don't think
>> it should be associated with vacuum.
>
> Agreed that it will be good to keep a separate undo process.  However,
> I am not exactly clear about your point of executing undo actions in
> background. Do you mean to say that the main backend will mark
> transaction as aborted in clog, but undo actions will be done in
> backend or you have something else in mind?

I think it's not good to have the main backend perform the UNDO
actions because they might ERROR.  That's not something we can support
when we're already in the abort path.  If it happens in the background
process, it can catch the ERROR and return to the toplevel and retry
after a pause or do whatever needs to be done.  For example, imagine
an I/O error.

> Okay, it seems we can deduce it from trasnction status. If the
> transaction is aborted, then we know undo log is invalid.  If it is
> in-progress, then there will be a valid undo log. If it is committed,
> and all-visble (precedes RecentGlobalXmin), then the undo log will be
> invalid.

For MVCC purposes, that might work, but for other purposes, maybe not.
We actually need to replay the undo log on abort before removing it.

> By create, it seems you are thinking about having undo files as some
> in-memory thing (space reserved in memory) and if required, then only
> create the file?

Yes.

> If so, then the point we have discussed above for
> giving equal priority to undo pages in shared_buffers as heap pages is
> relevant, because backend evicting undo page could be costly as
> compare to heap page.

It's possible, but I don't think that's a key part of the design so
I'd say skip it for now and we'll deal with it if it becomes an issue.

> Won't it be efficient if we can do the same mapping between undo
> byte-offset and undo file as we do for WAL ((WALWrite offset to file).
> We can keep file size bigger than what we have for each WAL Segment?
> Do we need something more?

Well you need something more if you want to combine multiple undo logs
in a single file.  If each file gets its own undo log, then that's all
you need.  I think.

> Do we need epoch,xid in back-pointer for undo or only byte-offset will do?

I think just byte-offset.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company