Thread: Frequent Update Project: Design Overview of HOT Updates

Frequent Update Project: Design Overview of HOT Updates

From

"Simon Riggs"

Date:

09 November 2006, 13:13:48

Design Overview of HOT Updates
------------------------------

The objective is to increase the speed of the UPDATE case, while
minimizing the overall negative effects of the UPDATE. We refer to the
general requirement as *Frequent Update Optimization*, though this
design proposal is for Heap Overflow Tuple (HOT) Updates. It is similar
in some ways to the design for SITC already proposed, though has a
number of additional features drawn from other designs to make it a
practical and effective implementation. 

EnterpriseDB have a working, performant prototype of this design. There
are still a number of issues to resolve and the intention is to follow
an open community process to find the best way forward. All required
detail will be provided for the work conducted so far.

Current PGSQL behaviour is for UPDATEs to create a new tuple version
within the heap, so acts from many perspectives as if it were an INSERT.
All of the tuple versions are chained together, so that whichever of the
tuples is visible to your Snapshot, you can walk the chain to find the
most recent tuple version to update.

The HOT proposal splits the heap into two areas: the main heap and an
overflow relation, both of which are regular, transactional relations.
INSERT and DELETE effect only the main heap exactly as they do now.
UPDATE places new tuple versions into the overflow relation in most
cases, maintaining all of the current capabilities of the MVCC model:
all tuple versions remain accessible, plus the chain of updated tuple
versions is maintained. 

UPDATEs do not insert into indexes at all - no index pointers exist at
all into the overflow relation. So during a stream of UPDATEs the main
heap and indexes stay the same size, while the indexes are used
read-only, avoiding additional database writes and the need for eventual
index rebuilding.

The current tuple version within the main heap is referred to as the
"root tuple" from now on. When reading the main heap, if the root tuple
is not visible we "walk the chain" into the overflow relation until we
find a tuple version that is visible - we follow the t_ctid pointer from
tuple to tuple, testing for MVCC visibility as we go.

As more UPDATEs take place these tuple chains would grow, making
locating the latest tuple take progressively longer. Intuitively this
sounds like a bad design, though an additional feature turns that from
bad to good performance:

- when anybody walks the chain into the overflow relation, if we find
that the root tuple is vacuumable we copy back the first visible tuple
version over the now-dead root tuple.

This allows the length of a typical tuple chain to be extremely short in
practice. For a single connection issuing a stream of UPDATEs the chain
length will no more than 1 at any time. Even under heavy concurrent
UPDATEs the modal chain length remains 1, with the chain length varying
in approximately Poisson distribution.

The overflow relation is similar in concept to a TOAST table, so we
might describe this approach as TOAST-for-updated-versions. The code
implementation is similar also, with reasonably similar modularity. HOT
does sound radical, but no more so than TOAST was when first discussed.

We can locate any tuple version and thereby allow MVCC to work
correctly, while at the same time preserving the crucial property of the
Postgres non-overwriting storage manager. This works very well for
indexed tuple access since the index and main heap never grow, thus
maintaining access speeds. HOT supports both Read Committed and
Serializable transaction isolation levels, with transactions of any
length, just as with current MVCC.

Performance results against the previously described test cases are
approximately:

1. pgbench (TPC-B) - Around 200-300% better

2. DBT-2 (TPC-C) - Signficicant improvement - possibly 10% improvement,
somewhat difficult to measure exactly because of the way the test itself
works.

3. truckin - Around 500% improvement

The performance gain remains better than REL8_2 base even in the
presence of longer running transactions.

[There is also considerable synergy with changes to b-tree indexes that
are both useful in their own right, and even better when HOT is added,
more on that on a separate thread.]

We expect that people will want to measure this for themselves, so we
don't publish all of the detailed internal tests here since YMMV.

Using HOT
---------

HOT can only work in cases where a tuple does not modify one of the
columns defined in an index on the table, and when we do not alter the
row length of the tuple. [We'll be able to do that more efficiently when
we have plan invalidation]

If we perform an update that meets the HOT criteria then we put the
new version into the overflow relation; we describe this as a HOT
UPDATE. If we perform an update that does not meet the criteria, then we
carry on with the existing/old MVCC behaviour; we describe this as a
non-HOT UPDATE.

So with HOT we must support both HOT and non-HOT UPDATEs. Each tuple
whether it is in the main heap or the overflow relation can point to
another tuple with the t_ctid, which is extended by using a previously
unused bit to flag whether the destination is a block from the main heap
or the overflow relation.

Just to reiterate, since a HOT update doesn't touch index values that
means that all members of a tuple chain have identical primary keys and
index values. So we only ever need to grovel through the overflow
relation *if* we have already met the index search criteria - so we
don't have to re-check the index criteria for each tuple version we
touch. All normal updates of a table will be read-only on the index,
plus writes to heap/overflow.

Simple Example of Tuple Chaining
--------------------------------

Lets look at a simple case:

1. INSERT INTO table (1,....)
Tuple is inserted into main heap (Version1 or V1)
  Heap Page P1            Overflow Relation  ----------------        -----------------  |              |        |
       |  |     ______   |        |               |  |    |  V1  |  |        |               |  |    |______|  |
|              |  |              |        |               |  |--------------|        |---------------|    
 
2. UPDATE table SET nonindexcol = 6 WHERE pkcol = 1;
Same length tuple, so V2 written using overflow
  Heap Page P1             Overflow Relation  ----------------        -----------------  |              |        |
______     |  |     ______   |   |------>|  V2  |     |  |    |  V1  |------|    |  |______|     |  |    |______|  |
   |               |  |              |        |               |  |--------------|        |---------------|    
 
3. UPDATE table SET nonindexcol = 7 WHERE pkcol = 1;
Same length tuple, so V3 written using overflow
We update v2.t_ctid to maintain the forward tuple chain
  Heap Page P1             Overflow Relation  ----------------        -----------------  |              |        |
______     |  |     ______   |   |------>|  V2  |     |  |    |  V1  |------|    |  |______|--|  |  |    |______|  |
   |   ______   |  |  |              |        |  |  V3  |<-|  |  |              |        |  |______|     |
|--------------|       |---------------|
 


Simple Example of Copying-back
------------------------------

4. UPDATE table SET nonindexcol = 8 WHERE pkcol = 1;
Same length tuple, so V4 will be written using overflow.
While selecting for UPDATE, we look at V1, we see it is now vacuumable,
so we first copy back the first visible tuple over V1 before proceeding.
[Assume V1 is dead, V2 is still visible]
  Heap Page P1             Overflow Relation  ----------------        -----------------  |              |        |
______     |  |     ______   |        |  |  V2* |     |  |    |  V2  |------|    |  |______|     |  |    |______|  |
|   |   ______      |  |              |   |------>|  V3  |     |  |              |        |  |______|     |
|--------------|       |---------------| 

The copy back operation leaves V2* as the now-unwritable old copy of V2.
We update it also to break the chain to V3. Now orphaned, it can be
removed later.

5. Now we continue with the UPDATE as planned
Same length tuple, so V4 will be written using overflow.
We update v3.t_ctid to maintain the forward tuple chain
  Heap Page P1             Overflow Relation  ----------------        -----------------  |              |        |
______     |  |     ______   |        |  |  V2* |     |  |    |  V2  |------|    |  |______|     |  |    |______|  |
|   |   ______      |  |              |   |------>|  V3  |     |  |              |        |  |______|--|  |  |
   |        |   ______   |  |  |              |        |  |  V4  |<-|  |  |              |        |  |______|     |
|--------------|       |---------------|
 
    
Copying Back
------------

The copy back operation is required to keep the tuple chains short and
efficient. When we copy back we leave a tuple version and possibly a
long part of the tuple chain orphaned. These are then marked for removal
by the next VACUUM.

To support copying back, we add an additional header fields onto
every tuple in the overflow relation (only). We generate WAL
appropriately for this action.

VACUUMing becomes significantly easier with HOT than with non-HOT
UPDATEs. Specifically, since there are no indexes on the overflow
relation we should be able to VACUUM it more easily and possibly
piece-by-piece (i.e. "retail vacuum").

Behavioural Characteristics
---------------------------

With HOT, it is easily possible that the chain of prior versions spans
many blocks. The chain always starts with the block of the root tuple
but possibly includes many overflow relation blocks.

A SeqScan of a HOT table will turn into a large number of
random accesses to the overflow relation, which could be considerably
more expensive than sequential I/O. Clearly very large tables would not
be able to be HOT enabled without additional cost, so we make HOT a
table-level option: WITH (freqUPDATE = on)   [discuss...]

This means that a SeqScan of a heap with heavy concurrent update
activity *could* be fairly costly. Most times it won't be, because the
appropriate overflow relation blocks will most likely be in-memory. That
is the price we pay for having this optimization - a suitable warning
would apply, though this would not be a surprise to anybody since Oracle
people know that reading heavily-updated data takes more CPU and
DB2/SQLServer know that it takes ages to get around each of the row
level locks. So this isn't bad *at all* in comparison.

We hope/expect/recommend HOT to be used in conjunction with UPDATE
RETURNING, so that fewer SELECTs and non-index operations will be used.

[There's an extra prize if you can think of a *correct* way of
SeqScanning with HOT more efficiently; we've not focused on that yet.]

In the presence of a long running transaction, the overflow relation
could become very large since VACUUM becomes less effective. (This same
situation occurs with current MVCC). The overflow relation will then
start to page out to disk and could grow large. Oracle has this problem
also and this is why Oracle 10g has moved away from fixed-size Rollback
Segments to the use of an overflow Tablespace. Some size controls are
available to limit the growth there. That would be a useful optimization
for HOT also and one much easier than any such enhancement of existing
MVCC.

Next Steps
----------

Deeper technical details will be published shortly; the design goes much
deeper than the discussion here, but we wanted to present the overall
summary first.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOT Updates

From

Martijn van Oosterhout

Date:

09 November 2006, 13:49:56

Nice idea, just one question:

On Thu, Nov 09, 2006 at 05:13:16PM +0000, Simon Riggs wrote:
> Behavioural Characteristics
> ---------------------------
>
> With HOT, it is easily possible that the chain of prior versions spans
> many blocks. The chain always starts with the block of the root tuple
> but possibly includes many overflow relation blocks.
>
> A SeqScan of a HOT table will turn into a large number of
> random accesses to the overflow relation, which could be considerably
> more expensive than sequential I/O. Clearly very large tables would not
> be able to be HOT enabled without additional cost, so we make HOT a
> table-level option: WITH (freqUPDATE = on)   [discuss...]

It seems to me that bitmap index scans will get these same
characteristics also, right? The bitmap scan will have to follow the
chain of any possibly matching tuple in any of the blocks that are in
the bitmap, right?

Have a ncie day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Frequent Update Project: Design Overview of HOTUpdates

From

"Simon Riggs"

Date:

09 November 2006, 13:57:58

On Thu, 2006-11-09 at 18:49 +0100, Martijn van Oosterhout wrote:
> Nice idea, just one question:

> It seems to me that bitmap index scans will get these same
> characteristics also, right? The bitmap scan will have to follow the
> chain of any possibly matching tuple in any of the blocks that are in
> the bitmap, right?

Yes, they would identify the root tuples. The whole chain has matching
index values, by definition, so the re-evaluation for lossy bitmaps will
work just the same before the actual tuple is retrieved by walking the
chain (if required).

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOT Updates

From

Josh Berkus

Date:

09 November 2006, 17:19:46

Simon,

> If we perform an update that meets the HOT criteria then we put the
> new version into the overflow relation; we describe this as a HOT
> UPDATE. If we perform an update that does not meet the criteria, then we
> carry on with the existing/old MVCC behaviour; we describe this as a
> non-HOT UPDATE.

Making the essential performance analysis question, "Am I HOT or Not?"  

Sorry, but someone had to say it.  ;-)

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Frequent Update Project: Design Overview of HOTUpdates

From

"Simon Riggs"

Date:

09 November 2006, 18:45:36

On Thu, 2006-11-09 at 13:21 -0800, Josh Berkus wrote:
> Simon,
> 
> > If we perform an update that meets the HOT criteria then we put the
> > new version into the overflow relation; we describe this as a HOT
> > UPDATE. If we perform an update that does not meet the criteria, then we
> > carry on with the existing/old MVCC behaviour; we describe this as a
> > non-HOT UPDATE.
> 
> Making the essential performance analysis question, "Am I HOT or Not?"  

Very good. ;-)

Well, we had Overflow Update CHaining as an alternative name... :-)

The naming sounds silly, but we had a few alternate designs, so we
needed to be able to tell them apart sensibly. We've had TVR, SITC, UIP
and now HOT. Software research...

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOT Updates

From

Tom Lane

Date:

09 November 2006, 19:28:44

"Simon Riggs" <simon@2ndquadrant.com> writes:
> As more UPDATEs take place these tuple chains would grow, making
> locating the latest tuple take progressively longer.

This is the part that bothers me --- particularly the random-access
nature of the search.  I wonder whether you couldn't do something
involving an initial table fill-factor of less than 50%, and having
the first updated version living on the same heap page as its parent.
Only when the active chain length was more than one (which you
hypothesize is rare) would it actually be necessary to do a random
access into the overflow table.

More generally, do we need an overflow table at all, rather than having
these overflow tuples living in the same file as the root tuples?  As
long as there's a bit available to mark a tuple as being this special
not-separately-indexed type, you don't need a special location to know
what it is.  This might break down in the presence of seqscans though.

Actually, you omitted to mention the locking aspects of moving tuples
around --- exactly how are you going to make that work without breaking
concurrent scans?

> This allows the length of a typical tuple chain to be extremely short in
> practice. For a single connection issuing a stream of UPDATEs the chain
> length will no more than 1 at any time.

Only if there are no other transactions being held open, which makes
this claim a lot weaker.

> HOT can only work in cases where a tuple does not modify one of the
> columns defined in an index on the table, and when we do not alter the
> row length of the tuple.

Seems like "altering the row length" isn't the issue, it's just "is
there room on the page for the new version".  Again, a generous
fillfactor would give you more flexibility.

> [We'll be able to do that more efficiently when
> we have plan invalidation]

Uh, what's that got to do with it?
        regards, tom lane

Re: Frequent Update Project: Design Overview of HOT Updates

From

Josh Berkus

Date:

09 November 2006, 19:31:57

Tom,

> Actually, you omitted to mention the locking aspects of moving tuples
> around --- exactly how are you going to make that work without breaking
> concurrent scans?

I believe that's the "unsolved technical issue" in the prototype, unless 
Pavan has solved it in the last two weeks.   Pavan?

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Frequent Update Project: Design Overview of HOT Updates

From

"Pavan Deolasee"

Date:

09 November 2006, 22:24:37

On 11/10/06, Josh Berkus <josh@agliodbs.com> wrote:

Tom,

> Actually, you omitted to mention the locking aspects of moving tuples
> around --- exactly how are you going to make that work without breaking
> concurrent scans?

I believe that's the "unsolved technical issue" in the prototype, unless
Pavan has solved it in the last two weeks. Pavan?

When an overflow tuple is copied back to the main heap, the overflow tuple is
marked with a special flag (HEAP_OVERFLOW_MOVEDBACK). Subsequently,
when a backend tries to lock the overflow version of the tuple, it checks the flag
and jumps to the main heap if the flag is set.

We use the same technique to update the correct version of a tuple when a
tuple is moved back to the main heap.

Regards,
Pavan

Re: Frequent Update Project: Design Overview of HOT Updates

From

Tom Lane

Date:

10 November 2006, 01:28:06

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On 11/10/06, Josh Berkus <josh@agliodbs.com> wrote:
>> I believe that's the "unsolved technical issue" in the prototype, unless
>> Pavan has solved it in the last two weeks.   Pavan?
>> 
> When an overflow tuple is copied back to the main heap, the overflow tuple
> is
> marked with a special flag (HEAP_OVERFLOW_MOVEDBACK). Subsequently,
> when a backend tries to lock the overflow version of the tuple, it checks
> the flag
> and jumps to the main heap if the flag is set.

(1) How does it "jump to the main heap"?  The links go the other
direction.

(2) Isn't this full of race conditions?

(3) I thought you already used up the one remaining t_infomask bit.
        regards, tom lane

Re: Frequent Update Project: Design Overview of HOT Updates

From

"Pavan Deolasee"

Date:

10 November 2006, 02:57:30

On 11/10/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On 11/10/06, Josh Berkus < josh@agliodbs.com> wrote:
>> I believe that's the "unsolved technical issue" in the prototype, unless
>> Pavan has solved it in the last two weeks. Pavan?
>>
> When an overflow tuple is copied back to the main heap, the overflow tuple
> is
> marked with a special flag (HEAP_OVERFLOW_MOVEDBACK). Subsequently,
> when a backend tries to lock the overflow version of the tuple, it checks
> the flag
> and jumps to the main heap if the flag is set.

(1) How does it "jump to the main heap"? The links go the other
direction.

The overflow tuple has a special header to store the back pointer to the main heap.
This increases the tuple header size by 6 bytes, but the overhead is restricted only to the overflow
tuples.

(2) Isn't this full of race conditions?

I agree, there could be race conditions. But IMO we can handle those. When we
follow the tuple chain, we hold a SHARE lock on the main heap buffer. Also, when
the root tuple is vacuumable and needs to be overwritten, we acquire and keep EXCLUSIVE
lock on the main heap buffer.

This reduces the race conditions to a great extent.

(3) I thought you already used up the one remaining t_infomask bit.

Yes. The last bit in the t_infomask is used up to mark presence of overflow tuple header. But I believe there are few more bits that can be reused. There are three bits available in the t_ctid field as well (since ip_posid needs maximum 13 bits). One bit is used to identify whether a given tid points to the main heap or the overflow heap. This helps when tids are passed around in the code.

Since the back pointer from the overflow tuple always points to the main heap, the same bit can be used to mark copied-back tuples (we are doing it in a slight different way in the current prototype though).

Regards,
Pavan

Re: Frequent Update Project: Design Overview of HOT

From

Hannu Krosing

Date:

10 November 2006, 03:25:55

Ühel kenal päeval, N, 2006-11-09 kell 18:28, kirjutas Tom Lane:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > As more UPDATEs take place these tuple chains would grow, making
> > locating the latest tuple take progressively longer.
> 
> This is the part that bothers me --- particularly the random-access
> nature of the search.  I wonder whether you couldn't do something
> involving an initial table fill-factor of less than 50%, and having
> the first updated version living on the same heap page as its parent.
> Only when the active chain length was more than one (which you
> hypothesize is rare) would it actually be necessary to do a random
> access into the overflow table.
> 
> More generally, do we need an overflow table at all, rather than having
> these overflow tuples living in the same file as the root tuples?  As
> long as there's a bit available to mark a tuple as being this special
> not-separately-indexed type, you don't need a special location to know
> what it is.  This might break down in the presence of seqscans though.

And why do you need to mark it as not-separately-indexed at all ?

We already cope with missing index pointers in VACUUM and I can't see
any other reason to have it.

What are the advantages of HOT over SITC (other than cool name) ?

Maybe just make HOT an extended SITC which can span pages.

In case of HOT together with reusing index tuples with DELETED bit set
we don't actually need copyback, but the same index pointer will follow
the head of live data automatically, maybe lagging only a small number
of versions.

> Actually, you omitted to mention the locking aspects of moving tuples
> around --- exactly how are you going to make that work without breaking
> concurrent scans?
> 
> > This allows the length of a typical tuple chain to be extremely short in
> > practice. For a single connection issuing a stream of UPDATEs the chain
> > length will no more than 1 at any time.
> 
> Only if there are no other transactions being held open, which makes
> this claim a lot weaker.
> 
> > HOT can only work in cases where a tuple does not modify one of the
> > columns defined in an index on the table, and when we do not alter the
> > row length of the tuple.
> 
> Seems like "altering the row length" isn't the issue, it's just "is
> there room on the page for the new version".  Again, a generous
> fillfactor would give you more flexibility.

Maybe they hoped to take very light locks when new chaoin head is copied
iver the old one in the same-length case.

>                 http://www.postgresql.org/about/donate
-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

NOTICE: This communication contains privileged or other confidential
information. If you have received it in error, please advise the sender
by reply email and immediately delete the message and any attachments
without copying or disclosing the contents.

Re: Frequent Update Project: Design Overview of HOT Updates

From

Tom Lane

Date:

10 November 2006, 03:32:25

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On 11/10/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (2) Isn't this full of race conditions?

> I agree, there  could be race  conditions. But IMO we can handle those.

Doubtless you can prevent races by introducing a bunch of additional
locking.  The question was really directed to how much concurrent
performance is left, once you get done locking it down.

> (3) I thought you already used up the one remaining t_infomask bit.

> Yes. The last bit in the t_infomask is used up to mark presence of overflow
> tuple header. But I believe there are few more bits that can be reused.
> There are three bits available in the t_ctid field as well (since ip_posid
> needs maximum 13 bits).

No, you cannot have those bits --- BLCKSZ is not limited to 8K, and even
if it were, we will not hold still for sticking flag bits into an
unrelated datatype.

You can probably fix this by inventing multiple context-dependent
interpretations of t_infomask bits, but that strikes me as a mighty
ugly and fragile way to proceed.

(Actually, the assumption that you can throw an additional back-pointer
into overflow tuple headers is the worst feature of this proposal in
that regard --- it's really not that easy to support multiple header
formats.)
        regards, tom lane

Re: Frequent Update Project: Design Overview of HOT

From

Hannu Krosing

Date:

10 November 2006, 03:51:46

Ühel kenal päeval, R, 2006-11-10 kell 09:06, kirjutas Hannu Krosing:
> Ühel kenal päeval, N, 2006-11-09 kell 18:28, kirjutas Tom Lane:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> > > As more UPDATEs take place these tuple chains would grow, making
> > > locating the latest tuple take progressively longer.
> > 
> > This is the part that bothers me --- particularly the random-access
> > nature of the search.  I wonder whether you couldn't do something
> > involving an initial table fill-factor of less than 50%, and having
> > the first updated version living on the same heap page as its parent.
> > Only when the active chain length was more than one (which you
> > hypothesize is rare) would it actually be necessary to do a random
> > access into the overflow table.
> > 
> > More generally, do we need an overflow table at all, rather than having
> > these overflow tuples living in the same file as the root tuples?  As
> > long as there's a bit available to mark a tuple as being this special
> > not-separately-indexed type, you don't need a special location to know
> > what it is.  This might break down in the presence of seqscans though.
> 
> And why do you need to mark it as not-separately-indexed at all ?
> 
> We already cope with missing index pointers in VACUUM and I can't see
> any other reason to have it.

Ok, now I see it - we can't VACUUM a tuple, if next versions of it are
accessible by t_ctid chain only. That is vacuum must not free tuples,
which have t_ctid pointing to a tuple that has not-separately-indexed
bit set. This seems to make vacuum quite complicated, as it has to
examine c_tid chains to detect if it can free a tuple, and what's even
worse, it has to examine these chains backwards.

> What are the advantages of HOT over SITC (other than cool name) ?

still wondering this, is it just the abilty to span multiple pages ?

> Maybe just make HOT an extended SITC which can span pages.
> 
> In case of HOT together with reusing index tuples with DELETED bit set
> we don't actually need copyback, but the same index pointer will follow
> the head of live data automatically, maybe lagging only a small number
> of versions.

> > Actually, you omitted to mention the locking aspects of moving tuples
> > around --- exactly how are you going to make that work without breaking
> > concurrent scans?
> > 
> > > This allows the length of a typical tuple chain to be extremely short in
> > > practice. For a single connection issuing a stream of UPDATEs the chain
> > > length will no more than 1 at any time.
> > 
> > Only if there are no other transactions being held open, which makes
> > this claim a lot weaker.
> > 
> > > HOT can only work in cases where a tuple does not modify one of the
> > > columns defined in an index on the table, and when we do not alter the
> > > row length of the tuple.
> > 
> > Seems like "altering the row length" isn't the issue, it's just "is
> > there room on the page for the new version".  Again, a generous
> > fillfactor would give you more flexibility.
> 
> Maybe they hoped to take very light locks when new chaoin head is copied
> iver the old one in the same-length case.
> 
> >                 http://www.postgresql.org/about/donate
-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

NOTICE: This communication contains privileged or other confidential
information. If you have received it in error, please advise the sender
by reply email and immediately delete the message and any attachments
without copying or disclosing the contents.

Re: Frequent Update Project: Design Overview of HOT

From

"Simon Riggs"

Date:

10 November 2006, 05:35:40

On Fri, 2006-11-10 at 09:51 +0200, Hannu Krosing wrote:

> > What are the advantages of HOT over SITC (other than cool name) ?
> 
> still wondering this, is it just the abilty to span multiple pages ?

Multiple page spanning, copy back/VACUUM support, separate overflow
relation to prevent heap growth were the main ones. 

There's ideas in HOT from lots of people and places, SITC was just one
of those. There wasn't a dust-off between different designs, just an
attempt to take the best designs from wherever/whoever they came from,
so HOT isn't one person's design. 

If anybody wants to change the name, we can, to whatever you like.

> > Maybe just make HOT an extended SITC which can span pages.

There is a strong willingness to get the design optimal, which will
require change. 
> > > http://www.postgresql.org/about/donate

That's exactly what HOT is all about....

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOT Updates

From

"Pavan Deolasee"

Date:

10 November 2006, 06:02:48

On 11/10/06, Tom Lane <tgl@sss.pgh.pa.us > wrote:

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:

> Yes. The last bit in the t_infomask is used up to mark presence of overflow
> tuple header. But I believe there are few more bits that can be reused.
> There are three bits available in the t_ctid field as well (since ip_posid
> needs maximum 13 bits).

No, you cannot have those bits --- BLCKSZ is not limited to 8K, and even
if it were, we will not hold still for sticking flag bits into an
unrelated datatype.

BLCKSZ is not limited to 8K, but it is limited to 32K because of lp_off and lp_len
which are 15 bits in size.

OffsetNumber is limited to (BLCKSZ / sizeof(ItemIdData)). Since sizeof(ItemIdData) is 4 bytes, MaxOffsetNumber is 8192. So we need only 13 bits to represent the MaxOffsetNumber.

Regards,
Pavan

Re: Frequent Update Project: Design Overview of HOT Updates

From

"Heikki Linnakangas"

Date:

10 November 2006, 06:12:46

Tom Lane wrote:
> "Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
>> Yes. The last bit in the t_infomask is used up to mark presence of overflow
>> tuple header. But I believe there are few more bits that can be reused.
>> There are three bits available in the t_ctid field as well (since ip_posid
>> needs maximum 13 bits).
> 
> No, you cannot have those bits --- BLCKSZ is not limited to 8K, and even
> if it were, we will not hold still for sticking flag bits into an
> unrelated datatype.
> 
> You can probably fix this by inventing multiple context-dependent
> interpretations of t_infomask bits, but that strikes me as a mighty
> ugly and fragile way to proceed.

Yeah, that seems ugly.

I think the best place to grab more bits is the t_natts field. The max 
value for that is MaxTupleAttributeNumber == 1664, which fits in 11 
bits. That leaves 5 bits for other use. We'll have to replace all direct 
access to that field with an accessor macro, but according to grep there 
isn't that many files that need to be changed.

> (Actually, the assumption that you can throw an additional back-pointer
> into overflow tuple headers is the worst feature of this proposal in
> that regard --- it's really not that easy to support multiple header
> formats.)

Well, we already have a variable length null bitmap in the header. It 
seems quite straightforward to me to add the new field before the null 
bitmap. It certainly requires some changes, in particular to places that 
access the null bitmap, but it's not an insurmountable effort. Or am I 
missing some less obvious consequences?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOT Updates

From

"Zeugswetter Andreas ADI SD"

Date:

10 November 2006, 07:32:31

> > As more UPDATEs take place these tuple chains would grow, making
> > locating the latest tuple take progressively longer.
> More generally, do we need an overflow table at all, rather
> than having these overflow tuples living in the same file as
> the root tuples?  As long as there's a bit available to mark
> a tuple as being this special not-separately-indexed type,
> you don't need a special location to know what it is.

Yes, I have that same impression.

1. It doubles the IO (original page + hot page), if the new row would have fit into the original page.

2. locking should be easier if only the original heap page is involved.

3. It makes the overflow pages really hot because all concurrent updates
competefor the few overflow pages.

4. although at first it might seem so I see no advantage for vacuum with
overflow

5. the size reduction of heap is imho moot because you trade it for a
growing overflow(size reduction only comes from reusing dead tuples and not
adding index tuples --> SITC)

Could you try to explain the reasoning behind separate overflow storage
?
What has been stated so far was not really conclusive to me in this
regard.
e.g. a different header seems no easier in overflow than in heap

Andreas

Re: Frequent Update Project: Design Overview of HOT Updates

From

"Pavan Deolasee"

Date:

10 November 2006, 07:36:23

On 11/10/06, Heikki Linnakangas <heikki@enterprisedb.com> wrote:

Tom Lane wrote:

> (Actually, the assumption that you can throw an additional back-pointer
> into overflow tuple headers is the worst feature of this proposal in
> that regard --- it's really not that easy to support multiple header
> formats.)

Well, we already have a variable length null bitmap in the header. It
seems quite straightforward to me to add the new field before the null
bitmap. It certainly requires some changes, in particular to places that
access the null bitmap, but it's not an insurmountable effort. Or am I
missing some less obvious consequences?

We have added the overflow header (which right now contains a single entry i.e.
the back pointer) on the very similar lines to optional Oid field in the tuple header.
A flag (the last free in the t_infomask) is used to check if there is an additional
overflow header and if so t_hoff is adjusted appropriately.

So in the current prototype, the overflow header is after the null bitmap
and before the Oid, if it exists.

Regards,
Pavan

Re: Frequent Update Project: Design Overview of HOTUpdates

From

"Simon Riggs"

Date:

10 November 2006, 08:20:30

On Thu, 2006-11-09 at 18:28 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > As more UPDATEs take place these tuple chains would grow, making
> > locating the latest tuple take progressively longer.
> 
> This is the part that bothers me --- particularly the random-access
> nature of the search.  I wonder whether you couldn't do something
> involving an initial table fill-factor of less than 50%, and having
> the first updated version living on the same heap page as its parent.
> Only when the active chain length was more than one (which you
> hypothesize is rare) would it actually be necessary to do a random
> access into the overflow table.

Thats appropriate sometimes, not others, but I'll investigate this
further so that its possible to take advantage of non-zero fillfactors
when they exist.

There's a number of distinct use-cases here:

If you have a very small, heavily updated table it makes a lot of sense
to use lower fillfactors as well.

If you have a larger table, using fillfactor 50% immediately doubles the
size of the table. If the updates are uneven, as they mostly are because
of the Benfold distribution/Pareto principle, then it has been found
that leaving space on block doesn't help the heavily updated portions of
a table, whereas it hinders the lightly updated portions of a table.

TPC-C and TPC-B both have uniformly distributed UPDATEs, so its easy to
use the fillfactor to great advantage there.

> More generally, do we need an overflow table at all, rather than having
> these overflow tuples living in the same file as the root tuples?  As
> long as there's a bit available to mark a tuple as being this special
> not-separately-indexed type, you don't need a special location to know
> what it is.  This might break down in the presence of seqscans though.

HOT currently attempts to place a subsequent UPDATE on the same page of
the overflow relation, but this doesn't happen (yet) for placing
multiple versions on same page. IMHO it could, but will think about it.

> > This allows the length of a typical tuple chain to be extremely short in
> > practice. For a single connection issuing a stream of UPDATEs the chain
> > length will no more than 1 at any time.
> 
> Only if there are no other transactions being held open, which makes
> this claim a lot weaker.

True, but Nikhil has run tests that clearly show HOT outperforming
current situation in the case of long running transactions. The need to
optimise HeapTupleSatisfiesVacuum() and avoid long chains does still
remain a difficulty for both HOT and the current situation.

> > HOT can only work in cases where a tuple does not modify one of the
> > columns defined in an index on the table, and when we do not alter the
> > row length of the tuple.
> 
> Seems like "altering the row length" isn't the issue, it's just "is
> there room on the page for the new version".  Again, a generous
> fillfactor would give you more flexibility.

The copy-back operation can only work if the tuple fits in the same
space as the root tuple. If it doesn't you end up with a tuple
permanently in the overflow relation. That might not worry us, I guess.

Also, my understanding was that an overwrite operation could not vary
the length of a tuple (at least according to code comments).

> > [We'll be able to do that more efficiently when
> > we have plan invalidation]
> 
> Uh, what's that got to do with it?

Currently the HOT code dynamically tests to see if the index columns
have been touched. If we had plan invalidation that would be able to be
assessed more easily at planning time, in cases where there is no BEFORE
trigger.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOTUpdates

From

"Heikki Linnakangas"

Date:

10 November 2006, 08:32:44

Simon Riggs wrote:
> On Thu, 2006-11-09 at 18:28 -0500, Tom Lane wrote:
>> "Simon Riggs" <simon@2ndquadrant.com> writes:
>>> HOT can only work in cases where a tuple does not modify one of the
>>> columns defined in an index on the table, and when we do not alter the
>>> row length of the tuple.
>> Seems like "altering the row length" isn't the issue, it's just "is
>> there room on the page for the new version".  Again, a generous
>> fillfactor would give you more flexibility.
> 
> The copy-back operation can only work if the tuple fits in the same
> space as the root tuple. If it doesn't you end up with a tuple
> permanently in the overflow relation. That might not worry us, I guess.
> 
> Also, my understanding was that an overwrite operation could not vary
> the length of a tuple (at least according to code comments).

You can't If someone else has the page pinned,

> 
>>> [We'll be able to do that more efficiently when
>>> we have plan invalidation]
>> Uh, what's that got to do with it?
> 
> Currently the HOT code dynamically tests to see if the index columns
> have been touched. If we had plan invalidation that would be able to be
> assessed more easily at planning time, in cases where there is no BEFORE
> trigger.
> 


--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOTUpdates

From

"Heikki Linnakangas"

Date:

10 November 2006, 08:38:19

Oops, pressed send too early. Ignore the one-line reply I just sent...

Simon Riggs wrote:
> On Thu, 2006-11-09 at 18:28 -0500, Tom Lane wrote:
>> "Simon Riggs" <simon@2ndquadrant.com> writes:
>>> HOT can only work in cases where a tuple does not modify one of the
>>> columns defined in an index on the table, and when we do not alter the
>>> row length of the tuple.
>> Seems like "altering the row length" isn't the issue, it's just "is
>> there room on the page for the new version".  Again, a generous
>> fillfactor would give you more flexibility.
> 
> The copy-back operation can only work if the tuple fits in the same
> space as the root tuple. If it doesn't you end up with a tuple
> permanently in the overflow relation. That might not worry us, I guess.

You can't move tuples around in a page without holding a Vacuum lock on 
the page. Some other backend might have the page pinned and have a 
pointer to a tuple on the page that would get bogus if the tuple is 
moved. Maybe you could try to get a vacuum lock when doing the update 
and prereserve space for the new version if you can get one.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOT Updates

From

Gregory Stark

Date:

10 November 2006, 08:45:47

"Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> writes:

> 1. It doubles the IO (original page + hot page), if the new row would 
>     have fit into the original page.

That's an awfully big IF there. Even if you use a fillfactor of 50% in which
case you're paying a 100% performance penalty *all* the time, not just when
dealing with a table that's been bloated by multiple versions you still have
no guarantee the extra versions will fit on the same page.

> 4. although at first it might seem so I see no advantage for vacuum with
> overflow 

The main problem with vacuum now is that it must scan the entire table (and
the entire index) even if only a few records are garbage. If we isolate the
garbage in a separate area then vacuum doesn't have to scan unrelated tuples.

I'm not sure this really solves that problem because there are still DELETEs
to consider but it does remove one factor that exacerbates it unnecessarily.

I think the vision is that the overflow table would never be very large
because it can be vacuumed very aggressively. It has only tuples that are busy
and will need vacuuming as soon as a transaction ends. Unlike the main table
which is mostly tuples that don't need vacuuming. 

> 5. the size reduction of heap is imho moot because you trade it for a
> growing overflow
>     (size reduction only comes from reusing dead tuples and not
> adding index tuples --> SITC)

I think you're comparing the wrong thing. Size isn't a problem in itself, size
is a problem because it causes extra i/o. So a heap that's double in size
necessary takes twice as long as necessary to scan. The fact that the overflow
tables are taking up space isn't interesting if they don't have to be scanned.
Hitting the overflow tables should be quite rare, it only comes into play when
looking at concurrently updated tuples. It certainly happens but most tuples
in the table will be committed and not being concurrently updated by anyone
else.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOTUpdates

From

Gregory Stark

Date:

10 November 2006, 08:52:53

"Simon Riggs" <simon@2ndquadrant.com> writes:

>> Seems like "altering the row length" isn't the issue, it's just "is
>> there room on the page for the new version".  Again, a generous
>> fillfactor would give you more flexibility.
>
> The copy-back operation can only work if the tuple fits in the same
> space as the root tuple. If it doesn't you end up with a tuple
> permanently in the overflow relation. That might not worry us, I guess.

I think he's suggesting that you can put the new version in the available
space rather than use the space from the existing tuple. You can keep the same
line pointer so index entries still refer to the correct tuple.

The only problem I see is that if you determine that there's space available
when you do the update that space may have disappeared by the table you come
along to do the move-back. 

Perhaps you can do something clever with reserving the space at that time for
the later move-back but I fear that'll complicate vacuum and open up risks if
the system crashes in that state.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOTUpdates

From

NikhilS

Date:

10 November 2006, 08:57:49

Hi,

> > This allows the length of a typical tuple chain to be extremely short in
> > practice. For a single connection issuing a stream of UPDATEs the chain
> > length will no more than 1 at any time.
>
> Only if there are no other transactions being held open, which makes
> this claim a lot weaker.

True, but Nikhil has run tests that clearly show HOT outperforming
current situation in the case of long running transactions. The need to
optimise HeapTupleSatisfiesVacuum() and avoid long chains does still
remain a difficulty for both HOT and the current situation.

Yes, I carried out some pgbench runs comparing our current HOT update
patch with PG82BETA2 sources for the long running transaction case. For
an apples to apples comparison we got roughly 170% improvement with the
HOT update patch over BETA2.

In case of BETA2, since all versions are in the main heap, we end up
doing multiple index scans for them. In case of HOT updates, we have a
single index entry with the chains getting traversed from the overflow
relation. So as Simon has mentioned the need to avoid long chains remains
a difficulty for both the situations.

Regards,
Nikhils

--
EnterpriseDB http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOT Updates

From

NikhilS

Date:

10 November 2006, 09:08:19

Hi,

I think the vision is that the overflow table would never be very large
because it can be vacuumed very aggressively. It has only tuples that are busy
and will need vacuuming as soon as a transaction ends. Unlike the main table
which is mostly tuples that don't need vacuuming.

Thats right. vacuum if it gets a chance to work on the overflow relation seems to be doing a decent job in our runs. If autovacuum/vacuum gets to run optimally, the FSM information generated for the overflow relations will be able to serve a lot of new tuple requests avoiding undue/large bloat in the overflow relations.

Regards,
Nikhils

--
EnterpriseDB http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOTUpdates

From

"Simon Riggs"

Date:

10 November 2006, 09:54:26

On Fri, 2006-11-10 at 12:32 +0100, Zeugswetter Andreas ADI SD wrote:
> > > As more UPDATEs take place these tuple chains would grow, making 
> > > locating the latest tuple take progressively longer.
>  
> > More generally, do we need an overflow table at all, rather 
> > than having these overflow tuples living in the same file as 
> > the root tuples?  As long as there's a bit available to mark 
> > a tuple as being this special not-separately-indexed type, 
> > you don't need a special location to know what it is.
> 
> Yes, I have that same impression.

I think this might work, I'll think about it.

If you can come up with a test case where we would need this
optimisation then I'm sure that can be prototyped.

In many cases though an overflow relation is desirable, so ISTM that we
might want to be able to do both: have an updated tuple with not-indexed
bit set in the heap if on same page, else go to overflow relation.

> 1. It doubles the IO (original page + hot page), if the new row would 
>     have fit into the original page.

Yes, but thats a big if, and once it is full that optimisation goes
away. Some thoughts:

First, it presumes that the second block is not in cache (see later).

Second, my observation is that this can happen for some part of the
time, but under heavy updates this window of opportunity is not the
common case and so this optimisation would not make much difference.
However, in some cases, it would be appropriate, so I'll investigate.

HOT optimises the case where a tuple in the overflow relation is
UPDATEd, so it can place the subsequent tuple version on the same page.
So if you perform 100 UPDATEs on a single tuple, the first will write to
the overflow relation in a new block, then the further 99 will attempt
to write to the same block, if they can. So in many cases we would do
only 101 block accesses and no real I/O.

> 2. locking should be easier if only the original heap page is involved.

Yes, but multi-page update already happens now, so HOT is not different
on that point.

> 3. It makes the overflow pages really hot because all concurrent updates
> compete
>     for the few overflow pages.

Which ensures they are in shared_buffers, rather than causing I/O. The
way FSM works, it will cause concurrent updaters to spread out their
writes to many blocks. So in the case of a single high frequency updater
all of the updates go into the same block of the overflow relation, so
the optimisation you referred to in (1) does take effect strongly in
that case, yet without causing contention with other updaters. The FSM
doesn't change with HOT, but the effects of having inserted additional
tuples into the main heap are much harder to undo afterwards.

The overflow relation varies in size according to the number of updates,
not the actual number of tuples, as does the main heap, so VACUUMing
will focus on the hotspots and be more efficient, especially since no
indexes need be scanned. [Sure we can use heapblock-need-vacuum bitmaps,
but there will still be a mix of updated/not-updated tuples in there, so
VACUUM would still be less efficient than with HOT]. So VACUUM can
operate more frequently on the overflow relation and keep the size
reasonable for more of the time, avoiding I/O.

Contention is and will remain a HOT topic ;-)
I understand your concerns and we should continue to monitor this on the
various performance tests that will be run.

> 4. although at first it might seem so I see no advantage for vacuum with
> overflow 

No need to VACUUM the indexes, which is the most expensive part. The
more indexes you have, the more VACUUM costs, not so with HOT.

> 5. the size reduction of heap is imho moot because you trade it for a
> growing overflow
>     (size reduction only comes from reusing dead tuples and not
> adding index tuples --> SITC)

HOT doesn't claim to reduce the size of the heap. In the presence of a
long running transaction, SizeOfHOT(heap + overflow) =
SizeOfCurrent(Heap).

VACUUM is still required in both cases to manage total heap size. If we
have solely UPDATEs and no deletes, then only the overflow relation need
be VACUUMed.

> Could you try to explain the reasoning behind separate overflow storage
> ?

I think the answers above cover the main points, which seem to make the
case clearly enough from a design rationale perspective, even without
the performance test results to confirm them.

> What has been stated so far was not really conclusive to me in this
> regard.

Personally, I understand. I argued against them for about a month after
I first heard of the idea, but they make sense for me now. HOT has
evolved considerably from the various ideas of each of the original idea
team (Jonah, Bruce, Jan, myself) and will continue to do so as better
ideas replace poor ones, based on performance tests. All of the ideas
within it need to be strongly challenged to ensure we arrive at the best
solution.

> e.g. a different header seems no easier in overflow than in heap 

True. The idea there is that we can turn frequent update on/off fairly
easily for normal tables since there are no tuple format changes in the
main heap. It also allows HOT to avoid wasting space when a table is
heavily updated in certain places only.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOTUpdates

From

"Pavan Deolasee"

Date:

10 November 2006, 10:12:39

On 11/10/06, Simon Riggs <simon@2ndquadrant.com> wrote:

On Fri, 2006-11-10 at 12:32 +0100, Zeugswetter Andreas ADI SD wrote:

> e.g. a different header seems no easier in overflow than in heap

True. The idea there is that we can turn frequent update on/off fairly
easily for normal tables since there are no tuple format changes in the
main heap. It also allows HOT to avoid wasting space when a table is
heavily updated in certain places only.

I general though, it would make implementation a bit simpler when tuples with
different header are isolated in a different relation.

Regards,
Pavan

Re: Frequent Update Project: Design Overview of HOT Updates

From

"Zeugswetter Andreas ADI SD"

Date:

10 November 2006, 10:34:33

> > I think the vision is that the overflow table would never be very
> > large because it can be vacuumed very aggressively. It has only
tuples
> > that are busy and will need vacuuming as soon as a transaction ends.

> > Unlike the main table which is mostly tuples that don't need
> > vacuuming.

Except when deleted :-)

> Thats right. vacuum if it gets a chance to work on the
> overflow relation seems to be doing a decent job in our runs.
> If autovacuum/vacuum gets to run optimally, the FSM
> information generated for the overflow relations will be able
> to serve a lot of new tuple requests  avoiding  undue/large
> bloat in the overflow relations.

It seems like we would want to create a chain into overflow for deleted
rows also (header + all cols null), so we can vacuum those too only by
looking
at overflow, at least optionally.

I think the overflow would really need to solve deletes too, or the
bitmap
idea is more generally useful to vacuum.

Generally for clear distinction I think we are talking about two things
here.
1. reduce index bloat and maintenance work
2. allow vaccuum a cheaper focus on what needs to be done

Andreas

Re: Frequent Update Project: Design Overview of HOTUpdates

From

"Nikhil S"

Date:

10 November 2006, 11:07:30

Hi,> True, but Nikhil has run tests that clearly show HOT outperforming> current situation in the case of long running
transactions.The need to> optimise HeapTupleSatisfiesVacuum() and avoid long chains does still> remain a difficulty for
bothHOT and the current situation.>
 

Yes, I carried out some pgbench runs comparing our current HOT update
patch with PG82BETA2 sources for the long running transaction case. For
an apples to apples comparison we got roughly 170% improvement with the
HOT update patch over BETA2.

In case of BETA2, since all versions are in the main heap, we end up
doing multiple index scans for them. In case of HOT updates, we have a
single index entry with the chains getting traversed from the overflow
relation. So as Simon has mentioned the need to avoid long chains remains
a difficulty for both the situations.

Regards,
Nikhils

Re: Frequent Update Project: Design Overview of HOT Updates

From

"Pavan Deolasee"

Date:

10 November 2006, 11:22:41

On 11/10/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On 11/10/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (2) Isn't this full of race conditions?

> I agree, there could be race conditions. But IMO we can handle those.

Doubtless you can prevent races by introducing a bunch of additional
locking. The question was really directed to how much concurrent
performance is left, once you get done locking it down.

I understand your point and I can clearly see a chance to improve upon the current
locking implementation in the prototype even though we are seeing a good performance
boost for 50 clients and 50 scaling factor with pgbench runs as mentioned by Nikhil.

Regards,
Pavan

Re: Frequent Update Project: Design Overview of HOT Updates

From

"Zeugswetter Andreas ADI SD"

Date:

10 November 2006, 11:46:54

> > 1. It doubles the IO (original page + hot page), if the new row
would
> >     have fit into the original page.
>
> That's an awfully big IF there. Even if you use a fillfactor
> of 50% in which case you're paying a 100% performance penalty

I don't see where the 50% come from ? That's only needed if you update
all
rows on the page. And that in a timeframe, that does not allow reuse of
other
outdated tuples.
> > 4. although at first it might seem so I see no advantage for vacuum
> > with overflow
>
> The main problem with vacuum now is that it must scan the
> entire table (and the entire index) even if only a few
> records are garbage. If we isolate the garbage in a separate
> area then vacuum doesn't have to scan unrelated tuples.
>
> I'm not sure this really solves that problem because there
> are still DELETEs to consider but it does remove one factor
> that exacerbates it unnecessarily.

Yea, so you still need to vaccum the large table regularly.

> I think the vision is that the overflow table would never be
> very large because it can be vacuumed very aggressively. It
> has only tuples that are busy and will need vacuuming as soon
> as a transaction ends. Unlike the main table which is mostly
> tuples that don't need vacuuming.

Ok, but you have to provide an extra vacuum that does only that then
(and it randomly touches heap pages, and only does partial work there).

> > 5. the size reduction of heap is imho moot because you trade it for
a
> > growing overflow
> >     (size reduction only comes from reusing dead tuples and
> not adding
> > index tuples --> SITC)
>
> I think you're comparing the wrong thing.

I mean unless you do individually vacuum the overflow more frequently

> Size isn't a problem in itself, size is a problem because it causes
extra
> i/o.

Yes, and I state that at all possible occations :-) OnDisk size is a
problem, really.

> So a heap that's double in size necessary takes twice as
> long as necessary to scan. The fact that the overflow tables
> are taking up space isn't interesting if they don't have to
> be scanned.

The overflow does have to be read for each seq scan. And it was stated
that it would
be accessed with random access (follow tuple chain).
But maybe we can read the overflow same as if it where an additional
segment file ?

> Hitting the overflow tables should be quite rare, it only
> comes into play when looking at concurrently updated tuples.
> It certainly happens but most tuples in the table will be
> committed and not being concurrently updated by anyone else.

The first update moves the row to overflow, only the 2nd next might be
able to pull it back.
So on average you would have at least 66% of all updated rows after last
vacuum in the overflow.

The problem with needing very frequent vacuums is, that you might not be
able to do any work because of long transactions.

Andreas

Re: Frequent Update Project: Design Overview of HOTUpdates

From

"Zeugswetter Andreas ADI SD"

Date:

10 November 2006, 12:00:44

> > 2. locking should be easier if only the original heap page is
involved.
>
> Yes, but multi-page update already happens now, so HOT is not
> different on that point.

I was thinking about the case when you "pull back" a tuple, which seems
to be more
difficult than what we have now.

Andreas

PS: I think it is great that you are doing all this work and explaining
it for us. Thanks.

Re: Frequent Update Project: Design Overview of HOTUpdates

From

"Zeugswetter Andreas ADI SD"

Date:

10 November 2006, 12:05:01

> > True, but Nikhil has run tests that clearly show HOT outperforming
> > current situation in the case of long running transactions. The need

> > to optimise HeapTupleSatisfiesVacuum() and avoid long chains does
> > still remain a difficulty for both HOT and the current situation.
>
>
> Yes, I carried out some pgbench runs comparing our current
> HOT update patch with PG82BETA2 sources for the long running
> transaction case. For an apples to apples comparison we got

Vaccuums every 5 minutes, or no vaccuums ?
> roughly 170% improvement with the HOT update patch over BETA2.

Wow, must be smaller indexes and generally less index maintenance.
What this also states imho, is that following tuple chains
is not so expensive as maintaining indexes (at least in a heavy update
scenario like pgbench).

Maybe we should try a version, where the only difference to now is,
that when the index keys stay the same the indexes are not updated, and
the tuple
chain is followed instead when selecting with index. (Maybe like the
current alive flag the index pointer can even be refreshed to the oldest
visible
tuple by readers)

Andreas

Re: Frequent Update Project: Design Overview ofHOTUpdates

From

"Simon Riggs"

Date:

10 November 2006, 13:37:28

On Fri, 2006-11-10 at 17:00 +0100, Zeugswetter Andreas ADI SD wrote:
> > > 2. locking should be easier if only the original heap page is
> involved.
> > 
> > Yes, but multi-page update already happens now, so HOT is not 
> > different on that point.
> 
> I was thinking about the case when you "pull back" a tuple, which seems
> to be more
> difficult than what we have now.

Well, it sounds like it would be really bad, at least thats what I
thought at first. We tried it anyway and performance shows it aint that
bad.

> PS: I think it is great that you are doing all this work and explaining
> it for us. Thanks.

Thanks for your feedback; I'm certain we'll be able to improve on where
HOT is now with some objective thinking.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of

From

Hannu Krosing

Date:

10 November 2006, 14:38:41

Ühel kenal päeval, R, 2006-11-10 kell 12:19, kirjutas Simon Riggs:
> On Thu, 2006-11-09 at 18:28 -0500, Tom Lane wrote:

> > > HOT can only work in cases where a tuple does not modify one of the
> > > columns defined in an index on the table, and when we do not alter the
> > > row length of the tuple.
> > 
> > Seems like "altering the row length" isn't the issue, it's just "is
> > there room on the page for the new version".  Again, a generous
> > fillfactor would give you more flexibility.
> 
> The copy-back operation can only work if the tuple fits in the same
> space as the root tuple. If it doesn't you end up with a tuple
> permanently in the overflow relation. That might not worry us, I guess.
> 
> Also, my understanding was that an overwrite operation could not vary
> the length of a tuple (at least according to code comments).

But you can change the t_ctid pointer in the heap tuple to point to
oldest visible tuple in overflow.

What I did not understand very well, is how do you determine, which
overflow tuples are safe to remove. Do you mark/remove them when doing
the copyback ?

> 
-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

NOTICE: This communication contains privileged or other confidential
information. If you have received it in error, please advise the sender
by reply email and immediately delete the message and any attachments
without copying or disclosing the contents.

Re: Frequent Update Project: Design Overview ofHOTUpdates

From

"Simon Riggs"

Date:

10 November 2006, 15:06:57

On Fri, 2006-11-10 at 20:38 +0200, Hannu Krosing wrote:
> Ühel kenal päeval, R, 2006-11-10 kell 12:19, kirjutas Simon Riggs:
> > On Thu, 2006-11-09 at 18:28 -0500, Tom Lane wrote:
>
> > > > HOT can only work in cases where a tuple does not modify one of the
> > > > columns defined in an index on the table, and when we do not alter the
> > > > row length of the tuple.
> > >
> > > Seems like "altering the row length" isn't the issue, it's just "is
> > > there room on the page for the new version".  Again, a generous
> > > fillfactor would give you more flexibility.
> >
> > The copy-back operation can only work if the tuple fits in the same
> > space as the root tuple. If it doesn't you end up with a tuple
> > permanently in the overflow relation. That might not worry us, I guess.
> >
> > Also, my understanding was that an overwrite operation could not vary
> > the length of a tuple (at least according to code comments).
>
> But you can change the t_ctid pointer in the heap tuple to point to
> oldest visible tuple in overflow.
>
> What I did not understand very well, is how do you determine, which
> overflow tuples are safe to remove. Do you mark/remove them when doing
> the copyback ?

Yes, dead chain is marked as we walk it to the next visible tuple prior
to copyback. Thats done with an exclusive lock held on the root block.
Pavan can answer those details better than me, so I'll leave that to
him, but the copyback operation needs to be closely examined.

We hope to post the code by Tuesday. It's being cleaned now to remove
the remains of previous prototypes to allow it be reviewed without
confusion.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOT Updates

From

NikhilS

Date:

10 November 2006, 15:23:33

Hi,

On 11/10/06, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

On 11/10/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Pavan Deolasee" < pavan.deolasee@gmail.com> writes:
> On 11/10/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (2) Isn't this full of race conditions?

> I agree, there could be race conditions. But IMO we can handle those.

Doubtless you can prevent races by introducing a bunch of additional
locking. The question was really directed to how much concurrent
performance is left, once you get done locking it down.

I understand your point and I can clearly see a chance to improve upon the current
locking implementation in the prototype even though we are seeing a good performance
boost for 50 clients and 50 scaling factor with pgbench runs as mentioned by Nikhil.

Regards,
Pavan

Yes, we have done a number of runs with and without autovacuum with parameters like 50 clients, 50 scaling factor and 25000 transactions per client. 50 clients should introduce a decent amount of concurrency. The tps values observed with the HOT update patch (850 tps) were approximately 200+% better than PG82 sources (270).

Runs with 25 clients, 25 scaling factor and 25000 transactions produce similar percentage increases with the HOT update patch.

Regards,

Nikhils

--
EnterpriseDB http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOTUpdates

From

NikhilS

Date:

10 November 2006, 15:29:26

Hi,

On 11/10/06, Zeugswetter Andreas ADI SD <ZeugswetterA@spardat.at> wrote:

> > True, but Nikhil has run tests that clearly show HOT outperforming
> > current situation in the case of long running transactions. The need

> > to optimise HeapTupleSatisfiesVacuum() and avoid long chains does
> > still remain a difficulty for both HOT and the current situation.
>
>
> Yes, I carried out some pgbench runs comparing our current
> HOT update patch with PG82BETA2 sources for the long running
> transaction case. For an apples to apples comparison we got

Vaccuums every 5 minutes, or no vaccuums ?

We tried with both. Vacuum seems to do little to help in a long running transaction case. Generally in most of the pgbench runs that we carried out, autovacuum did not seem to be of much help even to PG82.

Regards,

Nikhils

--
EnterpriseDB http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview ofHOTUpdates

From

"Simon Riggs"

Date:

10 November 2006, 16:40:20

On Fri, 2006-11-10 at 17:04 +0100, Zeugswetter Andreas ADI SD wrote:
> > > True, but Nikhil has run tests that clearly show HOT outperforming 
> > > current situation in the case of long running transactions. The need
> 
> > > to optimise HeapTupleSatisfiesVacuum() and avoid long chains does 
> > > still remain a difficulty for both HOT and the current situation.
> > 
> > 
> > Yes, I carried out some pgbench runs comparing our current 
> > HOT update patch with PG82BETA2 sources for the long running 
> > transaction case. For an apples to apples comparison we got
> 
> Vaccuums every 5 minutes, or no vaccuums ?

Same either way, since vacuum would not remove any tuples.

> > roughly 170% improvement with the HOT update patch over BETA2.
> 
> Wow, must be smaller indexes and generally less index maintenance. 
> What this also states imho, is that following tuple chains
> is not so expensive as maintaining indexes (at least in a heavy update 
> scenario like pgbench).
> 
> Maybe we should try a version, where the only difference to now is,
> that when the index keys stay the same the indexes are not updated, and
> the tuple
> chain is followed instead when selecting with index. (Maybe like the
> current alive flag the index pointer can even be refreshed to the oldest
> visible
> tuple by readers)

I think that is SITC, nearly.

It's also exactly what HOT does, with the exception that the updated
tuples goes to a block in the overflow, rather than a block later in the
heap. The overflow relation isn't critical to the HOT mechanism, but it
does allow the two optimisations of not requiring all tuple headers to
be modified with the back pointer and improving the locality of VACUUM.
We might do that with some other structuring, but if it works for TOAST,
I figure its OK for HOT also.


--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOTUpdates

From

"Simon Riggs"

Date:

10 November 2006, 16:40:34

On Fri, 2006-11-10 at 16:46 +0100, Zeugswetter Andreas ADI SD wrote:

> > I'm not sure this really solves that problem because there 
> > are still DELETEs to consider but it does remove one factor 
> > that exacerbates it unnecessarily.
> 
> Yea, so you still need to vaccum the large table regularly.

HOT covers the use-case of heavy updating, which in many common cases
occurs on tables with few inserts/deletes. HOT would significantly
reduce the need to vacuum since deletes and wraparound issues would be
the only remaining reasons to do this.

[I have some ideas for how to optimize tables with heavy INSERT/DELETE
activity, but that case is much less prevalent than heavy UPDATEs.]

> > I think the vision is that the overflow table would never be 
> > very large because it can be vacuumed very aggressively. It 
> > has only tuples that are busy and will need vacuuming as soon 
> > as a transaction ends. Unlike the main table which is mostly 
> > tuples that don't need vacuuming. 
> 
> Ok, but you have to provide an extra vacuum that does only that then
> (and it randomly touches heap pages, and only does partial work there).

Sure, HOT needs a specially optimised VACUUM.

> > So a heap that's double in size necessary takes twice as 
> > long as necessary to scan. The fact that the overflow tables 
> > are taking up space isn't interesting if they don't have to 
> > be scanned.
> 
> The overflow does have to be read for each seq scan. And it was stated
> that it would
> be accessed with random access (follow tuple chain).
> But maybe we can read the overflow same as if it where an additional
> segment file ?

Not without taking a write-avoiding lock on the table, unfortunately.

> > Hitting the overflow tables should be quite rare, it only 
> > comes into play when looking at concurrently updated tuples. 
> > It certainly happens but most tuples in the table will be 
> > committed and not being concurrently updated by anyone else.
> 
> The first update moves the row to overflow, only the 2nd next might be
> able to pull it back.
> So on average you would have at least 66% of all updated rows after last
> vacuum in the overflow.
> 
> The problem with needing very frequent vacuums is, that you might not be
> able to do any work because of long transactions.

HOT doesn't need more frequent VACUUMs, it is just more efficient and so
can allow them, when needed to avoid I/O. Space usage in the overflow
relation is at its worst in the case of an enormous table with low
volume random updates, but note that it is *never* worse than current
space usage. In the best case, which is actually fairly common in
practice: a small number of rows of a large table are being updated by a
steady stream of concurrent updates, we find the overflow relation needs
only a few 100 tuples, so regular vacuuming will be both easy and
effective.

As an aside, note that HOT works best in real-world situations, not
benchmarks such as TPC where the I/Os are deliberately randomised to
test the scalability of the RDBMS. But even then, HOT works better.

The long-running transaction issue remains unsolved in this proposal,
but I have some ideas for later.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview of HOTUpdates

From

Robert Treat

Date:

12 November 2006, 14:01:43

On Friday 10 November 2006 08:53, Simon Riggs wrote:
> On Fri, 2006-11-10 at 12:32 +0100, Zeugswetter Andreas ADI SD wrote:
> > 4. although at first it might seem so I see no advantage for vacuum with
> > overflow
>
> No need to VACUUM the indexes, which is the most expensive part. The
> more indexes you have, the more VACUUM costs, not so with HOT.
>

This isn't exactly true though right?  Since the more indexes you have, the 
more likely it is that your updating an indexed column, which means HOT isn't 
going to work for you.  One common use case that seems problematic is the 
indexed, frequently updated timestamp field.

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: Frequent Update Project: Design Overview ofHOTUpdates

From

"Simon Riggs"

Date:

12 November 2006, 17:23:39

On Sun, 2006-11-12 at 13:01 -0500, Robert Treat wrote:
> On Friday 10 November 2006 08:53, Simon Riggs wrote:
> > On Fri, 2006-11-10 at 12:32 +0100, Zeugswetter Andreas ADI SD wrote:
> > > 4. although at first it might seem so I see no advantage for vacuum with
> > > overflow
> >
> > No need to VACUUM the indexes, which is the most expensive part. The
> > more indexes you have, the more VACUUM costs, not so with HOT.
> >
> 
> This isn't exactly true though right? 

The above statement is completely true; please don't say I aim to
mislead. I've been clear about the pre-conditions for the optimization.
This is a straight-up attempt to improve some important use cases.

>  Since the more indexes you have, the 
> more likely it is that your updating an indexed column, which means HOT isn't 
> going to work for you.  

Well its not a chance thing is it?  It's clear that the pre-conditions
could in some circumstances be an annoyance, but that in itself isn't an
argument against it. I'm especially keen to hear of an optimisation that
would work in all cases for heavy updates. (It was I that originally
suggested the fillfactor approach to optimising UPDATEs, but regret that
although it applies no matter how many indexes you have its not very
effective and even that reduces after the first batch of UPDATEs have
happened).

> One common use case that seems problematic is the 
> indexed, frequently updated timestamp field.

Not sure of the use case for that? I understand using a timestamp field
for optimistic locking; why would you index that rather than the PK?

Locating things via coordinates was a use-case that would be non-HOT,
are you thinking of something similar? It's important to understand
which types of things HOT would optimize/not.

HOT probably would change the way you design if you need such a thing.
Rather than indexing the co-ordinate you'd end up binning the values so
the index value would change less often, so most would be HOT with a few
non-HOT UPDATEs. Maybe the same would be true with the timestamp, I'm
not sure.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview ofHOTUpdates

From

Robert Treat

Date:

12 November 2006, 19:32:28

On Sunday 12 November 2006 16:23, Simon Riggs wrote:
> On Sun, 2006-11-12 at 13:01 -0500, Robert Treat wrote:
> > On Friday 10 November 2006 08:53, Simon Riggs wrote:
> > > On Fri, 2006-11-10 at 12:32 +0100, Zeugswetter Andreas ADI SD wrote:
> > > > 4. although at first it might seem so I see no advantage for vacuum
> > > > with overflow
> > >
> > > No need to VACUUM the indexes, which is the most expensive part. The
> > > more indexes you have, the more VACUUM costs, not so with HOT.
> >
> > This isn't exactly true though right?
>
> The above statement is completely true; please don't say I aim to
> mislead. I've been clear about the pre-conditions for the optimization.
> This is a straight-up attempt to improve some important use cases.
>

I don't think you were trying to mislead, just my interpretation of the scheme 
requires a qualifier for that statement, namely that you are not updating an 
indexed column. Your statements above didn't include that qualfier, so I just 
wanted to make sure I wasn't overlooking something.  Actually I think I was, 
for example if your not updating all of the indexes on a table (which isn't 
likely) you're going to be better off with HOT, but in any case my apologies 
if I worded it badly.  

> >  Since the more indexes you have, the
> > more likely it is that your updating an indexed column, which means HOT
> > isn't going to work for you.
>
> Well its not a chance thing is it?  It's clear that the pre-conditions
> could in some circumstances be an annoyance, but that in itself isn't an
> argument against it. I'm especially keen to hear of an optimisation that
> would work in all cases for heavy updates. (It was I that originally
> suggested the fillfactor approach to optimising UPDATEs, but regret that
> although it applies no matter how many indexes you have its not very
> effective and even that reduces after the first batch of UPDATEs have
> happened).
>

I'd be keen to tell you such a plan if I had one, but obviously it isn't an 
easy problem to solve. :-)

> > One common use case that seems problematic is the
> > indexed, frequently updated timestamp field.
>
> Not sure of the use case for that? I understand using a timestamp field
> for optimistic locking; why would you index that rather than the PK?
>

Let's say you are doing system monitoring and you are updating last contact 
times fairly regularly. Sometimes you need to look at specific systems (the 
pk) and sometimes you need to query based on a time range (the indexed time 
field).   

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: Frequent Update Project: Design Overview ofHOTUpdates

From

"Simon Riggs"

Date:

13 November 2006, 08:04:13

On Sun, 2006-11-12 at 18:31 -0500, Robert Treat wrote:

> if your not updating all of the indexes on a table (which isn't 
> likely) you're going to be better off with HOT

Do you mean *any* rather than all?

> (which isn't likely)

There is no chance involved, unless the DBA adding indexes is unaware of
the HOT optimization; that would be regrettable, but we would aim to
make it fairly clear in the docs.

IMHO *most* UPDATEs occur on non-indexed fields. I guess any analysis
anybody has of the profile of UPDATEs in specific applications would be
interesting, especially if those are widely available applications. My
own analysis covers TPC-B, TPC-C, TPC_E and the truckin use case, plus
my own experience of other applications.

If my assumption is badly wrong on that then perhaps HOT would not be
useful after all. If we find that the majority of UPDATEs meet the HOT
pre-conditions, then I would continue to advocate it.

> > > One common use case that seems problematic is the
> > > indexed, frequently updated timestamp field.
> >
> > Not sure of the use case for that? I understand using a timestamp field
> > for optimistic locking; why would you index that rather than the PK?
> >
> 
> Let's say you are doing system monitoring and you are updating last contact 
> times fairly regularly. Sometimes you need to look at specific systems (the 
> pk) and sometimes you need to query based on a time range (the indexed time 
> field).   

OK, thanks.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview ofHOTUpdates

From

Csaba Nagy

Date:

13 November 2006, 08:42:32

[snip]
> IMHO *most* UPDATEs occur on non-indexed fields. [snip]
> 
> If my assumption is badly wrong on that then perhaps HOT would not be
> useful after all. If we find that the majority of UPDATEs meet the HOT
> pre-conditions, then I would continue to advocate it.

Just to confirm that the scenario is valid: our application has almost
all it's updates affecting only non-indexed columns. There are a few
exceptions, but the vast majority is non-indexed, and that holds to the
execution frequency too, not just for the count of tables/queries.

Cheers,
Csaba.

Re: Frequent Update Project: Design Overview ofHOTUpdates

From

August Zajonc

Date:

13 November 2006, 15:18:03

Simon Riggs wrote:
> If my assumption is badly wrong on that then perhaps HOT would not be
> useful after all. If we find that the majority of UPDATEs meet the HOT
> pre-conditions, then I would continue to advocate it.

This is exactly my situation. All updated hit non-indexed fields, with a
lot of lookups on indexes

What's interesting for me is that I might want to move away from some
heavy INSERT/DELETE cases to simply marking records as deleted on the
application side with this. The benefit being that I retain the archive
of "processed" items without having to move them, but get the advantage
of good throughput for the smaller set of items being "worked on".

Will be interesting to see how the design pans out.

- August

Re: Frequent Update Project: Design Overview ofHOTUpdates

From

Hannu Krosing

Date:

17 November 2006, 07:30:31

Ühel kenal päeval, E, 2006-11-13 kell 13:42, kirjutas Csaba Nagy:
> [snip]
> > IMHO *most* UPDATEs occur on non-indexed fields. [snip]
> > 
> > If my assumption is badly wrong on that then perhaps HOT would not be
> > useful after all. If we find that the majority of UPDATEs meet the HOT
> > pre-conditions, then I would continue to advocate it.
> 
> Just to confirm that the scenario is valid: our application has almost
> all it's updates affecting only non-indexed columns. There are a few
> exceptions, but the vast majority is non-indexed, and that holds to the
> execution frequency too, not just for the count of tables/queries.

One interesting case which should also be considered is conditional
indexes:

create index on payments(payment_id) where status = 'waiting';

here the payment_id is not changed when processing the payment, but when
status is changed to 'processed' it still should be removed from the
index.

How would this interact with HOT ?

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

Re: Frequent Update Project: Design Overview ofHOTUpdates

From

"Simon Riggs"

Date:

17 November 2006, 09:03:38

On Fri, 2006-11-17 at 13:30 +0200, Hannu Krosing wrote:
> Ühel kenal päeval, E, 2006-11-13 kell 13:42, kirjutas Csaba Nagy:
> > [snip]
> > > IMHO *most* UPDATEs occur on non-indexed fields. [snip]
> > >
> > > If my assumption is badly wrong on that then perhaps HOT would not be
> > > useful after all. If we find that the majority of UPDATEs meet the HOT
> > > pre-conditions, then I would continue to advocate it.
> >
> > Just to confirm that the scenario is valid: our application has almost
> > all it's updates affecting only non-indexed columns. There are a few
> > exceptions, but the vast majority is non-indexed, and that holds to the
> > execution frequency too, not just for the count of tables/queries.
>
> One interesting case which should also be considered is conditional
> indexes:
>
> create index on payments(payment_id) where status = 'waiting';
>
> here the payment_id is not changed when processing the payment, but when
> status is changed to 'processed' it still should be removed from the
> index.
>
> How would this interact with HOT ?

This can work, but isn't yet implemented in the current patch. (Neither
are expressional indexes).

For UPDATEs we currently assume that this will always result in an
inserted row, followed by inserts into appropriate indexes. So in the
executor we need to decide whether the newly inserted tuple version
should have a partial index tuple or not. So in after the call to
heap_update() we call ExecInsertIndexTuples() where we call ExecQual()
to evaluate the situation for the partial index.

For HOT we need to test the Partial Index predicate to decide whether
the UPDATE was HOT or COLD (i.e. non-HOT). We would then cache that
result so that if it is a COLD update and we need to insert a partial
index tuple then we do not have to evaluate it twice. We need to test
whether the old *and* the new tuple versions were indexed; thats an
additional cost discussed below.

In code we would need to refactor the ExecQual call so that the call can
be made slightly earlier than it is now. But same code, same caller,
same code files, just slightly modified functions.

So the answer is Yes, it can work with Partial Indexes, but caveats
would be:
*  it may only work partially ;-) - in your example the first Payment
INSERT would create an index row, the next UPDATE would not and so HOT
would not be usable for that UPDATE. OTOH the UPDATE isn't going to
produce a new index row for the partial index, so we're not making the
situation any worse either.
*  the cost of evals would be higher with HOT, but how much higher
depends upon the cost of the evaluation of your chosen index predicate.

Those caveats in themselves don't prevent HOT from being useful with
Partial Indexes. Remember, the more indexes you have the more you
benefit from HOT and none of the above changes that.

I think I'd be the first to say that the value of indexes *does* change
with HOT. In many use-cases having more indexes on a heavily updated
table is now feasible with HOT. In some use-cases, some existing indexes
could become more expensive to maintain and their value may not be
worthwhile any longer. That's why the proposal is for HOT to be an
option, not a fix for all cases.

The specific use-case you mention is fairly common, which is
disappointing, but a slightly different design is easily achieved.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Frequent Update Project: Design Overview

From

"Kevin Grittner"

Date:

17 November 2006, 11:26:40

>>> On Fri, Nov 17, 2006 at  5:30 AM, in message
<1163763016.2941.92.camel@localhost.localdomain>, Hannu Krosing
<hannu@skype.net> wrote:
> Ühel kenal päeval, E, 2006-11-13 kell 13:42, kirjutas Csaba Nagy:
>> [snip]
>> > IMHO *most* UPDATEs occur on non-indexed fields. [snip]
>> >
>> > If my assumption is badly wrong on that then perhaps HOT would not be
>> > useful after all. If we find that the majority of UPDATEs meet the HOT
>> > pre-conditions, then I would continue to advocate it.
>>
>> Just to confirm that the scenario is valid: our application has almost
>> all it's updates affecting only non-indexed columns. There are a few
>> exceptions, but the vast majority is non-indexed, and that holds to the
>> execution frequency too, not just for the count of tables/queries.
>
> One interesting case which should also be considered is conditional
> indexes:
>
> create index on payments(payment_id) where status = 'waiting';
>
> here the payment_id is not changed when processing the payment, but when
> status is changed to 'processed' it still should be removed from the
> index.
>
> How would this interact with HOT ?
I would say that at least 80% of our updates (probably higher) do not modify indexed columns.  We have a few very small
tables(under 100 rows) which have high update rates (often exceeding 100 updates per second) which are not against
indexedcolumns.  These quickly degraded our performance until we set pretty aggressive autovacuum parameters (20% + 1
rowevery 10 seconds) and added a daily cluster to our maintenance crontab runs. 
At the other extreme, we have a table which tracks the last modification timestamp of each court case, indexed by
timestamp,to support our SOAP subscribers who want to stay up-to-date on all active court cases.  Updates in this table
areboth high volume and always involve an indexed column. 
Like Hannu, we do use conditional indexes with high updates on columns in the WHERE clause, although these columns are
notpart of the index sequence.  For example, we have a receivables table which contains a balance due.  For audit trail
purposesthese rows remain for many years after the balance hits zero, but they're not something you want to look at
whensomeone is standing at the counter with their checkbook.  We index by name where the balance is non-zero.  The
balanceis updated frequently, with most eventually hitting zero.  (The reason for the frequent updates is that the
receivableis maintained by triggers from the supporting assessment detail, so a receivable will be initially added with
azero balance and may immediately be updated dozens of times as the assessment detail is added.)  Infrequently, the
balancemay hit zero and subsequently become non-zero again. 
I hope this is helpful.
-Kevin

Re: Frequent Update Project: Design Overview ofHOTUpdates

From

"Simon Riggs"

Date:

17 November 2006, 16:18:49

On Fri, 2006-11-17 at 09:25 -0600, Kevin Grittner wrote:

> Like Hannu, we do use conditional indexes with high updates on columns
> in the WHERE clause, although these columns are not part of the index
> sequence.  For example, we have a receivables table which contains a
> balance due.  For audit trail purposes these rows remain for many
> years after the balance hits zero, but they're not something you want
> to look at when someone is standing at the counter with their
> checkbook.  We index by name where the balance is non-zero.  The
> balance is updated frequently, with most eventually hitting zero.
> (The reason for the frequent updates is that the receivable is
> maintained by triggers from the supporting assessment detail, so a
> receivable will be initially added with a zero balance and may
> immediately be updated dozens of times as the assessment detail is
> added.)  Infrequently, the balance may hit zero and subsequently
> become non-zero again.

In that case its clear that you would remove the index WHERE clause and
utilise the HOT option. That would gain you considerably on most of the
UPDATEs at the slight expense of holding some additional rows in the
index. If that is a problem, move the balance==zero rows to an older
history table, perhaps using partitioning to keep them together. That
way you've reduced the size of all indexes without using partials.

So, HOT changes the way we'd think about indexes, yet would provide
considerable benefit in this case.

> I hope this is helpful.

I think it has been very helpful. I guess I'm interested in your overall
assessment of whether HOT would be beneficial for you, or not, given
what's been discussed. Remembering that you can turn it on for specific
tables at least as easily as you can set up autovacuum for them.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com