Thread: HOT - whats next ?

HOT - whats next ?

From
"Pavan Deolasee"
Date:
Hi All,

The version 4.0 of HOT patch is very close to the state where
we can start considering it for testing for correctness as well
as benchmarking, if there is sufficient interest to give it a
chance for 8.3

I have very little clue about what community thinks about
HOT and the patch, but I am ready to do whatever it needs
to increase quality of the patch. But I need reviews and
feedback to do that.

IMHO there are two things that need to be done to make HOT
feature complete. They are,

- Support for VACUUM FULL
- Support for CREATE INDEX

Simon has interest working on these items. I have
some ideas for supporting CREATE INDEX, but would wait
for Simon's thoughts on this.

Then there are several optimization and tuning work that
needs to be done. I am planning to start work on the following
two items and need suggestions/comments to make sure that
I am following the right path:

- With Tom/Heikki's recent patch of tracking free line pointers, we have some bits available in the page header for
book-keeping.I plan to use one bit to track if there are any LP_DELETEd items on the page. This information would help
usto quickly check if its worth searching through the line pointers to find a LP_DELETED item. The flag will be set
whenevera tuple is marked LP_DELETEd and reset in page-VACUUM and whenever we fail to find a LP_DELETEd item for
reuse.


- Another problem with the current HOT patch is that it generates tuple level fragmentation while reusing LP_DELETEd
itemswhen the new tuple is of smaller size than the original one. Heikki supported using best-fit strategy to reduce
thefragmentation and thats worth trying. But ISTM that we can also correct row-level defragmentation whenever we run
outof free space and LP_DELETEd tuples while doing UPDATE. Since this does not require moving tuples around, we can do
thisby a simple EXCLUSIVE lock on the page. A bit to track row-level fragmentation would help.
 

These would be just hint-bits and changes need not be WAL logged.
We could have done better with some sort of counters, but that
would require heap-page specific data and that may not
necessarily be a good idea.

Comments ?

Thanks,
Pavan






Re: HOT - whats next ?

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@enterprisedb.com> writes:
> - Another problem with the current HOT patch is that it generates
>   tuple level fragmentation while reusing LP_DELETEd items when
>   the new tuple is of smaller size than the original one. Heikki
>   supported using best-fit strategy to reduce the fragmentation
>   and thats worth trying. But ISTM that we can also correct
>   row-level defragmentation whenever we run out of free space
>   and LP_DELETEd tuples while doing UPDATE. Since this does not
>   require moving tuples around, we can do this by a simple EXCLUSIVE
>   lock on the page.

You are mistaken.  To move existing tuples requires
LockBufferForCleanup, the same as VACUUM needs; otherwise some other
backend might continue to access a tuple it found previously.

How much testing of this patch's concurrent behavior has been done?
I'm wondering if any other locking thinkos are in there ...
        regards, tom lane


Re: HOT - whats next ?

From
"Pavan Deolasee"
Date:

On 3/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Pavan Deolasee" <pavan.deolasee@enterprisedb.com> writes:
> - Another problem with the current HOT patch is that it generates
>   tuple level fragmentation while reusing LP_DELETEd items when
>   the new tuple is of smaller size than the original one. Heikki
>   supported using best-fit strategy to reduce the fragmentation
>   and thats worth trying. But ISTM that we can also correct
>   row-level defragmentation whenever we run out of free space
>   and LP_DELETEd tuples while doing UPDATE. Since this does not
>   require moving tuples around, we can do this by a simple EXCLUSIVE
>   lock on the page.

You are mistaken.  To move existing tuples requires
LockBufferForCleanup, the same as VACUUM needs; otherwise some other
backend might continue to access a tuple it found previously.

I am not suggesting moving tuples around. This is a specific case
of reusing LP_DELETEd tuples. For example, say the HOT-update
chain had two tuples, the first one is of length 100 and next one is
of length 125. When the first becomes dead, we remove it from the
chain and set its LP_DELETE true. Now, this tuple is say reused
to store a tuple of length 80, this results in tuple level fragmentation
of 20 bytes. The information about the original size of the tuple is
lost. Later of when this tuple is also LP_DELETEd, we can not
use it store tuple of size greater than 80, even though there is
unused free space of another 20 bytes.
 
What I am suggesting is to clean up this fragmentation (only
for LP_DELETEd tuples) by resetting the lp_len of these
tuples to the max possible value. None of the live tuples are
touched.

Btw, I haven't yet implemented this stuff, so I am seeking
opinions.

How much testing of this patch's concurrent behavior has been done?
I'm wondering if any other locking thinkos are in there ...

I have tested it on pgbench with maximum 90 clinets and 90
scaling factor, with 50000 txns/client (please see my another
post of preliminary results). I have done this quite a few time.
Not that I am saying there are no bugs, but I have good
confidence in the patch. These tests are done on SMP
machines. I also run data consistency checks at the end
of pgbench runs to validate the UPDATEs.
 
I also ran 4 hour DBT2 tests 3-4 times, not seen any failures.

I would appreciate if there are any independent tests, may be
in different setups.

Thanks,
Pavan


--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT - whats next ?

From
"Simon Riggs"
Date:
On Fri, 2007-03-02 at 10:08 -0500, Tom Lane wrote:

> How much testing of this patch's concurrent behavior has been done?
> I'm wondering if any other locking thinkos are in there ...

This version of HOT is being developed from scratch, with as much
feedback from the community as possible. The idea was to build it up
brick by brick, so that each assumption/decision could be challenged as
we go. The idea was to avoid a huge review at the end, which could lead
to a fatal flaw being discovered too late to make the release.

An earlier version had extensive analysis of locking to confirm it
worked, but this current version is aiming for minimal invasiveness. So
this version hasn't had extensive testing - yet. But we learned lots of
lessons along the way and that thinking goes into what we have now -
locking is an area of continual concern.

Intermediate reviews would be very useful, if thats possible.

The right kind of testing is clearly going to be important to getting
HOT right. Back in July, we spent some time building concurrent psql
specifically to allow test cases to be written that referenced multiple
sessions. Even if we don't like that thought for production, it would be
great to be able to have a tool that allowed multi-session test cases to
be written. Experience was that it was much, much easier to get a test
case written in a single script where you could easily read the
statement history.

It would also be very useful to have a version of pgstattuple that
worked with heaps, so test cases can be written that examine the header
fields, info flags etc. It would be useful to be able to specify the
basic behaviour in terms of explicit test cases.

Would those two approaches to test execution be desirable in the
regression tests?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT - whats next ?

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Fri, 2007-03-02 at 10:08 -0500, Tom Lane wrote:
> 
> > How much testing of this patch's concurrent behavior has been done?
> > I'm wondering if any other locking thinkos are in there ...
> 
> This version of HOT is being developed from scratch, with as much
> feedback from the community as possible. The idea was to build it up
> brick by brick, so that each assumption/decision could be challenged as
> we go. The idea was to avoid a huge review at the end, which could lead
> to a fatal flaw being discovered too late to make the release.

Yes, as Joshua Drake said, HOT is a model of how to develop complex
patches in the community.

> The right kind of testing is clearly going to be important to getting
> HOT right. Back in July, we spent some time building concurrent psql
> specifically to allow test cases to be written that referenced multiple
> sessions. Even if we don't like that thought for production, it would be
> great to be able to have a tool that allowed multi-session test cases to
> be written. Experience was that it was much, much easier to get a test
> case written in a single script where you could easily read the
> statement history.

Yes, I am assuming we are getting the concurrent psql patch in 8.3.  It
was stalled because we were waiting for regression tests and use
illustrations.

> 
> It would also be very useful to have a version of pgstattuple that
> worked with heaps, so test cases can be written that examine the header
> fields, info flags etc. It would be useful to be able to specify the
> basic behaviour in terms of explicit test cases.
> 
> Would those two approaches to test execution be desirable in the
> regression tests?

Sure.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: HOT - whats next ?

From
"Joshua D. Drake"
Date:
Bruce Momjian wrote:
> Simon Riggs wrote:
>> On Fri, 2007-03-02 at 10:08 -0500, Tom Lane wrote:
>>
>>> How much testing of this patch's concurrent behavior has been done?
>>> I'm wondering if any other locking thinkos are in there ...
>> This version of HOT is being developed from scratch, with as much
>> feedback from the community as possible. The idea was to build it up
>> brick by brick, so that each assumption/decision could be challenged as
>> we go. The idea was to avoid a huge review at the end, which could lead
>> to a fatal flaw being discovered too late to make the release.
> 
> Yes, as Joshua Drake said, HOT is a model of how to develop complex
> patches in the community.

It certainly was better than many we have seen. Hats off to Simon and
Pavan.

Joshua D. Drake


-- 
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997            http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/



Re: HOT - whats next ?

From
"Simon Riggs"
Date:
On Fri, 2007-03-02 at 21:53 -0500, Bruce Momjian wrote:
> Simon Riggs wrote:
>  
> > It would also be very useful to have a version of pgstattuple that
> > worked with heaps, so test cases can be written that examine the header
> > fields, info flags etc. It would be useful to be able to specify the
> > basic behaviour in terms of explicit test cases.
> > 
> > Would those two approaches to test execution be desirable in the
> > regression tests?
> 
> Sure.

I've written some utility functions that will help us look inside heap
blocks to examine headers and the like. I'd like to add these to core
(not contrib) so we can write regression tests directly using them.

I'll post what I have now to -patches, under the title: Heap page
diagnostic/test functions (WIP). (I have extended the pgstattuple
contrib module, but thats not the eventual destination, I hope).


The first function reads a single block from a file, returning the
complete page as a bytea of length BLCKSZ. 
CREATE OR REPLACE FUNCTION bufpage_get_raw_page(text, int4)RETURNS bytea ...

We do this to ensure that we get a time consistent view of all data on
the page, rather than returning to it repeatedly to read items from it.
This is a similar idea to heapgetpage() for page-at-a-time scans, but we
may want to inspect all aspects of the page, not just visible tuples.

Returning a bytea means we can also dump that out as text easily, so
this is also a useful tool for retrieving the contents of damaged blocks
and emailing them to people. We can also save the page data in a table.

Other functions then work from the bytea version of the page.

-- A simple function for checking page header validityCREATE OR REPLACE FUNCTION
heap_raw_page_header_is_valid(bytea)RETURNSboolean ...
 

-- An SRF for showing the details of tuple headers on a pageCREATE OR REPLACE FUNCTION
heap_raw_page_tuple_headers(bytea)RETURNSSETOF heap_page_tuple_header_type ...
 

example output: select * from
heap_raw_page_tuple_headers(bufpage_get_raw_page('foo',6));
itemid | ok | len |  tctid  |  xmn  | cn |  xmx  | cx | cid | natts |
toid | info_flag_text
--------+----+-----+---------+-------+----+-------+----+-----+-------+------+----------------     1 | t  |  33 | (6,1)
| 51800 |    |     0 | i  |   0 |     2 |
 
0 | HEAP_UPDATED     2 | t  |  33 | (6,2)   |   602 | c  |     0 | i  |   0 |     2 |
0 |     3 | t  |  33 | (6,1)   |   602 | c  | 51800 |    |   0 |     2 |
0 |

cn = xmin hint bits, cx = xmax hint bits

Can I have some additional requests for features on this, so I can
submit as a patch to core? What else do we want?

Thanks,

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT - whats next ?

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> The first function reads a single block from a file, returning the
> complete page as a bytea of length BLCKSZ. 
>  CREATE OR REPLACE FUNCTION bufpage_get_raw_page(text, int4)
>  RETURNS bytea ...

Directly from the file?  What if the version in buffers is completely
different?  OTOH, if you try to pull from shared buffers then you won't
be able to deal with corrupted pages, so I think you are claiming that
the function can serve purposes that it can't really fulfill
simultaneously.

As for putting it in core, we already had that discussion w.r.t. the
adminpack functions, and you have not provided any argument adequate
to override the concerns expressed about those.
        regards, tom lane


Re: HOT - whats next ?

From
"Simon Riggs"
Date:
On Mon, 2007-03-05 at 11:39 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > The first function reads a single block from a file, returning the
> > complete page as a bytea of length BLCKSZ. 
> >  CREATE OR REPLACE FUNCTION bufpage_get_raw_page(text, int4)
> >  RETURNS bytea ...
> 
> Directly from the file?  What if the version in buffers is completely
> different? 

No, I was doing from shared buffers.

>  OTOH, if you try to pull from shared buffers then you won't
> be able to deal with corrupted pages, so I think you are claiming that
> the function can serve purposes that it can't really fulfill
> simultaneously.

Here's the code I was using... 
Is there validation in this path? I thought not, maybe I'm wrong. 

buf = ReadBuffer(rel, blkno);
{bufpage = BufferGetPage(buf);
raw_page = (bytea *) palloc(BLCKSZ + VARHDRSZ);SET_VARSIZE(raw_page, BLCKSZ + VARHDRSZ);raw_page_data =
VARDATA(raw_page);
LockBuffer(buf, BUFFER_LOCK_SHARE);
/* SnapshotEverything */{    memcpy(raw_page_data, bufpage, BLCKSZ);}    LockBuffer(buf, BUFFER_LOCK_UNLOCK);
}
ReleaseBuffer(buf);


> As for putting it in core, we already had that discussion w.r.t. the
> adminpack functions, and you have not provided any argument adequate
> to override the concerns expressed about those.

The main point is to get a set of functions that can be used directly in
additional regression tests as well as diagnostics. ISTM we need to
*prove* HOT works, not just claim it. I'm very open to different
approaches as to how we might do this.

If we say these functions can't be used for block corruptions, thats
fine - that was just an additional point, not the main thought. I'll
still want a good way to produce regression tests that show HOT working
correctly.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT - whats next ?

From
Andrew Dunstan
Date:
Simon Riggs wrote:
>
> The main point is to get a set of functions that can be used directly in
> additional regression tests as well as diagnostics. ISTM we need to
> *prove* HOT works, not just claim it. I'm very open to different
> approaches as to how we might do this.
>
>   

Functions to support regression tests don't need to be built-ins. We 
already load some extra stuff for regression tests.

cheers

andrew


Re: HOT - whats next ?

From
"Simon Riggs"
Date:
On Mon, 2007-03-05 at 12:29 -0500, Andrew Dunstan wrote:
> Simon Riggs wrote:
> >
> > The main point is to get a set of functions that can be used directly in
> > additional regression tests as well as diagnostics. ISTM we need to
> > *prove* HOT works, not just claim it. I'm very open to different
> > approaches as to how we might do this.
> >

> Functions to support regression tests don't need to be built-ins. We 
> already load some extra stuff for regression tests.

Oh good, thanks.

There is still merit in including these in core because they'll be
useful in lots of cases. We have functions for esoteric things like the
current WAL insert pointer, we have SRFs for portals, locks etc. Why not
for heap tuple headers? In 8.3 we are aiming to include a number of
features that will directly effect tuple representation, such as HOT,
comboids or features that alter the way VACUUM works. ISTM a great time
to have some diagnostic functions that relate to heaps.

The earlier objections to AdminPack were about functions that write to
files. These functions just read data, not write them. So there's no
objection there, AFAICS.


--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT - whats next ?

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> The earlier objections to AdminPack were about functions that write to
> files. These functions just read data, not write them. So there's no
> objection there, AFAICS.

Au contraire, both reading and writing are issues.  But I had
misunderstood your original proposal as being for functions that would
read/write arbitrary files, so limiting it to access to database tables
(and making it superuser-only) probably is sufficient to cure that complaint.
        regards, tom lane