Thread: HOT patch - version 15

HOT patch - version 15

From
"Pavan Deolasee"
Date:

On 9/1/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:


I see this the other way around: if it doesn't work on system catalogs,
it probably doesn't work, period.  I'm not in favor of treating the
catalogs differently.



Please see the revised patches attached. The combo patch applies cleanly over
current HEAD. The incremental patch contains changes since the last patch.
The patch passes all but three regression tests. The failures are
cosmetic in nature.

Although I am comfortable with the patch, I wanted to send this out today
because tomorrow is my last work day before I go on leave for rest of
the week. I shall try to respond to mails and work a bit during that period,
but I won't be active as usual. I can fix any minor left over things tomorrow
though.

The patch now supports HOT update on system catalogs. This requires us
to wait for deleting/inserting transaction if we encounter
DELETE_IN_PROGRESS/INSERT_IN_PROGRESS tuple while building an index.
I got worried if this could add a new deadlock scenario, but I believe
system catalog reindexing is not very common, so may be we can live
with that.

Another thing that worries be is the handling of RECENTLY_DEAD tuples
while rebuilding  system catalogs indexes. For non-system relations, we
don't make the new index available for queries if any RECENTLY_DEAD
tuples are skipped while building the index. We can do the same for
system catalogs, though we may need more work to make that happen.
But since system catalogs are always run with SnapshotNow, do we need
to worry about RECENTLY_DEAD tuples ? ISTM that we should never return
RECENTLY_DEAD tuples with SnapshotNow. Does this sound ok ?

The changes to heap_update() signature are reverted. The caller can check if the
new tuple satisfies HeapTupleIsHeapOnly - which signifies that the update
was HOT update. This has also helped us limit the changes to support
system catalogs, because there are several invocation of
simple_heap_update(), followed by CatalogUpdateIndexes(). We can
now check for heap-only tuple in CatalogUpdateIndexes and skip
index inserts for HOT tuples.

As per Tom's suggestion, relcache is now responsible for building list
of HOT attributes. This needs to happen only once unless there are schema
changes. Also, we use just a single bitmap to track normal and system
attributes - offset by FirstLowInvalidHeapAttributeNumber

As per discussion, the page pruning and repair fragmentation is clubbed
together. In the SELECT path, we conditionally try for exclusive lock. If
the exclusive lock is also cleanup lock, we prune and defrag the page.
As the patch stands, during index fetch we try to prune only the chain
originating at that tuple. I am wondering if we should change that
and prune all the tuple chains in the page ?

I haven't removed the avgFSM logic used to time the pruning of
the page. But we can remove that if we feel its not worth the
complexity. May be a simple test of free space as factor of the
BLCKSZ would suffice.

MaxHeapTuplesPerPage is reverted back to the original value.

I could have removed hot_update from xl_heap_update. But that
would have complicated the heap_xlog_update() code. So I
left it unchanged right now. But we still feel thats worth doing, I
will make that change.

The IsPlanValidForThisTransaction() nows uses PlannerGlobal
as per Tom's suggestion, thus making it reentrant.  I am not sure if
I have got it completely right though.

Apart from the changes based on recent feedback, I have also
added the previously discussed changes to track dead_space in
the relation and trigger autovaccum when the percentage of
dead_space goes beyond the threshold. vac_update_scale
is still a percentage of dead_space to the relation size.
vac_update_threshold is right now number of blocks, but we need
to discuss and finalize the changes. One good thing is if we do
it this way, autoanalyze trigger will be based on total number of
HOT and non-HOT updates (along with inserts/deletes)

Thanks,
Pavan


--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com
Attachment

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Pavan Deolasee wrote:
> together. In the SELECT path, we conditionally try for exclusive lock. If
> the exclusive lock is also cleanup lock, we prune and defrag the page.
> As the patch stands, during index fetch we try to prune only the chain
> originating at that tuple. I am wondering if we should change that
> and prune all the tuple chains in the page ?

I am thinking we are best just doing the index chain we already have to
read.

If you are modifying the table (like with UPDATE) it makes sense to be
more aggressive and do the whole page because you know there are
probably other table modifications, but for an index lookup there isn't
any knowledge of whether the table is being modified so looking at more
than we need seems like overkill.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> I am thinking we are best just doing the index chain we already have to
> read.

> If you are modifying the table (like with UPDATE) it makes sense to be
> more aggressive and do the whole page because you know there are
> probably other table modifications, but for an index lookup there isn't
> any knowledge of whether the table is being modified so looking at more
> than we need seems like overkill.

Uh, why would any of this code at all execute during a pure lookup?

            regards, tom lane

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > I am thinking we are best just doing the index chain we already have to
> > read.
>
> > If you are modifying the table (like with UPDATE) it makes sense to be
> > more aggressive and do the whole page because you know there are
> > probably other table modifications, but for an index lookup there isn't
> > any knowledge of whether the table is being modified so looking at more
> > than we need seems like overkill.
>
> Uh, why would any of this code at all execute during a pure lookup?

No idea.  It seems an index lookup tries to prune a heap chain, and he
was asking if it should look at other chains on the page;  I said not.
Whether the index lookup should prune the heap chain is another issue.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> Uh, why would any of this code at all execute during a pure lookup?

> No idea.  It seems an index lookup tries to prune a heap chain, and he
> was asking if it should look at other chains on the page;  I said not.
> Whether the index lookup should prune the heap chain is another issue.

ISTM the only time we should be doing HOT cleanup work is when we are
trying to insert a new tuple on that page --- and maybe even only when
we try and it doesn't fit straight off.  Otherwise you're pushing
maintenance work into foreground query pathways, which is exactly where
I don't think we should be headed.

            regards, tom lane

Re: HOT patch - version 15

From
Gregory Stark
Date:
"Bruce Momjian" <bruce@momjian.us> writes:

> Tom Lane wrote:
>> Bruce Momjian <bruce@momjian.us> writes:
>> > I am thinking we are best just doing the index chain we already have to
>> > read.
>>
>> > If you are modifying the table (like with UPDATE) it makes sense to be
>> > more aggressive and do the whole page because you know there are
>> > probably other table modifications, but for an index lookup there isn't
>> > any knowledge of whether the table is being modified so looking at more
>> > than we need seems like overkill.
>>
>> Uh, why would any of this code at all execute during a pure lookup?
>
> No idea.  It seems an index lookup tries to prune a heap chain, and he
> was asking if it should look at other chains on the page;  I said not.
> Whether the index lookup should prune the heap chain is another issue.

Pruning chains is kind of the whole point of the exercise no?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: HOT patch - version 15

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Bruce Momjian <bruce@momjian.us> writes:
>> Tom Lane wrote:
>>> Uh, why would any of this code at all execute during a pure lookup?
>
>> No idea.  It seems an index lookup tries to prune a heap chain, and he
>> was asking if it should look at other chains on the page;  I said not.
>> Whether the index lookup should prune the heap chain is another issue.
>
> ISTM the only time we should be doing HOT cleanup work is when we are
> trying to insert a new tuple on that page --- and maybe even only when
> we try and it doesn't fit straight off.  Otherwise you're pushing
> maintenance work into foreground query pathways, which is exactly where
> I don't think we should be headed.

Ah, as I understand it you can't actually do the pruning then because the
executor holds references to source tuple on the page. In other words you can
never "get the vacuum lock" there because you already have the page pinned
yourself.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:

On 9/6/07, Gregory Stark <stark@enterprisedb.com> wrote:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

>
> ISTM the only time we should be doing HOT cleanup work is when we are
> trying to insert a new tuple on that page --- and maybe even only when
> we try and it doesn't fit straight off.  Otherwise you're pushing
> maintenance work into foreground query pathways, which is exactly where
> I don't think we should be headed.

Ah, as I understand it you can't actually do the pruning then because the
executor holds references to source tuple on the page. In other words you can
never "get the vacuum lock" there because you already have the page pinned
yourself.


I don't think executor holding a reference is a problem because when
we check for vacuum lock, we have already pinned the page anyways.
But moving the old tuple around deep down in the UPDATE code path
(when we realize that there is no free space) is an issue. I know Heikki
tried to do it this way - but then moved the pruning code to lookup
code. Heikki ?

Another real problem with doing pruning only in UPDATE path is that
we may end up with long HOT chains if the page does not receive a
UPDATE, after many consecutive HOT updates. Every lookup to the
visible tuple in this chain would be CPU expensive since it would require
long chain following.

The downside is of course that SELECT may occasionally do more work.
But since we ensure that we do the extra work only when there is no
contention on the page, ISTM that the downside is limited.


Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
Gregory Stark
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:

> On 9/6/07, Gregory Stark <stark@enterprisedb.com> wrote:
>
>> Ah, as I understand it you can't actually do the pruning then because the
>> executor holds references to source tuple on the page. In other words you
>> can never "get the vacuum lock" there because you already have the page
>> pinned yourself.
>
> I don't think executor holding a reference is a problem because when
> we check for vacuum lock, we have already pinned the page anyways.

But that's the point. Even though the pin-count is 1 and it may look like we
have the vacuum lock we really don't. The fact that the buffer was already
pinned by us means that there are already pointers around referencing tuples
on that page. If we move them around those pointers become garbage. In fact
Postgres avoids copying data if it can get by with a pointer at the original
tuple on disk so some of the pointers will be Datums pointing at individual
columns in the tuples in the page.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Pavan Deolasee wrote:
> On 9/6/07, Gregory Stark <stark@enterprisedb.com> wrote:
>> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>>
>>> ISTM the only time we should be doing HOT cleanup work is when we are
>>> trying to insert a new tuple on that page --- and maybe even only when
>>> we try and it doesn't fit straight off.  Otherwise you're pushing
>>> maintenance work into foreground query pathways, which is exactly where
>>> I don't think we should be headed.
>> Ah, as I understand it you can't actually do the pruning then because the
>> executor holds references to source tuple on the page. In other words you
>> can
>> never "get the vacuum lock" there because you already have the page pinned
>> yourself.
>>
>>
> I don't think executor holding a reference is a problem because when
> we check for vacuum lock, we have already pinned the page anyways.
> But moving the old tuple around deep down in the UPDATE code path
> (when we realize that there is no free space) is an issue. I know Heikki
> tried to do it this way - but then moved the pruning code to lookup
> code. Heikki ?

When I suggested that we get rid of the LP_DELETE flag for heap tuples,
the tuple-level fragmentation and all that, and just take the vacuum
lock and call PageRepairFragmentation, I was thinking that we'd do it in
heap_update and only when we run out of space on the page. But as Greg
said, it doesn't work because you're already holding a reference to at
least one tuple on the page, the one you're updating, by the time you
get to heap_update. That's why I put the pruning code to heap_fetch
instead. Yes, though the amortized cost is the same, it does push the
pruning work to the foreground query path.

That was a reason for separating the pruning and defragmenting a page.
We could do the pruning in heap_update, and only do the defragmenting in
heap_fetch. I was also thinking of an optimization to the pruning, so
that while we scan the page and remove dead tuples, we also check if we
leave behind an empty gap that's big enough to accommodate the new tuple
we're inserting, and reuse that space immediately, without defragmenting.

> Another real problem with doing pruning only in UPDATE path is that
> we may end up with long HOT chains if the page does not receive a
> UPDATE, after many consecutive HOT updates. Every lookup to the
> visible tuple in this chain would be CPU expensive since it would require
> long chain following.

Yes. For that as well, we could prune only, but not defragment, the page
in the lookup path.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> When I suggested that we get rid of the LP_DELETE flag for heap tuples,
> the tuple-level fragmentation and all that, and just take the vacuum
> lock and call PageRepairFragmentation, I was thinking that we'd do it in
> heap_update and only when we run out of space on the page. But as Greg
> said, it doesn't work because you're already holding a reference to at
> least one tuple on the page, the one you're updating, by the time you
> get to heap_update. That's why I put the pruning code to heap_fetch
> instead. Yes, though the amortized cost is the same, it does push the
> pruning work to the foreground query path.

The amortized cost is only "the same" if every heap_fetch is associated
with a heap update.  I feel pretty urgently unhappy about this choice.
Have you tested the impact of the patch on read-mostly workloads?

>> Another real problem with doing pruning only in UPDATE path is that
>> we may end up with long HOT chains if the page does not receive a
>> UPDATE, after many consecutive HOT updates.

How is that, if the same number of prune attempts would occur?

            regards, tom lane

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> When I suggested that we get rid of the LP_DELETE flag for heap tuples,
>> the tuple-level fragmentation and all that, and just take the vacuum
>> lock and call PageRepairFragmentation, I was thinking that we'd do it in
>> heap_update and only when we run out of space on the page. But as Greg
>> said, it doesn't work because you're already holding a reference to at
>> least one tuple on the page, the one you're updating, by the time you
>> get to heap_update. That's why I put the pruning code to heap_fetch
>> instead. Yes, though the amortized cost is the same, it does push the
>> pruning work to the foreground query path.
>
> The amortized cost is only "the same" if every heap_fetch is associated
> with a heap update.  I feel pretty urgently unhappy about this choice.
> Have you tested the impact of the patch on read-mostly workloads?

I haven't. Someone should. We have a tester working on a test suite with
many small CPU-bound performance test cases; hopefully we'll get those
test cases and results out soon.

Assuming the rule for when to prune would be the same whether we do it
in heap_fetch or heap_update, I don't see how the total cost would be
different. (that's a bad assumption, though, see below)

>>> Another real problem with doing pruning only in UPDATE path is that
>>> we may end up with long HOT chains if the page does not receive a
>>> UPDATE, after many consecutive HOT updates.
>
> How is that, if the same number of prune attempts would occur?

It wouldn't. To avoid the long HOT chains, we want to prune more often
than what's needed to just make room for updates. I'm not sure what the
exact rules are in the current patch.

That's a pretty sensitive tradeoff, we want to prune often to cut the
long HOT chains, but not too often because it's pretty expensive to
acquire the vacuum lock and move tuples around. I don't think we've
found the optimal solution yet. Separating the pruning and defragmenting
might help.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Florian Pflug
Date:
Heikki Linnakangas wrote:
> That's a pretty sensitive tradeoff, we want to prune often to cut the
> long HOT chains, but not too often because it's pretty expensive to
> acquire the vacuum lock and move tuples around. I don't think we've
> found the optimal solution yet. Separating the pruning and defragmenting
> might help.

Does defragmenting force writing a full page image to the WAL afterwards?
Or does it just log the fact that the page was defragmented, and the actual
work is redone on recovery?

In the first case, over-zealous defragmenting might be costly in terms of
WAL traffic too, not only in term of CPU usage.

greetings, Florian Pflug


Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>>> Another real problem with doing pruning only in UPDATE path is that
>>>> we may end up with long HOT chains if the page does not receive a
>>>> UPDATE, after many consecutive HOT updates.
>>> How is that, if the same number of prune attempts would occur?
>
>> It wouldn't. To avoid the long HOT chains, we want to prune more often
>> than what's needed to just make room for updates.
>
> I don't follow.  HOT chains can only get longer by updates.

Yes. We don't seem to be on the same wavelength...

Imagine a page with just one tuple on it:

1

After a bunch of updates, it looks like this

1 -> 2 -> 3 -> 4 -> 5

1 is the tuple the indexes point to, others are heap only.

Every time you do an index lookup, you have to follow that chain from 1
to 5. Which gets expensive. That's why we want to prune the chain to
make it shorter, even if there's still plenty of room on the page for
updates, and even if we're never going to update it again.

That's the theory. I don't know *how* expensive following the chain
really is, compared to index scans skipping over entries marked as
killed. I don't remember seeing any measurements of that.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Florian Pflug wrote:
> Heikki Linnakangas wrote:
>> That's a pretty sensitive tradeoff, we want to prune often to cut the
>> long HOT chains, but not too often because it's pretty expensive to
>> acquire the vacuum lock and move tuples around. I don't think we've
>> found the optimal solution yet. Separating the pruning and defragmenting
>> might help.
>
> Does defragmenting force writing a full page image to the WAL afterwards?
> Or does it just log the fact that the page was defragmented, and the actual
> work is redone on recovery?

No, you just log a note that it was defragmented.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>>> Another real problem with doing pruning only in UPDATE path is that
>>> we may end up with long HOT chains if the page does not receive a
>>> UPDATE, after many consecutive HOT updates.
>>
>> How is that, if the same number of prune attempts would occur?

> It wouldn't. To avoid the long HOT chains, we want to prune more often
> than what's needed to just make room for updates.

I don't follow.  HOT chains can only get longer by updates.

            regards, tom lane

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Imagine a page with just one tuple on it:
>
>> 1
>
>> After a bunch of updates, it looks like this
>
>> 1 -> 2 -> 3 -> 4 -> 5
>
>> 1 is the tuple the indexes point to, others are heap only.
>
> But if we were attempting prune at every update, at least some of the
> later updates should have managed to prune.  "2" should certainly be
> gone at this point, unless there's lots of contention for the page,
> in which case pruning at select won't make things better.

Oh I see. Yeah, hopefully you don't end up with long chains too often.
You don't need contention for it, though, a long-running transaction
will cause it. Or a transaction that inserts a tuple and updates it
multiple times in the same transaction.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Alvaro Herrera wrote:
> Pruning is going to take place on next vacuum anyway, isn't it?

Yes. If HOT is working well, you're not going to run vacuum very often,
though.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Alvaro Herrera wrote:
> Pruning is going to take place on next vacuum anyway, isn't it?
>
> AFAIUI the point is that you can't do cleanup selectively only on UPDATE
> because of pointers that your process may already have to the page,
> which is why you wanted to do it on every heap_fetch.  I suggest it is
> optimization which can be left for a later version (8.4?)

Wait, did you mean that we don't do pruning at all in 8.3? That's a bad
idea, the main purpose of HOT is to be able to vacuum page at a time,
reducing the need to do regular vacuums.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Florian Pflug
Date:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>>> Another real problem with doing pruning only in UPDATE path is that
>>>> we may end up with long HOT chains if the page does not receive a
>>>> UPDATE, after many consecutive HOT updates.
>>> How is that, if the same number of prune attempts would occur?
>
>> It wouldn't. To avoid the long HOT chains, we want to prune more often
>> than what's needed to just make room for updates.
>
> I don't follow.  HOT chains can only get longer by updates.

I believe the case pruning-on-select protects against is
where you'd have a period of heavy updating and long running
transaction, followed by a period of (mostly) read only access.

If you didn't manage to prune most chains during the first phase
due to a rather old OldestXmin, the following selects will all
spend cycles on following the long HOT chains.

Still, it sounds like the real reason is more the technical
difficulties of doing it on update, rather than preventing
that scenario.

A rather wild idea: Could we maybe pin individual tuples, instead
of the whole page? Then we'd just have to be careful not to move
those when pruning during the update.

greetings, Florian Pflug

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Me & Greg just had a little chat, and came up with this scheme:

1. on heap_update, if the page is full, you prune the page (but don't
defragment it, because you can't get the vacuum lock). That hopefully
leaves behind a large enough gap to put the new tuple in. Insert the new
tuple in the gap, and mark the page as Fragmented. Also make a note in
some backend-private data structure that we've left that page in
fragmented state.

2. In UnpinBuffer, if the pin count falls to zero and it's a page we've
pruned (check the backend-private data structure), defragment it.

Under little contention, all the cost of pruning will be carried by
transactions that do updates. Whether we need to prune in heap_fetch in
addition to that to keep the chains short, I don't know.

One problem with this scheme is that when the page gets full, you have
to hope that you can create a wide enough gap by pruning. It will work
well with fixed-size tuples, but not so well otherwise.

Hmm. I wonder if we could prune/defragment in bgwriter?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Florian Pflug wrote:
> Tom Lane wrote:
> A rather wild idea: Could we maybe pin individual tuples, instead
> of the whole page? Then we'd just have to be careful not to move
> those when pruning during the update.

Uh, that's a huge change. We might be able to keep track of tuples that
we have references to in our own backend, but even that seems like a
non-starter to me.

Yet another idea is to add an "intent" argument (or somehow pass it out
of line) to heap_fetch. You would prune the page in heap_fetch, but only
if you're fetching for the purpose of updating the tuple.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Simon Riggs
Date:
On Wed, 2007-09-05 at 18:15 -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> Uh, why would any of this code at all execute during a pure lookup?
>
> > No idea.  It seems an index lookup tries to prune a heap chain, and he
> > was asking if it should look at other chains on the page;  I said not.
> > Whether the index lookup should prune the heap chain is another issue.
>
> ISTM the only time we should be doing HOT cleanup work is when we are
> trying to insert a new tuple on that page --- and maybe even only when
> we try and it doesn't fit straight off.  Otherwise you're pushing
> maintenance work into foreground query pathways, which is exactly where
> I don't think we should be headed.

This is one of the many heuristics for tuning HOT.

We can choose to do this immediately, when the page fills or somewhere
in between.

The rationale was: In other circumstances we dirty a page to set a hint
bit on the first time we see the need for it. That avoids the clog
lookup on all later reads of that tuple. By analogy, we do the same
thing with HOT: when we see a tuple chain that can be pruned, we prune
it. If we dirty the page for the hint bit it seems like we may as well
dirty the page some more and prune it also at the same time.

We could begin pruning only when the chain is N long. Currently N=2, but
we could set N=3+ easily enough. There's no code yet to actually count
that, but we can do that easily as we do each lookup. We should also be
able to remember the visibility result for each tuple in the chain to
decide whether pruning will be effective or not.

I would say that if the buffer is already dirty and the chain is
prunable, we should prune it at the first possible chance.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
Alvaro Herrera
Date:
Heikki Linnakangas escribió:

> Hmm. I wonder if we could prune/defragment in bgwriter?

That would be best, if at all possible.  You can prune without accessing
anything outside the page itself, right?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Alvaro Herrera wrote:
> Heikki Linnakangas escribió:
>
>> Hmm. I wonder if we could prune/defragment in bgwriter?
>
> That would be best, if at all possible.  You can prune without accessing
> anything outside the page itself, right?

Yes, though you do need to have an oldest xmin to determine which tuples
are dead, and the removed tuples need to be WAL-logged.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Alvaro Herrera
Date:
Heikki Linnakangas escribió:

> Every time you do an index lookup, you have to follow that chain from 1
> to 5. Which gets expensive. That's why we want to prune the chain to
> make it shorter, even if there's still plenty of room on the page for
> updates, and even if we're never going to update it again.
>
> That's the theory. I don't know *how* expensive following the chain
> really is, compared to index scans skipping over entries marked as
> killed. I don't remember seeing any measurements of that.

I suggest you let that be.  Chain following can't be *that* expensive --
and if it is, then pruning would be a lot more expensive, so it's not a
thing you want to be doing for every heap_fetch.

Pruning is going to take place on next vacuum anyway, isn't it?

AFAIUI the point is that you can't do cleanup selectively only on UPDATE
because of pointers that your process may already have to the page,
which is why you wanted to do it on every heap_fetch.  I suggest it is
optimization which can be left for a later version (8.4?)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: HOT patch - version 15

From
Simon Riggs
Date:
On Thu, 2007-09-06 at 16:15 +0100, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> >> Imagine a page with just one tuple on it:
> >
> >> 1
> >
> >> After a bunch of updates, it looks like this
> >
> >> 1 -> 2 -> 3 -> 4 -> 5
> >
> >> 1 is the tuple the indexes point to, others are heap only.
> >
> > But if we were attempting prune at every update, at least some of the
> > later updates should have managed to prune.  "2" should certainly be
> > gone at this point, unless there's lots of contention for the page,
> > in which case pruning at select won't make things better.
>
> Oh I see. Yeah, hopefully you don't end up with long chains too often.
> You don't need contention for it, though, a long-running transaction
> will cause it. Or a transaction that inserts a tuple and updates it
> multiple times in the same transaction.

Yes, the main point is that an UPDATE doesn't always allow you to prune.
If it did, that would be the right place. Since it doesn't the best
place to prune is surely the first time we see we *can* prune.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
> Yes, the main point is that an UPDATE doesn't always allow you to prune.

You can always remove dead HOT tuples in heap_update. But you can never
defragment the page at that point.

> If it did, that would be the right place. Since it doesn't the best
> place to prune is surely the first time we see we *can* prune.

Not necessarily. Pruning is expensive, you need to scan all tuples on
the page and write WAL record. And defragment the page if you consider
that part of pruning. You don't want to do it too aggressively.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> I don't follow.  HOT chains can only get longer by updates.

> Yes. We don't seem to be on the same wavelength...

> Imagine a page with just one tuple on it:

> 1

> After a bunch of updates, it looks like this

> 1 -> 2 -> 3 -> 4 -> 5

> 1 is the tuple the indexes point to, others are heap only.

But if we were attempting prune at every update, at least some of the
later updates should have managed to prune.  "2" should certainly be
gone at this point, unless there's lots of contention for the page,
in which case pruning at select won't make things better.

            regards, tom lane

Re: HOT patch - version 15

From
Alvaro Herrera
Date:
Heikki Linnakangas escribió:
> Alvaro Herrera wrote:
> > Heikki Linnakangas escribió:
> >
> >> Hmm. I wonder if we could prune/defragment in bgwriter?
> >
> > That would be best, if at all possible.  You can prune without accessing
> > anything outside the page itself, right?
>
> Yes, though you do need to have an oldest xmin to determine which tuples
> are dead,

Hmm, well, I think that with the VXID patch it might actually be
possible to get a Xmin without being in a transaction.

> and the removed tuples need to be WAL-logged.

Is this a problem for bgwriter?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: HOT patch - version 15

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Alvaro Herrera wrote:
>> That would be best, if at all possible.  You can prune without accessing
>> anything outside the page itself, right?

> Yes, though you do need to have an oldest xmin to determine which tuples
> are dead, and the removed tuples need to be WAL-logged.

Hmm ... also check commit status (pg_clog access).  I've always thought
that those things couldn't be done in bgwriter, because it wasn't
running real transactions, but right at the moment I can't see that
there is any obstacle.  Perhaps that meme dates from a time when
GetOldestXmin didn't work outside a transaction?

Pushing the prune/defrag work into bgwriter would certainly fix my
worries about having it in foreground query paths.

(And screw up all Greg Smith's work on how much bgwriter should run...
or maybe this should be a third independent task within bgwriter?)

            regards, tom lane

Re: HOT patch - version 15

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Heikki Linnakangas escribi�:
>> and the removed tuples need to be WAL-logged.

> Is this a problem for bgwriter?

bgwriter does checkpoints, which emit WAL records; and they also call
GetOldestXmin.  So we know a-priori that those two things work.  There
may be some other showstopper but I'm not sure what it would be.

            regards, tom lane

Re: HOT patch - version 15

From
Simon Riggs
Date:
On Thu, 2007-09-06 at 19:05 +0100, Heikki Linnakangas wrote:

> Pruning is expensive, you need to scan all tuples on
> the page and write WAL record. And defragment the page if you consider
> that part of pruning. You don't want to do it too aggressively.

OK, so my understanding or the meaning of the word has changed since we
decided that initial behaviour. So yes, not too aggressively.

Elsewhere on this thread, I said:

On Thu, 2007-09-06 at 18:05 +0100, Simon Riggs wrote:

> We could begin pruning only when the chain is N long. Currently N=2, but
> we could set N=3+ easily enough. There's no code yet to actually count
> that, but we can do that easily as we do each lookup. We should also be
> able to remember the visibility result for each tuple in the chain to
> decide whether pruning will be effective or not.
>
> I would say that if the buffer is already dirty and the chain is
> prunable, we should prune it at the first possible chance.

If we defer pruning until the page is full, worst case we may could end
up with a chain ~240 tuples long, which might need to be scanned
repeatedly. That won't happen often, but it would be better to prune
whenever we hit one of these conditions
- when the block is full
- when we reach the 16th tuple in a chain

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
Tom Lane
Date:
I wrote:
> Hmm ... also check commit status (pg_clog access).  I've always thought
> that those things couldn't be done in bgwriter, because it wasn't
> running real transactions, but right at the moment I can't see that
> there is any obstacle.  Perhaps that meme dates from a time when
> GetOldestXmin didn't work outside a transaction?

On further thought, I think what I'm remembering is that full-scale
VACUUM can't work inside bgwriter, because you need to take table-level
locks and worry about index vs heap consistency.  But as long as HOT
pruning involves only a single heap page I see no need for it to take
more than the buffer-level locks on that page.

            regards, tom lane

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> I wrote:
>> Hmm ... also check commit status (pg_clog access).  I've always thought
>> that those things couldn't be done in bgwriter, because it wasn't
>> running real transactions, but right at the moment I can't see that
>> there is any obstacle.  Perhaps that meme dates from a time when
>> GetOldestXmin didn't work outside a transaction?
>
> On further thought, I think what I'm remembering is that full-scale
> VACUUM can't work inside bgwriter, because you need to take table-level
> locks and worry about index vs heap consistency.  But as long as HOT
> pruning involves only a single heap page I see no need for it to take
> more than the buffer-level locks on that page.

One complication is that you need to identify heap pages; you don't want
to start pruning index pages. We could mark the buffer with a new
Prunable-flag when an it's updated, to signal the bgwriter that it can
prune it.

I wonder if pruning in bgwriter only is enough to make HOT work
efficiently. On a very frequently updated page, bgwriter will have to
work hard to keep up.

Another problematic scenario is a big table and small shared_buffers
(like on Windows). Page can easily fall out of the buffer cache, before
bgwriter has the chance to prune anything on it.

But if it works reasonably well in typical scenarios, we can go with
that for 8.3 and improve later.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Greg Smith
Date:
On Thu, 6 Sep 2007, Heikki Linnakangas wrote:

> I wonder if pruning in bgwriter only is enough to make HOT work
> efficiently. On a very frequently updated page, bgwriter will have to
> work hard to keep up.

One of the goals for how I rebuilt the just-in-time BGW was to try and
make smaller values for bgwriter_delay feasible.  None of the tunables
*should* have to be adjusted if you need to run the bgwriter much more
often to support some new HOT-related activity in there as well; this is
actually my next area I wanted to do a bit more testing on myself.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
> > I wrote:
> >> Hmm ... also check commit status (pg_clog access).  I've always thought
> >> that those things couldn't be done in bgwriter, because it wasn't
> >> running real transactions, but right at the moment I can't see that
> >> there is any obstacle.  Perhaps that meme dates from a time when
> >> GetOldestXmin didn't work outside a transaction?
> >
> > On further thought, I think what I'm remembering is that full-scale
> > VACUUM can't work inside bgwriter, because you need to take table-level
> > locks and worry about index vs heap consistency.  But as long as HOT
> > pruning involves only a single heap page I see no need for it to take
> > more than the buffer-level locks on that page.
>
> One complication is that you need to identify heap pages; you don't want
> to start pruning index pages. We could mark the buffer with a new
> Prunable-flag when an it's updated, to signal the bgwriter that it can
> prune it.
>
> I wonder if pruning in bgwriter only is enough to make HOT work
> efficiently. On a very frequently updated page, bgwriter will have to
> work hard to keep up.
>
> Another problematic scenario is a big table and small shared_buffers
> (like on Windows). Page can easily fall out of the buffer cache, before
> bgwriter has the chance to prune anything on it.
>
> But if it works reasonably well in typical scenarios, we can go with
> that for 8.3 and improve later.

I don't like pushing pruning to the bgwriter.  I am afraid we get into
the same problem with autovacuum where polling via the bgwriter will
frequently not happen just-in-time, when we need it.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Simon Riggs wrote:
> > We could begin pruning only when the chain is N long. Currently N=2, but
> > we could set N=3+ easily enough. There's no code yet to actually count
> > that, but we can do that easily as we do each lookup. We should also be
> > able to remember the visibility result for each tuple in the chain to
> > decide whether pruning will be effective or not.
> >
> > I would say that if the buffer is already dirty and the chain is
> > prunable, we should prune it at the first possible chance.
>
> If we defer pruning until the page is full, worst case we may could end
> up with a chain ~240 tuples long, which might need to be scanned
> repeatedly. That won't happen often, but it would be better to prune
> whenever we hit one of these conditions
> - when the block is full
> - when we reach the 16th tuple in a chain

I don't see how following a HOT chain is any slower than following an
UPDATE chain like we do now.  In fact, what we do now is to have an
index row for every row version, while with HOT we will have one index
entry and all the row versions on the same page.  That has to be better
than what we have now, even without pruning.

Let me define two terms:

    prune -- modify ctid pointers to reduce the length of one HOT chain

    defragment -- reduce the length of all HOT chains on the page
        and collect free space

I think we all agree defragmenting should only happen when we need free
space on the page.  But it seems by the time we _know_ we are going to
be adding to the page we can't defragment.  I am thinking we should go
the direction of passing a boolean down into the routines so they know
to defragment before they get into a case where they can't.

As for pruning, I am thinking Simon's idea of just saying don't prune
unless the chain is longer than X is correct.  If there are a lot of
updates to the page the page will fill and the chain pruned during a
defragment, and if not the chains are guaranteed to be shorter than X.

Ideally you would want to say if the chain has been Y for a certain
length of time (meaning it isn't growing and defragement hasn't happened
in a while) you would start to prune more aggressively but I see no
easy way to track that.

Also, why all the talk of index lookups doing pruning?  Can't a
sequential scan do pruning?

Would someone tell us exactly when pruning and defragmenting happens in
the current version of the patch?  If we don't nail this issue down soon
PostgreSQL 8.3 is going to sail without this patch.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
Simon Riggs
Date:
On Fri, 2007-09-07 at 22:08 -0400, Bruce Momjian wrote:
> Simon Riggs wrote:
> > > We could begin pruning only when the chain is N long. Currently N=2, but
> > > we could set N=3+ easily enough. There's no code yet to actually count
> > > that, but we can do that easily as we do each lookup. We should also be
> > > able to remember the visibility result for each tuple in the chain to
> > > decide whether pruning will be effective or not.
> > >
> > > I would say that if the buffer is already dirty and the chain is
> > > prunable, we should prune it at the first possible chance.
> >
> > If we defer pruning until the page is full, worst case we may could end
> > up with a chain ~240 tuples long, which might need to be scanned
> > repeatedly. That won't happen often, but it would be better to prune
> > whenever we hit one of these conditions
> > - when the block is full
> > - when we reach the 16th tuple in a chain

Thanks for defining terms.

Just to answer a few of your points:

> I don't see how following a HOT chain is any slower than following an
> UPDATE chain like we do now.

The speed/cost is the same. The issue is *when* we do this. Normal
SELECTs will follow the chain each time we do an index lookup.

So if our read/write ratio is 1000:1, then we will waste many cycles and
yet the chain is never pruned on SELECT. So there really must be some
point at which we prune on a SELECT.

Perhaps we could say if Xid % 16 == 0 then we prune, i.e. every 16th
transaction walking the chain will prune it.

> Also, why all the talk of index lookups doing pruning?  Can't a
> sequential scan do pruning?

The SeqScan doesn't follow the chains, so can't easily determine whether
there are any long chains that need pruning. Its only when we come in
via an index that we need to walk the chain to the latest tuple version
and in that case we learn how long the chain is.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Fri, 2007-09-07 at 22:08 -0400, Bruce Momjian wrote:
>> I don't see how following a HOT chain is any slower than following an
>> UPDATE chain like we do now.

> The speed/cost is the same. The issue is *when* we do this. Normal
> SELECTs will follow the chain each time we do an index lookup.

> So if our read/write ratio is 1000:1, then we will waste many cycles and
> yet the chain is never pruned on SELECT. So there really must be some
> point at which we prune on a SELECT.

This argument is too weak to stand up by itself.

Once a tuple is marked as committed dead (which it certainly must be
before you could even consider pruning it) the cost of skipping over
it in a HOT-chain search is going to be very very small.  You have
already paid all the costs of searching the index, fetching and
pinning the heap page, etc.  The incremental cost of rejecting one
more HOT-chain entry is the cost to
    * fetch the line pointer;
    * check HEAP_XMIN_COMMITTED and see that it's set;
    * check xmin and see that it's valid per snapshot;
    * check HEAP_XMAX_COMMITTED and see that it's set;
    * check xmax and see that it's valid per snapshot;
    * fetch t_ctid to chain to next tuple.
Now the snapshot checks could be relatively expensive --- but in any
situation where a tuple is potentially prunable, they'll reduce to *one*
comparison: is xid less than snapshot's xmin?  So you've got two or at
most three shared-memory cache line fetches, no write to shared memory,
no locks taken or released, and a trivial number of CPU operations.

Compared to what it currently takes to check the same tuple (a separate
index entry fetch and traversal to the heap page), this is already an
enormous performance improvement.  Insisting that we add complexity and
pay performance penalties elsewhere to reduce this cost still more does
not strike me as a sane tradeoff --- certainly not if it's being made
without conclusive performance measurements to back it up.

            regards, tom lane

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> Compared to what it currently takes to check the same tuple (a separate
> index entry fetch and traversal to the heap page), this is already an
> enormous performance improvement.

Though keep in mind that we kill index tuples as soon as they're deemed
to be dead. Nevertheless, I'm not very worried about the cost of
following the chain either. But that's something we can quite easily
measure if we want to.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Florian Pflug
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Compared to what it currently takes to check the same tuple (a separate
>> index entry fetch and traversal to the heap page), this is already an
>> enormous performance improvement.
>
> Though keep in mind that we kill index tuples as soon as they're deemed
> to be dead. Nevertheless, I'm not very worried about the cost of
> following the chain either. But that's something we can quite easily
> measure if we want to.

I'm confused now. I though that pruning would be enough to shorten HOT-Chains -
because the root line pointer afterwards points directly to the first live
tuple. But we can *prune* (without actually defragmenting) without holding
a VACUUM-strength lock, right? Or did I get that wrong?

greetings, Florian Pflug



Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Florian Pflug wrote:
> Heikki Linnakangas wrote:
>> Tom Lane wrote:
>>> Compared to what it currently takes to check the same tuple (a separate
>>> index entry fetch and traversal to the heap page), this is already an
>>> enormous performance improvement.
>>
>> Though keep in mind that we kill index tuples as soon as they're deemed
>> to be dead. Nevertheless, I'm not very worried about the cost of
>> following the chain either. But that's something we can quite easily
>> measure if we want to.
>
> I'm confused now. I though that pruning would be enough to shorten
> HOT-Chains -
> because the root line pointer afterwards points directly to the first live
> tuple. But we can *prune* (without actually defragmenting) without holding
> a VACUUM-strength lock, right? Or did I get that wrong?

Yes, that's right. You don't seem to be confused at all.

Tom argued that following the tuple chain is cheap enough, and might
even be cheaper than what we have now, that we don't need to prune just
for the purpose of keeping the chains short. To which I pointed out that
currently, without HOT, we mark index tuples pointing to dead tuples as
killed to avoid following them in the future, so HOT without pruning is
not cheaper than what we have now.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Simon Riggs
Date:
On Sat, 2007-09-08 at 17:46 +0100, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Compared to what it currently takes to check the same tuple (a separate
> > index entry fetch and traversal to the heap page), this is already an
> > enormous performance improvement.
>
> Though keep in mind that we kill index tuples as soon as they're deemed
> to be dead. Nevertheless, I'm not very worried about the cost of
> following the chain either. But that's something we can quite easily
> measure if we want to.

OK, I'm good about it if you guys are.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Tom argued that following the tuple chain is cheap enough, and might
> even be cheaper than what we have now, that we don't need to prune just
> for the purpose of keeping the chains short. To which I pointed out that
> currently, without HOT, we mark index tuples pointing to dead tuples as
> killed to avoid following them in the future, so HOT without pruning is
> not cheaper than what we have now.

That hack only works in plain indexscans, though, not bitmapped scans.
Anyway, I remain unconvinced that the chains would normally get very
long in the first place, if we could prune when updating.

The we-already-pinned-the-page problem is a bit nasty but may not be
insurmountable.

            regards, tom lane

Re: HOT patch - version 15

From
Florian Pflug
Date:
Heikki Linnakangas wrote:
> Florian Pflug wrote:
>> Heikki Linnakangas wrote:
>>> Tom Lane wrote:
>>>> Compared to what it currently takes to check the same tuple (a separate
>>>> index entry fetch and traversal to the heap page), this is already an
>>>> enormous performance improvement.
>>> Though keep in mind that we kill index tuples as soon as they're deemed
>>> to be dead. Nevertheless, I'm not very worried about the cost of
>>> following the chain either. But that's something we can quite easily
>>> measure if we want to.
>> I'm confused now. I though that pruning would be enough to shorten
>> HOT-Chains -
>> because the root line pointer afterwards points directly to the first live
>> tuple. But we can *prune* (without actually defragmenting) without holding
>> a VACUUM-strength lock, right? Or did I get that wrong?
>
> Yes, that's right. You don't seem to be confused at all.
>
> Tom argued that following the tuple chain is cheap enough, and might
> even be cheaper than what we have now, that we don't need to prune just
> for the purpose of keeping the chains short. To which I pointed out that
> currently, without HOT, we mark index tuples pointing to dead tuples as
> killed to avoid following them in the future, so HOT without pruning is
> not cheaper than what we have now.

Hm.. In that case, couldn't we prune the chain while we followed it, with
nearly zero overhead at all? It seems that we'd just have to always update
the root line pointer / tuples to point at the first RECENTLY_DEAD tuple in the
chain, together with maybe compacting the orphaned tuples to just a line pointer

I'm not sure how HOT handles this currently, but it seems that this sort of
minimal pruning shouldn't need WAL.

greetings, Florian Pflug

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Florian Pflug wrote:
> > Heikki Linnakangas wrote:
> >> Tom Lane wrote:
> >>> Compared to what it currently takes to check the same tuple (a separate
> >>> index entry fetch and traversal to the heap page), this is already an
> >>> enormous performance improvement.
> >>
> >> Though keep in mind that we kill index tuples as soon as they're deemed
> >> to be dead. Nevertheless, I'm not very worried about the cost of
> >> following the chain either. But that's something we can quite easily
> >> measure if we want to.
> >
> > I'm confused now. I though that pruning would be enough to shorten
> > HOT-Chains -
> > because the root line pointer afterwards points directly to the first live
> > tuple. But we can *prune* (without actually defragmenting) without holding
> > a VACUUM-strength lock, right? Or did I get that wrong?
>
> Yes, that's right. You don't seem to be confused at all.
>
> Tom argued that following the tuple chain is cheap enough, and might
> even be cheaper than what we have now, that we don't need to prune just
> for the purpose of keeping the chains short. To which I pointed out that
> currently, without HOT, we mark index tuples pointing to dead tuples as
> killed to avoid following them in the future, so HOT without pruning is
> not cheaper than what we have now.

The central idea I now understand is that pruning only shrinks HOT
chains.  It does not allow reuse of space.  Only defragmentation does
that, and that is triggered when the page is almost full.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Simon Riggs wrote:
> > > If we defer pruning until the page is full, worst case we may could end
> > > up with a chain ~240 tuples long, which might need to be scanned
> > > repeatedly. That won't happen often, but it would be better to prune
> > > whenever we hit one of these conditions
> > > - when the block is full
> > > - when we reach the 16th tuple in a chain
>
> Thanks for defining terms.
>
> Just to answer a few of your points:
>
> > I don't see how following a HOT chain is any slower than following an
> > UPDATE chain like we do now.
>
> The speed/cost is the same. The issue is *when* we do this. Normal
> SELECTs will follow the chain each time we do an index lookup.

But a sequential scan still follows the chain, and that isn't going to
prune.  Why are we more worried about index chain lookups than
sequential scan lookups?

As I understand it, the pruning is only to reduce the chains.  It
doesn't allow space reuse.  I have updated my README to reflect that:

    ftp://momjian.us/pub/postgresql/mypatches/README.HOT

> So if our read/write ratio is 1000:1, then we will waste many cycles and
> yet the chain is never pruned on SELECT. So there really must be some
> point at which we prune on a SELECT.
>
> Perhaps we could say if Xid % 16 == 0 then we prune, i.e. every 16th
> transaction walking the chain will prune it.
>
> > Also, why all the talk of index lookups doing pruning?  Can't a
> > sequential scan do pruning?
>
> The SeqScan doesn't follow the chains, so can't easily determine whether
> there are any long chains that need pruning. Its only when we come in
> via an index that we need to walk the chain to the latest tuple version
> and in that case we learn how long the chain is.

Uh, as I understand it, every access follows the ctid, so why doesn't a
sequential scan follow the chain?

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
Florian Pflug
Date:
Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>> Florian Pflug wrote:
>>> Heikki Linnakangas wrote:
>>>> Tom Lane wrote:
>>>>> Compared to what it currently takes to check the same tuple (a separate
>>>>> index entry fetch and traversal to the heap page), this is already an
>>>>> enormous performance improvement.
>>>> Though keep in mind that we kill index tuples as soon as they're deemed
>>>> to be dead. Nevertheless, I'm not very worried about the cost of
>>>> following the chain either. But that's something we can quite easily
>>>> measure if we want to.
>>> I'm confused now. I though that pruning would be enough to shorten
>>> HOT-Chains -
>>> because the root line pointer afterwards points directly to the first live
>>> tuple. But we can *prune* (without actually defragmenting) without holding
>>> a VACUUM-strength lock, right? Or did I get that wrong?
>> Yes, that's right. You don't seem to be confused at all.
>>
>> Tom argued that following the tuple chain is cheap enough, and might
>> even be cheaper than what we have now, that we don't need to prune just
>> for the purpose of keeping the chains short. To which I pointed out that
>> currently, without HOT, we mark index tuples pointing to dead tuples as
>> killed to avoid following them in the future, so HOT without pruning is
>> not cheaper than what we have now.
>
> The central idea I now understand is that pruning only shrinks HOT
> chains.  It does not allow reuse of space.  Only defragmentation does
> that, and that is triggered when the page is almost full.

I was under the impression that pruning *does* free the space occupied
by DEAD HOT-Tuples (minus the size of a redirection line pointer). It
just doesn't defragment, so how much of that freed space you can actually
use to store new tuples is another question...

Or is that wrong - do we need a VACUUM-strength lock to turn a tuple
into a redirecting line pointer, and can pruning therefore only really
free *no* space?

greetings, Florian Pflug

Re: HOT patch - version 15

From
Tom Lane
Date:
Florian Pflug <fgp.phlo.org@gmail.com> writes:
> I was under the impression that pruning *does* free the space occupied
> by DEAD HOT-Tuples (minus the size of a redirection line pointer). It
> just doesn't defragment, so how much of that freed space you can actually
> use to store new tuples is another question...

None, unless the freed tuple happens to be directly adjacent to the
pd_lower --- pd_upper hole, or unless the patch has done a lot more to
the page-free-space-management code than I thought.  We have never
looked for free space anywhere except the main "hole".

            regards, tom lane

Re: HOT patch - version 15

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Simon Riggs wrote:
>> The speed/cost is the same. The issue is *when* we do this. Normal
>> SELECTs will follow the chain each time we do an index lookup.

> But a sequential scan still follows the chain, and that isn't going to
> prune.  Why are we more worried about index chain lookups than
> sequential scan lookups?

No, a seqscan *doesn't* follow HOT chains.  It just looks at every live
line pointer on the page, sequentially.  An indexscan will have to
follow HOT chains, or it will miss tuples it needs to return.

>> The SeqScan doesn't follow the chains, so can't easily determine whether
>> there are any long chains that need pruning. Its only when we come in
>> via an index that we need to walk the chain to the latest tuple version
>> and in that case we learn how long the chain is.

> Uh, as I understand it, every access follows the ctid, so why doesn't a
> sequential scan follow the chain?

Up to now, normal searches (whether seq or index) have never paid the
slightest attention to t_ctid.  That's only used when a READ COMMITTED
update realizes it's trying to update a non-latest version of a row.

            regards, tom lane

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> > Tom argued that following the tuple chain is cheap enough, and might
> > even be cheaper than what we have now, that we don't need to prune just
> > for the purpose of keeping the chains short. To which I pointed out that
> > currently, without HOT, we mark index tuples pointing to dead tuples as
> > killed to avoid following them in the future, so HOT without pruning is
> > not cheaper than what we have now.
>
> That hack only works in plain indexscans, though, not bitmapped scans.
> Anyway, I remain unconvinced that the chains would normally get very
> long in the first place, if we could prune when updating.
>
> The we-already-pinned-the-page problem is a bit nasty but may not be
> insurmountable.

As I understand it, there are two HOT features:

    Single-chain pruning, which trims HOT chains but doesn't reuse
        the space

    Defragementation, which prunes the entire page and reuses space
        and handles deleted rows, etc.

Defragementation is the HOT feature we really want.  Single-chain
pruning is kind of nice because it speeds things up, but isn't
necessary.  The fact that the entire chain is on the same page makes me
think that we could just leave single-chain pruning for 8.4 if
necessary.

I think allowing the chain to be more often on the same page via
defragmentation and having a single index entry for the chain is going
to be a win, and the fact we don't have marked-dead index entries for
some of the chain isn't going to be a problem.  My guess is that the
marked-dead index entries were a win only when the chain was on several
pages, which isn't the case for HOT chains.

FYI, I saw this comment in the patch:

+   /*
+    * If the free space left in the page is less than the average FSM
+    * request size (or a percentage of it), prune all the tuples or
+    * tuple chains in the page. Since the operation requires exclusive
+    * access to the page and needs to be WAL logged, we want to do as
+    * much as possible. At the same time, since the function may be
+    * called from a critical path, we want it to be as fast as
+    * possible.
+    *
+    * Disregard the free space if PAGE_PRUNE_DEFRAG_FORCE option is set.
+    *
+    * XXX The value of 120% is a ad-hoc choice and we may want to
+    * tweak it if required:
+    *
+    * XXX The average request size for a relation is currently
+    * initialized to a small value such as 256. So for a table with
+    * large size tuples, during initial few UPDATEs we may not prune
+    * a page even if the free space available is less than the new
+    * tuple size - resulting in unnecessary extention of the relation.
+    * Add a temporary hack to prune the page if the free space goes
+    * below a certain percentage of the block size (set to 12.5% here))
+    */

So this is how the system determines if it should defrag the whole page.
The defrag function is heap_page_prune_defrag().  The big downside of
this function is it has to get a lock to survey things and it often has
to guess if it should activate or not, meaning it has no idea if free
space is needed on this page or not.

In summary, I feel we have the HOT mechanics down well, but the open
issue is _when_ to activate each operation.

(Can someone time the access time for following a chain that fills an
entire page (the worst case) vs. having a single tuple on the page?)

In an ideal world, we would prune single chains only when they were long
enough to cause a performance impact, and would defragment only when a
new row will not fit on the page.  Other than those two cases, we don't
care how much dead space there is on a page.

However, there are two complexities to this.  One, we can't be sure we
can defragment when we need it because we might not get the lock, and
second, we are only going to try to put a row on a page if we are
updating a row on that page.  If the page is 90% dead but no rows are
being updated on that page no one will try to add a row to the page
because FSM thinks it is full.  That might be OK, it might not.

Another issue.  My guess is that it will take 2-3 weeks to get HOT
applied, meaning we aren't going to go to beta before October 1.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
Simon Riggs
Date:
On Sat, 2007-09-08 at 23:13 -0400, Bruce Momjian wrote:

> Defragementation is the HOT feature we really want.  Single-chain
> pruning is kind of nice because it speeds things up, but isn't
> necessary.  The fact that the entire chain is on the same page makes me
> think that we could just leave single-chain pruning for 8.4 if
> necessary.

Agreed

> I think allowing the chain to be more often on the same page via
> defragmentation and having a single index entry for the chain is going
> to be a win, and the fact we don't have marked-dead index entries for
> some of the chain isn't going to be a problem.  My guess is that the
> marked-dead index entries were a win only when the chain was on several
> pages, which isn't the case for HOT chains.

Agreed

> In summary, I feel we have the HOT mechanics down well, but the open
> issue is _when_ to activate each operation.

That has been the only open issue for months. The interplay of behaviour
is where HOT succeeds and fails. Some behaviours sound like they will be
good, but aren't.

> However, there are two complexities to this.  One, we can't be sure we
> can defragment when we need it because we might not get the lock, and

That is a probability call. Heikki measured that, and its a good bet.

If we do fail to fragment, then one of the rows will migrate to another
block. There will likely be fewer backends contending for that block and
then fragmentation will succeed. So this behaviour is self-balancing.

> second, we are only going to try to put a row on a page if we are
> updating a row on that page.  If the page is 90% dead but no rows are
> being updated on that page no one will try to add a row to the page
> because FSM thinks it is full.  That might be OK, it might not.

We must consider data row contention as well as space efficiency.

In the current system UPDATE contention is not an issue because we treat
them as INSERTs and there is a good scheme in place to deal with INSERT
contention. It would be easy to neglect UPDATE contention issues with
HOT because of this.

If we actively send more backends towards a full block, we may increase
contention which is a bad thing, and we may cause defragmentation to
fail, which would be even worse. The net result will be that some
backends go to a different block again, so it seems simpler to not send
backends towards other blocks just because they are full.

> Another issue.  My guess is that it will take 2-3 weeks to get HOT
> applied, meaning we aren't going to go to beta before October 1.

Seems reasonable.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Sat, 2007-09-08 at 23:13 -0400, Bruce Momjian wrote:
> > In summary, I feel we have the HOT mechanics down well, but the open
> > issue is _when_ to activate each operation.
>
> That has been the only open issue for months. The interplay of behaviour
> is where HOT succeeds and fails. Some behaviours sound like they will be
> good, but aren't.

Looking at the patch, it seems defragmentation is checked every time a
heap page is pinned in heapam.c by calling heap_page_prune_defrag().  If
page free space < 1.2 * average_tuple_size the page is defragmented.  It
would seem defragmentation is checked as much as possible.  Seems
pruning is also as aggressive as possible.

I think an open question is whether defragmentation should happen _only_
when we are trying to add something to the page and it doesn't fit.
Another open question is whether pruning should happen only when the
chain gets to a certain length.

The larger question is how are we going to get these answers?

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
Simon Riggs
Date:
On Sun, 2007-09-09 at 07:55 -0400, Bruce Momjian wrote:
> Simon Riggs wrote:
> > On Sat, 2007-09-08 at 23:13 -0400, Bruce Momjian wrote:
> > > In summary, I feel we have the HOT mechanics down well, but the open
> > > issue is _when_ to activate each operation.
> >
> > That has been the only open issue for months. The interplay of behaviour
> > is where HOT succeeds and fails. Some behaviours sound like they will be
> > good, but aren't.
>
> Looking at the patch, it seems defragmentation is checked every time a
> heap page is pinned in heapam.c by calling heap_page_prune_defrag().  If
> page free space < 1.2 * average_tuple_size the page is defragmented.  It
> would seem defragmentation is checked as much as possible.  Seems
> pruning is also as aggressive as possible.

I think the factor should be 1.0. A Setting of 1.2 will lead to twice as
many defragmentation attempts when we have a fixed row length.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Bruce Momjian wrote:
> Tom Lane wrote:
>> The we-already-pinned-the-page problem is a bit nasty but may not be
>> insurmountable.

If you find a way, that would be great.

> As I understand it, there are two HOT features:
>
>     Single-chain pruning, which trims HOT chains but doesn't reuse
>         the space
>
>     Defragementation, which prunes the entire page and reuses space
>         and handles deleted rows, etc.

I'm repeating myself, but I would split that second operation into two
parts while we think about this: pruning the entire page, and
defragmenting (PageRepairFragmentation). Tom wondered why they're
separated in the patch. As the patch stands, there is no reason, but I
feel that separating them and doing them at different times might be an
important piece in the puzzle.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Simon Riggs
Date:
On Mon, 2007-09-10 at 09:03 +0100, Heikki Linnakangas wrote:

> > As I understand it, there are two HOT features:
> >
> >     Single-chain pruning, which trims HOT chains but doesn't reuse
> >         the space
> >
> >     Defragementation, which prunes the entire page and reuses space
> >         and handles deleted rows, etc.
>
> I'm repeating myself, but I would split that second operation into two
> parts while we think about this: pruning the entire page, and
> defragmenting (PageRepairFragmentation). Tom wondered why they're
> separated in the patch. As the patch stands, there is no reason, but I
> feel that separating them and doing them at different times might be an
> important piece in the puzzle.

Yes, it does seem important to keep that separation, since it is
possible to do these things at different times, even if, for now, we
choose not to.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/10/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:


I'm repeating myself, but I would split that second operation into two
parts while we think about this: pruning the entire page, and
defragmenting (PageRepairFragmentation). Tom wondered why they're
separated in the patch. As the patch stands, there is no reason, but I
feel that separating them and doing them at different times might be an
important piece in the puzzle.



I think we all agree that defragmentation should be done less aggressively
and ideally when we are going to insert a tuple in the page and running
low on free space in the page. If we could really do it in heap_update(), after
we figure out that there is not enough free space available in the page,
that will be great. We already have exclusive lock on the page and hopefully
we also have the vacuum-lock. (This is how earlier version of the patch used to work - before we took out all the complexity associated with tracking LP_DELETEd tuples,
tracking fragmented tuples and replaced them with simple PageRepairFragmentation
Since in the earlier version, we never used to call PageRepairFragmentation,
we could easily do pruning in heap_update())

If we can't find a way to defragment inside heap_update(), then we have three
choices:

1. Pass a hint to heap_fetch that a UPDATE may follow. If we are running low
on free space, we try to prune and defragment at that point.

2. Move some of the work to bgwriter.

3. Use some heuristic to decide whether to defragment or not.

I am not sure how easy it is to know apriori that we are fetching a tuple
for UPDATE. Assuming we can, this seems like a good idea.

Eventually we may want to move some work to bgwriter or some other
background process. But my take would be to save that for 8.4

As a starting point, we are doing 3 in the patch. We always try to
keep just one tuple worth of space free in the page. So if an UPDATE
takes up  remaining free space in the page, the page will be
pruned/defraged in the subsequent lookup (index or seq). In read-mostly scenario,
only the first lookup would trigger prune/defrag, but next many lookups
would be quick. In update-mostly scenario, most of the heap_fetches
would anyways end up doing update, so doing pruning/defragmentation
in heap_fetch shouldn't be too bad. For a balanced scenario, we might
be moving cost of pruning/defragmentation to the SELECT path, but
UPDATEs would be quick.


The other issue is when to prune the HOT chain. I tend to agree that
the chances of having long HOT chains are less and the cost of following the
chain won't be too significant. At the same we probably don't want to live
with very long chains forever. An example would be: a tuple gets HOT updated
several times in a transaction. If the page is never updated again, the long HOT
chain would never be pruned. I like Simon's idea of pruning the chain if it
goes beyond a limit. Instead of adding any significant complexity to
track length of each HOT chain, we can just mark the page with a flag
if heap_hot_fetch() detects a long chain. The next lookup (index or seq)
of the page will try to prune the page and clear the flag if successful.

I also agree that we should avoid taking exclusive lock before checking
free space in heap_page_prune_defrag(). That might be unsafe, but we
don't care if you occasionally skip the maintenance work (or do it a
little early)

Thanks,
Pavan

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Bruce Momjian wrote:
> (Can someone time the access time for following a chain that fills an
> entire page (the worst case) vs. having a single tuple on the page?)

Here is some results, on my laptop. The test case is a table with a
single integer column, and a single row in the table. The row is updated
250 times to make a 251 tuples long chain. Page can hold 255 tuples, so
this is pretty much the worst case. After that, the row is fetched
1000000 times, in a PL/pgSQL function. I ran the tests with
enable_seqscan on and off, see "seqscan" and "idxscan" rows below.
Numbers are the times spent doing the fetches, in seconds. I repeated
each test a few times to make sure that the results are repeatable, they
seem to be repeatable to ~0.1s precision. The test script used is
attached. Autovacuum was disabled, and shared_buffers=320MB, otherwise
all settings were left to defaults.

        HEAD    HOT    HOT-opt    HOT-pruned
seqscan        19.9    21.1    20.1    11.5
idxscan        27.8    31.4    30.4    13.7

Explanations of the columns:

HEAD:        CVS HEAD
HOT-pruned:    CVS HEAD + HOT patch v15
HOT:        CVS HEAD + HOT patch v15, but with heap_page_prune_defrag short
circuited to do nothing
HOT-opt:    CVS HEAD + HOT patch v15, but with static XidInMVCCSnapshot
like in CVS HEAD

I didn't expect a difference in seqscan performance between HEAD and
HOT. I oprofiled it, and figured out that it's because HOT patch removed
the static-qualifier XidInMVCCSnapshot, because it's needed in
plancat.c. I changed it back to static, dummying out the call in
plancat.c, and the results are now closer to each other (HOT-opt column).

Comparing the idxscan columns, it looks like following the chain *is*
more expensive than having to go through killed index pointers. Pruning
clearly does help.

Given that this test is pretty much the worst case scenario, I'm ok with
not pruning for the purpose of keeping chains short.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
set enable_seqscan=on;

DROP TABLE hottest;

CREATE TABLE hottest (id integer);

INSERT INTO hottest VALUES (1);

CREATE INDEX i_hottest ON hottest(id);

-- Fill the table

CREATE OR REPLACE FUNCTION longchain (n integer) RETURNS void LANGUAGE PLPGSQL AS
$$
DECLARE
  i integer;
BEGIN
  FOR i IN 1..n LOOP
    UPDATE hottest SET id=id WHERE id = 1;
  END LOOP;
END;
$$;

SELECT longchain(250);

CREATE OR REPLACE FUNCTION fetchchain (n integer) RETURNS void LANGUAGE PLPGSQL AS
$$
DECLARE
  i integer;
  foo integer;
BEGIN
  FOR i IN 1..n LOOP
    SELECT id INTO foo FROM hottest WHERE id = 1;
  END LOOP;
END;
$$;

\timing
SELECT fetchchain(1000000);
\timing

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Heikki Linnakangas wrote:
> Explanations of the columns:
>
> HEAD:        CVS HEAD
> HOT-pruned:    CVS HEAD + HOT patch v15
> HOT:        CVS HEAD + HOT patch v15, but with heap_page_prune_defrag short
> circuited to do nothing
> HOT-opt:    CVS HEAD + HOT patch v15, but with static XidInMVCCSnapshot
> like in CVS HEAD

One correction: HOT-opt was also short circuited like HOT. IOW, HOT and
HOT-opt are the same, except for the static qualifier in XidInMVCCSnapshot

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/6/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> When I suggested that we get rid of the LP_DELETE flag for heap tuples,
> the tuple-level fragmentation and all that, and just take the vacuum
> lock and call PageRepairFragmentation, I was thinking that we'd do it in
> heap_update and only when we run out of space on the page. But as Greg
> said, it doesn't work because you're already holding a reference to at
> least one tuple on the page, the one you're updating, by the time you
> get to heap_update. That's why I put the pruning code to heap_fetch
> instead. Yes, though the amortized cost is the same, it does push the
> pruning work to the foreground query path.

The amortized cost is only "the same" if every heap_fetch is associated
with a heap update.  I feel pretty urgently unhappy about this choice.
Have you tested the impact of the patch on read-mostly workloads?


For read-mostly workloads, only the first SELECT after an UPDATE
would trigger pruning/defragmentation. heap_page_prune_defrag()
would be a no-op for subsequent SELECTs  (PageIsPrunable() would
be false until the next UPDATE)

I think Heikki's recent test confirms this.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
Simon Riggs
Date:
On Mon, 2007-09-10 at 12:17 +0100, Heikki Linnakangas wrote:
> Bruce Momjian wrote:
> > (Can someone time the access time for following a chain that fills an
> > entire page (the worst case) vs. having a single tuple on the page?)
>
> Here is some results, on my laptop.

>         HEAD    HOT    HOT-opt    HOT-pruned
> seqscan        19.9    21.1    20.1    11.5
> idxscan        27.8    31.4    30.4    13.7
>

> Comparing the idxscan columns, it looks like following the chain *is*
> more expensive than having to go through killed index pointers. Pruning
> clearly does help.
>
> Given that this test is pretty much the worst case scenario, I'm ok with
> not pruning for the purpose of keeping chains short.

I wasn't expecting that result and had accepted the counter argument.
IMHO it overturns the argument, or at least begs more test results.
Perhaps we need to compare this with a situation where we do 50% reads
and 50% writes where we try to prune aggressively on reads. There are
clearly big wins to be had in comparison with HEAD, but also the
possibility of regressions (in multiple directions) for various worst
cases.

ISTM that if we made 1 in every 16 index scans prune chains longer than
N (=16?) then we would avoid worst cases like this for almost no
additional (amortized) cost.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/8/07, Bruce Momjian <bruce@momjian.us> wrote:


Would someone tell us exactly when pruning and defragmenting happens in
the current version of the patch?  If we don't nail this issue down soon
PostgreSQL 8.3 is going to sail without this patch.




I guess you already have the answers by now, but let me repeat here:

Until patch version 14, defragmentation and pruning were separate
operations, though we used to invoke both at the same time. The only
difference was that pruning can work with a simple exclusive lock
whereas defragmentation needs vacuum-strength lock. Tom argued
that its not worth the complexity, so I changed that and clubbed
both the operations together. What it means: pruning also now needs
vacuum-strength lock and is always followed by defragmentation.

So as the patch stands (version 15), we call heap_page_prune_defrag()
inside heap_release_fetch() and heapgetpage(), apart from the vacuum
loop. It checks for PageIsPrunable() before proceeding. PageIsPrunable
is set when a tuple is updated/deleted. So for read-mostly workloads
heap_page_prune_defrag will mostly be no-op.

If PageIsPrunable is set, cleanup lock is available (exclusive lock is tried
conditionally, so we don't wait if there is contention) and we are running
low on free space (right now if free space is less than 1.2 times the average
tuple size or less than BLCKSZ/8), the entire page is pruned and fragmentation
is repaired.

We also prune-defrag is vacuum lazy and vacuum full. But I assume we
are not worried about these maintenance activities.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/10/07, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, 2007-09-10 at 12:17 +0100, Heikki Linnakangas wrote:
> Bruce Momjian wrote:
> > (Can someone time the access time for following a chain that fills an
> > entire page (the worst case) vs. having a single tuple on the page?)
>
> Here is some results, on my laptop.

>               HEAD    HOT     HOT-opt HOT-pruned
> seqscan               19.9    21.1    20.1    11.5
> idxscan               27.8    31.4    30.4     13.7
>

> Comparing the idxscan columns, it looks like following the chain *is*
> more expensive than having to go through killed index pointers. Pruning
> clearly does help.
>
> Given that this test is pretty much the worst case scenario, I'm ok with
> not pruning for the purpose of keeping chains short.

I wasn't expecting that result and had accepted the counter argument.



Yes, I go with Simon. I am also surprised that HOT-pruned did
so well in this setup. I always thought that HOT would do well
in update-intensive scenarios, but from the results it seems that
HOT is also doing well for read-mostly queries.

In this particular example, the first SELECT after the 250 UPDATEs
would have pruned all the dead tuples and reduced HOT chain
to a single tuple. Hence the total time for subsequent SELECTs improved
tremendously.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
>         HEAD    HOT    HOT-opt    HOT-pruned
> seqscan        19.9    21.1    20.1    11.5
> idxscan        27.8    31.4    30.4    13.7
>
> Explanations of the columns:
>
> HEAD:        CVS HEAD
> HOT-pruned:    CVS HEAD + HOT patch v15
> HOT:        CVS HEAD + HOT patch v15, but with heap_page_prune_defrag short
> circuited to do nothing
> HOT-opt:    CVS HEAD + HOT patch v15, but with static XidInMVCCSnapshot
> like in CVS HEAD
>
> I didn't expect a difference in seqscan performance between HEAD and
> HOT. I oprofiled it, and figured out that it's because HOT patch removed
> the static-qualifier XidInMVCCSnapshot, because it's needed in
> plancat.c. I changed it back to static, dummying out the call in
> plancat.c, and the results are now closer to each other (HOT-opt column).
>
> Comparing the idxscan columns, it looks like following the chain *is*
> more expensive than having to go through killed index pointers. Pruning
> clearly does help.
>
> Given that this test is pretty much the worst case scenario, I'm ok with
> not pruning for the purpose of keeping chains short.

Wow, that is very helpful.  You have shown that pruning a single chain
is indeed a valuable operation.  My guess is that once you prune a tuple
you no longer have to check its visibility, and that is where the win is
coming from.

If we check a tuple in a chain and the tuple is dead is it possible the
pruning operation is cheaper than having to check the tuple again for
visibility the next time we see it?  If so, we can just always prune
when we see a dead tuple in the chain, which I believe was the original
design.  Pruning becomes an operation similar to marking an index entry
as dead.  (I assuming pruing does not generate WAL traffic.)

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Bruce Momjian wrote:
> My guess is that once you prune a tuple
> you no longer have to check its visibility, and that is where the win is
> coming from.

Yes, I think that's right.

Oh, one more thing occured to me. Without HOT, we not only mark index
tuples pointing to dead tuples as killed, we remove them altogether if
the index page gets full. If you modify the test case so that after
doing the updates, you insert a bunch of tuples with a different key to
fill the index page, you should see CVS HEAD winning HOT without pruning
hands down.

> If we check a tuple in a chain and the tuple is dead is it possible the
> pruning operation is cheaper than having to check the tuple again for
> visibility the next time we see it?  If so, we can just always prune
> when we see a dead tuple in the chain, which I believe was the original
> design.  Pruning becomes an operation similar to marking an index entry
> as dead.  (I assuming pruing does not generate WAL traffic.)

Pruning does generate a WAL record at the moment. Maybe you could do
some kind of a quick pruning without a WAL record. Like just modify the
ctid of the oldest dead tuple in the chain, or the redirecting line
pointer if there is one, to point to the latest live tuple, without
removing the dead tuples or the line pointers.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Pavan Deolasee wrote:
> Until patch version 14, defragmentation and pruning were separate
> operations, though we used to invoke both at the same time. The only
> difference was that pruning can work with a simple exclusive lock
> whereas defragmentation needs vacuum-strength lock. Tom argued
> that its not worth the complexity, so I changed that and clubbed
> both the operations together. What it means: pruning also now needs
> vacuum-strength lock and is always followed by defragmentation.

Interesting.  See my previous post to Heikki stating we might need to
change that based on Heikki test results.

> So as the patch stands (version 15), we call heap_page_prune_defrag()
> inside heap_release_fetch() and heapgetpage(), apart from the vacuum
> loop. It checks for PageIsPrunable() before proceeding. PageIsPrunable
> is set when a tuple is updated/deleted. So for read-mostly workloads
> heap_page_prune_defrag will mostly be no-op.
>
> If PageIsPrunable is set, cleanup lock is available (exclusive lock is tried
> conditionally, so we don't wait if there is contention) and we are running
> low on free space (right now if free space is less than 1.2 times the
> average
> tuple size or less than BLCKSZ/8), the entire page is pruned and
> fragmentation
> is repaired.

Looking at the patch I see:

+       /*
+        * Mark the page as clear of prunable tuples. If we find a tuple which
+        * may become prunable, we shall set the hint again.
+        */
+       PageClearPrunable(page);

I like the idea of the page hint bit, but my question is if there is a
long-running transaction, isn't each SELECT going to try to defragment a
page over and over again because there is still something prunable on
the page?

I think we need to find out how hard it would be to try the
defragmentation only on INSERT or UPDATE.  The hint bit might still be
useful.

> We also prune-defrag is vacuum lazy and vacuum full. But I assume we
> are not worried about these maintenance activities.

Right.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
Gregory Stark
Date:
"Bruce Momjian" <bruce@momjian.us> writes:

> Looking at the patch I see:
>
> +       /*
> +        * Mark the page as clear of prunable tuples. If we find a tuple which
> +        * may become prunable, we shall set the hint again.
> +        */
> +       PageClearPrunable(page);
>
> I like the idea of the page hint bit, but my question is if there is a
> long-running transaction, isn't each SELECT going to try to defragment a
> page over and over again because there is still something prunable on
> the page?

Well it'll try to prune the chains over and over again. If it doesn't find
anything it won't defragment, but yes.

I think we could tackle that by storing on the page the oldest xmax which
would allow us to prune a tuple. Then you could compare RecentGlobalXmin
against that and not bother looking at any other chains if it hasn't been
passed yet.

It would be hard to do that with single-chain pruning though, once you the
limiting tuple you would then wouldn't know what the next limiting xmax is
until the next time you do a full-page prune. Still that gets us down to at
most two full-page prunes per update instead of a potentially unbounded number
of prunes.

This seems like a further optimization to think about after we have a place to
trigger the pruning where it'll do the most good.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: HOT patch - version 15

From
Florian Pflug
Date:
Bruce Momjian wrote:
> Looking at the patch I see:
>
> +       /*
> +        * Mark the page as clear of prunable tuples. If we find a tuple which
> +        * may become prunable, we shall set the hint again.
> +        */
> +       PageClearPrunable(page);
>
> I like the idea of the page hint bit, but my question is if there is a
> long-running transaction, isn't each SELECT going to try to defragment a
> page over and over again because there is still something prunable on
> the page?

Maybe that risk could be lowered if instead of a flag, we stored the
minimal global xmin needed to prune at least one tuple.

greetings, Florian Pflug



Re: HOT patch - version 15

From
Simon Riggs
Date:
On Mon, 2007-09-10 at 18:23 +0100, Heikki Linnakangas wrote:

> > If we check a tuple in a chain and the tuple is dead is it possible the
> > pruning operation is cheaper than having to check the tuple again for
> > visibility the next time we see it?  If so, we can just always prune
> > when we see a dead tuple in the chain, which I believe was the original
> > design.  Pruning becomes an operation similar to marking an index entry
> > as dead.  (I assuming pruing does not generate WAL traffic.)
>
> Pruning does generate a WAL record at the moment. Maybe you could do
> some kind of a quick pruning without a WAL record. Like just modify the
> ctid of the oldest dead tuple in the chain, or the redirecting line
> pointer if there is one, to point to the latest live tuple, without
> removing the dead tuples or the line pointers.

Sounds like a great idea.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/10/07, Florian Pflug <fgp.phlo.org@gmail.com> wrote:

Maybe that risk could be lowered if instead of a flag, we stored the
minimal global xmin needed to prune at least one tuple.



I like the idea. The question is whether the chances of a Prunable
page being looked up again and again in presence of a long
running transaction are high enough to justify adding 4 bytes
to page header.

Thanks,
Pavan


--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/10/07, Heikki Linnakangas <heikki@enterprisedb.com > wrote:

Oh, one more thing occured to me. Without HOT, we not only mark index
tuples pointing to dead tuples as killed, we remove them altogether if
the index page gets full. If you modify the test case so that after
doing the updates, you insert a bunch of tuples with a different key to
fill the index page, you should see CVS HEAD winning HOT without pruning
hands down.


Um. I am not sure I follow that. With HOT even if the HOT chain is long,
there is still just one index entry for the entire chain. So I don't see
why CVS HEAD would do better than HOT in the above case. Net-net
there will be equal number of index keys after the inserts.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Bruce Momjian wrote:
> > My guess is that once you prune a tuple
> > you no longer have to check its visibility, and that is where the win is
> > coming from.
>
> Yes, I think that's right.
>
> Oh, one more thing occured to me. Without HOT, we not only mark index
> tuples pointing to dead tuples as killed, we remove them altogether if
> the index page gets full. If you modify the test case so that after
> doing the updates, you insert a bunch of tuples with a different key to
> fill the index page, you should see CVS HEAD winning HOT without pruning
> hands down.
>
> > If we check a tuple in a chain and the tuple is dead is it possible the
> > pruning operation is cheaper than having to check the tuple again for
> > visibility the next time we see it?  If so, we can just always prune
> > when we see a dead tuple in the chain, which I believe was the original
> > design.  Pruning becomes an operation similar to marking an index entry
> > as dead.  (I assuming pruing does not generate WAL traffic.)
>
> Pruning does generate a WAL record at the moment. Maybe you could do
> some kind of a quick pruning without a WAL record. Like just modify the
> ctid of the oldest dead tuple in the chain, or the redirecting line
> pointer if there is one, to point to the latest live tuple, without
> removing the dead tuples or the line pointers.

I am wondering what you even mean by removing the dead tuples when
pruning.  I thought only defragmentation removed tuples.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Gregory Stark wrote:
> "Bruce Momjian" <bruce@momjian.us> writes:
>
> > Looking at the patch I see:
> >
> > +       /*
> > +        * Mark the page as clear of prunable tuples. If we find a tuple which
> > +        * may become prunable, we shall set the hint again.
> > +        */
> > +       PageClearPrunable(page);
> >
> > I like the idea of the page hint bit, but my question is if there is a
> > long-running transaction, isn't each SELECT going to try to defragment a
> > page over and over again because there is still something prunable on
> > the page?
>
> Well it'll try to prune the chains over and over again. If it doesn't find
> anything it won't defragment, but yes.
>
> I think we could tackle that by storing on the page the oldest xmax which
> would allow us to prune a tuple. Then you could compare RecentGlobalXmin
> against that and not bother looking at any other chains if it hasn't been
> passed yet.

Yea, that might work if we have space on the page.

> It would be hard to do that with single-chain pruning though, once you the
> limiting tuple you would then wouldn't know what the next limiting xmax is
> until the next time you do a full-page prune. Still that gets us down to at
> most two full-page prunes per update instead of a potentially unbounded number
> of prunes.
>
> This seems like a further optimization to think about after we have a place to
> trigger the pruning where it'll do the most good.

I was thinking if we could only try defragementing when we are doing an
INSERT or UPDATE, then the defragment would either free enough space so
we could stay on the page or the new row would go on another page and we
would probably not return to that page again anytime soon.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/11/07, Bruce Momjian <bruce@momjian.us > wrote:
Heikki Linnakangas wrote:

>
> Pruning does generate a WAL record at the moment. Maybe you could do
> some kind of a quick pruning without a WAL record. Like just modify the
> ctid of the oldest dead tuple in the chain, or the redirecting line
> pointer if there is one, to point to the latest live tuple, without
> removing the dead tuples or the line pointers.

I am wondering what you even mean by removing the dead tuples when
pruning.  I thought only defragmentation removed tuples.


Pruning removes intermediate dead tuples by marking their line pointers
~LP_USED and redirecting the root line pointer to the first live/recently_dead
tuple in the chain. OTOH defragmentation moves tuples around to repair
fragmentation. IOW defragmentation is nothing but calling
PageRepairFragmentation on the page.

For example, if we have a HOT chain of 5 tuples:

1 --> 2 --> 3 --> 4 --> 5

of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE

At the end of pruning:

- line pointer of 1 is redirected to 4
- line pointers of 2 and 3 are marked ~LP_USED
- the offset of 4 and 5 is unchanged

At the end of defragmentation:

- 4 and 5 are moved to the end of the page to create a larger
contiguous free space in the page.

Thanks,
Pavan


--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Pavan Deolasee wrote:
> > > Pruning does generate a WAL record at the moment. Maybe you could do
> > > some kind of a quick pruning without a WAL record. Like just modify the
> > > ctid of the oldest dead tuple in the chain, or the redirecting line
> > > pointer if there is one, to point to the latest live tuple, without
> > > removing the dead tuples or the line pointers.
> >
> > I am wondering what you even mean by removing the dead tuples when
> > pruning.  I thought only defragmentation removed tuples.
> >
> >
> Pruning removes intermediate dead tuples by marking their line pointers
> ~LP_USED and redirecting the root line pointer to the first
> live/recently_dead
> tuple in the chain. OTOH defragmentation moves tuples around to repair
> fragmentation. IOW defragmentation is nothing but calling
> PageRepairFragmentation on the page.
>
> For example, if we have a HOT chain of 5 tuples:
>
> 1 --> 2 --> 3 --> 4 --> 5
>
> of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE
>
> At the end of pruning:
>
> - line pointer of 1 is redirected to 4
> - line pointers of 2 and 3 are marked ~LP_USED
> - the offset of 4 and 5 is unchanged
>
> At the end of defragmentation:
>
> - 4 and 5 are moved to the end of the page to create a larger
> contiguous free space in the page.

Right.  My point is that pruning only modifies item pointers.  It does
not change the actual heap tuples.  In the quote above, how is Heikki's
"quick pruning" different from the pruning you are describing?

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Bruce Momjian wrote:
> Pavan Deolasee wrote:
>>>> Pruning does generate a WAL record at the moment. Maybe you could do
>>>> some kind of a quick pruning without a WAL record. Like just modify the
>>>> ctid of the oldest dead tuple in the chain, or the redirecting line
>>>> pointer if there is one, to point to the latest live tuple, without
>>>> removing the dead tuples or the line pointers.
>>> I am wondering what you even mean by removing the dead tuples when
>>> pruning.  I thought only defragmentation removed tuples.
>>>
>>>
>> Pruning removes intermediate dead tuples by marking their line pointers
>> ~LP_USED and redirecting the root line pointer to the first
>> live/recently_dead
>> tuple in the chain. OTOH defragmentation moves tuples around to repair
>> fragmentation. IOW defragmentation is nothing but calling
>> PageRepairFragmentation on the page.
>>
>> For example, if we have a HOT chain of 5 tuples:
>>
>> 1 --> 2 --> 3 --> 4 --> 5
>>
>> of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE
>>
>> At the end of pruning:
>>
>> - line pointer of 1 is redirected to 4
>> - line pointers of 2 and 3 are marked ~LP_USED
>> - the offset of 4 and 5 is unchanged
>>
>> At the end of defragmentation:
>>
>> - 4 and 5 are moved to the end of the page to create a larger
>> contiguous free space in the page.
>
> Right.  My point is that pruning only modifies item pointers.  It does
> not change the actual heap tuples.  In the quote above, how is Heikki's
> "quick pruning" different from the pruning you are describing?

In "quick pruning", you don't mark the line pointer as not used.
Otherwise someone might reuse the line pointer for another tuple, which
is a problem if you then crash. WAL replay would see an insert to a line
pointer that's already in use (the quick pruning wouldn't be WAL
logged), which would raise an error.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/11/07, Bruce Momjian <bruce@momjian.us> wrote:


Right.  My point is that pruning only modifies item pointers.  It does
not change the actual heap tuples.  In the quote above, how is Heikki's
"quick pruning" different from the pruning you are describing?



I think the only difference is that the quick pruning does not mark
intermediate tuples ~LP_USED and hence we may avoid WAL logging.

Btw, I am not too enthusiastic about quick pruning because it leaves
behind dead heap-only tuples which are not reachable from any root
heap tuples. Not that we can't handle them, but I am worried about
making such changes right now unless we are sure about the benefits.
We can always tune and tweak in 8.4

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
Simon Riggs
Date:
On Mon, 2007-09-10 at 20:46 +0100, Heikki Linnakangas wrote:
> Bruce Momjian wrote:
> > Pavan Deolasee wrote:
> >> At the end of pruning:
> >>
> >> - line pointer of 1 is redirected to 4
> >> - line pointers of 2 and 3 are marked ~LP_USED
> >> - the offset of 4 and 5 is unchanged
> >>

> > In the quote above, how is Heikki's
> > "quick pruning" different from the pruning you are describing?

> In "quick pruning", you don't mark the line pointer as not used.
> Otherwise someone might reuse the line pointer for another tuple, which
> is a problem if you then crash. WAL replay would see an insert to a line
> pointer that's already in use (the quick pruning wouldn't be WAL
> logged), which would raise an error.

That seems like the best way forward to me. It's just a hint that says
where the first live tuple is in the chain.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
Simon Riggs
Date:
On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote:

> I think the only difference is that the quick pruning does not mark
> intermediate tuples ~LP_USED and hence we may avoid WAL logging.

Sounds great.

> Btw, I am not too enthusiastic about quick pruning because it leaves
> behind dead heap-only tuples which are not reachable from any root
> heap tuples. Not that we can't handle them, but I am worried about
> making such changes right now unless we are sure about the benefits.
> We can always tune and tweak in 8.4

...but we don't want to reach them. They are waiting to be removed.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
Gregory Stark
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:

> On 9/11/07, Bruce Momjian <bruce@momjian.us> wrote:
>
>> Right.  My point is that pruning only modifies item pointers.  It does
>> not change the actual heap tuples.  In the quote above, how is Heikki's
>> "quick pruning" different from the pruning you are describing?
>
> I think the only difference is that the quick pruning does not mark
> intermediate tuples ~LP_USED and hence we may avoid WAL logging.
>
> Btw, I am not too enthusiastic about quick pruning because it leaves
> behind dead heap-only tuples which are not reachable from any root
> heap tuples. Not that we can't handle them, but I am worried about
> making such changes right now unless we are sure about the benefits.
> We can always tune and tweak in 8.4

You could mark such tuples with LP_DELETE. That would also let other
transactions quickly tot up how much space would be available if they were to
run PageRepairFragmentation.

But if you don't WAL log setting LP_DELETE then you would still have to deal
with unreachable HOT tuples who lost their LP_DELETE flag. I suppose that as
long as PageRepairFragmentation first looks for any dead HOT tuples without
depending on LP_DELETE that would be enough.

I wonder if you could do the quick prune without even taking the exclusive
page lock? You're overwriting 16 bits and you know nobody else will be
modifying any of the line pointers in question to anything else. (because your
pin prevents the vacuum lock from coming in and trying to mark it unused).

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: HOT patch - version 15

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> Pruning removes intermediate dead tuples by marking their line pointers
> ~LP_USED and redirecting the root line pointer to the first
> live/recently_dead tuple in the chain.

It seems utterly unsafe to do this without a vacuum-grade lock on the
page.  Someone might be examining the root tuple (eg, in a SnapshotAny
scan) and I think they are entitled to assume that the line pointer
stays valid while they are looking at the tuple.

            regards, tom lane

Re: HOT patch - version 15

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote:
>> I think the only difference is that the quick pruning does not mark
>> intermediate tuples ~LP_USED and hence we may avoid WAL logging.

> Sounds great.

What it sounds is utterly unsafe.  You can get away with not WAL-logging
individual bit flips (that is, hint-bit-setting) because either state of
the page is valid.  If I read this proposal correctly it is to change
t_ctid without WAL-logging, which means that a partial page write (torn
page syndrome) could leave the page undetectably corrupted --- t_ctid
is 6 bytes and could easily cross a hardware sector boundary.

            regards, tom lane

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote:
> >> I think the only difference is that the quick pruning does not mark
> >> intermediate tuples ~LP_USED and hence we may avoid WAL logging.
>
> > Sounds great.
>
> What it sounds is utterly unsafe.  You can get away with not WAL-logging
> individual bit flips (that is, hint-bit-setting) because either state of
> the page is valid.  If I read this proposal correctly it is to change
> t_ctid without WAL-logging, which means that a partial page write (torn
> page syndrome) could leave the page undetectably corrupted --- t_ctid
> is 6 bytes and could easily cross a hardware sector boundary.

OK, we certainly want to avoid WAL writes for pruning a single chain if
possible.  One idea would be to just mark the item pointer as dead but
not repoint the first item pointer to point to the first live tuple.

Heikki, would that give us speedups similar to what you saw in your
earlier tests?  It would if the majority of the cost of scanning the
page looking for the first live tuple was in doing the visibility tests
on every tuple.

Basically, when we are walking the chain via an index and we see the
tuple is dead we mark it as dead but don't repoint.  We would repoint
and reclaim on a defragementation.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/11/07, Gregory Stark <stark@enterprisedb.com> wrote:



You could mark such tuples with LP_DELETE. That would also let other
transactions quickly tot up how much space would be available if they were to
run PageRepairFragmentation.





IMHO we are making full circles here. We have already tried LP_DELETE techniques
and moved away to simplify things. We also tried reusing dead space without
running PageRepairFragmentation. Each of these techniques worked just fine
with slightly different performance characteristics. What we now have is a
simplified algorithm which is much easier to follow and is safer, yet giving
us a very good performance boost. I am not sure if this is the right time to
throw new ideas because we would never be sure as what we are doing
would be the most optimal solution. Would it help if we go with some
solution right now, get rest of the review process done and then use
the feedback during beta testing to tune things ? We may have far more
data points at that time to choose one technique over other. And we
would also know what areas to focus on.

I am also worried that by focusing too much on this issue we may
overlook some other correctness issue in the patch.

From whatever we have discussed so far, IMHO we should do the following
things and let rest of the review process proceed

- Defragment a page only when the free space left in the page is not
enough to accommodate even a single tuple (use average tuple length for
this decision). This would mean we might be defragmenting even though
there is no immediate UPDATE to the page. But we can treat this as
fillfactor which allows us to provision for the next UPDATE coming to
the page. Since we are defragmenting when the page is almost full
hopefully we would reclaim good amount of space in the page and
won't call defragmentation for next few UPDATEs.

We already have mechanism to track average tuple request size in
relcache. May be we can have some relcache invalidation to keep the
information in sync (send invalidation when the average request size
changes by say 5%)

- Avoid pruning chains in every index or seq lookup. But if the chain
becomes longer than X tuples, mark the page to be pruned in the
next lookup. We can choose to separate prune and defragmentation
and only do pruning in this case. But I would prefer to keep them
together for now.

- Track the minimum xmin in the page header to avoid repeated
(wasted) attempts to prune a Prunable page in the presence of long running
transactions.

We can save rest of the techniques for beta testing period or 8.4.


Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
Bruce Momjian
Date:
Pavan Deolasee wrote:
> IMHO we are making full circles here. We have already tried LP_DELETE
> techniques
> and moved away to simplify things. We also tried reusing dead space without
> running PageRepairFragmentation. Each of these techniques worked just fine
> with slightly different performance characteristics. What we now have is a
> simplified algorithm which is much easier to follow and is safer, yet giving
> us a very good performance boost. I am not sure if this is the right time to
> throw new ideas because we would never be sure as what we are doing
> would be the most optimal solution. Would it help if we go with some
> solution right now, get rest of the review process done and then use
> the feedback during beta testing to tune things ? We may have far more
> data points at that time to choose one technique over other. And we
> would also know what areas to focus on.

I totally understand.  I was just throwing out ideas to make sure I
fully understood it and to explore various options.  I do agree we
should just wait for the review, knowing we have these trade-offs.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: HOT patch - version 15

From
Simon Riggs
Date:
On Mon, 2007-09-10 at 22:24 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote:
> >> I think the only difference is that the quick pruning does not mark
> >> intermediate tuples ~LP_USED and hence we may avoid WAL logging.
>
> > Sounds great.
>
> What it sounds is utterly unsafe.  You can get away with not WAL-logging
> individual bit flips (that is, hint-bit-setting) because either state of
> the page is valid.  If I read this proposal correctly it is to change
> t_ctid without WAL-logging, which means that a partial page write (torn
> page syndrome) could leave the page undetectably corrupted --- t_ctid
> is 6 bytes and could easily cross a hardware sector boundary.

We can calculate where they are, if thats the objection.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
Simon Riggs
Date:
On Tue, 2007-09-11 at 09:21 +0530, Pavan Deolasee wrote:

> IMHO we are making full circles here. We have already tried LP_DELETE
> techniques and moved away to simplify things.

> We can save rest of the techniques for beta testing period or 8.4.

Agreed, we're just fine tuning by reinventing old ideas. HOT works, but
everybody has their own opinion of which subtle factors are important.
Seems like time to find out in Beta.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> What it sounds is utterly unsafe.  You can get away with not WAL-logging
> individual bit flips (that is, hint-bit-setting) because either state of
> the page is valid.  If I read this proposal correctly it is to change
> t_ctid without WAL-logging, which means that a partial page write (torn
> page syndrome) could leave the page undetectably corrupted --- t_ctid
> is 6 bytes and could easily cross a hardware sector boundary.

Well we would never be overwriting the blockid, only the posid which is 2
bytes. And the ctid (and posid) should always be 4-byte aligned. So actually
it would never cross a hardware sector boundary.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
>> On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote:
>>> I think the only difference is that the quick pruning does not mark
>>> intermediate tuples ~LP_USED and hence we may avoid WAL logging.
>
>> Sounds great.
>
> What it sounds is utterly unsafe.  You can get away with not WAL-logging
> individual bit flips (that is, hint-bit-setting) because either state of
> the page is valid.  If I read this proposal correctly it is to change
> t_ctid without WAL-logging, which means that a partial page write (torn
> page syndrome) could leave the page undetectably corrupted --- t_ctid
> is 6 bytes and could easily cross a hardware sector boundary.

We're only changing the offsetnumber part of it, which is 2 bytes. That
shouldn't cross a hardware sector boundary on any reasonable hardware.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/11/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:


We're only changing the offsetnumber part of it, which is 2 bytes. That
shouldn't cross a hardware sector boundary on any reasonable hardware.


Not entirely true if we consider line pointer redirection, which involves
changing 4 bytes. You can argue that for quick pruning we don't do any
line pointer redirection though.

Also my worry about dealing with unreachable heap-only dead tuples
remain. We may need to do far more work to identify that they are not part of
any reachable HOT chain and hence can be removed.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/11/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> Pruning removes intermediate dead tuples by marking their line pointers
> ~LP_USED and redirecting the root line pointer to the first
> live/recently_dead tuple in the chain.

It seems utterly unsafe to do this without a vacuum-grade lock on the
page.  Someone might be examining the root tuple (eg, in a SnapshotAny
scan) and I think they are entitled to assume that the line pointer
stays valid while they are looking at the tuple.


As the patch stands, we prune holding a vacuum-grade lock. But ISTM
that there are only limited users of SnapshotAny and we make them
recheck the line pointer after acquiring the buffer lock. So hopefully we
should be fine.

Anyways, its not a concern right now because we hold vacuum-grade lock while
pruning. But if we decide to what Heikki/Greg suggested about quick
pruning then we will have to make sure that your concern is addressed.


Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/11/07, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

- Track the minimum xmin in the page header to avoid repeated
(wasted) attempts to prune a Prunable page in the presence of long running
transactions.


I would actually think twice before even doing this because this would lead to
complete change in heap page structure and stop people from upgrading
to 8.3 without a complete dump/restore. I don't remember 8.3 introduces any other
significant change which already enforces that.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> I would actually think twice before even doing this because this would
> lead to complete change in heap page structure and stop people from
> upgrading to 8.3 without a complete dump/restore. I don't remember 8.3
> introduces any other significant change which already enforces that.

Then you don't remember far enough --- either the HeapTupleHeader change
or the varvarlena change would be enough to prevent that.  For that
matter, the pd_flags header field that the patch is already relying on
is new for 8.3.

Adding an xmin field might or might not be a good solution, but to
complain about it on compatibility grounds is silly.

            regards, tom lane

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/11/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> I would actually think twice before even doing this because this would
> lead to complete change in heap page structure and stop people from
> upgrading to 8.3 without a complete dump/restore. I don't remember 8.3
> introduces any other significant change which already enforces that.

Then you don't remember far enough --- either the HeapTupleHeader change
or the varvarlena change would be enough to prevent that.  For that
matter, the pd_flags header field that the patch is already relying on
is new for 8.3.


Oops! I forgot combo command id. That went into last major release of EDB
and that confused me. Regarding pd_flags, I thought somebody can just
fix that in-place and avoid dump/restore (of course, its too invasive, but
isn't it feasible with pg_migrator ?)

Anyways, given this, should we go for storing xmin in the page header ?
For that matter should there be a separate heap page header ?

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> "Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
>> I would actually think twice before even doing this because this would
>> lead to complete change in heap page structure and stop people from
>> upgrading to 8.3 without a complete dump/restore. I don't remember 8.3
>> introduces any other significant change which already enforces that.
>
> Then you don't remember far enough --- either the HeapTupleHeader change
> or the varvarlena change would be enough to prevent that.  For that
> matter, the pd_flags header field that the patch is already relying on
> is new for 8.3.

Yeah, those changes will need to be dealt with in the conversion. But
none of the changes this far have increased the on-disk size. Adding a
new field to page header means that in some corner cases it might not be
possible to convert a page from old format to the new one because the
data no longer fits.

But let's not hang ourselves on that. I'm sure there's a way to deal
with the page format conversion if we have to. The crucial design
decision for HOT is when to prune and when to defragment the page, so
that when we're doing the UPDATE, there's room in the page.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Then you don't remember far enough --- either the HeapTupleHeader change
>> or the varvarlena change would be enough to prevent that.  For that
>> matter, the pd_flags header field that the patch is already relying on
>> is new for 8.3.

> Yeah, those changes will need to be dealt with in the conversion. But
> none of the changes this far have increased the on-disk size. Adding a
> new field to page header means that in some corner cases it might not be
> possible to convert a page from old format to the new one because the
> data no longer fits.

Counting on my fingers, it seems that the 4-byte savings in
HeapTupleHeader size would guarantee that adding 4 bytes to the page
header wouldn't kill you.

            regards, tom lane

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> Then you don't remember far enough --- either the HeapTupleHeader change
>>> or the varvarlena change would be enough to prevent that.  For that
>>> matter, the pd_flags header field that the patch is already relying on
>>> is new for 8.3.
>
>> Yeah, those changes will need to be dealt with in the conversion. But
>> none of the changes this far have increased the on-disk size. Adding a
>> new field to page header means that in some corner cases it might not be
>> possible to convert a page from old format to the new one because the
>> data no longer fits.
>
> Counting on my fingers, it seems that the 4-byte savings in
> HeapTupleHeader size would guarantee that adding 4 bytes to the page
> header wouldn't kill you.

It depends on the alignment of the data and null-bitmap. For example:
- a platform with 8 bytes alignment
- the table has 9 columns
- every tuple on the page has a null bitmap
- there is zero bytes left on the page
- no tuples where varvarlen frees up space

The header size after alignment is MAXALIGN(23+2) = 32. Which is the
same as before combo cid, MAXALIGN(27+2) = 32. It's a corner-case, but
it's possible.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2007-09-11 at 09:21 +0530, Pavan Deolasee wrote:
>> IMHO we are making full circles here. We have already tried LP_DELETE
>> techniques and moved away to simplify things.

>> We can save rest of the techniques for beta testing period or 8.4.

> Agreed, we're just fine tuning by reinventing old ideas. HOT works, but
> everybody has their own opinion of which subtle factors are important.
> Seems like time to find out in Beta.

I am worried by this idea that it's fine to make major changes in HOT
after beta starts, or that the attentions of beta testers will allow us
to collect data on variant implementations.  Neither is the case.

            regards, tom lane

Re: HOT patch - version 15

From
Simon Riggs
Date:
On Tue, 2007-09-11 at 16:10 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Tue, 2007-09-11 at 09:21 +0530, Pavan Deolasee wrote:
> >> IMHO we are making full circles here. We have already tried LP_DELETE
> >> techniques and moved away to simplify things.
>
> >> We can save rest of the techniques for beta testing period or 8.4.
>
> > Agreed, we're just fine tuning by reinventing old ideas. HOT works, but
> > everybody has their own opinion of which subtle factors are important.
> > Seems like time to find out in Beta.
>
> I am worried by this idea that it's fine to make major changes in HOT
> after beta starts, or that the attentions of beta testers will allow us
> to collect data on variant implementations.  Neither is the case.

Understood.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/11/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:


The crucial design
decision for HOT is when to prune and when to defragment the page, so
that when we're doing the UPDATE, there's room in the page.


Right.

For defragmentation, I am still inclined towards doing it when we see
that the free space is less than (or slightly more than) the average tuple
size of the table - unless we have a better solution.

For pruning, we can do the quick pruning. But I won't feel comfortable
doing it without an exclusive lock on the buffer. Also I would avoid
line pointer redirection during quick prune. A simpler solution would
be to flag the page whenever HOT chain becomes longer than, say
5, (which can easily be detected in heap_hot_fetch) and prune it in
the next lookup. Hopefully we would only rarely have long HOT chains
and if at all we have them, we will have some mechanism to recover
from it.

Any other ideas ?

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> Please see the revised patches attached.

I'm curious how much the WAL-recovery aspects of this patch have been
tested, because heap_xlog_clean seems quite broken.  You have apparently
decided to redefine the WAL record format as using one-based rather than
zero-based item offsets, which would be fine if any of the rest of the
code had been changed to agree ...

            regards, tom lane

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/13/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm curious how much the WAL-recovery aspects of this patch have been
tested, because heap_xlog_clean seems quite broken.


There are quite a few crash recovery tests that one of our QA persons
(Dharmendra Goyal) has written. I can post them if necessary. We run
these tests very regularly.

Apart from these regression crash tests, I had mostly tested by
running lot of concurrent clients on UP/SMP machines, crashing
and recovering the database. We fixed quite a few issues with
these tests. I have tried crashing in middle of UPDATEs/INSERTs/DELETEs
and VACUUM/VACUUM FULL.
 

You have apparently
decided to redefine the WAL record format as using one-based rather than
zero-based item offsets, which would be fine if any of the rest of the
code had been changed to agree ...


I know Heikki changed that when he did the initial refactoring, but not
sure why. May be he wanted to make it more consistent.
But I don't think its broken because we collect the offsets in one-based
format in PageRepairFragmentation, WAL log in that format and redo
the same way. Am I missing something ?

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Heikki Linnakangas"
Date:
Pavan Deolasee wrote:
> On 9/13/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> You have apparently
>> decided to redefine the WAL record format as using one-based rather than
>> zero-based item offsets, which would be fine if any of the rest of the
>> code had been changed to agree ...
>>
>>
> I know Heikki changed that when he did the initial refactoring, but not
> sure why. May be he wanted to make it more consistent.

Yeah, I just checked the work-in-progress patch I sent you back in July.
I refactored it to use one-based offsets for consistency, since I
modified log_heap_clean quite heavily anyway.

It's possible that I broke it in the process, I was only interested in
testing the performance characteristics of the simplified pruning scheme.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/13/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:


Yeah, I just checked the work-in-progress patch I sent you back in July.
I refactored it to use one-based offsets for consistency, since I
modified log_heap_clean quite heavily anyway.

It's possible that I broke it in the process, I was only interested in
testing the performance characteristics of the simplified pruning scheme.


I don't think so -- atleast I couldn't figure out why its broken. But I
would take Tom's comment seriously and look more into it tomorrow.
Or we can just revert it back to zero-based offsets.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On 9/13/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> You have apparently
>> decided to redefine the WAL record format as using one-based rather than
>> zero-based item offsets, which would be fine if any of the rest of the
>> code had been changed to agree ...
>>
> I know Heikki changed that when he did the initial refactoring, but not
> sure why. May be he wanted to make it more consistent.
> But I don't think its broken because we collect the offsets in one-based
> format in PageRepairFragmentation, WAL log in that format and redo
> the same way. Am I missing something ?

Hmm, I had been thinking that vacuum.c and vacuumlazy.c worked directly
with that info, but it looks like you're right, only
PageRepairFragmentation touches that array.  Never mind ... though my
suspicions would probably not have been aroused if anyone had bothered
to fix the comments.

            regards, tom lane

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/13/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:


 Never mind ... though my
suspicions would probably not have been aroused if anyone had bothered
to fix the comments.


Yeah, my fault. I should have fixed that. Sorry about that.


Thanks,
Pavan


--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT patch - version 15

From
Tom Lane
Date:
Okay, something else (a real problem this time ;-)):
HeapTupleSatisfiesAbort is bogus because it has failed to track recent
changes in tqual.c.

Rather than fix it, though, I question why we need it at all.  The only
use of it is in heap_prune_tuplechain() and ISTM that code is redundant,
wrong, or both.

In the case of a full-page prune, it's redundant because the tuple would
be found anyway by searching its chain from the root tuple.  Indeed I
suspect that such a tuple is entered into the newly-dead list twice,
perhaps risking overflow of that array.

In the case of a single-chain prune, it still seems wrong since you'll
eliminate only one of what might be a chain with multiple aborted
entries.  If it's OK to leave those other entries for future collection,
then it should be OK to leave this one too.  If it's not OK then the
approach needs to be redesigned.

I'm fairly unclear on what the design intention is for recovering from
the case where the last item(s) on a HOT chain are aborted.  Please
clarify.

            regards, tom lane

Re: HOT patch - version 15

From
"Pavan Deolasee"
Date:


On 9/14/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

HeapTupleSatisfiesAbort is bogus because it has failed to track recent
changes in tqual.c.


Oh. I should have been aware.
 

Rather than fix it, though, I question why we need it at all.  The only
use of it is in heap_prune_tuplechain() and ISTM that code is redundant,
wrong, or both.

In the case of a full-page prune, it's redundant because the tuple would
be found anyway by searching its chain from the root tuple.  Indeed I
suspect that such a tuple is entered into the newly-dead list twice,
perhaps risking overflow of that array.


No, we would never reach an aborted dead tuple from the chain because
we would see a live tuple before that. Or the tuple preceding the aborted
(first aborted, if there are many) may have been updated again and the
chain never reaches to the aborted tuples. Thats the reason why we have
HeapTupleSatisfiesAbort to collect any aborted heap-only tuples.

 

In the case of a single-chain prune, it still seems wrong since you'll
eliminate only one of what might be a chain with multiple aborted
entries.  If it's OK to leave those other entries for future collection,
then it should be OK to leave this one too.  If it's not OK then the
approach needs to be redesigned.


Again, since we check every heap-only tuple for aborted dead
case we shall collect all such tuples. I think one place where you
are confusing (or IOW the code might have confused you ;-))
is that we never reach aborted dead heap-only tuples by
following the HOT chain from the root tuple and thats why it
needs special treatment.

I'm fairly unclear on what the design intention is for recovering from
the case where the last item(s) on a HOT chain are aborted.  Please
clarify.


We don't do anything special. Such a tuple is never reached during
HOT chain following because either we would see a visible tuple before
that or the older tuple might have been updated again and the chain
had moved in a different direction. We only need some special treatment
to prune such tuple and hence the business of HeapTupleSatisfiesAbort

Having said that, based on our recent discussion about pruning a chain
upto and including the latest DEAD tuple in the chain, ISTM that we can
get rid of the giving any special treatment to aborted heap-only
tuples. Earlier we could not do so for "HOT updated aborted heap-only"
because we did not know if its a part of any valid HOT chain. Now
(assuming we change the pruning code to always prune everything including
the latest DEAD tuple) we guarantee that all DEAD tuples in the chain will
be pruned, and hence we can collect any DEAD tuple seen while pruning.

I am not sure if I explain this well,  may be I should post an example.
Need to run now.

Thanks,
Pavan

--

Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com