Thread: HOT patch - version 15
On 9/1/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I see this the other way around: if it doesn't work on system catalogs,
it probably doesn't work, period. I'm not in favor of treating the
catalogs differently.
Please see the revised patches attached. The combo patch applies cleanly over
current HEAD. The incremental patch contains changes since the last patch.
The patch passes all but three regression tests. The failures are
cosmetic in nature.
Although I am comfortable with the patch, I wanted to send this out today
because tomorrow is my last work day before I go on leave for rest of
the week. I shall try to respond to mails and work a bit during that period,
but I won't be active as usual. I can fix any minor left over things tomorrow
though.
The patch now supports HOT update on system catalogs. This requires us
to wait for deleting/inserting transaction if we encounter
DELETE_IN_PROGRESS/INSERT_IN_PROGRESS tuple while building an index.
I got worried if this could add a new deadlock scenario, but I believe
system catalog reindexing is not very common, so may be we can live
with that.
Another thing that worries be is the handling of RECENTLY_DEAD tuples
while rebuilding system catalogs indexes. For non-system relations, we
don't make the new index available for queries if any RECENTLY_DEAD
tuples are skipped while building the index. We can do the same for
system catalogs, though we may need more work to make that happen.
But since system catalogs are always run with SnapshotNow, do we need
to worry about RECENTLY_DEAD tuples ? ISTM that we should never return
RECENTLY_DEAD tuples with SnapshotNow. Does this sound ok ?
The changes to heap_update() signature are reverted. The caller can check if the
new tuple satisfies HeapTupleIsHeapOnly - which signifies that the update
was HOT update. This has also helped us limit the changes to support
system catalogs, because there are several invocation of
simple_heap_update(), followed by CatalogUpdateIndexes(). We can
now check for heap-only tuple in CatalogUpdateIndexes and skip
index inserts for HOT tuples.
As per Tom's suggestion, relcache is now responsible for building list
of HOT attributes. This needs to happen only once unless there are schema
changes. Also, we use just a single bitmap to track normal and system
attributes - offset by FirstLowInvalidHeapAttributeNumber
As per discussion, the page pruning and repair fragmentation is clubbed
together. In the SELECT path, we conditionally try for exclusive lock. If
the exclusive lock is also cleanup lock, we prune and defrag the page.
As the patch stands, during index fetch we try to prune only the chain
originating at that tuple. I am wondering if we should change that
and prune all the tuple chains in the page ?
I haven't removed the avgFSM logic used to time the pruning of
the page. But we can remove that if we feel its not worth the
complexity. May be a simple test of free space as factor of the
BLCKSZ would suffice.
MaxHeapTuplesPerPage is reverted back to the original value.
I could have removed hot_update from xl_heap_update. But that
would have complicated the heap_xlog_update() code. So I
left it unchanged right now. But we still feel thats worth doing, I
will make that change.
The IsPlanValidForThisTransaction() nows uses PlannerGlobal
as per Tom's suggestion, thus making it reentrant. I am not sure if
I have got it completely right though.
Apart from the changes based on recent feedback, I have also
added the previously discussed changes to track dead_space in
the relation and trigger autovaccum when the percentage of
dead_space goes beyond the threshold. vac_update_scale
is still a percentage of dead_space to the relation size.
vac_update_threshold is right now number of blocks, but we need
to discuss and finalize the changes. One good thing is if we do
it this way, autoanalyze trigger will be based on total number of
HOT and non-HOT updates (along with inserts/deletes)
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Attachment
Pavan Deolasee wrote: > together. In the SELECT path, we conditionally try for exclusive lock. If > the exclusive lock is also cleanup lock, we prune and defrag the page. > As the patch stands, during index fetch we try to prune only the chain > originating at that tuple. I am wondering if we should change that > and prune all the tuple chains in the page ? I am thinking we are best just doing the index chain we already have to read. If you are modifying the table (like with UPDATE) it makes sense to be more aggressive and do the whole page because you know there are probably other table modifications, but for an index lookup there isn't any knowledge of whether the table is being modified so looking at more than we need seems like overkill. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > I am thinking we are best just doing the index chain we already have to > read. > If you are modifying the table (like with UPDATE) it makes sense to be > more aggressive and do the whole page because you know there are > probably other table modifications, but for an index lookup there isn't > any knowledge of whether the table is being modified so looking at more > than we need seems like overkill. Uh, why would any of this code at all execute during a pure lookup? regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > I am thinking we are best just doing the index chain we already have to > > read. > > > If you are modifying the table (like with UPDATE) it makes sense to be > > more aggressive and do the whole page because you know there are > > probably other table modifications, but for an index lookup there isn't > > any knowledge of whether the table is being modified so looking at more > > than we need seems like overkill. > > Uh, why would any of this code at all execute during a pure lookup? No idea. It seems an index lookup tries to prune a heap chain, and he was asking if it should look at other chains on the page; I said not. Whether the index lookup should prune the heap chain is another issue. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> Uh, why would any of this code at all execute during a pure lookup? > No idea. It seems an index lookup tries to prune a heap chain, and he > was asking if it should look at other chains on the page; I said not. > Whether the index lookup should prune the heap chain is another issue. ISTM the only time we should be doing HOT cleanup work is when we are trying to insert a new tuple on that page --- and maybe even only when we try and it doesn't fit straight off. Otherwise you're pushing maintenance work into foreground query pathways, which is exactly where I don't think we should be headed. regards, tom lane
"Bruce Momjian" <bruce@momjian.us> writes: > Tom Lane wrote: >> Bruce Momjian <bruce@momjian.us> writes: >> > I am thinking we are best just doing the index chain we already have to >> > read. >> >> > If you are modifying the table (like with UPDATE) it makes sense to be >> > more aggressive and do the whole page because you know there are >> > probably other table modifications, but for an index lookup there isn't >> > any knowledge of whether the table is being modified so looking at more >> > than we need seems like overkill. >> >> Uh, why would any of this code at all execute during a pure lookup? > > No idea. It seems an index lookup tries to prune a heap chain, and he > was asking if it should look at other chains on the page; I said not. > Whether the index lookup should prune the heap chain is another issue. Pruning chains is kind of the whole point of the exercise no? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Bruce Momjian <bruce@momjian.us> writes: >> Tom Lane wrote: >>> Uh, why would any of this code at all execute during a pure lookup? > >> No idea. It seems an index lookup tries to prune a heap chain, and he >> was asking if it should look at other chains on the page; I said not. >> Whether the index lookup should prune the heap chain is another issue. > > ISTM the only time we should be doing HOT cleanup work is when we are > trying to insert a new tuple on that page --- and maybe even only when > we try and it doesn't fit straight off. Otherwise you're pushing > maintenance work into foreground query pathways, which is exactly where > I don't think we should be headed. Ah, as I understand it you can't actually do the pruning then because the executor holds references to source tuple on the page. In other words you can never "get the vacuum lock" there because you already have the page pinned yourself. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 9/6/07, Gregory Stark <stark@enterprisedb.com> wrote:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
> ISTM the only time we should be doing HOT cleanup work is when we are
> trying to insert a new tuple on that page --- and maybe even only when
> we try and it doesn't fit straight off. Otherwise you're pushing
> maintenance work into foreground query pathways, which is exactly where
> I don't think we should be headed.
Ah, as I understand it you can't actually do the pruning then because the
executor holds references to source tuple on the page. In other words you can
never "get the vacuum lock" there because you already have the page pinned
yourself.
I don't think executor holding a reference is a problem because when
we check for vacuum lock, we have already pinned the page anyways.
But moving the old tuple around deep down in the UPDATE code path
(when we realize that there is no free space) is an issue. I know Heikki
tried to do it this way - but then moved the pruning code to lookup
code. Heikki ?
Another real problem with doing pruning only in UPDATE path is that
we may end up with long HOT chains if the page does not receive a
UPDATE, after many consecutive HOT updates. Every lookup to the
visible tuple in this chain would be CPU expensive since it would require
long chain following.
The downside is of course that SELECT may occasionally do more work.
But since we ensure that we do the extra work only when there is no
contention on the page, ISTM that the downside is limited.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > On 9/6/07, Gregory Stark <stark@enterprisedb.com> wrote: > >> Ah, as I understand it you can't actually do the pruning then because the >> executor holds references to source tuple on the page. In other words you >> can never "get the vacuum lock" there because you already have the page >> pinned yourself. > > I don't think executor holding a reference is a problem because when > we check for vacuum lock, we have already pinned the page anyways. But that's the point. Even though the pin-count is 1 and it may look like we have the vacuum lock we really don't. The fact that the buffer was already pinned by us means that there are already pointers around referencing tuples on that page. If we move them around those pointers become garbage. In fact Postgres avoids copying data if it can get by with a pointer at the original tuple on disk so some of the pointers will be Datums pointing at individual columns in the tuples in the page. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > On 9/6/07, Gregory Stark <stark@enterprisedb.com> wrote: >> "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> >>> ISTM the only time we should be doing HOT cleanup work is when we are >>> trying to insert a new tuple on that page --- and maybe even only when >>> we try and it doesn't fit straight off. Otherwise you're pushing >>> maintenance work into foreground query pathways, which is exactly where >>> I don't think we should be headed. >> Ah, as I understand it you can't actually do the pruning then because the >> executor holds references to source tuple on the page. In other words you >> can >> never "get the vacuum lock" there because you already have the page pinned >> yourself. >> >> > I don't think executor holding a reference is a problem because when > we check for vacuum lock, we have already pinned the page anyways. > But moving the old tuple around deep down in the UPDATE code path > (when we realize that there is no free space) is an issue. I know Heikki > tried to do it this way - but then moved the pruning code to lookup > code. Heikki ? When I suggested that we get rid of the LP_DELETE flag for heap tuples, the tuple-level fragmentation and all that, and just take the vacuum lock and call PageRepairFragmentation, I was thinking that we'd do it in heap_update and only when we run out of space on the page. But as Greg said, it doesn't work because you're already holding a reference to at least one tuple on the page, the one you're updating, by the time you get to heap_update. That's why I put the pruning code to heap_fetch instead. Yes, though the amortized cost is the same, it does push the pruning work to the foreground query path. That was a reason for separating the pruning and defragmenting a page. We could do the pruning in heap_update, and only do the defragmenting in heap_fetch. I was also thinking of an optimization to the pruning, so that while we scan the page and remove dead tuples, we also check if we leave behind an empty gap that's big enough to accommodate the new tuple we're inserting, and reuse that space immediately, without defragmenting. > Another real problem with doing pruning only in UPDATE path is that > we may end up with long HOT chains if the page does not receive a > UPDATE, after many consecutive HOT updates. Every lookup to the > visible tuple in this chain would be CPU expensive since it would require > long chain following. Yes. For that as well, we could prune only, but not defragment, the page in the lookup path. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > When I suggested that we get rid of the LP_DELETE flag for heap tuples, > the tuple-level fragmentation and all that, and just take the vacuum > lock and call PageRepairFragmentation, I was thinking that we'd do it in > heap_update and only when we run out of space on the page. But as Greg > said, it doesn't work because you're already holding a reference to at > least one tuple on the page, the one you're updating, by the time you > get to heap_update. That's why I put the pruning code to heap_fetch > instead. Yes, though the amortized cost is the same, it does push the > pruning work to the foreground query path. The amortized cost is only "the same" if every heap_fetch is associated with a heap update. I feel pretty urgently unhappy about this choice. Have you tested the impact of the patch on read-mostly workloads? >> Another real problem with doing pruning only in UPDATE path is that >> we may end up with long HOT chains if the page does not receive a >> UPDATE, after many consecutive HOT updates. How is that, if the same number of prune attempts would occur? regards, tom lane
Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> When I suggested that we get rid of the LP_DELETE flag for heap tuples, >> the tuple-level fragmentation and all that, and just take the vacuum >> lock and call PageRepairFragmentation, I was thinking that we'd do it in >> heap_update and only when we run out of space on the page. But as Greg >> said, it doesn't work because you're already holding a reference to at >> least one tuple on the page, the one you're updating, by the time you >> get to heap_update. That's why I put the pruning code to heap_fetch >> instead. Yes, though the amortized cost is the same, it does push the >> pruning work to the foreground query path. > > The amortized cost is only "the same" if every heap_fetch is associated > with a heap update. I feel pretty urgently unhappy about this choice. > Have you tested the impact of the patch on read-mostly workloads? I haven't. Someone should. We have a tester working on a test suite with many small CPU-bound performance test cases; hopefully we'll get those test cases and results out soon. Assuming the rule for when to prune would be the same whether we do it in heap_fetch or heap_update, I don't see how the total cost would be different. (that's a bad assumption, though, see below) >>> Another real problem with doing pruning only in UPDATE path is that >>> we may end up with long HOT chains if the page does not receive a >>> UPDATE, after many consecutive HOT updates. > > How is that, if the same number of prune attempts would occur? It wouldn't. To avoid the long HOT chains, we want to prune more often than what's needed to just make room for updates. I'm not sure what the exact rules are in the current patch. That's a pretty sensitive tradeoff, we want to prune often to cut the long HOT chains, but not too often because it's pretty expensive to acquire the vacuum lock and move tuples around. I don't think we've found the optimal solution yet. Separating the pruning and defragmenting might help. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > That's a pretty sensitive tradeoff, we want to prune often to cut the > long HOT chains, but not too often because it's pretty expensive to > acquire the vacuum lock and move tuples around. I don't think we've > found the optimal solution yet. Separating the pruning and defragmenting > might help. Does defragmenting force writing a full page image to the WAL afterwards? Or does it just log the fact that the page was defragmented, and the actual work is redone on recovery? In the first case, over-zealous defragmenting might be costly in terms of WAL traffic too, not only in term of CPU usage. greetings, Florian Pflug
Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>>> Another real problem with doing pruning only in UPDATE path is that >>>> we may end up with long HOT chains if the page does not receive a >>>> UPDATE, after many consecutive HOT updates. >>> How is that, if the same number of prune attempts would occur? > >> It wouldn't. To avoid the long HOT chains, we want to prune more often >> than what's needed to just make room for updates. > > I don't follow. HOT chains can only get longer by updates. Yes. We don't seem to be on the same wavelength... Imagine a page with just one tuple on it: 1 After a bunch of updates, it looks like this 1 -> 2 -> 3 -> 4 -> 5 1 is the tuple the indexes point to, others are heap only. Every time you do an index lookup, you have to follow that chain from 1 to 5. Which gets expensive. That's why we want to prune the chain to make it shorter, even if there's still plenty of room on the page for updates, and even if we're never going to update it again. That's the theory. I don't know *how* expensive following the chain really is, compared to index scans skipping over entries marked as killed. I don't remember seeing any measurements of that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Florian Pflug wrote: > Heikki Linnakangas wrote: >> That's a pretty sensitive tradeoff, we want to prune often to cut the >> long HOT chains, but not too often because it's pretty expensive to >> acquire the vacuum lock and move tuples around. I don't think we've >> found the optimal solution yet. Separating the pruning and defragmenting >> might help. > > Does defragmenting force writing a full page image to the WAL afterwards? > Or does it just log the fact that the page was defragmented, and the actual > work is redone on recovery? No, you just log a note that it was defragmented. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Tom Lane wrote: >>> Another real problem with doing pruning only in UPDATE path is that >>> we may end up with long HOT chains if the page does not receive a >>> UPDATE, after many consecutive HOT updates. >> >> How is that, if the same number of prune attempts would occur? > It wouldn't. To avoid the long HOT chains, we want to prune more often > than what's needed to just make room for updates. I don't follow. HOT chains can only get longer by updates. regards, tom lane
Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> Imagine a page with just one tuple on it: > >> 1 > >> After a bunch of updates, it looks like this > >> 1 -> 2 -> 3 -> 4 -> 5 > >> 1 is the tuple the indexes point to, others are heap only. > > But if we were attempting prune at every update, at least some of the > later updates should have managed to prune. "2" should certainly be > gone at this point, unless there's lots of contention for the page, > in which case pruning at select won't make things better. Oh I see. Yeah, hopefully you don't end up with long chains too often. You don't need contention for it, though, a long-running transaction will cause it. Or a transaction that inserts a tuple and updates it multiple times in the same transaction. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Alvaro Herrera wrote: > Pruning is going to take place on next vacuum anyway, isn't it? Yes. If HOT is working well, you're not going to run vacuum very often, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Alvaro Herrera wrote: > Pruning is going to take place on next vacuum anyway, isn't it? > > AFAIUI the point is that you can't do cleanup selectively only on UPDATE > because of pointers that your process may already have to the page, > which is why you wanted to do it on every heap_fetch. I suggest it is > optimization which can be left for a later version (8.4?) Wait, did you mean that we don't do pruning at all in 8.3? That's a bad idea, the main purpose of HOT is to be able to vacuum page at a time, reducing the need to do regular vacuums. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>>> Another real problem with doing pruning only in UPDATE path is that >>>> we may end up with long HOT chains if the page does not receive a >>>> UPDATE, after many consecutive HOT updates. >>> How is that, if the same number of prune attempts would occur? > >> It wouldn't. To avoid the long HOT chains, we want to prune more often >> than what's needed to just make room for updates. > > I don't follow. HOT chains can only get longer by updates. I believe the case pruning-on-select protects against is where you'd have a period of heavy updating and long running transaction, followed by a period of (mostly) read only access. If you didn't manage to prune most chains during the first phase due to a rather old OldestXmin, the following selects will all spend cycles on following the long HOT chains. Still, it sounds like the real reason is more the technical difficulties of doing it on update, rather than preventing that scenario. A rather wild idea: Could we maybe pin individual tuples, instead of the whole page? Then we'd just have to be careful not to move those when pruning during the update. greetings, Florian Pflug
Me & Greg just had a little chat, and came up with this scheme: 1. on heap_update, if the page is full, you prune the page (but don't defragment it, because you can't get the vacuum lock). That hopefully leaves behind a large enough gap to put the new tuple in. Insert the new tuple in the gap, and mark the page as Fragmented. Also make a note in some backend-private data structure that we've left that page in fragmented state. 2. In UnpinBuffer, if the pin count falls to zero and it's a page we've pruned (check the backend-private data structure), defragment it. Under little contention, all the cost of pruning will be carried by transactions that do updates. Whether we need to prune in heap_fetch in addition to that to keep the chains short, I don't know. One problem with this scheme is that when the page gets full, you have to hope that you can create a wide enough gap by pruning. It will work well with fixed-size tuples, but not so well otherwise. Hmm. I wonder if we could prune/defragment in bgwriter? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Florian Pflug wrote: > Tom Lane wrote: > A rather wild idea: Could we maybe pin individual tuples, instead > of the whole page? Then we'd just have to be careful not to move > those when pruning during the update. Uh, that's a huge change. We might be able to keep track of tuples that we have references to in our own backend, but even that seems like a non-starter to me. Yet another idea is to add an "intent" argument (or somehow pass it out of line) to heap_fetch. You would prune the page in heap_fetch, but only if you're fetching for the purpose of updating the tuple. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-09-05 at 18:15 -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Tom Lane wrote: > >> Uh, why would any of this code at all execute during a pure lookup? > > > No idea. It seems an index lookup tries to prune a heap chain, and he > > was asking if it should look at other chains on the page; I said not. > > Whether the index lookup should prune the heap chain is another issue. > > ISTM the only time we should be doing HOT cleanup work is when we are > trying to insert a new tuple on that page --- and maybe even only when > we try and it doesn't fit straight off. Otherwise you're pushing > maintenance work into foreground query pathways, which is exactly where > I don't think we should be headed. This is one of the many heuristics for tuning HOT. We can choose to do this immediately, when the page fills or somewhere in between. The rationale was: In other circumstances we dirty a page to set a hint bit on the first time we see the need for it. That avoids the clog lookup on all later reads of that tuple. By analogy, we do the same thing with HOT: when we see a tuple chain that can be pruned, we prune it. If we dirty the page for the hint bit it seems like we may as well dirty the page some more and prune it also at the same time. We could begin pruning only when the chain is N long. Currently N=2, but we could set N=3+ easily enough. There's no code yet to actually count that, but we can do that easily as we do each lookup. We should also be able to remember the visibility result for each tuple in the chain to decide whether pruning will be effective or not. I would say that if the buffer is already dirty and the chain is prunable, we should prune it at the first possible chance. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Heikki Linnakangas escribió: > Hmm. I wonder if we could prune/defragment in bgwriter? That would be best, if at all possible. You can prune without accessing anything outside the page itself, right? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Heikki Linnakangas escribió: > >> Hmm. I wonder if we could prune/defragment in bgwriter? > > That would be best, if at all possible. You can prune without accessing > anything outside the page itself, right? Yes, though you do need to have an oldest xmin to determine which tuples are dead, and the removed tuples need to be WAL-logged. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas escribió: > Every time you do an index lookup, you have to follow that chain from 1 > to 5. Which gets expensive. That's why we want to prune the chain to > make it shorter, even if there's still plenty of room on the page for > updates, and even if we're never going to update it again. > > That's the theory. I don't know *how* expensive following the chain > really is, compared to index scans skipping over entries marked as > killed. I don't remember seeing any measurements of that. I suggest you let that be. Chain following can't be *that* expensive -- and if it is, then pruning would be a lot more expensive, so it's not a thing you want to be doing for every heap_fetch. Pruning is going to take place on next vacuum anyway, isn't it? AFAIUI the point is that you can't do cleanup selectively only on UPDATE because of pointers that your process may already have to the page, which is why you wanted to do it on every heap_fetch. I suggest it is optimization which can be left for a later version (8.4?) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Thu, 2007-09-06 at 16:15 +0100, Heikki Linnakangas wrote: > Tom Lane wrote: > > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: > >> Imagine a page with just one tuple on it: > > > >> 1 > > > >> After a bunch of updates, it looks like this > > > >> 1 -> 2 -> 3 -> 4 -> 5 > > > >> 1 is the tuple the indexes point to, others are heap only. > > > > But if we were attempting prune at every update, at least some of the > > later updates should have managed to prune. "2" should certainly be > > gone at this point, unless there's lots of contention for the page, > > in which case pruning at select won't make things better. > > Oh I see. Yeah, hopefully you don't end up with long chains too often. > You don't need contention for it, though, a long-running transaction > will cause it. Or a transaction that inserts a tuple and updates it > multiple times in the same transaction. Yes, the main point is that an UPDATE doesn't always allow you to prune. If it did, that would be the right place. Since it doesn't the best place to prune is surely the first time we see we *can* prune. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > Yes, the main point is that an UPDATE doesn't always allow you to prune. You can always remove dead HOT tuples in heap_update. But you can never defragment the page at that point. > If it did, that would be the right place. Since it doesn't the best > place to prune is surely the first time we see we *can* prune. Not necessarily. Pruning is expensive, you need to scan all tuples on the page and write WAL record. And defragment the page if you consider that part of pruning. You don't want to do it too aggressively. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> I don't follow. HOT chains can only get longer by updates. > Yes. We don't seem to be on the same wavelength... > Imagine a page with just one tuple on it: > 1 > After a bunch of updates, it looks like this > 1 -> 2 -> 3 -> 4 -> 5 > 1 is the tuple the indexes point to, others are heap only. But if we were attempting prune at every update, at least some of the later updates should have managed to prune. "2" should certainly be gone at this point, unless there's lots of contention for the page, in which case pruning at select won't make things better. regards, tom lane
Heikki Linnakangas escribió: > Alvaro Herrera wrote: > > Heikki Linnakangas escribió: > > > >> Hmm. I wonder if we could prune/defragment in bgwriter? > > > > That would be best, if at all possible. You can prune without accessing > > anything outside the page itself, right? > > Yes, though you do need to have an oldest xmin to determine which tuples > are dead, Hmm, well, I think that with the VXID patch it might actually be possible to get a Xmin without being in a transaction. > and the removed tuples need to be WAL-logged. Is this a problem for bgwriter? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Alvaro Herrera wrote: >> That would be best, if at all possible. You can prune without accessing >> anything outside the page itself, right? > Yes, though you do need to have an oldest xmin to determine which tuples > are dead, and the removed tuples need to be WAL-logged. Hmm ... also check commit status (pg_clog access). I've always thought that those things couldn't be done in bgwriter, because it wasn't running real transactions, but right at the moment I can't see that there is any obstacle. Perhaps that meme dates from a time when GetOldestXmin didn't work outside a transaction? Pushing the prune/defrag work into bgwriter would certainly fix my worries about having it in foreground query paths. (And screw up all Greg Smith's work on how much bgwriter should run... or maybe this should be a third independent task within bgwriter?) regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > Heikki Linnakangas escribi�: >> and the removed tuples need to be WAL-logged. > Is this a problem for bgwriter? bgwriter does checkpoints, which emit WAL records; and they also call GetOldestXmin. So we know a-priori that those two things work. There may be some other showstopper but I'm not sure what it would be. regards, tom lane
On Thu, 2007-09-06 at 19:05 +0100, Heikki Linnakangas wrote: > Pruning is expensive, you need to scan all tuples on > the page and write WAL record. And defragment the page if you consider > that part of pruning. You don't want to do it too aggressively. OK, so my understanding or the meaning of the word has changed since we decided that initial behaviour. So yes, not too aggressively. Elsewhere on this thread, I said: On Thu, 2007-09-06 at 18:05 +0100, Simon Riggs wrote: > We could begin pruning only when the chain is N long. Currently N=2, but > we could set N=3+ easily enough. There's no code yet to actually count > that, but we can do that easily as we do each lookup. We should also be > able to remember the visibility result for each tuple in the chain to > decide whether pruning will be effective or not. > > I would say that if the buffer is already dirty and the chain is > prunable, we should prune it at the first possible chance. If we defer pruning until the page is full, worst case we may could end up with a chain ~240 tuples long, which might need to be scanned repeatedly. That won't happen often, but it would be better to prune whenever we hit one of these conditions - when the block is full - when we reach the 16th tuple in a chain -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
I wrote: > Hmm ... also check commit status (pg_clog access). I've always thought > that those things couldn't be done in bgwriter, because it wasn't > running real transactions, but right at the moment I can't see that > there is any obstacle. Perhaps that meme dates from a time when > GetOldestXmin didn't work outside a transaction? On further thought, I think what I'm remembering is that full-scale VACUUM can't work inside bgwriter, because you need to take table-level locks and worry about index vs heap consistency. But as long as HOT pruning involves only a single heap page I see no need for it to take more than the buffer-level locks on that page. regards, tom lane
Tom Lane wrote: > I wrote: >> Hmm ... also check commit status (pg_clog access). I've always thought >> that those things couldn't be done in bgwriter, because it wasn't >> running real transactions, but right at the moment I can't see that >> there is any obstacle. Perhaps that meme dates from a time when >> GetOldestXmin didn't work outside a transaction? > > On further thought, I think what I'm remembering is that full-scale > VACUUM can't work inside bgwriter, because you need to take table-level > locks and worry about index vs heap consistency. But as long as HOT > pruning involves only a single heap page I see no need for it to take > more than the buffer-level locks on that page. One complication is that you need to identify heap pages; you don't want to start pruning index pages. We could mark the buffer with a new Prunable-flag when an it's updated, to signal the bgwriter that it can prune it. I wonder if pruning in bgwriter only is enough to make HOT work efficiently. On a very frequently updated page, bgwriter will have to work hard to keep up. Another problematic scenario is a big table and small shared_buffers (like on Windows). Page can easily fall out of the buffer cache, before bgwriter has the chance to prune anything on it. But if it works reasonably well in typical scenarios, we can go with that for 8.3 and improve later. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 6 Sep 2007, Heikki Linnakangas wrote: > I wonder if pruning in bgwriter only is enough to make HOT work > efficiently. On a very frequently updated page, bgwriter will have to > work hard to keep up. One of the goals for how I rebuilt the just-in-time BGW was to try and make smaller values for bgwriter_delay feasible. None of the tunables *should* have to be adjusted if you need to run the bgwriter much more often to support some new HOT-related activity in there as well; this is actually my next area I wanted to do a bit more testing on myself. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Heikki Linnakangas wrote: > Tom Lane wrote: > > I wrote: > >> Hmm ... also check commit status (pg_clog access). I've always thought > >> that those things couldn't be done in bgwriter, because it wasn't > >> running real transactions, but right at the moment I can't see that > >> there is any obstacle. Perhaps that meme dates from a time when > >> GetOldestXmin didn't work outside a transaction? > > > > On further thought, I think what I'm remembering is that full-scale > > VACUUM can't work inside bgwriter, because you need to take table-level > > locks and worry about index vs heap consistency. But as long as HOT > > pruning involves only a single heap page I see no need for it to take > > more than the buffer-level locks on that page. > > One complication is that you need to identify heap pages; you don't want > to start pruning index pages. We could mark the buffer with a new > Prunable-flag when an it's updated, to signal the bgwriter that it can > prune it. > > I wonder if pruning in bgwriter only is enough to make HOT work > efficiently. On a very frequently updated page, bgwriter will have to > work hard to keep up. > > Another problematic scenario is a big table and small shared_buffers > (like on Windows). Page can easily fall out of the buffer cache, before > bgwriter has the chance to prune anything on it. > > But if it works reasonably well in typical scenarios, we can go with > that for 8.3 and improve later. I don't like pushing pruning to the bgwriter. I am afraid we get into the same problem with autovacuum where polling via the bgwriter will frequently not happen just-in-time, when we need it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Simon Riggs wrote: > > We could begin pruning only when the chain is N long. Currently N=2, but > > we could set N=3+ easily enough. There's no code yet to actually count > > that, but we can do that easily as we do each lookup. We should also be > > able to remember the visibility result for each tuple in the chain to > > decide whether pruning will be effective or not. > > > > I would say that if the buffer is already dirty and the chain is > > prunable, we should prune it at the first possible chance. > > If we defer pruning until the page is full, worst case we may could end > up with a chain ~240 tuples long, which might need to be scanned > repeatedly. That won't happen often, but it would be better to prune > whenever we hit one of these conditions > - when the block is full > - when we reach the 16th tuple in a chain I don't see how following a HOT chain is any slower than following an UPDATE chain like we do now. In fact, what we do now is to have an index row for every row version, while with HOT we will have one index entry and all the row versions on the same page. That has to be better than what we have now, even without pruning. Let me define two terms: prune -- modify ctid pointers to reduce the length of one HOT chain defragment -- reduce the length of all HOT chains on the page and collect free space I think we all agree defragmenting should only happen when we need free space on the page. But it seems by the time we _know_ we are going to be adding to the page we can't defragment. I am thinking we should go the direction of passing a boolean down into the routines so they know to defragment before they get into a case where they can't. As for pruning, I am thinking Simon's idea of just saying don't prune unless the chain is longer than X is correct. If there are a lot of updates to the page the page will fill and the chain pruned during a defragment, and if not the chains are guaranteed to be shorter than X. Ideally you would want to say if the chain has been Y for a certain length of time (meaning it isn't growing and defragement hasn't happened in a while) you would start to prune more aggressively but I see no easy way to track that. Also, why all the talk of index lookups doing pruning? Can't a sequential scan do pruning? Would someone tell us exactly when pruning and defragmenting happens in the current version of the patch? If we don't nail this issue down soon PostgreSQL 8.3 is going to sail without this patch. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 2007-09-07 at 22:08 -0400, Bruce Momjian wrote: > Simon Riggs wrote: > > > We could begin pruning only when the chain is N long. Currently N=2, but > > > we could set N=3+ easily enough. There's no code yet to actually count > > > that, but we can do that easily as we do each lookup. We should also be > > > able to remember the visibility result for each tuple in the chain to > > > decide whether pruning will be effective or not. > > > > > > I would say that if the buffer is already dirty and the chain is > > > prunable, we should prune it at the first possible chance. > > > > If we defer pruning until the page is full, worst case we may could end > > up with a chain ~240 tuples long, which might need to be scanned > > repeatedly. That won't happen often, but it would be better to prune > > whenever we hit one of these conditions > > - when the block is full > > - when we reach the 16th tuple in a chain Thanks for defining terms. Just to answer a few of your points: > I don't see how following a HOT chain is any slower than following an > UPDATE chain like we do now. The speed/cost is the same. The issue is *when* we do this. Normal SELECTs will follow the chain each time we do an index lookup. So if our read/write ratio is 1000:1, then we will waste many cycles and yet the chain is never pruned on SELECT. So there really must be some point at which we prune on a SELECT. Perhaps we could say if Xid % 16 == 0 then we prune, i.e. every 16th transaction walking the chain will prune it. > Also, why all the talk of index lookups doing pruning? Can't a > sequential scan do pruning? The SeqScan doesn't follow the chains, so can't easily determine whether there are any long chains that need pruning. Its only when we come in via an index that we need to walk the chain to the latest tuple version and in that case we learn how long the chain is. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs <simon@2ndquadrant.com> writes: > On Fri, 2007-09-07 at 22:08 -0400, Bruce Momjian wrote: >> I don't see how following a HOT chain is any slower than following an >> UPDATE chain like we do now. > The speed/cost is the same. The issue is *when* we do this. Normal > SELECTs will follow the chain each time we do an index lookup. > So if our read/write ratio is 1000:1, then we will waste many cycles and > yet the chain is never pruned on SELECT. So there really must be some > point at which we prune on a SELECT. This argument is too weak to stand up by itself. Once a tuple is marked as committed dead (which it certainly must be before you could even consider pruning it) the cost of skipping over it in a HOT-chain search is going to be very very small. You have already paid all the costs of searching the index, fetching and pinning the heap page, etc. The incremental cost of rejecting one more HOT-chain entry is the cost to * fetch the line pointer; * check HEAP_XMIN_COMMITTED and see that it's set; * check xmin and see that it's valid per snapshot; * check HEAP_XMAX_COMMITTED and see that it's set; * check xmax and see that it's valid per snapshot; * fetch t_ctid to chain to next tuple. Now the snapshot checks could be relatively expensive --- but in any situation where a tuple is potentially prunable, they'll reduce to *one* comparison: is xid less than snapshot's xmin? So you've got two or at most three shared-memory cache line fetches, no write to shared memory, no locks taken or released, and a trivial number of CPU operations. Compared to what it currently takes to check the same tuple (a separate index entry fetch and traversal to the heap page), this is already an enormous performance improvement. Insisting that we add complexity and pay performance penalties elsewhere to reduce this cost still more does not strike me as a sane tradeoff --- certainly not if it's being made without conclusive performance measurements to back it up. regards, tom lane
Tom Lane wrote: > Compared to what it currently takes to check the same tuple (a separate > index entry fetch and traversal to the heap page), this is already an > enormous performance improvement. Though keep in mind that we kill index tuples as soon as they're deemed to be dead. Nevertheless, I'm not very worried about the cost of following the chain either. But that's something we can quite easily measure if we want to. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Tom Lane wrote: >> Compared to what it currently takes to check the same tuple (a separate >> index entry fetch and traversal to the heap page), this is already an >> enormous performance improvement. > > Though keep in mind that we kill index tuples as soon as they're deemed > to be dead. Nevertheless, I'm not very worried about the cost of > following the chain either. But that's something we can quite easily > measure if we want to. I'm confused now. I though that pruning would be enough to shorten HOT-Chains - because the root line pointer afterwards points directly to the first live tuple. But we can *prune* (without actually defragmenting) without holding a VACUUM-strength lock, right? Or did I get that wrong? greetings, Florian Pflug
Florian Pflug wrote: > Heikki Linnakangas wrote: >> Tom Lane wrote: >>> Compared to what it currently takes to check the same tuple (a separate >>> index entry fetch and traversal to the heap page), this is already an >>> enormous performance improvement. >> >> Though keep in mind that we kill index tuples as soon as they're deemed >> to be dead. Nevertheless, I'm not very worried about the cost of >> following the chain either. But that's something we can quite easily >> measure if we want to. > > I'm confused now. I though that pruning would be enough to shorten > HOT-Chains - > because the root line pointer afterwards points directly to the first live > tuple. But we can *prune* (without actually defragmenting) without holding > a VACUUM-strength lock, right? Or did I get that wrong? Yes, that's right. You don't seem to be confused at all. Tom argued that following the tuple chain is cheap enough, and might even be cheaper than what we have now, that we don't need to prune just for the purpose of keeping the chains short. To which I pointed out that currently, without HOT, we mark index tuples pointing to dead tuples as killed to avoid following them in the future, so HOT without pruning is not cheaper than what we have now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sat, 2007-09-08 at 17:46 +0100, Heikki Linnakangas wrote: > Tom Lane wrote: > > Compared to what it currently takes to check the same tuple (a separate > > index entry fetch and traversal to the heap page), this is already an > > enormous performance improvement. > > Though keep in mind that we kill index tuples as soon as they're deemed > to be dead. Nevertheless, I'm not very worried about the cost of > following the chain either. But that's something we can quite easily > measure if we want to. OK, I'm good about it if you guys are. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Tom argued that following the tuple chain is cheap enough, and might > even be cheaper than what we have now, that we don't need to prune just > for the purpose of keeping the chains short. To which I pointed out that > currently, without HOT, we mark index tuples pointing to dead tuples as > killed to avoid following them in the future, so HOT without pruning is > not cheaper than what we have now. That hack only works in plain indexscans, though, not bitmapped scans. Anyway, I remain unconvinced that the chains would normally get very long in the first place, if we could prune when updating. The we-already-pinned-the-page problem is a bit nasty but may not be insurmountable. regards, tom lane
Heikki Linnakangas wrote: > Florian Pflug wrote: >> Heikki Linnakangas wrote: >>> Tom Lane wrote: >>>> Compared to what it currently takes to check the same tuple (a separate >>>> index entry fetch and traversal to the heap page), this is already an >>>> enormous performance improvement. >>> Though keep in mind that we kill index tuples as soon as they're deemed >>> to be dead. Nevertheless, I'm not very worried about the cost of >>> following the chain either. But that's something we can quite easily >>> measure if we want to. >> I'm confused now. I though that pruning would be enough to shorten >> HOT-Chains - >> because the root line pointer afterwards points directly to the first live >> tuple. But we can *prune* (without actually defragmenting) without holding >> a VACUUM-strength lock, right? Or did I get that wrong? > > Yes, that's right. You don't seem to be confused at all. > > Tom argued that following the tuple chain is cheap enough, and might > even be cheaper than what we have now, that we don't need to prune just > for the purpose of keeping the chains short. To which I pointed out that > currently, without HOT, we mark index tuples pointing to dead tuples as > killed to avoid following them in the future, so HOT without pruning is > not cheaper than what we have now. Hm.. In that case, couldn't we prune the chain while we followed it, with nearly zero overhead at all? It seems that we'd just have to always update the root line pointer / tuples to point at the first RECENTLY_DEAD tuple in the chain, together with maybe compacting the orphaned tuples to just a line pointer I'm not sure how HOT handles this currently, but it seems that this sort of minimal pruning shouldn't need WAL. greetings, Florian Pflug
Heikki Linnakangas wrote: > Florian Pflug wrote: > > Heikki Linnakangas wrote: > >> Tom Lane wrote: > >>> Compared to what it currently takes to check the same tuple (a separate > >>> index entry fetch and traversal to the heap page), this is already an > >>> enormous performance improvement. > >> > >> Though keep in mind that we kill index tuples as soon as they're deemed > >> to be dead. Nevertheless, I'm not very worried about the cost of > >> following the chain either. But that's something we can quite easily > >> measure if we want to. > > > > I'm confused now. I though that pruning would be enough to shorten > > HOT-Chains - > > because the root line pointer afterwards points directly to the first live > > tuple. But we can *prune* (without actually defragmenting) without holding > > a VACUUM-strength lock, right? Or did I get that wrong? > > Yes, that's right. You don't seem to be confused at all. > > Tom argued that following the tuple chain is cheap enough, and might > even be cheaper than what we have now, that we don't need to prune just > for the purpose of keeping the chains short. To which I pointed out that > currently, without HOT, we mark index tuples pointing to dead tuples as > killed to avoid following them in the future, so HOT without pruning is > not cheaper than what we have now. The central idea I now understand is that pruning only shrinks HOT chains. It does not allow reuse of space. Only defragmentation does that, and that is triggered when the page is almost full. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Simon Riggs wrote: > > > If we defer pruning until the page is full, worst case we may could end > > > up with a chain ~240 tuples long, which might need to be scanned > > > repeatedly. That won't happen often, but it would be better to prune > > > whenever we hit one of these conditions > > > - when the block is full > > > - when we reach the 16th tuple in a chain > > Thanks for defining terms. > > Just to answer a few of your points: > > > I don't see how following a HOT chain is any slower than following an > > UPDATE chain like we do now. > > The speed/cost is the same. The issue is *when* we do this. Normal > SELECTs will follow the chain each time we do an index lookup. But a sequential scan still follows the chain, and that isn't going to prune. Why are we more worried about index chain lookups than sequential scan lookups? As I understand it, the pruning is only to reduce the chains. It doesn't allow space reuse. I have updated my README to reflect that: ftp://momjian.us/pub/postgresql/mypatches/README.HOT > So if our read/write ratio is 1000:1, then we will waste many cycles and > yet the chain is never pruned on SELECT. So there really must be some > point at which we prune on a SELECT. > > Perhaps we could say if Xid % 16 == 0 then we prune, i.e. every 16th > transaction walking the chain will prune it. > > > Also, why all the talk of index lookups doing pruning? Can't a > > sequential scan do pruning? > > The SeqScan doesn't follow the chains, so can't easily determine whether > there are any long chains that need pruning. Its only when we come in > via an index that we need to walk the chain to the latest tuple version > and in that case we learn how long the chain is. Uh, as I understand it, every access follows the ctid, so why doesn't a sequential scan follow the chain? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Heikki Linnakangas wrote: >> Florian Pflug wrote: >>> Heikki Linnakangas wrote: >>>> Tom Lane wrote: >>>>> Compared to what it currently takes to check the same tuple (a separate >>>>> index entry fetch and traversal to the heap page), this is already an >>>>> enormous performance improvement. >>>> Though keep in mind that we kill index tuples as soon as they're deemed >>>> to be dead. Nevertheless, I'm not very worried about the cost of >>>> following the chain either. But that's something we can quite easily >>>> measure if we want to. >>> I'm confused now. I though that pruning would be enough to shorten >>> HOT-Chains - >>> because the root line pointer afterwards points directly to the first live >>> tuple. But we can *prune* (without actually defragmenting) without holding >>> a VACUUM-strength lock, right? Or did I get that wrong? >> Yes, that's right. You don't seem to be confused at all. >> >> Tom argued that following the tuple chain is cheap enough, and might >> even be cheaper than what we have now, that we don't need to prune just >> for the purpose of keeping the chains short. To which I pointed out that >> currently, without HOT, we mark index tuples pointing to dead tuples as >> killed to avoid following them in the future, so HOT without pruning is >> not cheaper than what we have now. > > The central idea I now understand is that pruning only shrinks HOT > chains. It does not allow reuse of space. Only defragmentation does > that, and that is triggered when the page is almost full. I was under the impression that pruning *does* free the space occupied by DEAD HOT-Tuples (minus the size of a redirection line pointer). It just doesn't defragment, so how much of that freed space you can actually use to store new tuples is another question... Or is that wrong - do we need a VACUUM-strength lock to turn a tuple into a redirecting line pointer, and can pruning therefore only really free *no* space? greetings, Florian Pflug
Florian Pflug <fgp.phlo.org@gmail.com> writes: > I was under the impression that pruning *does* free the space occupied > by DEAD HOT-Tuples (minus the size of a redirection line pointer). It > just doesn't defragment, so how much of that freed space you can actually > use to store new tuples is another question... None, unless the freed tuple happens to be directly adjacent to the pd_lower --- pd_upper hole, or unless the patch has done a lot more to the page-free-space-management code than I thought. We have never looked for free space anywhere except the main "hole". regards, tom lane
Bruce Momjian <bruce@momjian.us> writes: > Simon Riggs wrote: >> The speed/cost is the same. The issue is *when* we do this. Normal >> SELECTs will follow the chain each time we do an index lookup. > But a sequential scan still follows the chain, and that isn't going to > prune. Why are we more worried about index chain lookups than > sequential scan lookups? No, a seqscan *doesn't* follow HOT chains. It just looks at every live line pointer on the page, sequentially. An indexscan will have to follow HOT chains, or it will miss tuples it needs to return. >> The SeqScan doesn't follow the chains, so can't easily determine whether >> there are any long chains that need pruning. Its only when we come in >> via an index that we need to walk the chain to the latest tuple version >> and in that case we learn how long the chain is. > Uh, as I understand it, every access follows the ctid, so why doesn't a > sequential scan follow the chain? Up to now, normal searches (whether seq or index) have never paid the slightest attention to t_ctid. That's only used when a READ COMMITTED update realizes it's trying to update a non-latest version of a row. regards, tom lane
Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: > > Tom argued that following the tuple chain is cheap enough, and might > > even be cheaper than what we have now, that we don't need to prune just > > for the purpose of keeping the chains short. To which I pointed out that > > currently, without HOT, we mark index tuples pointing to dead tuples as > > killed to avoid following them in the future, so HOT without pruning is > > not cheaper than what we have now. > > That hack only works in plain indexscans, though, not bitmapped scans. > Anyway, I remain unconvinced that the chains would normally get very > long in the first place, if we could prune when updating. > > The we-already-pinned-the-page problem is a bit nasty but may not be > insurmountable. As I understand it, there are two HOT features: Single-chain pruning, which trims HOT chains but doesn't reuse the space Defragementation, which prunes the entire page and reuses space and handles deleted rows, etc. Defragementation is the HOT feature we really want. Single-chain pruning is kind of nice because it speeds things up, but isn't necessary. The fact that the entire chain is on the same page makes me think that we could just leave single-chain pruning for 8.4 if necessary. I think allowing the chain to be more often on the same page via defragmentation and having a single index entry for the chain is going to be a win, and the fact we don't have marked-dead index entries for some of the chain isn't going to be a problem. My guess is that the marked-dead index entries were a win only when the chain was on several pages, which isn't the case for HOT chains. FYI, I saw this comment in the patch: + /* + * If the free space left in the page is less than the average FSM + * request size (or a percentage of it), prune all the tuples or + * tuple chains in the page. Since the operation requires exclusive + * access to the page and needs to be WAL logged, we want to do as + * much as possible. At the same time, since the function may be + * called from a critical path, we want it to be as fast as + * possible. + * + * Disregard the free space if PAGE_PRUNE_DEFRAG_FORCE option is set. + * + * XXX The value of 120% is a ad-hoc choice and we may want to + * tweak it if required: + * + * XXX The average request size for a relation is currently + * initialized to a small value such as 256. So for a table with + * large size tuples, during initial few UPDATEs we may not prune + * a page even if the free space available is less than the new + * tuple size - resulting in unnecessary extention of the relation. + * Add a temporary hack to prune the page if the free space goes + * below a certain percentage of the block size (set to 12.5% here)) + */ So this is how the system determines if it should defrag the whole page. The defrag function is heap_page_prune_defrag(). The big downside of this function is it has to get a lock to survey things and it often has to guess if it should activate or not, meaning it has no idea if free space is needed on this page or not. In summary, I feel we have the HOT mechanics down well, but the open issue is _when_ to activate each operation. (Can someone time the access time for following a chain that fills an entire page (the worst case) vs. having a single tuple on the page?) In an ideal world, we would prune single chains only when they were long enough to cause a performance impact, and would defragment only when a new row will not fit on the page. Other than those two cases, we don't care how much dead space there is on a page. However, there are two complexities to this. One, we can't be sure we can defragment when we need it because we might not get the lock, and second, we are only going to try to put a row on a page if we are updating a row on that page. If the page is 90% dead but no rows are being updated on that page no one will try to add a row to the page because FSM thinks it is full. That might be OK, it might not. Another issue. My guess is that it will take 2-3 weeks to get HOT applied, meaning we aren't going to go to beta before October 1. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sat, 2007-09-08 at 23:13 -0400, Bruce Momjian wrote: > Defragementation is the HOT feature we really want. Single-chain > pruning is kind of nice because it speeds things up, but isn't > necessary. The fact that the entire chain is on the same page makes me > think that we could just leave single-chain pruning for 8.4 if > necessary. Agreed > I think allowing the chain to be more often on the same page via > defragmentation and having a single index entry for the chain is going > to be a win, and the fact we don't have marked-dead index entries for > some of the chain isn't going to be a problem. My guess is that the > marked-dead index entries were a win only when the chain was on several > pages, which isn't the case for HOT chains. Agreed > In summary, I feel we have the HOT mechanics down well, but the open > issue is _when_ to activate each operation. That has been the only open issue for months. The interplay of behaviour is where HOT succeeds and fails. Some behaviours sound like they will be good, but aren't. > However, there are two complexities to this. One, we can't be sure we > can defragment when we need it because we might not get the lock, and That is a probability call. Heikki measured that, and its a good bet. If we do fail to fragment, then one of the rows will migrate to another block. There will likely be fewer backends contending for that block and then fragmentation will succeed. So this behaviour is self-balancing. > second, we are only going to try to put a row on a page if we are > updating a row on that page. If the page is 90% dead but no rows are > being updated on that page no one will try to add a row to the page > because FSM thinks it is full. That might be OK, it might not. We must consider data row contention as well as space efficiency. In the current system UPDATE contention is not an issue because we treat them as INSERTs and there is a good scheme in place to deal with INSERT contention. It would be easy to neglect UPDATE contention issues with HOT because of this. If we actively send more backends towards a full block, we may increase contention which is a bad thing, and we may cause defragmentation to fail, which would be even worse. The net result will be that some backends go to a different block again, so it seems simpler to not send backends towards other blocks just because they are full. > Another issue. My guess is that it will take 2-3 weeks to get HOT > applied, meaning we aren't going to go to beta before October 1. Seems reasonable. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Sat, 2007-09-08 at 23:13 -0400, Bruce Momjian wrote: > > In summary, I feel we have the HOT mechanics down well, but the open > > issue is _when_ to activate each operation. > > That has been the only open issue for months. The interplay of behaviour > is where HOT succeeds and fails. Some behaviours sound like they will be > good, but aren't. Looking at the patch, it seems defragmentation is checked every time a heap page is pinned in heapam.c by calling heap_page_prune_defrag(). If page free space < 1.2 * average_tuple_size the page is defragmented. It would seem defragmentation is checked as much as possible. Seems pruning is also as aggressive as possible. I think an open question is whether defragmentation should happen _only_ when we are trying to add something to the page and it doesn't fit. Another open question is whether pruning should happen only when the chain gets to a certain length. The larger question is how are we going to get these answers? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sun, 2007-09-09 at 07:55 -0400, Bruce Momjian wrote: > Simon Riggs wrote: > > On Sat, 2007-09-08 at 23:13 -0400, Bruce Momjian wrote: > > > In summary, I feel we have the HOT mechanics down well, but the open > > > issue is _when_ to activate each operation. > > > > That has been the only open issue for months. The interplay of behaviour > > is where HOT succeeds and fails. Some behaviours sound like they will be > > good, but aren't. > > Looking at the patch, it seems defragmentation is checked every time a > heap page is pinned in heapam.c by calling heap_page_prune_defrag(). If > page free space < 1.2 * average_tuple_size the page is defragmented. It > would seem defragmentation is checked as much as possible. Seems > pruning is also as aggressive as possible. I think the factor should be 1.0. A Setting of 1.2 will lead to twice as many defragmentation attempts when we have a fixed row length. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Bruce Momjian wrote: > Tom Lane wrote: >> The we-already-pinned-the-page problem is a bit nasty but may not be >> insurmountable. If you find a way, that would be great. > As I understand it, there are two HOT features: > > Single-chain pruning, which trims HOT chains but doesn't reuse > the space > > Defragementation, which prunes the entire page and reuses space > and handles deleted rows, etc. I'm repeating myself, but I would split that second operation into two parts while we think about this: pruning the entire page, and defragmenting (PageRepairFragmentation). Tom wondered why they're separated in the patch. As the patch stands, there is no reason, but I feel that separating them and doing them at different times might be an important piece in the puzzle. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-09-10 at 09:03 +0100, Heikki Linnakangas wrote: > > As I understand it, there are two HOT features: > > > > Single-chain pruning, which trims HOT chains but doesn't reuse > > the space > > > > Defragementation, which prunes the entire page and reuses space > > and handles deleted rows, etc. > > I'm repeating myself, but I would split that second operation into two > parts while we think about this: pruning the entire page, and > defragmenting (PageRepairFragmentation). Tom wondered why they're > separated in the patch. As the patch stands, there is no reason, but I > feel that separating them and doing them at different times might be an > important piece in the puzzle. Yes, it does seem important to keep that separation, since it is possible to do these things at different times, even if, for now, we choose not to. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On 9/10/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
I'm repeating myself, but I would split that second operation into two
parts while we think about this: pruning the entire page, and
defragmenting (PageRepairFragmentation). Tom wondered why they're
separated in the patch. As the patch stands, there is no reason, but I
feel that separating them and doing them at different times might be an
important piece in the puzzle.
I think we all agree that defragmentation should be done less aggressively
and ideally when we are going to insert a tuple in the page and running
low on free space in the page. If we could really do it in heap_update(), after
we figure out that there is not enough free space available in the page,
that will be great. We already have exclusive lock on the page and hopefully
we also have the vacuum-lock. (This is how earlier version of the patch used to work - before we took out all the complexity associated with tracking LP_DELETEd tuples,
tracking fragmented tuples and replaced them with simple PageRepairFragmentation
Since in the earlier version, we never used to call PageRepairFragmentation,
we could easily do pruning in heap_update())
If we can't find a way to defragment inside heap_update(), then we have three
choices:
1. Pass a hint to heap_fetch that a UPDATE may follow. If we are running low
on free space, we try to prune and defragment at that point.
2. Move some of the work to bgwriter.
3. Use some heuristic to decide whether to defragment or not.
I am not sure how easy it is to know apriori that we are fetching a tuple
for UPDATE. Assuming we can, this seems like a good idea.
Eventually we may want to move some work to bgwriter or some other
background process. But my take would be to save that for 8.4
As a starting point, we are doing 3 in the patch. We always try to
keep just one tuple worth of space free in the page. So if an UPDATE
takes up remaining free space in the page, the page will be
pruned/defraged in the subsequent lookup (index or seq). In read-mostly scenario,
only the first lookup would trigger prune/defrag, but next many lookups
would be quick. In update-mostly scenario, most of the heap_fetches
would anyways end up doing update, so doing pruning/defragmentation
in heap_fetch shouldn't be too bad. For a balanced scenario, we might
be moving cost of pruning/defragmentation to the SELECT path, but
UPDATEs would be quick.
The other issue is when to prune the HOT chain. I tend to agree that
the chances of having long HOT chains are less and the cost of following the
chain won't be too significant. At the same we probably don't want to live
with very long chains forever. An example would be: a tuple gets HOT updated
several times in a transaction. If the page is never updated again, the long HOT
chain would never be pruned. I like Simon's idea of pruning the chain if it
goes beyond a limit. Instead of adding any significant complexity to
track length of each HOT chain, we can just mark the page with a flag
if heap_hot_fetch() detects a long chain. The next lookup (index or seq)
of the page will try to prune the page and clear the flag if successful.
I also agree that we should avoid taking exclusive lock before checking
free space in heap_page_prune_defrag(). That might be unsafe, but we
don't care if you occasionally skip the maintenance work (or do it a
little early)
Thanks,
Pavan
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Bruce Momjian wrote: > (Can someone time the access time for following a chain that fills an > entire page (the worst case) vs. having a single tuple on the page?) Here is some results, on my laptop. The test case is a table with a single integer column, and a single row in the table. The row is updated 250 times to make a 251 tuples long chain. Page can hold 255 tuples, so this is pretty much the worst case. After that, the row is fetched 1000000 times, in a PL/pgSQL function. I ran the tests with enable_seqscan on and off, see "seqscan" and "idxscan" rows below. Numbers are the times spent doing the fetches, in seconds. I repeated each test a few times to make sure that the results are repeatable, they seem to be repeatable to ~0.1s precision. The test script used is attached. Autovacuum was disabled, and shared_buffers=320MB, otherwise all settings were left to defaults. HEAD HOT HOT-opt HOT-pruned seqscan 19.9 21.1 20.1 11.5 idxscan 27.8 31.4 30.4 13.7 Explanations of the columns: HEAD: CVS HEAD HOT-pruned: CVS HEAD + HOT patch v15 HOT: CVS HEAD + HOT patch v15, but with heap_page_prune_defrag short circuited to do nothing HOT-opt: CVS HEAD + HOT patch v15, but with static XidInMVCCSnapshot like in CVS HEAD I didn't expect a difference in seqscan performance between HEAD and HOT. I oprofiled it, and figured out that it's because HOT patch removed the static-qualifier XidInMVCCSnapshot, because it's needed in plancat.c. I changed it back to static, dummying out the call in plancat.c, and the results are now closer to each other (HOT-opt column). Comparing the idxscan columns, it looks like following the chain *is* more expensive than having to go through killed index pointers. Pruning clearly does help. Given that this test is pretty much the worst case scenario, I'm ok with not pruning for the purpose of keeping chains short. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com set enable_seqscan=on; DROP TABLE hottest; CREATE TABLE hottest (id integer); INSERT INTO hottest VALUES (1); CREATE INDEX i_hottest ON hottest(id); -- Fill the table CREATE OR REPLACE FUNCTION longchain (n integer) RETURNS void LANGUAGE PLPGSQL AS $$ DECLARE i integer; BEGIN FOR i IN 1..n LOOP UPDATE hottest SET id=id WHERE id = 1; END LOOP; END; $$; SELECT longchain(250); CREATE OR REPLACE FUNCTION fetchchain (n integer) RETURNS void LANGUAGE PLPGSQL AS $$ DECLARE i integer; foo integer; BEGIN FOR i IN 1..n LOOP SELECT id INTO foo FROM hottest WHERE id = 1; END LOOP; END; $$; \timing SELECT fetchchain(1000000); \timing
Heikki Linnakangas wrote: > Explanations of the columns: > > HEAD: CVS HEAD > HOT-pruned: CVS HEAD + HOT patch v15 > HOT: CVS HEAD + HOT patch v15, but with heap_page_prune_defrag short > circuited to do nothing > HOT-opt: CVS HEAD + HOT patch v15, but with static XidInMVCCSnapshot > like in CVS HEAD One correction: HOT-opt was also short circuited like HOT. IOW, HOT and HOT-opt are the same, except for the static qualifier in XidInMVCCSnapshot -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 9/6/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> When I suggested that we get rid of the LP_DELETE flag for heap tuples,
> the tuple-level fragmentation and all that, and just take the vacuum
> lock and call PageRepairFragmentation, I was thinking that we'd do it in
> heap_update and only when we run out of space on the page. But as Greg
> said, it doesn't work because you're already holding a reference to at
> least one tuple on the page, the one you're updating, by the time you
> get to heap_update. That's why I put the pruning code to heap_fetch
> instead. Yes, though the amortized cost is the same, it does push the
> pruning work to the foreground query path.
The amortized cost is only "the same" if every heap_fetch is associated
with a heap update. I feel pretty urgently unhappy about this choice.
Have you tested the impact of the patch on read-mostly workloads?
For read-mostly workloads, only the first SELECT after an UPDATE
would trigger pruning/defragmentation. heap_page_prune_defrag()
would be a no-op for subsequent SELECTs (PageIsPrunable() would
be false until the next UPDATE)
I think Heikki's recent test confirms this.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-09-10 at 12:17 +0100, Heikki Linnakangas wrote: > Bruce Momjian wrote: > > (Can someone time the access time for following a chain that fills an > > entire page (the worst case) vs. having a single tuple on the page?) > > Here is some results, on my laptop. > HEAD HOT HOT-opt HOT-pruned > seqscan 19.9 21.1 20.1 11.5 > idxscan 27.8 31.4 30.4 13.7 > > Comparing the idxscan columns, it looks like following the chain *is* > more expensive than having to go through killed index pointers. Pruning > clearly does help. > > Given that this test is pretty much the worst case scenario, I'm ok with > not pruning for the purpose of keeping chains short. I wasn't expecting that result and had accepted the counter argument. IMHO it overturns the argument, or at least begs more test results. Perhaps we need to compare this with a situation where we do 50% reads and 50% writes where we try to prune aggressively on reads. There are clearly big wins to be had in comparison with HEAD, but also the possibility of regressions (in multiple directions) for various worst cases. ISTM that if we made 1 in every 16 index scans prune chains longer than N (=16?) then we would avoid worst cases like this for almost no additional (amortized) cost. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On 9/8/07, Bruce Momjian <bruce@momjian.us> wrote:
Would someone tell us exactly when pruning and defragmenting happens in
the current version of the patch? If we don't nail this issue down soon
PostgreSQL 8.3 is going to sail without this patch.
I guess you already have the answers by now, but let me repeat here:
Until patch version 14, defragmentation and pruning were separate
operations, though we used to invoke both at the same time. The only
difference was that pruning can work with a simple exclusive lock
whereas defragmentation needs vacuum-strength lock. Tom argued
that its not worth the complexity, so I changed that and clubbed
both the operations together. What it means: pruning also now needs
vacuum-strength lock and is always followed by defragmentation.
So as the patch stands (version 15), we call heap_page_prune_defrag()
inside heap_release_fetch() and heapgetpage(), apart from the vacuum
loop. It checks for PageIsPrunable() before proceeding. PageIsPrunable
is set when a tuple is updated/deleted. So for read-mostly workloads
heap_page_prune_defrag will mostly be no-op.
If PageIsPrunable is set, cleanup lock is available (exclusive lock is tried
conditionally, so we don't wait if there is contention) and we are running
low on free space (right now if free space is less than 1.2 times the average
tuple size or less than BLCKSZ/8), the entire page is pruned and fragmentation
is repaired.
We also prune-defrag is vacuum lazy and vacuum full. But I assume we
are not worried about these maintenance activities.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On 9/10/07, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, 2007-09-10 at 12:17 +0100, Heikki Linnakangas wrote:
> Bruce Momjian wrote:
> > (Can someone time the access time for following a chain that fills an
> > entire page (the worst case) vs. having a single tuple on the page?)
>
> Here is some results, on my laptop.
> HEAD HOT HOT-opt HOT-pruned
> seqscan 19.9 21.1 20.1 11.5
> idxscan 27.8 31.4 30.4 13.7
>
> Comparing the idxscan columns, it looks like following the chain *is*
> more expensive than having to go through killed index pointers. Pruning
> clearly does help.
>
> Given that this test is pretty much the worst case scenario, I'm ok with
> not pruning for the purpose of keeping chains short.
I wasn't expecting that result and had accepted the counter argument.
Yes, I go with Simon. I am also surprised that HOT-pruned did
so well in this setup. I always thought that HOT would do well
in update-intensive scenarios, but from the results it seems that
HOT is also doing well for read-mostly queries.
In this particular example, the first SELECT after the 250 UPDATEs
would have pruned all the dead tuples and reduced HOT chain
to a single tuple. Hence the total time for subsequent SELECTs improved
tremendously.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > HEAD HOT HOT-opt HOT-pruned > seqscan 19.9 21.1 20.1 11.5 > idxscan 27.8 31.4 30.4 13.7 > > Explanations of the columns: > > HEAD: CVS HEAD > HOT-pruned: CVS HEAD + HOT patch v15 > HOT: CVS HEAD + HOT patch v15, but with heap_page_prune_defrag short > circuited to do nothing > HOT-opt: CVS HEAD + HOT patch v15, but with static XidInMVCCSnapshot > like in CVS HEAD > > I didn't expect a difference in seqscan performance between HEAD and > HOT. I oprofiled it, and figured out that it's because HOT patch removed > the static-qualifier XidInMVCCSnapshot, because it's needed in > plancat.c. I changed it back to static, dummying out the call in > plancat.c, and the results are now closer to each other (HOT-opt column). > > Comparing the idxscan columns, it looks like following the chain *is* > more expensive than having to go through killed index pointers. Pruning > clearly does help. > > Given that this test is pretty much the worst case scenario, I'm ok with > not pruning for the purpose of keeping chains short. Wow, that is very helpful. You have shown that pruning a single chain is indeed a valuable operation. My guess is that once you prune a tuple you no longer have to check its visibility, and that is where the win is coming from. If we check a tuple in a chain and the tuple is dead is it possible the pruning operation is cheaper than having to check the tuple again for visibility the next time we see it? If so, we can just always prune when we see a dead tuple in the chain, which I believe was the original design. Pruning becomes an operation similar to marking an index entry as dead. (I assuming pruing does not generate WAL traffic.) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > My guess is that once you prune a tuple > you no longer have to check its visibility, and that is where the win is > coming from. Yes, I think that's right. Oh, one more thing occured to me. Without HOT, we not only mark index tuples pointing to dead tuples as killed, we remove them altogether if the index page gets full. If you modify the test case so that after doing the updates, you insert a bunch of tuples with a different key to fill the index page, you should see CVS HEAD winning HOT without pruning hands down. > If we check a tuple in a chain and the tuple is dead is it possible the > pruning operation is cheaper than having to check the tuple again for > visibility the next time we see it? If so, we can just always prune > when we see a dead tuple in the chain, which I believe was the original > design. Pruning becomes an operation similar to marking an index entry > as dead. (I assuming pruing does not generate WAL traffic.) Pruning does generate a WAL record at the moment. Maybe you could do some kind of a quick pruning without a WAL record. Like just modify the ctid of the oldest dead tuple in the chain, or the redirecting line pointer if there is one, to point to the latest live tuple, without removing the dead tuples or the line pointers. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > Until patch version 14, defragmentation and pruning were separate > operations, though we used to invoke both at the same time. The only > difference was that pruning can work with a simple exclusive lock > whereas defragmentation needs vacuum-strength lock. Tom argued > that its not worth the complexity, so I changed that and clubbed > both the operations together. What it means: pruning also now needs > vacuum-strength lock and is always followed by defragmentation. Interesting. See my previous post to Heikki stating we might need to change that based on Heikki test results. > So as the patch stands (version 15), we call heap_page_prune_defrag() > inside heap_release_fetch() and heapgetpage(), apart from the vacuum > loop. It checks for PageIsPrunable() before proceeding. PageIsPrunable > is set when a tuple is updated/deleted. So for read-mostly workloads > heap_page_prune_defrag will mostly be no-op. > > If PageIsPrunable is set, cleanup lock is available (exclusive lock is tried > conditionally, so we don't wait if there is contention) and we are running > low on free space (right now if free space is less than 1.2 times the > average > tuple size or less than BLCKSZ/8), the entire page is pruned and > fragmentation > is repaired. Looking at the patch I see: + /* + * Mark the page as clear of prunable tuples. If we find a tuple which + * may become prunable, we shall set the hint again. + */ + PageClearPrunable(page); I like the idea of the page hint bit, but my question is if there is a long-running transaction, isn't each SELECT going to try to defragment a page over and over again because there is still something prunable on the page? I think we need to find out how hard it would be to try the defragmentation only on INSERT or UPDATE. The hint bit might still be useful. > We also prune-defrag is vacuum lazy and vacuum full. But I assume we > are not worried about these maintenance activities. Right. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Bruce Momjian" <bruce@momjian.us> writes: > Looking at the patch I see: > > + /* > + * Mark the page as clear of prunable tuples. If we find a tuple which > + * may become prunable, we shall set the hint again. > + */ > + PageClearPrunable(page); > > I like the idea of the page hint bit, but my question is if there is a > long-running transaction, isn't each SELECT going to try to defragment a > page over and over again because there is still something prunable on > the page? Well it'll try to prune the chains over and over again. If it doesn't find anything it won't defragment, but yes. I think we could tackle that by storing on the page the oldest xmax which would allow us to prune a tuple. Then you could compare RecentGlobalXmin against that and not bother looking at any other chains if it hasn't been passed yet. It would be hard to do that with single-chain pruning though, once you the limiting tuple you would then wouldn't know what the next limiting xmax is until the next time you do a full-page prune. Still that gets us down to at most two full-page prunes per update instead of a potentially unbounded number of prunes. This seems like a further optimization to think about after we have a place to trigger the pruning where it'll do the most good. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Bruce Momjian wrote: > Looking at the patch I see: > > + /* > + * Mark the page as clear of prunable tuples. If we find a tuple which > + * may become prunable, we shall set the hint again. > + */ > + PageClearPrunable(page); > > I like the idea of the page hint bit, but my question is if there is a > long-running transaction, isn't each SELECT going to try to defragment a > page over and over again because there is still something prunable on > the page? Maybe that risk could be lowered if instead of a flag, we stored the minimal global xmin needed to prune at least one tuple. greetings, Florian Pflug
On Mon, 2007-09-10 at 18:23 +0100, Heikki Linnakangas wrote: > > If we check a tuple in a chain and the tuple is dead is it possible the > > pruning operation is cheaper than having to check the tuple again for > > visibility the next time we see it? If so, we can just always prune > > when we see a dead tuple in the chain, which I believe was the original > > design. Pruning becomes an operation similar to marking an index entry > > as dead. (I assuming pruing does not generate WAL traffic.) > > Pruning does generate a WAL record at the moment. Maybe you could do > some kind of a quick pruning without a WAL record. Like just modify the > ctid of the oldest dead tuple in the chain, or the redirecting line > pointer if there is one, to point to the latest live tuple, without > removing the dead tuples or the line pointers. Sounds like a great idea. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On 9/10/07, Florian Pflug <fgp.phlo.org@gmail.com> wrote:
Maybe that risk could be lowered if instead of a flag, we stored the
minimal global xmin needed to prune at least one tuple.
I like the idea. The question is whether the chances of a Prunable
page being looked up again and again in presence of a long
running transaction are high enough to justify adding 4 bytes
to page header.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On 9/10/07, Heikki Linnakangas <heikki@enterprisedb.com > wrote:
Oh, one more thing occured to me. Without HOT, we not only mark index
tuples pointing to dead tuples as killed, we remove them altogether if
the index page gets full. If you modify the test case so that after
doing the updates, you insert a bunch of tuples with a different key to
fill the index page, you should see CVS HEAD winning HOT without pruning
hands down.
Um. I am not sure I follow that. With HOT even if the HOT chain is long,
there is still just one index entry for the entire chain. So I don't see
why CVS HEAD would do better than HOT in the above case. Net-net
there will be equal number of index keys after the inserts.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Bruce Momjian wrote: > > My guess is that once you prune a tuple > > you no longer have to check its visibility, and that is where the win is > > coming from. > > Yes, I think that's right. > > Oh, one more thing occured to me. Without HOT, we not only mark index > tuples pointing to dead tuples as killed, we remove them altogether if > the index page gets full. If you modify the test case so that after > doing the updates, you insert a bunch of tuples with a different key to > fill the index page, you should see CVS HEAD winning HOT without pruning > hands down. > > > If we check a tuple in a chain and the tuple is dead is it possible the > > pruning operation is cheaper than having to check the tuple again for > > visibility the next time we see it? If so, we can just always prune > > when we see a dead tuple in the chain, which I believe was the original > > design. Pruning becomes an operation similar to marking an index entry > > as dead. (I assuming pruing does not generate WAL traffic.) > > Pruning does generate a WAL record at the moment. Maybe you could do > some kind of a quick pruning without a WAL record. Like just modify the > ctid of the oldest dead tuple in the chain, or the redirecting line > pointer if there is one, to point to the latest live tuple, without > removing the dead tuples or the line pointers. I am wondering what you even mean by removing the dead tuples when pruning. I thought only defragmentation removed tuples. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Gregory Stark wrote: > "Bruce Momjian" <bruce@momjian.us> writes: > > > Looking at the patch I see: > > > > + /* > > + * Mark the page as clear of prunable tuples. If we find a tuple which > > + * may become prunable, we shall set the hint again. > > + */ > > + PageClearPrunable(page); > > > > I like the idea of the page hint bit, but my question is if there is a > > long-running transaction, isn't each SELECT going to try to defragment a > > page over and over again because there is still something prunable on > > the page? > > Well it'll try to prune the chains over and over again. If it doesn't find > anything it won't defragment, but yes. > > I think we could tackle that by storing on the page the oldest xmax which > would allow us to prune a tuple. Then you could compare RecentGlobalXmin > against that and not bother looking at any other chains if it hasn't been > passed yet. Yea, that might work if we have space on the page. > It would be hard to do that with single-chain pruning though, once you the > limiting tuple you would then wouldn't know what the next limiting xmax is > until the next time you do a full-page prune. Still that gets us down to at > most two full-page prunes per update instead of a potentially unbounded number > of prunes. > > This seems like a further optimization to think about after we have a place to > trigger the pruning where it'll do the most good. I was thinking if we could only try defragementing when we are doing an INSERT or UPDATE, then the defragment would either free enough space so we could stay on the page or the new row would go on another page and we would probably not return to that page again anytime soon. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 9/11/07, Bruce Momjian <bruce@momjian.us > wrote:
Heikki Linnakangas wrote:
>
> Pruning does generate a WAL record at the moment. Maybe you could do
> some kind of a quick pruning without a WAL record. Like just modify the
> ctid of the oldest dead tuple in the chain, or the redirecting line
> pointer if there is one, to point to the latest live tuple, without
> removing the dead tuples or the line pointers.
I am wondering what you even mean by removing the dead tuples when
pruning. I thought only defragmentation removed tuples.
Pruning removes intermediate dead tuples by marking their line pointers
~LP_USED and redirecting the root line pointer to the first live/recently_dead
tuple in the chain. OTOH defragmentation moves tuples around to repair
fragmentation. IOW defragmentation is nothing but calling
PageRepairFragmentation on the page.
For example, if we have a HOT chain of 5 tuples:
1 --> 2 --> 3 --> 4 --> 5
of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE
At the end of pruning:
- line pointer of 1 is redirected to 4
- line pointers of 2 and 3 are marked ~LP_USED
- the offset of 4 and 5 is unchanged
At the end of defragmentation:
- 4 and 5 are moved to the end of the page to create a larger
contiguous free space in the page.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > > > Pruning does generate a WAL record at the moment. Maybe you could do > > > some kind of a quick pruning without a WAL record. Like just modify the > > > ctid of the oldest dead tuple in the chain, or the redirecting line > > > pointer if there is one, to point to the latest live tuple, without > > > removing the dead tuples or the line pointers. > > > > I am wondering what you even mean by removing the dead tuples when > > pruning. I thought only defragmentation removed tuples. > > > > > Pruning removes intermediate dead tuples by marking their line pointers > ~LP_USED and redirecting the root line pointer to the first > live/recently_dead > tuple in the chain. OTOH defragmentation moves tuples around to repair > fragmentation. IOW defragmentation is nothing but calling > PageRepairFragmentation on the page. > > For example, if we have a HOT chain of 5 tuples: > > 1 --> 2 --> 3 --> 4 --> 5 > > of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE > > At the end of pruning: > > - line pointer of 1 is redirected to 4 > - line pointers of 2 and 3 are marked ~LP_USED > - the offset of 4 and 5 is unchanged > > At the end of defragmentation: > > - 4 and 5 are moved to the end of the page to create a larger > contiguous free space in the page. Right. My point is that pruning only modifies item pointers. It does not change the actual heap tuples. In the quote above, how is Heikki's "quick pruning" different from the pruning you are describing? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Pavan Deolasee wrote: >>>> Pruning does generate a WAL record at the moment. Maybe you could do >>>> some kind of a quick pruning without a WAL record. Like just modify the >>>> ctid of the oldest dead tuple in the chain, or the redirecting line >>>> pointer if there is one, to point to the latest live tuple, without >>>> removing the dead tuples or the line pointers. >>> I am wondering what you even mean by removing the dead tuples when >>> pruning. I thought only defragmentation removed tuples. >>> >>> >> Pruning removes intermediate dead tuples by marking their line pointers >> ~LP_USED and redirecting the root line pointer to the first >> live/recently_dead >> tuple in the chain. OTOH defragmentation moves tuples around to repair >> fragmentation. IOW defragmentation is nothing but calling >> PageRepairFragmentation on the page. >> >> For example, if we have a HOT chain of 5 tuples: >> >> 1 --> 2 --> 3 --> 4 --> 5 >> >> of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE >> >> At the end of pruning: >> >> - line pointer of 1 is redirected to 4 >> - line pointers of 2 and 3 are marked ~LP_USED >> - the offset of 4 and 5 is unchanged >> >> At the end of defragmentation: >> >> - 4 and 5 are moved to the end of the page to create a larger >> contiguous free space in the page. > > Right. My point is that pruning only modifies item pointers. It does > not change the actual heap tuples. In the quote above, how is Heikki's > "quick pruning" different from the pruning you are describing? In "quick pruning", you don't mark the line pointer as not used. Otherwise someone might reuse the line pointer for another tuple, which is a problem if you then crash. WAL replay would see an insert to a line pointer that's already in use (the quick pruning wouldn't be WAL logged), which would raise an error. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 9/11/07, Bruce Momjian <bruce@momjian.us> wrote:
Right. My point is that pruning only modifies item pointers. It does
not change the actual heap tuples. In the quote above, how is Heikki's
"quick pruning" different from the pruning you are describing?
I think the only difference is that the quick pruning does not mark
intermediate tuples ~LP_USED and hence we may avoid WAL logging.
Btw, I am not too enthusiastic about quick pruning because it leaves
behind dead heap-only tuples which are not reachable from any root
heap tuples. Not that we can't handle them, but I am worried about
making such changes right now unless we are sure about the benefits.
We can always tune and tweak in 8.4
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-09-10 at 20:46 +0100, Heikki Linnakangas wrote: > Bruce Momjian wrote: > > Pavan Deolasee wrote: > >> At the end of pruning: > >> > >> - line pointer of 1 is redirected to 4 > >> - line pointers of 2 and 3 are marked ~LP_USED > >> - the offset of 4 and 5 is unchanged > >> > > In the quote above, how is Heikki's > > "quick pruning" different from the pruning you are describing? > In "quick pruning", you don't mark the line pointer as not used. > Otherwise someone might reuse the line pointer for another tuple, which > is a problem if you then crash. WAL replay would see an insert to a line > pointer that's already in use (the quick pruning wouldn't be WAL > logged), which would raise an error. That seems like the best way forward to me. It's just a hint that says where the first live tuple is in the chain. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote: > I think the only difference is that the quick pruning does not mark > intermediate tuples ~LP_USED and hence we may avoid WAL logging. Sounds great. > Btw, I am not too enthusiastic about quick pruning because it leaves > behind dead heap-only tuples which are not reachable from any root > heap tuples. Not that we can't handle them, but I am worried about > making such changes right now unless we are sure about the benefits. > We can always tune and tweak in 8.4 ...but we don't want to reach them. They are waiting to be removed. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > On 9/11/07, Bruce Momjian <bruce@momjian.us> wrote: > >> Right. My point is that pruning only modifies item pointers. It does >> not change the actual heap tuples. In the quote above, how is Heikki's >> "quick pruning" different from the pruning you are describing? > > I think the only difference is that the quick pruning does not mark > intermediate tuples ~LP_USED and hence we may avoid WAL logging. > > Btw, I am not too enthusiastic about quick pruning because it leaves > behind dead heap-only tuples which are not reachable from any root > heap tuples. Not that we can't handle them, but I am worried about > making such changes right now unless we are sure about the benefits. > We can always tune and tweak in 8.4 You could mark such tuples with LP_DELETE. That would also let other transactions quickly tot up how much space would be available if they were to run PageRepairFragmentation. But if you don't WAL log setting LP_DELETE then you would still have to deal with unreachable HOT tuples who lost their LP_DELETE flag. I suppose that as long as PageRepairFragmentation first looks for any dead HOT tuples without depending on LP_DELETE that would be enough. I wonder if you could do the quick prune without even taking the exclusive page lock? You're overwriting 16 bits and you know nobody else will be modifying any of the line pointers in question to anything else. (because your pin prevents the vacuum lock from coming in and trying to mark it unused). -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > Pruning removes intermediate dead tuples by marking their line pointers > ~LP_USED and redirecting the root line pointer to the first > live/recently_dead tuple in the chain. It seems utterly unsafe to do this without a vacuum-grade lock on the page. Someone might be examining the root tuple (eg, in a SnapshotAny scan) and I think they are entitled to assume that the line pointer stays valid while they are looking at the tuple. regards, tom lane
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote: >> I think the only difference is that the quick pruning does not mark >> intermediate tuples ~LP_USED and hence we may avoid WAL logging. > Sounds great. What it sounds is utterly unsafe. You can get away with not WAL-logging individual bit flips (that is, hint-bit-setting) because either state of the page is valid. If I read this proposal correctly it is to change t_ctid without WAL-logging, which means that a partial page write (torn page syndrome) could leave the page undetectably corrupted --- t_ctid is 6 bytes and could easily cross a hardware sector boundary. regards, tom lane
Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote: > >> I think the only difference is that the quick pruning does not mark > >> intermediate tuples ~LP_USED and hence we may avoid WAL logging. > > > Sounds great. > > What it sounds is utterly unsafe. You can get away with not WAL-logging > individual bit flips (that is, hint-bit-setting) because either state of > the page is valid. If I read this proposal correctly it is to change > t_ctid without WAL-logging, which means that a partial page write (torn > page syndrome) could leave the page undetectably corrupted --- t_ctid > is 6 bytes and could easily cross a hardware sector boundary. OK, we certainly want to avoid WAL writes for pruning a single chain if possible. One idea would be to just mark the item pointer as dead but not repoint the first item pointer to point to the first live tuple. Heikki, would that give us speedups similar to what you saw in your earlier tests? It would if the majority of the cost of scanning the page looking for the first live tuple was in doing the visibility tests on every tuple. Basically, when we are walking the chain via an index and we see the tuple is dead we mark it as dead but don't repoint. We would repoint and reclaim on a defragementation. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 9/11/07, Gregory Stark <stark@enterprisedb.com> wrote:
You could mark such tuples with LP_DELETE. That would also let other
transactions quickly tot up how much space would be available if they were to
run PageRepairFragmentation.
IMHO we are making full circles here. We have already tried LP_DELETE techniques
and moved away to simplify things. We also tried reusing dead space without
running PageRepairFragmentation. Each of these techniques worked just fine
with slightly different performance characteristics. What we now have is a
simplified algorithm which is much easier to follow and is safer, yet giving
us a very good performance boost. I am not sure if this is the right time to
throw new ideas because we would never be sure as what we are doing
would be the most optimal solution. Would it help if we go with some
solution right now, get rest of the review process done and then use
the feedback during beta testing to tune things ? We may have far more
data points at that time to choose one technique over other. And we
would also know what areas to focus on.
I am also worried that by focusing too much on this issue we may
overlook some other correctness issue in the patch.
From whatever we have discussed so far, IMHO we should do the following
things and let rest of the review process proceed
- Defragment a page only when the free space left in the page is not
enough to accommodate even a single tuple (use average tuple length for
this decision). This would mean we might be defragmenting even though
there is no immediate UPDATE to the page. But we can treat this as
fillfactor which allows us to provision for the next UPDATE coming to
the page. Since we are defragmenting when the page is almost full
hopefully we would reclaim good amount of space in the page and
won't call defragmentation for next few UPDATEs.
We already have mechanism to track average tuple request size in
relcache. May be we can have some relcache invalidation to keep the
information in sync (send invalidation when the average request size
changes by say 5%)
- Avoid pruning chains in every index or seq lookup. But if the chain
becomes longer than X tuples, mark the page to be pruned in the
next lookup. We can choose to separate prune and defragmentation
and only do pruning in this case. But I would prefer to keep them
together for now.
- Track the minimum xmin in the page header to avoid repeated
(wasted) attempts to prune a Prunable page in the presence of long running
transactions.
We can save rest of the techniques for beta testing period or 8.4.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > IMHO we are making full circles here. We have already tried LP_DELETE > techniques > and moved away to simplify things. We also tried reusing dead space without > running PageRepairFragmentation. Each of these techniques worked just fine > with slightly different performance characteristics. What we now have is a > simplified algorithm which is much easier to follow and is safer, yet giving > us a very good performance boost. I am not sure if this is the right time to > throw new ideas because we would never be sure as what we are doing > would be the most optimal solution. Would it help if we go with some > solution right now, get rest of the review process done and then use > the feedback during beta testing to tune things ? We may have far more > data points at that time to choose one technique over other. And we > would also know what areas to focus on. I totally understand. I was just throwing out ideas to make sure I fully understood it and to explore various options. I do agree we should just wait for the review, knowing we have these trade-offs. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Mon, 2007-09-10 at 22:24 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote: > >> I think the only difference is that the quick pruning does not mark > >> intermediate tuples ~LP_USED and hence we may avoid WAL logging. > > > Sounds great. > > What it sounds is utterly unsafe. You can get away with not WAL-logging > individual bit flips (that is, hint-bit-setting) because either state of > the page is valid. If I read this proposal correctly it is to change > t_ctid without WAL-logging, which means that a partial page write (torn > page syndrome) could leave the page undetectably corrupted --- t_ctid > is 6 bytes and could easily cross a hardware sector boundary. We can calculate where they are, if thats the objection. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Tue, 2007-09-11 at 09:21 +0530, Pavan Deolasee wrote: > IMHO we are making full circles here. We have already tried LP_DELETE > techniques and moved away to simplify things. > We can save rest of the techniques for beta testing period or 8.4. Agreed, we're just fine tuning by reinventing old ideas. HOT works, but everybody has their own opinion of which subtle factors are important. Seems like time to find out in Beta. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > What it sounds is utterly unsafe. You can get away with not WAL-logging > individual bit flips (that is, hint-bit-setting) because either state of > the page is valid. If I read this proposal correctly it is to change > t_ctid without WAL-logging, which means that a partial page write (torn > page syndrome) could leave the page undetectably corrupted --- t_ctid > is 6 bytes and could easily cross a hardware sector boundary. Well we would never be overwriting the blockid, only the posid which is 2 bytes. And the ctid (and posid) should always be 4-byte aligned. So actually it would never cross a hardware sector boundary. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote: >>> I think the only difference is that the quick pruning does not mark >>> intermediate tuples ~LP_USED and hence we may avoid WAL logging. > >> Sounds great. > > What it sounds is utterly unsafe. You can get away with not WAL-logging > individual bit flips (that is, hint-bit-setting) because either state of > the page is valid. If I read this proposal correctly it is to change > t_ctid without WAL-logging, which means that a partial page write (torn > page syndrome) could leave the page undetectably corrupted --- t_ctid > is 6 bytes and could easily cross a hardware sector boundary. We're only changing the offsetnumber part of it, which is 2 bytes. That shouldn't cross a hardware sector boundary on any reasonable hardware. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 9/11/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
We're only changing the offsetnumber part of it, which is 2 bytes. That
shouldn't cross a hardware sector boundary on any reasonable hardware.
Not entirely true if we consider line pointer redirection, which involves
changing 4 bytes. You can argue that for quick pruning we don't do any
line pointer redirection though.
Also my worry about dealing with unreachable heap-only dead tuples
remain. We may need to do far more work to identify that they are not part of
any reachable HOT chain and hence can be removed.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On 9/11/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> Pruning removes intermediate dead tuples by marking their line pointers
> ~LP_USED and redirecting the root line pointer to the first
> live/recently_dead tuple in the chain.
It seems utterly unsafe to do this without a vacuum-grade lock on the
page. Someone might be examining the root tuple (eg, in a SnapshotAny
scan) and I think they are entitled to assume that the line pointer
stays valid while they are looking at the tuple.
As the patch stands, we prune holding a vacuum-grade lock. But ISTM
that there are only limited users of SnapshotAny and we make them
recheck the line pointer after acquiring the buffer lock. So hopefully we
should be fine.
Anyways, its not a concern right now because we hold vacuum-grade lock while
pruning. But if we decide to what Heikki/Greg suggested about quick
pruning then we will have to make sure that your concern is addressed.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On 9/11/07, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
- Track the minimum xmin in the page header to avoid repeated
(wasted) attempts to prune a Prunable page in the presence of long running
transactions.
I would actually think twice before even doing this because this would lead to
complete change in heap page structure and stop people from upgrading
to 8.3 without a complete dump/restore. I don't remember 8.3 introduces any other
significant change which already enforces that.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > I would actually think twice before even doing this because this would > lead to complete change in heap page structure and stop people from > upgrading to 8.3 without a complete dump/restore. I don't remember 8.3 > introduces any other significant change which already enforces that. Then you don't remember far enough --- either the HeapTupleHeader change or the varvarlena change would be enough to prevent that. For that matter, the pd_flags header field that the patch is already relying on is new for 8.3. Adding an xmin field might or might not be a good solution, but to complain about it on compatibility grounds is silly. regards, tom lane
On 9/11/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> I would actually think twice before even doing this because this would
> lead to complete change in heap page structure and stop people from
> upgrading to 8.3 without a complete dump/restore. I don't remember 8.3
> introduces any other significant change which already enforces that.
Then you don't remember far enough --- either the HeapTupleHeader change
or the varvarlena change would be enough to prevent that. For that
matter, the pd_flags header field that the patch is already relying on
is new for 8.3.
Oops! I forgot combo command id. That went into last major release of EDB
and that confused me. Regarding pd_flags, I thought somebody can just
fix that in-place and avoid dump/restore (of course, its too invasive, but
isn't it feasible with pg_migrator ?)
Anyways, given this, should we go for storing xmin in the page header ?
For that matter should there be a separate heap page header ?
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > "Pavan Deolasee" <pavan.deolasee@gmail.com> writes: >> I would actually think twice before even doing this because this would >> lead to complete change in heap page structure and stop people from >> upgrading to 8.3 without a complete dump/restore. I don't remember 8.3 >> introduces any other significant change which already enforces that. > > Then you don't remember far enough --- either the HeapTupleHeader change > or the varvarlena change would be enough to prevent that. For that > matter, the pd_flags header field that the patch is already relying on > is new for 8.3. Yeah, those changes will need to be dealt with in the conversion. But none of the changes this far have increased the on-disk size. Adding a new field to page header means that in some corner cases it might not be possible to convert a page from old format to the new one because the data no longer fits. But let's not hang ourselves on that. I'm sure there's a way to deal with the page format conversion if we have to. The crucial design decision for HOT is when to prune and when to defragment the page, so that when we're doing the UPDATE, there's room in the page. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> Then you don't remember far enough --- either the HeapTupleHeader change >> or the varvarlena change would be enough to prevent that. For that >> matter, the pd_flags header field that the patch is already relying on >> is new for 8.3. > Yeah, those changes will need to be dealt with in the conversion. But > none of the changes this far have increased the on-disk size. Adding a > new field to page header means that in some corner cases it might not be > possible to convert a page from old format to the new one because the > data no longer fits. Counting on my fingers, it seems that the 4-byte savings in HeapTupleHeader size would guarantee that adding 4 bytes to the page header wouldn't kill you. regards, tom lane
Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>> Then you don't remember far enough --- either the HeapTupleHeader change >>> or the varvarlena change would be enough to prevent that. For that >>> matter, the pd_flags header field that the patch is already relying on >>> is new for 8.3. > >> Yeah, those changes will need to be dealt with in the conversion. But >> none of the changes this far have increased the on-disk size. Adding a >> new field to page header means that in some corner cases it might not be >> possible to convert a page from old format to the new one because the >> data no longer fits. > > Counting on my fingers, it seems that the 4-byte savings in > HeapTupleHeader size would guarantee that adding 4 bytes to the page > header wouldn't kill you. It depends on the alignment of the data and null-bitmap. For example: - a platform with 8 bytes alignment - the table has 9 columns - every tuple on the page has a null bitmap - there is zero bytes left on the page - no tuples where varvarlen frees up space The header size after alignment is MAXALIGN(23+2) = 32. Which is the same as before combo cid, MAXALIGN(27+2) = 32. It's a corner-case, but it's possible. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2007-09-11 at 09:21 +0530, Pavan Deolasee wrote: >> IMHO we are making full circles here. We have already tried LP_DELETE >> techniques and moved away to simplify things. >> We can save rest of the techniques for beta testing period or 8.4. > Agreed, we're just fine tuning by reinventing old ideas. HOT works, but > everybody has their own opinion of which subtle factors are important. > Seems like time to find out in Beta. I am worried by this idea that it's fine to make major changes in HOT after beta starts, or that the attentions of beta testers will allow us to collect data on variant implementations. Neither is the case. regards, tom lane
On Tue, 2007-09-11 at 16:10 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2007-09-11 at 09:21 +0530, Pavan Deolasee wrote: > >> IMHO we are making full circles here. We have already tried LP_DELETE > >> techniques and moved away to simplify things. > > >> We can save rest of the techniques for beta testing period or 8.4. > > > Agreed, we're just fine tuning by reinventing old ideas. HOT works, but > > everybody has their own opinion of which subtle factors are important. > > Seems like time to find out in Beta. > > I am worried by this idea that it's fine to make major changes in HOT > after beta starts, or that the attentions of beta testers will allow us > to collect data on variant implementations. Neither is the case. Understood. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On 9/11/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
The crucial design
decision for HOT is when to prune and when to defragment the page, so
that when we're doing the UPDATE, there's room in the page.
Right.
For defragmentation, I am still inclined towards doing it when we see
that the free space is less than (or slightly more than) the average tuple
size of the table - unless we have a better solution.
For pruning, we can do the quick pruning. But I won't feel comfortable
doing it without an exclusive lock on the buffer. Also I would avoid
line pointer redirection during quick prune. A simpler solution would
be to flag the page whenever HOT chain becomes longer than, say
5, (which can easily be detected in heap_hot_fetch) and prune it in
the next lookup. Hopefully we would only rarely have long HOT chains
and if at all we have them, we will have some mechanism to recover
from it.
Any other ideas ?
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > Please see the revised patches attached. I'm curious how much the WAL-recovery aspects of this patch have been tested, because heap_xlog_clean seems quite broken. You have apparently decided to redefine the WAL record format as using one-based rather than zero-based item offsets, which would be fine if any of the rest of the code had been changed to agree ... regards, tom lane
On 9/13/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
There are quite a few crash recovery tests that one of our QA persons
(Dharmendra Goyal) has written. I can post them if necessary. We run
these tests very regularly.
Apart from these regression crash tests, I had mostly tested by
running lot of concurrent clients on UP/SMP machines, crashing
and recovering the database. We fixed quite a few issues with
these tests. I have tried crashing in middle of UPDATEs/INSERTs/DELETEs
and VACUUM/VACUUM FULL.
I'm curious how much the WAL-recovery aspects of this patch have been
tested, because heap_xlog_clean seems quite broken.
There are quite a few crash recovery tests that one of our QA persons
(Dharmendra Goyal) has written. I can post them if necessary. We run
these tests very regularly.
Apart from these regression crash tests, I had mostly tested by
running lot of concurrent clients on UP/SMP machines, crashing
and recovering the database. We fixed quite a few issues with
these tests. I have tried crashing in middle of UPDATEs/INSERTs/DELETEs
and VACUUM/VACUUM FULL.
You have apparently
decided to redefine the WAL record format as using one-based rather than
zero-based item offsets, which would be fine if any of the rest of the
code had been changed to agree ...
I know Heikki changed that when he did the initial refactoring, but not
sure why. May be he wanted to make it more consistent.
But I don't think its broken because we collect the offsets in one-based
format in PageRepairFragmentation, WAL log in that format and redo
the same way. Am I missing something ?
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > On 9/13/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > You have apparently >> decided to redefine the WAL record format as using one-based rather than >> zero-based item offsets, which would be fine if any of the rest of the >> code had been changed to agree ... >> >> > I know Heikki changed that when he did the initial refactoring, but not > sure why. May be he wanted to make it more consistent. Yeah, I just checked the work-in-progress patch I sent you back in July. I refactored it to use one-based offsets for consistency, since I modified log_heap_clean quite heavily anyway. It's possible that I broke it in the process, I was only interested in testing the performance characteristics of the simplified pruning scheme. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 9/13/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Yeah, I just checked the work-in-progress patch I sent you back in July.
I refactored it to use one-based offsets for consistency, since I
modified log_heap_clean quite heavily anyway.
It's possible that I broke it in the process, I was only interested in
testing the performance characteristics of the simplified pruning scheme.
I don't think so -- atleast I couldn't figure out why its broken. But I
would take Tom's comment seriously and look more into it tomorrow.
Or we can just revert it back to zero-based offsets.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > On 9/13/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> You have apparently >> decided to redefine the WAL record format as using one-based rather than >> zero-based item offsets, which would be fine if any of the rest of the >> code had been changed to agree ... >> > I know Heikki changed that when he did the initial refactoring, but not > sure why. May be he wanted to make it more consistent. > But I don't think its broken because we collect the offsets in one-based > format in PageRepairFragmentation, WAL log in that format and redo > the same way. Am I missing something ? Hmm, I had been thinking that vacuum.c and vacuumlazy.c worked directly with that info, but it looks like you're right, only PageRepairFragmentation touches that array. Never mind ... though my suspicions would probably not have been aroused if anyone had bothered to fix the comments. regards, tom lane
On 9/13/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Never mind ... though my
suspicions would probably not have been aroused if anyone had bothered
to fix the comments.
Yeah, my fault. I should have fixed that. Sorry about that.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Okay, something else (a real problem this time ;-)): HeapTupleSatisfiesAbort is bogus because it has failed to track recent changes in tqual.c. Rather than fix it, though, I question why we need it at all. The only use of it is in heap_prune_tuplechain() and ISTM that code is redundant, wrong, or both. In the case of a full-page prune, it's redundant because the tuple would be found anyway by searching its chain from the root tuple. Indeed I suspect that such a tuple is entered into the newly-dead list twice, perhaps risking overflow of that array. In the case of a single-chain prune, it still seems wrong since you'll eliminate only one of what might be a chain with multiple aborted entries. If it's OK to leave those other entries for future collection, then it should be OK to leave this one too. If it's not OK then the approach needs to be redesigned. I'm fairly unclear on what the design intention is for recovering from the case where the last item(s) on a HOT chain are aborted. Please clarify. regards, tom lane
On 9/14/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Oh. I should have been aware.
No, we would never reach an aborted dead tuple from the chain because
we would see a live tuple before that. Or the tuple preceding the aborted
(first aborted, if there are many) may have been updated again and the
chain never reaches to the aborted tuples. Thats the reason why we have
HeapTupleSatisfiesAbort to collect any aborted heap-only tuples.
Again, since we check every heap-only tuple for aborted dead
case we shall collect all such tuples. I think one place where you
are confusing (or IOW the code might have confused you ;-))
is that we never reach aborted dead heap-only tuples by
following the HOT chain from the root tuple and thats why it
needs special treatment.
HeapTupleSatisfiesAbort is bogus because it has failed to track recent
changes in tqual.c.
Oh. I should have been aware.
Rather than fix it, though, I question why we need it at all. The only
use of it is in heap_prune_tuplechain() and ISTM that code is redundant,
wrong, or both.
In the case of a full-page prune, it's redundant because the tuple would
be found anyway by searching its chain from the root tuple. Indeed I
suspect that such a tuple is entered into the newly-dead list twice,
perhaps risking overflow of that array.
No, we would never reach an aborted dead tuple from the chain because
we would see a live tuple before that. Or the tuple preceding the aborted
(first aborted, if there are many) may have been updated again and the
chain never reaches to the aborted tuples. Thats the reason why we have
HeapTupleSatisfiesAbort to collect any aborted heap-only tuples.
In the case of a single-chain prune, it still seems wrong since you'll
eliminate only one of what might be a chain with multiple aborted
entries. If it's OK to leave those other entries for future collection,
then it should be OK to leave this one too. If it's not OK then the
approach needs to be redesigned.
Again, since we check every heap-only tuple for aborted dead
case we shall collect all such tuples. I think one place where you
are confusing (or IOW the code might have confused you ;-))
is that we never reach aborted dead heap-only tuples by
following the HOT chain from the root tuple and thats why it
needs special treatment.
I'm fairly unclear on what the design intention is for recovering from
the case where the last item(s) on a HOT chain are aborted. Please
clarify.
We don't do anything special. Such a tuple is never reached during
HOT chain following because either we would see a visible tuple before
that or the older tuple might have been updated again and the chain
had moved in a different direction. We only need some special treatment
to prune such tuple and hence the business of HeapTupleSatisfiesAbort
Having said that, based on our recent discussion about pruning a chain
upto and including the latest DEAD tuple in the chain, ISTM that we can
get rid of the giving any special treatment to aborted heap-only
tuples. Earlier we could not do so for "HOT updated aborted heap-only"
because we did not know if its a part of any valid HOT chain. Now
(assuming we change the pruning code to always prune everything including
the latest DEAD tuple) we guarantee that all DEAD tuples in the chain will
be pruned, and hence we can collect any DEAD tuple seen while pruning.
I am not sure if I explain this well, may be I should post an example.
Need to run now.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com