Thread: An idle thought
A few days ago there was a thread on one of our lists where someone was suprised it took as much i/o to delete data as it took to insert it in the first place. At first this does seem surprising but the fact that Postgres stores its transaction information inline with the data and does all i/o in blocks makes this inevitable. http://archives.postgresql.org/pgsql-performance/2010-03/msg00141.php However then I started thinking about this case and wondered if it wouldn't be possible to optimize. One of the suggested optimizations was to look at using TRUNCATE. But I wonder why it's necessary to use a dedicated command. Shouldn't it be possible for the system to notice this situation and do effectively the same thing itself? I'm picturing storing a bit in the visibility map indicating that *no* records are visible in a given page. If that bit is set then the page is virtually empty even if it contains data. If anyone tries to load the page they should just initialize a new page instead which will overwrite it. That way a big batch delete just has to set a bunch of flags in the visibility map to do the bulk delete. There are a couple problems with the way I've described this idea here. Firstly, if the deleting transaction hasn't committed yet or later aborts then of course the records need to still be visible. So either the visibility map would need an xmax for the page as a whole or it would need to only be set after the page has actually been vacuumed. Secondly there's the whole retail vacuum problem -- any index entries referring to this page would be left dangling unless there's some kind of retail vacuum or perhaps a page version number. I'm not 100% sure this idea is workable but if it is it might make batch deletes a lot less painful. Several orders of magnitude less i/o whenever a single transaction deletes many rows. -- greg
Greg Stark <stark@mit.edu> writes: > However then I started thinking about this case and wondered if it > wouldn't be possible to optimize. One of the suggested optimizations > was to look at using TRUNCATE. But I wonder why it's necessary to use > a dedicated command. Shouldn't it be possible for the system to notice > this situation and do effectively the same thing itself? Not unless you'd like DELETE to always acquire exclusive lock... > There are a couple problems with the way I've described this idea > here. Precisely because of the lack of lock. regards, tom lane
On Tue, 2010-03-16 at 15:29 +0000, Greg Stark wrote: > big batch delete Is one of the reasons for partitioning, allowing the use of truncate. -- Simon Riggs www.2ndQuadrant.com
On Tue, 2010-03-16 at 15:29 +0000, Greg Stark wrote: > I'm picturing storing a bit in the visibility map indicating that *no* > records are visible in a given page. I've been thinking for a while that we could store the visibility information in a structure separate from the heap -- sort of like the visibility map, but per-tuple and authoritative rather than a per-page hint. There are all kinds of challenges there, but it might be worth thinking about. Visibility information is highly compressible, and requires constant maintenance (updates, deletes, freezing, etc.). It also might make it possible to move to 64-bit xids, if we wanted to. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > There are all kinds of challenges there, but it might be worth thinking > about. Visibility information is highly compressible, and requires > constant maintenance (updates, deletes, freezing, etc.). It also might > make it possible to move to 64-bit xids, if we wanted to. If you want it to be cheaply updatable (or even cheaply readable), compression is not what you're going to do. regards, tom lane
On Wed, 2010-03-17 at 17:20 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > There are all kinds of challenges there, but it might be worth thinking > > about. Visibility information is highly compressible, and requires > > constant maintenance (updates, deletes, freezing, etc.). It also might > > make it possible to move to 64-bit xids, if we wanted to. > > If you want it to be cheaply updatable (or even cheaply readable), > compression is not what you're going to do. I didn't mean that we'd want to compress it to the absolute minimum size. I had envisioned that it would be a simple scheme designed only to eliminate long runs of identical visibility information (perhaps only the frozen and always visible regions would be compressed). The extra level of indirection would be slower, but if we freeze tuples more aggressively (which would be much cheaper if we didn't have to go to the heap), there might be a small number of tuples with interesting visibility information at any particular time. Regards,Jeff Davis
On Wed, 2010-03-17 at 14:09 -0700, Jeff Davis wrote: > I've been thinking for a while that we could store the visibility > information in a structure separate from the heap -- sort of like the > visibility map, but per-tuple and authoritative rather than a per-page > hint. A lot of people have been thinking that for a while, but I've yet to see a fully costed assessment of whether its worth doing. It might be. It seems easier to focus on outcomes rather than specifics, then order the possible solutions to those outcomes in cost-benefit sequence. -- Simon Riggs www.2ndQuadrant.com
On Thu, Mar 18, 2010 at 2:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Davis <pgsql@j-davis.com> writes:If you want it to be cheaply updatable (or even cheaply readable),
> There are all kinds of challenges there, but it might be worth thinking
> about. Visibility information is highly compressible, and requires
> constant maintenance (updates, deletes, freezing, etc.). It also might
> make it possible to move to 64-bit xids, if we wanted to.
compression is not what you're going to do.
regards, tom lane
+1..
<div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt0pt 0.8ex; padding-left: 1ex;"> Secondly there's the whole retail vacuum problem -- any<br /> index entries referringto this page would be left dangling unless<br /> there's some kind of retail vacuum or perhaps a page version number.<br/><br /></blockquote></div><br />The issue, we can divide into two<br />a)volatile functions <br />b)broken datatypes<br/><br />For a) I think volatile function issue can be resolved by using hidden columns in the heap itself. Thiswill create a duplication of data, but since the index will get created on it, it will always point to the right heaptuple<br /><br />For b) We are already suffering with this issue in any index lookups, index based updates/deletes, uniqueconstraints, referential integrity maintenance etc. So hopefully one day we can extend this list to include more :))<br/><br />Gokul.<br /><br />
I didn't mean that we'd want to compress it to the absolute minimum
size. I had envisioned that it would be a simple scheme designed only to
eliminate long runs of identical visibility information (perhaps only
the frozen and always visible regions would be compressed).
The extra level of indirection would be slower, but if we freeze tuples
more aggressively (which would be much cheaper if we didn't have to go
to the heap), there might be a small number of tuples with interesting
visibility information at any particular time.
Thanks,
Gokul.
simon@2ndQuadrant.com (Simon Riggs) writes: > On Tue, 2010-03-16 at 15:29 +0000, Greg Stark wrote: > >> big batch delete > > Is one of the reasons for partitioning, allowing the use of truncate. Sure, but it would be even nicer if DELETE could be thus made cheaper without needing to interfere with the schema. The concurrency issue might be resolved (*might!*) by the following complication... - A delete request is looking at a page, and concludes, "oh, all the tuples here are now marked dead!". - It flags the page as *possibly* dead. Almost what Greg suggests for the visibility map, but this is just marking it as"proposed dead." - It throws the page number, along with xid, into a side map. When something wants to do something with the page (e.g. - vacuum), it sees that it's "possibly dead," and looks at the "side map" for the list of xids that wanted to mark the page dead. for each xid: if xid is still active do nothing with it else remove xid entry from the map if all xids were failed remove flag from page if any xid committed empty the page; the tuples are all dead I'm less confident about that last "clause" - I *think* that if *any* page-clearing XID is found, that means the page is well and truly clear, doesn't it? The extra "map" mayn't be a nice thing. It's food for thought, anyways. -- let name="cbbrowne" and tld="linuxfinances.info" in String.concat "@" [name;tld];; The real problem with the the year 2000 is that there are too many zero bits and that adversely affects the global bit density. -- Boyd Roberts <boyd@france3.fr>
On Thu, 2010-03-18 at 14:29 +0530, Gokulakannan Somasundaram wrote: > If you want it to be cheaply updatable (or even cheaply > readable), > compression is not what you're going to do. > > regards, tom lane > > > > +1.. The visibility map itself is already an example of compression. If visibility information were randomly distributed among tuples, the visibility map would be nearly useless. Regards,Jeff Davis
The visibility map itself is already an example of compression. If
visibility information were randomly distributed among tuples, the
visibility map would be nearly useless.
I believe it is very difficult to make visibility map update friendly without compromising durability. But such a functionality is very much wanted in PG still.
On Fri, 2010-03-19 at 01:59 +0530, Gokulakannan Somasundaram wrote: > > The visibility map itself is already an example of > compression. If > visibility information were randomly distributed among tuples, > the > visibility map would be nearly useless. > > > I believe it is very difficult to make visibility map update friendly > without compromising durability. But such a functionality is very > much wanted in PG still. Surely the VM is already update-friendly. If you update a tuple in a page with the visibility bit set, the bit must be unset or you will get wrong results. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Fri, 2010-03-19 at 01:59 +0530, Gokulakannan Somasundaram wrote: >> I believe it is very difficult to make visibility map update friendly >> without compromising durability. But such a functionality is very >> much wanted in PG still. > Surely the VM is already update-friendly. If you update a tuple in a > page with the visibility bit set, the bit must be unset or you will get > wrong results. The VM is (a) not compressed and (b) not correctness-critical. Wrong bit values don't do any serious damage. regards, tom lane
On Thu, 2010-03-18 at 16:50 -0400, Tom Lane wrote: > The VM is (a) not compressed and (b) not correctness-critical. > Wrong bit values don't do any serious damage. The VM cause wrong results if a bit is set that's not supposed to be -- right? Am I missing something? How does a seq scan skip visibility checks and still produce right results, if it doesn't rely on the bit? The visibility map would obviously not be very useful if visibility information was randomly distributed among tuples. Whether that qualifies as "compression" or not was not my primary point. The point is that it may be possible to use some structure that is significantly smaller than holding xmin/xmax for every tuple in the heap, and at the same time may be acceptably fast to update. Regards,Jeff Davis
On Thu, Mar 18, 2010 at 15:07, Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2010-03-18 at 16:50 -0400, Tom Lane wrote: >> The VM is (a) not compressed and (b) not correctness-critical. >> Wrong bit values don't do any serious damage. > > The VM cause wrong results if a bit is set that's not supposed to be -- > right? Am I missing something? How does a seq scan skip visibility > checks and still produce right results, if it doesn't rely on the bit? Isn't it only really used for VACUUM at this point?
Jeff Davis <pgsql@j-davis.com> writes: > On Thu, 2010-03-18 at 16:50 -0400, Tom Lane wrote: >> The VM is (a) not compressed and (b) not correctness-critical. >> Wrong bit values don't do any serious damage. > The VM cause wrong results if a bit is set that's not supposed to be -- > right? Am I missing something? How does a seq scan skip visibility > checks and still produce right results, if it doesn't rely on the bit? It doesn't. The only thing we currently rely on the VM for is deciding whether a page needs vacuuming --- and even that we don't trust it for when doing anti-wraparound vacuuming. The worst-case consequence of a wrong bit is failure to free some dead tuples until the vacuum freeze limit expires. In order to do things like not visiting a page during scans, we'll have to solve the reliability issues. regards, tom lane
On Thu, Mar 18, 2010 at 9:07 PM, Jeff Davis <pgsql@j-davis.com> wrote: > The VM cause wrong results if a bit is set that's not supposed to be -- > right? Am I missing something? How does a seq scan skip visibility > checks and still produce right results, if it doesn't rely on the bit? > There's also a PD_ALL_VISIBLE flag on the page header. We wal log when we clear that bit so we can trust if it's set then all the tuples really are visible. I forget whether we can trust it if it's *not* set but there's not much point -- all tuples could become visible spontaneously even the page is sitting on disk. -- greg
On Thu, 2010-03-18 at 17:17 -0400, Tom Lane wrote: > > The VM cause wrong results if a bit is set that's not supposed to be -- > > right? Am I missing something? How does a seq scan skip visibility > > checks and still produce right results, if it doesn't rely on the bit? > > It doesn't. The only thing we currently rely on the VM for is deciding > whether a page needs vacuuming Oh, my mistake. I misremembered the discussion and I thought the seq scan optimization made it in. > In order to do things like not visiting a page during scans, we'll have > to solve the reliability issues. Yeah, and also for the index-only scans. Regards,Jeff Davis
On Thu, 2010-03-18 at 14:48 -0700, Jeff Davis wrote: > On Thu, 2010-03-18 at 17:17 -0400, Tom Lane wrote: > > > The VM cause wrong results if a bit is set that's not supposed to be -- > > > right? Am I missing something? How does a seq scan skip visibility > > > checks and still produce right results, if it doesn't rely on the bit? > > > > It doesn't. The only thing we currently rely on the VM for is deciding > > whether a page needs vacuuming > > Oh, my mistake. I misremembered the discussion and I thought the seq > scan optimization made it in. > Hmm... >From heapgetpage() in heapam.c, CVS HEAD: /* * If the all-visible flag indicates that all tuples on the page are * visible to everyone, we can skip the per-tuple visibility tests. But I tested in gdb, and it calls HeapTupleSatisfiesMVCC, until I VACUUM a few times, and then it doesn't call it any more. So, apparently the seq scan optimization _is_ there. And that means it is correctness-critical. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > I tested in gdb, and it calls HeapTupleSatisfiesMVCC, until I VACUUM a > few times, and then it doesn't call it any more. So, apparently the seq > scan optimization _is_ there. And that means it is correctness-critical. The page header bit is critical. Not the VM. regards, tom lane
Surely the VM is already update-friendly. If you update a tuple in a
page with the visibility bit set, the bit must be unset or you will get
wrong results.
I was referring in the context of index only scans to skip visibility checks. I doubt, whether the visibility map feature to skip visibility checks at the heap can be created without any extra cost to updates/inserts. When a data is compressed then there is more contention for the same block and hence would likely affect DMLs. I hope that's what Tom was also referring to, but not in the visibility map context.
Gokul.
Gokul.
Jeff Davis wrote: > On Tue, 2010-03-16 at 15:29 +0000, Greg Stark wrote: > > I'm picturing storing a bit in the visibility map indicating that *no* > > records are visible in a given page. > > I've been thinking for a while that we could store the visibility > information in a structure separate from the heap -- sort of like the > visibility map, but per-tuple and authoritative rather than a per-page > hint. > > There are all kinds of challenges there, but it might be worth thinking > about. Visibility information is highly compressible, and requires > constant maintenance (updates, deletes, freezing, etc.). It also might > make it possible to move to 64-bit xids, if we wanted to. I don't think we want to move to 64-bit xids becasue we would still need to do vacuum freeze to trim the clog. In fact we do vacuum freeze much more frequently than required for 32-bit xids for this very reason. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
On Mon, 2010-03-22 at 16:48 -0400, Bruce Momjian wrote: > I don't think we want to move to 64-bit xids becasue we would still need > to do vacuum freeze to trim the clog. In fact we do vacuum freeze much > more frequently than required for 32-bit xids for this very reason. Good point. I think there are a lot of issues like this, which would make storing the visibility information separate from the heap a huge undertaking. Still worth thinking about, though. Regards,Jeff Davis