Thread: An idle thought

An idle thought

From
Greg Stark
Date:
A few days ago there was a thread on one of our lists where someone
was suprised it took as much i/o to delete data as it took to insert
it in the first place. At first this does seem surprising but the fact
that Postgres stores its transaction information inline with the data
and does all i/o in blocks makes this inevitable.

http://archives.postgresql.org/pgsql-performance/2010-03/msg00141.php

However then I started thinking about this case and wondered if it
wouldn't be possible to optimize. One of the suggested optimizations
was to look at using TRUNCATE. But I wonder why it's necessary to use
a dedicated command. Shouldn't it be possible for the system to notice
this situation and do effectively the same thing itself?

I'm picturing storing a bit in the visibility map indicating that *no*
records are visible in a given page. If that bit is set then the page
is virtually empty even if it contains data. If anyone tries to load
the page they should just initialize a new page instead which will
overwrite it. That way a big batch delete just has to set a bunch of
flags in the visibility map to do the bulk delete.

There are a couple problems with the way I've described this idea
here. Firstly, if the deleting transaction hasn't committed yet or
later aborts then of course the records need to still be visible. So
either the visibility map would need an xmax for the page as a whole
or it would need to only be set after the page has actually been
vacuumed. Secondly there's the whole retail vacuum problem -- any
index entries referring to this page would be left dangling unless
there's some kind of retail vacuum or perhaps a page version number.

I'm not 100% sure this idea is workable but if it is it might make
batch deletes a lot less painful. Several orders of magnitude less i/o
whenever a single transaction deletes many rows.

-- 
greg


Re: An idle thought

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> However then I started thinking about this case and wondered if it
> wouldn't be possible to optimize. One of the suggested optimizations
> was to look at using TRUNCATE. But I wonder why it's necessary to use
> a dedicated command. Shouldn't it be possible for the system to notice
> this situation and do effectively the same thing itself?

Not unless you'd like DELETE to always acquire exclusive lock...

> There are a couple problems with the way I've described this idea
> here.

Precisely because of the lack of lock.
        regards, tom lane


Re: An idle thought

From
Simon Riggs
Date:
On Tue, 2010-03-16 at 15:29 +0000, Greg Stark wrote:

> big batch delete

Is one of the reasons for partitioning, allowing the use of truncate.

-- Simon Riggs           www.2ndQuadrant.com



Re: An idle thought

From
Jeff Davis
Date:
On Tue, 2010-03-16 at 15:29 +0000, Greg Stark wrote:
> I'm picturing storing a bit in the visibility map indicating that *no*
> records are visible in a given page.

I've been thinking for a while that we could store the visibility
information in a structure separate from the heap -- sort of like the
visibility map, but per-tuple and authoritative rather than a per-page
hint.

There are all kinds of challenges there, but it might be worth thinking
about. Visibility information is highly compressible, and requires
constant maintenance (updates, deletes, freezing, etc.). It also might
make it possible to move to 64-bit xids, if we wanted to.

Regards,Jeff Davis



Re: An idle thought

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> There are all kinds of challenges there, but it might be worth thinking
> about. Visibility information is highly compressible, and requires
> constant maintenance (updates, deletes, freezing, etc.). It also might
> make it possible to move to 64-bit xids, if we wanted to.

If you want it to be cheaply updatable (or even cheaply readable),
compression is not what you're going to do.
        regards, tom lane


Re: An idle thought

From
Jeff Davis
Date:
On Wed, 2010-03-17 at 17:20 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > There are all kinds of challenges there, but it might be worth thinking
> > about. Visibility information is highly compressible, and requires
> > constant maintenance (updates, deletes, freezing, etc.). It also might
> > make it possible to move to 64-bit xids, if we wanted to.
> 
> If you want it to be cheaply updatable (or even cheaply readable),
> compression is not what you're going to do.

I didn't mean that we'd want to compress it to the absolute minimum
size. I had envisioned that it would be a simple scheme designed only to
eliminate long runs of identical visibility information (perhaps only
the frozen and always visible regions would be compressed).

The extra level of indirection would be slower, but if we freeze tuples
more aggressively (which would be much cheaper if we didn't have to go
to the heap), there might be a small number of tuples with interesting
visibility information at any particular time.

Regards,Jeff Davis



Re: An idle thought

From
Simon Riggs
Date:
On Wed, 2010-03-17 at 14:09 -0700, Jeff Davis wrote:

> I've been thinking for a while that we could store the visibility
> information in a structure separate from the heap -- sort of like the
> visibility map, but per-tuple and authoritative rather than a per-page
> hint.

A lot of people have been thinking that for a while, but I've yet to see
a fully costed assessment of whether its worth doing. It might be.

It seems easier to focus on outcomes rather than specifics, then order
the possible solutions to those outcomes in cost-benefit sequence.

-- Simon Riggs           www.2ndQuadrant.com



Re: An idle thought

From
Gokulakannan Somasundaram
Date:


On Thu, Mar 18, 2010 at 2:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Davis <pgsql@j-davis.com> writes:
> There are all kinds of challenges there, but it might be worth thinking
> about. Visibility information is highly compressible, and requires
> constant maintenance (updates, deletes, freezing, etc.). It also might
> make it possible to move to 64-bit xids, if we wanted to.

If you want it to be cheaply updatable (or even cheaply readable),
compression is not what you're going to do.

                       regards, tom lane

+1..

Re: An idle thought

From
Gokulakannan Somasundaram
Date:
<div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt
0pt0pt 0.8ex; padding-left: 1ex;"> Secondly there's the whole retail vacuum problem -- any<br /> index entries
referringto this page would be left dangling unless<br /> there's some kind of retail vacuum or perhaps a page version
number.<br/><br /></blockquote></div><br />The issue, we can divide into two<br />a)volatile functions <br />b)broken
datatypes<br/><br />For a) I think volatile function issue can be resolved by using hidden columns in the heap itself.
Thiswill create a duplication of data, but since the index will get created on it, it will always point to the right
heaptuple<br /><br />For b) We are already suffering with this issue in any index lookups, index based updates/deletes,
uniqueconstraints, referential integrity maintenance etc. So hopefully one day we can extend this list to include more
:))<br/><br />Gokul.<br /><br /> 

Re: An idle thought

From
Gokulakannan Somasundaram
Date:

I didn't mean that we'd want to compress it to the absolute minimum
size. I had envisioned that it would be a simple scheme designed only to
eliminate long runs of identical visibility information (perhaps only
the frozen and always visible regions would be compressed).

The extra level of indirection would be slower, but if we freeze tuples
more aggressively (which would be much cheaper if we didn't have to go
to the heap), there might be a small number of tuples with interesting
visibility information at any particular time.


This should be achievable with current proposal of Heikki, but I think it is useful only for tables which won't have more concurrent operations and on databases without any long running queries. So if we have an option to create a visibiliy map ( on the lines of materialized view ), whenever we want for a table, it would help a good number of use cases, i suppose.

Thanks,
Gokul.

Re: An idle thought

From
Chris Browne
Date:
simon@2ndQuadrant.com (Simon Riggs) writes:
> On Tue, 2010-03-16 at 15:29 +0000, Greg Stark wrote:
>
>> big batch delete
>
> Is one of the reasons for partitioning, allowing the use of truncate.

Sure, but it would be even nicer if DELETE could be thus made cheaper
without needing to interfere with the schema.

The concurrency issue might be resolved (*might!*) by the following
complication...
- A delete request is looking at a page, and concludes, "oh, all the  tuples here are now marked dead!".
- It flags the page as *possibly* dead.  Almost what Greg suggests for  the visibility map, but this is just marking it
as"proposed dead."
 
- It throws the page number, along with xid, into a side map.

When something wants to do something with the page (e.g. - vacuum), it
sees that it's "possibly dead," and looks at the "side map" for the list
of xids that wanted to mark the page dead.

for each xid:  if xid is still active      do nothing with it  else       remove xid entry from the map

if all xids were failed  remove flag from page
if any xid committed  empty the page; the tuples are all dead

I'm less confident about that last "clause" - I *think* that if *any*
page-clearing XID is found, that means the page is well and truly clear,
doesn't it?

The extra "map" mayn't be a nice thing.

It's food for thought, anyways.
-- 
let name="cbbrowne" and tld="linuxfinances.info" in String.concat "@" [name;tld];;
The real  problem with the  the year 2000  is that there are  too many
zero bits and that adversely affects the global bit density.
-- Boyd Roberts <boyd@france3.fr>


Re: An idle thought

From
Jeff Davis
Date:
On Thu, 2010-03-18 at 14:29 +0530, Gokulakannan Somasundaram wrote:

>         If you want it to be cheaply updatable (or even cheaply
>         readable),
>         compression is not what you're going to do.
>         
>                                regards, tom lane
>         
>         
>         
> +1.. 

The visibility map itself is already an example of compression. If
visibility information were randomly distributed among tuples, the
visibility map would be nearly useless.

Regards,Jeff Davis




Re: An idle thought

From
Gokulakannan Somasundaram
Date:

The visibility map itself is already an example of compression. If
visibility information were randomly distributed among tuples, the
visibility map would be nearly useless.


I believe it is very difficult to make visibility map update friendly without compromising durability.  But such a functionality is very much wanted in PG still.



Re: An idle thought

From
Jeff Davis
Date:
On Fri, 2010-03-19 at 01:59 +0530, Gokulakannan Somasundaram wrote:
> 
>         The visibility map itself is already an example of
>         compression. If
>         visibility information were randomly distributed among tuples,
>         the
>         visibility map would be nearly useless.
>         
>         
> I believe it is very difficult to make visibility map update friendly
> without compromising durability.  But such a functionality is very
> much wanted in PG still.

Surely the VM is already update-friendly. If you update a tuple in a
page with the visibility bit set, the bit must be unset or you will get
wrong results.

Regards,Jeff Davis



Re: An idle thought

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Fri, 2010-03-19 at 01:59 +0530, Gokulakannan Somasundaram wrote:
>> I believe it is very difficult to make visibility map update friendly
>> without compromising durability.  But such a functionality is very
>> much wanted in PG still.

> Surely the VM is already update-friendly. If you update a tuple in a
> page with the visibility bit set, the bit must be unset or you will get
> wrong results.

The VM is (a) not compressed and (b) not correctness-critical.
Wrong bit values don't do any serious damage.
        regards, tom lane


Re: An idle thought

From
Jeff Davis
Date:
On Thu, 2010-03-18 at 16:50 -0400, Tom Lane wrote:
> The VM is (a) not compressed and (b) not correctness-critical.
> Wrong bit values don't do any serious damage.

The VM cause wrong results if a bit is set that's not supposed to be --
right? Am I missing something? How does a seq scan skip visibility
checks and still produce right results, if it doesn't rely on the bit?

The visibility map would obviously not be very useful if visibility
information was randomly distributed among tuples. Whether that
qualifies as "compression" or not was not my primary point. The point is
that it may be possible to use some structure that is significantly
smaller than holding xmin/xmax for every tuple in the heap, and at the
same time may be acceptably fast to update.

Regards,Jeff Davis



Re: An idle thought

From
Alex Hunsaker
Date:
On Thu, Mar 18, 2010 at 15:07, Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2010-03-18 at 16:50 -0400, Tom Lane wrote:
>> The VM is (a) not compressed and (b) not correctness-critical.
>> Wrong bit values don't do any serious damage.
>
> The VM cause wrong results if a bit is set that's not supposed to be --
> right? Am I missing something? How does a seq scan skip visibility
> checks and still produce right results, if it doesn't rely on the bit?

Isn't it only really used for VACUUM at this point?


Re: An idle thought

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Thu, 2010-03-18 at 16:50 -0400, Tom Lane wrote:
>> The VM is (a) not compressed and (b) not correctness-critical.
>> Wrong bit values don't do any serious damage.

> The VM cause wrong results if a bit is set that's not supposed to be --
> right? Am I missing something? How does a seq scan skip visibility
> checks and still produce right results, if it doesn't rely on the bit?

It doesn't.  The only thing we currently rely on the VM for is deciding
whether a page needs vacuuming --- and even that we don't trust it for
when doing anti-wraparound vacuuming.  The worst-case consequence of a
wrong bit is failure to free some dead tuples until the vacuum freeze
limit expires.

In order to do things like not visiting a page during scans, we'll have
to solve the reliability issues.
        regards, tom lane


Re: An idle thought

From
Greg Stark
Date:
On Thu, Mar 18, 2010 at 9:07 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> The VM cause wrong results if a bit is set that's not supposed to be --
> right? Am I missing something? How does a seq scan skip visibility
> checks and still produce right results, if it doesn't rely on the bit?
>

There's also a PD_ALL_VISIBLE flag on the page header. We wal log when
we clear that bit so we can trust if it's set then all the tuples
really are visible. I forget whether we can trust it if it's *not* set
but there's not much point -- all tuples could become visible
spontaneously even the page is sitting on disk.
-- 
greg


Re: An idle thought

From
Jeff Davis
Date:
On Thu, 2010-03-18 at 17:17 -0400, Tom Lane wrote:
> > The VM cause wrong results if a bit is set that's not supposed to be --
> > right? Am I missing something? How does a seq scan skip visibility
> > checks and still produce right results, if it doesn't rely on the bit?
> 
> It doesn't.  The only thing we currently rely on the VM for is deciding
> whether a page needs vacuuming

Oh, my mistake. I misremembered the discussion and I thought the seq
scan optimization made it in.

> In order to do things like not visiting a page during scans, we'll have
> to solve the reliability issues.

Yeah, and also for the index-only scans.

Regards,Jeff Davis



Re: An idle thought

From
Jeff Davis
Date:
On Thu, 2010-03-18 at 14:48 -0700, Jeff Davis wrote:
> On Thu, 2010-03-18 at 17:17 -0400, Tom Lane wrote:
> > > The VM cause wrong results if a bit is set that's not supposed to be --
> > > right? Am I missing something? How does a seq scan skip visibility
> > > checks and still produce right results, if it doesn't rely on the bit?
> > 
> > It doesn't.  The only thing we currently rely on the VM for is deciding
> > whether a page needs vacuuming
> 
> Oh, my mistake. I misremembered the discussion and I thought the seq
> scan optimization made it in.
> 

Hmm...

>From heapgetpage() in heapam.c, CVS HEAD:
/*                                                                                                              
* If the all-visible flag indicates that all tuples on the page
are                                            
* visible to everyone, we can skip the per-tuple visibility tests.
But                                         

I tested in gdb, and it calls HeapTupleSatisfiesMVCC, until I VACUUM a
few times, and then it doesn't call it any more. So, apparently the seq
scan optimization _is_ there. And that means it is correctness-critical.

Regards,Jeff Davis



Re: An idle thought

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> I tested in gdb, and it calls HeapTupleSatisfiesMVCC, until I VACUUM a
> few times, and then it doesn't call it any more. So, apparently the seq
> scan optimization _is_ there. And that means it is correctness-critical.

The page header bit is critical.  Not the VM.
        regards, tom lane


Re: An idle thought

From
Gokulakannan Somasundaram
Date:
Surely the VM is already update-friendly. If you update a tuple in a
page with the visibility bit set, the bit must be unset or you will get
wrong results.

 
I was referring in the context of index only scans to skip visibility checks. I doubt, whether the visibility map feature to skip visibility checks at the heap can be created without any extra cost to updates/inserts. When a data is compressed then there is more contention for the same block and hence would likely affect DMLs. I hope that's what Tom was also referring to, but not in the visibility map context.

Gokul.

Re: An idle thought

From
Bruce Momjian
Date:
Jeff Davis wrote:
> On Tue, 2010-03-16 at 15:29 +0000, Greg Stark wrote:
> > I'm picturing storing a bit in the visibility map indicating that *no*
> > records are visible in a given page.
> 
> I've been thinking for a while that we could store the visibility
> information in a structure separate from the heap -- sort of like the
> visibility map, but per-tuple and authoritative rather than a per-page
> hint.
> 
> There are all kinds of challenges there, but it might be worth thinking
> about. Visibility information is highly compressible, and requires
> constant maintenance (updates, deletes, freezing, etc.). It also might
> make it possible to move to 64-bit xids, if we wanted to.

I don't think we want to move to 64-bit xids becasue we would still need
to do vacuum freeze to trim the clog.  In fact we do vacuum freeze much
more frequently than required for 32-bit xids for this very reason.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do


Re: An idle thought

From
Jeff Davis
Date:
On Mon, 2010-03-22 at 16:48 -0400, Bruce Momjian wrote:
> I don't think we want to move to 64-bit xids becasue we would still need
> to do vacuum freeze to trim the clog.  In fact we do vacuum freeze much
> more frequently than required for 32-bit xids for this very reason.

Good point. I think there are a lot of issues like this, which would
make storing the visibility information separate from the heap a huge
undertaking. Still worth thinking about, though.

Regards,Jeff Davis