Thread: Remove xmin and cmin from frozen tuples

Remove xmin and cmin from frozen tuples

From
ITAGAKI Takahiro
Date:
Hi Hackers,

I think it would be a waste to retain xmin and cmin for frozen tuples
because their values represent only 'visible for all transactions'.
Additionally, most tuples in database can be frozen potentially.

I wrote a makeshift patch to compress xmin and cmin (8bytes) to
1-bit flag, using tuple overlaping.

Is this idea worth trying?

Also, it will be useful to combine it and more aggressive freeze vacuum,
for example, a freezer integrated with bgwriter.



(The following is test of the attached patch)

* Test query
  1. create table test (a int);
  2. insert into test select * from generate_series(1, 100000);
  3. update test set a = a where a % 100 = 0; # to force defrag
  4. select * from pgstattuple('test');
  5. vacuum freeze test;
  6. select * from pgstattuple('test');

* Results of pgstattuple

-[ before vacuum ]-+--------
table_len          | 3645440
tuple_count        | 100000
tuple_len          | 3200000
tuple_percent      | 87.78
dead_tuple_count   | 1000
dead_tuple_len     | 32000
dead_tuple_percent | 0.88
free_space         | 536
free_percent       | 0.01

-[ 8.1beta1 orig ]-+--------
table_len          | 3645440
tuple_count        | 100000
tuple_len          | 3200000
tuple_percent      | 87.78
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 30772   <-- about 32byte * 1000 (dead tuples)
free_percent       | 0.84

-[ patched ]-------+--------
table_len          | 3645440
tuple_count        | 100000
tuple_len          | 3200000
tuple_percent      | 87.78
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 823628  <-- + 8byte * 100000 (whole tuples)
free_percent       | 22.59

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories

Attachment

Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Thu, Sep 01, 2005 at 10:45:44AM +0900, ITAGAKI Takahiro wrote:

Hi,

> I think it would be a waste to retain xmin and cmin for frozen tuples
> because their values represent only 'visible for all transactions'.
> Additionally, most tuples in database can be frozen potentially.

I think this is an interesting idea.  I was thinking that when the tuple
needs to be obsoleted it would need to grow to accomodate the Xmax, but
you are not actually proposing to remove that, so it seems sensible.  In
fact, it is perfectly reasonable to remove Xmin and Cmin, because after
the tuple is frozen, the Xmin never changes again.

Now, one thing of note is that you need to "compress" the page in order
to actually be able to use the just-freed space.  VACUUM could do that,
but maybe it would be better to do it on-line -- the freezing process is
going to have to write the page regardless.  I wonder if with your patch
the page is compressed on the same VACUUM execution that freezes the
tuple?

One thing that comes to mind is that this makes somewhat easier to build
a tool to write pre-built tables, for bulk-loading purposes.  You just
construct the binary file with the HEAP_FROZEN bit set, and then attach
the file to a dummy table.  (Then again, you can do it today, using a
Xmin of FrozenTransactionId.  I wonder why the Bizgres people isn't
advocating a tool to do that.  It is very hard to do with user-defined
types, but for BI/DW you mostly don't need those, do you?)

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
"Cuando no hay humildad las personas se degradan" (A. Christie)


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> I think it would be a waste to retain xmin and cmin for frozen tuples
> because their values represent only 'visible for all transactions'.

True, but the hard part is getting rid of the storage for them.

> I wrote a makeshift patch to compress xmin and cmin (8bytes) to
> 1-bit flag, using tuple overlaping.
> Is this idea worth trying?

I think this is incredibly ugly :-(.  It eliminates a fairly basic
assumption which is that items on a page don't overlap.  The space
savings cannot be worth the loss in testability and reliability.
To take just one problem, it is no longer possible to check an item
offset for validity against pd_upper.  If we're going to do this,
we need a more invasive patch that changes the structure of heaptuple
headers in a more fundamental way, and avoids breaking the page layout
representation.  (Something like the way Oids are now handled might
work, although there are alignment issues to worry about, and it'd
take more work on VACUUM's part to convert a tuple to frozen state.)

I'm also less than enthused about using up our last infomask bit for
a relatively unimportant purpose.  We might need that for something
bigger someday... though I can't presently guess what.
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
Josh Berkus
Date:
Alvaro,

> One thing that comes to mind is that this makes somewhat easier to build
> a tool to write pre-built tables, for bulk-loading purposes.  You just
> construct the binary file with the HEAP_FROZEN bit set, and then attach
> the file to a dummy table.  (Then again, you can do it today, using a
> Xmin of FrozenTransactionId.  I wonder why the Bizgres people isn't
> advocating a tool to do that.  It is very hard to do with user-defined
> types, but for BI/DW you mostly don't need those, do you?)

Hmmm ... can you expand on this a little?  We'd discussed "frozen partitions" 
but hadn't thought to get around to them for a while, expecting the kind of 
issues which Tom just raised.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Remove xmin and cmin from frozen tuples

From
ITAGAKI Takahiro
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> Now, one thing of note is that you need to "compress" the page in order
> to actually be able to use the just-freed space.  VACUUM could do that,
> but maybe it would be better to do it on-line -- the freezing process is
> going to have to write the page regardless.

I agree. I think an good position of freezer is on bgwriter.
My idea is: 1. Just before bgwriter writes an dirty page in LRU order, 2. Freeze tuples in the page and repair
fragmentation.3. (Replace the fsm page that has least freespace.) 4. Flush the page.
 


> I wonder if with your patch
> the page is compressed on the same VACUUM execution that freezes the tuple?

Yes, defragmentation is performed after freezing, but the page has at least
one dead tuple. In current VACUUM implementation, pages that have no dead
tuples will not be defraged. So you cannot "compress" just after bulk-load.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories




Re: Remove xmin and cmin from frozen tuples

From
ITAGAKI Takahiro
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I think this is incredibly ugly :-(.

Yes, I think so, too :-(    My patch is product of the thought that
I don't want to modify codes widely. So if we want to do it more cool way,
lots of changes are needed as you said.


> I'm also less than enthused about using up our last infomask bit for
> a relatively unimportant purpose.  We might need that for something
> bigger someday... though I can't presently guess what.

I think it is not a problem, because the header still has rooms for several
bits. I assume that the combination of HEAP_XMIN_COMMITTED + HEAP_XMIN_INVALID
has currently no meaning, right? If so, HEAP_FROZEN can be assigned here.
Also, t_natts is currently 16-bits, but it can be cut to 11-bits
because MaxTupleAttributeNumber is 1664 < 2^11.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories




Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> I agree. I think an good position of freezer is on bgwriter.
> My idea is:
>   1. Just before bgwriter writes an dirty page in LRU order,
>   2. Freeze tuples in the page and repair fragmentation.
>   3. (Replace the fsm page that has least freespace.)
>   4. Flush the page.

This is a bad idea.  The bgwriter isn't the place to be doing freezing,
because there is no reasonable way for it to guarantee that all old
tuples in a table (or any larger unit) have been frozen.  So you'd still
need VACUUM to ensure no wraparound.  Plus, you can't do such changes
without emitting an XLOG record, which is something we don't want
happening in the bgwriter's inner loop.  Even more to the point, you
can't do such changes without getting a superexclusive lock on the page
(not only locked, but no one else has it pinned), which is a real
nonstarter for the bgwriter, both for performance and possible deadlock
issues.
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Wed, Aug 31, 2005 at 09:14:42PM -0700, Josh Berkus wrote:

> > One thing that comes to mind is that this makes somewhat easier to build
> > a tool to write pre-built tables, for bulk-loading purposes.  You just
> > construct the binary file with the HEAP_FROZEN bit set, and then attach
> > the file to a dummy table.  (Then again, you can do it today, using a
> > Xmin of FrozenTransactionId.  I wonder why the Bizgres people isn't
> > advocating a tool to do that.  It is very hard to do with user-defined
> > types, but for BI/DW you mostly don't need those, do you?)
> 
> Hmmm ... can you expand on this a little?  We'd discussed "frozen partitions" 
> but hadn't thought to get around to them for a while, expecting the kind of 
> issues which Tom just raised.

What issues did he raise on this?

What I'm saying is that you can write a heap file, on which the tuples
would all have xmin=FrozenTransactionId, xmax=Invalid, and the
corresponding bits set in the infomask.  This ensures that no matter the
state of the server, you can plug the file in and all tuples will be
valid.

The "only" problem is figuring out how to lay the data in the tuples
themselves, w.r.t endianness and such.  This is platform-dependent, so
you have to write code to do it correctly.  In absence of user-defined
types, this should not be _too_ hard to do.  Of course, such a program
would in general also be Postgres-version-dependent.

Note that this is a very different business from skipping the Xmin and
Cmin from the tuple header -- in fact, there's no relation to that at
all.

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
FOO MANE PADME HUM


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> What I'm saying is that you can write a heap file, on which the tuples
> would all have xmin=FrozenTransactionId, xmax=Invalid, and the
> corresponding bits set in the infomask.  This ensures that no matter the
> state of the server, you can plug the file in and all tuples will be
> valid.

> The "only" problem is figuring out how to lay the data in the tuples
> themselves, w.r.t endianness and such.  This is platform-dependent, so
> you have to write code to do it correctly.  In absence of user-defined
> types, this should not be _too_ hard to do.  Of course, such a program
> would in general also be Postgres-version-dependent.

Of course, it's fair to ask whether such a program would be any faster
than binary-mode COPY by the time you got done ... or enough faster to
justify your effort, anyway.

THe only fundamental disadvantage that COPY labors under is having to
write WAL records.  It might be interesting to do something similar to
the recent hacks for CREATE TABLE AS, so that a COPY into a table just
created in the current transaction would skip writing WAL and instead
fsync the table at the end.
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
Simon Riggs
Date:
On Wed, 2005-08-31 at 22:25 -0400, Alvaro Herrera wrote:
> On Thu, Sep 01, 2005 at 10:45:44AM +0900, ITAGAKI Takahiro wrote:
> 
> Hi,
> 
> > I think it would be a waste to retain xmin and cmin for frozen tuples
> > because their values represent only 'visible for all transactions'.
> > Additionally, most tuples in database can be frozen potentially.
> 
> I think this is an interesting idea.  

Agreed, especially since it would avoid the need to vacuum altogether.

> I was thinking that when the tuple
> needs to be obsoleted it would need to grow to accomodate the Xmax, but
> you are not actually proposing to remove that, so it seems sensible.  In
> fact, it is perfectly reasonable to remove Xmin and Cmin, because after
> the tuple is frozen, the Xmin never changes again.

It's a good idea, but the Xmin is set to FrozenTransactionId, which is
how we know it is frozen, so how can we remove Xmin? The way to do this
is surely by using a row version id that is different for this format.

Getting 8 or 16 bytes per row back would be a very useful gain.

> Now, one thing of note is that you need to "compress" the page in order
> to actually be able to use the just-freed space.  VACUUM could do that,
> but maybe it would be better to do it on-line -- the freezing process is
> going to have to write the page regardless.  I wonder if with your patch
> the page is compressed on the same VACUUM execution that freezes the
> tuple?

Only if you do a FULL, which is currently incompatible with a FREEZE. 

There's no point in compressing a block if you can't also redistribute
rows between blocks to fill up the spaces, so another reason why it has
to be a FULL. Unless you do this at load time, which is why I guess you
mention....

> One thing that comes to mind is that this makes somewhat easier to build
> a tool to write pre-built tables, for bulk-loading purposes.  You just
> construct the binary file with the HEAP_FROZEN bit set, and then attach
> the file to a dummy table.  (Then again, you can do it today, using a
> Xmin of FrozenTransactionId.  I wonder why the Bizgres people isn't
> advocating a tool to do that.  It is very hard to do with user-defined
> types, but for BI/DW you mostly don't need those, do you?)

Loading a table using COPY with frozen bits set was suggested in May, so
yeh... it was suggested. At that time it was rejected, since earlier
transactions would then be able to see rows they ought not be able to
see. Thinking some more about this, this is only the inverse situation
of a TRUNCATE. With truncate we remove tuples that ought to still be
visible to pre-existing transactions. So there shouldn't really be an
issue with loading pre-frozen tuples - as long as you accept the
consequences for row visibility. 

Externally writing blocks is possible, but it bypasses a lot of other
features. My current preference would be to have bulk_heap_insert()
function to add a whole page at a time rather than inserting rows one at
at a time. The main objective for a load is to make it disk bound; once
we've achieved that by some further tuning, writing an external file
would cost around the same as writing it internally from the DBMS.
Oracle (direct path loader) and Teradata (Fastload) load data in
complete blocks using a reduced code pathway, so I guess I was just
following on, but I'm genuinely open to further persuasion if there is a
better way.

Having a table marked as INSERT ONLY would allow us to save 8 bytes/row,
loading it pre-frozen (in some way) would save another 8 bytes/row and
allow us to permanently avoid VACUUMing the table. That would be even
better when we have per-table XID wrap avoidance.

Best Regards, Simon Riggs



Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Thu, Sep 01, 2005 at 11:08:36AM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > What I'm saying is that you can write a heap file, on which the tuples
> > would all have xmin=FrozenTransactionId, xmax=Invalid, and the
> > corresponding bits set in the infomask.  This ensures that no matter the
> > state of the server, you can plug the file in and all tuples will be
> > valid.
> 
> > The "only" problem is figuring out how to lay the data in the tuples
> > themselves, w.r.t endianness and such.  This is platform-dependent, so
> > you have to write code to do it correctly.  In absence of user-defined
> > types, this should not be _too_ hard to do.  Of course, such a program
> > would in general also be Postgres-version-dependent.
> 
> Of course, it's fair to ask whether such a program would be any faster
> than binary-mode COPY by the time you got done ... or enough faster to
> justify your effort, anyway.

It may not be faster generating the data in the first place, but you
don't have to vacuum the table, nor you are subject to hint bits
changing, resulting in more unnecessary I/O.

This can't be avoided with COPY, because there's always the chance that
it will fail partway through, so you can't write frozen tuples.  With an
external program, you can just dump the invalid line somewhere else and
continue with the rest.

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
"Just treat us the way you want to be treated + some extra allowancefor ignorance."
(MichaelBrusser)
 


Re: Remove xmin and cmin from frozen tuples

From
Josh Berkus
Date:
Alvaro,

> What issues did he raise on this?

On having no Xmin.

> What I'm saying is that you can write a heap file, on which the tuples
> would all have xmin=FrozenTransactionId, xmax=Invalid, and the
> corresponding bits set in the infomask.  This ensures that no matter the
> state of the server, you can plug the file in and all tuples will be
> valid.
>
> The "only" problem is figuring out how to lay the data in the tuples
> themselves, w.r.t endianness and such.  This is platform-dependent, so
> you have to write code to do it correctly.  In absence of user-defined
> types, this should not be _too_ hard to do.  Of course, such a program
> would in general also be Postgres-version-dependent.

So, bulk loading by file generation?   So the idea is that you would generate 
a properly formatted PostgreSQL table file, and then in one transaction 
create the table and attach it?

Seems like this would have the additional limitation of being useful only for 
loading new partitions/new tables.  However, it would have some significant 
advantages for bulk loading ... chiefly that the data page generation and 
associated computations could be done *off* the database server.   This might 
help considerably in getting around the 100mb/s data computation ceiling 
we're hitting ...

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Remove xmin and cmin from frozen tuples

From
Josh Berkus
Date:
Tom,

> THe only fundamental disadvantage that COPY labors under is having to
> write WAL records.  It might be interesting to do something similar to
> the recent hacks for CREATE TABLE AS, so that a COPY into a table just
> created in the current transaction would skip writing WAL and instead
> fsync the table at the end.

Yes, I thought we discussed doing this for empty tables -- it would be, per 
our tests, a +10% to +30% boost to COPY.

But there was some problem the patch?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Thu, Sep 01, 2005 at 09:20:48AM -0700, Josh Berkus wrote:

> > What I'm saying is that you can write a heap file, on which the tuples
> > would all have xmin=FrozenTransactionId, xmax=Invalid, and the
> > corresponding bits set in the infomask.  This ensures that no matter the
> > state of the server, you can plug the file in and all tuples will be
> > valid.
> 
> So, bulk loading by file generation?   So the idea is that you would generate 
> a properly formatted PostgreSQL table file, and then in one transaction 
> create the table and attach it?

Exactly.

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
"Changing the world ... one keyboard at a time!"                        (www.DVzine.org)


Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Thu, Sep 01, 2005 at 04:34:09PM +0100, Simon Riggs wrote:
> On Wed, 2005-08-31 at 22:25 -0400, Alvaro Herrera wrote:

> > I was thinking that when the tuple
> > needs to be obsoleted it would need to grow to accomodate the Xmax, but
> > you are not actually proposing to remove that, so it seems sensible.  In
> > fact, it is perfectly reasonable to remove Xmin and Cmin, because after
> > the tuple is frozen, the Xmin never changes again.
> 
> It's a good idea, but the Xmin is set to FrozenTransactionId, which is
> how we know it is frozen, so how can we remove Xmin? The way to do this
> is surely by using a row version id that is different for this format.

Per Takahiro's patch, you don't need to set the Xmin to
FrozenTransactionId -- what you do instead is set a bit in the infomask.


> > Now, one thing of note is that you need to "compress" the page in order
> > to actually be able to use the just-freed space.  VACUUM could do that,
> > but maybe it would be better to do it on-line -- the freezing process is
> > going to have to write the page regardless.  I wonder if with your patch
> > the page is compressed on the same VACUUM execution that freezes the
> > tuple?
> 
> Only if you do a FULL, which is currently incompatible with a FREEZE. 

Well, if we are going to mess with what FREEZE is doing, we can as well
make it compress the page.  Note that to compress the page you don't
need to touch the indexes.

I don't remember the exact reason why FULL is incompatible with FREEZE,
but AFAIR it's not fundamentally unsolvable (just very hard.)

> There's no point in compressing a block if you can't also redistribute
> rows between blocks to fill up the spaces, so another reason why it has
> to be a FULL.

That's a good point.

> > One thing that comes to mind is that this makes somewhat easier to build
> > a tool to write pre-built tables, for bulk-loading purposes.  You just
> > construct the binary file with the HEAP_FROZEN bit set, and then attach
> > the file to a dummy table.
> 
> Loading a table using COPY with frozen bits set was suggested in May, so
> yeh... it was suggested.

I'm not proposing to use COPY for that.  It has loads of problems, which
is why the patch was rejected.  Using an external program is a different
matter.

> Externally writing blocks is possible, but it bypasses a lot of other
> features.

Like what?

I don't really care for this feature, mind you -- I was merely
mentioning the idea as it crossed my mind.

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
         (Paul Graham)
 


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> THe only fundamental disadvantage that COPY labors under is having to
>> write WAL records.  It might be interesting to do something similar to
>> the recent hacks for CREATE TABLE AS, so that a COPY into a table just
>> created in the current transaction would skip writing WAL and instead
>> fsync the table at the end.

> Yes, I thought we discussed doing this for empty tables -- it would be, per 
> our tests, a +10% to +30% boost to COPY.

> But there was some problem the patch?

I have seen no such patch AFAIR.
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On Thu, Sep 01, 2005 at 04:34:09PM +0100, Simon Riggs wrote:
>> On Wed, 2005-08-31 at 22:25 -0400, Alvaro Herrera wrote:
>>> Now, one thing of note is that you need to "compress" the page in order
>>> to actually be able to use the just-freed space.  VACUUM could do that,
>>> but maybe it would be better to do it on-line -- the freezing process is
>>> going to have to write the page regardless.  I wonder if with your patch
>>> the page is compressed on the same VACUUM execution that freezes the
>>> tuple?
>> 
>> Only if you do a FULL, which is currently incompatible with a FREEZE. 

> Well, if we are going to mess with what FREEZE is doing, we can as well
> make it compress the page.

Anyone looked at the code lately???

PageRepairFragmentation is part of any kind of vacuum.  As long as you
don't reassign tuple IDs (which it doesn't) there's no impact on indexes.
        regards, tom lane


Additional background daemon (was: Remove xmin and cmin from frozen tuples)

From
"Jim C. Nasby"
Date:
On Thu, Sep 01, 2005 at 09:22:35AM -0400, Tom Lane wrote:
> ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> > I agree. I think an good position of freezer is on bgwriter.
> > My idea is:
> >   1. Just before bgwriter writes an dirty page in LRU order,
> >   2. Freeze tuples in the page and repair fragmentation.
> >   3. (Replace the fsm page that has least freespace.)
> >   4. Flush the page.
> 
> This is a bad idea.  The bgwriter isn't the place to be doing freezing,
> because there is no reasonable way for it to guarantee that all old
> tuples in a table (or any larger unit) have been frozen.  So you'd still
> need VACUUM to ensure no wraparound.  Plus, you can't do such changes
> without emitting an XLOG record, which is something we don't want
> happening in the bgwriter's inner loop.  Even more to the point, you
> can't do such changes without getting a superexclusive lock on the page
> (not only locked, but no one else has it pinned), which is a real
> nonstarter for the bgwriter, both for performance and possible deadlock
> issues.

So is this something that another daemon could handle? Presumably one
that would operate on pages before they were written out by bgwriter.

Basically, right now any time someone thinks of something that could be
done in the background, bgwriter is the automatic candidate because it's
the only daemon in the backend. And it's often rejected for valid
technical reasons, but that doesn't mean we can't have additional
daemons that operate either before or after bgwriter.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software        http://pervasive.com        512-569-9461


"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Thu, Sep 01, 2005 at 09:22:35AM -0400, Tom Lane wrote:
>> This is a bad idea.  The bgwriter isn't the place to be doing freezing,

> So is this something that another daemon could handle?

Possibly, but I'd be inclined to think of it as autovacuum's problem.
        regards, tom lane


Re: Additional background daemon (was: Remove xmin and cmin from frozen tuples)

From
"Jim C. Nasby"
Date:
On Thu, Sep 01, 2005 at 11:22:07PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > On Thu, Sep 01, 2005 at 09:22:35AM -0400, Tom Lane wrote:
> >> This is a bad idea.  The bgwriter isn't the place to be doing freezing,
> 
> > So is this something that another daemon could handle?
> 
> Possibly, but I'd be inclined to think of it as autovacuum's problem.

Possibly, although what tends to make bgwriter interesting for these
things is that we want to perform some operation on pages between when
they get modified and when they get written out to disk. AFAIK
autovacuum wouldn't currently serve that purpose (though I could be
wrong). In any case, the big point is that there are ideas out there
that might warrant an additional daemon besides bgwriter (my
recollection is that this isn't the first proposal that's been objected
to on the basis of bgwriter being the wrong place to do something).
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software        http://pervasive.com        512-569-9461


Re: Remove xmin and cmin from frozen tuples

From
Pailloncy Jean-Gerard
Date:
> [skip]

> happening in the bgwriter's inner loop.  Even more to the point, you
> can't do such changes without getting a superexclusive lock on the
> page
> (not only locked, but no one else has it pinned), which is a real
> nonstarter for the bgwriter, both for performance and possible
> deadlock
> issues.
Hi Tom,

I do not want to discuss in deep the place to do this job, it is
really over my head.

But, you said you need a "super-exclusive lock", and I wonder if a
wait-free algorithm would be good for pg in a general manner.

I have given some references about it already on the list.
http://archives.postgresql.org/pgsql-hackers/2005-02/msg00263.php

Cordialement,
Jean-Gérard Pailloncy



Re: Remove xmin and cmin from frozen tuples

From
Bruce Momjian
Date:
Tom Lane wrote:
> Of course, it's fair to ask whether such a program would be any faster
> than binary-mode COPY by the time you got done ... or enough faster to
> justify your effort, anyway.
> 
> THe only fundamental disadvantage that COPY labors under is having to
> write WAL records.  It might be interesting to do something similar to
> the recent hacks for CREATE TABLE AS, so that a COPY into a table just
> created in the current transaction would skip writing WAL and instead
> fsync the table at the end.

Added to TODO:
       o Allow COPY into an empty table to skip WAL logging

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Remove xmin and cmin from frozen tuples

From
Bruce Momjian
Date:
Tom Lane wrote:
> ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> > I think it would be a waste to retain xmin and cmin for frozen tuples
> > because their values represent only 'visible for all transactions'.
> 
> True, but the hard part is getting rid of the storage for them.
> 
> > I wrote a makeshift patch to compress xmin and cmin (8bytes) to
> > 1-bit flag, using tuple overlaping.
> > Is this idea worth trying?
> 
> I think this is incredibly ugly :-(.  It eliminates a fairly basic
> assumption which is that items on a page don't overlap.  The space
> savings cannot be worth the loss in testability and reliability.
> To take just one problem, it is no longer possible to check an item
> offset for validity against pd_upper.  If we're going to do this,
> we need a more invasive patch that changes the structure of heaptuple
> headers in a more fundamental way, and avoids breaking the page layout
> representation.  (Something like the way Oids are now handled might
> work, although there are alignment issues to worry about, and it'd
> take more work on VACUUM's part to convert a tuple to frozen state.)
> 
> I'm also less than enthused about using up our last infomask bit for
> a relatively unimportant purpose.  We might need that for something
> bigger someday... though I can't presently guess what.

Considering the cost/benefits, rather than doing some optimization for
long-lived tuples, I would like to see us merge the existing
xmin/xmax/cmin/cmax values back into three storage fields like we had in
7.4 and had to expand to a full four in 8.0 to support subtransactions.
The benefit is that every row would be reduced in size by 4 bytes or 14%
for all rows:
* Merge xmin/xmax/cmin/cmax back into three header fields  Before subtransactions, there used to be only three fields
neededto  store these four values. This was possible because only the current  transaction looks at the cmin/cmax
values.If the current transaction  created and expired the row the fields stored where xmin (same as  xmax), cmin,
cmax,and if the transaction was expiring a row from a  another transaction, the fields stored were xmin (cmin was not
needed),xmax, and cmax. Such a system worked because a transaction  could only see rows from another completed
transaction.
  However, subtransactions can see rows from outer transactions, and  once the subtransaction completes, the outer
transactioncontinues,  requiring the storage of all four fields. With subtransactions, an  outer transaction can create
arow, a subtransaction expire it, and  when the subtransaction completes, the outer transaction still has  to have
propervisibility of the row's cmin, for example, for   cursors.  One possible solution is to create a phantom cid which
representsa  cmin/cmax pair and is stored in local memory.
 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> THe only fundamental disadvantage that COPY labors under is having to
>> write WAL records.  It might be interesting to do something similar to
>> the recent hacks for CREATE TABLE AS, so that a COPY into a table just
>> created in the current transaction would skip writing WAL and instead
>> fsync the table at the end.

> Added to TODO:
>         o Allow COPY into an empty table to skip WAL logging

It has to be a *new* table, not an *empty* table.  If it's already
visible to other xacts then somebody else could insert into it in
parallel with you, because COPY doesn't take an exclusive lock.
Contrariwise, it doesn't really matter (I think) if there are WAL-logged
records already in the table and COPY is adding more that aren't logged.
(You might have to force COPY to start adding the rows on freshly added
pages ... hmm ... all of a sudden I think we had this discussion already?
I for sure remember the fresh-pages trick from some other thread.)
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Considering the cost/benefits, rather than doing some optimization for
> long-lived tuples, I would like to see us merge the existing
> xmin/xmax/cmin/cmax values back into three storage fields like we had in
> 7.4 and had to expand to a full four in 8.0 to support subtransactions.

There is another reason for trying to do that rather than the frozen-row
optimization, which is that to get it down to two visibility-related
fields, we'd have to find another representation for tuples that are
Datums in memory.  The current Datum representation overlays three int32
fields where the visibility fields are for a tuple on-disk.  This works
fine now, and would still work fine if we could revert to the 7.4
approach, but it doesn't play nicely with a scheme to remove 2 of the 4
fields.
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Fri, Sep 02, 2005 at 04:02:08PM -0400, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> THe only fundamental disadvantage that COPY labors under is having to
> >> write WAL records.  It might be interesting to do something similar to
> >> the recent hacks for CREATE TABLE AS, so that a COPY into a table just
> >> created in the current transaction would skip writing WAL and instead
> >> fsync the table at the end.
> 
> > Added to TODO:
> >         o Allow COPY into an empty table to skip WAL logging
> 
> It has to be a *new* table, not an *empty* table.  If it's already
> visible to other xacts then somebody else could insert into it in
> parallel with you, because COPY doesn't take an exclusive lock.

What about the indexes?  Logging one of the inserters and not the other
is certain to corrupt the whole thing.  (Logging index insertion but not
the heap itself is silly, but perhaps an easy way out is to disable the
feature for tables with indexes.)

> Contrariwise, it doesn't really matter (I think) if there are WAL-logged
> records already in the table and COPY is adding more that aren't logged.

Only if the page is locked in a fashion that the bulk loader can't
insert tuples into a page that the other transaction is using.  (Not
sure if this can happen in reality.)  Else we risk both inserting a
tuple in the same page, and on recovery finding out that somebody else
used the tuple slot.

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
"Los románticos son seres que mueren de deseos de vida"


Re: Remove xmin and cmin from frozen tuples

From
Josh Berkus
Date:
Tom, Alvaro,

> > It has to be a *new* table, not an *empty* table.  If it's already
> > visible to other xacts then somebody else could insert into it in
> > parallel with you, because COPY doesn't take an exclusive lock.

There's still major gains to be had, for ETL, in being able to disable 
logging on new tables/partitions.  *particularly* partitions.

> Contrariwise, it doesn't really matter (I think) if there are WAL-logged
> records already in the table and COPY is adding more that aren't logged.
> (You might have to force COPY to start adding the rows on freshly added
> pages ... hmm ... all of a sudden I think we had this discussion
> already? I for sure remember the fresh-pages trick from some other
> thread.)

Yes, and that's what shot the proposal down before.   But I don't think we 
devoted sufficient discussion to the "new table" case.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: Remove xmin and cmin from frozen tuples

From
Josh Berkus
Date:
People:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Considering the cost/benefits, rather than doing some optimization for
> > long-lived tuples, I would like to see us merge the existing
> > xmin/xmax/cmin/cmax values back into three storage fields like we had
> > in 7.4 and had to expand to a full four in 8.0 to support
> > subtransactions.

Hmmm.   I personally don't see a whole lot of value in trimming 4 bytes per 
row off an archive table, particularly if the table would need to go 
through some kind of I/O intensive operation to do it.

Where I do see value is in enabling index-only access for "frozen" tables.  
That would be a *huge* gain, especially with bitmaps.   I think we've 
discussed this before,though.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On Fri, Sep 02, 2005 at 04:02:08PM -0400, Tom Lane wrote:
>> It has to be a *new* table, not an *empty* table.  If it's already
>> visible to other xacts then somebody else could insert into it in
>> parallel with you, because COPY doesn't take an exclusive lock.

> What about the indexes?  Logging one of the inserters and not the other
> is certain to corrupt the whole thing.

Good point, but that fits in just fine with the restriction to
just-created tables.

>> Contrariwise, it doesn't really matter (I think) if there are WAL-logged
>> records already in the table and COPY is adding more that aren't logged.

> Only if the page is locked in a fashion that the bulk loader can't
> insert tuples into a page that the other transaction is using.

What other transaction?  The point I was making is thatBEGIN;CREATE TABLE ...INSERT ...COPY ...
is still optimizable.  There isn't going to be anyone competing with
the COPY while it runs.
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Fri, Sep 02, 2005 at 04:27:59PM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> >> Contrariwise, it doesn't really matter (I think) if there are WAL-logged
> >> records already in the table and COPY is adding more that aren't logged.
> 
> > Only if the page is locked in a fashion that the bulk loader can't
> > insert tuples into a page that the other transaction is using.
> 
> What other transaction?  The point I was making is that
>     BEGIN;
>     CREATE TABLE ...
>     INSERT ...
>     COPY ...
> is still optimizable.  There isn't going to be anyone competing with
> the COPY while it runs.

Sure.  I was thinking that you were looking for a mechanism to relax the
other restriction.

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
Dios hizo a Adán, pero fue Eva quien lo hizo hombre.


Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Fri, Sep 02, 2005 at 01:35:42PM -0700, Josh Berkus wrote:

> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Considering the cost/benefits, rather than doing some optimization for
> > > long-lived tuples, I would like to see us merge the existing
> > > xmin/xmax/cmin/cmax values back into three storage fields like we had
> > > in 7.4 and had to expand to a full four in 8.0 to support
> > > subtransactions.
> 
> Hmmm.   I personally don't see a whole lot of value in trimming 4 bytes per 
> row off an archive table, particularly if the table would need to go 
> through some kind of I/O intensive operation to do it.

I think you are missing something.  These 4 bytes are not trimmed by an
I/O-intensive operation, they are not written in the first place.

Now, I agree for a very wide table those 4 bytes per tuple may not be a
lot.  But the optimization could be significant for not-wide (uh, sorry,
I don't remember the word) tables. 

> Where I do see value is in enabling index-only access for "frozen" tables.  
> That would be a *huge* gain, especially with bitmaps.   I think we've 
> discussed this before, though.

That's a completely different discussion.  Btree-organized heaps may
help you there.

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)


Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Fri, Sep 02, 2005 at 01:30:58PM -0700, Josh Berkus wrote:

> > Contrariwise, it doesn't really matter (I think) if there are WAL-logged
> > records already in the table and COPY is adding more that aren't logged.
> > (You might have to force COPY to start adding the rows on freshly added
> > pages ... hmm ... all of a sudden I think we had this discussion
> > already? I for sure remember the fresh-pages trick from some other
> > thread.)
> 
> Yes, and that's what shot the proposal down before.   But I don't think we 
> devoted sufficient discussion to the "new table" case.

If we are going to have real partitioning sometime soon, I don't think
the restriction is a problem.  You may have to load a whole partition
again, which may be faster than using logged COPY to an already-filled
partition.  The point is, it's not the whole table, just a partition.

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
"Acepta los honores y aplausos y perderás tu libertad"


Re: Remove xmin and cmin from frozen tuples

From
Bruce Momjian
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On Fri, Sep 02, 2005 at 04:02:08PM -0400, Tom Lane wrote:
> >> It has to be a *new* table, not an *empty* table.  If it's already
> >> visible to other xacts then somebody else could insert into it in
> >> parallel with you, because COPY doesn't take an exclusive lock.
> 
> > What about the indexes?  Logging one of the inserters and not the other
> > is certain to corrupt the whole thing.
> 
> Good point, but that fits in just fine with the restriction to
> just-created tables.

Seem the newly created table could have an index, but we would skip
logging on that too and create a zero-length file on crash restore.

> >> Contrariwise, it doesn't really matter (I think) if there are WAL-logged
> >> records already in the table and COPY is adding more that aren't logged.
> 
> > Only if the page is locked in a fashion that the bulk loader can't
> > insert tuples into a page that the other transaction is using.
> 
> What other transaction?  The point I was making is that
>     BEGIN;
>     CREATE TABLE ...
>     INSERT ...
>     COPY ...
> is still optimizable.  There isn't going to be anyone competing with
> the COPY while it runs.

Updated TODO:
       o Allow COPY on a newly-created table to skip WAL logging
         On crash recovery, the table involved in the COPY would         have its heap and index files truncated.  One
issueis         that no other backend should be able to add to the table         at the same time, which is something
thatis currently         allowed.
 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Updated TODO:

>         o Allow COPY on a newly-created table to skip WAL logging

>           On crash recovery, the table involved in the COPY would
>           have its heap and index files truncated.  One issue is
>           that no other backend should be able to add to the table
>           at the same time, which is something that is currently
>           allowed.

This is simply wrong.  (1) a table created in the current transaction
isn't visible to anyone else, (2) the correct rollback state is for
it not to be there, rather than be there and empty.
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On Fri, Sep 02, 2005 at 01:35:42PM -0700, Josh Berkus wrote:
>> Where I do see value is in enabling index-only access for "frozen" tables.  
>> That would be a *huge* gain, especially with bitmaps.   I think we've 
>> discussed this before, though.

> That's a completely different discussion.  Btree-organized heaps may
> help you there.

There was some talk of using a spare bit in index entries to mark "known
good" index entries (xmin committed and less than GlobalXmin, and xmax
invalid) but the cost of maintaining such bits seems nontrivial.  In any
case I agree that's an independent issue.
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Updated TODO:
> 
> >         o Allow COPY on a newly-created table to skip WAL logging
> 
> >           On crash recovery, the table involved in the COPY would
> >           have its heap and index files truncated.  One issue is
> >           that no other backend should be able to add to the table
> >           at the same time, which is something that is currently
> >           allowed.
> 
> This is simply wrong.  (1) a table created in the current transaction
> isn't visible to anyone else, (2) the correct rollback state is for
> it not to be there, rather than be there and empty.

New text:
       o Allow COPY on a newly-created table to skip WAL logging
         On crash recovery, the table involved in the COPY would         removed or have its heap and index files
truncated. One         issue is that no other backend should be able to add to         the table at the same time,
whichis something that is         currently allowed.
 

I think we can lock a zero-length table and do this optimization.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Fri, Sep 02, 2005 at 05:18:09PM -0400, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Updated TODO:
> 
> >         o Allow COPY on a newly-created table to skip WAL logging
> 
> >           On crash recovery, the table involved in the COPY would
> >           have its heap and index files truncated.  One issue is
> >           that no other backend should be able to add to the table
> >           at the same time, which is something that is currently
> >           allowed.
> 
> This is simply wrong.  (1) a table created in the current transaction
> isn't visible to anyone else, (2) the correct rollback state is for
> it not to be there, rather than be there and empty.

As a related note:

I remember somebody mentioned some time ago that if you create a table
and then crash before ending the transaction, the tuple in pg_class is
no longer valid, but the file remains.  I think this will be a much
worse problem if we allow a table that's being COPY'ed to remain after a
crash.

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
Oh, oh, las chicas galacianas, lo harán por las perlas,
¡Y las de Arrakis por el agua! Pero si buscas damas
Que se consuman como llamas, ¡Prueba una hija de Caladan! (Gurney Halleck)


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> I remember somebody mentioned some time ago that if you create a table
> and then crash before ending the transaction, the tuple in pg_class is
> no longer valid, but the file remains.

Right --- it will be removed on transaction rollback, but not if the
backend crashes first.  There was a patch submitted earlier this year to
try to clean out such files, but it got rejected (as too messy IIRC).
I think we still have a TODO item about it.
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
Manfred Koizar
Date:
On Fri, 2 Sep 2005 15:51:15 -0400 (EDT), Bruce Momjian
<pgman@candle.pha.pa.us> wrote:
>    * Merge xmin/xmax/cmin/cmax back into three header fields

And don't forget xvac, please.
>      Before subtransactions, there used to be only three fields needed to
>      store these four values.

... five values.

> This was possible because only the current
> transaction looks at the cmin/cmax values.

Which is a reason to get rid of cmin/cmax in tuple headers entirely.
Once I had a patch based on 7.4 that stored cmin and cmax in
backend-local memory.  It passed make check and some volume tests, but
I felt it was not ready to be applied without any spill-to-disk
mechanism.  Development stalled when I tried to eliminate xvac as
well, which would have required deep cuts into VACUUM code :-(

ServusManfred



Re: Remove xmin and cmin from frozen tuples

From
Bruce Momjian
Date:
Manfred Koizar wrote:
> On Fri, 2 Sep 2005 15:51:15 -0400 (EDT), Bruce Momjian
> <pgman@candle.pha.pa.us> wrote:
> >    * Merge xmin/xmax/cmin/cmax back into three header fields
> 
> And don't forget xvac, please.
>     
> >      Before subtransactions, there used to be only three fields needed to
> >      store these four values.
> 
> ... five values.
> 
> > This was possible because only the current
> > transaction looks at the cmin/cmax values.
> 
> Which is a reason to get rid of cmin/cmax in tuple headers entirely.
> Once I had a patch based on 7.4 that stored cmin and cmax in
> backend-local memory.  It passed make check and some volume tests, but
> I felt it was not ready to be applied without any spill-to-disk
> mechanism.  Development stalled when I tried to eliminate xvac as
> well, which would have required deep cuts into VACUUM code :-(

Interesting idea, but how would you record the cmin/xmin values without
requiring unlimited memory?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Remove xmin and cmin from frozen tuples

From
Manfred Koizar
Date:
On Fri, 2 Sep 2005 20:41:48 -0400 (EDT), Bruce Momjian
<pgman@candle.pha.pa.us> wrote:
>> Once I had a patch based on 7.4 that stored cmin and cmax in
>> backend-local memory.

>Interesting idea, but how would you record the cmin/xmin values without
>requiring unlimited memory?

That's exactly the reason for not sending it to -patches.  Without
spilling to disk this is just not ready for real life.  The problem is
that -- unlike other data structures that build up during a
transaction, e.g. trigger queues -- cmin/cmax lookup requires random
access, so we'd need some form of tree or hash.  Unfornunately I never
got beyond brainstorming :-(

BTW, is there anything else that'd need spilling to disk during long
transactions?

ServusManfred



Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Sat, Sep 03, 2005 at 10:59:31AM +0200, Manfred Koizar wrote:
> On Fri, 2 Sep 2005 20:41:48 -0400 (EDT), Bruce Momjian
> <pgman@candle.pha.pa.us> wrote:
> >> Once I had a patch based on 7.4 that stored cmin and cmax in
> >> backend-local memory.
> 
> >Interesting idea, but how would you record the cmin/xmin values without
> >requiring unlimited memory?
> 
> That's exactly the reason for not sending it to -patches.  Without
> spilling to disk this is just not ready for real life.  The problem is
> that -- unlike other data structures that build up during a
> transaction, e.g. trigger queues -- cmin/cmax lookup requires random
> access, so we'd need some form of tree or hash.  Unfornunately I never
> got beyond brainstorming :-(
> 
> BTW, is there anything else that'd need spilling to disk during long
> transactions?

Yes, the queue of pending deferred triggers.

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)


Re: Remove xmin and cmin from frozen tuples

From
Bruce Momjian
Date:
Manfred Koizar wrote:
> On Fri, 2 Sep 2005 20:41:48 -0400 (EDT), Bruce Momjian
> <pgman@candle.pha.pa.us> wrote:
> >> Once I had a patch based on 7.4 that stored cmin and cmax in
> >> backend-local memory.
> 
> >Interesting idea, but how would you record the cmin/xmin values without
> >requiring unlimited memory?
> 
> That's exactly the reason for not sending it to -patches.  Without
> spilling to disk this is just not ready for real life.  The problem is
> that -- unlike other data structures that build up during a
> transaction, e.g. trigger queues -- cmin/cmax lookup requires random
> access, so we'd need some form of tree or hash.  Unfornunately I never
> got beyond brainstorming :-(

What we do for shared row locks is much more compact because we create a
phantom xid for combination of xids and store that in a hash.  I think
the problem with having cmin/cmax in local memory is that without a way
to stamp a row with some fake cid (as is suggested in the TODO item now)
there is really no way to _group_ rows together to store their values as
a single item in local memory, meaning we would have to store ctids or
something for each row, and that seems unscalable.

I have added your memory-only idea to the TODO item because it
highlights that cmin/cmax is never looked at outside the backend that is
modifying the row.

If there was a tuple header field we can use just while the row is being
created or expired we could use that, but I don't see one.  Could we
recalculated one of the fields after it is created/expired?  I don't
know.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Remove xmin and cmin from frozen tuples

From
"Jim C. Nasby"
Date:
On Fri, Sep 02, 2005 at 03:51:15PM -0400, Bruce Momjian wrote:
>       One possible solution is to create a phantom cid which represents a
>       cmin/cmax pair and is stored in local memory.

If we're going to look at doing that I think it would also be good to
consider including xmin and xmax as well. This might require persisting
to disk, but for transactions that touch a number of tuples it could
potentially be a big win (imagine being able to shrink all 4 fields down
to a single int; a 45% space reduction).
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Tue, Sep 06, 2005 at 03:58:28PM -0500, Jim C. Nasby wrote:
> On Fri, Sep 02, 2005 at 03:51:15PM -0400, Bruce Momjian wrote:
> >       One possible solution is to create a phantom cid which represents a
> >       cmin/cmax pair and is stored in local memory.
> 
> If we're going to look at doing that I think it would also be good to
> consider including xmin and xmax as well.

If you do that, you'll never be able to delete or update the tuple.


> This might require persisting to disk, but for transactions that touch
> a number of tuples it could potentially be a big win (imagine being
> able to shrink all 4 fields down to a single int; a 45% space
> reduction).

Yeah, I've heard about compression algorithms that managed to fit
megabytes of data in 8 bytes and even less.  They were very cool.  No
one managed to write decompression algorithms however.  Imagine a whole
data warehouse could be stored in a single disk block!!  I imagine the
development of decompressors was boycotted by SAN vendors and the like.

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
"Si un desconocido se acerca y te regala un CD de Ubuntu ...                                    Eso es ...  Eau de
Tux"


Re: Remove xmin and cmin from frozen tuples

From
"Jim C. Nasby"
Date:
On Tue, Sep 06, 2005 at 05:37:06PM -0400, Alvaro Herrera wrote:
> On Tue, Sep 06, 2005 at 03:58:28PM -0500, Jim C. Nasby wrote:
> > On Fri, Sep 02, 2005 at 03:51:15PM -0400, Bruce Momjian wrote:
> > >       One possible solution is to create a phantom cid which represents a
> > >       cmin/cmax pair and is stored in local memory.
> > 
> > If we're going to look at doing that I think it would also be good to
> > consider including xmin and xmax as well.
> 
> If you do that, you'll never be able to delete or update the tuple.

My idea was to use an int to represent combinations of (c|x)(min|max),
probably on a per-table basis. Essentially, it would normalize these
values. I don't see how this would eliminate the ability to update or
delete.

Of course this would actually hurt on tables that had mostly single-row
operations, so it would have to be a table-creation option. I believe it
could substantially reduce the size of tables that don't see a lot of
transactions. It would be good just to have a quick and dirty patch just
to see if this is the case or not.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
>>> If we're going to look at doing that I think it would also be good to
>>> consider including xmin and xmax as well.
>> 
>> If you do that, you'll never be able to delete or update the tuple.

> My idea was to use an int to represent combinations of (c|x)(min|max),
> probably on a per-table basis. Essentially, it would normalize these
> values. I don't see how this would eliminate the ability to update or
> delete.

How will other transactions know whether the tuple is good (yet) or not?
How will you recover if the backend that does know this crashes before
transaction end?  How will you lock tuples for update/delete?
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
"Jim C. Nasby"
Date:
On Tue, Sep 06, 2005 at 06:02:27PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> >>> If we're going to look at doing that I think it would also be good to
> >>> consider including xmin and xmax as well.
> >> 
> >> If you do that, you'll never be able to delete or update the tuple.
> 
> > My idea was to use an int to represent combinations of (c|x)(min|max),
> > probably on a per-table basis. Essentially, it would normalize these
> > values. I don't see how this would eliminate the ability to update or
> > delete.
> 
> How will other transactions know whether the tuple is good (yet) or not?
> How will you recover if the backend that does know this crashes before
> transaction end?  How will you lock tuples for update/delete?

If the 4 header fields in question were just normalized out, wouldn't
all the semantics continue to work the same? All I'm envisioning is
replacing them in each tuple with a pointer (vis_id) to another
datastore that would be roughly equivalent to:

CREATE TABLE visibility (   vis_id      SERIAL,   xmin        int,   xmax        int,   cmin        int,   cmax_xmax
int
)

Of course you wouldn't use an actual table to do this, but hopefully
this clarifies my idea. Any time the backend would update any of those
fields it would now insert a new row into visibility containing the
proper values and use vis_id in the tuples.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> If the 4 header fields in question were just normalized out, wouldn't
> all the semantics continue to work the same? All I'm envisioning is
> replacing them in each tuple with a pointer (vis_id) to another
> datastore that would be roughly equivalent to:

> CREATE TABLE visibility (
>     vis_id      SERIAL,
>     xmin        int,
>     xmax        int,
>     cmin        int,
>     cmax_xmax   int
> )

> Of course you wouldn't use an actual table to do this, but hopefully
> this clarifies my idea.

Let's see ... replace every tuple access with TWO tuple accesses
... yes, that should improve performance nicely.  And no, I'm not
sure the semantics are the same, particularly w.r.t. atomicity of
state changes.
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
"Jim C. Nasby"
Date:
On Tue, Sep 06, 2005 at 07:02:20PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > If the 4 header fields in question were just normalized out, wouldn't
> > all the semantics continue to work the same? All I'm envisioning is
> > replacing them in each tuple with a pointer (vis_id) to another
> > datastore that would be roughly equivalent to:
> 
> > CREATE TABLE visibility (
> >     vis_id      SERIAL,
> >     xmin        int,
> >     xmax        int,
> >     cmin        int,
> >     cmax_xmax   int
> > )
> 
> > Of course you wouldn't use an actual table to do this, but hopefully
> > this clarifies my idea.
> 
> Let's see ... replace every tuple access with TWO tuple accesses
> ... yes, that should improve performance nicely.  And no, I'm not
> sure the semantics are the same, particularly w.r.t. atomicity of
> state changes.

Well, like I said, I'm not envisioning using a table to store that info.
Since we'd be storing 4 fixed-length fields, you wouldn't need to
actually store vis_id per entry, just use the offset into the page.
Assuming you use one 'slot' to store the id of the first set, you could
store 511 tuples per page, which doesn't sound very bad.

But this is why I'm hoping a quick patch could be done so that this
could be tested.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Wed, Sep 07, 2005 at 12:31:01AM -0500, Jim C. Nasby wrote:
> On Tue, Sep 06, 2005 at 07:02:20PM -0400, Tom Lane wrote:
> > "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > > If the 4 header fields in question were just normalized out, wouldn't
> > > all the semantics continue to work the same? All I'm envisioning is
> > > replacing them in each tuple with a pointer (vis_id) to another
> > > datastore that would be roughly equivalent to:
> > 
> > > CREATE TABLE visibility (
> > >     vis_id      SERIAL,
> > >     xmin        int,
> > >     xmax        int,
> > >     cmin        int,
> > >     cmax_xmax   int
> > > )
> > 
> > > Of course you wouldn't use an actual table to do this, but hopefully
> > > this clarifies my idea.

> Well, like I said, I'm not envisioning using a table to store that info.
> Since we'd be storing 4 fixed-length fields, you wouldn't need to
> actually store vis_id per entry, just use the offset into the page.
> Assuming you use one 'slot' to store the id of the first set, you could
> store 511 tuples per page, which doesn't sound very bad.

I think this could be done with our "SLRU" mechanism, just like pg_clog,
pg_subtrans and pg_multixact do.  Whether it's really a good idea or
not, it's another story.  The problem is that you would be creating new
ones all the time, it would become a huge source of contention, and it
would use a lot of memory.

Anyway you are just moving the storage somewhere else -- instead of
having 4 fields in the tuple itself, you have one field which points
the same 4 fields elsewhere.  I don't see how is that a win; it's
actually worse because you have to do two lookups.  (And actually you
have just enlarged the storage requirements because you need to store
the "vis_id" twice.)

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
"La realidad se compone de muchos sueños, todos ellos diferentes,
pero en cierto aspecto, parecidos..." (Yo, hablando de sueños eróticos)


Re: Remove xmin and cmin from frozen tuples

From
"Jim C. Nasby"
Date:
On Wed, Sep 07, 2005 at 01:05:52PM -0400, Alvaro Herrera wrote:
> Anyway you are just moving the storage somewhere else -- instead of
> having 4 fields in the tuple itself, you have one field which points
> the same 4 fields elsewhere.  I don't see how is that a win; it's
> actually worse because you have to do two lookups.  (And actually you
> have just enlarged the storage requirements because you need to store
> the "vis_id" twice.)

It would only be of use if the table had few transactions in it; in
other words, if it was mostly read-only. For a true read-only table
there are other options people have suggested that are probably better.

BTW, this becomes even more attractive if vis_id is int2; in that case
you can keep the entire mapping in memory in ~1MB.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Remove xmin and cmin from frozen tuples

From
Alvaro Herrera
Date:
On Wed, Sep 07, 2005 at 01:20:27PM -0400, Tom Lane wrote:

> Anyway the fundamental insight has been completely lost here.  The
> original point was that cmin and cmax are only interesting within the
> originating transaction, and not to anyone else, and thus perhaps don't
> need to be kept in permanent storage; while xmin/xmax are different
> animals because they *are* of interest to other transactions.

I'm curious to know how can you store the cmin/cmax pair completely out
of the tuple.  It's easy to see how to store a single identifier in each
tuple that would be an index to a structure in local memory.  However,
to eliminate both you'd have to keep a list of all tuples you have
created or obsoleted, with the cmin and cmax of each.  This seems like
an awful amount of memory.

-- 
Alvaro Herrera -- Valdivia, Chile         Architect, www.EnterpriseDB.com
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu.  Five minutes later I realize that it's also talking
about food" (Donald Knuth)


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> I'm curious to know how can you store the cmin/cmax pair completely out
> of the tuple.  It's easy to see how to store a single identifier in each
> tuple that would be an index to a structure in local memory.  However,
> to eliminate both you'd have to keep a list of all tuples you have
> created or obsoleted, with the cmin and cmax of each.  This seems like
> an awful amount of memory.

Yeah.  I think a reasonable compromise scheme is to try to get down to
three fields per tuple:
xmin    same as nowxmax    same as nowcid/xvac

xvac can share storage with the command ID info as long as VACUUM FULL
never tries to move a tuple whose originating or deleting transaction
is still running ... which is pretty much the same restriction we had
before.

For the command IDs, I am imagining:

if created in current transaction: use cid to store cmin

if deleted in current transaction: use cid to store cmax

if both created and deleted in current transaction: cid is an index
into an in-memory data structure that contains cmin and cmax.

"current transaction" would have to have the loose definition that
includes any subxact of the current top xact, but still, I think that
the case where you need both fields is relatively uncommon.

The in-memory data structure would only need to contain an entry for
each distinct combination of cmin and cmax used in the current xact,
so I think we could assume that it would never get unreasonably large.
The entries would be created "on demand" much like we do for
multixact ids (I guess you'd want a hash table to map requested
cmin/cmax to an existing entry ID quickly).
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> I think this could be done with our "SLRU" mechanism, just like pg_clog,
> pg_subtrans and pg_multixact do.  Whether it's really a good idea or
> not, it's another story.  The problem is that you would be creating new
> ones all the time, it would become a huge source of contention, and it
> would use a lot of memory.

... and you couldn't expire the data in a reasonable period of time.
pg_subtrans and pg_multixact have only very short active ranges.
pg_clog is longer-lived, but at only 2 bits per transaction, we can
stand it.  16 bytes per tuple is a whole lot more data.

Anyway the fundamental insight has been completely lost here.  The
original point was that cmin and cmax are only interesting within the
originating transaction, and not to anyone else, and thus perhaps don't
need to be kept in permanent storage; while xmin/xmax are different
animals because they *are* of interest to other transactions.
The storage scheme Jim proposes takes no advantage of that whatever.
        regards, tom lane


Re: Remove xmin and cmin from frozen tuples

From
Bruce Momjian
Date:
This is now in the TODO list:

* Merge xmin/xmax/cmin/cmax back into three header fields
 Before subtransactions, there used to be only three fields needed to store these four values. This was possible
becauseonly the current transaction looks at the cmin/cmax values. If the current transaction created and expired the
rowthe fields stored where xmin (same as xmax), cmin, cmax, and if the transaction was expiring a row from a another
transaction,the fields stored were xmin (cmin was not needed), xmax, and cmax. Such a system worked because a
transactioncould only see rows from another completed transaction. However, subtransactions can see rows from outer
transactions,and once the subtransaction completes, the outer transaction continues, requiring the storage of all four
fields.With subtransactions, an outer transaction can create a row, a subtransaction expire it, and when the
subtransactioncompletes, the outer transaction still has to have proper visibility of the row's cmin, for example, for
cursors.
 One possible solution is to create a phantom cid which represents a cmin/cmax pair and is stored in local memory.
Anotheridea is to store both cmin and cmax only in local memory.
 


---------------------------------------------------------------------------

Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > I'm curious to know how can you store the cmin/cmax pair completely out
> > of the tuple.  It's easy to see how to store a single identifier in each
> > tuple that would be an index to a structure in local memory.  However,
> > to eliminate both you'd have to keep a list of all tuples you have
> > created or obsoleted, with the cmin and cmax of each.  This seems like
> > an awful amount of memory.
> 
> Yeah.  I think a reasonable compromise scheme is to try to get down to
> three fields per tuple:
> 
>     xmin    same as now
>     xmax    same as now
>     cid/xvac
> 
> xvac can share storage with the command ID info as long as VACUUM FULL
> never tries to move a tuple whose originating or deleting transaction
> is still running ... which is pretty much the same restriction we had
> before.
> 
> For the command IDs, I am imagining:
> 
> if created in current transaction: use cid to store cmin
> 
> if deleted in current transaction: use cid to store cmax
> 
> if both created and deleted in current transaction: cid is an index
> into an in-memory data structure that contains cmin and cmax.
> 
> "current transaction" would have to have the loose definition that
> includes any subxact of the current top xact, but still, I think that
> the case where you need both fields is relatively uncommon.
> 
> The in-memory data structure would only need to contain an entry for
> each distinct combination of cmin and cmax used in the current xact,
> so I think we could assume that it would never get unreasonably large.
> The entries would be created "on demand" much like we do for
> multixact ids (I guess you'd want a hash table to map requested
> cmin/cmax to an existing entry ID quickly).
> 
>             regards, tom lane
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073