Thread: Another idea for dealing with cmin/cmax

Another idea for dealing with cmin/cmax

From
"Jim C. Nasby"
Date:
In addition to/instead of abstracting cmin/cmax to a phantom ID, what
about allowing for two versions of the tuple header, one with cid info
and one without? That would allow for cid info to be stripped out when
pages were written to disk.

The downside to this is that we'd have to be able to deal with pages
in-memory potentially being larger than pages on-disk. Since there's
been discussion of separating on-disk and in-memory page formats, maybe
that doesn't kill the proposal outright.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Another idea for dealing with cmin/cmax

From
Heikki Linnakangas
Date:
Jim C. Nasby wrote:
> In addition to/instead of abstracting cmin/cmax to a phantom ID, what
> about allowing for two versions of the tuple header, one with cid info
> and one without? That would allow for cid info to be stripped out when
> pages were written to disk.
>   

How exactly would that help? You can't just strip out cid info when 
writing to disk, if you don't want to lose the information.

And it's certainly a lot more complicated than the phantom id thing.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com



Re: Another idea for dealing with cmin/cmax

From
"Jim C. Nasby"
Date:
On Thu, Sep 28, 2006 at 05:13:11PM +0100, Heikki Linnakangas wrote:
> Jim C. Nasby wrote:
> >In addition to/instead of abstracting cmin/cmax to a phantom ID, what
> >about allowing for two versions of the tuple header, one with cid info
> >and one without? That would allow for cid info to be stripped out when
> >pages were written to disk.
> >  
> 
> How exactly would that help? You can't just strip out cid info when 
> writing to disk, if you don't want to lose the information.

Erm, sorry, brainfart... yeah, we'd need to be able to write the info
out to disk.

The reason I thought of this is because once the transaction commits, we
have no use for the cid info. So we could do something like have
bgwriter look for tuples that belong to committed transactions before it
writes a page, and strip the cid out of them.

The problem with *that* is that (AFAIK) you'll need cid info again once
you go to update or delete that tuple. And that might obviously need to
spill to disk before the transaction commits.

Back to the drawing board...
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Another idea for dealing with cmin/cmax

From
ITAGAKI Takahiro
Date:
"Jim C. Nasby" <jim@nasby.net> wrote:

> The reason I thought of this is because once the transaction commits, we
> have no use for the cid info. So we could do something like have
> bgwriter look for tuples that belong to committed transactions before it
> writes a page, and strip the cid out of them.

Your concept is just like as the experimental method that I suggested before
in http://archives.postgresql.org/pgsql-hackers/2005-08/msg01193.php
We can remove cmin and cmax from commited tuples and xmin from frozen tuples
and we might save some bytes in tuple headers.

However, I think our next goal to shrink the headers is 16 bytes. The headers
become 23 bytes using phantom cids and we are limited by alignments, so we will
have no more advantages unless we delete extra 7 bytes in the headers.
...and it seems to be very difficult.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Another idea for dealing with cmin/cmax

From
Heikki Linnakangas
Date:
ITAGAKI Takahiro wrote:
> However, I think our next goal to shrink the headers is 16 bytes. The 
> headers
> become 23 bytes using phantom cids and we are limited by alignments, 
> so we will
> have no more advantages unless we delete extra 7 bytes in the headers.
> ...and it seems to be very difficult.

Yeah, I thought about that too earlier.

If we get rid of VACUUM FULL, or replace it with something that doesn't 
need xvac, and keep cmin and cmax in backend-private storage, we could 
get rid of the overlayed t_field4, which is 4 bytes. Then we're down to 
19 bytes.

We could get rid of t_hoff, because we should always be able to 
calculate the header size. Then we're down to 18 bytes.

There's currently  15 bits in use in the infomask. After we remove the 
HEAP_MOVED_* fields that we don't need without VACUUM FULL, that's down 
to 13 bits. t_natts only needs 11 bits, because MaxHeapAttributeNumber 
is 1600. We could move 5 of the bits in infomask to the high 5 bits of 
t_natts, and save one byte.

We're now down to 17 bytes. That's as far as I got.

So it seems we could shave off some bytes, but we still can't get down 
to 16. And the changes needed in total would be quite massive.

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Another idea for dealing with cmin/cmax

From
Martijn van Oosterhout
Date:
On Fri, Sep 29, 2006 at 09:35:31AM +0100, Heikki Linnakangas wrote:
> We could get rid of t_hoff, because we should always be able to
> calculate the header size. Then we're down to 18 bytes.

Without t_hoff, how do you know the size of the null bitmap? You could
probably do it only if you assume the null bitmap, if present, always
covers all the columns...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Another idea for dealing with cmin/cmax

From
Heikki Linnakangas
Date:
Martijn van Oosterhout wrote:
> On Fri, Sep 29, 2006 at 09:35:31AM +0100, Heikki Linnakangas wrote:
>> We could get rid of t_hoff, because we should always be able to
>> calculate the header size. Then we're down to 18 bytes.
>
> Without t_hoff, how do you know the size of the null bitmap? You could
> probably do it only if you assume the null bitmap, if present, always
> covers all the columns...

I think we assume that already. heap_form_tuple reserves space for the 
bitmap like this:
   if (hasnull)       len += BITMAPLEN(numberOfAttributes);

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Another idea for dealing with cmin/cmax

From
Martijn van Oosterhout
Date:
On Fri, Sep 29, 2006 at 10:59:13AM +0100, Heikki Linnakangas wrote:
> Martijn van Oosterhout wrote:
> >On Fri, Sep 29, 2006 at 09:35:31AM +0100, Heikki Linnakangas wrote:
> >>We could get rid of t_hoff, because we should always be able to
> >>calculate the header size. Then we're down to 18 bytes.
> >
> >Without t_hoff, how do you know the size of the null bitmap? You could
> >probably do it only if you assume the null bitmap, if present, always
> >covers all the columns...
>
> I think we assume that already. heap_form_tuple reserves space for the
> bitmap like this:
>
>    if (hasnull)
>        len += BITMAPLEN(numberOfAttributes);

Ok, now we do an ALTER TABLE blah ADD COLUMN ..., and we have to expand
the bitmaps for the entire table?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Another idea for dealing with cmin/cmax

From
Heikki Linnakangas
Date:
Martijn van Oosterhout wrote:
> On Fri, Sep 29, 2006 at 10:59:13AM +0100, Heikki Linnakangas wrote:
>> Martijn van Oosterhout wrote:
>>> On Fri, Sep 29, 2006 at 09:35:31AM +0100, Heikki Linnakangas wrote:
>>>> We could get rid of t_hoff, because we should always be able to
>>>> calculate the header size. Then we're down to 18 bytes.
>>> Without t_hoff, how do you know the size of the null bitmap? You could
>>> probably do it only if you assume the null bitmap, if present, always
>>> covers all the columns...
>> I think we assume that already. heap_form_tuple reserves space for the
>> bitmap like this:
>>
>> if (hasnull)
>> len += BITMAPLEN(numberOfAttributes);
>
> Ok, now we do an ALTER TABLE blah ADD COLUMN ..., and we have to expand
> the bitmaps for the entire table?
No, you'd still have the the number of attributes (t_natts) in the header.

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Another idea for dealing with cmin/cmax

From
"Jim C. Nasby"
Date:
On Fri, Sep 29, 2006 at 01:15:06PM +0900, ITAGAKI Takahiro wrote:
> 
> "Jim C. Nasby" <jim@nasby.net> wrote:
> 
> > The reason I thought of this is because once the transaction commits, we
> > have no use for the cid info. So we could do something like have
> > bgwriter look for tuples that belong to committed transactions before it
> > writes a page, and strip the cid out of them.
> 
> Your concept is just like as the experimental method that I suggested before
> in http://archives.postgresql.org/pgsql-hackers/2005-08/msg01193.php
> We can remove cmin and cmax from commited tuples and xmin from frozen tuples
> and we might save some bytes in tuple headers.
> 
> However, I think our next goal to shrink the headers is 16 bytes. The headers
> become 23 bytes using phantom cids and we are limited by alignments, so we will
> have no more advantages unless we delete extra 7 bytes in the headers.
> ...and it seems to be very difficult.

Dumb question... wouldn't getting down to 20 bytes buy us something?
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Another idea for dealing with cmin/cmax

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> Dumb question... wouldn't getting down to 20 bytes buy us something?

Only on 32-bit machines, which are getting less interesting as database
servers every day.  (Just last night I was reading somebody opining that
the transition to 64-bit hardware would be effectively complete by 2008
... and he was talking about desktop PCs, not serious iron.)

BTW, the apparently useless byte after the 27- or 23-byte header
actually has some good use: in a table of up to 8 columns, you can
fit a null bitmap there "for free".  In a scheme that took us down
to 20 rather than 19 bytes, even a narrow table would pay the full
maxalign price for having a null.

I'm in favor of combining cmin/cmax/xvac to get us down to 23 bytes,
but I think anything beyond that is going to face a serious problem
of greatly increased cost for diminishing returns.
        regards, tom lane


Re: Another idea for dealing with cmin/cmax

From
ITAGAKI Takahiro
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "Jim C. Nasby" <jim@nasby.net> writes:
> > Dumb question... wouldn't getting down to 20 bytes buy us something?
> 
> BTW, the apparently useless byte after the 27- or 23-byte header
> actually has some good use: in a table of up to 8 columns, you can
> fit a null bitmap there "for free".  In a scheme that took us down
> to 20 rather than 19 bytes, even a narrow table would pay the full
> maxalign price for having a null.

We may use "free" bytes for other purposes. For example, if we increase
the size of XIDs from 4 to 6 bytes, we can get rid of transaction
wraparound problem. There may be some other uses.


> I'm in favor of combining cmin/cmax/xvac to get us down to 23 bytes,
> but I think anything beyond that is going to face a serious problem
> of greatly increased cost for diminishing returns.

If we really want to reduce the size of headers to 16 bytes,
we maybe need to do with HeapTupleHeaderData.t_ctid .

Under the assumption that tuples are offen updated in the same page,
we only need offsets in the page to link an old tuple to new one.
We can reduce the size of t_ctid from 6 to 2 bytes. When tuples
are updated to other pages, we probably need "phantom ctid".

In another assumption that tuples are offen updated after frozen,
we can overlap ctid and xmin to a physical field using union.
But "phantom xids" are needed to update unfrozen tuples.

However, I think both assumptions have less probability than
the one assumed when we introduce phantom cids.
The above ideas probably do not work well.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Another idea for dealing with cmin/cmax

From
"Jim C. Nasby"
Date:
On Mon, Oct 02, 2006 at 10:52:48AM +0900, ITAGAKI Takahiro wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> > "Jim C. Nasby" <jim@nasby.net> writes:
> > > Dumb question... wouldn't getting down to 20 bytes buy us something?
> > 
> > BTW, the apparently useless byte after the 27- or 23-byte header
> > actually has some good use: in a table of up to 8 columns, you can
> > fit a null bitmap there "for free".  In a scheme that took us down
> > to 20 rather than 19 bytes, even a narrow table would pay the full
> > maxalign price for having a null.
> 
> We may use "free" bytes for other purposes. For example, if we increase
> the size of XIDs from 4 to 6 bytes, we can get rid of transaction
> wraparound problem. There may be some other uses.
> 
> 
> > I'm in favor of combining cmin/cmax/xvac to get us down to 23 bytes,
> > but I think anything beyond that is going to face a serious problem
> > of greatly increased cost for diminishing returns.
> 
> If we really want to reduce the size of headers to 16 bytes,
> we maybe need to do with HeapTupleHeaderData.t_ctid .
> 
> Under the assumption that tuples are offen updated in the same page,
> we only need offsets in the page to link an old tuple to new one.
> We can reduce the size of t_ctid from 6 to 2 bytes. When tuples
> are updated to other pages, we probably need "phantom ctid".
> 
> In another assumption that tuples are offen updated after frozen,
> we can overlap ctid and xmin to a physical field using union.
> But "phantom xids" are needed to update unfrozen tuples.
> 
> However, I think both assumptions have less probability than
> the one assumed when we introduce phantom cids.
> The above ideas probably do not work well.

Well, for data warehousing, phantom XIDs of some sort would probably be
extremely valuable. Instead of a number of 4 byte XIDs in each tuple,
place a limit on the number of transactions that can be live in a table
at once. Granted, this means the need to vacuum-freeze more often, but
since that can be done as a low-load, low-priority background process
that doesn't seem too horrible. If the same technique was applied to
cmin/cmax, you could shrink all the visibility info to 1 byte if you
wanted to.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Another idea for dealing with cmin/cmax

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> ... place a limit on the number of transactions that can be live in a table
> at once.

Urk, well maybe, but ...

> you could shrink all the visibility info to 1 byte if you
> wanted to.

... 256 of 'em is surely not an acceptable limit.
        regards, tom lane


Re: Another idea for dealing with cmin/cmax

From
Hannu Krosing
Date:
Ühel kenal päeval, E, 2006-10-02 kell 01:30, kirjutas Tom Lane:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > ... place a limit on the number of transactions that can be live in a table
> > at once.
> 
> Urk, well maybe, but ...
> 
> > you could shrink all the visibility info to 1 byte if you
> > wanted to.
> 
> ... 256 of 'em is surely not an acceptable limit.

I have been thinking about this, and it seems that especially for OLAP
loads it would be much better to keep tuple visibility info in a
separate file, lets call it Tuple Visibility Map (TVM)

TVM would have the following benefits:

1) TVM could be uses for index-only lookups as well as heap-only
lookups, also other index lookups could be filtered against it fast
before going to heap.

2) TVM could be heavily compressed, especially for bulk loads something
like a single (xmin, xmax,cmin,cmax) tuple plus RLE-encoded list of
pointers to it will do.

3) In case TVM space is needed in in any page, it would be easy to just
throw away cmin/cmax from tuples from committed/aborted transactions.

4) First pass of VACUUM would be much faster, as it has to scan only
TVM. Pages with no expired tuples would not need to be touched. 


If we can come up with a good design for TVM, it may also be an overall
win for many kinds of OLTP queries, as it may result in less writes to
disk and almost the same amount of writing to WAL.

Maybe bitmap or btree index would be something to use as a starting
point when designing TVM.

Another idea to consider would be to merge FSM and TVM and then use them
also for keeping data in cluster order.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: Another idea for dealing with cmin/cmax

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> "Jim C. Nasby" <jim@nasby.net> writes:
> > ... place a limit on the number of transactions that can be live in a table
> > at once.
> 
> Urk, well maybe, but ...
> 
> > you could shrink all the visibility info to 1 byte if you
> > wanted to.
> 
> ... 256 of 'em is surely not an acceptable limit.

The plan Gavin Sherry and I were discussing at the Code Sprint was to store a
single "most common xmin" xmin in the per-page special area. Then have a bit
on each tuple indicating that the xmin isn't present in the tuple and instead
to use the xmin from the page footer. Any tuples with other values of xmin
would have to store that xmin normally.

The use case here is primarily tables loaded in large batch jobs that have
large swaths of the table, possibly the entire table, loaded in the same
transaction.

My thinking was that "most common xmin" could be very approximate. In fact my
my plan was to just store the first xmin the page sees there. This catches the
common use cases of pg_restore or COPY loading many records and even catches
most cases of large inserts into existing tables whenever they extend the
table.

A further refinement could be to have vacuum look for the actual most common
xmin and rewrite the page using it. If there's enough free space it may be
worthwhile storing InvalidTransactionId and allowing the next insert to set
the "most common xmin".

However I'm not convinced of the importance of this refinement. The thing to
remember is that shaving bits off tuples is not a goal in itself. It's a means
to an end, namely packing more tuples on a page. If we shave off 4 bytes off
every tuple when we're loading thousands of tuples then we get to put more of
them on a page. If we save space on tuples when we're running vacuum that just
gives us more free space in the free space map and we only see a benefit down
the line if we end up eventually filling up that space.

-- 
greg



Re: Another idea for dealing with cmin/cmax

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> The plan Gavin Sherry and I were discussing at the Code Sprint was to store a
> single "most common xmin" xmin in the per-page special area. Then have a bit
> on each tuple indicating that the xmin isn't present in the tuple and instead
> to use the xmin from the page footer. Any tuples with other values of xmin
> would have to store that xmin normally.

This seems pretty unworkable --- anything that involves more than one
field layout for tuple headers is going to be a huge PITA.
        regards, tom lane


Re: Another idea for dealing with cmin/cmax

From
ITAGAKI Takahiro
Date:
Greg Stark <gsstark@mit.edu> wrote:

> The plan Gavin Sherry and I were discussing at the Code Sprint was to store a
> single "most common xmin" xmin in the per-page special area. Then have a bit
> on each tuple indicating that the xmin isn't present in the tuple and instead
> to use the xmin from the page footer. Any tuples with other values of xmin
> would have to store that xmin normally.

Is this a similar approach to Interested Transaction List (ITL) in Oracle?
http://www.dbazine.com/oracle/or-articles/nanda3

ITL-like approach is more efficient than per-tuple XIDs
unless all tuples in a page are locked at the same time.
However, MAXTRANS and PCTFREE issues may bother us.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Another idea for dealing with cmin/cmax

From
Gregory Stark
Date:
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

> ITL-like approach is more efficient than per-tuple XIDs
> unless all tuples in a page are locked at the same time.
> However, MAXTRANS and PCTFREE issues may bother us.

I'm not sure how Oracle gets away with MAXTRANS. Somehow it seems to never
arise as a practical problem yet it seems like there must be instances where
it would cause a serious problems.

It's worse for Postgres since as I understand it Oracle only needs to track
transaction ids that are actively locking the record. Postgres needs to track
transaction ids for the inserting and deleting transaction even when there's
no lock (or no lock remaining).

I can imagine having a scheme where there's a short list of transaction ids in
the footer and then each tuple just stores an index into that list. If you had
space for 16 transaction ids set aside then you could store xmin and xmax in a
single byte for example.

If the space set aside for these transaction ids is full when you're inserting
i suppose you could just go back to the FSM for another page. But I don't see
any way out when you're deleting. You have to mark xmax one way or another and
if there's no space left in the footer and you only have 4 bits in the tuple
what are you going to do?

As an aside doing vacuum freeze more aggressively might reduce the pressure on
these ITL slots. 

But I don't see any way to guarantee a slot is available for xmax when
deleting. We would need some sort of scheme where the space for transaction
ids is able to grow but we're already growing from both ends of the page. We
would either have to interleave transaction ids with line pointers or store
them on another special page somewhere.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Another idea for dealing with cmin/cmax

From
Jim Nasby
Date:
On Oct 3, 2006, at 2:23 PM, Gregory Stark wrote:
> If the space set aside for these transaction ids is full when  
> you're inserting
> i suppose you could just go back to the FSM for another page. But I  
> don't see
> any way out when you're deleting. You have to mark xmax one way or  
> another and
> if there's no space left in the footer and you only have 4 bits in  
> the tuple
> what are you going to do?
>
> As an aside doing vacuum freeze more aggressively might reduce the  
> pressure on
> these ITL slots.
>
> But I don't see any way to guarantee a slot is available for xmax when
> deleting. We would need some sort of scheme where the space for  
> transaction
> ids is able to grow but we're already growing from both ends of the  
> page. We
> would either have to interleave transaction ids with line pointers  
> or store
> them on another special page somewhere.

Well, worst-case you could just re-do the whole page if you need to  
expand the list of transaction slots; I don't think that's a huge  
deal. What did have me baffled was how to deal with xmax though,  
since (as you mentioned), you can end up in a situation where you  
can't delete a tuple because there's no more room on the page for  
another xmax.

But I just thought of a way around that which might be better than a  
separate store for transaction info: allow for moving a tuple off the  
current page by placing a link to it's new location, similar to how  
ctid works. We probably wouldn't want to try and cram that into the  
item list, but I think we should be able to create a special version  
of a tuple header (AddressForwardingHeader) that simply states "the  
tuple has moved to this new ctid; go there".

Of course, anytime you have to follow that link you're going to pay a  
penalty, but I think this should only be needed when trying to delete  
a tuple on a page that's basically full. Theoretically, there  
shouldn't be too many people trying to hit that deleted tuple, but to  
further reduce the number of people hitting it, we could include the  
visibility info (or a pointer to it) in the AddressForwardingHeader.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Another idea for dealing with cmin/cmax

From
"Jonah H. Harris"
Date:
On 10/3/06, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote:
> ITL-like approach is more efficient than per-tuple XIDs
> unless all tuples in a page are locked at the same time.
> However, MAXTRANS and PCTFREE issues may bother us.

IIRC, the last time I checked Oracle's patents, they pretty much had
the ITL covered.  So any work in this area would need a significant
amount of research.

While ITLs are generally better than our current approach, there are
often tuning-related issues due to improper settings of initial ITL
slots (INITRANS) and freespace.  Many times a block doesn't have
enough freespace to fulfill MAXTRANS and you'll get contention issues. However, setting INITRANS too high will waste a
significantamount
 
of storage space... so you have to really know the application you
plan to be running on the database and what hotspots may exist to
effectively tune these parameters on a table-by-table basis.

Again, it goes back to how much PostgreSQL expects people to become
database tuning experts.  Oracle lets you modify almost any parameter
which is both good and bad.  In Oracle, you can easily fix almost any
bottleneck if you know enough about the system; whereas PostgreSQL's
tuning options are fairly basic and oriented for more generalized
use-cases.  There are obvious pros/cons to each type of system
architecture which, in this case, pretty much relate to design
complexity and the amount of end-user training required to make good
use of the software.

If anyone ever wanted to play with Oracle and understand some of the
issues related to their design considerations, running Quest's
Spotlight on Oracle while concurrently running Benchmark Factory gives
you a nice overview.  I haven't played with either tool in about a
year and a half now, but I think you can still get trial versions.

-- 
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation            | fax: 732.331.1301
33 Wood Ave S, 2nd Floor            | jharris@enterprisedb.com
Iselin, New Jersey 08830            | http://www.enterprisedb.com/