Thread: Proposal: Multiversion page api (inplace upgrade)

Proposal: Multiversion page api (inplace upgrade)

From
Zdenek Kotala
Date:
1) Overview

This proposal is part of inplace upgrade project. PostgreSQL should be able to 
read any page in old version. This is basic for all possible upgrade method.


2) Background

We have several macros for manipulating of the page structures but this list is 
not complete and many parts of code access into this structures directly and 
severals part does not use existing macros. The idea is to use only specified 
API for manipulation/access of data structure on page. This API will recognize 
page layout version and it process data correctly.


3) API

Proposed API is extended version of current macros which does not satisfy all 
Page Header manipulation. I plan to use function in first implementation, 
because it offers better type control and debugging capability, but some 
functions could be converted into macros (or into inline functions) in final 
solution (performance improving). All changes are related to bufpage.h and page.c.


4) Implementation

The main point of implementation is to have several version of PageHeader 
structure (e.g. PageHeader_04, PageHeader_03 ...) and correct structure will be 
handled in special branch (see examples).

Possible improvement is to use union which combine different PageHeader version 
and because most PageHeader items are same for all Page Layout version, it will 
reduce number of switches. But I'm afraid if union have same data layout as 
separate structure on all supported platforms.

There are examples:

void PageSetFull(Page page)
{switch ( PageGetPageLayoutVersion(page) ){    case 4 : ((PageHeader_04) (page))->pd_flags |= PD_PAGE_FULL;
break;    default elog(PANIC, "PageSetFull is not supported on page layout version %i",
PageGetPageLayoutVersion(page));}
}

LocationIndex PageGetLower(Page page)
{switch ( PageGetPageLayoutVersion(page) ){    case 4 : return ((PageHeader_04) (page))->pd_lower);}elog(PANIC,
"Unsupportedpage layout in function PageGetLower.");
 
}


5) Issues
 a) hash index has hardcoded PageHeader into meta page structure -> need 
rewrite hash index implementation to be multiheader version friendly b) All *ItemSize macros (+toast chunk size)
dependson sizeof(PageHeader) -> 
 
separate proposal will follow soon.

All comments are welcome.
    Zdenek



Re: Proposal: Multiversion page api (inplace upgrade)

From
Tom Lane
Date:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> There are examples:

> void PageSetFull(Page page)
> {
>     switch ( PageGetPageLayoutVersion(page) )
>     {
>         case 4 : ((PageHeader_04) (page))->pd_flags |= PD_PAGE_FULL;
>                   break;
>         default elog(PANIC, "PageSetFull is not supported on page layout version %i",
>                 PageGetPageLayoutVersion(page));
>     }
> }

> LocationIndex PageGetLower(Page page)
> {
>     switch ( PageGetPageLayoutVersion(page) )
>     {
>         case 4 : return ((PageHeader_04) (page))->pd_lower);
>     }
>     elog(PANIC, "Unsupported page layout in function PageGetLower.");
> }

I'm fairly concerned about the performance impact of turning what had
been simple field accesses into function calls.  I argue also that since
none of the PageHeader fields have actually moved in any version that's
likely to be supported, the above functions are actually of exactly
zero value.

The proposed PANIC in PageSetFull seems like it requires more thought as
well: surely we don't want that ever to happen.  Which means that
callers need to be careful not to invoke such an operation on an
un-updated page, but this proposed coding offers no aid in making sure
that won't happen.  What is needed there, I think, is some more global
policy about what operations are permitted on old (un-converted) pages
and a high-level approach to ensuring that unsafe operations aren't
attempted.
        regards, tom lane


Re: Proposal: Multiversion page api (inplace upgrade)

From
"Heikki Linnakangas"
Date:
Zdenek Kotala wrote:
> 4) Implementation
> 
> The main point of implementation is to have several version of 
> PageHeader structure (e.g. PageHeader_04, PageHeader_03 ...) and correct 
> structure will be handled in special branch (see examples).

(this won't come as a surprise as we talked about this in PGCon, but) I 
think we should rather convert the page structure to new format in 
ReadBuffer the first time a page is read in. That would keep the changes 
a lot more isolated.

Note that you need to handle not only page header changes, but changes 
to internal representations of different data types, and changes like 
varvarlen and combocid. Those are things that have happened in the past; 
in the future, I'm foreseeing changes to the toast header, for example, 
as there's been a lot of ideas related to toast options compression.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Proposal: Multiversion page api (inplace upgrade)

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> (this won't come as a surprise as we talked about this in PGCon, but) I 
> think we should rather convert the page structure to new format in 
> ReadBuffer the first time a page is read in. That would keep the changes 
> a lot more isolated.

The problem is that ReadBuffer is an extremely low-level environment,
and it's not clear that it's possible (let alone practical) to do a
conversion at that level in every case.  In particular it hardly seems
sane to expect ReadBuffer to do tuple content conversion, which is going
to be practically impossible to perform without any catalog accesses.

Another issue is that it might not be possible to update a page for
lack of space.  Are we prepared to assume that there will never be a
transformation we need to apply that makes the data bigger?  (Likely
counterexample: adding collation info to text values.)  In such a
situation an in-place update might be impossible, and that certainly
takes it outside the bounds of what ReadBuffer can be expected to manage.
        regards, tom lane


Re: Proposal: Multiversion page api (inplace upgrade)

From
Zdenek Kotala
Date:
Tom Lane napsal(a):
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
>> There are examples:
> 
>> void PageSetFull(Page page)
>> {
>>     switch ( PageGetPageLayoutVersion(page) )
>>     {
>>         case 4 : ((PageHeader_04) (page))->pd_flags |= PD_PAGE_FULL;
>>                   break;
>>         default elog(PANIC, "PageSetFull is not supported on page layout version %i",
>>                 PageGetPageLayoutVersion(page));
>>     }
>> }
> 
>> LocationIndex PageGetLower(Page page)
>> {
>>     switch ( PageGetPageLayoutVersion(page) )
>>     {
>>         case 4 : return ((PageHeader_04) (page))->pd_lower);
>>     }
>>     elog(PANIC, "Unsupported page layout in function PageGetLower.");
>> }
> 
> I'm fairly concerned about the performance impact of turning what had
> been simple field accesses into function calls.  

I use functions now because it is easy to track what's going on. Finally it 
should be (mostly) macros.

> I argue also that since
> none of the PageHeader fields have actually moved in any version that's
> likely to be supported, the above functions are actually of exactly
> zero value.

Yeah, it is why I'm thinking to use page header with unions inside (for example 
TSL/flag field)
and use switch only in case like TSL or flags fields. What I don't know if 
fields in this structure will be placed on same place on all platforms.

> The proposed PANIC in PageSetFull seems like it requires more thought as
> well: surely we don't want that ever to happen.  Which means that
> callers need to be careful not to invoke such an operation on an
> un-updated page, but this proposed coding offers no aid in making sure
> that won't happen.  What is needed there, I think, is some more global
> policy about what operations are permitted on old (un-converted) pages
> and a high-level approach to ensuring that unsafe operations aren't
> attempted.

ad) PANIC
PANIC shouldn't happen because page validation in BufferRead should check 
supported page version.

ad) policy - it is good catch. I think all read page operation should be allowed 
on old page version. Only tuple, LSN, TSL, and special modification should be 
allowed for writing. Addpageitem should invokes page conversion before any 
action happen (if there is free space for tuple, it is possible to convert page 
in to the new format, but after conversion space could be smaller then tuple.).
    Zdenek









Re: Proposal: Multiversion page api (inplace upgrade)

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> (this won't come as a surprise as we talked about this in PGCon, but) I 
>> think we should rather convert the page structure to new format in 
>> ReadBuffer the first time a page is read in. That would keep the changes 
>> a lot more isolated.
> 
> The problem is that ReadBuffer is an extremely low-level environment,
> and it's not clear that it's possible (let alone practical) to do a
> conversion at that level in every case.

Well, we can't predict the future, and can't guarantee that it's 
possible or practical to do the things we need to do in the future no 
matter what approach we choose.

>  In particular it hardly seems
> sane to expect ReadBuffer to do tuple content conversion, which is going
> to be practically impossible to perform without any catalog accesses.

ReadBuffer has access to Relation, which has information about what kind 
of a relation it's dealing with, and TupleDesc. That should get us 
pretty far. It would be a modularity violation, for sure, but I could 
live with that for the purpose of page version conversion.

> Another issue is that it might not be possible to update a page for
> lack of space.  Are we prepared to assume that there will never be a
> transformation we need to apply that makes the data bigger?

We do need some solution to that. One idea is to run a pre-upgrade 
script in the old version that scans the database and moves tuples that 
would no longer fit on their pages in the new version. This could be run 
before the upgrade, while the old database is still running, so it would 
be acceptable for that to take some time.

No doubt people would prefer something better than that. Another idea 
would be to have some over-sized buffers that can be used as the target 
of conversion, until some tuples are moved off to another page. Perhaps 
the over-sized buffer wouldn't need to be in shared memory, if they're 
read-only until some tuples are moved.

This is pretty hand-wavy, I know. The point is, I don't think these 
problems are insurmountable.

>  (Likely counterexample: adding collation info to text values.)

I doubt it, as collation is not a property of text values, but 
operations. But that's off-topic...

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Proposal: Multiversion page api (inplace upgrade)

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:
>> 4) Implementation
>>
>> The main point of implementation is to have several version of 
>> PageHeader structure (e.g. PageHeader_04, PageHeader_03 ...) and 
>> correct structure will be handled in special branch (see examples).
> 
> (this won't come as a surprise as we talked about this in PGCon, but) I 
> think we should rather convert the page structure to new format in 
> ReadBuffer the first time a page is read in. That would keep the changes 
> a lot more isolated.

I agree with Tom's reply. And anyway this approach will be mostly isolated into 
page.c and you need to able read old page in both cases.

> Note that you need to handle not only page header changes, but changes 
> to internal representations of different data types, and changes like 
> varvarlen and combocid. Those are things that have happened in the past; 
> in the future, I'm foreseeing changes to the toast header, for example, 
> as there's been a lot of ideas related to toast options compression.

I know, this is a first small step for inplace upgrade. Tupleheader will follow. 
Page structure is basic. I want to split development into small steps, because 
it is easy to review.
     Zdenek



Re: Proposal: Multiversion page api (inplace upgrade)

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Tom Lane wrote:
>> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>>> (this won't come as a surprise as we talked about this in PGCon, but) 
>>> I think we should rather convert the page structure to new format in 
>>> ReadBuffer the first time a page is read in. That would keep the 
>>> changes a lot more isolated.
>>
>> The problem is that ReadBuffer is an extremely low-level environment,
>> and it's not clear that it's possible (let alone practical) to do a
>> conversion at that level in every case.
> 
> Well, we can't predict the future, and can't guarantee that it's 
> possible or practical to do the things we need to do in the future no 
> matter what approach we choose.
> 
>>  In particular it hardly seems
>> sane to expect ReadBuffer to do tuple content conversion, which is going
>> to be practically impossible to perform without any catalog accesses.
> 
> ReadBuffer has access to Relation, which has information about what kind 
> of a relation it's dealing with, and TupleDesc. That should get us 
> pretty far. It would be a modularity violation, for sure, but I could 
> live with that for the purpose of page version conversion.

But if you look for example into hash implementation some pages are not in 
regular format and conversion could need more information which we do not have 
to have in ReadBuffer.

>> Another issue is that it might not be possible to update a page for
>> lack of space.  Are we prepared to assume that there will never be a
>> transformation we need to apply that makes the data bigger?
> 
> We do need some solution to that. One idea is to run a pre-upgrade 
> script in the old version that scans the database and moves tuples that 
> would no longer fit on their pages in the new version. This could be run 
> before the upgrade, while the old database is still running, so it would 
> be acceptable for that to take some time.

It could not work for indexes and do not forget TOAST chunks. I think in some 
cases you can get unused quoter of each page in TOAST table.

> No doubt people would prefer something better than that. Another idea 
> would be to have some over-sized buffers that can be used as the target 
> of conversion, until some tuples are moved off to another page. Perhaps 
> the over-sized buffer wouldn't need to be in shared memory, if they're 
> read-only until some tuples are moved.

Anyway, you need mechanism how to mark that this page is read only which is also  require a lot of modification. And
somemechanism how to make a decision when 
 
this page converted. I guess this approach will require similar modification as 
convert on write.

> This is pretty hand-wavy, I know. The point is, I don't think these 
> problems are insurmountable.
> 
>>  (Likely counterexample: adding collation info to text values.)
> 
> I doubt it, as collation is not a property of text values, but 
> operations. But that's off-topic...

Yes, it is offtopic, however I think Tom is right :-).
    Zdenek





Re: Proposal: Multiversion page api (inplace upgrade)

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> (Likely counterexample: adding collation info to text values.)

I don't think the argument really needs an example, but I would be pretty
upset if we proposed tagging every text datum with a collation. Encoding
perhaps, though that seems like a bad idea to me on performance grounds, but
collation is not a property of the data at all.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: Proposal: Multiversion page api (inplace upgrade)

From
"Stephen Denne"
Date:
Gregory Stark wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
> > (Likely counterexample: adding collation info to text values.)
>
> I don't think the argument really needs an example, but I
> would be pretty
> upset if we proposed tagging every text datum with a
> collation. Encoding
> perhaps, though that seems like a bad idea to me on
> performance grounds, but
> collation is not a property of the data at all.

Again not directly related to difficulties upgrading pages...

The recent discussion ...
http://archives.postgresql.org/pgsql-hackers/2008-06/msg00102.php
... mentions keeping collation information together with text data,
however it is referring to keeping it together when processing it,
not when storing the text.

Regards,
Stephen Denne.
--
At the Datamail Group we value teamwork, respect, achievement, client focus, and courage.
This email with any attachments is confidential and may be subject to legal privilege.
If it is not intended for you please advise by replying immediately, destroy it and do not
copy, disclose or use it in any way.

The Datamail Group, through our GoGreen programme, is committed to environmental sustainability.
Help us in our efforts by not printing this email.
__________________________________________________________________ This email has been scanned by the DMZGlobal
BusinessQuality             Electronic Messaging Suite. 
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__________________________________________________________________




Re: Proposal: Multiversion page api (inplace upgrade)

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Zdenek Kotala wrote:
> > 4) Implementation
> > 
> > The main point of implementation is to have several version of 
> > PageHeader structure (e.g. PageHeader_04, PageHeader_03 ...) and correct 
> > structure will be handled in special branch (see examples).
> 
> (this won't come as a surprise as we talked about this in PGCon, but) I 
> think we should rather convert the page structure to new format in 
> ReadBuffer the first time a page is read in. That would keep the changes 
> a lot more isolated.
> 
> Note that you need to handle not only page header changes, but changes 
> to internal representations of different data types, and changes like 
> varvarlen and combocid. Those are things that have happened in the past; 
> in the future, I'm foreseeing changes to the toast header, for example, 
> as there's been a lot of ideas related to toast options compression.

I understand the goal of having good modularity (not having ReadBuffer
modify the page), but I am worried that doing multi-version page
processing in a modular way is going to spread version-specific
information all over the backend code, making is harder to understand.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Proposal: Multiversion page api (inplace upgrade)

From
Decibel!
Date:
On Jun 11, 2008, at 10:42 AM, Heikki Linnakangas wrote:

>> Another issue is that it might not be possible to update a page for
>> lack of space.  Are we prepared to assume that there will never be a
>> transformation we need to apply that makes the data bigger?
>
> We do need some solution to that. One idea is to run a pre-upgrade  
> script in the old version that scans the database and moves tuples  
> that would no longer fit on their pages in the new version. This  
> could be run before the upgrade, while the old database is still  
> running, so it would be acceptable for that to take some time.

That means old versions have to have some knowledge of new versions.  
There's also a big race condition unless the old version starts  
taking size requirements into account every time a page is dirtied.

> No doubt people would prefer something better than that. Another  
> idea would be to have some over-sized buffers that can be used as  
> the target of conversion, until some tuples are moved off to  
> another page. Perhaps the over-sized buffer wouldn't need to be in  
> shared memory, if they're read-only until some tuples are moved.
>
> This is pretty hand-wavy, I know. The point is, I don't think these  
> problems are insurmountable.

-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: Proposal: Multiversion page api (inplace upgrade)

From
Ron Mayer
Date:
Tom Lane wrote:
> Another issue is that it might not be possible to update a page for
> lack of space.  Are we prepared to assume that there will never be a
> transformation we need to apply that makes the data bigger?   In such a
> situation an in-place update might be impossible, and that certainly
> takes it outside the bounds of what ReadBuffer can be expected to manage.

Would a possible solution to this be that you could
  1. Upgrade to the newest minor-version of the old release     (which has knowledge of the space requirements of the
 new one).
 
  2. Run some new maintenance command like "vacuum expand" or     "vacuum prepare_for_upgrade" or something that would
split    any too-full pages, leaving only pages with enough space.
 
  3. Only then shutdown the old server and start the     new major-version server.



Re: Proposal: Multiversion page api (inplace upgrade)

From
Zdenek Kotala
Date:
Ron Mayer napsal(a):
> Tom Lane wrote:
>> Another issue is that it might not be possible to update a page for
>> lack of space.  Are we prepared to assume that there will never be a
>> transformation we need to apply that makes the data bigger?   In such a
>> situation an in-place update might be impossible, and that certainly
>> takes it outside the bounds of what ReadBuffer can be expected to manage.
> 
> Would a possible solution to this be that you could
> 

<snip>

> 
>   2. Run some new maintenance command like "vacuum expand" or
>      "vacuum prepare_for_upgrade" or something that would split
>      any too-full pages, leaving only pages with enough space.

It does not solve problems for example with TOAST tables. If chunks does not fit 
on a new page layout one of the chunk tuple have to be moved to free page. It 
means you get a lot of pages with ~2kB of free unused space. And if max chunk 
size is different between version you got another problem as well.

There is also idea to change compression algorithm for 8.4 (or offer more 
varinats). It also mean that you need to understand old algorithm in a new 
version or you need to repack everything on old version.

    Zdenek


Re: Proposal: Multiversion page api (inplace upgrade)

From
Zdenek Kotala
Date:
Bruce Momjian napsal(a):
> Heikki Linnakangas wrote:
>> Zdenek Kotala wrote:
>>> 4) Implementation
>>>
>>> The main point of implementation is to have several version of 
>>> PageHeader structure (e.g. PageHeader_04, PageHeader_03 ...) and correct 
>>> structure will be handled in special branch (see examples).
>> (this won't come as a surprise as we talked about this in PGCon, but) I 
>> think we should rather convert the page structure to new format in 
>> ReadBuffer the first time a page is read in. That would keep the changes 
>> a lot more isolated.
>>
>> Note that you need to handle not only page header changes, but changes 
>> to internal representations of different data types, and changes like 
>> varvarlen and combocid. Those are things that have happened in the past; 
>> in the future, I'm foreseeing changes to the toast header, for example, 
>> as there's been a lot of ideas related to toast options compression.
> 
> I understand the goal of having good modularity (not having ReadBuffer
> modify the page), but I am worried that doing multi-version page
> processing in a modular way is going to spread version-specific
> information all over the backend code, making is harder to understand.

I don't think so. Page already contains page version information inside and 
currently we have macros like PageSetLSN. Caller needn't know nothing about 
PageHeader representation. It is responsibility of page API to correctly handle 
multi version.

The same we can use for tuple access. It is more complicated but I think it is 
possible. Currently we several macros (e.g. HeapTupleGetOid) which works on 
TupleData structure. "Only" what we need is extend this API as well.

I think in final we will get more readable code.
    Zdenek



Re: Proposal: Multiversion page api (inplace upgrade)

From
Tom Lane
Date:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> It does not solve problems for example with TOAST tables. If chunks does not fit 
> on a new page layout one of the chunk tuple have to be moved to free page. It 
> means you get a lot of pages with ~2kB of free unused space. And if max chunk 
> size is different between version you got another problem as well.

> There is also idea to change compression algorithm for 8.4 (or offer more 
> varinats). It also mean that you need to understand old algorithm in a new 
> version or you need to repack everything on old version.

I don't have any problem at all with the idea that in-place update isn't
going to support arbitrary changes of parameters, such as modifying the
toast chunk size.  In particular anything that is locked down by
pg_control isn't a problem.
        regards, tom lane