Thread: Prototype: In-place upgrade

Prototype: In-place upgrade

From
Zdenek Kotala
Date:
Attached patch is prototype of in-place upgrade as was presented on PGCon this year.

Main idea is to learn postgres to handle different version of page and tuple
structures.

1) page - Patch contains new page API and all code access page through this API.
Functions check page version and return correct data to caller. It is mostly
complete now. Only ItemId flags need finish.

2) tuple - HeapTuple structure has been extended with t_ver attribute which
contains page layout version and direct access to HeapTupleHeader is forbidden.
It is possible now only through HeapTuple* functions (see htup.c).
(HeapTupleHeader access still stays in a several functions like heap_form_tuple).

This patch version still does not allow to read old database, but it shows how
it should work. Main disadvantage of this approach is performance penalty.

Please, let me know your opinion about this approach.

Future work:
1) learn WAL to process different tuple structure version
2) tuple conversion to new version and put it into executor (ExecStoreTuple)
3) multiversion MaxItemSize constant

        thanks for your comments Zdenek


--
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql


Attachment

Re: Prototype: In-place upgrade

From
Heikki Linnakangas
Date:
The patch seems to be missing the new htup.c file.

Zdenek Kotala wrote:
> Attached patch is prototype of in-place upgrade as was presented on
> PGCon this year.
> 
> Main idea is to learn postgres to handle different version of page and 
> tuple structures.
> 
> 1) page - Patch contains new page API and all code access page through 
> this API. Functions check page version and return correct data to 
> caller. It is mostly complete now. Only ItemId flags need finish.
> 
> 2) tuple - HeapTuple structure has been extended with t_ver attribute 
> which contains page layout version and direct access to HeapTupleHeader 
> is forbidden. It is possible now only through HeapTuple* functions (see 
> htup.c). (HeapTupleHeader access still stays in a several functions like 
> heap_form_tuple).
> 
> This patch version still does not allow to read old database, but it 
> shows how it should work. Main disadvantage of this approach is 
> performance penalty.
> 
> Please, let me know your opinion about this approach.
> 
> Future work:
> 1) learn WAL to process different tuple structure version
> 2) tuple conversion to new version and put it into executor 
> (ExecStoreTuple)
> 3) multiversion MaxItemSize constant


--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Prototype: In-place upgrade

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> The patch seems to be missing the new htup.c file.

Upps, I'm sorry I'm going to fix it and I will send new version asap.
Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Prototype: In-place upgrade v02

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> The patch seems to be missing the new htup.c file.

I'm sorry. I attached new version which is synchronized with current head. I
would like to say more comments as well.

1) The patch contains also changes which was discussed during July commit
fest.     - PageGetTempPage modification suggested by Tom
- another hash.h backward compatible cleanup

2) I add tuplimits.h header file which contains tuple limits for different
access method. It is not finished yet, but idea is to keep all limits in one
file and easily add limits for different page layout version - for example
replace static computing with dynamic based on relation (maxtuplesize could be
store in pg_class for each relation).

I need this header also because I fallen in a cycle in header dependency.

3) I already sent Page API performance result in
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00398.php

I replaced call sequence PagetGetItemId, PageGetItemId with PageGetIndexTuple
and PageGetHeapTuple function. It is main difference in this patch. PAgeGetHeap
Tuple fills t_ver in HeapTuple to identify correct tupleheader version.

It would be good to mention that PageAPI (and tuple API) implementation is only
prototype without any performance optimization.

4) This patch contains more topics for decision. First is general if this
approach is acceptable. Second is about new Page API if we replace all page
access with new proposed macros/(inline)function. Third is how to name and where
store different data structure version. My idea is use suffix with underscore
and page layout version and keep all version in a same header file.

5) I got another idea about usage of page API. I call it "3 in 1". Because all
page access will be through New API, it could be use for WAL logging and other
WAL recording could be reduced. Replication could be easily added based on page
modification. It is just idea for thinking.

6) it is probably all for Friday evening.

        Zdenek


--
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql


Attachment

Re: Prototype: In-place upgrade v02

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> Heikki Linnakangas napsal(a):
>> The patch seems to be missing the new htup.c file.
> 
> I'm sorry. I attached new version which is synchronized with current
> head. I would like to say more comments as well.
> 
> 1) The patch contains also changes which was discussed during July 
> commit fest.     - PageGetTempPage modification suggested by Tom
> - another hash.h backward compatible cleanup

It might be a good idea to split that into a separate patch. The sheer 
size of this patch is quite daunting, even though the bulk of it is 
straightforward search&replace.

> 2) I add tuplimits.h header file which contains tuple limits for 
> different access method. It is not finished yet, but idea is to keep all 
> limits in one file and easily add limits for different page layout 
> version - for example replace static computing with dynamic based on 
> relation (maxtuplesize could be store in pg_class for each relation).
> 
> I need this header also because I fallen in a cycle in header dependency.
> 
> 3) I already sent Page API performance result in 
> http://archives.postgresql.org/pgsql-hackers/2008-08/msg00398.php
> 
> I replaced call sequence PagetGetItemId, PageGetItemId with 
> PageGetIndexTuple and PageGetHeapTuple function. It is main difference 
> in this patch. PAgeGetHeap Tuple fills t_ver in HeapTuple to identify 
> correct tupleheader version.
>
> It would be good to mention that PageAPI (and tuple API) implementation 
> is only prototype without any performance optimization.

You mentioned 5% performance degradation in that thread. What test case 
was that? What would be a worst-case scanario, and how bad is it?

5% is a pretty hefty price, especially when it's paid by not only 
upgraded installations, but also freshly initialized clusters. I think 
you'll need to pursue those performance optimizations.

> 4) This patch contains more topics for decision. First is general if 
> this approach is acceptable.

I don't like the invasiveness of this approach. It's pretty invasive 
already, and ISTM you'll need similar switch-case handling of all data 
types that have changed the internal representation as well.

We've talked about this before, so you'll remember that I favor teh 
approach is to convert the page format, page at a time, when the pages 
are read in. I grant you that there's non-trivial issues with that as 
well, like if the converted data takes more space and don't fit in the 
page anymore.

I wonder if we could go with some sort of a hybrid approach? Convert the  whole page when it's read in, but if it
doesn'tfit, fall back to 
 
tricks like loosening the alignment requirements on platforms that can 
handle non-aligned data, or support a special truncated page header, 
without pd_tli and pd_prune_xid fields. Just a thought, not sure how 
feasible those particular tricks are, but something along those lines..

All in all, though. I find it a bit hard to see the big picture. For 
upgrade-in-place, what are all the pieces that we need? To keep this 
concrete, let's focus on PG 8.2 -> PG 8.3 (or are you focusing on PG 8.3 
-> 8.4? That's fine with me as well, but let's pick one) and forget 
about hypothetical changes that might occur in a future version. I can see:
1. Handling page layout changes (pd_prune_xid, pd_flags)
2. Handling tuple header changes (infomask2, HOT bits, combocid)
3. Handling changes in data type representation (packed varlens)
4. Toast chunk size
5. Catalogs

After putting all those together, how large a patch are we talking 
about, and what's the performance penalty then? How much of all that 
needs to be in core, and how much can live in a pgfoundry project or an 
extra binary in src/bin or contrib? I realize that none of us have a 
crystal ball, and one has to start somewhere, but I feel uneasy 
committing to an approach until we have a full plan.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Prototype: In-place upgrade v02

From
Greg Smith
Date:
On Fri, 5 Sep 2008, Heikki Linnakangas wrote:

> All in all, though. I find it a bit hard to see the big picture.

I've been working on trying to see that myself lately, have been dumping 
links to all the interesting material at 
http://wiki.postgresql.org/wiki/In-place_upgrade if there's any of that 
you haven't seen before.

> To keep this concrete, let's focus on PG 8.2 -> PG 8.3 (or are you 
> focusing on PG 8.3 -> 8.4? That's fine with me as well, but let's pick 
> one)

From a complexity perspective, the changes needed to go from 8.2->8.3 seem 
much larger than what's needed for 8.3->8.4.  There's also a huge PR win 
if 8.4 goes out the door saying that in-place upgrades are available from 
the previous version starting at the 8.4 release.  Given the limited time 
left, I would think a focus on nailing the 8.3->8.4 conversion down first 
and then slipping in support for earlier revs later would be one way to 
get this into more managable chunks.  Obviously if you can fit 
infrastructure that makes the 8.2 conversion easier that's worth doing, 
but I'd hate to see this get bogged down worrying too much about things 
that haven't actually changed since 8.3.

The specific areas I am getting up to speed to help out with here are 
catalog updates and working on integration/testing.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Prototype: In-place upgrade v02

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> > 4) This patch contains more topics for decision. First is general if 
> > this approach is acceptable.
> 
> I don't like the invasiveness of this approach. It's pretty invasive 
> already, and ISTM you'll need similar switch-case handling of all data 
> types that have changed the internal representation as well.
> 
> We've talked about this before, so you'll remember that I favor teh 
> approach is to convert the page format, page at a time, when the pages 
> are read in. I grant you that there's non-trivial issues with that as 
> well, like if the converted data takes more space and don't fit in the 
> page anymore.

I 100% agree with Heikki here;  having the conversion spill out into the
main backend is very expensive and adds lots of complexity.  The only
argument for the Zdenek's conversion spill appoach is that it allows
conversion to happen at a more natural time than when the page is read
in, but frankly I think the conversion needs are going to be pretty
limited and are better done in a localized way at page read-in time.

As far as the page not fitting after conversion, what about some user
command that will convert an entire table to the new format if page
expansion fails.

> I wonder if we could go with some sort of a hybrid approach? Convert the 
>   whole page when it's read in, but if it doesn't fit, fall back to 
> tricks like loosening the alignment requirements on platforms that can 
> handle non-aligned data, or support a special truncated page header, 
> without pd_tli and pd_prune_xid fields. Just a thought, not sure how 
> feasible those particular tricks are, but something along those lines..
> 
> All in all, though. I find it a bit hard to see the big picture. For 
> upgrade-in-place, what are all the pieces that we need? To keep this 
> concrete, let's focus on PG 8.2 -> PG 8.3 (or are you focusing on PG 8.3 
> -> 8.4? That's fine with me as well, but let's pick one) and forget 
> about hypothetical changes that might occur in a future version. I can see:
> 1. Handling page layout changes (pd_prune_xid, pd_flags)
> 2. Handling tuple header changes (infomask2, HOT bits, combocid)
> 3. Handling changes in data type representation (packed varlens)
> 4. Toast chunk size
> 5. Catalogs
> 
> After putting all those together, how large a patch are we talking 
> about, and what's the performance penalty then? How much of all that 
> needs to be in core, and how much can live in a pgfoundry project or an 
> extra binary in src/bin or contrib? I realize that none of us have a 
> crystal ball, and one has to start somewhere, but I feel uneasy 
> committing to an approach until we have a full plan.

Yes, another very good point.

I am ready to focus on these issues for 8.4;  all this needs to be
fleshed out, perhaps on a wiki.  As a starting point, what would be
really nice is to start a wiki that lists all data format changes for
every major release.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Prototype: In-place upgrade v02

From
Heikki Linnakangas
Date:
Bruce Momjian wrote:
> As far as the page not fitting after conversion, what about some user
> command that will convert an entire table to the new format if page
> expansion fails.

VACUUM?

Having to run a manual command defeats the purpose somewhat, though. 
Especially if you have no way of knowing on what tables it needs to be 
run on.

> I am ready to focus on these issues for 8.4;  all this needs to be
> fleshed out, perhaps on a wiki.  As a starting point, what would be
> really nice is to start a wiki that lists all data format changes for
> every major release.

Have you looked at http://wiki.postgresql.org/wiki/In-place_upgrade 
already, that Greg Smith mentioned elsewhere in this thread? That's a 
good starting point.

In fact, I don't think there's any low-level data format changes yet 
between 8.3 and 8.4, so this would be a comparatively easy release to 
implement upgrade-in-place. There's just the catalog changes, but AFAICS 
nothing that would require scanning through relations.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Prototype: In-place upgrade v02

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Bruce Momjian wrote:
> > As far as the page not fitting after conversion, what about some user
> > command that will convert an entire table to the new format if page
> > expansion fails.
> 
> VACUUM?
> 
> Having to run a manual command defeats the purpose somewhat, though. 
> Especially if you have no way of knowing on what tables it needs to be 
> run on.

My assumption is that the page not fitting would be a rare case so
requiring something like vacuum to fix it would be OK.

What I don't want to do it to add lots of complexity to the code just to
handle the page expansion case, when such a case is rare and perhaps can
be fixed by a vacuum.

> > I am ready to focus on these issues for 8.4;  all this needs to be
> > fleshed out, perhaps on a wiki.  As a starting point, what would be
> > really nice is to start a wiki that lists all data format changes for
> > every major release.
> 
> Have you looked at http://wiki.postgresql.org/wiki/In-place_upgrade 
> already, that Greg Smith mentioned elsewhere in this thread? That's a 
> good starting point.

Agreed.

> In fact, I don't think there's any low-level data format changes yet 
> between 8.3 and 8.4, so this would be a comparatively easy release to 
> implement upgrade-in-place. There's just the catalog changes, but AFAICS 
> nothing that would require scanning through relations.

Yep.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Prototype: In-place upgrade v02

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> In fact, I don't think there's any low-level data format changes yet 
> between 8.3 and 8.4, so this would be a comparatively easy release to 
> implement upgrade-in-place. There's just the catalog changes, but AFAICS 
> nothing that would require scanning through relations.

After a quick scan of the catversion.h changelog (which hopefully covers
any such changes): we changed sequences incompatibly, we changed hash
indexes incompatibly (even without the pending patch that would change
their contents beyond recognition), and Teodor did some stuff to GIN
indexes that might or might not represent an on-disk format change,
you'd have to ask him.  We also whacked around the sort order of
bpchar_pattern_ops btree indexes.

I didn't see anything that looked like an immediate change in user table
contents, unless they used the "name" type; but what of relation forks?
        regards, tom lane


Re: Prototype: In-place upgrade v02

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> I didn't see anything that looked like an immediate change in user table
> contents, unless they used the "name" type; but what of relation forks?

Relation forks didn't change anything inside relation files, so no 
scanning of relations is required because of that. Neither will the FSM 
rewrite. Not sure about DSM yet.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Prototype: In-place upgrade v02

From
Gregory Stark
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

> Tom Lane wrote:
>> I didn't see anything that looked like an immediate change in user table
>> contents, unless they used the "name" type; but what of relation forks?
>
> Relation forks didn't change anything inside relation files, so no scanning of
> relations is required because of that. Neither will the FSM rewrite. Not sure
> about DSM yet.

And just to confirm -- they don't change the name of the files the postmaster
expects to find in its data directory, right?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Prototype: In-place upgrade v02

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Relation forks didn't change anything inside relation files, so no scanning of
>> relations is required because of that. Neither will the FSM rewrite. Not sure
>> about DSM yet.
> 
> And just to confirm -- they don't change the name of the files the postmaster
> expects to find in its data directory, right?

Right. But it wouldn't be a big issue anyway. Renaming would be quick 
regardless of the relation sizes, FSM and DSM will introduce new files, 
though, that probably need to be created as part of the upgrade, but 
again they're not very big.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Prototype: In-place upgrade v02

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:
>> Heikki Linnakangas napsal(a):
>>> The patch seems to be missing the new htup.c file.
>>
>> I'm sorry. I attached new version which is synchronized with current
>> head. I would like to say more comments as well.
>>
>> 1) The patch contains also changes which was discussed during July 
>> commit fest.     - PageGetTempPage modification suggested by Tom
>> - another hash.h backward compatible cleanup
> 
> It might be a good idea to split that into a separate patch. The sheer 
> size of this patch is quite daunting, even though the bulk of it is 
> straightforward search&replace.

Yes, I will do it.

>> 2) I add tuplimits.h header file which contains tuple limits for 
>> different access method. It is not finished yet, but idea is to keep 
>> all limits in one file and easily add limits for different page layout 
>> version - for example replace static computing with dynamic based on 
>> relation (maxtuplesize could be store in pg_class for each relation).
>>
>> I need this header also because I fallen in a cycle in header dependency.
>>
>> 3) I already sent Page API performance result in 
>> http://archives.postgresql.org/pgsql-hackers/2008-08/msg00398.php
>>
>> I replaced call sequence PagetGetItemId, PageGetItemId with 
>> PageGetIndexTuple and PageGetHeapTuple function. It is main difference 
>> in this patch. PAgeGetHeap Tuple fills t_ver in HeapTuple to identify 
>> correct tupleheader version.
>>
>> It would be good to mention that PageAPI (and tuple API) 
>> implementation is only prototype without any performance optimization.
> 
> You mentioned 5% performance degradation in that thread. What test case 
> was that? What would be a worst-case scanario, and how bad is it?

Paul van den Bogaart tested long run OLTP workload on it. He used iGen test.

> 5% is a pretty hefty price, especially when it's paid by not only 
> upgraded installations, but also freshly initialized clusters. I think 
> you'll need to pursue those performance optimizations.

5% is worst scenario. Current version is not optimized. It is written for easy 
debugging and (D)tracing. Pageheaders structures are very similar and we can 
easily remove switches for most of attributes and replace function with macros 
or inline function.

>> 4) This patch contains more topics for decision. First is general if 
>> this approach is acceptable.
> 
> I don't like the invasiveness of this approach. It's pretty invasive 
> already, and ISTM you'll need similar switch-case handling of all data 
> types that have changed the internal representation as well.

I agree in general. But for example new page API is not so invasive and by my 
opinion it should be implemented (with or without multiversion support), because 
it cleans a code. HeapTuple processing is easy too, but unfortunately it 
requires lot of modifications on many places. I has wonder how many pieces of 
code access directly to HeapTupleHeader and does not use HeapTuple data 
structure. I think we should make a conclusion what is recommended usage of 
HeapTupleHeader and HeapTuple. Most of changes in a code is like replacing 
HeapTupleHeaderGetXmax(tuple->t_data) with HeapTupleGetXmax(tuple) and so on. I 
think it should be cleanup anyway.

You mentioned data types, but it is not a problem. You can easily extend data 
type attribute about version information and call correct in/out functions. Or 
use different Oid for new data type version. There are more possible easy 
solutions for data types. And for conversion You can use ALTER TABLE command.
Main idea is keep data in all format in a relation. This approach should use 
also for integer/float datetime problem.

> We've talked about this before, so you'll remember that I favor teh 
> approach is to convert the page format, page at a time, when the pages 
> are read in. I grant you that there's non-trivial issues with that as 
> well, like if the converted data takes more space and don't fit in the 
> page anymore.

I like conversion on read too, because it is easy but there are more problems.

The non-fit page is one them.  Others problems are with indexes. For example 
hash index stores bitmap into page and it is not mentioned anywhere. Only hash 
am knows what page contains this kind of data. It is probably impossible to 
convert this page during a reading. :(

> I wonder if we could go with some sort of a hybrid approach? Convert the 
>  whole page when it's read in, but if it doesn't fit, fall back to 
> tricks like loosening the alignment requirements on platforms that can 
> handle non-aligned data, or support a special truncated page header, 
> without pd_tli and pd_prune_xid fields. Just a thought, not sure how 
> feasible those particular tricks are, but something along those lines..

OK, I have backup idea :-). Stay tuned :-)

> All in all, though. I find it a bit hard to see the big picture. For 
> upgrade-in-place, what are all the pieces that we need? To keep this 
> concrete, let's focus on PG 8.2 -> PG 8.3 (or are you focusing on PG 8.3 
> -> 8.4? That's fine with me as well, but let's pick one) and forget 
> about hypothetical changes that might occur in a future version. I can see:
> 1. Handling page layout changes (pd_prune_xid, pd_flags)
> 2. Handling tuple header changes (infomask2, HOT bits, combocid)
2.5 + composite data type
> 3. Handling changes in data type representation (packed varlens)
3.5 Data types generally (cidr/inet)
> 4. Toast chunk size
4.5 general MaxTupleSize for each different AM
> 5. Catalogs
6. AM methods

> 
> After putting all those together, how large a patch are we talking 
> about, and what's the performance penalty then? How much of all that 
> needs to be in core, and how much can live in a pgfoundry project or an 
> extra binary in src/bin or contrib? I realize that none of us have a 
> crystal ball, and one has to start somewhere, but I feel uneasy 
> committing to an approach until we have a full plan.

Unfortunately, I'm still in analyzing phase. Presented patch is prototype of one 
possible approach. I hit lot of problems and I don't have still answers on all 
of them. I'm going to update wiki page to share all these information.

At this moment, I think that I can implement offline heap conversion (8.2->8.4) 
and all indexed will be reindex. It is what we can have for 8.4. Online 
conversion has lot of problems which we are not able to answer at this moment.
    Zdenek



-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Prototype: In-place upgrade v02

From
Zdenek Kotala
Date:
Bruce Momjian napsal(a):
> 
> As far as the page not fitting after conversion, what about some user
> command that will convert an entire table to the new format if page
> expansion fails.

Keep in a mind that there are more kind of pages. Heap is easy, but each index 
AM has own specific :(. Better approach is move tuple to the new page and 
invalidate all related table indexes. Following reindex automatically convert 
whole table.

>> After putting all those together, how large a patch are we talking 
>> about, and what's the performance penalty then? How much of all that 
>> needs to be in core, and how much can live in a pgfoundry project or an 
>> extra binary in src/bin or contrib? I realize that none of us have a 
>> crystal ball, and one has to start somewhere, but I feel uneasy 
>> committing to an approach until we have a full plan.
> 
> Yes, another very good point.
> 
> I am ready to focus on these issues for 8.4;  all this needs to be
> fleshed out, perhaps on a wiki.  As a starting point, what would be
> really nice is to start a wiki that lists all data format changes for
> every major release.

As Greg mentioned in his mail there wiki page is already there. Unfortunately, I 
did not time to put actual information there. I'm going to do soon.Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Prototype: In-place upgrade v02

From
Zdenek Kotala
Date:
Bruce Momjian napsal(a):
> Heikki Linnakangas wrote:
>> Bruce Momjian wrote:
>>> As far as the page not fitting after conversion, what about some user
>>> command that will convert an entire table to the new format if page
>>> expansion fails.
>> VACUUM?
>>
>> Having to run a manual command defeats the purpose somewhat, though. 
>> Especially if you have no way of knowing on what tables it needs to be 
>> run on.
> 
> My assumption is that the page not fitting would be a rare case so
> requiring something like vacuum to fix it would be OK.

It is 1-2% records per heap. I assume that is is more for BTree.

> What I don't want to do it to add lots of complexity to the code just to
> handle the page expansion case, when such a case is rare and perhaps can
> be fixed by a vacuum.

Unfortunately it is not so rare. And only heap on 32bit x86 platform (4byte Max 
alignment) is no problem. But all index pages are affected.

> 
>> In fact, I don't think there's any low-level data format changes yet 
>> between 8.3 and 8.4, so this would be a comparatively easy release to 
>> implement upgrade-in-place. There's just the catalog changes, but AFAICS 
>> nothing that would require scanning through relations.
> 
> Yep.

I did not test now but pg_upgrade.sh script worked fine in May without any 
modification for conversion 8.3->8.4.
    Zdenek



-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Prototype: In-place upgrade v02

From
Zdenek Kotala
Date:
Tom Lane napsal(a):
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> In fact, I don't think there's any low-level data format changes yet 
>> between 8.3 and 8.4, so this would be a comparatively easy release to 
>> implement upgrade-in-place. There's just the catalog changes, but AFAICS 
>> nothing that would require scanning through relations.
> 
> After a quick scan of the catversion.h changelog (which hopefully covers
> any such changes): we changed sequences incompatibly, we changed hash
> indexes incompatibly (even without the pending patch that would change
> their contents beyond recognition), and Teodor did some stuff to GIN
> indexes that might or might not represent an on-disk format change,
> you'd have to ask him.  We also whacked around the sort order of
> bpchar_pattern_ops btree indexes.

Hmm, It seems that reindex is only good answer on all these changes. Sequence 
should be converted during catalog conversion.

Another idea is to create backward compatible AM and put them into separate 
library. If these AM will work also with old page structure then there should 
not be reason for reindexing or index page conversion after upgrade.
    Zdenek



-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Prototype: In-place upgrade v02

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Tom Lane wrote:
>> I didn't see anything that looked like an immediate change in user table
>> contents, unless they used the "name" type; but what of relation forks?
> 
> Relation forks didn't change anything inside relation files, so no 
> scanning of relations is required because of that. Neither will the FSM 
> rewrite. Not sure about DSM yet.
> 

Does it mean, that if you "inject" old data file after catalog upgrade, then FSM 
will works without any problem?
    Zdenek

PS: I plan to review FSM this week.



-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Prototype: In-place upgrade v02

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> Heikki Linnakangas napsal(a):
>> Relation forks didn't change anything inside relation files, so no 
>> scanning of relations is required because of that. Neither will the 
>> FSM rewrite. Not sure about DSM yet.
> 
> Does it mean, that if you "inject" old data file after catalog upgrade, 
> then FSM will works without any problem?

Yes. You'll need to construct an FSM, but it doesn't necessarily need to 
reflect the reality. You could just fill it with zeros, meaning that 
there's no free space anywhere, and let the next vacuum fill it with 
real information. Or you could read the old pg_fsm.cache file and fill 
the new FSM accordingly.

> PS: I plan to review FSM this week.

Thanks!

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Prototype: In-place upgrade v02

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> Tom Lane napsal(a):
>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>>> In fact, I don't think there's any low-level data format changes yet 
>>> between 8.3 and 8.4, so this would be a comparatively easy release to 
>>> implement upgrade-in-place. There's just the catalog changes, but 
>>> AFAICS nothing that would require scanning through relations.
>>
>> After a quick scan of the catversion.h changelog (which hopefully covers
>> any such changes): we changed sequences incompatibly, we changed hash
>> indexes incompatibly (even without the pending patch that would change
>> their contents beyond recognition), and Teodor did some stuff to GIN
>> indexes that might or might not represent an on-disk format change,
>> you'd have to ask him.  We also whacked around the sort order of
>> bpchar_pattern_ops btree indexes.
> 
> Hmm, It seems that reindex is only good answer on all these changes. 

Isn't that exactly what we want to avoid with upgrade-in-place? As long 
as the conversion can be done page-at-a-time, without consulting other 
pages, we can do it when the page is read in.

I'm not sure what the GIN changes were, but I didn't see any changes to 
the page layout at a quick glance.

The bpchar_pattern_ops change you mentioned must be this one:
>     A not-immediately-obvious incompatibility is that the sort order within
>     bpchar_pattern_ops indexes changes --- it had been identical to plain
>     strcmp, but is now trailing-blank-insensitive.  This will impact
>     in-place upgrades, if those ever happen.

The way I read that, bpchar_pattern_ops just became less sensitive. Some 
values are now considered equal that weren't before, and thus can now be 
stored in any order. That's not an incompatible change, right?

> Sequence should be converted during catalog conversion.

Agreed.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Prototype: In-place upgrade v02

From
Tom Lane
Date:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> Another idea is to create backward compatible AM and put them into separate 
> library. If these AM will work also with old page structure then there should
> not be reason for reindexing or index page conversion after upgrade.

I don't think that'd be real workable.  It would require duplicating all
the entries for that AM in pg_opfamily, pg_amop, etc.  Which we could do
for the built-in entries, I suppose, but what happens to user-defined
operator classes?

At least for the index changes proposed so far for 8.4, it seems to me
that the best solution is to mark affected indexes as not "indisvalid"
and require a post-conversion REINDEX to fix 'em.  Obviously a better
solution would be nice later, but we have to avoid putting huge amounts
of work into noncritical problems, else the whole feature is just not
going to get finished.
        regards, tom lane


Re: Prototype: In-place upgrade v02

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> The bpchar_pattern_ops change you mentioned must be this one:
>> A not-immediately-obvious incompatibility is that the sort order within
>> bpchar_pattern_ops indexes changes --- it had been identical to plain
>> strcmp, but is now trailing-blank-insensitive.  This will impact
>> in-place upgrades, if those ever happen.

Yup.

> The way I read that, bpchar_pattern_ops just became less sensitive. Some 
> values are now considered equal that weren't before, and thus can now be 
> stored in any order. That's not an incompatible change, right?

No, consider 'abc^I' vs 'abc ' (^I denoting a tab character).  These are
unequal in either case, but the sort order has flipped.
        regards, tom lane


Re: Prototype: In-place upgrade v02

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:
>> Tom Lane napsal(a):
>>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>>>> In fact, I don't think there's any low-level data format changes yet 
>>>> between 8.3 and 8.4, so this would be a comparatively easy release 
>>>> to implement upgrade-in-place. There's just the catalog changes, but 
>>>> AFAICS nothing that would require scanning through relations.
>>>
>>> After a quick scan of the catversion.h changelog (which hopefully covers
>>> any such changes): we changed sequences incompatibly, we changed hash
>>> indexes incompatibly (even without the pending patch that would change
>>> their contents beyond recognition), and Teodor did some stuff to GIN
>>> indexes that might or might not represent an on-disk format change,
>>> you'd have to ask him.  We also whacked around the sort order of
>>> bpchar_pattern_ops btree indexes.
>>
>> Hmm, It seems that reindex is only good answer on all these changes. 
> 
> Isn't that exactly what we want to avoid with upgrade-in-place? As long 
> as the conversion can be done page-at-a-time, without consulting other 
> pages, we can do it when the page is read in.

Yes, but I meant what we can do for 8.4.
    Zdenek


Re: Prototype: In-place upgrade v02

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:
>> Heikki Linnakangas napsal(a):
>>> Relation forks didn't change anything inside relation files, so no 
>>> scanning of relations is required because of that. Neither will the 
>>> FSM rewrite. Not sure about DSM yet.
>>
>> Does it mean, that if you "inject" old data file after catalog 
>> upgrade, then FSM will works without any problem?
> 
> Yes. You'll need to construct an FSM, but it doesn't necessarily need to 
> reflect the reality. You could just fill it with zeros, meaning that 
> there's no free space anywhere, and let the next vacuum fill it with 
> real information. Or you could read the old pg_fsm.cache file and fill 
> the new FSM accordingly.

I think zeroed FSM is good, because new items should not be added on to old page.
    Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Prototype: In-place upgrade v02

From
Zdenek Kotala
Date:
Tom Lane napsal(a):
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
>> Another idea is to create backward compatible AM and put them into separate 
>> library. If these AM will work also with old page structure then there should
>> not be reason for reindexing or index page conversion after upgrade.
> 
> I don't think that'd be real workable.  It would require duplicating all
> the entries for that AM in pg_opfamily, pg_amop, etc.  Which we could do
> for the built-in entries, I suppose, but what happens to user-defined
> operator classes?

When catalog upgrade will be performed directly, user-defined op classes should 
stay in the catalog. But question is what's happen with regproc records and if 
all functions will be compatible with a new server ... It invokes idea that we 
need stable API for operator and data types implementation. All datatype which 
will use only this API, then can be used on new PostgreSQL versions without 
recompilation.

> At least for the index changes proposed so far for 8.4, it seems to me
> that the best solution is to mark affected indexes as not "indisvalid"
> and require a post-conversion REINDEX to fix 'em.  Obviously a better
> solution would be nice later, but we have to avoid putting huge amounts
> of work into noncritical problems, else the whole feature is just not
> going to get finished.

Agree.
Zdenek



-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Prototype: In-place upgrade v02

From
Bruce Momjian
Date:
Zdenek Kotala wrote:
> You mentioned data types, but it is not a problem. You can easily extend data 
> type attribute about version information and call correct in/out functions. Or 
> use different Oid for new data type version. There are more possible easy 
> solutions for data types. And for conversion You can use ALTER TABLE command.
> Main idea is keep data in all format in a relation. This approach should use 
> also for integer/float datetime problem.

This kind of code structure scares me that our system will become so
complex that it will hinder our ability to continue making improvements.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +