Thread: Proposal: Multiversion page api (inplace upgrade)
1) Overview This proposal is part of inplace upgrade project. PostgreSQL should be able to read any page in old version. This is basic for all possible upgrade method. 2) Background We have several macros for manipulating of the page structures but this list is not complete and many parts of code access into this structures directly and severals part does not use existing macros. The idea is to use only specified API for manipulation/access of data structure on page. This API will recognize page layout version and it process data correctly. 3) API Proposed API is extended version of current macros which does not satisfy all Page Header manipulation. I plan to use function in first implementation, because it offers better type control and debugging capability, but some functions could be converted into macros (or into inline functions) in final solution (performance improving). All changes are related to bufpage.h and page.c. 4) Implementation The main point of implementation is to have several version of PageHeader structure (e.g. PageHeader_04, PageHeader_03 ...) and correct structure will be handled in special branch (see examples). Possible improvement is to use union which combine different PageHeader version and because most PageHeader items are same for all Page Layout version, it will reduce number of switches. But I'm afraid if union have same data layout as separate structure on all supported platforms. There are examples: void PageSetFull(Page page) {switch ( PageGetPageLayoutVersion(page) ){ case 4 : ((PageHeader_04) (page))->pd_flags |= PD_PAGE_FULL; break; default elog(PANIC, "PageSetFull is not supported on page layout version %i", PageGetPageLayoutVersion(page));} } LocationIndex PageGetLower(Page page) {switch ( PageGetPageLayoutVersion(page) ){ case 4 : return ((PageHeader_04) (page))->pd_lower);}elog(PANIC, "Unsupportedpage layout in function PageGetLower."); } 5) Issues a) hash index has hardcoded PageHeader into meta page structure -> need rewrite hash index implementation to be multiheader version friendly b) All *ItemSize macros (+toast chunk size) dependson sizeof(PageHeader) -> separate proposal will follow soon. All comments are welcome. Zdenek
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: > There are examples: > void PageSetFull(Page page) > { > switch ( PageGetPageLayoutVersion(page) ) > { > case 4 : ((PageHeader_04) (page))->pd_flags |= PD_PAGE_FULL; > break; > default elog(PANIC, "PageSetFull is not supported on page layout version %i", > PageGetPageLayoutVersion(page)); > } > } > LocationIndex PageGetLower(Page page) > { > switch ( PageGetPageLayoutVersion(page) ) > { > case 4 : return ((PageHeader_04) (page))->pd_lower); > } > elog(PANIC, "Unsupported page layout in function PageGetLower."); > } I'm fairly concerned about the performance impact of turning what had been simple field accesses into function calls. I argue also that since none of the PageHeader fields have actually moved in any version that's likely to be supported, the above functions are actually of exactly zero value. The proposed PANIC in PageSetFull seems like it requires more thought as well: surely we don't want that ever to happen. Which means that callers need to be careful not to invoke such an operation on an un-updated page, but this proposed coding offers no aid in making sure that won't happen. What is needed there, I think, is some more global policy about what operations are permitted on old (un-converted) pages and a high-level approach to ensuring that unsafe operations aren't attempted. regards, tom lane
Zdenek Kotala wrote: > 4) Implementation > > The main point of implementation is to have several version of > PageHeader structure (e.g. PageHeader_04, PageHeader_03 ...) and correct > structure will be handled in special branch (see examples). (this won't come as a surprise as we talked about this in PGCon, but) I think we should rather convert the page structure to new format in ReadBuffer the first time a page is read in. That would keep the changes a lot more isolated. Note that you need to handle not only page header changes, but changes to internal representations of different data types, and changes like varvarlen and combocid. Those are things that have happened in the past; in the future, I'm foreseeing changes to the toast header, for example, as there's been a lot of ideas related to toast options compression. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > (this won't come as a surprise as we talked about this in PGCon, but) I > think we should rather convert the page structure to new format in > ReadBuffer the first time a page is read in. That would keep the changes > a lot more isolated. The problem is that ReadBuffer is an extremely low-level environment, and it's not clear that it's possible (let alone practical) to do a conversion at that level in every case. In particular it hardly seems sane to expect ReadBuffer to do tuple content conversion, which is going to be practically impossible to perform without any catalog accesses. Another issue is that it might not be possible to update a page for lack of space. Are we prepared to assume that there will never be a transformation we need to apply that makes the data bigger? (Likely counterexample: adding collation info to text values.) In such a situation an in-place update might be impossible, and that certainly takes it outside the bounds of what ReadBuffer can be expected to manage. regards, tom lane
Tom Lane napsal(a): > Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: >> There are examples: > >> void PageSetFull(Page page) >> { >> switch ( PageGetPageLayoutVersion(page) ) >> { >> case 4 : ((PageHeader_04) (page))->pd_flags |= PD_PAGE_FULL; >> break; >> default elog(PANIC, "PageSetFull is not supported on page layout version %i", >> PageGetPageLayoutVersion(page)); >> } >> } > >> LocationIndex PageGetLower(Page page) >> { >> switch ( PageGetPageLayoutVersion(page) ) >> { >> case 4 : return ((PageHeader_04) (page))->pd_lower); >> } >> elog(PANIC, "Unsupported page layout in function PageGetLower."); >> } > > I'm fairly concerned about the performance impact of turning what had > been simple field accesses into function calls. I use functions now because it is easy to track what's going on. Finally it should be (mostly) macros. > I argue also that since > none of the PageHeader fields have actually moved in any version that's > likely to be supported, the above functions are actually of exactly > zero value. Yeah, it is why I'm thinking to use page header with unions inside (for example TSL/flag field) and use switch only in case like TSL or flags fields. What I don't know if fields in this structure will be placed on same place on all platforms. > The proposed PANIC in PageSetFull seems like it requires more thought as > well: surely we don't want that ever to happen. Which means that > callers need to be careful not to invoke such an operation on an > un-updated page, but this proposed coding offers no aid in making sure > that won't happen. What is needed there, I think, is some more global > policy about what operations are permitted on old (un-converted) pages > and a high-level approach to ensuring that unsafe operations aren't > attempted. ad) PANIC PANIC shouldn't happen because page validation in BufferRead should check supported page version. ad) policy - it is good catch. I think all read page operation should be allowed on old page version. Only tuple, LSN, TSL, and special modification should be allowed for writing. Addpageitem should invokes page conversion before any action happen (if there is free space for tuple, it is possible to convert page in to the new format, but after conversion space could be smaller then tuple.). Zdenek
Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> (this won't come as a surprise as we talked about this in PGCon, but) I >> think we should rather convert the page structure to new format in >> ReadBuffer the first time a page is read in. That would keep the changes >> a lot more isolated. > > The problem is that ReadBuffer is an extremely low-level environment, > and it's not clear that it's possible (let alone practical) to do a > conversion at that level in every case. Well, we can't predict the future, and can't guarantee that it's possible or practical to do the things we need to do in the future no matter what approach we choose. > In particular it hardly seems > sane to expect ReadBuffer to do tuple content conversion, which is going > to be practically impossible to perform without any catalog accesses. ReadBuffer has access to Relation, which has information about what kind of a relation it's dealing with, and TupleDesc. That should get us pretty far. It would be a modularity violation, for sure, but I could live with that for the purpose of page version conversion. > Another issue is that it might not be possible to update a page for > lack of space. Are we prepared to assume that there will never be a > transformation we need to apply that makes the data bigger? We do need some solution to that. One idea is to run a pre-upgrade script in the old version that scans the database and moves tuples that would no longer fit on their pages in the new version. This could be run before the upgrade, while the old database is still running, so it would be acceptable for that to take some time. No doubt people would prefer something better than that. Another idea would be to have some over-sized buffers that can be used as the target of conversion, until some tuples are moved off to another page. Perhaps the over-sized buffer wouldn't need to be in shared memory, if they're read-only until some tuples are moved. This is pretty hand-wavy, I know. The point is, I don't think these problems are insurmountable. > (Likely counterexample: adding collation info to text values.) I doubt it, as collation is not a property of text values, but operations. But that's off-topic... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas napsal(a): > Zdenek Kotala wrote: >> 4) Implementation >> >> The main point of implementation is to have several version of >> PageHeader structure (e.g. PageHeader_04, PageHeader_03 ...) and >> correct structure will be handled in special branch (see examples). > > (this won't come as a surprise as we talked about this in PGCon, but) I > think we should rather convert the page structure to new format in > ReadBuffer the first time a page is read in. That would keep the changes > a lot more isolated. I agree with Tom's reply. And anyway this approach will be mostly isolated into page.c and you need to able read old page in both cases. > Note that you need to handle not only page header changes, but changes > to internal representations of different data types, and changes like > varvarlen and combocid. Those are things that have happened in the past; > in the future, I'm foreseeing changes to the toast header, for example, > as there's been a lot of ideas related to toast options compression. I know, this is a first small step for inplace upgrade. Tupleheader will follow. Page structure is basic. I want to split development into small steps, because it is easy to review. Zdenek
Heikki Linnakangas napsal(a): > Tom Lane wrote: >> "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >>> (this won't come as a surprise as we talked about this in PGCon, but) >>> I think we should rather convert the page structure to new format in >>> ReadBuffer the first time a page is read in. That would keep the >>> changes a lot more isolated. >> >> The problem is that ReadBuffer is an extremely low-level environment, >> and it's not clear that it's possible (let alone practical) to do a >> conversion at that level in every case. > > Well, we can't predict the future, and can't guarantee that it's > possible or practical to do the things we need to do in the future no > matter what approach we choose. > >> In particular it hardly seems >> sane to expect ReadBuffer to do tuple content conversion, which is going >> to be practically impossible to perform without any catalog accesses. > > ReadBuffer has access to Relation, which has information about what kind > of a relation it's dealing with, and TupleDesc. That should get us > pretty far. It would be a modularity violation, for sure, but I could > live with that for the purpose of page version conversion. But if you look for example into hash implementation some pages are not in regular format and conversion could need more information which we do not have to have in ReadBuffer. >> Another issue is that it might not be possible to update a page for >> lack of space. Are we prepared to assume that there will never be a >> transformation we need to apply that makes the data bigger? > > We do need some solution to that. One idea is to run a pre-upgrade > script in the old version that scans the database and moves tuples that > would no longer fit on their pages in the new version. This could be run > before the upgrade, while the old database is still running, so it would > be acceptable for that to take some time. It could not work for indexes and do not forget TOAST chunks. I think in some cases you can get unused quoter of each page in TOAST table. > No doubt people would prefer something better than that. Another idea > would be to have some over-sized buffers that can be used as the target > of conversion, until some tuples are moved off to another page. Perhaps > the over-sized buffer wouldn't need to be in shared memory, if they're > read-only until some tuples are moved. Anyway, you need mechanism how to mark that this page is read only which is also require a lot of modification. And somemechanism how to make a decision when this page converted. I guess this approach will require similar modification as convert on write. > This is pretty hand-wavy, I know. The point is, I don't think these > problems are insurmountable. > >> (Likely counterexample: adding collation info to text values.) > > I doubt it, as collation is not a property of text values, but > operations. But that's off-topic... Yes, it is offtopic, however I think Tom is right :-). Zdenek
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > (Likely counterexample: adding collation info to text values.) I don't think the argument really needs an example, but I would be pretty upset if we proposed tagging every text datum with a collation. Encoding perhaps, though that seems like a bad idea to me on performance grounds, but collation is not a property of the data at all. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > > (Likely counterexample: adding collation info to text values.) > > I don't think the argument really needs an example, but I > would be pretty > upset if we proposed tagging every text datum with a > collation. Encoding > perhaps, though that seems like a bad idea to me on > performance grounds, but > collation is not a property of the data at all. Again not directly related to difficulties upgrading pages... The recent discussion ... http://archives.postgresql.org/pgsql-hackers/2008-06/msg00102.php ... mentions keeping collation information together with text data, however it is referring to keeping it together when processing it, not when storing the text. Regards, Stephen Denne. -- At the Datamail Group we value teamwork, respect, achievement, client focus, and courage. This email with any attachments is confidential and may be subject to legal privilege. If it is not intended for you please advise by replying immediately, destroy it and do not copy, disclose or use it in any way. The Datamail Group, through our GoGreen programme, is committed to environmental sustainability. Help us in our efforts by not printing this email. __________________________________________________________________ This email has been scanned by the DMZGlobal BusinessQuality Electronic Messaging Suite. Please see http://www.dmzglobal.com/dmzmessaging.htm for details. __________________________________________________________________
Heikki Linnakangas wrote: > Zdenek Kotala wrote: > > 4) Implementation > > > > The main point of implementation is to have several version of > > PageHeader structure (e.g. PageHeader_04, PageHeader_03 ...) and correct > > structure will be handled in special branch (see examples). > > (this won't come as a surprise as we talked about this in PGCon, but) I > think we should rather convert the page structure to new format in > ReadBuffer the first time a page is read in. That would keep the changes > a lot more isolated. > > Note that you need to handle not only page header changes, but changes > to internal representations of different data types, and changes like > varvarlen and combocid. Those are things that have happened in the past; > in the future, I'm foreseeing changes to the toast header, for example, > as there's been a lot of ideas related to toast options compression. I understand the goal of having good modularity (not having ReadBuffer modify the page), but I am worried that doing multi-version page processing in a modular way is going to spread version-specific information all over the backend code, making is harder to understand. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Jun 11, 2008, at 10:42 AM, Heikki Linnakangas wrote: >> Another issue is that it might not be possible to update a page for >> lack of space. Are we prepared to assume that there will never be a >> transformation we need to apply that makes the data bigger? > > We do need some solution to that. One idea is to run a pre-upgrade > script in the old version that scans the database and moves tuples > that would no longer fit on their pages in the new version. This > could be run before the upgrade, while the old database is still > running, so it would be acceptable for that to take some time. That means old versions have to have some knowledge of new versions. There's also a big race condition unless the old version starts taking size requirements into account every time a page is dirtied. > No doubt people would prefer something better than that. Another > idea would be to have some over-sized buffers that can be used as > the target of conversion, until some tuples are moved off to > another page. Perhaps the over-sized buffer wouldn't need to be in > shared memory, if they're read-only until some tuples are moved. > > This is pretty hand-wavy, I know. The point is, I don't think these > problems are insurmountable. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Tom Lane wrote: > Another issue is that it might not be possible to update a page for > lack of space. Are we prepared to assume that there will never be a > transformation we need to apply that makes the data bigger? In such a > situation an in-place update might be impossible, and that certainly > takes it outside the bounds of what ReadBuffer can be expected to manage. Would a possible solution to this be that you could 1. Upgrade to the newest minor-version of the old release (which has knowledge of the space requirements of the new one). 2. Run some new maintenance command like "vacuum expand" or "vacuum prepare_for_upgrade" or something that would split any too-full pages, leaving only pages with enough space. 3. Only then shutdown the old server and start the new major-version server.
Ron Mayer napsal(a): > Tom Lane wrote: >> Another issue is that it might not be possible to update a page for >> lack of space. Are we prepared to assume that there will never be a >> transformation we need to apply that makes the data bigger? In such a >> situation an in-place update might be impossible, and that certainly >> takes it outside the bounds of what ReadBuffer can be expected to manage. > > Would a possible solution to this be that you could > <snip> > > 2. Run some new maintenance command like "vacuum expand" or > "vacuum prepare_for_upgrade" or something that would split > any too-full pages, leaving only pages with enough space. It does not solve problems for example with TOAST tables. If chunks does not fit on a new page layout one of the chunk tuple have to be moved to free page. It means you get a lot of pages with ~2kB of free unused space. And if max chunk size is different between version you got another problem as well. There is also idea to change compression algorithm for 8.4 (or offer more varinats). It also mean that you need to understand old algorithm in a new version or you need to repack everything on old version. Zdenek
Bruce Momjian napsal(a): > Heikki Linnakangas wrote: >> Zdenek Kotala wrote: >>> 4) Implementation >>> >>> The main point of implementation is to have several version of >>> PageHeader structure (e.g. PageHeader_04, PageHeader_03 ...) and correct >>> structure will be handled in special branch (see examples). >> (this won't come as a surprise as we talked about this in PGCon, but) I >> think we should rather convert the page structure to new format in >> ReadBuffer the first time a page is read in. That would keep the changes >> a lot more isolated. >> >> Note that you need to handle not only page header changes, but changes >> to internal representations of different data types, and changes like >> varvarlen and combocid. Those are things that have happened in the past; >> in the future, I'm foreseeing changes to the toast header, for example, >> as there's been a lot of ideas related to toast options compression. > > I understand the goal of having good modularity (not having ReadBuffer > modify the page), but I am worried that doing multi-version page > processing in a modular way is going to spread version-specific > information all over the backend code, making is harder to understand. I don't think so. Page already contains page version information inside and currently we have macros like PageSetLSN. Caller needn't know nothing about PageHeader representation. It is responsibility of page API to correctly handle multi version. The same we can use for tuple access. It is more complicated but I think it is possible. Currently we several macros (e.g. HeapTupleGetOid) which works on TupleData structure. "Only" what we need is extend this API as well. I think in final we will get more readable code. Zdenek
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: > It does not solve problems for example with TOAST tables. If chunks does not fit > on a new page layout one of the chunk tuple have to be moved to free page. It > means you get a lot of pages with ~2kB of free unused space. And if max chunk > size is different between version you got another problem as well. > There is also idea to change compression algorithm for 8.4 (or offer more > varinats). It also mean that you need to understand old algorithm in a new > version or you need to repack everything on old version. I don't have any problem at all with the idea that in-place update isn't going to support arbitrary changes of parameters, such as modifying the toast chunk size. In particular anything that is locked down by pg_control isn't a problem. regards, tom lane