Thread: [WIP] In-place upgrade
This is really first patch which is not clean up, but it add in-place upgrade functionality. The patch requires other clean up patches which I already send. You can find aslo GIT repository with "workable" version. Main point is that tuples are converted to latest version in SeqScan and IndexScan node. All storage/access module is able process database 8.1-8.4. (Page Layout 3 and 4). What works: - select - heap scan is ok, but index scan does not work on varlena datatypes. I need to convert index key somewhere in index access. What does not work: - tuple conversion which contains arrays, composite datatypes and toast - vacuum - it tries to cleanup old pages - probably better could be converted them to the new format during processing... - insert/delete/update The Patch contains lot of extra comments and rubbish, but it is in process of cleanup. What I need to know/solve: 1) yes/no for this kind of online upgrade method 2) I'm not sure if the calling ExecStoreTuple correct. 3) I'm still looking best place to store old data structures and conversion functions. My idea is to create new directories: src/include/odf/v03/... src/backend/storage/upgrade/ src/backend/access/upgrade (odf = On Disk Format) Links: http://git.postgresql.org/?p=~davidfetter/upgrade_in_place/.git;a=summary http://src.opensolaris.org/source/xref/sfw/usr/src/cmd/postgres/postgresql-upgrade/ Thanks for your comments Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/access/heap/htup_03.c pgsql_master_upgrade.13a47c410da7/src/backend/access/heap/htup_03.c *** pgsql_master_upgrade.751eb7c6969f/src/backend/access/heap/htup_03.c 1970-01-01 01:00:00.000000000 +0100 --- pgsql_master_upgrade.13a47c410da7/src/backend/access/heap/htup_03.c 2008-10-31 21:45:33.281134312 +0100 *************** *** 0 **** --- 1,223 ---- + #include "postgres.h" + #include "access/htup_03.h" + #include "access/heapam.h" + #include "utils/rel.h" + + #define VARATT_FLAG_EXTERNAL 0x80000000 + #define VARATT_FLAG_COMPRESSED 0x40000000 + #define VARATT_MASK_FLAGS 0xc0000000 + #define VARATT_MASK_SIZE 0x3fffffff + #define VARATT_SIZEP(_PTR) (((varattrib_03 *)(_PTR))->va_header) + #define VARATT_SIZE(PTR) (VARATT_SIZEP(PTR) & VARATT_MASK_SIZE) + + typedef struct varattrib_03 + { + int32 va_header; /* External/compressed storage */ + /* flags and item size */ + union + { + struct + { + int32 va_rawsize; /* Plain data size */ + char va_data[1]; /* Compressed data */ + } va_compressed; /* Compressed stored attribute */ + + struct + { + int32 va_rawsize; /* Plain data size */ + int32 va_extsize; /* External saved size */ + Oid va_valueid; /* Unique identifier of value */ + Oid va_toastrelid; /* RelID where to find chunks */ + } va_external; /* External stored attribute */ + + char va_data[1]; /* Plain stored attribute */ + } va_content; + } varattrib_03; + + /* + * att_align aligns the given offset as needed for a datum of alignment + * requirement attalign. The cases are tested in what is hopefully something + * like their frequency of occurrence. + */ + static + long att_align_03(long cur_offset, char attalign) + { + switch(attalign) + { + case 'i' : return INTALIGN(cur_offset); + case 'c' : return cur_offset; + case 'd' : return DOUBLEALIGN(cur_offset); + case 's' : return SHORTALIGN(cur_offset); + default: elog(ERROR, "unsupported alligment (%c).", attalign); + } + } + + /* + * att_addlength increments the given offset by the length of the attribute. + * attval is only accessed if we are dealing with a variable-length attribute. + */ + static + long att_addlength_03(long cur_offset, int attlen, Datum attval) + { + if(attlen > 0) + return cur_offset + attlen; + + if(attlen == -1) + return cur_offset + (*((uint32*) DatumGetPointer(attval)) & 0x3fffffff); + + if(attlen != -2) + elog(ERROR, "not supported attlen (%i).", attlen); + + return cur_offset + strlen(DatumGetCString(attval)) + 1; + } + + + /* deform tuple from version 03 including varlena and + * composite type handling */ + void + heap_deform_tuple_03(HeapTuple tuple, TupleDesc tupleDesc, + Datum *values, bool *isnull) + { + HeapTupleHeader_03 tup = (HeapTupleHeader_03) tuple->t_data; + bool hasnulls = (tup->t_infomask & 0x01); + Form_pg_attribute *att = tupleDesc->attrs; + int tdesc_natts = tupleDesc->natts; + int natts; /* number of atts to extract */ + int attnum; + Pointer tp_data; + long off; /* offset in tuple data */ + bits8 *bp = tup->t_bits; /* ptr to null bitmap in tuple */ + + natts = tup->t_natts; + + /* + * In inheritance situations, it is possible that the given tuple actually + * has more fields than the caller is expecting. Don't run off the end of + * the caller's arrays. + */ + natts = Min(natts, tdesc_natts); + + tp_data = ((Pointer)tup) + tup->t_hoff; + + off = 0; + + for (attnum = 0; attnum < natts; attnum++) + { + Form_pg_attribute thisatt = att[attnum]; + + if (hasnulls && att_isnull(attnum, bp)) + { + values[attnum] = (Datum) 0; + isnull[attnum] = true; + continue; + } + + isnull[attnum] = false; + + off = att_align_03(off, thisatt->attalign); + + values[attnum] = fetchatt(thisatt, tp_data + off); /* fetchatt looks compatible */ + + off = att_addlength_03(off, thisatt->attlen, (Datum)(tp_data + off)); + } + + /* + * If tuple doesn't have all the atts indicated by tupleDesc, read the + * rest as null + */ + for (; attnum < tdesc_natts; attnum++) + { + values[attnum] = (Datum) 0; + isnull[attnum] = true; + } + } + + HeapTuple heap_tuple_upgrade_03(Relation rel, HeapTuple tuple) + { + TupleDesc tupleDesc = RelationGetDescr(rel); + int natts; + Datum *values; + bool *isnull; + bool *isalloc; + HeapTuple newTuple; + int n; + + /* Preallocate values/isnull arrays */ + natts = tupleDesc->natts; + values = (Datum *) palloc0(natts * sizeof(Datum)); + isnull = (bool *) palloc0(natts * sizeof(bool)); + isalloc = (bool *) palloc0(natts * sizeof(bool)); + + heap_deform_tuple_03(tuple, tupleDesc, values, isnull); + + /* now we need to go through values and convert varlen and composite types */ + for( n = 0; n < natts; n++) + { + if(isnull[n]) + continue; + + if(tupleDesc->attrs[n]->attlen == -1) + { + varattrib_03* varhdr_03; + varattrib_4b* varhdr_04; + char *data; + + // elog(NOTICE,"attname %s", tupleDesc->attrs[n]->attname); + + /* varlena conversion */ + varhdr_03 = (varattrib_03*) DatumGetPointer(values[n]); + data = palloc(VARATT_SIZE(varhdr_03)); + varhdr_04 = (varattrib_4b*) data; + + if( (varhdr_03->va_header & VARATT_MASK_FLAGS) == 0 ) + { /* TODO short varlena - but form_tuple should convert it anyway */ + + SET_VARSIZE(varhdr_04, VARATT_SIZE(varhdr_03)); + memcpy( VARDATA(varhdr_04), varhdr_03->va_content.va_data, + VARATT_SIZE(varhdr_03)- offsetof(varattrib_03, va_content.va_data) ); + } else + if( (varhdr_03->va_header & VARATT_FLAG_EXTERNAL) != 0) + { + SET_VARSIZE_EXTERNAL(varhdr_04, + VARHDRSZ_EXTERNAL + sizeof(struct varatt_external)); + memcpy( VARDATA_EXTERNAL(varhdr_04), + &(varhdr_03->va_content.va_external.va_rawsize), sizeof(struct varatt_external)); + } else + if( (varhdr_03->va_header & VARATT_FLAG_COMPRESSED ) != 0) + { + + SET_VARSIZE_COMPRESSED(varhdr_04, VARATT_SIZE(varhdr_03)); + varhdr_04->va_compressed.va_rawsize = varhdr_03->va_content.va_compressed.va_rawsize; + + memcpy( VARDATA_4B_C(varhdr_04), varhdr_03->va_content.va_compressed.va_data, + VARATT_SIZE(varhdr_03)- offsetof(varattrib_03, va_content.va_compressed.va_data) ); + } + + values[n] = PointerGetDatum(data); + isalloc[n] = true; + } + } + + newTuple = heap_form_tuple(tupleDesc, values, isnull); + + /* free allocated memory */ + for( n = 0; n < natts; n++) + { + if(isalloc[n]) + pfree(DatumGetPointer(values[n])); + } + + /* Preserve OID, if any */ + if(rel->rd_rel->relhasoids) + { + Oid oid; + oid = *((Oid *) ((char *)(tuple->t_data) + ((HeapTupleHeader_03)(tuple->t_data))->t_hoff - sizeof(Oid))); + HeapTupleSetOid(newTuple, oid); + } + return newTuple; + } + + + + + diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/access/heap/htup.c pgsql_master_upgrade.13a47c410da7/src/backend/access/heap/htup.c *** pgsql_master_upgrade.751eb7c6969f/src/backend/access/heap/htup.c 2008-10-31 21:45:33.114200837 +0100 --- pgsql_master_upgrade.13a47c410da7/src/backend/access/heap/htup.c 2008-10-31 21:45:33.218887161 +0100 *************** *** 2,10 **** --- 2,15 ---- #include "fmgr.h" #include "access/htup.h" + #include "access/htup_03.h" #include "access/transam.h" #include "storage/bufpage.h" + + #define TPH03(tup) \ + ((HeapTupleHeader_03)tuple->t_data) + /* * HeapTupleHeader accessor macros * *************** *** 135,251 **** */ bool HeapTupleIsHotUpdated(HeapTuple tuple) { ! return ((tuple->t_data->t_infomask2 & HEAP_HOT_UPDATED) != 0 && ! (tuple->t_data->t_infomask & (HEAP_XMIN_INVALID | HEAP_XMAX_INVALID)) == 0); } void HeapTupleSetHotUpdated(HeapTuple tuple) { ! tuple->t_data->t_infomask2 |= HEAP_HOT_UPDATED; } void HeapTupleClearHotUpdated(HeapTuple tuple) { ! tuple->t_data->t_infomask2 &= ~HEAP_HOT_UPDATED; } bool HeapTupleIsHeapOnly(HeapTuple tuple) { ! return (tuple->t_data->t_infomask2 & HEAP_ONLY_TUPLE) != 0; } void HeapTupleSetHeapOnly(HeapTuple tuple) { ! tuple->t_data->t_infomask2 |= HEAP_ONLY_TUPLE; } void HeapTupleClearHeapOnly(HeapTuple tuple) { ! tuple->t_data->t_infomask2 &= ~HEAP_ONLY_TUPLE; } Oid HeapTupleGetOid(HeapTuple tuple) { if(!HeapTupleIs(tuple, HEAP_HASOID)) return InvalidOid; ! return *((Oid *) ((char *)tuple->t_data + HeapTupleGetHoff(tuple) - sizeof(Oid))); } void HeapTupleSetOid(HeapTuple tuple, Oid oid) { Assert(HeapTupleIs(tuple, HEAP_HASOID)); ! *((Oid *) ((char *)(tuple->t_data) + HeapTupleGetHoff(tuple) - sizeof(Oid))) = oid; ! } ! ! bool HeapTupleHasOid(HeapTuple tuple) ! { ! return HeapTupleIs(tuple, HEAP_HASOID); } TransactionId HeapTupleGetXmax(HeapTuple tuple) { ! return tuple->t_data->t_choice.t_heap.t_xmax; } void HeapTupleSetXmax(HeapTuple tuple, TransactionId xmax) { ! tuple->t_data->t_choice.t_heap.t_xmax = xmax; } TransactionId HeapTupleGetXmin(HeapTuple tuple) { ! return tuple->t_data->t_choice.t_heap.t_xmin; } void HeapTupleSetXmin(HeapTuple tuple, TransactionId xmin) { ! tuple->t_data->t_choice.t_heap.t_xmin = xmin; } TransactionId HeapTupleGetXvac(HeapTuple tuple) { ! return (HeapTupleIs(tuple, HEAP_MOVED)) ? ! tuple->t_data->t_choice.t_heap.t_field3.t_xvac : ! InvalidTransactionId; } void HeapTupleSetXvac(HeapTuple tuple, TransactionId Xvac) { ! Assert(HeapTupleIs(tuple, HEAP_MOVED)); ! tuple->t_data->t_choice.t_heap.t_field3.t_xvac = Xvac; } void HeapTupleSetCmax(HeapTuple tuple, CommandId cid, bool iscombo) { ! Assert(!(HeapTupleIs(tuple, HEAP_MOVED))); ! tuple->t_data->t_choice.t_heap.t_field3.t_cid = cid; ! if(iscombo) ! HeapTupleSet(tuple, HEAP_COMBOCID); ! else ! HeapTupleClear(tuple, HEAP_COMBOCID); } void HeapTupleSetCmin(HeapTuple tuple, CommandId cid) { ! Assert(!(HeapTupleIs(tuple, HEAP_MOVED))); ! tuple->t_data->t_choice.t_heap.t_field3.t_cid = cid; ! HeapTupleClear(tuple, HEAP_COMBOCID); } uint16 HeapTupleGetInfoMask(HeapTuple tuple) { ! return ((tuple)->t_data->t_infomask); } void HeapTupleSetInfoMask(HeapTuple tuple, uint16 infomask) { ! ((tuple)->t_data->t_infomask = (infomask)); } uint16 HeapTupleGetInfoMask2(HeapTuple tuple) { ! return ((tuple)->t_data->t_infomask2); } bool HeapTupleIs(HeapTuple tuple, uint16 mask) --- 140,361 ---- */ bool HeapTupleIsHotUpdated(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4 : return ((tuple->t_data->t_infomask2 & HEAP_HOT_UPDATED) != 0 && ! (tuple->t_data->t_infomask & (HEAP_XMIN_INVALID | HEAP_XMAX_INVALID)) == 0); ! case 3 : return false; ! } ! Assert(false); ! return false; } void HeapTupleSetHotUpdated(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4 : tuple->t_data->t_infomask2 |= HEAP_HOT_UPDATED; ! return; ! } ! elog(PANIC,"Tuple cannot be HOT updated"); } void HeapTupleClearHotUpdated(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4 : tuple->t_data->t_infomask2 &= ~HEAP_HOT_UPDATED; ! return; ! } ! elog(PANIC,"Tuple cannot be HOT updated"); } bool HeapTupleIsHeapOnly(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4 : return (tuple->t_data->t_infomask2 & HEAP_ONLY_TUPLE) != 0; ! case 3 : return false; ! } ! Assert(false); ! return false; } void HeapTupleSetHeapOnly(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4 : tuple->t_data->t_infomask2 |= HEAP_ONLY_TUPLE; ! return; ! } ! elog(PANIC, "HeapOnly flag is not supported."); } void HeapTupleClearHeapOnly(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4 : tuple->t_data->t_infomask2 &= ~HEAP_ONLY_TUPLE; ! return; ! } ! elog(PANIC, "HeapOnly flag is not supported."); } + bool HeapTupleHasOid(HeapTuple tuple) + { + return HeapTupleIs(tuple, HEAP_HASOID); + } Oid HeapTupleGetOid(HeapTuple tuple) { if(!HeapTupleIs(tuple, HEAP_HASOID)) return InvalidOid; ! switch(tuple->t_ver) ! { ! case 4 : return *((Oid *) ((char *)tuple->t_data + HeapTupleGetHoff(tuple) - sizeof(Oid))); ! case 3 : return *((Oid *) ((char *)TPH03(tuple) + HeapTupleGetHoff(tuple) - sizeof(Oid))); ! } ! elog(PANIC, "HeapTupleGetOid is not supported."); } void HeapTupleSetOid(HeapTuple tuple, Oid oid) { Assert(HeapTupleIs(tuple, HEAP_HASOID)); ! switch(tuple->t_ver) ! { ! case 4 : *((Oid *) ((char *)(tuple->t_data) + HeapTupleGetHoff(tuple) - sizeof(Oid))) = oid; ! break; ! case 3 : *((Oid *) ((char *)TPH03(tuple) + HeapTupleGetHoff(tuple) - sizeof(Oid))) = oid; ! break; ! default: elog(PANIC, "HeapTupleSetOid is not supported."); ! } } TransactionId HeapTupleGetXmax(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4 : return tuple->t_data->t_choice.t_heap.t_xmax; ! case 3 : return TPH03(tuple)->t_choice.t_heap.t_xmax; ! } ! elog(PANIC, "HeapTupleGetXmax is not supported."); ! return 0; } void HeapTupleSetXmax(HeapTuple tuple, TransactionId xmax) { ! switch(tuple->t_ver) ! { ! case 4 : tuple->t_data->t_choice.t_heap.t_xmax = xmax; ! break; ! case 3 : TPH03(tuple)->t_choice.t_heap.t_xmax = xmax; ! break; ! default: elog(PANIC, "HeapTupleSetXmax is not supported."); ! } } TransactionId HeapTupleGetXmin(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4 : return tuple->t_data->t_choice.t_heap.t_xmin; ! case 3 : return TPH03(tuple)->t_choice.t_heap.t_xmin; ! } ! elog(PANIC, "HeapTupleSetXmin is not supported."); ! return 0; } void HeapTupleSetXmin(HeapTuple tuple, TransactionId xmin) { ! switch(tuple->t_ver) ! { ! case 4 : tuple->t_data->t_choice.t_heap.t_xmin = xmin; ! break; ! case 3 : TPH03(tuple)->t_choice.t_heap.t_xmin = xmin; ! default: elog(PANIC, "HeapTupleSetXmin is not supported."); ! } } TransactionId HeapTupleGetXvac(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4 : return (HeapTupleIs(tuple, HEAP_MOVED)) ? ! tuple->t_data->t_choice.t_heap.t_field3.t_xvac : ! InvalidTransactionId; ! } ! Assert(false); ! return InvalidTransactionId; } void HeapTupleSetXvac(HeapTuple tuple, TransactionId Xvac) { ! switch(tuple->t_ver) ! { ! case 4 : Assert(HeapTupleIs(tuple, HEAP_MOVED)); ! tuple->t_data->t_choice.t_heap.t_field3.t_xvac = Xvac; ! break; ! default: Assert(false); ! } } void HeapTupleSetCmax(HeapTuple tuple, CommandId cid, bool iscombo) { ! switch(tuple->t_ver) ! { ! case 4 : Assert(!(HeapTupleIs(tuple, HEAP_MOVED))); ! tuple->t_data->t_choice.t_heap.t_field3.t_cid = cid; ! if(iscombo) ! HeapTupleSet(tuple, HEAP_COMBOCID); ! else ! HeapTupleClear(tuple, HEAP_COMBOCID); ! break; ! default: Assert(false); ! } } void HeapTupleSetCmin(HeapTuple tuple, CommandId cid) { ! switch(tuple->t_ver) ! { ! case 4 : Assert(!(HeapTupleIs(tuple, HEAP_MOVED))); ! tuple->t_data->t_choice.t_heap.t_field3.t_cid = cid; ! HeapTupleClear(tuple, HEAP_COMBOCID); ! break; ! default: Assert(false); ! } } uint16 HeapTupleGetInfoMask(HeapTuple tuple) { ! uint16 infomask; ! switch(tuple->t_ver) ! { ! case 4: return ((tuple)->t_data->t_infomask); ! case 3: infomask = TPH03(tuple)->t_infomask & 0xFFB7; /* reset 3 (HASOID), 4 (UNUSED), 5 (COMBOCID) bit */ ! infomask |= ((TPH03(tuple)->t_infomask& 0x0010) << 1 ); /* copy HASOID */ ! return infomask; ! } ! elog(PANIC, "HeapTupleGetInfoMask is not supported."); } void HeapTupleSetInfoMask(HeapTuple tuple, uint16 infomask) { ! switch(tuple->t_ver) ! { ! case 4: ((tuple)->t_data->t_infomask = (infomask)); ! break; ! default: Assert(false); ! } } uint16 HeapTupleGetInfoMask2(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4 :return ((tuple)->t_data->t_infomask2); ! default: return 0; ! } } bool HeapTupleIs(HeapTuple tuple, uint16 mask) *************** *** 265,271 **** void HeapTupleClear2(HeapTuple tuple, uint16 mask) { ! ((tuple)->t_data->t_infomask2 &= ~(mask)); } CommandId HeapTupleGetRawCommandId(HeapTuple tuple) --- 375,386 ---- void HeapTupleClear2(HeapTuple tuple, uint16 mask) { ! switch(tuple->t_ver) ! { ! case 4: ((tuple)->t_data->t_infomask2 &= ~(mask)); ! break; ! } ! /* silently ignore on older versions */ } CommandId HeapTupleGetRawCommandId(HeapTuple tuple) *************** *** 275,281 **** int HeapTupleGetNatts(HeapTuple tuple) { ! return (tuple->t_data->t_infomask2 & HEAP_NATTS_MASK); } ItemPointer HeapTupleGetCtid(HeapTuple tuple) --- 390,401 ---- int HeapTupleGetNatts(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4: return (tuple->t_data->t_infomask2 & HEAP_NATTS_MASK); ! case 3: return TPH03(tuple)->t_natts; ! } ! elog(PANIC, "HeapTupleGetNatts is not supported."); } ItemPointer HeapTupleGetCtid(HeapTuple tuple) *************** *** 290,306 **** uint8 HeapTupleGetHoff(HeapTuple tuple) { ! return (tuple->t_data->t_hoff); } Pointer HeapTupleGetBits(HeapTuple tuple) { ! return (Pointer)(tuple->t_data->t_bits); } Pointer HeapTupleGetData(HeapTuple tuple) { ! return (((Pointer)tuple->t_data) + tuple->t_data->t_hoff); } void HeapTupleInit(HeapTuple tuple, int32 len, Oid typid, int32 typmod, --- 410,438 ---- uint8 HeapTupleGetHoff(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4: return (tuple->t_data->t_hoff); ! } ! elog(PANIC, "HeapTupleGetHoff is not supported."); } Pointer HeapTupleGetBits(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4: return (Pointer)(tuple->t_data->t_bits); ! } ! elog(PANIC, "HeapTupleGetBits is not supported."); } Pointer HeapTupleGetData(HeapTuple tuple) { ! switch(tuple->t_ver) ! { ! case 4: return (((Pointer)tuple->t_data) + tuple->t_data->t_hoff); ! } ! elog(PANIC, "HeapTupleGetData is not supported."); } void HeapTupleInit(HeapTuple tuple, int32 len, Oid typid, int32 typmod, diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/access/heap/Makefile pgsql_master_upgrade.13a47c410da7/src/backend/access/heap/Makefile *** pgsql_master_upgrade.751eb7c6969f/src/backend/access/heap/Makefile 2008-10-31 21:45:33.112796571 +0100 --- pgsql_master_upgrade.13a47c410da7/src/backend/access/heap/Makefile 2008-10-31 21:45:33.217276252 +0100 *************** *** 12,17 **** top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o htup.o include $(top_srcdir)/src/backend/common.mk --- 12,17 ---- top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o htup.o htup_03.o include $(top_srcdir)/src/backend/common.mk diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/access/nbtree/nbtinsert.c pgsql_master_upgrade.13a47c410da7/src/backend/access/nbtree/nbtinsert.c *** pgsql_master_upgrade.751eb7c6969f/src/backend/access/nbtree/nbtinsert.c 2008-10-31 21:45:33.136480748 +0100 --- pgsql_master_upgrade.13a47c410da7/src/backend/access/nbtree/nbtinsert.c 2008-10-31 21:45:33.231075233 +0100 *************** *** 1203,1209 **** /* Total free space available on a btree page, after fixed overhead */ leftspace = rightspace = ! PageGetPageSize(page) - SizeOfPageHeaderData - MAXALIGN(sizeof(BTPageOpaqueData)); /* The right page will have the same high key as the old page */ --- 1203,1209 ---- /* Total free space available on a btree page, after fixed overhead */ leftspace = rightspace = ! PageGetPageSize(page) - SizeOfPageHeaderData04 - MAXALIGN(sizeof(BTPageOpaqueData)); /* The right page will have the same high key as the old page */ diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/executor/nodeIndexscan.c pgsql_master_upgrade.13a47c410da7/src/backend/executor/nodeIndexscan.c *** pgsql_master_upgrade.751eb7c6969f/src/backend/executor/nodeIndexscan.c 2008-10-31 21:45:33.144606136 +0100 --- pgsql_master_upgrade.13a47c410da7/src/backend/executor/nodeIndexscan.c 2008-10-31 21:45:33.238969129 +0100 *************** *** 27,32 **** --- 27,33 ---- #include "access/genam.h" #include "access/nbtree.h" #include "access/relscan.h" + #include "access/htup_03.h" #include "executor/execdebug.h" #include "executor/nodeIndexscan.h" #include "optimizer/clauses.h" *************** *** 113,122 **** * Note: we pass 'false' because tuples returned by amgetnext are * pointers onto disk pages and must not be pfree()'d. */ ! ExecStoreTuple(tuple, /* tuple to store */ ! slot, /* slot to store in */ ! scandesc->xs_cbuf, /* buffer containing tuple */ ! false); /* don't pfree */ /* * If the index was lossy, we have to recheck the index quals using --- 114,138 ---- * Note: we pass 'false' because tuples returned by amgetnext are * pointers onto disk pages and must not be pfree()'d. */ ! if(tuple->t_ver == 4) ! { ! ExecStoreTuple(tuple, /* tuple to store */ ! slot, /* slot to store in */ ! scandesc->xs_cbuf, /* buffer containing tuple */ ! false); /* don't pfree */ ! } else ! if(tuple->t_ver == 3) ! { ! HeapTuple newtup; ! newtup = heap_tuple_upgrade_03(scandesc->heapRelation, tuple); ! ExecStoreTuple(newtup, /* tuple to store */ ! slot, /* slot to store in */ ! InvalidBuffer, /* buffer associated with this ! * tuple */ ! true); /* pfree this pointer */ ! } ! else ! elog(ERROR,"Unsupported tuple version (%i).",tuple->t_ver); /* * If the index was lossy, we have to recheck the index quals using diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/executor/nodeSeqscan.c pgsql_master_upgrade.13a47c410da7/src/backend/executor/nodeSeqscan.c *** pgsql_master_upgrade.751eb7c6969f/src/backend/executor/nodeSeqscan.c 2008-10-31 21:45:33.148191833 +0100 --- pgsql_master_upgrade.13a47c410da7/src/backend/executor/nodeSeqscan.c 2008-10-31 21:45:33.242644971 +0100 *************** *** 25,30 **** --- 25,31 ---- #include "postgres.h" #include "access/heapam.h" + #include "access/htup_03.h" #include "access/relscan.h" #include "executor/execdebug.h" #include "executor/nodeSeqscan.h" *************** *** 101,111 **** * refcount will not be dropped until the tuple table slot is cleared. */ if (tuple) ! ExecStoreTuple(tuple, /* tuple to store */ ! slot, /* slot to store in */ ! scandesc->rs_cbuf, /* buffer associated with this ! * tuple */ ! false); /* don't pfree this pointer */ else ExecClearTuple(slot); --- 102,129 ---- * refcount will not be dropped until the tuple table slot is cleared. */ if (tuple) ! { ! if(tuple->t_ver == 4) ! { ! ExecStoreTuple(tuple, /* tuple to store */ ! slot, /* slot to store in */ ! scandesc->rs_cbuf, /* buffer associated with this ! * tuple */ ! false); /* don't pfree this pointer */ ! } else ! if(tuple->t_ver == 3) ! { ! HeapTuple newtup; ! newtup = heap_tuple_upgrade_03(scandesc->rs_rd, tuple); ! ExecStoreTuple(newtup, /* tuple to store */ ! slot, /* slot to store in */ ! InvalidBuffer, /* buffer associated with this ! * tuple */ ! true); /* pfree this pointer */ ! } ! else ! elog(ERROR,"Unsupported tuple version (%i).",tuple->t_ver); ! } else ExecClearTuple(slot); diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/optimizer/util/plancat.c pgsql_master_upgrade.13a47c410da7/src/backend/optimizer/util/plancat.c *** pgsql_master_upgrade.751eb7c6969f/src/backend/optimizer/util/plancat.c 2008-10-31 21:45:33.157104094 +0100 --- pgsql_master_upgrade.13a47c410da7/src/backend/optimizer/util/plancat.c 2008-10-31 21:45:33.251184277 +0100 *************** *** 429,435 **** tuple_width += sizeof(HeapTupleHeaderData); tuple_width += sizeof(ItemPointerData); /* note: integer division is intentional here */ ! density = (BLCKSZ - SizeOfPageHeaderData) / tuple_width; } *tuples = rint(density * (double) curpages); break; --- 429,435 ---- tuple_width += sizeof(HeapTupleHeaderData); tuple_width += sizeof(ItemPointerData); /* note: integer division is intentional here */ ! density = (BLCKSZ - SizeOfPageHeaderData04) / tuple_width; } *tuples = rint(density * (double) curpages); break; diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/storage/page/bufpage.c pgsql_master_upgrade.13a47c410da7/src/backend/storage/page/bufpage.c *** pgsql_master_upgrade.751eb7c6969f/src/backend/storage/page/bufpage.c 2008-10-31 21:45:33.168097249 +0100 --- pgsql_master_upgrade.13a47c410da7/src/backend/storage/page/bufpage.c 2008-10-31 21:45:33.262190876 +0100 *************** *** 19,24 **** --- 19,28 ---- #include "access/transam.h" #include "storage/bufpage.h" + + static bool PageLayoutIsValid_04(Page page); + static bool PageLayoutIsValid_03(Page page); + static bool PageIsZeroed(Page page); static Item PageGetItem(Page page, OffsetNumber offsetNumber); /* ---------------------------------------------------------------- *************** *** 28,50 **** /* * PageInit ! * Initializes the contents of a page. */ void PageInit(Page page, Size pageSize, Size specialSize) { ! PageHeader p = (PageHeader) page; specialSize = MAXALIGN(specialSize); Assert(pageSize == BLCKSZ); ! Assert(pageSize > specialSize + SizeOfPageHeaderData); /* Make sure all fields of page are zero, as well as unused space */ MemSet(p, 0, pageSize); /* p->pd_flags = 0; done by above MemSet */ ! p->pd_lower = SizeOfPageHeaderData; p->pd_upper = pageSize - specialSize; p->pd_special = pageSize - specialSize; PageSetPageSizeAndVersion(page, pageSize, PG_PAGE_LAYOUT_VERSION); --- 32,55 ---- /* * PageInit ! * Initializes the contents of a page. We allow to initialize page only ! * in latest Page Layout Version. */ void PageInit(Page page, Size pageSize, Size specialSize) { ! PageHeader_04 p = (PageHeader_04) page; specialSize = MAXALIGN(specialSize); Assert(pageSize == BLCKSZ); ! Assert(pageSize > specialSize + SizeOfPageHeaderData04); /* Make sure all fields of page are zero, as well as unused space */ MemSet(p, 0, pageSize); /* p->pd_flags = 0; done by above MemSet */ ! p->pd_lower = SizeOfPageHeaderData04; p->pd_upper = pageSize - specialSize; p->pd_special = pageSize - specialSize; PageSetPageSizeAndVersion(page, pageSize, PG_PAGE_LAYOUT_VERSION); *************** *** 53,59 **** /* ! * PageHeaderIsValid * Check that the header fields of a page appear valid. * * This is called when a page has just been read in from disk. The idea is --- 58,64 ---- /* ! * PageLayoutIsValid * Check that the header fields of a page appear valid. * * This is called when a page has just been read in from disk. The idea is *************** *** 73,94 **** bool PageLayoutIsValid(Page page) { char *pagebytes; int i; - PageHeader ph = (PageHeader)page; - - /* Check normal case */ - if (PageGetPageSize(page) == BLCKSZ && - PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION && - (ph->pd_flags & ~PD_VALID_FLAG_BITS) == 0 && - ph->pd_lower >= SizeOfPageHeaderData && - ph->pd_lower <= ph->pd_upper && - ph->pd_upper <= ph->pd_special && - ph->pd_special <= BLCKSZ && - ph->pd_special == MAXALIGN(ph->pd_special)) - return true; - /* Check all-zeroes case */ pagebytes = (char *) page; for (i = 0; i < BLCKSZ; i++) { --- 78,102 ---- bool PageLayoutIsValid(Page page) { + /* Check normal case */ + switch(PageGetPageLayoutVersion(page)) + { + case 4 : return(PageLayoutIsValid_04(page)); + case 3 : return(PageLayoutIsValid_03(page)); + case 0 : return(PageIsZeroed(page)); + } + return false; + } + + /* + * Check all-zeroes case + */ + bool + PageIsZeroed(Page page) + { char *pagebytes; int i; pagebytes = (char *) page; for (i = 0; i < BLCKSZ; i++) { *************** *** 98,103 **** --- 106,141 ---- return true; } + bool PageLayoutIsValid_04(Page page) + { + PageHeader_04 phdr = (PageHeader_04)page; + if( + PageGetPageSize(page) == BLCKSZ && + (phdr->pd_flags & ~PD_VALID_FLAG_BITS) == 0 && + phdr->pd_lower >= SizeOfPageHeaderData04 && + phdr->pd_lower <= phdr->pd_upper && + phdr->pd_upper <= phdr->pd_special && + phdr->pd_special <= BLCKSZ && + phdr->pd_special == MAXALIGN(phdr->pd_special)) + return true; + return false; + } + + bool PageLayoutIsValid_03(Page page) + { + PageHeader_03 phdr = (PageHeader_03)page; + if( + PageGetPageSize(page) == BLCKSZ && + phdr->pd_lower >= SizeOfPageHeaderData03 && + phdr->pd_lower <= phdr->pd_upper && + phdr->pd_upper <= phdr->pd_special && + phdr->pd_special <= BLCKSZ && + phdr->pd_special == MAXALIGN(phdr->pd_special)) + return true; + return false; + } + + /* * PageAddItem *************** *** 127,133 **** bool overwrite, bool is_heap) { ! PageHeader phdr = (PageHeader) page; Size alignedSize; int lower; int upper; --- 165,171 ---- bool overwrite, bool is_heap) { ! PageHeader_04 phdr = (PageHeader_04) page; Size alignedSize; int lower; int upper; *************** *** 135,144 **** OffsetNumber limit; bool needshuffle = false; /* * Be wary about corrupted page pointers */ ! if (phdr->pd_lower < SizeOfPageHeaderData || phdr->pd_lower > phdr->pd_upper || phdr->pd_upper > phdr->pd_special || phdr->pd_special > BLCKSZ) --- 173,185 ---- OffsetNumber limit; bool needshuffle = false; + /* We allow add new items only on the new page layout - TODO indexes? */ + if( PageGetPageLayoutVersion(page) != PG_PAGE_LAYOUT_VERSION ) + elog(PANIC, "Add item on old page layout version is forbidden."); /* * Be wary about corrupted page pointers */ ! if (phdr->pd_lower < SizeOfPageHeaderData04 || phdr->pd_lower > phdr->pd_upper || phdr->pd_upper > phdr->pd_special || phdr->pd_special > BLCKSZ) *************** *** 265,281 **** { Size pageSize; Page temp; ! PageHeader thdr; pageSize = PageGetPageSize(page); temp = (Page) palloc(pageSize); ! thdr = (PageHeader) temp; /* copy old page in */ memcpy(temp, page, pageSize); /* set high, low water marks */ ! thdr->pd_lower = SizeOfPageHeaderData; thdr->pd_upper = pageSize - MAXALIGN(specialSize); /* clear out the middle */ --- 306,322 ---- { Size pageSize; Page temp; ! PageHeader_04 thdr; pageSize = PageGetPageSize(page); temp = (Page) palloc(pageSize); ! thdr = (PageHeader_04) temp; /* copy old page in */ memcpy(temp, page, pageSize); /* set high, low water marks */ ! thdr->pd_lower = SizeOfPageHeaderData04; thdr->pd_upper = pageSize - MAXALIGN(specialSize); /* clear out the middle */ *************** *** 333,341 **** void PageRepairFragmentation(Page page) { ! Offset pd_lower = ((PageHeader) page)->pd_lower; ! Offset pd_upper = ((PageHeader) page)->pd_upper; ! Offset pd_special = ((PageHeader) page)->pd_special; itemIdSort itemidbase, itemidptr; ItemId lp; --- 374,382 ---- void PageRepairFragmentation(Page page) { ! Offset pd_lower = PageGetLower(page); ! Offset pd_upper = PageGetUpper(page); ! Offset pd_special = PageGetSpecial(page); itemIdSort itemidbase, itemidptr; ItemId lp; *************** *** 353,359 **** * etc could cause us to clobber adjacent disk buffers, spreading the data * loss further. So, check everything. */ ! if (pd_lower < SizeOfPageHeaderData || pd_lower > pd_upper || pd_upper > pd_special || pd_special > BLCKSZ || --- 394,400 ---- * etc could cause us to clobber adjacent disk buffers, spreading the data * loss further. So, check everything. */ ! if (pd_lower < SizeOfPageHeaderData04 || pd_lower > pd_upper || pd_upper > pd_special || pd_special > BLCKSZ || *************** *** 384,390 **** if (nstorage == 0) { /* Page is completely empty, so just reset it quickly */ ! ((PageHeader) page)->pd_upper = pd_special; } else { /* nstorage != 0 */ --- 425,431 ---- if (nstorage == 0) { /* Page is completely empty, so just reset it quickly */ ! PageSetUpper(page, pd_special); } else { /* nstorage != 0 */ *************** *** 434,440 **** lp->lp_off = upper; } ! ((PageHeader) page)->pd_upper = upper; pfree(itemidbase); } --- 475,481 ---- lp->lp_off = upper; } ! PageSetUpper(page, upper); pfree(itemidbase); } *************** *** 463,470 **** * Use signed arithmetic here so that we behave sensibly if pd_lower > * pd_upper. */ ! space = (int) ((PageHeader) page)->pd_upper - ! (int) ((PageHeader) page)->pd_lower; if (space < (int) sizeof(ItemIdData)) return 0; --- 504,510 ---- * Use signed arithmetic here so that we behave sensibly if pd_lower > * pd_upper. */ ! space = PageGetExactFreeSpace(page); if (space < (int) sizeof(ItemIdData)) return 0; *************** *** 487,494 **** * Use signed arithmetic here so that we behave sensibly if pd_lower > * pd_upper. */ ! space = (int) ((PageHeader) page)->pd_upper - ! (int) ((PageHeader) page)->pd_lower; if (space < 0) return 0; --- 527,533 ---- * Use signed arithmetic here so that we behave sensibly if pd_lower > * pd_upper. */ ! space = (int)PageGetUpper(page) - (int)PageGetLower(page); if (space < 0) return 0; *************** *** 575,581 **** void PageIndexTupleDelete(Page page, OffsetNumber offnum) { ! PageHeader phdr = (PageHeader) page; char *addr; ItemId tup; Size size; --- 614,620 ---- void PageIndexTupleDelete(Page page, OffsetNumber offnum) { ! PageHeader_04 phdr = (PageHeader_04) page; /* TODO PGU */ char *addr; ItemId tup; Size size; *************** *** 587,593 **** /* * As with PageRepairFragmentation, paranoia seems justified. */ ! if (phdr->pd_lower < SizeOfPageHeaderData || phdr->pd_lower > phdr->pd_upper || phdr->pd_upper > phdr->pd_special || phdr->pd_special > BLCKSZ) --- 626,632 ---- /* * As with PageRepairFragmentation, paranoia seems justified. */ ! if (phdr->pd_lower < SizeOfPageHeaderData04 || phdr->pd_lower > phdr->pd_upper || phdr->pd_upper > phdr->pd_special || phdr->pd_special > BLCKSZ) *************** *** 681,687 **** void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems) { ! PageHeader phdr = (PageHeader) page; Offset pd_lower = phdr->pd_lower; Offset pd_upper = phdr->pd_upper; Offset pd_special = phdr->pd_special; --- 720,726 ---- void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems) { ! PageHeader_04 phdr = (PageHeader_04) page; /* TODO PGU */ Offset pd_lower = phdr->pd_lower; Offset pd_upper = phdr->pd_upper; Offset pd_special = phdr->pd_special; *************** *** 716,722 **** /* * As with PageRepairFragmentation, paranoia seems justified. */ ! if (pd_lower < SizeOfPageHeaderData || pd_lower > pd_upper || pd_upper > pd_special || pd_special > BLCKSZ || --- 755,761 ---- /* * As with PageRepairFragmentation, paranoia seems justified. */ ! if (pd_lower < SizeOfPageHeaderData04 || pd_lower > pd_upper || pd_upper > pd_special || pd_special > BLCKSZ || *************** *** 796,815 **** lp->lp_off = upper; } ! phdr->pd_lower = SizeOfPageHeaderData + nused * sizeof(ItemIdData); phdr->pd_upper = upper; pfree(itemidbase); } /* * PageGetItemId * Returns an item identifier of a page. */ ! static ItemId PageGetItemId(Page page, OffsetNumber offsetNumber) { AssertMacro(offsetNumber > 0); ! return (ItemId) (& ((PageHeader) page)->pd_linp[(offsetNumber) - 1]) ; } /* --- 835,861 ---- lp->lp_off = upper; } ! phdr->pd_lower = SizeOfPageHeaderData04 + nused * sizeof(ItemIdData); phdr->pd_upper = upper; pfree(itemidbase); } + + /* * PageGetItemId * Returns an item identifier of a page. */ ! ItemId PageGetItemId(Page page, OffsetNumber offsetNumber) { AssertMacro(offsetNumber > 0); ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return (ItemId) (& ((PageHeader_04) page)->pd_linp[(offsetNumber) - 1]) ; ! case 3 : return (ItemId) (& ((PageHeader_03) page)->pd_linp[(offsetNumber) - 1]) ; ! } ! elog(PANIC, "Unsupported page layout in function PageGetItemId."); } /* *************** *** 824,836 **** Item PageGetItem(Page page, OffsetNumber offsetNumber) { AssertMacro(PageIsValid(page)); ! return (Item) (page + ((PageHeader) page)->pd_linp[(offsetNumber) - 1].lp_off); } ItemLength PageItemGetSize(Page page, OffsetNumber offsetNumber) { ! return (ItemLength) ! ((PageHeader) page)->pd_linp[(offsetNumber) - 1].lp_len; } IndexTuple PageGetIndexTuple(Page page, OffsetNumber offsetNumber) --- 870,896 ---- Item PageGetItem(Page page, OffsetNumber offsetNumber) { AssertMacro(PageIsValid(page)); ! // AssertMacro(ItemIdHasStorage(itemId)); ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return (Item) (page + ! ((PageHeader_04) page)->pd_linp[(offsetNumber) - 1].lp_off); ! case 3 : return (Item) (page + ! ((PageHeader_03) page)->pd_linp[(offsetNumber) - 1].lp_off); ! } ! elog(PANIC, "Unsupported page layout in function PageGetItem."); } ItemLength PageItemGetSize(Page page, OffsetNumber offsetNumber) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return (ItemLength) ! ((PageHeader_04) page)->pd_linp[(offsetNumber) - 1].lp_len; ! case 3 : return (ItemLength) ! ((PageHeader_03) page)->pd_linp[(offsetNumber) - 1].lp_len; ! } ! elog(PANIC, "Unsupported page layout in function PageItemGetSize."); } IndexTuple PageGetIndexTuple(Page page, OffsetNumber offsetNumber) *************** *** 848,889 **** bool PageItemIsDead(Page page, OffsetNumber offsetNumber) { ! return ItemIdIsDead(PageGetItemId(page, offsetNumber)); } void PageItemMarkDead(Page page, OffsetNumber offsetNumber) { ! ItemIdMarkDead(PageGetItemId(page, offsetNumber)); } bool PageItemIsNormal(Page page, OffsetNumber offsetNumber) { ! return ItemIdIsNormal(PageGetItemId(page, offsetNumber)); } bool PageItemIsUsed(Page page, OffsetNumber offsetNumber) { ! return ItemIdIsUsed(PageGetItemId(page, offsetNumber)); } void PageItemSetUnused(Page page, OffsetNumber offsetNumber) { ! ItemIdSetUnused(PageGetItemId(page, offsetNumber)); } bool PageItemIsRedirected(Page page, OffsetNumber offsetNumber) { ! return ItemIdIsRedirected(PageGetItemId(page, offsetNumber)); } OffsetNumber PageItemGetRedirect(Page page, OffsetNumber offsetNumber) { ! return ItemIdGetRedirect(PageGetItemId(page, offsetNumber)); } void PageItemSetRedirect(Page page, OffsetNumber fromoff, OffsetNumber tooff) { ! ItemIdSetRedirect( PageGetItemId(page, fromoff), tooff); } void PageItemMove(Page page, OffsetNumber dstoff, OffsetNumber srcoff) --- 908,949 ---- bool PageItemIsDead(Page page, OffsetNumber offsetNumber) { ! return ItemIdIsDead(PageGetItemId(page, offsetNumber)); // TODO multi version } void PageItemMarkDead(Page page, OffsetNumber offsetNumber) { ! ItemIdMarkDead(PageGetItemId(page, offsetNumber)); // TODO multi version } bool PageItemIsNormal(Page page, OffsetNumber offsetNumber) { ! return ItemIdIsNormal(PageGetItemId(page, offsetNumber)); // TODO multi version } bool PageItemIsUsed(Page page, OffsetNumber offsetNumber) { ! return ItemIdIsUsed(PageGetItemId(page, offsetNumber)); // TODO multi version } void PageItemSetUnused(Page page, OffsetNumber offsetNumber) { ! ItemIdSetUnused(PageGetItemId(page, offsetNumber)); // TODO multi version } bool PageItemIsRedirected(Page page, OffsetNumber offsetNumber) { ! return ItemIdIsRedirected(PageGetItemId(page, offsetNumber)); // TODO multi version } OffsetNumber PageItemGetRedirect(Page page, OffsetNumber offsetNumber) { ! return ItemIdGetRedirect(PageGetItemId(page, offsetNumber)); // TODO multi version } void PageItemSetRedirect(Page page, OffsetNumber fromoff, OffsetNumber tooff) { ! ItemIdSetRedirect(PageGetItemId(page, fromoff), tooff); // TODO multi version } void PageItemMove(Page page, OffsetNumber dstoff, OffsetNumber srcoff) *************** *** 900,906 **** */ Pointer PageGetContents(Page page) { ! return (Pointer) (&((PageHeader) (page))->pd_linp[0]); } /* ---------------- --- 960,971 ---- */ Pointer PageGetContents(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return (Pointer) (&((PageHeader_04) (page))->pd_linp[0]); ! case 3 : return (Pointer) (&((PageHeader_03) (page))->pd_linp[0]); ! } ! elog(PANIC, "Unsupported page layout in function PageGetContents."); } /* ---------------- *************** *** 913,924 **** */ Size PageGetSpecialSize(Page page) { ! return (Size) PageGetPageSize(page) - ((PageHeader)(page))->pd_special; } Size PageGetDataSize(Page page) { ! return (Size) ((PageHeader)(page))->pd_special - ((PageHeader)(page))->pd_upper; } /* --- 978,1000 ---- */ Size PageGetSpecialSize(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return (Size) PageGetPageSize(page) - ((PageHeader_04)(page))->pd_special; ! case 3 : return (Size) PageGetPageSize(page) - ((PageHeader_03)(page))->pd_special; ! ! } ! elog(PANIC, "Unsupported page layout in function PageGetSpecialSize."); } Size PageGetDataSize(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return (Size) ((PageHeader_04)(page))->pd_special - ((PageHeader_04)(page))->pd_upper; ! case 3 : return (Size) ((PageHeader_03)(page))->pd_special - ((PageHeader_03)(page))->pd_upper; ! } ! elog(PANIC, "Unsupported page layout in function PageGetDataSize."); } /* *************** *** 928,934 **** Pointer PageGetSpecialPointer(Page page) { AssertMacro(PageIsValid(page)); ! return page + ((PageHeader)(page))->pd_special; } /* --- 1004,1015 ---- Pointer PageGetSpecialPointer(Page page) { AssertMacro(PageIsValid(page)); ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return page + ((PageHeader_04)(page))->pd_special; ! case 3 : return page + ((PageHeader_03)(page))->pd_special; ! } ! elog(PANIC, "Unsupported page layout in function PageGetSpecialPointer."); } /* *************** *** 938,970 **** Pointer PageGetUpperPointer(Page page) { AssertMacro(PageIsValid(page)); ! return page + ((PageHeader)(page))->pd_upper; } void PageSetLower(Page page, LocationIndex lower) { ! ((PageHeader) page)->pd_lower = lower; } void PageSetUpper(Page page, LocationIndex upper) { ! ((PageHeader) page)->pd_upper = upper; } void PageReserveLinp(Page page) { AssertMacro(PageIsValid(page)); ! ((PageHeader) page)->pd_lower += sizeof(ItemIdData); ! AssertMacro(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper ); } void PageReleaseLinp(Page page) { AssertMacro(PageIsValid(page)); ! ((PageHeader) page)->pd_lower -= sizeof(ItemIdData); ! AssertMacro(((PageHeader) page)->pd_lower >= SizeOfPageHeaderData); } /* * PageGetMaxOffsetNumber * Returns the maximum offset number used by the given page. --- 1019,1087 ---- Pointer PageGetUpperPointer(Page page) { AssertMacro(PageIsValid(page)); ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return page + ((PageHeader_04)(page))->pd_upper; ! case 3 : return page + ((PageHeader_03)(page))->pd_upper; ! } ! elog(PANIC, "Unsupported page layout in function PageGetUpperPointer."); } void PageSetLower(Page page, LocationIndex lower) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : ((PageHeader_04) page)->pd_lower = lower; ! break; ! case 3 : ((PageHeader_03) page)->pd_lower = lower; ! break; ! default: elog(PANIC, "Unsupported page layout in function PageSetLower."); ! } } void PageSetUpper(Page page, LocationIndex upper) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : ((PageHeader_04) page)->pd_upper = upper; ! break; ! case 3 : ((PageHeader_03) page)->pd_upper = upper; ! break; ! default: elog(PANIC, "Unsupported page layout in function PageSetLower."); ! } } void PageReserveLinp(Page page) { AssertMacro(PageIsValid(page)); ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : ((PageHeader_04) page)->pd_lower += sizeof(ItemIdData); ! AssertMacro(((PageHeader_04) page)->pd_lower <= ((PageHeader_04) page)->pd_upper ); ! break; ! case 3 : ((PageHeader_03) page)->pd_lower += sizeof(ItemIdData); ! AssertMacro(((PageHeader_03) page)->pd_lower <= ((PageHeader_03) page)->pd_upper ); ! break; ! default: elog(PANIC, "Unsupported page layout in function PageReserveLinp."); ! } } void PageReleaseLinp(Page page) { AssertMacro(PageIsValid(page)); ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : ((PageHeader_04) page)->pd_lower -= sizeof(ItemIdData); ! AssertMacro(((PageHeader_04) page)->pd_lower >= SizeOfPageHeaderData04); ! break; ! case 3 : ((PageHeader_03) page)->pd_lower -= sizeof(ItemIdData); ! AssertMacro(((PageHeader_03) page)->pd_lower >= SizeOfPageHeaderData03); ! break; ! default: elog(PANIC, "Unsupported page layout in function PageReleaseLinp."); ! } } + /* * PageGetMaxOffsetNumber * Returns the maximum offset number used by the given page. *************** *** 977,985 **** */ int PageGetMaxOffsetNumber(Page page) { ! PageHeader header = (PageHeader) (page); ! return header->pd_lower <= SizeOfPageHeaderData ? 0 : ! (header->pd_lower - SizeOfPageHeaderData) / sizeof(ItemIdData); } /* --- 1094,1115 ---- */ int PageGetMaxOffsetNumber(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : { ! PageHeader_04 header = (PageHeader_04) (page); ! return header->pd_lower <= SizeOfPageHeaderData04 ? 0 : ! (header->pd_lower - SizeOfPageHeaderData04) / sizeof(ItemIdData); ! } ! case 3 : { ! PageHeader_03 header = (PageHeader_03) (page); ! return header->pd_lower <= SizeOfPageHeaderData03 ? 0 : ! (header->pd_lower - SizeOfPageHeaderData03) / sizeof(ItemIdData); ! } ! ! } ! elog(PANIC, "Unsupported page layout in function PageGetMaxOffsetNumber. (%i)", PageGetPageLayoutVersion(page) ); ! return 0; } /* *************** *** 987,1089 **** */ XLogRecPtr PageGetLSN(Page page) { ! return ((PageHeader) page)->pd_lsn; } LocationIndex PageGetLower(Page page) { ! return ((PageHeader) page)->pd_lower; } LocationIndex PageGetUpper(Page page) { ! return ((PageHeader) page)->pd_upper; } LocationIndex PageGetSpecial(Page page) { ! return ((PageHeader) page)->pd_special; } void PageSetLSN(Page page, XLogRecPtr lsn) { ! ((PageHeader) page)->pd_lsn = lsn; } /* NOTE: only the 16 least significant bits are stored */ TimeLineID PageGetTLI(Page page) { ! return ((PageHeader) (page))->pd_tli; } void PageSetTLI(Page page, TimeLineID tli) { ! ((PageHeader) (page))->pd_tli = (uint16) (tli); } bool PageHasFreeLinePointers(Page page) { ! return ((PageHeader) (page))->pd_flags & PD_HAS_FREE_LINES; } void PageSetHasFreeLinePointers(Page page) { ! ((PageHeader) (page))->pd_flags |= PD_HAS_FREE_LINES; } void PageClearHasFreeLinePointers(Page page) { ! ((PageHeader) (page))->pd_flags &= ~PD_HAS_FREE_LINES; } bool PageIsFull(Page page) { ! return ((PageHeader) (page))->pd_flags & PD_PAGE_FULL; } void PageSetFull(Page page) { ! ((PageHeader) (page))->pd_flags |= PD_PAGE_FULL; } void PageClearFull(Page page) { ! ((PageHeader) (page))->pd_flags &= ~PD_PAGE_FULL; } bool PageIsPrunable(Page page, TransactionId oldestxmin) { AssertMacro(TransactionIdIsNormal(oldestxmin)); ! return ( ! TransactionIdIsValid(((PageHeader) page)->pd_prune_xid) && ! TransactionIdPrecedes(((PageHeader) page)->pd_prune_xid, oldestxmin) ); } TransactionId PageGetPrunable(Page page) { ! return ((PageHeader) page)->pd_prune_xid; } void PageSetPrunable(Page page, TransactionId xid) { Assert(TransactionIdIsNormal(xid)); ! if (!TransactionIdIsValid(((PageHeader) (page))->pd_prune_xid) || ! TransactionIdPrecedes(xid, ((PageHeader) (page))->pd_prune_xid)) ! ((PageHeader) (page))->pd_prune_xid = (xid); } void PageClearPrunable(Page page) { ! ((PageHeader) page)->pd_prune_xid = InvalidTransactionId; } bool PageIsComprimable(Page page) { ! PageHeader ph = (PageHeader) page; ! return(ph->pd_lower >= SizeOfPageHeaderData && ! ph->pd_upper > ph->pd_lower && ! ph->pd_upper <= BLCKSZ); } /* --- 1117,1335 ---- */ XLogRecPtr PageGetLSN(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return ((PageHeader_04) page)->pd_lsn; ! case 3 : return ((PageHeader_03) page)->pd_lsn; ! } ! elog(PANIC, "Unsupported page layout in function PageGetLSN."); } LocationIndex PageGetLower(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return ((PageHeader_04) page)->pd_lower; ! case 3 : return ((PageHeader_03) page)->pd_lower; ! case 0 : return 0; ! } ! elog(PANIC, "Unsupported page layout in function PageGetLower."); ! return 0; } LocationIndex PageGetUpper(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return ((PageHeader_04) page)->pd_upper; ! case 3 : return ((PageHeader_03) page)->pd_upper; ! case 0 : return 0; ! } ! elog(PANIC, "Unsupported page layout in function PageGetUpper."); ! return 0; } LocationIndex PageGetSpecial(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return ((PageHeader_04) page)->pd_special; ! case 3 : return ((PageHeader_03) page)->pd_special; ! } ! elog(PANIC, "Unsupported page layout in function PageGetUpper."); ! return 0; } + void PageSetLSN(Page page, XLogRecPtr lsn) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : ((PageHeader_04) page)->pd_lsn = lsn; ! break; ! case 3 : ((PageHeader_03) page)->pd_lsn = lsn; ! break; ! default: elog(PANIC, "Unsupported page layout in function PageSetLSN."); ! } } /* NOTE: only the 16 least significant bits are stored */ TimeLineID PageGetTLI(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return ((PageHeader_04) (page))->pd_tli; ! case 3 : return ((PageHeader_03) (page))->pd_tli; ! } ! elog(PANIC, "Unsupported page layout in function PageGetTLI."); } void PageSetTLI(Page page, TimeLineID tli) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : ((PageHeader_04) (page))->pd_tli = (uint16) (tli); ! break; ! case 3 : ((PageHeader_03) (page))->pd_tli = tli; ! break; ! default: elog(PANIC, "Unsupported page layout in function PageSetTLI."); ! } } bool PageHasFreeLinePointers(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return ((PageHeader_04) (page))->pd_flags & PD_HAS_FREE_LINES; ! default: return false; ! } } void PageSetHasFreeLinePointers(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : ((PageHeader_04) (page))->pd_flags |= PD_HAS_FREE_LINES; ! break; ! default: elog(PANIC, "HasFreeLinePointers is not supported on page layout version %i", ! PageGetPageLayoutVersion(page)); ! } } void PageClearHasFreeLinePointers(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : ((PageHeader_04) (page))->pd_flags &= ~PD_HAS_FREE_LINES; ! break; ! default: elog(PANIC, "HasFreeLinePointers is not supported on page layout version %i", ! PageGetPageLayoutVersion(page)); ! } } bool PageIsFull(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return ((PageHeader_04) (page))->pd_flags & PD_PAGE_FULL; ! default : return true; ! } ! return true; /* no space on old data page */ } void PageSetFull(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : ((PageHeader_04) (page))->pd_flags |= PD_PAGE_FULL; ! break; ! default: elog(PANIC, "PageSetFull is not supported on page layout version %i", ! PageGetPageLayoutVersion(page)); ! } } void PageClearFull(Page page) { ! switch( PageGetPageLayoutVersion(page) ) ! { ! case 4 : ((PageHeader_04) (page))->pd_flags &= ~PD_PAGE_FULL; ! break; ! default: elog(PANIC, "PageClearFull is not supported on page layout version %i", ! PageGetPageLayoutVersion(page)); ! } } bool PageIsPrunable(Page page, TransactionId oldestxmin) { AssertMacro(TransactionIdIsNormal(oldestxmin)); ! switch( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return ( ! TransactionIdIsValid(((PageHeader_04) page)->pd_prune_xid) && ! TransactionIdPrecedes(((PageHeader_04) page)->pd_prune_xid, oldestxmin) ); ! case 3 : return false; ! } ! elog(PANIC, "PageIsPrunable is not supported on page layout version %i", ! PageGetPageLayoutVersion(page)); } TransactionId PageGetPrunable(Page page) { ! switch( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return ((PageHeader_04) page)->pd_prune_xid; ! case 3 : return 0; ! } ! elog(PANIC, "PageGetPrunable is not supported on page layout version %i", ! PageGetPageLayoutVersion(page)); } void PageSetPrunable(Page page, TransactionId xid) { Assert(TransactionIdIsNormal(xid)); ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : if (!TransactionIdIsValid(((PageHeader_04) (page))->pd_prune_xid) || ! TransactionIdPrecedes(xid, ((PageHeader_04) (page))->pd_prune_xid)) ! ((PageHeader_04) (page))->pd_prune_xid = (xid); ! break; ! default: elog(PANIC, "PageSetPrunable is not supported on page layout version %i", ! PageGetPageLayoutVersion(page)); ! } } void PageClearPrunable(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : ((PageHeader_04) page)->pd_prune_xid = InvalidTransactionId; ! break; ! // default: elog(PANIC, "PageClearPrunable is not supported on page layout version %i", ! // PageGetPageLayoutVersion(page)); ! // Silently ignore this request ! } } bool PageIsComprimable(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : { ! PageHeader_04 ph = (PageHeader_04) page; ! return(ph->pd_lower >= SizeOfPageHeaderData04 && ! ph->pd_upper > ph->pd_lower && ! ph->pd_upper <= BLCKSZ); ! } ! case 3 : { ! PageHeader_03 ph = (PageHeader_03) page; ! return(ph->pd_lower >= SizeOfPageHeaderData03 && ! ph->pd_upper > ph->pd_lower && ! ph->pd_upper <= BLCKSZ); ! } ! default: elog(PANIC, "PageIsComprimable is not supported on page layout version %i", ! PageGetPageLayoutVersion(page)); ! } } /* *************** *** 1092,1097 **** */ bool PageIsEmpty(Page page) { ! return (((PageHeader) (page))->pd_lower <= SizeOfPageHeaderData); } --- 1338,1349 ---- */ bool PageIsEmpty(Page page) { ! switch ( PageGetPageLayoutVersion(page) ) ! { ! case 4 : return (((PageHeader_04) (page))->pd_lower <= SizeOfPageHeaderData04); ! case 3 : return (((PageHeader_04) (page))->pd_lower <= SizeOfPageHeaderData03); ! default: elog(PANIC, "PageIsEmpty is not supported on page layout version %i", ! PageGetPageLayoutVersion(page)); ! } } diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/access/gin.h pgsql_master_upgrade.13a47c410da7/src/include/access/gin.h *** pgsql_master_upgrade.751eb7c6969f/src/include/access/gin.h 2008-10-31 21:45:33.172329319 +0100 --- pgsql_master_upgrade.13a47c410da7/src/include/access/gin.h 2008-10-31 21:45:33.265898951 +0100 *************** *** 115,121 **** #define GinGetPosting(itup) ( (ItemPointer)(( ((char*)(itup)) + SHORTALIGN(GinGetOrigSizePosting(itup)) )) ) #define GinMaxItemSize \ ! ((BLCKSZ - SizeOfPageHeaderData - \ MAXALIGN(sizeof(GinPageOpaqueData))) / 3 - sizeof(ItemIdData)) --- 115,121 ---- #define GinGetPosting(itup) ( (ItemPointer)(( ((char*)(itup)) + SHORTALIGN(GinGetOrigSizePosting(itup)) )) ) #define GinMaxItemSize \ ! ((BLCKSZ - SizeOfPageHeaderData04 - \ MAXALIGN(sizeof(GinPageOpaqueData))) / 3 - sizeof(ItemIdData)) *************** *** 131,137 **** (GinDataPageGetData(page) + ((i)-1) * GinSizeOfItem(page)) #define GinDataPageGetFreeSpace(page) \ ! (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \ - MAXALIGN(sizeof(ItemPointerData)) \ - GinPageGetOpaque(page)->maxoff * GinSizeOfItem(page) \ - MAXALIGN(sizeof(GinPageOpaqueData))) --- 131,137 ---- (GinDataPageGetData(page) + ((i)-1) * GinSizeOfItem(page)) #define GinDataPageGetFreeSpace(page) \ ! (BLCKSZ - MAXALIGN(SizeOfPageHeaderData04) \ - MAXALIGN(sizeof(ItemPointerData)) \ - GinPageGetOpaque(page)->maxoff * GinSizeOfItem(page) \ - MAXALIGN(sizeof(GinPageOpaqueData))) diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/access/gist_private.h pgsql_master_upgrade.13a47c410da7/src/include/access/gist_private.h *** pgsql_master_upgrade.751eb7c6969f/src/include/access/gist_private.h 2008-10-31 21:45:33.176541012 +0100 --- pgsql_master_upgrade.13a47c410da7/src/include/access/gist_private.h 2008-10-31 21:45:33.270105878 +0100 *************** *** 272,278 **** /* gistutil.c */ #define GiSTPageSize \ ! ( BLCKSZ - SizeOfPageHeaderData - MAXALIGN(sizeof(GISTPageOpaqueData)) ) #define GIST_MIN_FILLFACTOR 10 #define GIST_DEFAULT_FILLFACTOR 90 --- 272,278 ---- /* gistutil.c */ #define GiSTPageSize \ ! ( BLCKSZ - SizeOfPageHeaderData04 - MAXALIGN(sizeof(GISTPageOpaqueData)) ) #define GIST_MIN_FILLFACTOR 10 #define GIST_DEFAULT_FILLFACTOR 90 diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/access/hash.h pgsql_master_upgrade.13a47c410da7/src/include/access/hash.h *** pgsql_master_upgrade.751eb7c6969f/src/include/access/hash.h 2008-10-31 21:45:33.180773015 +0100 --- pgsql_master_upgrade.13a47c410da7/src/include/access/hash.h 2008-10-31 21:45:33.274260668 +0100 *************** *** 168,174 **** */ #define HashMaxItemSize(page) \ MAXALIGN_DOWN(PageGetPageSize(page) - \ ! SizeOfPageHeaderData - \ sizeof(ItemIdData) - \ MAXALIGN(sizeof(HashPageOpaqueData))) --- 168,174 ---- */ #define HashMaxItemSize(page) \ MAXALIGN_DOWN(PageGetPageSize(page) - \ ! SizeOfPageHeaderData04 - \ sizeof(ItemIdData) - \ MAXALIGN(sizeof(HashPageOpaqueData))) *************** *** 198,204 **** #define HashGetMaxBitmapSize(page) \ (PageGetPageSize((Page) page) - \ ! (MAXALIGN(SizeOfPageHeaderData) + MAXALIGN(sizeof(HashPageOpaqueData)))) #define HashPageGetMeta(page) \ ((HashMetaPage) PageGetContents(page)) --- 198,204 ---- #define HashGetMaxBitmapSize(page) \ (PageGetPageSize((Page) page) - \ ! (MAXALIGN(SizeOfPageHeaderData04) + MAXALIGN(sizeof(HashPageOpaqueData)))) #define HashPageGetMeta(page) \ ((HashMetaPage) PageGetContents(page)) diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/access/htup_03.h pgsql_master_upgrade.13a47c410da7/src/include/access/htup_03.h *** pgsql_master_upgrade.751eb7c6969f/src/include/access/htup_03.h 1970-01-01 01:00:00.000000000 +0100 --- pgsql_master_upgrade.13a47c410da7/src/include/access/htup_03.h 2008-10-31 21:45:33.282414095 +0100 *************** *** 0 **** --- 1,311 ---- + /*------------------------------------------------------------------------- + * + * htup.h + * POSTGRES heap tuple definitions. + * + * + * Portions Copyright (c) 1996-2006, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * $PostgreSQL: pgsql/src/include/access/htup.h,v 1.87 2006/11/05 22:42:10 tgl Exp $ + * + *------------------------------------------------------------------------- + */ + #ifndef HTUP_03_H + #define HTUP_03_H + + #include "access/htup.h" + #include "storage/itemptr.h" + #include "storage/relfilenode.h" + #include "utils/rel.h" + + /* + * Heap tuple header. To avoid wasting space, the fields should be + * layed out in such a way to avoid structure padding. + * + * Datums of composite types (row types) share the same general structure + * as on-disk tuples, so that the same routines can be used to build and + * examine them. However the requirements are slightly different: a Datum + * does not need any transaction visibility information, and it does need + * a length word and some embedded type information. We can achieve this + * by overlaying the xmin/cmin/xmax/cmax/xvac fields of a heap tuple + * with the fields needed in the Datum case. Typically, all tuples built + * in-memory will be initialized with the Datum fields; but when a tuple is + * about to be inserted in a table, the transaction fields will be filled, + * overwriting the datum fields. + * + * The overall structure of a heap tuple looks like: + * fixed fields (HeapTupleHeaderData struct) + * nulls bitmap (if HEAP_HASNULL is set in t_infomask) + * alignment padding (as needed to make user data MAXALIGN'd) + * object ID (if HEAP_HASOID is set in t_infomask) + * user data fields + * + * We store five "virtual" fields Xmin, Cmin, Xmax, Cmax, and Xvac in four + * physical fields. Xmin, Cmin and Xmax are always really stored, but + * Cmax and Xvac share a field. This works because we know that there are + * only a limited number of states that a tuple can be in, and that Cmax + * is only interesting for the lifetime of the deleting transaction. + * This assumes that VACUUM FULL never tries to move a tuple whose Cmax + * is still interesting (ie, delete-in-progress). + * + * Note that in 7.3 and 7.4 a similar idea was applied to Xmax and Cmin. + * However, with the advent of subtransactions, a tuple may need both Xmax + * and Cmin simultaneously, so this is no longer possible. + * + * A word about t_ctid: whenever a new tuple is stored on disk, its t_ctid + * is initialized with its own TID (location). If the tuple is ever updated, + * its t_ctid is changed to point to the replacement version of the tuple. + * Thus, a tuple is the latest version of its row iff XMAX is invalid or + * t_ctid points to itself (in which case, if XMAX is valid, the tuple is + * either locked or deleted). One can follow the chain of t_ctid links + * to find the newest version of the row. Beware however that VACUUM might + * erase the pointed-to (newer) tuple before erasing the pointing (older) + * tuple. Hence, when following a t_ctid link, it is necessary to check + * to see if the referenced slot is empty or contains an unrelated tuple. + * Check that the referenced tuple has XMIN equal to the referencing tuple's + * XMAX to verify that it is actually the descendant version and not an + * unrelated tuple stored into a slot recently freed by VACUUM. If either + * check fails, one may assume that there is no live descendant version. + * + * Following the fixed header fields, the nulls bitmap is stored (beginning + * at t_bits). The bitmap is *not* stored if t_infomask shows that there + * are no nulls in the tuple. If an OID field is present (as indicated by + * t_infomask), then it is stored just before the user data, which begins at + * the offset shown by t_hoff. Note that t_hoff must be a multiple of + * MAXALIGN. + */ + + typedef struct HeapTupleFields_03 + { + TransactionId t_xmin; /* inserting xact ID */ + CommandId t_cmin; /* inserting command ID */ + TransactionId t_xmax; /* deleting or locking xact ID */ + + union + { + CommandId t_cmax; /* deleting or locking command ID */ + TransactionId t_xvac; /* VACUUM FULL xact ID */ + } t_field4; + } HeapTupleFields_03; + + typedef struct DatumTupleFields_03 + { + int32 datum_len; /* required to be a varlena type */ + + int32 datum_typmod; /* -1, or identifier of a record type */ + + Oid datum_typeid; /* composite type OID, or RECORDOID */ + + /* + * Note: field ordering is chosen with thought that Oid might someday + * widen to 64 bits. + */ + } DatumTupleFields_03; + + typedef struct HeapTupleHeaderData_03 + { + union + { + HeapTupleFields_03 t_heap; + DatumTupleFields_03 t_datum; + } t_choice; + + ItemPointerData t_ctid; /* current TID of this or newer tuple */ + + /* Fields below here must match MinimalTupleData! */ + + int16 t_natts; /* number of attributes */ + + uint16 t_infomask; /* various flag bits, see below */ + + uint8 t_hoff; /* sizeof header incl. bitmap, padding */ + + /* ^ - 27 bytes - ^ */ + + bits8 t_bits[1]; /* bitmap of NULLs -- VARIABLE LENGTH */ + + /* MORE DATA FOLLOWS AT END OF STRUCT */ + } HeapTupleHeaderData_03; + + typedef HeapTupleHeaderData_03 *HeapTupleHeader_03; + + /* + * information stored in t_infomask: + */ + #define HEAP03_HASNULL 0x0001 /* has null attribute(s) */ + #define HEAP03_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */ + #define HEAP03_HASEXTERNAL 0x0004 /* has external stored attribute(s) */ + #define HEAP03_HASCOMPRESSED 0x0008 /* has compressed stored attribute(s) */ + #define HEAP03_HASEXTENDED 0x000C /* the two above combined */ + #define HEAP03_HASOID 0x0010 /* has an object-id field */ + /* 0x0020 is presently unused */ + #define HEAP03_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */ + #define HEAP03_XMAX_SHARED_LOCK 0x0080 /* xmax is shared locker */ + /* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */ + #define HEAP03_IS_LOCKED (HEAP03_XMAX_EXCL_LOCK | HEAP03_XMAX_SHARED_LOCK) + #define HEAP03_XMIN_COMMITTED 0x0100 /* t_xmin committed */ + #define HEAP03_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */ + #define HEAP03_XMAX_COMMITTED 0x0400 /* t_xmax committed */ + #define HEAP03_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */ + #define HEAP03_XMAX_IS_MULTI 0x1000 /* t_xmax is a MultiXactId */ + #define HEAP03_UPDATED 0x2000 /* this is UPDATEd version of row */ + #define HEAP03_MOVED_OFF 0x4000 /* moved to another place by VACUUM + * FULL */ + #define HEAP03_MOVED_IN 0x8000 /* moved from another place by VACUUM + * FULL */ + #define HEAP03_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN) + + #define HEAP03_XACT_MASK 0xFFC0 /* visibility-related bits */ + + + /* + * HeapTupleHeader accessor macros + * + * Note: beware of multiple evaluations of "tup" argument. But the Set + * macros evaluate their other argument only once. + */ + /* + #define HeapTupleHeaderGetXmin(tup) \ + ( \ + (tup)->t_choice.t_heap.t_xmin \ + ) + + #define HeapTupleHeaderSetXmin(tup, xid) \ + ( \ + TransactionIdStore((xid), &(tup)->t_choice.t_heap.t_xmin) \ + ) + + #define HeapTupleHeaderGetXmax(tup) \ + ( \ + (tup)->t_choice.t_heap.t_xmax \ + ) + + #define HeapTupleHeaderSetXmax(tup, xid) \ + ( \ + TransactionIdStore((xid), &(tup)->t_choice.t_heap.t_xmax) \ + ) + + #define HeapTupleHeaderGetCmin(tup) \ + ( \ + (tup)->t_choice.t_heap.t_cmin \ + ) + + #define HeapTupleHeaderSetCmin(tup, cid) \ + ( \ + (tup)->t_choice.t_heap.t_cmin = (cid) \ + ) + */ + /* + * Note: GetCmax will produce wrong answers after SetXvac has been executed + * by a transaction other than the inserting one. We could check + * HEAP_XMAX_INVALID and return FirstCommandId if it's clear, but since that + * bit will be set again if the deleting transaction aborts, there'd be no + * real gain in safety from the extra test. So, just rely on the caller not + * to trust the value unless it's meaningful. + */ + /* + #define HeapTupleHeaderGetCmax(tup) \ + ( \ + (tup)->t_choice.t_heap.t_field4.t_cmax \ + ) + + #define HeapTupleHeaderSetCmax(tup, cid) \ + do { \ + Assert(!((tup)->t_infomask & HEAP_MOVED)); \ + (tup)->t_choice.t_heap.t_field4.t_cmax = (cid); \ + } while (0) + + #define HeapTupleHeaderGetXvac(tup) \ + ( \ + ((tup)->t_infomask & HEAP_MOVED) ? \ + (tup)->t_choice.t_heap.t_field4.t_xvac \ + : \ + InvalidTransactionId \ + ) + + #define HeapTupleHeaderSetXvac(tup, xid) \ + do { \ + Assert((tup)->t_infomask & HEAP_MOVED); \ + TransactionIdStore((xid), &(tup)->t_choice.t_heap.t_field4.t_xvac); \ + } while (0) + + #define HeapTupleHeaderGetDatumLength(tup) \ + ( \ + (tup)->t_choice.t_datum.datum_len \ + ) + + #define HeapTupleHeaderSetDatumLength(tup, len) \ + ( \ + (tup)->t_choice.t_datum.datum_len = (len) \ + ) + + #define HeapTupleHeaderGetTypeId(tup) \ + ( \ + (tup)->t_choice.t_datum.datum_typeid \ + ) + + #define HeapTupleHeaderSetTypeId(tup, typeid) \ + ( \ + (tup)->t_choice.t_datum.datum_typeid = (typeid) \ + ) + + #define HeapTupleHeaderGetTypMod(tup) \ + ( \ + (tup)->t_choice.t_datum.datum_typmod \ + ) + + #define HeapTupleHeaderSetTypMod(tup, typmod) \ + ( \ + (tup)->t_choice.t_datum.datum_typmod = (typmod) \ + ) + + #define HeapTupleHeaderGetOid(tup) \ + ( \ + ((tup)->t_infomask & HEAP_HASOID) ? \ + *((Oid *) ((char *)(tup) + (tup)->t_hoff - sizeof(Oid))) \ + : \ + InvalidOid \ + ) + + #define HeapTupleHeaderSetOid(tup, oid) \ + do { \ + Assert((tup)->t_infomask & HEAP_HASOID); \ + *((Oid *) ((char *)(tup) + (tup)->t_hoff - sizeof(Oid))) = (oid); \ + } while (0) + + */ + /* + * BITMAPLEN(NATTS) - + * Computes size of null bitmap given number of data columns. + */ + //#define BITMAPLEN(NATTS) (((int)(NATTS) + 7) / 8) + + /* + * MaxTupleSize is the maximum allowed size of a tuple, including header and + * MAXALIGN alignment padding. Basically it's BLCKSZ minus the other stuff + * that has to be on a disk page. The "other stuff" includes access-method- + * dependent "special space", which we assume will be no more than + * MaxSpecialSpace bytes (currently, on heap pages it's actually zero). + * + * NOTE: we do not need to count an ItemId for the tuple because + * sizeof(PageHeaderData) includes the first ItemId on the page. + */ + //#define MaxSpecialSpace 32 + + //#define MaxTupleSize \ + // (BLCKSZ - MAXALIGN(sizeof(PageHeaderData) + MaxSpecialSpace)) + + /* + * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can + * fit on one heap page. (Note that indexes could have more, because they + * use a smaller tuple header.) We arrive at the divisor because each tuple + * must be maxaligned, and it must have an associated item pointer. + */ + //#define MaxHeapTuplesPerPage \ + // ((int) ((BLCKSZ - offsetof(PageHeaderData, pd_linp)) / \ + // (MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) + sizeof(ItemIdData)))) + + extern HeapTuple heap_tuple_upgrade_03(Relation rel, HeapTuple tuple); + + #endif /* HTUP_H */ diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/access/tuplimits.h pgsql_master_upgrade.13a47c410da7/src/include/access/tuplimits.h *** pgsql_master_upgrade.751eb7c6969f/src/include/access/tuplimits.h 2008-10-31 21:45:33.181679309 +0100 --- pgsql_master_upgrade.13a47c410da7/src/include/access/tuplimits.h 2008-10-31 21:45:33.275179286 +0100 *************** *** 29,35 **** * ItemIds and tuples have different alignment requirements, don't assume that * you can, say, fit 2 tuples of size MaxHeapTupleSize/2 on the same page. */ ! #define MaxHeapTupleSize (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + sizeof(ItemIdData))) /* * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can --- 29,35 ---- * ItemIds and tuples have different alignment requirements, don't assume that * you can, say, fit 2 tuples of size MaxHeapTupleSize/2 on the same page. */ ! #define MaxHeapTupleSize (BLCKSZ - MAXALIGN(SizeOfPageHeaderData04 + sizeof(ItemIdData))) /* * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can *************** *** 43,49 **** * require increases in the size of work arrays. */ #define MaxHeapTuplesPerPage \ ! ((int) ((BLCKSZ - SizeOfPageHeaderData) / \ (MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) + sizeof(ItemIdData)))) --- 43,49 ---- * require increases in the size of work arrays. */ #define MaxHeapTuplesPerPage \ ! ((int) ((BLCKSZ - SizeOfPageHeaderData04) / \ (MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) + sizeof(ItemIdData)))) *************** *** 55,61 **** * must be maxaligned, and it must have an associated item pointer. */ #define MaxIndexTuplesPerPage \ ! ((int) ((BLCKSZ - SizeOfPageHeaderData) / \ (MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData)))) /* --- 55,61 ---- * must be maxaligned, and it must have an associated item pointer. */ #define MaxIndexTuplesPerPage \ ! ((int) ((BLCKSZ - SizeOfPageHeaderData04) / \ (MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData)))) /* *************** *** 66,72 **** */ #define BTMaxItemSize(page) \ MAXALIGN_DOWN((PageGetPageSize(page) - \ ! MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \ MAXALIGN(sizeof(BTPageOpaqueData))) / 3) --- 66,72 ---- */ #define BTMaxItemSize(page) \ MAXALIGN_DOWN((PageGetPageSize(page) - \ ! MAXALIGN(SizeOfPageHeaderData04 + 3*sizeof(ItemIdData)) - \ MAXALIGN(sizeof(BTPageOpaqueData))) / 3) diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/access/tuptoaster.h pgsql_master_upgrade.13a47c410da7/src/include/access/tuptoaster.h *** pgsql_master_upgrade.751eb7c6969f/src/include/access/tuptoaster.h 2008-10-31 21:45:33.183208112 +0100 --- pgsql_master_upgrade.13a47c410da7/src/include/access/tuptoaster.h 2008-10-31 21:45:33.276709073 +0100 *************** *** 49,55 **** #define TOAST_TUPLE_THRESHOLD \ MAXALIGN_DOWN((BLCKSZ - \ ! MAXALIGN(SizeOfPageHeaderData + TOAST_TUPLES_PER_PAGE * sizeof(ItemIdData))) \ / TOAST_TUPLES_PER_PAGE) #define TOAST_TUPLE_TARGET TOAST_TUPLE_THRESHOLD --- 49,55 ---- #define TOAST_TUPLE_THRESHOLD \ MAXALIGN_DOWN((BLCKSZ - \ ! MAXALIGN(SizeOfPageHeaderData04 + TOAST_TUPLES_PER_PAGE * sizeof(ItemIdData))) \ / TOAST_TUPLES_PER_PAGE) #define TOAST_TUPLE_TARGET TOAST_TUPLE_THRESHOLD *************** *** 75,81 **** #define EXTERN_TUPLE_MAX_SIZE \ MAXALIGN_DOWN((BLCKSZ - \ ! MAXALIGN(SizeOfPageHeaderData + EXTERN_TUPLES_PER_PAGE * sizeof(ItemIdData))) \ / EXTERN_TUPLES_PER_PAGE) #define TOAST_MAX_CHUNK_SIZE \ --- 75,81 ---- #define EXTERN_TUPLE_MAX_SIZE \ MAXALIGN_DOWN((BLCKSZ - \ ! MAXALIGN(SizeOfPageHeaderData04 + EXTERN_TUPLES_PER_PAGE * sizeof(ItemIdData))) \ / EXTERN_TUPLES_PER_PAGE) #define TOAST_MAX_CHUNK_SIZE \ diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/storage/bufpage.h pgsql_master_upgrade.13a47c410da7/src/include/storage/bufpage.h *** pgsql_master_upgrade.751eb7c6969f/src/include/storage/bufpage.h 2008-10-31 21:45:33.185832325 +0100 --- pgsql_master_upgrade.13a47c410da7/src/include/storage/bufpage.h 2008-10-31 21:45:33.279178878 +0100 *************** *** 121,127 **** * On the high end, we can only support pages up to 32KB because lp_off/lp_len * are 15 bits. */ ! typedef struct PageHeaderData { /* XXX LSN is member of *any* block, not only page-organized ones */ XLogRecPtr pd_lsn; /* LSN: next byte after last byte of xlog --- 121,127 ---- * On the high end, we can only support pages up to 32KB because lp_off/lp_len * are 15 bits. */ ! typedef struct PageHeaderData_04 { /* XXX LSN is member of *any* block, not only page-organized ones */ XLogRecPtr pd_lsn; /* LSN: next byte after last byte of xlog *************** *** 135,143 **** uint16 pd_pagesize_version; TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */ ItemIdData pd_linp[1]; /* beginning of line pointer array */ ! } PageHeaderData; - typedef PageHeaderData *PageHeader; /* * pd_flags contains the following flag bits. Undefined bits are initialized --- 135,160 ---- uint16 pd_pagesize_version; TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */ ItemIdData pd_linp[1]; /* beginning of line pointer array */ ! } PageHeaderData_04; ! ! typedef PageHeaderData_04 *PageHeader_04; ! ! typedef struct PageHeaderData_03 ! { ! /* XXX LSN is member of *any* block, not only page-organized ones */ ! XLogRecPtr pd_lsn; /* LSN: next byte after last byte of xlog ! * record for last change to this page */ ! TimeLineID pd_tli; /* TLI of last change */ ! LocationIndex pd_lower; /* offset to start of free space */ ! LocationIndex pd_upper; /* offset to end of free space */ ! LocationIndex pd_special; /* offset to start of special space */ ! uint16 pd_pagesize_version; ! ItemIdData pd_linp[1]; /* beginning of line pointer array */ ! } PageHeaderData_03; ! ! typedef PageHeaderData_03 *PageHeader_03; ! /* * pd_flags contains the following flag bits. Undefined bits are initialized *************** *** 181,195 **** #define PageIsValid(page) PointerIsValid(page) /* ! * line pointer(s) do not count as part of header */ ! #define SizeOfPageHeaderData (offsetof(PageHeaderData, pd_linp)) /* * PageIsNew * returns true iff page has not been initialized (by PageInit) */ ! #define PageIsNew(page) (((PageHeader) (page))->pd_upper == 0) /* ---------------- * macros to access page size info --- 198,213 ---- #define PageIsValid(page) PointerIsValid(page) /* ! * line pointer does not count as part of header */ ! #define SizeOfPageHeaderData04 offsetof(PageHeaderData_04, pd_linp[0]) ! #define SizeOfPageHeaderData03 offsetof(PageHeaderData_03, pd_linp[0]) /* * PageIsNew * returns true iff page has not been initialized (by PageInit) */ ! #define PageIsNew(page) (((PageHeader_04) (page))->pd_upper == 0) /* ---------------- * macros to access page size info *************** *** 211,224 **** * however, it can be called on a page that is not stored in a buffer. */ #define PageGetPageSize(page) \ ! ((Size) (((PageHeader) (page))->pd_pagesize_version & (uint16) 0xFF00)) /* * PageGetPageLayoutVersion * Returns the page layout version of a page. */ #define PageGetPageLayoutVersion(page) \ ! (((PageHeader) (page))->pd_pagesize_version & 0x00FF) /* * PageSetPageSizeAndVersion --- 229,242 ---- * however, it can be called on a page that is not stored in a buffer. */ #define PageGetPageSize(page) \ ! ((Size) (((PageHeader_04) (page))->pd_pagesize_version & (uint16) 0xFF00)) /* * PageGetPageLayoutVersion * Returns the page layout version of a page. */ #define PageGetPageLayoutVersion(page) \ ! (((PageHeader_04) (page))->pd_pagesize_version & 0x00FF) /* * PageSetPageSizeAndVersion *************** *** 231,239 **** ( \ AssertMacro(((size) & 0xFF00) == (size)), \ AssertMacro(((version) & 0x00FF) == (version)), \ ! ((PageHeader) (page))->pd_pagesize_version = (size) | (version) \ ) /* ---------------------------------------------------------------- * extern declarations * ---------------------------------------------------------------- --- 249,259 ---- ( \ AssertMacro(((size) & 0xFF00) == (size)), \ AssertMacro(((version) & 0x00FF) == (version)), \ ! ((PageHeader_04) (page))->pd_pagesize_version = (size) | (version) \ ) + + /* ---------------------------------------------------------------- * extern declarations * ---------------------------------------------------------------- *************** *** 303,307 **** --- 323,328 ---- extern OffsetNumber PageItemGetRedirect(Page, OffsetNumber offsetNumber); extern ItemId PageGetItemId(Page page, OffsetNumber offsetNumber);/* do not use it - only for pg_inspect contrib modul*/ + #endif /* BUFPAGE_H */ diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/storage/fsm_internals.h pgsql_master_upgrade.13a47c410da7/src/include/storage/fsm_internals.h *** pgsql_master_upgrade.751eb7c6969f/src/include/storage/fsm_internals.h 2008-10-31 21:45:33.186717311 +0100 --- pgsql_master_upgrade.13a47c410da7/src/include/storage/fsm_internals.h 2008-10-31 21:45:33.280068969 +0100 *************** *** 49,55 **** * Number of non-leaf and leaf nodes, and nodes in total, on an FSM page. * These definitions are internal to fsmpage.c. */ ! #define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \ offsetof(FSMPageData, fp_nodes)) #define NonLeafNodesPerPage (BLCKSZ / 2 - 1) --- 49,55 ---- * Number of non-leaf and leaf nodes, and nodes in total, on an FSM page. * These definitions are internal to fsmpage.c. */ ! #define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData04) - \ offsetof(FSMPageData, fp_nodes)) #define NonLeafNodesPerPage (BLCKSZ / 2 - 1)
I tried to apply this patch to CVS HEAD and it blew up all over the place. It doesn't seem to be intended to apply against CVS HEAD; for example, I don't have backend/access/heap/htup.c at all, so can't apply changes to that file. I was able to clone the GIT repository with the following command... git clone http://git.postgresql.org/git/~davidfetter/upgrade_in_place/.git ...but now I'm confused, because I don't see the changes from the diff reflected in the resulting tree. As you can see, I am not a git wizard. Any help would be appreciated. Here are a few initial thoughts based mostly on reading the diff: In the minor nit department, I don't really like the idea of PageHeaderData_04, SizeOfPageHeaderData04, PageLayoutIsValid_04, etc. I think the latest version should just be PageHeaderData and SizeOfPageHeaderData, and previous versions should be, e.g. PageHeaderDataV3. It looks to me like this would cut a few hunks out of this and maybe make it a bit easier to understand what is going on.At any rate, if we are going to stick with an explicitversion number in both versions, it should be marked in a consistent way, not _04 sometimes and just 04 other times. My suggestion is e.g. "V4" but YMMV. The changes to nodeIndexscan.c and nodeSeqscan.c are worrisome to me. It looks like the added code is (nearly?) identical in both places, so probably it needs to be refactored to avoid code duplication. I'm also a bit skeptical about the idea of doing the tuple conversion here. Why here rather than ExecStoreTuple()? If you decide to convert the tuple, you can palloc the new one, pfree the old one if ShouldFree is set, and reset shouldFree to true. I am pretty skeptical of the idea that all of the HeapTuple* functions can just be conditionalized on the page version and everything will Just Work. It seems like that is too low a level to be worrying about such things. Even if it happens to work for the changes between V3 and V4, what happens when V5 or V6 is changed in such a way that the answer to HeapTupleIsWhatever is neither "Yes" nor "No", but rather "Maybe" or "Seven"? The performance hit also sounds painful. I don't have a better idea right now though... I think it's going to be absolutely imperative to begin vacuuming away old V3 pages as quickly as possible after the upgrade. If you go with the approach of converting the tuple in, or just before, ExecStoreTuple, then you're going to introduce a lot of overhead when working with V3 pages. I think that's fine. You should plan to do your in-place upgrade at 1AM on Christmas morning (or whenever your load hits rock bottom...) and immediately start converting the database, starting with your most important and smallest tables. In fact, I would look whenever possible for ways to make the V4 case a fast-path and just accept that the system is going to labor a bit when dealing with V3 stuff. Any overhead you introduce when dealing with V3 pages can go away; any V4 overhead is permanent and therefore much more difficult to accept. That's about all I have for now... if you can give me some pointers on working with this git repository, or provide a complete patch that applies cleanly to CVS HEAD, I will try to look at this in more detail. ...Robert
Big thanks for review. Robert Haas napsal(a): > I tried to apply this patch to CVS HEAD and it blew up all over the > place. It doesn't seem to be intended to apply against CVS HEAD; for > example, I don't have backend/access/heap/htup.c at all, so can't > apply changes to that file. You need to apply also two other patches: which are located here: http://wiki.postgresql.org/wiki/CommitFestInProgress#Upgrade-in-place_and_related_issues I moved one related patch from another category here to correct place. The problem is that it is difficult to keep it in sync with head, because they change a lot of things. It the reason why I put all also into GIT repository, but ... > I was able to clone the GIT repository > with the following command... > > git clone http://git.postgresql.org/git/~davidfetter/upgrade_in_place/.git > > ...but now I'm confused, because I don't see the changes from the diff > reflected in the resulting tree. As you can see, I am not a git > wizard. Any help would be appreciated. I'm GIT newbie I use mercurial for development and I manually applied changes into GIT. I asked David Fetter with help how to get back the correct clone. In meantime you can download a tarball. http://git.postgresql.org/?p=~davidfetter/upgrade_in_place/.git;a=snapshot;h=c72bafada59ed278ffac59657c913bc375f77808;sf=tgz It should contains every think including yesterdays improvements (delete, insert, update works - inser/update only on table without index). > Here are a few initial thoughts based mostly on reading the diff: > > In the minor nit department, I don't really like the idea of > PageHeaderData_04, SizeOfPageHeaderData04, PageLayoutIsValid_04, etc. > I think the latest version should just be PageHeaderData and > SizeOfPageHeaderData, and previous versions should be, e.g. > PageHeaderDataV3. It looks to me like this would cut a few hunks out > of this and maybe make it a bit easier to understand what is going on. > At any rate, if we are going to stick with an explicit version number > in both versions, it should be marked in a consistent way, not _04 > sometimes and just 04 other times. My suggestion is e.g. "V4" but > YMMV. Yeah, it is most difficult part :-) find correct names for it. I think that each version of structure should have versionsuffix including lastone. And of cource the last one we should have a general name without suffix - see example: typedef struct PageHeaderData_04 { ...} PageHeaderData_04 typedef struct PageHeaderData_03 { ...} PageHeaderData_03 typedef PageHeaderData_04 PageHeaderData This allows you exactly specify version on places where you need it and keep general name where version is not relevant. How suffix should looks it another question. I prefer to have 04 not only 4. What's about PageHeaderData_V04? By the way what YMMV means? > The changes to nodeIndexscan.c and nodeSeqscan.c are worrisome to me. > It looks like the added code is (nearly?) identical in both places, so > probably it needs to be refactored to avoid code duplication. I'm > also a bit skeptical about the idea of doing the tuple conversion > here. Why here rather than ExecStoreTuple()? If you decide to > convert the tuple, you can palloc the new one, pfree the old one if > ShouldFree is set, and reset shouldFree to true. Good point. I thought about it as a one variant. And if I look it close now it is really much better place. It should fix a problem why REINDEX does not work. I will move it. > I am pretty skeptical of the idea that all of the HeapTuple* functions > can just be conditionalized on the page version and everything will > Just Work. It seems like that is too low a level to be worrying about > such things. Even if it happens to work for the changes between V3 > and V4, what happens when V5 or V6 is changed in such a way that the > answer to HeapTupleIsWhatever is neither "Yes" nor "No", but rather > "Maybe" or "Seven"? The performance hit also sounds painful. I don't > have a better idea right now though... OK. Currently it works (or I hope that it works). If somebody in a future invent some special change, i think in most (maybe all) cases there will be possible mapping. The speed is key point. When I check it last time I go 1% performance drop in fresh database. I think 1% is good price for in-place online upgrade. > I think it's going to be absolutely imperative to begin vacuuming away > old V3 pages as quickly as possible after the upgrade. If you go with > the approach of converting the tuple in, or just before, > ExecStoreTuple, then you're going to introduce a lot of overhead when > working with V3 pages. I think that's fine. You should plan to do > your in-place upgrade at 1AM on Christmas morning (or whenever your > load hits rock bottom...) and immediately start converting the > database, starting with your most important and smallest tables. In > fact, I would look whenever possible for ways to make the V4 case a > fast-path and just accept that the system is going to labor a bit when > dealing with V3 stuff. Any overhead you introduce when dealing with > V3 pages can go away; any V4 overhead is permanent and therefore much > more difficult to accept. Yes, it is a plan to improve vacuum to convert old page to new one. But in as a second step. I have already page converter code. With some modification it could be integrated easily into vacuum code. > That's about all I have for now... if you can give me some pointers on > working with this git repository, or provide a complete patch that > applies cleanly to CVS HEAD, I will try to look at this in more > detail. Thanks for your comments. Try snapshot link. I hope that it will work. Zdenek PS: I'm sorry about response time, but I'm on training this week. -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
> You need to apply also two other patches: > which are located here: > http://wiki.postgresql.org/wiki/CommitFestInProgress#Upgrade-in-place_and_related_issues > I moved one related patch from another category here to correct place. Just to confirm, which two? > http://git.postgresql.org/?p=~davidfetter/upgrade_in_place/.git;a=snapshot;h=c72bafada59ed278ffac59657c913bc375f77808;sf=tgz > > It should contains every think including yesterdays improvements (delete, > insert, update works - inser/update only on table without index). Wow, sounds like great improvements. I understand your difficulties in keeping up with HEAD, but I hope we can figure out some solution, because right now I have a diff (that I can't apply) and a tarball (that I can't diff) and that is not ideal for reviewing. > Yeah, it is most difficult part :-) find correct names for it. I think that > each version of structure should have version suffix including lastone. And > of cource the last one we should have a general name without suffix - see > example: > > typedef struct PageHeaderData_04 { ...} PageHeaderData_04 > typedef struct PageHeaderData_03 { ...} PageHeaderData_03 > typedef PageHeaderData_04 PageHeaderData > > This allows you exactly specify version on places where you need it and keep > general name where version is not relevant. That doesn't make sense to me. If PageHeaderData and PageHeaderData_04 are the same type, how do you decide which one to use in any particular place in the code? > How suffix should looks it another question. I prefer to have 04 not only 4. > What's about PageHeaderData_V04? I prefer "V" as a delimiter rather than "_" because that makes it more clear that the number which follows is a version number, but I think "_V" is overkill. However, I don't really want to argue the point; I'm just throwing in my $0.02 and I am sure others will have their own views as well. > By the way what YMMV means? "Your Mileage May Vary." http://www.urbandictionary.com/define.php?term=YMMV >> I am pretty skeptical of the idea that all of the HeapTuple* functions >> can just be conditionalized on the page version and everything will >> Just Work. It seems like that is too low a level to be worrying about >> such things. Even if it happens to work for the changes between V3 >> and V4, what happens when V5 or V6 is changed in such a way that the >> answer to HeapTupleIsWhatever is neither "Yes" nor "No", but rather >> "Maybe" or "Seven"? The performance hit also sounds painful. I don't >> have a better idea right now though... > > OK. Currently it works (or I hope that it works). If somebody in a future > invent some special change, i think in most (maybe all) cases there will be > possible mapping. > > The speed is key point. When I check it last time I go 1% performance drop > in fresh database. I think 1% is good price for in-place online upgrade. I think that's arguable and something that needs to be more broadly discussed. I wouldn't be keen to pay a 1% performance drop for this feature, because it's not a feature I really need. Sure, in-place upgrade would be nice to have, but for me, dump and reload isn't a huge problem. It's a lot better than the 5% number you quoted previously, but I'm not sure whether it is good enough, I would feel more comfortable if the feature could be completely disabled via compile-time defines. Then you could build the system either with or without in-place upgrade, according to your needs. But I don't think that's very practical with HeapTuple* as functions. You could conditionalize away the switch, but the function call overhead would remain. To get rid of that, you'd need some enormous, fragile hack that I don't even want to contemplate. Really, what I'd ideally like to see here is a system where the V3 code is in essence error-recovery code. Everything should be V4-only unless you detect a V3 page, and then you error out (if in-place upgrade is not enabled) or jump to the appropriate V3-aware code (if in-place upgrade is enabled). In theory, with a system like this, it seems like the overhead for V4 ought to be no more than the cost of checking the page version on each page read, which is a cheap sanity check we'd be willing to pay for anyway, and trivial in cost. But I think we probably need some input from -core on this topic as well. ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: > Really, what I'd ideally like to see here is a system where the V3 > code is in essence error-recovery code. Everything should be V4-only > unless you detect a V3 page, and then you error out (if in-place > upgrade is not enabled) or jump to the appropriate V3-aware code (if > in-place upgrade is enabled). In theory, with a system like this, it > seems like the overhead for V4 ought to be no more than the cost of > checking the page version on each page read, which is a cheap sanity > check we'd be willing to pay for anyway, and trivial in cost. We already do check the page version on read-in --- see PageHeaderIsValid. > But I think we probably need some input from -core on this topic as well. I concur that I don't want to see this patch adding more than the absolute unavoidable minimum of overhead for data that meets the "current" layout definition. I'm disturbed by the proposal to stick overhead into tuple header access, for example. regards, tom lane
> We already do check the page version on read-in --- see PageHeaderIsValid. Right, but the only place this is called is in ReadBuffer_common, which doesn't seem like a suitable place to deal with the possibility of a V3 page since you don't yet know what you plan to do with it. I'm not quite sure what the right solution to that problem is... >> But I think we probably need some input from -core on this topic as well. > I concur that I don't want to see this patch adding more than the > absolute unavoidable minimum of overhead for data that meets the > "current" layout definition. I'm disturbed by the proposal to stick > overhead into tuple header access, for example. ...but it seems like we both agree that conditionalizing heap tuple header access on page version is not the right answer. Based on that, I'm going to move the "htup and bufpage API clean up" patch to "Returned with feedback" and continue reviewing the remainder of these patches. As I'm looking at this, I'm realizing another problem - there is a lot of code that looks like this: void HeapTupleSetXmax(HeapTuple tuple, TransactionId xmax) { switch(tuple->t_ver) { case 4 : tuple->t_data->t_choice.t_heap.t_xmax = xmax; break; case 3 : TPH03(tuple)->t_choice.t_heap.t_xmax = xmax; break; default: elog(PANIC, "HeapTupleSetXmax is not supported."); } } TPH03 is a macro that is casting tuple->t_data to HeapTupleHeader_03. Unless I'm missing something, that means that given an arbitrary pointer to HeapTuple, there is absolutely no guarantee that tuple->t_data->t_choice actually points to that field at all. It will if tuple->t_ver happens to be 4 OR if HeapTupleHeader and HeapTupleHeader_03 happen to agree on where t_choice is; otherwise it points to some other member of HeapTupleHeader_03, or off the end of the structure. To me that seems unacceptably fragile, because it means the compiler can't warn us that we're using a pointer inappropriately. If we truly want to be safe here then we need to create an opaque HeapTupleHeader structure that contains only those elements that HeapTupleHeader_03 and HeapTupleHeader_04 have in common, and cast BOTH of them after checking the version. That way if somone writes a function that attempts to deference a HeapTupleHeader without going through the API, it will fail to compile rather than mostly working but possibly failing on a V3 page. ...Robert
Robert Haas napsal(a): > > Really, what I'd ideally like to see here is a system where the V3 > code is in essence error-recovery code. Everything should be V4-only > unless you detect a V3 page, and then you error out (if in-place > upgrade is not enabled) or jump to the appropriate V3-aware code (if > in-place upgrade is enabled). In theory, with a system like this, it > seems like the overhead for V4 ought to be no more than the cost of > checking the page version on each page read, which is a cheap sanity > check we'd be willing to pay for anyway, and trivial in cost. OK. It was original idea to make "Convert on read" which has several problems with no easy solution. One is that new data does not fit on the page and second big problem is how to convert TOAST table data. Another problem which is general is how to convert indexes... Convert on read has minimal impact on core when latest version is processed. But problem is what happen when you need to migrate tuple form page to new one modify index and also needs convert toast value(s)... Problem is that response could be long in some query, because it invokes a lot of changes and conversion. I think in corner case it could requiresconverts all index when you request one record. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
> OK. It was original idea to make "Convert on read" which has several > problems with no easy solution. One is that new data does not fit on the > page and second big problem is how to convert TOAST table data. Another > problem which is general is how to convert indexes... > > Convert on read has minimal impact on core when latest version is processed. > But problem is what happen when you need to migrate tuple form page to new > one modify index and also needs convert toast value(s)... Problem is that > response could be long in some query, because it invokes a lot of changes > and conversion. I think in corner case it could requires converts all index > when you request one record. I don't think I'm proposing convert on read, exactly. If you actually try to convert the entire page when you read it in, I think you're doomed to failure, because, as you rightly point out, there is absolutely no guarantee that the page contents in their new format will still fit into one block. I think what you want to do is convert the structures within the page one by one as you read them out of the page. The proposed refactoring of ExecStoreTuple will do exactly this, for example. HEAD uses a pointer into the actual buffer for a V4 tuple that comes from an existing relation, and a pointer to a palloc'd structure for a tuple that is generated during query execution. The proposed refactoring will keep these rules, plus add a new rule that if you happen to read a V3 page, you will palloc space for a new V4 tuple that is semantically equivalent to the V3 tuple on the page, and use that pointer instead. That, it seems to me, is exactly the right balance - the PAGE is still a V3 page, but all of the tuples that the upper-level code ever sees are V4 tuples. I'm not sure how far this particular approach can be generalized. ExecStoreTuple has the advantage that it already has to deal with both direct buffer pointers and palloc'd structures, so the code doesn't need to be much more complex to handle this case as well. I think the thing to do is go through and scrutinize all of the ReadBuffer call sites and figure out an approach to each one. I haven't looked at your latest code yet, so you may have already done this, but just for example, RelationGetBufferForTuple should probably just reject any V3 pages encountered as if they were full, including updating the FSM where appropriate. I would think that it would be possible to implement that with almost zero performance impact. I'm happy to look at and discuss the problem cases with you, and hopefully others will chime in as well since my knowledge of the code is far from exhaustive. ...Robert
Robert Haas napsal(a): >> OK. It was original idea to make "Convert on read" which has several >> problems with no easy solution. One is that new data does not fit on the >> page and second big problem is how to convert TOAST table data. Another >> problem which is general is how to convert indexes... >> >> Convert on read has minimal impact on core when latest version is processed. >> But problem is what happen when you need to migrate tuple form page to new >> one modify index and also needs convert toast value(s)... Problem is that >> response could be long in some query, because it invokes a lot of changes >> and conversion. I think in corner case it could requires converts all index >> when you request one record. > > I don't think I'm proposing convert on read, exactly. If you actually > try to convert the entire page when you read it in, I think you're > doomed to failure, because, as you rightly point out, there is > absolutely no guarantee that the page contents in their new format > will still fit into one block. I think what you want to do is convert > the structures within the page one by one as you read them out of the > page. The proposed refactoring of ExecStoreTuple will do exactly > this, for example. I see. But Vacuum and other internals function access heap pages directly without ExecStoreTuple. however you point to one idea which I'm currently thinking about it too. There is my version: If you look into new page API it has PageGetHeapTuple. It could do the conversion job. Problem is that you don't have relation info there and you cannot convert data, but transaction information can be converted. I think about HeapTupleData structure modification. It will have pointer to transaction info t_transinfo, which will point to the page tuple for V4. For V3 PageGetHeapTuple function will allocate memory and put converted data here. ExecStoreTuple will finally convert data. Because it know about relation and It does not make sense convert data early. Who wants to convert invisible or dead data. With this approach tuple will be processed same way with V4 without any overhead (they will be small overhead with allocating and free heaptupledata in some places - mostly vacuum). Only multi version access will be driven on page basis. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
> I see. But Vacuum and other internals function access heap pages directly > without ExecStoreTuple. Right. I don't think there's any getting around the fact that any function which accesses heap pages directly is going to need modification. The key is to make those modifications as non-invasive as possible. For example, in the case of vacuum, as soon as it detects that a V3 page has been read, it should call a special function whose only purpose in life is to move the data out of that V3 page and onto one or more V4 pages, and return. What you shouldn't do is try to make the regular vacuum code handle both V3 and V4 pages, because that will lead to code that may be slow and will almost certainly be complicated and difficult to maintain. I'll read through the rest of this when I have a bit more time. ...Robert
Zdenek Kotala wrote: > Robert Haas napsal(a): >> Really, what I'd ideally like to see here is a system where the V3 >> code is in essence error-recovery code. Everything should be V4-only >> unless you detect a V3 page, and then you error out (if in-place >> upgrade is not enabled) or jump to the appropriate V3-aware code (if >> in-place upgrade is enabled). In theory, with a system like this, it >> seems like the overhead for V4 ought to be no more than the cost of >> checking the page version on each page read, which is a cheap sanity >> check we'd be willing to pay for anyway, and trivial in cost. > > OK. It was original idea to make "Convert on read" which has several > problems with no easy solution. One is that new data does not fit on the > page and second big problem is how to convert TOAST table data. Another > problem which is general is how to convert indexes... We've talked about this many times before, so I'm sure you know what my opinion is. Let me phrase it one more time: 1. You *will* need a function to convert a page from old format to new format. We do want to get rid of the old format pages eventually, whether it's during VACUUM, whenever a page is read in, or by using an extra utility. And that process needs to online. Please speak up now if you disagree with that. 2. It follows from point 1, that you *will* need to solve the problems with pages where the data doesn't fit on the page in new format, as well as converting TOAST data. We've discussed various solutions to those problems; it's not insurmountable. For the "data doesn't fit anymore" problem, a fairly simple solution is to run a pre-upgrade utility in the old version, that reserves some free space on each page, to make sure everything fits after converting to new format. For TOAST, you can retoast tuples when the heap page is read in. I'm not sure what the problem with indexes is, but you can split pages if necessary, for example. Assuming everyone agrees with point 1, could we focus on these issues? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> We've talked about this many times before, so I'm sure you know what my > opinion is. Let me phrase it one more time: > > 1. You *will* need a function to convert a page from old format to new > format. We do want to get rid of the old format pages eventually, whether > it's during VACUUM, whenever a page is read in, or by using an extra > utility. And that process needs to online. Please speak up now if you > disagree with that. Well, I just proposed an approach that doesn't work this way, so I guess I'll have to put myself in the disagree category, or anyway yet to be convinced. As long as you can move individual tuples onto new pages, you can eventually empty V3 pages and reinitialize them as new, empty V4 pages. You can force that process along via, say, VACUUM, but in the meantime you can still continue to read the old pages without being forced to change them to the new format. That's not the only possible approach, but it's not obvious to me that it's insane. If you think it's a non-starter, it would be good to know why. ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: > Well, I just proposed an approach that doesn't work this way, so I > guess I'll have to put myself in the disagree category, or anyway yet > to be convinced. As long as you can move individual tuples onto new > pages, you can eventually empty V3 pages and reinitialize them as new, > empty V4 pages. You can force that process along via, say, VACUUM, > but in the meantime you can still continue to read the old pages > without being forced to change them to the new format. That's not the > only possible approach, but it's not obvious to me that it's insane. > If you think it's a non-starter, it would be good to know why. That's sane *if* you can guarantee that only negligible overhead is added for accessing data that is in the up-to-date format. I don't think that will be the case if we start putting version checks into every tuple access macro. regards, tom lane
> That's sane *if* you can guarantee that only negligible overhead is > added for accessing data that is in the up-to-date format. I don't > think that will be the case if we start putting version checks into > every tuple access macro. Yes, the point is that you'll read the page as V3 or V4, whichever it is, but if it's V3, you'll convert the tuples to V4 format before you try to doing anything with them (for example by modifying ExecStoreTuple to copy any V3 tuple into a palloc'd buffer, which fits nicely into what that function already does). ...Robert
>> Well, I just proposed an approach that doesn't work this way, so I >> guess I'll have to put myself in the disagree category, or anyway yet >> to be convinced. As long as you can move individual tuples onto new >> pages, you can eventually empty V3 pages and reinitialize them as new, >> empty V4 pages. You can force that process along via, say, VACUUM, > > No, if you can force that process along via some command, whatever it is, then > you're still in the category he described. Maybe. The difference is that I'm talking about converting tuples, not pages, so "What happens when the data doesn't fit on the new page?" is a meaningless question. Since that seemed to be Heikki's main concern, I thought we must be talking about different things. My thought was that the code path for converting a tuple would be very similar to what heap_update does today, and large tuples would be handled via TOAST just as they are now - by converting the relation one tuple at a time, you might end up with a new relation that has either more or fewer pages than the old relation, and it really doesn't matter which. I haven't really thought through all of the other kinds of things that might need to be converted, though. That's where it would be useful for someone more experienced to weigh in on indexes, etc. ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: >> We've talked about this many times before, so I'm sure you know what my >> opinion is. Let me phrase it one more time: >> >> 1. You *will* need a function to convert a page from old format to new >> format. We do want to get rid of the old format pages eventually, whether >> it's during VACUUM, whenever a page is read in, or by using an extra >> utility. And that process needs to online. Please speak up now if you >> disagree with that. > > Well, I just proposed an approach that doesn't work this way, so I > guess I'll have to put myself in the disagree category, or anyway yet > to be convinced. As long as you can move individual tuples onto new > pages, you can eventually empty V3 pages and reinitialize them as new, > empty V4 pages. You can force that process along via, say, VACUUM, No, if you can force that process along via some command, whatever it is, then you're still in the category he described. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
"Robert Haas" <robertmhaas@gmail.com> writes: >>> Well, I just proposed an approach that doesn't work this way, so I >>> guess I'll have to put myself in the disagree category, or anyway yet >>> to be convinced. As long as you can move individual tuples onto new >>> pages, you can eventually empty V3 pages and reinitialize them as new, >>> empty V4 pages. You can force that process along via, say, VACUUM, >> >> No, if you can force that process along via some command, whatever it is, then >> you're still in the category he described. > > Maybe. The difference is that I'm talking about converting tuples, > not pages, so "What happens when the data doesn't fit on the new > page?" is a meaningless question. No it's not, because as you pointed out you still need a way for the user to force it to happen sometime. Unless you're going to be happy with telling users they need to update all their tuples which would not be an online process. In any case it sounds like you're saying you want to allow multiple versions of tuples on the same page -- which a) would be much harder and b) doesn't solve the problem since the page still has to be converted sometime anyways. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
>> Maybe. The difference is that I'm talking about converting tuples, >> not pages, so "What happens when the data doesn't fit on the new >> page?" is a meaningless question. > > No it's not, because as you pointed out you still need a way for the user to > force it to happen sometime. Unless you're going to be happy with telling > users they need to update all their tuples which would not be an online > process. > > In any case it sounds like you're saying you want to allow multiple versions > of tuples on the same page -- which a) would be much harder and b) doesn't > solve the problem since the page still has to be converted sometime anyways. No, that's not what I'm suggesting. My thought was that any V3 page would be treated as if it were completely full, with the exception of a completely empty page which can be reinitialized as a V4 page. So you would never add any tuples to a V3 page, but you would need to update xmax, hint bits, etc. Eventually when all the tuples were dead you could reuse the page. ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: >>> Maybe. The difference is that I'm talking about converting tuples, >>> not pages, so "What happens when the data doesn't fit on the new >>> page?" is a meaningless question. >> >> No it's not, because as you pointed out you still need a way for the user to >> force it to happen sometime. Unless you're going to be happy with telling >> users they need to update all their tuples which would not be an online >> process. >> >> In any case it sounds like you're saying you want to allow multiple versions >> of tuples on the same page -- which a) would be much harder and b) doesn't >> solve the problem since the page still has to be converted sometime anyways. > > No, that's not what I'm suggesting. My thought was that any V3 page > would be treated as if it were completely full, with the exception of > a completely empty page which can be reinitialized as a V4 page. So > you would never add any tuples to a V3 page, but you would need to > update xmax, hint bits, etc. Eventually when all the tuples were dead > you could reuse the page. But there's no guarantee that will ever happen. Heikki claimed you would need a mechanism to convert the page some day and you said you proposed a system where that wasn't true. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
>> No, that's not what I'm suggesting. My thought was that any V3 page >> would be treated as if it were completely full, with the exception of >> a completely empty page which can be reinitialized as a V4 page. So >> you would never add any tuples to a V3 page, but you would need to >> update xmax, hint bits, etc. Eventually when all the tuples were dead >> you could reuse the page. > > But there's no guarantee that will ever happen. Heikki claimed you would need > a mechanism to convert the page some day and you said you proposed a system > where that wasn't true. What's the scenario you're concerned about? An old snapshot that never goes away? Can we lock the old and new pages, move the tuple to a V4 page, and update index entries without changing xmin/xmax? ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: >>> No, that's not what I'm suggesting. My thought was that any V3 page >>> would be treated as if it were completely full, with the exception of >>> a completely empty page which can be reinitialized as a V4 page. So >>> you would never add any tuples to a V3 page, but you would need to >>> update xmax, hint bits, etc. Eventually when all the tuples were dead >>> you could reuse the page. >> >> But there's no guarantee that will ever happen. Heikki claimed you would need >> a mechanism to convert the page some day and you said you proposed a system >> where that wasn't true. > > What's the scenario you're concerned about? An old snapshot that > never goes away? An old page which never goes away. New page formats are introduced for a reason -- to support new features. An old page lying around indefinitely means some pages can't support those new features. Just as an example, DBAs may be surprised to find out that large swathes of their database are still not protected by CRC checksums months or years after having upgraded to 8.4 (or even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their data is upgraded. > Can we lock the old and new pages, move the tuple to a V4 page, and > update index entries without changing xmin/xmax? Not exactly. But regardless -- the point is we need to do something. (And then the argument goes that since we *have* to do that then we needn't bother with doing anything else. At least if we do it's just an optimization over just doing the whole page right away.) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
Gregory Stark wrote: > "Robert Haas" <robertmhaas@gmail.com> writes: > An old page which never goes away. New page formats are introduced for a > reason -- to support new features. An old page lying around indefinitely means > some pages can't support those new features. Just as an example, DBAs may be > surprised to find out that large swathes of their database are still not > protected by CRC checksums months or years after having upgraded to 8.4 (or > even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their > data is upgraded. Then provide a manual mechanism to convert all pages? Joshua D. Drake
"Joshua D. Drake" <jd@commandprompt.com> writes: > Gregory Stark wrote: >> "Robert Haas" <robertmhaas@gmail.com> writes: > >> An old page which never goes away. New page formats are introduced for a >> reason -- to support new features. An old page lying around indefinitely means >> some pages can't support those new features. Just as an example, DBAs may be >> surprised to find out that large swathes of their database are still not >> protected by CRC checksums months or years after having upgraded to 8.4 (or >> even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their >> data is upgraded. > > Then provide a manual mechanism to convert all pages? The origin of this thread was the dispute over this claim: 1. You *will* need a function to convert a page from old format to new format. We do want to get rid of the old formatpages eventually, whether it's during VACUUM, whenever a page is read in, or by using an extra utility. And thatprocess needs to online. Please speak up now if you disagree with that. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
Gregory Stark wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > >> Gregory Stark wrote: >>> "Robert Haas" <robertmhaas@gmail.com> writes: >>> An old page which never goes away. New page formats are introduced for a >>> reason -- to support new features. An old page lying around indefinitely means >>> some pages can't support those new features. Just as an example, DBAs may be >>> surprised to find out that large swathes of their database are still not >>> protected by CRC checksums months or years after having upgraded to 8.4 (or >>> even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their >>> data is upgraded. >> Then provide a manual mechanism to convert all pages? > > The origin of this thread was the dispute over this claim: > > 1. You *will* need a function to convert a page from old format to new > format. We do want to get rid of the old format pages eventually, whether > it's during VACUUM, whenever a page is read in, or by using an extra > utility. And that process needs to online. Please speak up now if you > disagree with that. > I agree. Joshua D. Drake
> An old page which never goes away. New page formats are introduced for a > reason -- to support new features. An old page lying around indefinitely means > some pages can't support those new features. Just as an example, DBAs may be > surprised to find out that large swathes of their database are still not > protected by CRC checksums months or years after having upgraded to 8.4 (or > even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their > data is upgraded. OK, I see your point. In the absence of any old snapshots, convert-on-write allows you to forcibly upgrade the whole table by rewriting all of the tuples into new pages: UPDATE table SET col = col In the absence of page expansion, you can put logic into VACUUM to upgrade each page in place. If you have both old snapshots that you can't get rid of, and page expansion, then you have a big problem, which I guess brings us back to Heikki's point. ...Robert
Heikki Linnakangas napsal(a): > Zdenek Kotala wrote: > > We've talked about this many times before, so I'm sure you know what my > opinion is. Let me phrase it one more time: > > 1. You *will* need a function to convert a page from old format to new > format. We do want to get rid of the old format pages eventually, > whether it's during VACUUM, whenever a page is read in, or by using an > extra utility. And that process needs to online. Please speak up now if > you disagree with that. Yes. Agree. The basic idea is to create new empty page and copy+convert tuples into new page. This new page will overwrite old one I have already code which converts heap table (excluding arrays and composite datatype). > 2. It follows from point 1, that you *will* need to solve the problems > with pages where the data doesn't fit on the page in new format, as well > as converting TOAST data. Yes or no. It depends if we will want live with old pages forever. But I think convert all pages to the newest version is good idea. > We've discussed various solutions to those problems; it's not > insurmountable. For the "data doesn't fit anymore" problem, a fairly > simple solution is to run a pre-upgrade utility in the old version, that > reserves some free space on each page, to make sure everything fits > after converting to new format. I think it will not work. you need protect also PotgreSQL to put any data extra data on a page. Which requires modification into PostgreSQL code in old branches. > For TOAST, you can retoast tuples when > the heap page is read in. Yes you have to retosted it which is only possible method but problem is thet you need workinig toastable index ... yeah, indexes are different story. > I'm not sure what the problem with indexes is,> but you can split pages if necessary, for example. Indexes is different story. In first step I prefer to use reindex. But in the future a prefer to extend pg_am and add ampageconvert which will point to conversion function. Maybe we can extend it now and keep this column empty. > Assuming everyone agrees with point 1, could we focus on these issues? Yes, OK I'm going to cleanup code which I have and I will send it soon. Tuple conversion is already part of patch which I already send. See access/heapam/htup_03.c. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Tom Lane napsal(a): > > I concur that I don't want to see this patch adding more than the > absolute unavoidable minimum of overhead for data that meets the > "current" layout definition. I'm disturbed by the proposal to stick > overhead into tuple header access, for example. OK. I agree that it is overhead. However the patch contains also Tuple and Page API cleanup which is general thing. All function should use HeapTuple access not HeapTupleHeader. I used function in the patch because I added multi version access, but they can be macro. The main change of page API is to add two function PageGetHeapTuple and PageGetIndexTuple. I also add function like PageItemIsDead and so on. These change are not only related to upgrade. I accepting your complains about Tuples, but I think we should have multi page version access method. The main advantage is that indexes are ready for reading without any problem. It helps mostly in TOAST chunk data access and it is necessary for retoasting. OK it will works until somebody change btree ondisk format, but now it helps. Zdenek
I don't think this really qualifies as "in place upgrade" since it would mean creating a whole second copy of all your data. And it's only online got read-only queries too. I think we need a way to upgrade the pages in place and deal with any overflow data as exceptional cases or else there's hardly much point in the exercise. greg On 5 Nov 2008, at 07:32 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: >> An old page which never goes away. New page formats are introduced >> for a >> reason -- to support new features. An old page lying around >> indefinitely means >> some pages can't support those new features. Just as an example, >> DBAs may be >> surprised to find out that large swathes of their database are >> still not >> protected by CRC checksums months or years after having upgraded to >> 8.4 (or >> even 8.5 or 8.6 or ...). They would certainly want a way to ensure >> all their >> data is upgraded. > > OK, I see your point. In the absence of any old snapshots, > convert-on-write allows you to forcibly upgrade the whole table by > rewriting all of the tuples into new pages: > > UPDATE table SET col = col > > In the absence of page expansion, you can put logic into VACUUM to > upgrade each page in place. > > If you have both old snapshots that you can't get rid of, and page > expansion, then you have a big problem, which I guess brings us back > to Heikki's point. > > ...Robert
Greg Stark napsal(a): > I don't think this really qualifies as "in place upgrade" since it would > mean creating a whole second copy of all your data. And it's only online > got read-only queries too. > > I think we need a way to upgrade the pages in place and deal with any > overflow data as exceptional cases or else there's hardly much point in > the exercise. It is exceptional case between V3 and V4 and only on heap, because you save in varlena. But between V4 and V5 we will lost another 4 bytes in a page header -> page header will be 28 bytes long but tuple size is same. Try to get raw free space on each page in 8.3 database and you probably see a lot of pages where free space is 0. My last experience is something about 1-2% of pages. Zdenek
On Wed, Nov 05, 2008 at 03:04:42PM +0100, Zdenek Kotala wrote: > Greg Stark napsal(a): > It is exceptional case between V3 and V4 and only on heap, because you save > in varlena. But between V4 and V5 we will lost another 4 bytes in a page > header -> page header will be 28 bytes long but tuple size is same. > > Try to get raw free space on each page in 8.3 database and you probably see > a lot of pages where free space is 0. My last experience is something about > 1-2% of pages. Is this really such a big deal? You do the null-update on the last tuple of the page and then you do have enough room. So Phase one moves a few tuples to make room. Phase 2 actually converts the pages inplace. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout napsal(a): > On Wed, Nov 05, 2008 at 03:04:42PM +0100, Zdenek Kotala wrote: >> Greg Stark napsal(a): >> It is exceptional case between V3 and V4 and only on heap, because you save >> in varlena. But between V4 and V5 we will lost another 4 bytes in a page >> header -> page header will be 28 bytes long but tuple size is same. >> >> Try to get raw free space on each page in 8.3 database and you probably see >> a lot of pages where free space is 0. My last experience is something about >> 1-2% of pages. > > Is this really such a big deal? You do the null-update on the last > tuple of the page and then you do have enough room. So Phase one moves > a few tuples to make room. Phase 2 actually converts the pages inplace. Problem is how to move tuple from page to another and keep indexes in sync. One solution is to perform some think like "update" operation on the tuple. But you need exclusive lock on the page and pin counter have to be zero. And question is where it is safe operation. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: > Martijn van Oosterhout napsal(a): >> Is this really such a big deal? You do the null-update on the last >> tuple of the page and then you do have enough room. So Phase one moves >> a few tuples to make room. Phase 2 actually converts the pages inplace. > Problem is how to move tuple from page to another and keep indexes in > sync. One solution is to perform some think like "update" operation on > the tuple. But you need exclusive lock on the page and pin counter > have to be zero. And question is where it is safe operation. Hmm. Well, it may be a nasty problem but you have to find a solution. We're not going to guarantee that no update ever expands the data ... regards, tom lane
> Problem is how to move tuple from page to another and keep indexes in sync. > One solution is to perform some think like "update" operation on the tuple. > But you need exclusive lock on the page and pin counter have to be zero. And > question is where it is safe operation. But doesn't this problem go away if you do it in a transaction? You set xmax on the old tuple, write the new tuple, and add index entries just as you would for a normal update. When the old tuple is no longer visible to any transaction, you nuke it. ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: >> Problem is how to move tuple from page to another and keep indexes in sync. >> One solution is to perform some think like "update" operation on the tuple. >> But you need exclusive lock on the page and pin counter have to be zero. And >> question is where it is safe operation. > > But doesn't this problem go away if you do it in a transaction? You > set xmax on the old tuple, write the new tuple, and add index entries > just as you would for a normal update. But that doesn't actually solve the overflow problem on the old page... -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
On Wed, Nov 05, 2008 at 09:41:52PM +0000, Gregory Stark wrote: > "Robert Haas" <robertmhaas@gmail.com> writes: > > >> Problem is how to move tuple from page to another and keep indexes in sync. > >> One solution is to perform some think like "update" operation on the tuple. > >> But you need exclusive lock on the page and pin counter have to be zero. And > >> question is where it is safe operation. > > > > But doesn't this problem go away if you do it in a transaction? You > > set xmax on the old tuple, write the new tuple, and add index entries > > just as you would for a normal update. > > But that doesn't actually solve the overflow problem on the old page... Sure it does. You move just enough tuples that you can convert the page without an overflow. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout <kleptog@svana.org> writes: > On Wed, Nov 05, 2008 at 09:41:52PM +0000, Gregory Stark wrote: >> "Robert Haas" <robertmhaas@gmail.com> writes: >> >> >> Problem is how to move tuple from page to another and keep indexes in sync. >> >> One solution is to perform some think like "update" operation on the tuple. >> >> But you need exclusive lock on the page and pin counter have to be zero. And >> >> question is where it is safe operation. >> > >> > But doesn't this problem go away if you do it in a transaction? You >> > set xmax on the old tuple, write the new tuple, and add index entries >> > just as you would for a normal update. >> >> But that doesn't actually solve the overflow problem on the old page... > > Sure it does. You move just enough tuples that you can convert the page > without an overflow. setting the xmax on a tuple doesn't "move" the tuple -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
>>> >> Problem is how to move tuple from page to another and keep indexes in sync. >>> >> One solution is to perform some think like "update" operation on the tuple. >>> >> But you need exclusive lock on the page and pin counter have to be zero. And >>> >> question is where it is safe operation. >>> > >>> > But doesn't this problem go away if you do it in a transaction? You >>> > set xmax on the old tuple, write the new tuple, and add index entries >>> > just as you would for a normal update. >>> >>> But that doesn't actually solve the overflow problem on the old page... >> >> Sure it does. You move just enough tuples that you can convert the page >> without an overflow. > > setting the xmax on a tuple doesn't "move" the tuple Nobody said it did. I think this would have been more clear if you had quoted my whole email instead of stopping in the middle: >> But doesn't this problem go away if you do it in a transaction? You >> set xmax on the old tuple, write the new tuple, and add index entries >> just as you would for a normal update. >> >> When the old tuple is no longer visible to any transaction, you nuke it. To spell this out in more detail: Suppose page 123 is a V3 page containing 6 tuples A, B, C, D, E, and F. We examine the page and determine that if we convert this to a V4 page, only five tuples will fit. So we need to get rid of one of the tuples. We begin a transaction and choose F as the victim. Searching the FSM, we discover that page 456 is a V4 page with available free space. We pin and lock pages 123 and 456 just as if we were doing a heap_update. We create F', the V4 version of F, and write it onto page 456. We set xmax on the original F. We peform the corresponding index updates and commit the transaction. Time passes. Eventually F becomes dead. We reclaim the space previously used by F, and page 123 now contains only 5 tuples. This is exactly what we needed in order to convert page F to a V4 page, so we do. ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: > To spell this out in more detail: > Suppose page 123 is a V3 page containing 6 tuples A, B, C, D, E, and > F. We examine the page and determine that if we convert this to a V4 > page, only five tuples will fit. So we need to get rid of one of the > tuples. We begin a transaction and choose F as the victim. Searching > the FSM, we discover that page 456 is a V4 page with available free > space. We pin and lock pages 123 and 456 just as if we were doing a > heap_update. We create F', the V4 version of F, and write it onto > page 456. We set xmax on the original F. We peform the corresponding > index updates and commit the transaction. > Time passes. Eventually F becomes dead. We reclaim the space > previously used by F, and page 123 now contains only 5 tuples. This > is exactly what we needed in order to convert page F to a V4 page, so > we do. That's all fine and dandy, except that it presumes that you can perform SELECT/UPDATE/DELETE on V3 tuple versions; you can't just pretend that A-E aren't there until they get converted. Which is exactly the overhead we were looking to avoid. (Another small issue is exactly when you convert the index entries, should you be faced with an upgrade that requires that.) regards, tom lane
Tom Lane napsal(a): > "Robert Haas" <robertmhaas@gmail.com> writes: >> To spell this out in more detail: > >> Suppose page 123 is a V3 page containing 6 tuples A, B, C, D, E, and >> F. We examine the page and determine that if we convert this to a V4 >> page, only five tuples will fit. So we need to get rid of one of the >> tuples. We begin a transaction and choose F as the victim. Searching >> the FSM, we discover that page 456 is a V4 page with available free >> space. We pin and lock pages 123 and 456 just as if we were doing a >> heap_update. We create F', the V4 version of F, and write it onto >> page 456. We set xmax on the original F. We peform the corresponding >> index updates and commit the transaction. > >> Time passes. Eventually F becomes dead. We reclaim the space >> previously used by F, and page 123 now contains only 5 tuples. This >> is exactly what we needed in order to convert page F to a V4 page, so >> we do. > > That's all fine and dandy, except that it presumes that you can perform > SELECT/UPDATE/DELETE on V3 tuple versions; you can't just pretend that > A-E aren't there until they get converted. Which is exactly the > overhead we were looking to avoid. We want to avoid overhead on V$lastest$ tuples, but I guess small performance gap on old tuple is acceptable. The only way (which I see now) how it should work is to have multi page version processing. And old tuple will be converted when PageGetHepaTuple will be called. However, how Heikki mentioned tuple and page conversion is basic and same for all upgrade method and it should be done first. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
> That's all fine and dandy, except that it presumes that you can perform > SELECT/UPDATE/DELETE on V3 tuple versions; you can't just pretend that > A-E aren't there until they get converted. Which is exactly the > overhead we were looking to avoid. I don't understand this comment at all. Unless you have some sort of magical wand in your back pocket that will instantaneously transform the entire database, there is going to be a period of time when you have to cope with both V3 and V4 pages. ISTM that what we should be talking about here is: (1) How are we going to do that in a way that imposes near-zero overhead once the entire database has been converted? (2) How are we going to do that in a way that is minimally invasive to the code? (3) Can we accomplish (1) and (2) while still retaining somewhat reasonable performance for V3 pages? Zdenek's initial proposal did this by replacing all of the tuple header macros with functions that were conditionalized on page version. I think we agree that's not going to work. That doesn't mean that there is no approach that can work, and we were discussing possible ways to make it work upthread until the thread got hijacked to discuss the right way of handling page expansion. Now that it seems we agree that a transaction can be used to move tuples onto new pages, I think we'd be well served to stop talking about page expansion and get back to the original topic: where and how to insert the hooks for V3 tuple handling. > (Another small issue is exactly when you convert the index entries, > should you be faced with an upgrade that requires that.) Zdenek set out his thoughts on this point upthread, no need to rehash here. ...Robert
Robert Haas wrote: > > That's all fine and dandy, except that it presumes that you can perform > > SELECT/UPDATE/DELETE on V3 tuple versions; you can't just pretend that > > A-E aren't there until they get converted. Which is exactly the > > overhead we were looking to avoid. > > I don't understand this comment at all. Unless you have some sort of > magical wand in your back pocket that will instantaneously transform > the entire database, there is going to be a period of time when you > have to cope with both V3 and V4 pages. ISTM that what we should be > talking about here is: > > (1) How are we going to do that in a way that imposes near-zero > overhead once the entire database has been converted? > (2) How are we going to do that in a way that is minimally invasive to the code? > (3) Can we accomplish (1) and (2) while still retaining somewhat > reasonable performance for V3 pages? > > Zdenek's initial proposal did this by replacing all of the tuple > header macros with functions that were conditionalized on page > version. I think we agree that's not going to work. That doesn't > mean that there is no approach that can work, and we were discussing > possible ways to make it work upthread until the thread got hijacked > to discuss the right way of handling page expansion. Now that it > seems we agree that a transaction can be used to move tuples onto new > pages, I think we'd be well served to stop talking about page > expansion and get back to the original topic: where and how to insert > the hooks for V3 tuple handling. I think the above is a good summary. For me, the problem with any approach that has information about prior-version block formats in the main code path is code complexity, and secondarily performance. I know there is concern that converting all blocks on read-in might expand the page beyond 8k in size. One idea Heikki had was to require some tool must be run on minor releases before a major upgrade to guarantee there is enough free space to convert the block to the current format on read-in, which would localize the information about prior block formats. We could release the tool in minor branches around the time as a major release. Also consider that there are very few releases that expand the page size. For these reasons, the expand-the-page-beyond-8k problem should not be dictating what approach we take for upgrade-in-place because there are workarounds for the problem, and the problem is rare. I would like us to again focus on converting the pages to the current version format on read-in, and perhaps a tool to convert all old pages to the new format. FYI, we are also going to need the ability to convert all pages to the current format for multi-release upgrades. For example, if you did upgrade-in-place from 8.2 to 8.3, you are going to need to update all pages to the 8.3 format before doing upgrade-in-place to 8.4; perhaps vacuum can do something like this on a per-table basis, and we can record that status a pg_class column. Also, consider that when we did PITR, we required commands before and after the tar so that there was a consistent API for PITR, and later had to add capabilities to those functions, but the user API didn't change. I envision a similar system where we have utilities to guarantee all pages have enough free space, and all pages are the current version, before allowing an upgrade-in-place to the next version. Such a consistent API will make the job for users easier and our job simpler, and with upgrade-in-place, where we have limited time and resources to code this for each release, simplicity is important. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > I envision a similar system where we have utilities to guarantee all > pages have enough free space, and all pages are the current version, > before allowing an upgrade-in-place to the next version. Such a > consistent API will make the job for users easier and our job simpler, > and with upgrade-in-place, where we have limited time and resources to > code this for each release, simplicity is important. An external utility doesn't seem like the right way to approach it. For example, given the need to ensure X amount of free space in each page, the only way to guarantee that would be to shut down the database while you run the utility over all the pages --- otherwise somebody might fill some page up again. And that completely defeats the purpose, which is to have minimal downtime during upgrade. I think we can have a notion of pre-upgrade maintenance, but it would have to be integrated into normal operations. For instance, if conversion to 8.4 requires extra free space, we'd make late releases of 8.3.x not only be able to force that to occur, but also tweak the normal code paths to maintain that minimum free space. The full concept as I understood it (dunno why Bruce left all these details out of his message) went like this: * Add a "format serial number" column to pg_class, and probably also pg_database. Rather like the frozenxid columns, this would have the semantics that all pages in a relation or database are known to have at least the specified format number. * There would actually be two serial numbers per release, at least for releases where pre-update prep work is involved --- for instance, between 8.3 and 8.4 there'd be an "8.3-and-a-half" format which is 8.3 but known ready to update to 8.4 (eg, enough free space available). Minor releases of 8.3 that appear with or subsequent to 8.4 release understand the "half" format number and how to upgrade to it. * VACUUM would be empowered, in the same way as it handles frozenxid maintenance, to update any less-than-the-latest-version pages and then fix the pg_class and pg_database entries. * We could mechanically enforce that you not update until the database is ready for it by checking pg_database.datformatversion during postmaster startup. So the update process would require users to install a suitably late version of 8.3, vacuum everything over a suitable maintenance window, then install 8.4, then perhaps vacuum everything again if they want to try to push page update work into specific maintenance windows. But the DB is up and functioning the whole time. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > I envision a similar system where we have utilities to guarantee all > > pages have enough free space, and all pages are the current version, > > before allowing an upgrade-in-place to the next version. Such a > > consistent API will make the job for users easier and our job simpler, > > and with upgrade-in-place, where we have limited time and resources to > > code this for each release, simplicity is important. > > An external utility doesn't seem like the right way to approach it. > For example, given the need to ensure X amount of free space in each > page, the only way to guarantee that would be to shut down the database > while you run the utility over all the pages --- otherwise somebody > might fill some page up again. And that completely defeats the purpose, > which is to have minimal downtime during upgrade. > > I think we can have a notion of pre-upgrade maintenance, but it would > have to be integrated into normal operations. For instance, if > conversion to 8.4 requires extra free space, we'd make late releases > of 8.3.x not only be able to force that to occur, but also tweak the > normal code paths to maintain that minimum free space. > > The full concept as I understood it (dunno why Bruce left all these > details out of his message) went like this: Exactly. I didn't go into the implementation details to make it easer for people to see my general goals. Tom's implementation steps are the correct approach, assuming we can get agreement on the general goals. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
> An external utility doesn't seem like the right way to approach it. > For example, given the need to ensure X amount of free space in each > page, the only way to guarantee that would be to shut down the database > while you run the utility over all the pages --- otherwise somebody > might fill some page up again. And that completely defeats the purpose, > which is to have minimal downtime during upgrade. Agreed. > I think we can have a notion of pre-upgrade maintenance, but it would > have to be integrated into normal operations. For instance, if > conversion to 8.4 requires extra free space, we'd make late releases > of 8.3.x not only be able to force that to occur, but also tweak the > normal code paths to maintain that minimum free space. 1. This seems to fly in the face of the sort of thing we've traditionally back-patched. The code to make pages ready for upgrade to the next major release will not necessarily be straightforward (in fact it probably isn't, otherwise we wouldn't have insisted on a two-stage conversion process), which turns a seemingly safe minor upgrade into a potentially dangerous operation. 2. Just because I want to upgrade to 8.3.47 and get the latest bug fixes does not mean that I have any intention of upgrading to 8.4, and yet you've rearranged all of my pages to have useless free space in them (possibly at considerable and unexpected I/O cost for at least as long as the conversion is running). The second point could probably be addressed with a GUC but the first one certainly can't. 3. What about multi-release upgrades? Say someone wants to upgrade from 8.3 to 8.6. 8.6 only knows how to read pages that are 8.5-and-a-half or better, 8.5 only knows how to read pages that are 8.4-and-a-half or better, and 8.4 only knows how to read pages that are 8.3-and-a-half or better. So the user will have to upgrade to 8.3.MAX, then 8.4.MAX, then 8.5.MAX, and then 8.6. It seems to me that if there is any way to put all of the logic to handle old page versions in the new code that would be much better, especially if it's an optional feature that can be compiled in or not.Then when it's time to upgrade from 8.3 to 8.6 youcould do: ./configure --with-upgrade-83 --with-upgrade-84 --with-upgrade85 but if you don't need the code to handle old page versions you can: ./configure --without-upgrade85 Admittedly, this requires making the new code capable of rearranging pages to create free space when necessary, and to be able to continue to execute queries while doing it, but ways of doing this have been proposed. The only uncertainty is as to whether the performance and code complexity can be kept manageable, but I don't believe that question has been explored to the point where we should be ready to declare defeat. ...Robert
Robert Haas wrote: > The second point could probably be addressed with a GUC but the first > one certainly can't. > > 3. What about multi-release upgrades? Say someone wants to upgrade > from 8.3 to 8.6. 8.6 only knows how to read pages that are > 8.5-and-a-half or better, 8.5 only knows how to read pages that are > 8.4-and-a-half or better, and 8.4 only knows how to read pages that > are 8.3-and-a-half or better. So the user will have to upgrade to > 8.3.MAX, then 8.4.MAX, then 8.5.MAX, and then 8.6. Yes. > It seems to me that if there is any way to put all of the logic to > handle old page versions in the new code that would be much better, > especially if it's an optional feature that can be compiled in or not. > Then when it's time to upgrade from 8.3 to 8.6 you could do: > > ./configure --with-upgrade-83 --with-upgrade-84 --with-upgrade85 > > but if you don't need the code to handle old page versions you can: > > ./configure --without-upgrade85 > > Admittedly, this requires making the new code capable of rearranging > pages to create free space when necessary, and to be able to continue > to execute queries while doing it, but ways of doing this have been > proposed. The only uncertainty is as to whether the performance and > code complexity can be kept manageable, but I don't believe that > question has been explored to the point where we should be ready to > declare defeat. And almost guarantee that the job will never be completed, or tested fully. Remember that in-place upgrades would be pretty painless so doing multiple major upgrades should not be a difficult requiremnt, or they can dump/reload their data to skip it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > I think we can have a notion of pre-upgrade maintenance, but it would > have to be integrated into normal operations. For instance, if > conversion to 8.4 requires extra free space, we'd make late releases > of 8.3.x not only be able to force that to occur, but also tweak the > normal code paths to maintain that minimum free space. Agreed, the backend needs to be modified to reserve the space. > The full concept as I understood it (dunno why Bruce left all these > details out of his message) went like this: > > * Add a "format serial number" column to pg_class, and probably also > pg_database. Rather like the frozenxid columns, this would have the > semantics that all pages in a relation or database are known to have at > least the specified format number. > > * There would actually be two serial numbers per release, at least for > releases where pre-update prep work is involved --- for instance, > between 8.3 and 8.4 there'd be an "8.3-and-a-half" format which is > 8.3 but known ready to update to 8.4 (eg, enough free space available). > Minor releases of 8.3 that appear with or subsequent to 8.4 release > understand the "half" format number and how to upgrade to it. > > * VACUUM would be empowered, in the same way as it handles frozenxid > maintenance, to update any less-than-the-latest-version pages and then > fix the pg_class and pg_database entries. > > * We could mechanically enforce that you not update until the database > is ready for it by checking pg_database.datformatversion during > postmaster startup. Adding catalog columns seems rather complicated, and not back-patchable. Not backpatchable means that we'd need to be sure now that the format serial numbers are enough for the upcoming 8.4-8.5 upgrade. I imagined that you would have just a single cluster-wide variable, a GUC perhaps, indicating how much space should be reserved by updates/inserts. Then you'd have an additional program, perhaps a new contrib module, that sets the variable to the right value for the version you're upgrading, and scans through all tables, moving tuples so that every page has enough free space for the upgrade. After that's done, it'd set a flag in the data directory indicating that the cluster is ready for upgrade. The tool could run concurrently with normal activity, so you could just let it run for as long as it takes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> And almost guarantee that the job will never be completed, or tested > fully. Remember that in-place upgrades would be pretty painless so > doing multiple major upgrades should not be a difficult requiremnt, or > they can dump/reload their data to skip it. Regardless of what design is chosen, there's no requirement that we support in-place upgrade from 8.3 to 8.6, or even 8.4 to 8.6, in one shot. But the design that you and Tom are proposing pretty much ensures that it will be impossible. But that's certainly the least important reason not to do it this way.I think this comment from Heikki is pretty revealing: > Adding catalog columns seems rather complicated, and not back-patchable. Not backpatchable means that we'd need to be surenow > that the format serial numbers are enough for the upcoming 8.4-8.5 upgrade. That means, in essence, that the earliest possible version that could be in-place upgraded would be an 8.4 system - we are giving up completely on in-place upgrade to 8.4 from any earlier version (which personally I thought was the whole point of this feature in the first place). And we'll only be able to in-place upgrade to 8.5 if the unproven assumption that these catalog changes are sufficient turns out to be true, or if whatever other changes turn out to be necessary are back-patchable. ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: > That means, in essence, that the earliest possible version that could > be in-place upgraded would be an 8.4 system - we are giving up > completely on in-place upgrade to 8.4 from any earlier version (which > personally I thought was the whole point of this feature in the first > place). Quite honestly, given where we are in the schedule and the lack of consensus about how to do this, I think we would be well advised to decide right now to forget about supporting in-place upgrade to 8.4, and instead work on allowing in-place upgrades from 8.4 onwards. Shooting for a general-purpose does-it-all scheme that can handle old versions that had no thought of supporting such updates is likely to ensure that we end up with *NOTHING*. What Bruce is proposing, I think, is that we intentionally restrict what we want to accomplish to something that might be within reach now and also sustainable over the long term. Planning to update any version to any other version is *not* sustainable --- we haven't got the resources nor the interest to create large amounts of conversion code. regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Adding catalog columns seems rather complicated, and not back-patchable. Agreed, we'd not be able to make them retroactively appear in 8.3. > I imagined that you would have just a single cluster-wide variable, a > GUC perhaps, indicating how much space should be reserved by > updates/inserts. Then you'd have an additional program, perhaps a new > contrib module, that sets the variable to the right value for the > version you're upgrading, and scans through all tables, moving tuples so > that every page has enough free space for the upgrade. After that's > done, it'd set a flag in the data directory indicating that the cluster > is ready for upgrade. Possibly that could work. The main thing is to have a way of being sure that the prep work has been completed on every page of the database. The disadvantage of not having catalog support is that you'd have to complete the entire scan operation in one go to be sure you'd hit everything. Another thought here is that I don't think we are yet committed to any changes that require extra space between 8.3 and 8.4, are we? The proposed addition of CRC words could be put off to 8.5, for instance. So it seems at least within reach to not require any preparatory steps for 8.3-to-8.4, and put the infrastructure in place now to support such steps in future go-rounds. regards, tom lane
On Thu, 6 Nov 2008, Tom Lane wrote: > Another thought here is that I don't think we are yet committed to any > changes that require extra space between 8.3 and 8.4, are we? The > proposed addition of CRC words could be put off to 8.5, for instance. I was just staring at that code as you wrote this thinking about the same thing. CRCs are a great feature I'd really like to see. On the other hand, announcing that 8.4 features in-place upgrades for 8.3 databases, and that the project has laid the infrastructure such that future releases will also upgrade in-place, would IMHO be the biggest positive announcement of the new release by a large margin. At least then new large (>1TB) installs could kick off on either the stable 8.3 or 8.4 knowing they'd never be forced to deal with dump/reload, whereas right now there is no reasonable solution for them that involves PostgreSQL (I just crossed 3TB on a system last month and I'm not looking forward to its future upgrades). Two questions come to mind here: -If you reduce the page layout upgrade problem to "convert from V4 to V5 adding support for CRCs", is there a worthwhile simpler path to handling that without dragging the full complexity of the older page layout changes in? -Is it worth considering making CRCs an optional compile-time feature, and that (for now at least) you couldn't get them and the in-place upgrade at the same time? Stepping back for a second, the idea that in-place upgrade is only worthwhile if it yields zero downtime isn't necessarily the case. Even having an offline-only upgrade tool to handle the more complicated situations where tuples have to be squeezed onto another page would still be a major improvement over the current situation. The thing that you have to recognize here is that dump/reload is extremely slow because of bottlenecks in the COPY process. That makes for a large amount of downtime--many hours isn't unusual. If older version upgrade downtime was reduced to how long it takes to run a "must scan every page and fiddle with it if full" tool, that would still be a giant improvement over the current state of things. If Zdenek's figures that only a small percentages of pages will need such adjustment holds up, that should take only some factor longer than a sequential scan of the whole database. That's not instant, but it's at least an order of magnitude faster than a dump/reload on a big system. The idea that you're going to get in-place upgrade all the way back to 8.2 without taking the database down for a even little bit to run such a utility is hard to pull off, and it's impressive that Zdenek and everyone else involved has gotten so close to doing it. I personally am on the fence as to whether it's worth paying even the 1% penalty for that implementation all the time just to get in-place upgrades. If an offline utility with reasonable (scan instead of dump/reload) downtime and closer to zero overhead when finished was available instead, that might be a more reasonable trade-off to make for handling older releases. There are so many bottlenecks in the older versions that you're less likely to find a database too large to dump and reload there anyway. It would also be the case that improvements to that offline utility could continue after 8.4 proper was completely frozen. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
> The idea that you're going to get in-place upgrade all the way back to 8.2 > without taking the database down for a even little bit to run such a utility > is hard to pull off, and it's impressive that Zdenek and everyone else > involved has gotten so close to doing it. I think we should at least wait to see what the next version of his patch looks like before making any final decisions. ...Robert
Greg Smith <gsmith@gregsmith.com> writes: > On Thu, 6 Nov 2008, Tom Lane wrote: >> Another thought here is that I don't think we are yet committed to any >> changes that require extra space between 8.3 and 8.4, are we? The >> proposed addition of CRC words could be put off to 8.5, for instance. > I was just staring at that code as you wrote this thinking about the same > thing. ... > -Is it worth considering making CRCs an optional compile-time feature, and > that (for now at least) you couldn't get them and the in-place upgrade at > the same time? Hmm ... might be better than not offering them in 8.4 at all, but the thing is that then you are asking packagers to decide for their customers which is more important. And I'd bet you anything you want that in-place upgrade would be their choice. Also, having such an option would create extra complexity for 8.4-to-8.5 upgrades. regards, tom lane
On Thu, 6 Nov 2008, Tom Lane wrote: >> -Is it worth considering making CRCs an optional compile-time feature, and >> that (for now at least) you couldn't get them and the in-place upgrade at >> the same time? > > Hmm ... might be better than not offering them in 8.4 at all, but the > thing is that then you are asking packagers to decide for their > customers which is more important. And I'd bet you anything you want > that in-place upgrade would be their choice. I was thinking of something similar to how --enable-thread-safety has been rolled out. It could be hanging around there and available to those who want it in their build, even though it might not be available by default in a typical mainstream distribution. Since there's already a GUC for toggling the checksums in the code, internally it could work like debug_assertions where you only get that option if support was compiled in appropriately. Just a thought I wanted to throw out there, if it makes eventual upgrades from 8.4 more complicated it may not be worth even considering. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Heikki Linnakangas napsal(a): > Tom Lane wrote: >> I think we can have a notion of pre-upgrade maintenance, but it would >> have to be integrated into normal operations. For instance, if >> conversion to 8.4 requires extra free space, we'd make late releases >> of 8.3.x not only be able to force that to occur, but also tweak the >> normal code paths to maintain that minimum free space. > > Agreed, the backend needs to be modified to reserve the space. > >> The full concept as I understood it (dunno why Bruce left all these >> details out of his message) went like this: >> >> * Add a "format serial number" column to pg_class, and probably also >> pg_database. Rather like the frozenxid columns, this would have the >> semantics that all pages in a relation or database are known to have at >> least the specified format number. >> >> * There would actually be two serial numbers per release, at least for >> releases where pre-update prep work is involved --- for instance, >> between 8.3 and 8.4 there'd be an "8.3-and-a-half" format which is >> 8.3 but known ready to update to 8.4 (eg, enough free space available). >> Minor releases of 8.3 that appear with or subsequent to 8.4 release >> understand the "half" format number and how to upgrade to it. >> >> * VACUUM would be empowered, in the same way as it handles frozenxid >> maintenance, to update any less-than-the-latest-version pages and then >> fix the pg_class and pg_database entries. >> >> * We could mechanically enforce that you not update until the database >> is ready for it by checking pg_database.datformatversion during >> postmaster startup. > > Adding catalog columns seems rather complicated, and not back-patchable. > Not backpatchable means that we'd need to be sure now that the format > serial numbers are enough for the upcoming 8.4-8.5 upgrade. Reloptions is suitable for keeping amount of reserver space. And it can be back ported into 8.3 and 8.2. And of course there is no problem to convert 8.1->8.2. For backported branch would be better to combine internal modification - preserve space and e.g. store procedure which check all relations. In the 8.4 and newer pg_class could be extended for new attributes. > I imagined that you would have just a single cluster-wide variable, a > GUC perhaps, indicating how much space should be reserved by > updates/inserts. You sometimes need different reserved size for different type of relation. For example on 32bit x86 you don't need reserve space for heap but you need do it for indexes (between v3->v4). Better is to use reloptions and pre-upgrade procedure sets this information correctly. > Then you'd have an additional program, perhaps a new > contrib module, that sets the variable to the right value for the > version you're upgrading, and scans through all tables, moving tuples so > that every page has enough free space for the upgrade. After that's > done, it'd set a flag in the data directory indicating that the cluster > is ready for upgrade. I prefer to have this information in pg_class. It is accessible by SQL commands. pg_class should also contains information about last checked page to prevent repeatable check on very large tables. > The tool could run concurrently with normal activity, so you could just > let it run for as long as it takes. Agree. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Tom Lane napsal(a): > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Adding catalog columns seems rather complicated, and not back-patchable. > > Agreed, we'd not be able to make them retroactively appear in 8.3. > >> I imagined that you would have just a single cluster-wide variable, a >> GUC perhaps, indicating how much space should be reserved by >> updates/inserts. Then you'd have an additional program, perhaps a new >> contrib module, that sets the variable to the right value for the >> version you're upgrading, and scans through all tables, moving tuples so >> that every page has enough free space for the upgrade. After that's >> done, it'd set a flag in the data directory indicating that the cluster >> is ready for upgrade. > > Possibly that could work. The main thing is to have a way of being sure > that the prep work has been completed on every page of the database. > The disadvantage of not having catalog support is that you'd have to > complete the entire scan operation in one go to be sure you'd hit > everything. I prefer to have catalog support. Special on very long tables it helps when somebody stop preupgrade script for some reason. > Another thought here is that I don't think we are yet committed to any > changes that require extra space between 8.3 and 8.4, are we? The > proposed addition of CRC words could be put off to 8.5, for instance. > So it seems at least within reach to not require any preparatory steps > for 8.3-to-8.4, and put the infrastructure in place now to support such > steps in future go-rounds. Yeah. We still have V4 without any storage modification (exclude HASH index). However I think if reloptions will be use for storing information about reserved space then It shouldn't be a problem. But we need to be sure if it is possible. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Tom Lane napsal(a): > I think we can have a notion of pre-upgrade maintenance, but it would > have to be integrated into normal operations. For instance, if > conversion to 8.4 requires extra free space, we'd make late releases > of 8.3.x not only be able to force that to occur, but also tweak the > normal code paths to maintain that minimum free space. OK. I will focus on this. I guess this approach revival my hook patch: http://archives.postgresql.org/pgsql-hackers/2008-04/msg00990.php > The full concept as I understood it (dunno why Bruce left all these > details out of his message) went like this: > > * Add a "format serial number" column to pg_class, and probably also > pg_database. Rather like the frozenxid columns, this would have the > semantics that all pages in a relation or database are known to have at > least the specified format number. > > * There would actually be two serial numbers per release, at least for > releases where pre-update prep work is involved --- for instance, > between 8.3 and 8.4 there'd be an "8.3-and-a-half" format which is > 8.3 but known ready to update to 8.4 (eg, enough free space available). > Minor releases of 8.3 that appear with or subsequent to 8.4 release > understand the "half" format number and how to upgrade to it. I prefer to have latest processed block. InvalidBlockNumber would mean nothing is processed and 0 means everything is already reserved. I suggest to process it backward. It should prevent to check new extended block which will be already correctly setup. > * VACUUM would be empowered, in the same way as it handles frozenxid > maintenance, to update any less-than-the-latest-version pages and then > fix the pg_class and pg_database entries. > > * We could mechanically enforce that you not update until the database > is ready for it by checking pg_database.datformatversion during > postmaster startup. I'm don't understand you here? Do you mean on old server version or new server version. Or who will perform this check? Do not remember that we currently do catalog conversion by dump and import which lost all extended information. Thanks Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: > Tom Lane napsal(a): >> * Add a "format serial number" column to pg_class, and probably also >> pg_database. Rather like the frozenxid columns, this would have the >> semantics that all pages in a relation or database are known to have at >> least the specified format number. > I prefer to have latest processed block. InvalidBlockNumber would mean > nothing is processed and 0 means everything is already reserved. I > suggest to process it backward. It should prevent to check new > extended block which will be already correctly setup. That seems bizarre and not very helpful. In the first place, if we're driving it off vacuum there would be no opportunity for recording a half-processed state value. In the second place, this formulation fails to provide any evidence of *what* processing you completed or didn't complete. In a multi-step upgrade sequence I think it's going to be a mess if we aren't explicit about that. regards, tom lane
On Nov 6, 2008, at 1:31 PM, Bruce Momjian wrote: >> 3. What about multi-release upgrades? Say someone wants to upgrade >> from 8.3 to 8.6. 8.6 only knows how to read pages that are >> 8.5-and-a-half or better, 8.5 only knows how to read pages that are >> 8.4-and-a-half or better, and 8.4 only knows how to read pages that >> are 8.3-and-a-half or better. So the user will have to upgrade to >> 8.3.MAX, then 8.4.MAX, then 8.5.MAX, and then 8.6. > > Yes. I think that's pretty seriously un-desirable. It's not at all uncommon for databases to stick around for a very long time and then jump ahead many versions. I don't think we want to tell people they can't do that. More importantly, I think we're barking up the wrong tree by putting migration knowledge into old versions. All that the old versions need to do is guarantee a specific amount of free space per page. We should provide a mechanism to tell a cluster what that free space requirement is, and not hard-code it into the backend. Unless I'm mistaken, there are only two cases we care about for additional space: per-page and per-tuple. Those requirements could also vary for different types of pg_class objects. What we need is an API that allows an administrator to tell the database to start setting this space aside. One possibility: pg_min_free_space( version, relkind, bytes_per_page, bytes_per_tuple ); pg_min_free_space_index( version, indexkind, bytes_per_page, bytes_per_tuple ); version: This would be provided as a safety mechanism. You would have to provide the major version that matches what the backend is running. See below for an example. relkind: Essentially, heap vs toast, though I suppose it's possible we might need this for sequences. indexkind: Because we support different types of indexes, I think we need to handle them differently than heap/toast. If we wanted, we could have a single function that demands that indexkind is NULL if relkind != 'index'. bytes_per_(page|tuple): obvious. :) Once we have an API, we need to get users to make use of it. I'm thinking add something like the following to the release notes: "To upgrade from a prior version to 8.4, you will need to run some of the following commands, depending on what version you are currently using: For version 8.3: SELECT pg_min_free_space( '8.3', 'heap', 4, 12 ); SELECT pg_min_free_space( '8.3', 'toast', 4, 12 ); For version 8.2: SELECT pg_min_free_space( '8.2', 'heap', 14, 12 ); SELECT pg_min_free_space( '8.2', 'toast', 14, 12 ); SELECT pg_min_free_space_index( '8.2', 'b-tree', 4, 4);" (Note I'm just pulling numbers out of thin air in this example.) As you can see, we pass in the version number to ensure that if someone accidentally cut and pastes the wrong stuff they know what they did wrong immediately. One downside to this scheme is that it doesn't provide a mechanism to ensure that all required minimum free space requirements were passed in. Perhaps we want a function that takes an array of complex types and forces you to supply information for all known storage mechanisms. Another possibility would be to pass in some kind of binary format that contains a checksum. Even if we do come up with a pretty fool-proof way to tell the old version what free space it needs to set aside, I think we should still have a mechanism for the new version to know exactly what the old version has set aside, and if it's actually been accomplished or not. One option that comes to mind is to add min_free_space_per_page and min_free_space_per_tuple to pg_class. Normally these fields would be NULL; the old version would only set them once it had verified that all pages in a given relation met those requirements (presumably via vacuum). The new version would check all these values on startup to ensure they made sense. OTOH, we might not want to go mucking around with changing the catalog for older versions (I'm not even sure if we can). So perhaps it would be better to store this information in a separate table, or maybe a separate file. That might be best anyway; we generally wouldn't need this information, so it would be nice if it wasn't bloating pg_class all the time. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Decibel! <decibel@decibel.org> writes: > I think that's pretty seriously un-desirable. It's not at all > uncommon for databases to stick around for a very long time and then > jump ahead many versions. I don't think we want to tell people they > can't do that. Of course they can do that --- they just have to do it one version at a time. I think it's time for people to stop asking for the moon and realize that if we don't constrain this feature pretty darn tightly, we will have *nothing at all* for 8.4. Again. regards, tom lane
On Sun, 2008-11-09 at 20:02 -0500, Tom Lane wrote: > Decibel! <decibel@decibel.org> writes: > > I think that's pretty seriously un-desirable. It's not at all > > uncommon for databases to stick around for a very long time and then > > jump ahead many versions. I don't think we want to tell people they > > can't do that. > > Of course they can do that --- they just have to do it one version at a > time. > > I think it's time for people to stop asking for the moon and realize > that if we don't constrain this feature pretty darn tightly, we will > have *nothing at all* for 8.4. Again. Gotta go with Tom on this one. The idea that we would somehow upgrade from 8.1 to 8.4 is silly. Yes it will be unfortunate for those running 8.1 but keeping track of multi version like that is going to be entirely too expensive. At some point it won't matter but right now it really does. Joshua D. Drake > > regards, tom lane > --
Decibel! napsal(a): > Unless I'm mistaken, there are only two cases we care about for > additional space: per-page and per-tuple. Yes. And maybe special space indexes could be extended, but it is covered in per-page setting. > Those requirements could also > vary for different types of pg_class objects. What we need is an API > that allows an administrator to tell the database to start setting this > space aside. One possibility: We need API or mechanism how in-place upgrade will setup it. It must be done by in-place upgrade. <snip> > relkind: Essentially, heap vs toast, though I suppose it's possible we > might need this for sequences. Sequences are converted during catalog upgrade. <snip> > Once we have an API, we need to get users to make use of it. I'm > thinking add something like the following to the release notes: > > "To upgrade from a prior version to 8.4, you will need to run some of > the following commands, depending on what version you are currently using: > <snip> It is too complicated. At first it depends also on architecture and it is possible to easily compute by in-place upgrade script. What you need is only run script which do all setting for you. You can obtain it from next version (IIRC Oracle do it this way) or we can add this configuration script into previous version during a minor update. > > OTOH, we might not want to go mucking around with changing the catalog > for older versions (I'm not even sure if we can). So perhaps it would be > better to store this information in a separate table, or maybe a > separate file. That might be best anyway; we generally wouldn't need > this information, so it would be nice if it wasn't bloating pg_class all > the time. It is why I selected relopt for storing this configuration parameter, which is supported from 8.2 and upgrade from 8.1->8.2 works fine. Zdenek
Tom Lane wrote: > Decibel! <decibel@decibel.org> writes: > >> I think that's pretty seriously un-desirable. It's not at all >> uncommon for databases to stick around for a very long time and then >> jump ahead many versions. I don't think we want to tell people they >> can't do that. >> > > Of course they can do that --- they just have to do it one version at a > time. Also, people may be less likely to stick with an old outdated version for years and years if the upgrade process is easier.
On Mon, 2008-11-10 at 09:14 -0500, Matthew T. O'Connor wrote: > Tom Lane wrote: > > Decibel! <decibel@decibel.org> writes: > > > >> I think that's pretty seriously un-desirable. It's not at all > >> uncommon for databases to stick around for a very long time and then > >> jump ahead many versions. I don't think we want to tell people they > >> can't do that. > >> > > > > Of course they can do that --- they just have to do it one version at a > > time. > > Also, people may be less likely to stick with an old outdated version > for years and years if the upgrade process is easier. Kind of OT but, I don't agree with this. There will always be those who are willing to just upgrade because they can but the smart play is to upgrade because you need to. If anything in place upgrades is just going to remove the last real business and technical barrier to using postgresql for enterprises. Joshua D. Drake > > --
On Nov 9, 2008, at 11:09 PM, Joshua D. Drake wrote: >> I think it's time for people to stop asking for the moon and realize >> that if we don't constrain this feature pretty darn tightly, we will >> have *nothing at all* for 8.4. Again. > > Gotta go with Tom on this one. The idea that we would somehow upgrade > from 8.1 to 8.4 is silly. Yes it will be unfortunate for those running > 8.1 but keeping track of multi version like that is going to be > entirely > too expensive. > I agree as well. If we can get the at least the base level stuff in 8.4 so that 8.5 and beyond is in-place upgradable then that is a huge win. If we could support 8.2 or 8.3 or 6.5 :) that would be nice, but I think dealing with everything retroactively will cause our heads to explode and a mountain of awful code to arise. If we say "8.4 and beyond will be upgradable" we can toss everything in we think we'll need to deal with it and not worry about the retroactive case (unless someone has a really clever(tm) idea!) This can't be an original problem to solve, too many other databases do it as well. -- Jeff Trout <jeff@jefftrout.com> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
Zdenek - I am a bit murky on where we stand with upgrade-in-place in terms of reviewing. Initially, you had submitted four patches for this commitfest: 1. htup and bufpage API clean up 2. HeapTuple version extension + code cleanup 3. In-place online upgrade 4. Extending pg_class info + more flexible TOAST chunk size I think that it was decided that replacing the heap tuple access macros with function calls was not acceptable, so I have moved patches #1 and #2 to the "Returned with feedback" section. I thought that perhaps the third patch could be salvaged, but the consensus seemed to be to go in a new direction, so I'm thinking that one should probably be moved to "Returned with feedback" as well. However, I'm not clear on whether you will be submitting something else instead and whether that thing should be considered material for this commitfest. Can you let me know how you are thinking about this? With respect to #4, I know that Alvaro submitted a draft patch, but I'm not clear on whether that needs to be reviewed, because: - I'm not sure whether it's close enough to being finished for a review to be a good use of time. - I'm not sure how much you and Heikki have already reviewed it. - I'm not sure whether this patch buys us anything by itself. Thoughts? ...Robert
Robert, big thanks for your review. I think #1 is still partially valid, because it contains general cleanups, but part of it is not necessary now. #2, #3 and #4 you can move to return with feedback section. Thanks Zdenek Robert Haas napsal(a): > Zdenek - > > I am a bit murky on where we stand with upgrade-in-place in terms of > reviewing. Initially, you had submitted four patches for this > commitfest: > > 1. htup and bufpage API clean up > 2. HeapTuple version extension + code cleanup > 3. In-place online upgrade > 4. Extending pg_class info + more flexible TOAST chunk size > > I think that it was decided that replacing the heap tuple access > macros with function calls was not acceptable, so I have moved patches > #1 and #2 to the "Returned with feedback" section. I thought that > perhaps the third patch could be salvaged, but the consensus seemed to > be to go in a new direction, so I'm thinking that one should probably > be moved to "Returned with feedback" as well. However, I'm not clear > on whether you will be submitting something else instead and whether > that thing should be considered material for this commitfest. Can you > let me know how you are thinking about this? > > With respect to #4, I know that Alvaro submitted a draft patch, but > I'm not clear on whether that needs to be reviewed, because: > > - I'm not sure whether it's close enough to being finished for a > review to be a good use of time. > - I'm not sure how much you and Heikki have already reviewed it. > - I'm not sure whether this patch buys us anything by itself. > > Thoughts? > > ...Robert
>> 1. htup and bufpage API clean up >> 2. HeapTuple version extension + code cleanup >> 3. In-place online upgrade >> 4. Extending pg_class info + more flexible TOAST chunk size > big thanks for your review. I think #1 is still partially valid, because it > contains general cleanups, but part of it is not necessary now. #2, #3 and > #4 you can move to return with feedback section. OK, when can you submit a new version of #1 with the parts that are still valid, updated to CVS HEAD, etc? Thanks, ...Robert
Robert Haas escribió: > With respect to #4, I know that Alvaro submitted a draft patch, but > I'm not clear on whether that needs to be reviewed, because: > > - I'm not sure whether it's close enough to being finished for a > review to be a good use of time. > - I'm not sure how much you and Heikki have already reviewed it. > - I'm not sure whether this patch buys us anything by itself. I finished that patch, but I didn't submit it because in later discussion it turned out (at least as I read it) that it's considered to be unnecessary. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera napsal(a): > Robert Haas escribió: > >> With respect to #4, I know that Alvaro submitted a draft patch, but >> I'm not clear on whether that needs to be reviewed, because: >> >> - I'm not sure whether it's close enough to being finished for a >> review to be a good use of time. >> - I'm not sure how much you and Heikki have already reviewed it. >> - I'm not sure whether this patch buys us anything by itself. > > I finished that patch, but I didn't submit it because in later > discussion it turned out (at least as I read it) that it's considered to > be unnecessary. > From pg_upgrade perspective, it is something what we will need do anyway. Because TOAST_MAX_CHUNK_SIZE will be different in 8.5 (if you commit CRC). Then we will need the patch for 8.5. It is not necessary for 8.3->8.4 upgrade because TOAST_MAX_CHUNK_SIZE is same. And makethis change into toast table now will add unnecessary complexity. Zdenek
Robert Haas napsal(a): >>> 1. htup and bufpage API clean up >>> 2. HeapTuple version extension + code cleanup >>> 3. In-place online upgrade >>> 4. Extending pg_class info + more flexible TOAST chunk size >> big thanks for your review. I think #1 is still partially valid, because it >> contains general cleanups, but part of it is not necessary now. #2, #3 and >> #4 you can move to return with feedback section. > > OK, when can you submit a new version of #1 with the parts that are > still valid, updated to CVS HEAD, etc? > It does not have priority now. I'm working on space reservation first. Thanks Zdenek