Thread: [WIP] In-place upgrade

[WIP] In-place upgrade

From

Zdenek Kotala

Date:

31 October 2008, 18:44:01

This is really first patch which is not clean up, but it add in-place upgrade
functionality. The patch requires other clean up patches which I already send.
You can find aslo GIT repository with "workable" version.

Main point is that tuples are converted to latest version in SeqScan and
IndexScan node. All storage/access module is able process database 8.1-8.4.
(Page Layout 3 and 4).

What works:
- select - heap scan is ok, but index scan does not work on varlena datatypes. I
need to convert index key somewhere in index access.

What does not work:
- tuple conversion which contains arrays, composite datatypes and toast
- vacuum - it tries to cleanup old pages - probably better could be converted
them to the new format during processing...
- insert/delete/update

The Patch contains lot of extra comments and rubbish, but it is in process of
cleanup.

What I need to know/solve:

1) yes/no for this kind of online upgrade method
2) I'm not sure if the calling ExecStoreTuple correct.
3) I'm still looking best place to store old data structures and conversion
functions. My idea is to create new directories:
src/include/odf/v03/...
src/backend/storage/upgrade/
src/backend/access/upgrade
(odf = On Disk Format)

Links:
http://git.postgresql.org/?p=~davidfetter/upgrade_in_place/.git;a=summary
http://src.opensolaris.org/source/xref/sfw/usr/src/cmd/postgres/postgresql-upgrade/

        Thanks for your comments

                Zdenek


--
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/access/heap/htup_03.c
pgsql_master_upgrade.13a47c410da7/src/backend/access/heap/htup_03.c
*** pgsql_master_upgrade.751eb7c6969f/src/backend/access/heap/htup_03.c    1970-01-01 01:00:00.000000000 +0100
--- pgsql_master_upgrade.13a47c410da7/src/backend/access/heap/htup_03.c    2008-10-31 21:45:33.281134312 +0100
***************
*** 0 ****
--- 1,223 ----
+ #include "postgres.h"
+ #include "access/htup_03.h"
+ #include "access/heapam.h"
+ #include "utils/rel.h"
+
+ #define VARATT_FLAG_EXTERNAL    0x80000000
+ #define VARATT_FLAG_COMPRESSED    0x40000000
+ #define VARATT_MASK_FLAGS        0xc0000000
+ #define VARATT_MASK_SIZE        0x3fffffff
+ #define VARATT_SIZEP(_PTR)    (((varattrib_03 *)(_PTR))->va_header)
+ #define VARATT_SIZE(PTR)        (VARATT_SIZEP(PTR) & VARATT_MASK_SIZE)
+
+ typedef struct varattrib_03
+ {
+     int32        va_header;        /* External/compressed storage */
+     /* flags and item size */
+     union
+     {
+         struct
+         {
+             int32        va_rawsize;        /* Plain data size */
+             char        va_data[1];        /* Compressed data */
+         }            va_compressed;        /* Compressed stored attribute */
+
+         struct
+         {
+             int32        va_rawsize;        /* Plain data size */
+             int32        va_extsize;        /* External saved size */
+             Oid            va_valueid;        /* Unique identifier of value */
+             Oid            va_toastrelid;    /* RelID where to find chunks */
+         }            va_external;    /* External stored attribute */
+
+         char        va_data[1]; /* Plain stored attribute */
+     }            va_content;
+ } varattrib_03;
+
+ /*
+  * att_align aligns the given offset as needed for a datum of alignment
+  * requirement attalign.  The cases are tested in what is hopefully something
+  * like their frequency of occurrence.
+  */
+ static
+ long att_align_03(long cur_offset, char attalign)
+ {
+     switch(attalign)
+     {
+         case 'i' : return INTALIGN(cur_offset);
+         case 'c' : return cur_offset;
+         case 'd' : return DOUBLEALIGN(cur_offset);
+         case 's' : return SHORTALIGN(cur_offset);
+         default: elog(ERROR, "unsupported alligment (%c).", attalign);
+     }
+ }
+
+ /*
+  * att_addlength increments the given offset by the length of the attribute.
+  * attval is only accessed if we are dealing with a variable-length attribute.
+  */
+ static
+ long att_addlength_03(long cur_offset, int attlen, Datum attval)
+ {
+     if(attlen > 0)
+         return cur_offset + attlen;
+
+     if(attlen == -1)
+         return cur_offset + (*((uint32*) DatumGetPointer(attval)) & 0x3fffffff);
+
+     if(attlen != -2)
+         elog(ERROR, "not supported attlen (%i).", attlen);
+
+     return cur_offset + strlen(DatumGetCString(attval)) + 1;
+ }
+
+
+ /* deform tuple from version 03 including varlena and
+  * composite type handling */
+ void
+ heap_deform_tuple_03(HeapTuple tuple, TupleDesc tupleDesc,
+                   Datum *values, bool *isnull)
+ {
+     HeapTupleHeader_03 tup = (HeapTupleHeader_03) tuple->t_data;
+     bool        hasnulls = (tup->t_infomask & 0x01);
+     Form_pg_attribute *att = tupleDesc->attrs;
+     int            tdesc_natts = tupleDesc->natts;
+     int            natts;            /* number of atts to extract */
+     int            attnum;
+     Pointer        tp_data;
+     long        off;            /* offset in tuple data */
+     bits8       *bp = tup->t_bits;        /* ptr to null bitmap in tuple */
+
+     natts = tup->t_natts;
+
+     /*
+      * In inheritance situations, it is possible that the given tuple actually
+      * has more fields than the caller is expecting.  Don't run off the end of
+      * the caller's arrays.
+      */
+     natts = Min(natts, tdesc_natts);
+
+     tp_data = ((Pointer)tup) + tup->t_hoff;
+
+     off = 0;
+
+     for (attnum = 0; attnum < natts; attnum++)
+     {
+         Form_pg_attribute thisatt = att[attnum];
+
+         if (hasnulls && att_isnull(attnum, bp))
+         {
+             values[attnum] = (Datum) 0;
+             isnull[attnum] = true;
+             continue;
+         }
+
+         isnull[attnum] = false;
+
+         off = att_align_03(off, thisatt->attalign);
+
+         values[attnum] = fetchatt(thisatt, tp_data + off); /* fetchatt looks compatible */
+
+         off = att_addlength_03(off, thisatt->attlen, (Datum)(tp_data + off));
+     }
+
+     /*
+      * If tuple doesn't have all the atts indicated by tupleDesc, read the
+      * rest as null
+      */
+     for (; attnum < tdesc_natts; attnum++)
+     {
+         values[attnum] = (Datum) 0;
+         isnull[attnum] = true;
+     }
+ }
+
+ HeapTuple heap_tuple_upgrade_03(Relation rel, HeapTuple tuple)
+ {
+     TupleDesc    tupleDesc = RelationGetDescr(rel);
+     int            natts;
+     Datum        *values;
+     bool        *isnull;
+     bool        *isalloc;
+     HeapTuple    newTuple;
+     int            n;
+
+     /* Preallocate values/isnull arrays */
+     natts = tupleDesc->natts;
+     values = (Datum *) palloc0(natts * sizeof(Datum));
+     isnull = (bool *) palloc0(natts * sizeof(bool));
+     isalloc = (bool *) palloc0(natts * sizeof(bool));
+
+     heap_deform_tuple_03(tuple, tupleDesc, values, isnull);
+
+     /* now we need to go through values and convert varlen and composite types */
+     for( n = 0; n < natts; n++)
+     {
+         if(isnull[n])
+             continue;
+
+         if(tupleDesc->attrs[n]->attlen == -1)
+         {
+             varattrib_03* varhdr_03;
+             varattrib_4b* varhdr_04;
+             char *data;
+
+ //            elog(NOTICE,"attname %s", tupleDesc->attrs[n]->attname);
+
+             /* varlena conversion */
+             varhdr_03 = (varattrib_03*) DatumGetPointer(values[n]);
+             data = palloc(VARATT_SIZE(varhdr_03));
+             varhdr_04 = (varattrib_4b*) data;
+
+             if( (varhdr_03->va_header & VARATT_MASK_FLAGS) == 0 )
+             {     /* TODO short varlena - but form_tuple should convert it anyway */
+
+                 SET_VARSIZE(varhdr_04, VARATT_SIZE(varhdr_03));
+                 memcpy( VARDATA(varhdr_04), varhdr_03->va_content.va_data,
+                     VARATT_SIZE(varhdr_03)- offsetof(varattrib_03, va_content.va_data) );
+             } else
+             if( (varhdr_03->va_header & VARATT_FLAG_EXTERNAL) != 0)
+             {
+                 SET_VARSIZE_EXTERNAL(varhdr_04,
+                     VARHDRSZ_EXTERNAL + sizeof(struct varatt_external));
+                 memcpy( VARDATA_EXTERNAL(varhdr_04),
+                         &(varhdr_03->va_content.va_external.va_rawsize), sizeof(struct varatt_external));
+             } else
+             if( (varhdr_03->va_header & VARATT_FLAG_COMPRESSED ) != 0)
+             {
+
+                 SET_VARSIZE_COMPRESSED(varhdr_04, VARATT_SIZE(varhdr_03));
+                 varhdr_04->va_compressed.va_rawsize = varhdr_03->va_content.va_compressed.va_rawsize;
+
+                 memcpy( VARDATA_4B_C(varhdr_04), varhdr_03->va_content.va_compressed.va_data,
+                     VARATT_SIZE(varhdr_03)- offsetof(varattrib_03, va_content.va_compressed.va_data) );
+             }
+
+             values[n] = PointerGetDatum(data);
+             isalloc[n] = true;
+         }
+     }
+
+     newTuple = heap_form_tuple(tupleDesc, values, isnull);
+
+     /* free allocated memory */
+     for( n = 0; n < natts; n++)
+     {
+         if(isalloc[n])
+             pfree(DatumGetPointer(values[n]));
+     }
+
+     /* Preserve OID, if any */
+     if(rel->rd_rel->relhasoids)
+     {
+         Oid oid;
+         oid = *((Oid *) ((char *)(tuple->t_data) + ((HeapTupleHeader_03)(tuple->t_data))->t_hoff - sizeof(Oid)));
+         HeapTupleSetOid(newTuple, oid);
+     }
+     return newTuple;
+ }
+
+
+
+
+
diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/access/heap/htup.c
pgsql_master_upgrade.13a47c410da7/src/backend/access/heap/htup.c
*** pgsql_master_upgrade.751eb7c6969f/src/backend/access/heap/htup.c    2008-10-31 21:45:33.114200837 +0100
--- pgsql_master_upgrade.13a47c410da7/src/backend/access/heap/htup.c    2008-10-31 21:45:33.218887161 +0100
***************
*** 2,10 ****
--- 2,15 ----

  #include "fmgr.h"
  #include "access/htup.h"
+ #include "access/htup_03.h"
  #include "access/transam.h"
  #include "storage/bufpage.h"

+
+ #define TPH03(tup) \
+     ((HeapTupleHeader_03)tuple->t_data)
+
  /*
   * HeapTupleHeader accessor macros
   *
***************
*** 135,251 ****
   */
  bool HeapTupleIsHotUpdated(HeapTuple tuple)
  {
!     return ((tuple->t_data->t_infomask2 & HEAP_HOT_UPDATED) != 0 &&
!             (tuple->t_data->t_infomask & (HEAP_XMIN_INVALID | HEAP_XMAX_INVALID)) == 0);
  }
  void HeapTupleSetHotUpdated(HeapTuple tuple)
  {
!     tuple->t_data->t_infomask2 |= HEAP_HOT_UPDATED;
  }

  void HeapTupleClearHotUpdated(HeapTuple tuple)
  {
!     tuple->t_data->t_infomask2 &= ~HEAP_HOT_UPDATED;
  }

  bool HeapTupleIsHeapOnly(HeapTuple tuple)
  {
!     return (tuple->t_data->t_infomask2 & HEAP_ONLY_TUPLE) != 0;
  }

  void HeapTupleSetHeapOnly(HeapTuple tuple)
  {
!     tuple->t_data->t_infomask2 |= HEAP_ONLY_TUPLE;
  }

  void HeapTupleClearHeapOnly(HeapTuple tuple)
  {
!     tuple->t_data->t_infomask2 &= ~HEAP_ONLY_TUPLE;
  }


  Oid HeapTupleGetOid(HeapTuple tuple)
  {
      if(!HeapTupleIs(tuple, HEAP_HASOID))
          return InvalidOid;

!     return  *((Oid *) ((char *)tuple->t_data + HeapTupleGetHoff(tuple) - sizeof(Oid)));
  }

  void HeapTupleSetOid(HeapTuple tuple, Oid oid)
  {
      Assert(HeapTupleIs(tuple, HEAP_HASOID));
!     *((Oid *) ((char *)(tuple->t_data) + HeapTupleGetHoff(tuple) - sizeof(Oid))) = oid;
! }
!
! bool HeapTupleHasOid(HeapTuple tuple)
! {
!     return HeapTupleIs(tuple, HEAP_HASOID);
  }

  TransactionId HeapTupleGetXmax(HeapTuple tuple)
  {
!     return tuple->t_data->t_choice.t_heap.t_xmax;
  }

  void HeapTupleSetXmax(HeapTuple tuple, TransactionId xmax)
  {
!     tuple->t_data->t_choice.t_heap.t_xmax = xmax;
  }

  TransactionId HeapTupleGetXmin(HeapTuple tuple)
  {
!     return tuple->t_data->t_choice.t_heap.t_xmin;
  }

  void HeapTupleSetXmin(HeapTuple tuple, TransactionId xmin)
  {
!     tuple->t_data->t_choice.t_heap.t_xmin = xmin;
  }

  TransactionId HeapTupleGetXvac(HeapTuple tuple)
  {
!     return (HeapTupleIs(tuple, HEAP_MOVED)) ?
!             tuple->t_data->t_choice.t_heap.t_field3.t_xvac :
!             InvalidTransactionId;
  }

  void HeapTupleSetXvac(HeapTuple tuple, TransactionId Xvac)
  {
!     Assert(HeapTupleIs(tuple, HEAP_MOVED));
!          tuple->t_data->t_choice.t_heap.t_field3.t_xvac = Xvac;
  }

  void HeapTupleSetCmax(HeapTuple tuple, CommandId cid, bool iscombo)
  {
!     Assert(!(HeapTupleIs(tuple, HEAP_MOVED)));
!     tuple->t_data->t_choice.t_heap.t_field3.t_cid = cid;
!     if(iscombo)
!         HeapTupleSet(tuple, HEAP_COMBOCID);
!     else
!         HeapTupleClear(tuple, HEAP_COMBOCID);
  }

  void HeapTupleSetCmin(HeapTuple tuple, CommandId cid)
  {
!     Assert(!(HeapTupleIs(tuple, HEAP_MOVED)));
!     tuple->t_data->t_choice.t_heap.t_field3.t_cid = cid;
!     HeapTupleClear(tuple, HEAP_COMBOCID);
  }

  uint16 HeapTupleGetInfoMask(HeapTuple tuple)
  {
!     return ((tuple)->t_data->t_infomask);
  }

  void HeapTupleSetInfoMask(HeapTuple tuple, uint16 infomask)
  {
!     ((tuple)->t_data->t_infomask = (infomask));
  }

  uint16 HeapTupleGetInfoMask2(HeapTuple tuple)
  {
!     return ((tuple)->t_data->t_infomask2);
  }

  bool HeapTupleIs(HeapTuple tuple, uint16 mask)
--- 140,361 ----
   */
  bool HeapTupleIsHotUpdated(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : return ((tuple->t_data->t_infomask2 & HEAP_HOT_UPDATED) != 0 &&
!                     (tuple->t_data->t_infomask & (HEAP_XMIN_INVALID | HEAP_XMAX_INVALID)) == 0);
!         case 3 : return false;
!     }
!     Assert(false);
!     return false;
  }
  void HeapTupleSetHotUpdated(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : tuple->t_data->t_infomask2 |= HEAP_HOT_UPDATED;
!                  return;
!     }
!     elog(PANIC,"Tuple cannot be HOT updated");
  }

  void HeapTupleClearHotUpdated(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : tuple->t_data->t_infomask2 &= ~HEAP_HOT_UPDATED;
!                  return;
!     }
!     elog(PANIC,"Tuple cannot be HOT updated");
  }

  bool HeapTupleIsHeapOnly(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : return (tuple->t_data->t_infomask2 & HEAP_ONLY_TUPLE) != 0;
!         case 3 : return false;
!     }
!     Assert(false);
!     return false;
  }

  void HeapTupleSetHeapOnly(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : tuple->t_data->t_infomask2 |= HEAP_ONLY_TUPLE;
!                  return;
!     }
!     elog(PANIC, "HeapOnly flag is not supported.");
  }

  void HeapTupleClearHeapOnly(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : tuple->t_data->t_infomask2 &= ~HEAP_ONLY_TUPLE;
!                  return;
!     }
!     elog(PANIC, "HeapOnly flag is not supported.");
  }

+ bool HeapTupleHasOid(HeapTuple tuple)
+ {
+     return HeapTupleIs(tuple, HEAP_HASOID);
+ }

  Oid HeapTupleGetOid(HeapTuple tuple)
  {
      if(!HeapTupleIs(tuple, HEAP_HASOID))
          return InvalidOid;

!     switch(tuple->t_ver)
!     {
!         case 4 : return  *((Oid *) ((char *)tuple->t_data + HeapTupleGetHoff(tuple) - sizeof(Oid)));
!         case 3 : return  *((Oid *) ((char *)TPH03(tuple) + HeapTupleGetHoff(tuple) - sizeof(Oid)));
!     }
!     elog(PANIC, "HeapTupleGetOid is not supported.");
  }

  void HeapTupleSetOid(HeapTuple tuple, Oid oid)
  {
      Assert(HeapTupleIs(tuple, HEAP_HASOID));
!     switch(tuple->t_ver)
!     {
!         case 4 : *((Oid *) ((char *)(tuple->t_data) + HeapTupleGetHoff(tuple) - sizeof(Oid))) = oid;
!                  break;
!         case 3 : *((Oid *) ((char *)TPH03(tuple) + HeapTupleGetHoff(tuple) - sizeof(Oid))) = oid;
!                  break;
!         default: elog(PANIC, "HeapTupleSetOid is not supported.");
!     }
  }

  TransactionId HeapTupleGetXmax(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : return tuple->t_data->t_choice.t_heap.t_xmax;
!         case 3 : return TPH03(tuple)->t_choice.t_heap.t_xmax;
!     }
!     elog(PANIC, "HeapTupleGetXmax is not supported.");
!     return 0;
  }

  void HeapTupleSetXmax(HeapTuple tuple, TransactionId xmax)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : tuple->t_data->t_choice.t_heap.t_xmax = xmax;
!                  break;
!         case 3 : TPH03(tuple)->t_choice.t_heap.t_xmax = xmax;
!                  break;
!         default: elog(PANIC, "HeapTupleSetXmax is not supported.");
!     }
  }

  TransactionId HeapTupleGetXmin(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : return tuple->t_data->t_choice.t_heap.t_xmin;
!         case 3 : return TPH03(tuple)->t_choice.t_heap.t_xmin;
!     }
!     elog(PANIC, "HeapTupleSetXmin is not supported.");
!     return 0;
  }

  void HeapTupleSetXmin(HeapTuple tuple, TransactionId xmin)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : tuple->t_data->t_choice.t_heap.t_xmin = xmin;
!                  break;
!         case 3 : TPH03(tuple)->t_choice.t_heap.t_xmin = xmin;
!         default: elog(PANIC, "HeapTupleSetXmin is not supported.");
!     }
  }

  TransactionId HeapTupleGetXvac(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : return (HeapTupleIs(tuple, HEAP_MOVED)) ?
!                         tuple->t_data->t_choice.t_heap.t_field3.t_xvac :
!                         InvalidTransactionId;
!     }
!     Assert(false);
!     return InvalidTransactionId;
  }

  void HeapTupleSetXvac(HeapTuple tuple, TransactionId Xvac)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : Assert(HeapTupleIs(tuple, HEAP_MOVED));
!                  tuple->t_data->t_choice.t_heap.t_field3.t_xvac = Xvac;
!                  break;
!         default: Assert(false);
!     }
  }

  void HeapTupleSetCmax(HeapTuple tuple, CommandId cid, bool iscombo)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : Assert(!(HeapTupleIs(tuple, HEAP_MOVED)));
!                  tuple->t_data->t_choice.t_heap.t_field3.t_cid = cid;
!                  if(iscombo)
!                     HeapTupleSet(tuple, HEAP_COMBOCID);
!                  else
!                     HeapTupleClear(tuple, HEAP_COMBOCID);
!                  break;
!         default: Assert(false);
!     }
  }

  void HeapTupleSetCmin(HeapTuple tuple, CommandId cid)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 : Assert(!(HeapTupleIs(tuple, HEAP_MOVED)));
!                  tuple->t_data->t_choice.t_heap.t_field3.t_cid = cid;
!                  HeapTupleClear(tuple, HEAP_COMBOCID);
!                  break;
!         default: Assert(false);
!     }
  }

  uint16 HeapTupleGetInfoMask(HeapTuple tuple)
  {
!     uint16 infomask;
!     switch(tuple->t_ver)
!     {
!         case 4: return ((tuple)->t_data->t_infomask);
!         case 3: infomask = TPH03(tuple)->t_infomask & 0xFFB7; /* reset 3 (HASOID), 4 (UNUSED), 5 (COMBOCID) bit */
!                 infomask |= ((TPH03(tuple)->t_infomask& 0x0010) << 1 ); /* copy HASOID */
!                 return infomask;
!     }
!     elog(PANIC, "HeapTupleGetInfoMask is not supported.");
  }

  void HeapTupleSetInfoMask(HeapTuple tuple, uint16 infomask)
  {
!     switch(tuple->t_ver)
!     {
!         case 4:    ((tuple)->t_data->t_infomask = (infomask));
!                 break;
!         default: Assert(false);
!     }
  }

  uint16 HeapTupleGetInfoMask2(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4 :return ((tuple)->t_data->t_infomask2);
!         default: return 0;
!     }
  }

  bool HeapTupleIs(HeapTuple tuple, uint16 mask)
***************
*** 265,271 ****

  void HeapTupleClear2(HeapTuple tuple, uint16 mask)
  {
!     ((tuple)->t_data->t_infomask2 &= ~(mask));
  }

  CommandId HeapTupleGetRawCommandId(HeapTuple tuple)
--- 375,386 ----

  void HeapTupleClear2(HeapTuple tuple, uint16 mask)
  {
!     switch(tuple->t_ver)
!     {
!         case 4:    ((tuple)->t_data->t_infomask2 &= ~(mask));
!                 break;
!     }
!     /* silently ignore on older versions */
  }

  CommandId HeapTupleGetRawCommandId(HeapTuple tuple)
***************
*** 275,281 ****

  int HeapTupleGetNatts(HeapTuple tuple)
  {
!     return (tuple->t_data->t_infomask2 & HEAP_NATTS_MASK);
  }

  ItemPointer HeapTupleGetCtid(HeapTuple tuple)
--- 390,401 ----

  int HeapTupleGetNatts(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4: return (tuple->t_data->t_infomask2 & HEAP_NATTS_MASK);
!         case 3: return TPH03(tuple)->t_natts;
!     }
!     elog(PANIC, "HeapTupleGetNatts is not supported.");
  }

  ItemPointer HeapTupleGetCtid(HeapTuple tuple)
***************
*** 290,306 ****

  uint8 HeapTupleGetHoff(HeapTuple tuple)
  {
!     return (tuple->t_data->t_hoff);
  }

  Pointer HeapTupleGetBits(HeapTuple tuple)
  {
!     return (Pointer)(tuple->t_data->t_bits);
  }

  Pointer HeapTupleGetData(HeapTuple tuple)
  {
!     return (((Pointer)tuple->t_data) + tuple->t_data->t_hoff);
  }

  void HeapTupleInit(HeapTuple tuple, int32 len, Oid typid, int32 typmod,
--- 410,438 ----

  uint8 HeapTupleGetHoff(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4: return (tuple->t_data->t_hoff);
!     }
!     elog(PANIC, "HeapTupleGetHoff is not supported.");
  }

  Pointer HeapTupleGetBits(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4: return (Pointer)(tuple->t_data->t_bits);
!     }
!     elog(PANIC, "HeapTupleGetBits is not supported.");
  }

  Pointer HeapTupleGetData(HeapTuple tuple)
  {
!     switch(tuple->t_ver)
!     {
!         case 4: return (((Pointer)tuple->t_data) + tuple->t_data->t_hoff);
!     }
!     elog(PANIC, "HeapTupleGetData is not supported.");
  }

  void HeapTupleInit(HeapTuple tuple, int32 len, Oid typid, int32 typmod,
diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/access/heap/Makefile
pgsql_master_upgrade.13a47c410da7/src/backend/access/heap/Makefile
*** pgsql_master_upgrade.751eb7c6969f/src/backend/access/heap/Makefile    2008-10-31 21:45:33.112796571 +0100
--- pgsql_master_upgrade.13a47c410da7/src/backend/access/heap/Makefile    2008-10-31 21:45:33.217276252 +0100
***************
*** 12,17 ****
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global

! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o htup.o

  include $(top_srcdir)/src/backend/common.mk
--- 12,17 ----
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global

! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o htup.o htup_03.o

  include $(top_srcdir)/src/backend/common.mk
diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/access/nbtree/nbtinsert.c
pgsql_master_upgrade.13a47c410da7/src/backend/access/nbtree/nbtinsert.c
*** pgsql_master_upgrade.751eb7c6969f/src/backend/access/nbtree/nbtinsert.c    2008-10-31 21:45:33.136480748 +0100
--- pgsql_master_upgrade.13a47c410da7/src/backend/access/nbtree/nbtinsert.c    2008-10-31 21:45:33.231075233 +0100
***************
*** 1203,1209 ****

      /* Total free space available on a btree page, after fixed overhead */
      leftspace = rightspace =
!         PageGetPageSize(page) - SizeOfPageHeaderData -
          MAXALIGN(sizeof(BTPageOpaqueData));

      /* The right page will have the same high key as the old page */
--- 1203,1209 ----

      /* Total free space available on a btree page, after fixed overhead */
      leftspace = rightspace =
!         PageGetPageSize(page) - SizeOfPageHeaderData04 -
          MAXALIGN(sizeof(BTPageOpaqueData));

      /* The right page will have the same high key as the old page */
diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/executor/nodeIndexscan.c
pgsql_master_upgrade.13a47c410da7/src/backend/executor/nodeIndexscan.c
*** pgsql_master_upgrade.751eb7c6969f/src/backend/executor/nodeIndexscan.c    2008-10-31 21:45:33.144606136 +0100
--- pgsql_master_upgrade.13a47c410da7/src/backend/executor/nodeIndexscan.c    2008-10-31 21:45:33.238969129 +0100
***************
*** 27,32 ****
--- 27,33 ----
  #include "access/genam.h"
  #include "access/nbtree.h"
  #include "access/relscan.h"
+ #include "access/htup_03.h"
  #include "executor/execdebug.h"
  #include "executor/nodeIndexscan.h"
  #include "optimizer/clauses.h"
***************
*** 113,122 ****
           * Note: we pass 'false' because tuples returned by amgetnext are
           * pointers onto disk pages and must not be pfree()'d.
           */
!         ExecStoreTuple(tuple,    /* tuple to store */
!                        slot,    /* slot to store in */
!                        scandesc->xs_cbuf,        /* buffer containing tuple */
!                        false);    /* don't pfree */

          /*
           * If the index was lossy, we have to recheck the index quals using
--- 114,138 ----
           * Note: we pass 'false' because tuples returned by amgetnext are
           * pointers onto disk pages and must not be pfree()'d.
           */
!         if(tuple->t_ver == 4)
!         {
!             ExecStoreTuple(tuple,    /* tuple to store */
!                            slot,    /* slot to store in */
!                            scandesc->xs_cbuf,        /* buffer containing tuple */
!                            false);    /* don't pfree */
!         } else
!         if(tuple->t_ver == 3)
!         {
!             HeapTuple newtup;
!             newtup = heap_tuple_upgrade_03(scandesc->heapRelation, tuple);
!             ExecStoreTuple(newtup,    /* tuple to store */
!                            slot,    /* slot to store in */
!                            InvalidBuffer,        /* buffer associated with this
!                                                      * tuple */
!                            true);    /* pfree this pointer */
!         }
!         else
!             elog(ERROR,"Unsupported tuple version (%i).",tuple->t_ver);

          /*
           * If the index was lossy, we have to recheck the index quals using
diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/executor/nodeSeqscan.c
pgsql_master_upgrade.13a47c410da7/src/backend/executor/nodeSeqscan.c
*** pgsql_master_upgrade.751eb7c6969f/src/backend/executor/nodeSeqscan.c    2008-10-31 21:45:33.148191833 +0100
--- pgsql_master_upgrade.13a47c410da7/src/backend/executor/nodeSeqscan.c    2008-10-31 21:45:33.242644971 +0100
***************
*** 25,30 ****
--- 25,31 ----
  #include "postgres.h"

  #include "access/heapam.h"
+ #include "access/htup_03.h"
  #include "access/relscan.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSeqscan.h"
***************
*** 101,111 ****
       * refcount will not be dropped until the tuple table slot is cleared.
       */
      if (tuple)
!         ExecStoreTuple(tuple,    /* tuple to store */
!                        slot,    /* slot to store in */
!                        scandesc->rs_cbuf,        /* buffer associated with this
!                                                  * tuple */
!                        false);    /* don't pfree this pointer */
      else
          ExecClearTuple(slot);

--- 102,129 ----
       * refcount will not be dropped until the tuple table slot is cleared.
       */
      if (tuple)
!     {
!         if(tuple->t_ver == 4)
!         {
!             ExecStoreTuple(tuple,    /* tuple to store */
!                            slot,    /* slot to store in */
!                            scandesc->rs_cbuf,        /* buffer associated with this
!                                                      * tuple */
!                            false);    /* don't pfree this pointer */
!         } else
!         if(tuple->t_ver == 3)
!         {
!             HeapTuple newtup;
!             newtup = heap_tuple_upgrade_03(scandesc->rs_rd, tuple);
!             ExecStoreTuple(newtup,    /* tuple to store */
!                            slot,    /* slot to store in */
!                            InvalidBuffer,        /* buffer associated with this
!                                                      * tuple */
!                            true);    /* pfree this pointer */
!         }
!         else
!             elog(ERROR,"Unsupported tuple version (%i).",tuple->t_ver);
!     }
      else
          ExecClearTuple(slot);

diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/optimizer/util/plancat.c
pgsql_master_upgrade.13a47c410da7/src/backend/optimizer/util/plancat.c
*** pgsql_master_upgrade.751eb7c6969f/src/backend/optimizer/util/plancat.c    2008-10-31 21:45:33.157104094 +0100
--- pgsql_master_upgrade.13a47c410da7/src/backend/optimizer/util/plancat.c    2008-10-31 21:45:33.251184277 +0100
***************
*** 429,435 ****
                  tuple_width += sizeof(HeapTupleHeaderData);
                  tuple_width += sizeof(ItemPointerData);
                  /* note: integer division is intentional here */
!                 density = (BLCKSZ - SizeOfPageHeaderData) / tuple_width;
              }
              *tuples = rint(density * (double) curpages);
              break;
--- 429,435 ----
                  tuple_width += sizeof(HeapTupleHeaderData);
                  tuple_width += sizeof(ItemPointerData);
                  /* note: integer division is intentional here */
!                 density = (BLCKSZ - SizeOfPageHeaderData04) / tuple_width;
              }
              *tuples = rint(density * (double) curpages);
              break;
diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/backend/storage/page/bufpage.c
pgsql_master_upgrade.13a47c410da7/src/backend/storage/page/bufpage.c
*** pgsql_master_upgrade.751eb7c6969f/src/backend/storage/page/bufpage.c    2008-10-31 21:45:33.168097249 +0100
--- pgsql_master_upgrade.13a47c410da7/src/backend/storage/page/bufpage.c    2008-10-31 21:45:33.262190876 +0100
***************
*** 19,24 ****
--- 19,28 ----
  #include "access/transam.h"
  #include "storage/bufpage.h"

+
+ static bool PageLayoutIsValid_04(Page page);
+ static bool PageLayoutIsValid_03(Page page);
+ static bool PageIsZeroed(Page page);
  static Item PageGetItem(Page page, OffsetNumber offsetNumber);

  /* ----------------------------------------------------------------
***************
*** 28,50 ****

  /*
   * PageInit
!  *        Initializes the contents of a page.
   */
  void
  PageInit(Page page, Size pageSize, Size specialSize)
  {
!     PageHeader    p = (PageHeader) page;

      specialSize = MAXALIGN(specialSize);

      Assert(pageSize == BLCKSZ);
!     Assert(pageSize > specialSize + SizeOfPageHeaderData);

      /* Make sure all fields of page are zero, as well as unused space */
      MemSet(p, 0, pageSize);

      /* p->pd_flags = 0;                                done by above MemSet */
!     p->pd_lower = SizeOfPageHeaderData;
      p->pd_upper = pageSize - specialSize;
      p->pd_special = pageSize - specialSize;
      PageSetPageSizeAndVersion(page, pageSize, PG_PAGE_LAYOUT_VERSION);
--- 32,55 ----

  /*
   * PageInit
!  *        Initializes the contents of a page. We allow to initialize page only
!  *      in latest Page Layout Version.
   */
  void
  PageInit(Page page, Size pageSize, Size specialSize)
  {
!     PageHeader_04    p = (PageHeader_04) page;

      specialSize = MAXALIGN(specialSize);

      Assert(pageSize == BLCKSZ);
!     Assert(pageSize > specialSize + SizeOfPageHeaderData04);

      /* Make sure all fields of page are zero, as well as unused space */
      MemSet(p, 0, pageSize);

      /* p->pd_flags = 0;                                done by above MemSet */
!     p->pd_lower = SizeOfPageHeaderData04;
      p->pd_upper = pageSize - specialSize;
      p->pd_special = pageSize - specialSize;
      PageSetPageSizeAndVersion(page, pageSize, PG_PAGE_LAYOUT_VERSION);
***************
*** 53,59 ****


  /*
!  * PageHeaderIsValid
   *        Check that the header fields of a page appear valid.
   *
   * This is called when a page has just been read in from disk.    The idea is
--- 58,64 ----


  /*
!  * PageLayoutIsValid
   *        Check that the header fields of a page appear valid.
   *
   * This is called when a page has just been read in from disk.    The idea is
***************
*** 73,94 ****
  bool
  PageLayoutIsValid(Page page)
  {
      char       *pagebytes;
      int            i;
-     PageHeader    ph = (PageHeader)page;
-
-     /* Check normal case */
-     if (PageGetPageSize(page) == BLCKSZ &&
-         PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION &&
-         (ph->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
-         ph->pd_lower >= SizeOfPageHeaderData &&
-         ph->pd_lower <= ph->pd_upper &&
-         ph->pd_upper <= ph->pd_special &&
-         ph->pd_special <= BLCKSZ &&
-         ph->pd_special == MAXALIGN(ph->pd_special))
-         return true;

-     /* Check all-zeroes case */
      pagebytes = (char *) page;
      for (i = 0; i < BLCKSZ; i++)
      {
--- 78,102 ----
  bool
  PageLayoutIsValid(Page page)
  {
+     /* Check normal case */
+     switch(PageGetPageLayoutVersion(page))
+     {
+         case 4 : return(PageLayoutIsValid_04(page));
+         case 3 : return(PageLayoutIsValid_03(page));
+         case 0 : return(PageIsZeroed(page));
+     }
+     return false;
+ }
+
+ /*
+  * Check all-zeroes case
+  */
+ bool
+ PageIsZeroed(Page page)
+ {
      char       *pagebytes;
      int            i;

      pagebytes = (char *) page;
      for (i = 0; i < BLCKSZ; i++)
      {
***************
*** 98,103 ****
--- 106,141 ----
      return true;
  }

+ bool PageLayoutIsValid_04(Page page)
+ {
+     PageHeader_04 phdr = (PageHeader_04)page;
+      if(
+          PageGetPageSize(page) == BLCKSZ &&
+         (phdr->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
+         phdr->pd_lower >= SizeOfPageHeaderData04 &&
+         phdr->pd_lower <= phdr->pd_upper &&
+         phdr->pd_upper <= phdr->pd_special &&
+         phdr->pd_special <= BLCKSZ &&
+         phdr->pd_special == MAXALIGN(phdr->pd_special))
+         return true;
+     return false;
+ }
+
+ bool PageLayoutIsValid_03(Page page)
+ {
+     PageHeader_03 phdr = (PageHeader_03)page;
+      if(
+          PageGetPageSize(page) == BLCKSZ &&
+         phdr->pd_lower >= SizeOfPageHeaderData03 &&
+         phdr->pd_lower <= phdr->pd_upper &&
+         phdr->pd_upper <= phdr->pd_special &&
+         phdr->pd_special <= BLCKSZ &&
+         phdr->pd_special == MAXALIGN(phdr->pd_special))
+         return true;
+     return false;
+ }
+
+

  /*
   *    PageAddItem
***************
*** 127,133 ****
              bool overwrite,
              bool is_heap)
  {
!     PageHeader    phdr = (PageHeader) page;
      Size        alignedSize;
      int            lower;
      int            upper;
--- 165,171 ----
              bool overwrite,
              bool is_heap)
  {
!     PageHeader_04    phdr = (PageHeader_04) page;
      Size        alignedSize;
      int            lower;
      int            upper;
***************
*** 135,144 ****
      OffsetNumber limit;
      bool        needshuffle = false;

      /*
       * Be wary about corrupted page pointers
       */
!     if (phdr->pd_lower < SizeOfPageHeaderData ||
          phdr->pd_lower > phdr->pd_upper ||
          phdr->pd_upper > phdr->pd_special ||
          phdr->pd_special > BLCKSZ)
--- 173,185 ----
      OffsetNumber limit;
      bool        needshuffle = false;

+     /* We allow add new items only on the new page layout - TODO indexes? */
+     if( PageGetPageLayoutVersion(page) != PG_PAGE_LAYOUT_VERSION )
+         elog(PANIC, "Add item on old page layout version is forbidden.");
      /*
       * Be wary about corrupted page pointers
       */
!     if (phdr->pd_lower < SizeOfPageHeaderData04 ||
          phdr->pd_lower > phdr->pd_upper ||
          phdr->pd_upper > phdr->pd_special ||
          phdr->pd_special > BLCKSZ)
***************
*** 265,281 ****
  {
      Size        pageSize;
      Page        temp;
!     PageHeader    thdr;

      pageSize = PageGetPageSize(page);
      temp = (Page) palloc(pageSize);
!     thdr = (PageHeader) temp;

      /* copy old page in */
      memcpy(temp, page, pageSize);

      /* set high, low water marks */
!     thdr->pd_lower = SizeOfPageHeaderData;
      thdr->pd_upper = pageSize - MAXALIGN(specialSize);

      /* clear out the middle */
--- 306,322 ----
  {
      Size        pageSize;
      Page        temp;
!     PageHeader_04    thdr;

      pageSize = PageGetPageSize(page);
      temp = (Page) palloc(pageSize);
!     thdr = (PageHeader_04) temp;

      /* copy old page in */
      memcpy(temp, page, pageSize);

      /* set high, low water marks */
!     thdr->pd_lower = SizeOfPageHeaderData04;
      thdr->pd_upper = pageSize - MAXALIGN(specialSize);

      /* clear out the middle */
***************
*** 333,341 ****
  void
  PageRepairFragmentation(Page page)
  {
!     Offset        pd_lower = ((PageHeader) page)->pd_lower;
!     Offset        pd_upper = ((PageHeader) page)->pd_upper;
!     Offset        pd_special = ((PageHeader) page)->pd_special;
      itemIdSort    itemidbase,
                  itemidptr;
      ItemId        lp;
--- 374,382 ----
  void
  PageRepairFragmentation(Page page)
  {
!     Offset        pd_lower = PageGetLower(page);
!     Offset        pd_upper = PageGetUpper(page);
!     Offset        pd_special = PageGetSpecial(page);
      itemIdSort    itemidbase,
                  itemidptr;
      ItemId        lp;
***************
*** 353,359 ****
       * etc could cause us to clobber adjacent disk buffers, spreading the data
       * loss further.  So, check everything.
       */
!     if (pd_lower < SizeOfPageHeaderData ||
          pd_lower > pd_upper ||
          pd_upper > pd_special ||
          pd_special > BLCKSZ ||
--- 394,400 ----
       * etc could cause us to clobber adjacent disk buffers, spreading the data
       * loss further.  So, check everything.
       */
!     if (pd_lower < SizeOfPageHeaderData04 ||
          pd_lower > pd_upper ||
          pd_upper > pd_special ||
          pd_special > BLCKSZ ||
***************
*** 384,390 ****
      if (nstorage == 0)
      {
          /* Page is completely empty, so just reset it quickly */
!         ((PageHeader) page)->pd_upper = pd_special;
      }
      else
      {                            /* nstorage != 0 */
--- 425,431 ----
      if (nstorage == 0)
      {
          /* Page is completely empty, so just reset it quickly */
!         PageSetUpper(page, pd_special);
      }
      else
      {                            /* nstorage != 0 */
***************
*** 434,440 ****
              lp->lp_off = upper;
          }

!         ((PageHeader) page)->pd_upper = upper;

          pfree(itemidbase);
      }
--- 475,481 ----
              lp->lp_off = upper;
          }

!         PageSetUpper(page, upper);

          pfree(itemidbase);
      }
***************
*** 463,470 ****
       * Use signed arithmetic here so that we behave sensibly if pd_lower >
       * pd_upper.
       */
!     space = (int) ((PageHeader) page)->pd_upper -
!         (int) ((PageHeader) page)->pd_lower;

      if (space < (int) sizeof(ItemIdData))
          return 0;
--- 504,510 ----
       * Use signed arithmetic here so that we behave sensibly if pd_lower >
       * pd_upper.
       */
!     space = PageGetExactFreeSpace(page);

      if (space < (int) sizeof(ItemIdData))
          return 0;
***************
*** 487,494 ****
       * Use signed arithmetic here so that we behave sensibly if pd_lower >
       * pd_upper.
       */
!     space = (int) ((PageHeader) page)->pd_upper -
!         (int) ((PageHeader) page)->pd_lower;

      if (space < 0)
          return 0;
--- 527,533 ----
       * Use signed arithmetic here so that we behave sensibly if pd_lower >
       * pd_upper.
       */
!     space = (int)PageGetUpper(page) - (int)PageGetLower(page);

      if (space < 0)
          return 0;
***************
*** 575,581 ****
  void
  PageIndexTupleDelete(Page page, OffsetNumber offnum)
  {
!     PageHeader    phdr = (PageHeader) page;
      char       *addr;
      ItemId        tup;
      Size        size;
--- 614,620 ----
  void
  PageIndexTupleDelete(Page page, OffsetNumber offnum)
  {
!     PageHeader_04    phdr = (PageHeader_04) page; /* TODO PGU */
      char       *addr;
      ItemId        tup;
      Size        size;
***************
*** 587,593 ****
      /*
       * As with PageRepairFragmentation, paranoia seems justified.
       */
!     if (phdr->pd_lower < SizeOfPageHeaderData ||
          phdr->pd_lower > phdr->pd_upper ||
          phdr->pd_upper > phdr->pd_special ||
          phdr->pd_special > BLCKSZ)
--- 626,632 ----
      /*
       * As with PageRepairFragmentation, paranoia seems justified.
       */
!     if (phdr->pd_lower < SizeOfPageHeaderData04 ||
          phdr->pd_lower > phdr->pd_upper ||
          phdr->pd_upper > phdr->pd_special ||
          phdr->pd_special > BLCKSZ)
***************
*** 681,687 ****
  void
  PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
  {
!     PageHeader    phdr = (PageHeader) page;
      Offset        pd_lower = phdr->pd_lower;
      Offset        pd_upper = phdr->pd_upper;
      Offset        pd_special = phdr->pd_special;
--- 720,726 ----
  void
  PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
  {
!     PageHeader_04    phdr = (PageHeader_04) page; /* TODO PGU */
      Offset        pd_lower = phdr->pd_lower;
      Offset        pd_upper = phdr->pd_upper;
      Offset        pd_special = phdr->pd_special;
***************
*** 716,722 ****
      /*
       * As with PageRepairFragmentation, paranoia seems justified.
       */
!     if (pd_lower < SizeOfPageHeaderData ||
          pd_lower > pd_upper ||
          pd_upper > pd_special ||
          pd_special > BLCKSZ ||
--- 755,761 ----
      /*
       * As with PageRepairFragmentation, paranoia seems justified.
       */
!     if (pd_lower < SizeOfPageHeaderData04 ||
          pd_lower > pd_upper ||
          pd_upper > pd_special ||
          pd_special > BLCKSZ ||
***************
*** 796,815 ****
          lp->lp_off = upper;
      }

!     phdr->pd_lower = SizeOfPageHeaderData + nused * sizeof(ItemIdData);
      phdr->pd_upper = upper;

      pfree(itemidbase);
  }

  /*
   * PageGetItemId
   *        Returns an item identifier of a page.
   */
! static ItemId PageGetItemId(Page page, OffsetNumber offsetNumber)
  {
      AssertMacro(offsetNumber > 0);
!     return (ItemId) (& ((PageHeader) page)->pd_linp[(offsetNumber) - 1]) ;
  }

  /*
--- 835,861 ----
          lp->lp_off = upper;
      }

!     phdr->pd_lower = SizeOfPageHeaderData04 + nused * sizeof(ItemIdData);
      phdr->pd_upper = upper;

      pfree(itemidbase);
  }

+
+
  /*
   * PageGetItemId
   *        Returns an item identifier of a page.
   */
! ItemId PageGetItemId(Page page, OffsetNumber offsetNumber)
  {
      AssertMacro(offsetNumber > 0);
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return (ItemId) (& ((PageHeader_04) page)->pd_linp[(offsetNumber) - 1]) ;
!         case 3 : return (ItemId) (& ((PageHeader_03) page)->pd_linp[(offsetNumber) - 1]) ;
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetItemId.");
  }

  /*
***************
*** 824,836 ****
  Item PageGetItem(Page page, OffsetNumber offsetNumber)
  {
      AssertMacro(PageIsValid(page));
!     return (Item) (page + ((PageHeader) page)->pd_linp[(offsetNumber) - 1].lp_off);
  }

  ItemLength PageItemGetSize(Page page, OffsetNumber offsetNumber)
  {
!     return (ItemLength)
!             ((PageHeader) page)->pd_linp[(offsetNumber) - 1].lp_len;
  }

  IndexTuple PageGetIndexTuple(Page page, OffsetNumber offsetNumber)
--- 870,896 ----
  Item PageGetItem(Page page, OffsetNumber offsetNumber)
  {
      AssertMacro(PageIsValid(page));
! //    AssertMacro(ItemIdHasStorage(itemId));
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return (Item) (page +
!                    ((PageHeader_04) page)->pd_linp[(offsetNumber) - 1].lp_off);
!         case 3 : return (Item) (page +
!                    ((PageHeader_03) page)->pd_linp[(offsetNumber) - 1].lp_off);
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetItem.");
  }

  ItemLength PageItemGetSize(Page page, OffsetNumber offsetNumber)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return (ItemLength)
!                     ((PageHeader_04) page)->pd_linp[(offsetNumber) - 1].lp_len;
!         case 3 : return (ItemLength)
!                     ((PageHeader_03) page)->pd_linp[(offsetNumber) - 1].lp_len;
!     }
!     elog(PANIC, "Unsupported page layout in function PageItemGetSize.");
  }

  IndexTuple PageGetIndexTuple(Page page, OffsetNumber offsetNumber)
***************
*** 848,889 ****

  bool PageItemIsDead(Page page, OffsetNumber offsetNumber)
  {
!   return ItemIdIsDead(PageGetItemId(page, offsetNumber));
  }

  void PageItemMarkDead(Page page, OffsetNumber offsetNumber)
  {
!     ItemIdMarkDead(PageGetItemId(page, offsetNumber));
  }

  bool PageItemIsNormal(Page page, OffsetNumber offsetNumber)
  {
!   return ItemIdIsNormal(PageGetItemId(page, offsetNumber));
  }

  bool PageItemIsUsed(Page page, OffsetNumber offsetNumber)
  {
!   return ItemIdIsUsed(PageGetItemId(page, offsetNumber));
  }

  void PageItemSetUnused(Page page, OffsetNumber offsetNumber)
  {
!     ItemIdSetUnused(PageGetItemId(page, offsetNumber));
  }

  bool PageItemIsRedirected(Page page, OffsetNumber offsetNumber)
  {
!   return ItemIdIsRedirected(PageGetItemId(page, offsetNumber));
  }

  OffsetNumber PageItemGetRedirect(Page page, OffsetNumber offsetNumber)
  {
!   return ItemIdGetRedirect(PageGetItemId(page, offsetNumber));
  }

  void PageItemSetRedirect(Page page, OffsetNumber fromoff, OffsetNumber tooff)
  {
!     ItemIdSetRedirect( PageGetItemId(page, fromoff), tooff);
  }

  void PageItemMove(Page page, OffsetNumber dstoff, OffsetNumber srcoff)
--- 908,949 ----

  bool PageItemIsDead(Page page, OffsetNumber offsetNumber)
  {
!   return ItemIdIsDead(PageGetItemId(page, offsetNumber)); // TODO multi version
  }

  void PageItemMarkDead(Page page, OffsetNumber offsetNumber)
  {
!     ItemIdMarkDead(PageGetItemId(page, offsetNumber)); // TODO multi version
  }

  bool PageItemIsNormal(Page page, OffsetNumber offsetNumber)
  {
!   return ItemIdIsNormal(PageGetItemId(page, offsetNumber)); // TODO multi version
  }

  bool PageItemIsUsed(Page page, OffsetNumber offsetNumber)
  {
!   return ItemIdIsUsed(PageGetItemId(page, offsetNumber)); // TODO multi version
  }

  void PageItemSetUnused(Page page, OffsetNumber offsetNumber)
  {
!     ItemIdSetUnused(PageGetItemId(page, offsetNumber)); // TODO multi version
  }

  bool PageItemIsRedirected(Page page, OffsetNumber offsetNumber)
  {
!   return ItemIdIsRedirected(PageGetItemId(page, offsetNumber)); // TODO multi version
  }

  OffsetNumber PageItemGetRedirect(Page page, OffsetNumber offsetNumber)
  {
!   return ItemIdGetRedirect(PageGetItemId(page, offsetNumber)); // TODO multi version
  }

  void PageItemSetRedirect(Page page, OffsetNumber fromoff, OffsetNumber tooff)
  {
!     ItemIdSetRedirect(PageGetItemId(page, fromoff), tooff); // TODO multi version
  }

  void PageItemMove(Page page, OffsetNumber dstoff, OffsetNumber srcoff)
***************
*** 900,906 ****
   */
  Pointer PageGetContents(Page page)
  {
!     return (Pointer) (&((PageHeader) (page))->pd_linp[0]);
  }

  /* ----------------
--- 960,971 ----
   */
  Pointer PageGetContents(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return (Pointer) (&((PageHeader_04) (page))->pd_linp[0]);
!         case 3 : return (Pointer) (&((PageHeader_03) (page))->pd_linp[0]);
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetContents.");
  }

  /* ----------------
***************
*** 913,924 ****
   */
  Size PageGetSpecialSize(Page page)
  {
!     return (Size) PageGetPageSize(page) - ((PageHeader)(page))->pd_special;
  }

  Size PageGetDataSize(Page page)
  {
!     return (Size) ((PageHeader)(page))->pd_special - ((PageHeader)(page))->pd_upper;
  }

  /*
--- 978,1000 ----
   */
  Size PageGetSpecialSize(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return (Size) PageGetPageSize(page) - ((PageHeader_04)(page))->pd_special;
!         case 3 : return (Size) PageGetPageSize(page) - ((PageHeader_03)(page))->pd_special;
!
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetSpecialSize.");
  }

  Size PageGetDataSize(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return (Size) ((PageHeader_04)(page))->pd_special - ((PageHeader_04)(page))->pd_upper;
!         case 3 : return (Size) ((PageHeader_03)(page))->pd_special - ((PageHeader_03)(page))->pd_upper;
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetDataSize.");
  }

  /*
***************
*** 928,934 ****
  Pointer PageGetSpecialPointer(Page page)
  {
      AssertMacro(PageIsValid(page));
!     return page + ((PageHeader)(page))->pd_special;
  }

  /*
--- 1004,1015 ----
  Pointer PageGetSpecialPointer(Page page)
  {
      AssertMacro(PageIsValid(page));
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return page + ((PageHeader_04)(page))->pd_special;
!         case 3 : return page + ((PageHeader_03)(page))->pd_special;
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetSpecialPointer.");
  }

  /*
***************
*** 938,970 ****
  Pointer PageGetUpperPointer(Page page)
  {
      AssertMacro(PageIsValid(page));
!     return page + ((PageHeader)(page))->pd_upper;
  }

  void PageSetLower(Page page, LocationIndex lower)
  {
!     ((PageHeader) page)->pd_lower = lower;
  }

  void PageSetUpper(Page page, LocationIndex upper)
  {
!     ((PageHeader) page)->pd_upper = upper;
  }

  void PageReserveLinp(Page page)
  {
      AssertMacro(PageIsValid(page));
!     ((PageHeader) page)->pd_lower += sizeof(ItemIdData);
!     AssertMacro(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper );
  }

  void PageReleaseLinp(Page page)
  {
      AssertMacro(PageIsValid(page));
!     ((PageHeader) page)->pd_lower -= sizeof(ItemIdData);
!     AssertMacro(((PageHeader) page)->pd_lower >= SizeOfPageHeaderData);
  }

  /*
   * PageGetMaxOffsetNumber
   *        Returns the maximum offset number used by the given page.
--- 1019,1087 ----
  Pointer PageGetUpperPointer(Page page)
  {
      AssertMacro(PageIsValid(page));
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return page + ((PageHeader_04)(page))->pd_upper;
!         case 3 : return page + ((PageHeader_03)(page))->pd_upper;
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetUpperPointer.");
  }

  void PageSetLower(Page page, LocationIndex lower)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : ((PageHeader_04) page)->pd_lower = lower;
!                 break;
!         case 3 : ((PageHeader_03) page)->pd_lower = lower;
!                 break;
!         default:  elog(PANIC, "Unsupported page layout in function PageSetLower.");
!     }
  }

  void PageSetUpper(Page page, LocationIndex upper)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : ((PageHeader_04) page)->pd_upper = upper;
!                 break;
!         case 3 : ((PageHeader_03) page)->pd_upper = upper;
!                 break;
!         default:  elog(PANIC, "Unsupported page layout in function PageSetLower.");
!     }
  }

  void PageReserveLinp(Page page)
  {
      AssertMacro(PageIsValid(page));
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : ((PageHeader_04) page)->pd_lower += sizeof(ItemIdData);
!                 AssertMacro(((PageHeader_04) page)->pd_lower <= ((PageHeader_04) page)->pd_upper );
!                 break;
!         case 3 : ((PageHeader_03) page)->pd_lower += sizeof(ItemIdData);
!                 AssertMacro(((PageHeader_03) page)->pd_lower <= ((PageHeader_03) page)->pd_upper );
!                 break;
!         default: elog(PANIC, "Unsupported page layout in function PageReserveLinp.");
!     }
  }

  void PageReleaseLinp(Page page)
  {
      AssertMacro(PageIsValid(page));
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : ((PageHeader_04) page)->pd_lower -= sizeof(ItemIdData);
!                 AssertMacro(((PageHeader_04) page)->pd_lower >= SizeOfPageHeaderData04);
!                 break;
!         case 3 : ((PageHeader_03) page)->pd_lower -= sizeof(ItemIdData);
!                 AssertMacro(((PageHeader_03) page)->pd_lower >= SizeOfPageHeaderData03);
!                 break;
!         default: elog(PANIC, "Unsupported page layout in function PageReleaseLinp.");
!     }
  }

+
  /*
   * PageGetMaxOffsetNumber
   *        Returns the maximum offset number used by the given page.
***************
*** 977,985 ****
   */
  int PageGetMaxOffsetNumber(Page page)
  {
!     PageHeader header = (PageHeader) (page);
!     return header->pd_lower <= SizeOfPageHeaderData ? 0 :
!              (header->pd_lower - SizeOfPageHeaderData) / sizeof(ItemIdData);
  }

  /*
--- 1094,1115 ----
   */
  int PageGetMaxOffsetNumber(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : {
!                     PageHeader_04 header = (PageHeader_04) (page);
!                      return header->pd_lower <= SizeOfPageHeaderData04 ? 0 :
!                      (header->pd_lower - SizeOfPageHeaderData04) / sizeof(ItemIdData);
!                  }
!         case 3 : {
!                     PageHeader_03 header = (PageHeader_03) (page);
!                      return header->pd_lower <= SizeOfPageHeaderData03 ? 0 :
!                      (header->pd_lower - SizeOfPageHeaderData03) / sizeof(ItemIdData);
!                  }
!
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetMaxOffsetNumber. (%i)", PageGetPageLayoutVersion(page) );
!     return 0;
  }

  /*
***************
*** 987,1089 ****
   */
  XLogRecPtr PageGetLSN(Page page)
  {
!     return ((PageHeader) page)->pd_lsn;
  }

  LocationIndex PageGetLower(Page page)
  {
!     return ((PageHeader) page)->pd_lower;
  }

  LocationIndex PageGetUpper(Page page)
  {
!     return ((PageHeader) page)->pd_upper;
  }

  LocationIndex PageGetSpecial(Page page)
  {
!     return ((PageHeader) page)->pd_special;
  }

  void PageSetLSN(Page page, XLogRecPtr lsn)
  {
!     ((PageHeader) page)->pd_lsn = lsn;
  }

  /* NOTE: only the 16 least significant bits are stored */
  TimeLineID PageGetTLI(Page page)
  {
!     return ((PageHeader) (page))->pd_tli;
  }

  void PageSetTLI(Page page, TimeLineID tli)
  {
!     ((PageHeader) (page))->pd_tli = (uint16) (tli);
  }

  bool PageHasFreeLinePointers(Page page)
  {
!     return ((PageHeader) (page))->pd_flags & PD_HAS_FREE_LINES;
  }

  void PageSetHasFreeLinePointers(Page page)
  {
!     ((PageHeader) (page))->pd_flags |= PD_HAS_FREE_LINES;
  }

  void PageClearHasFreeLinePointers(Page page)
  {
!     ((PageHeader) (page))->pd_flags &= ~PD_HAS_FREE_LINES;
  }

  bool PageIsFull(Page page)
  {
!     return ((PageHeader) (page))->pd_flags & PD_PAGE_FULL;
  }

  void PageSetFull(Page page)
  {
!     ((PageHeader) (page))->pd_flags |= PD_PAGE_FULL;
  }


  void PageClearFull(Page page)
  {
!     ((PageHeader) (page))->pd_flags &= ~PD_PAGE_FULL;
  }

  bool PageIsPrunable(Page page, TransactionId oldestxmin)
  {
      AssertMacro(TransactionIdIsNormal(oldestxmin));
!     return (
!             TransactionIdIsValid(((PageHeader) page)->pd_prune_xid) &&
!             TransactionIdPrecedes(((PageHeader) page)->pd_prune_xid, oldestxmin) );
  }

  TransactionId PageGetPrunable(Page page)
  {
!     return ((PageHeader) page)->pd_prune_xid;
  }

  void PageSetPrunable(Page page, TransactionId xid)
  {
      Assert(TransactionIdIsNormal(xid));
!     if (!TransactionIdIsValid(((PageHeader) (page))->pd_prune_xid) ||
!                 TransactionIdPrecedes(xid, ((PageHeader) (page))->pd_prune_xid))
!                 ((PageHeader) (page))->pd_prune_xid = (xid);
  }

  void PageClearPrunable(Page page)
  {
!     ((PageHeader) page)->pd_prune_xid = InvalidTransactionId;
  }

  bool PageIsComprimable(Page page)
  {
!     PageHeader ph = (PageHeader) page;
!     return(ph->pd_lower >= SizeOfPageHeaderData &&
!             ph->pd_upper > ph->pd_lower &&
!             ph->pd_upper <= BLCKSZ);
  }

  /*
--- 1117,1335 ----
   */
  XLogRecPtr PageGetLSN(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return ((PageHeader_04) page)->pd_lsn;
!         case 3 : return ((PageHeader_03) page)->pd_lsn;
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetLSN.");
  }

  LocationIndex PageGetLower(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return ((PageHeader_04) page)->pd_lower;
!         case 3 : return ((PageHeader_03) page)->pd_lower;
!         case 0 : return 0;
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetLower.");
!     return 0;
  }

  LocationIndex PageGetUpper(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return ((PageHeader_04) page)->pd_upper;
!         case 3 : return ((PageHeader_03) page)->pd_upper;
!         case 0 : return 0;
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetUpper.");
!     return 0;
  }

  LocationIndex PageGetSpecial(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return ((PageHeader_04) page)->pd_special;
!         case 3 : return ((PageHeader_03) page)->pd_special;
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetUpper.");
!     return 0;
  }

+
  void PageSetLSN(Page page, XLogRecPtr lsn)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : ((PageHeader_04) page)->pd_lsn = lsn;
!                 break;
!         case 3 : ((PageHeader_03) page)->pd_lsn = lsn;
!                 break;
!         default: elog(PANIC, "Unsupported page layout in function PageSetLSN.");
!     }
  }

  /* NOTE: only the 16 least significant bits are stored */
  TimeLineID PageGetTLI(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return ((PageHeader_04) (page))->pd_tli;
!         case 3 : return ((PageHeader_03) (page))->pd_tli;
!     }
!     elog(PANIC, "Unsupported page layout in function PageGetTLI.");
  }

  void PageSetTLI(Page page, TimeLineID tli)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : ((PageHeader_04) (page))->pd_tli = (uint16) (tli);
!                 break;
!         case 3 : ((PageHeader_03) (page))->pd_tli = tli;
!                 break;
!         default: elog(PANIC, "Unsupported page layout in function PageSetTLI.");
!     }
  }

  bool PageHasFreeLinePointers(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return ((PageHeader_04) (page))->pd_flags & PD_HAS_FREE_LINES;
!         default: return false;
!     }
  }

  void PageSetHasFreeLinePointers(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : ((PageHeader_04) (page))->pd_flags |= PD_HAS_FREE_LINES;
!                  break;
!         default: elog(PANIC, "HasFreeLinePointers is not supported on page layout version %i",
!                 PageGetPageLayoutVersion(page));
!     }
  }

  void PageClearHasFreeLinePointers(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : ((PageHeader_04) (page))->pd_flags &= ~PD_HAS_FREE_LINES;
!                   break;
!         default: elog(PANIC, "HasFreeLinePointers is not supported on page layout version %i",
!                 PageGetPageLayoutVersion(page));
!     }
  }

  bool PageIsFull(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return ((PageHeader_04) (page))->pd_flags & PD_PAGE_FULL;
!         default : return true;
!     }
!     return true; /* no space on old data page */
  }

  void PageSetFull(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : ((PageHeader_04) (page))->pd_flags |= PD_PAGE_FULL;
!                   break;
!         default: elog(PANIC, "PageSetFull is not supported on page layout version %i",
!                 PageGetPageLayoutVersion(page));
!     }
  }


  void PageClearFull(Page page)
  {
!     switch( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : ((PageHeader_04) (page))->pd_flags &= ~PD_PAGE_FULL;
!                   break;
!         default: elog(PANIC, "PageClearFull is not supported on page layout version %i",
!                 PageGetPageLayoutVersion(page));
!     }
  }

  bool PageIsPrunable(Page page, TransactionId oldestxmin)
  {
      AssertMacro(TransactionIdIsNormal(oldestxmin));
!     switch( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return (
!             TransactionIdIsValid(((PageHeader_04) page)->pd_prune_xid) &&
!             TransactionIdPrecedes(((PageHeader_04) page)->pd_prune_xid, oldestxmin) );
!         case 3 : return false;
!     }
!     elog(PANIC, "PageIsPrunable is not supported on page layout version %i",
!         PageGetPageLayoutVersion(page));
  }

  TransactionId PageGetPrunable(Page page)
  {
!     switch( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return ((PageHeader_04) page)->pd_prune_xid;
!         case 3 : return 0;
!     }
!     elog(PANIC, "PageGetPrunable is not supported on page layout version %i",
!         PageGetPageLayoutVersion(page));
  }

  void PageSetPrunable(Page page, TransactionId xid)
  {
      Assert(TransactionIdIsNormal(xid));
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : if (!TransactionIdIsValid(((PageHeader_04) (page))->pd_prune_xid) ||
!                 TransactionIdPrecedes(xid, ((PageHeader_04) (page))->pd_prune_xid))
!                 ((PageHeader_04) (page))->pd_prune_xid = (xid);
!                 break;
!         default: elog(PANIC, "PageSetPrunable is not supported on page layout version %i",
!                 PageGetPageLayoutVersion(page));
!     }
  }

  void PageClearPrunable(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : ((PageHeader_04) page)->pd_prune_xid = InvalidTransactionId;
!                 break;
! //        default: elog(PANIC, "PageClearPrunable is not supported on page layout version %i",
! //                PageGetPageLayoutVersion(page));
! //    Silently ignore this request
!     }
  }

  bool PageIsComprimable(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : {
!                     PageHeader_04 ph = (PageHeader_04) page;
!                     return(ph->pd_lower >= SizeOfPageHeaderData04 &&
!                         ph->pd_upper > ph->pd_lower &&
!                         ph->pd_upper <= BLCKSZ);
!                   }
!         case 3 : {
!                     PageHeader_03 ph = (PageHeader_03) page;
!                     return(ph->pd_lower >= SizeOfPageHeaderData03 &&
!                         ph->pd_upper > ph->pd_lower &&
!                         ph->pd_upper <= BLCKSZ);
!                   }
!         default: elog(PANIC, "PageIsComprimable is not supported on page layout version %i",
!                 PageGetPageLayoutVersion(page));
!     }
  }

  /*
***************
*** 1092,1097 ****
   */
  bool PageIsEmpty(Page page)
  {
!     return (((PageHeader) (page))->pd_lower <= SizeOfPageHeaderData);
  }

--- 1338,1349 ----
   */
  bool PageIsEmpty(Page page)
  {
!     switch ( PageGetPageLayoutVersion(page) )
!     {
!         case 4 : return (((PageHeader_04) (page))->pd_lower <= SizeOfPageHeaderData04);
!         case 3 : return (((PageHeader_04) (page))->pd_lower <= SizeOfPageHeaderData03);
!         default: elog(PANIC, "PageIsEmpty is not supported on page layout version %i",
!                 PageGetPageLayoutVersion(page));
!     }
  }

diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/access/gin.h
pgsql_master_upgrade.13a47c410da7/src/include/access/gin.h
*** pgsql_master_upgrade.751eb7c6969f/src/include/access/gin.h    2008-10-31 21:45:33.172329319 +0100
--- pgsql_master_upgrade.13a47c410da7/src/include/access/gin.h    2008-10-31 21:45:33.265898951 +0100
***************
*** 115,121 ****
  #define GinGetPosting(itup)            ( (ItemPointer)(( ((char*)(itup)) + SHORTALIGN(GinGetOrigSizePosting(itup)) ))
)

  #define GinMaxItemSize \
!     ((BLCKSZ - SizeOfPageHeaderData - \
          MAXALIGN(sizeof(GinPageOpaqueData))) / 3 - sizeof(ItemIdData))


--- 115,121 ----
  #define GinGetPosting(itup)            ( (ItemPointer)(( ((char*)(itup)) + SHORTALIGN(GinGetOrigSizePosting(itup)) ))
)

  #define GinMaxItemSize \
!     ((BLCKSZ - SizeOfPageHeaderData04 - \
          MAXALIGN(sizeof(GinPageOpaqueData))) / 3 - sizeof(ItemIdData))


***************
*** 131,137 ****
      (GinDataPageGetData(page) + ((i)-1) * GinSizeOfItem(page))

  #define GinDataPageGetFreeSpace(page)    \
!     (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
       - MAXALIGN(sizeof(ItemPointerData)) \
       - GinPageGetOpaque(page)->maxoff * GinSizeOfItem(page) \
       - MAXALIGN(sizeof(GinPageOpaqueData)))
--- 131,137 ----
      (GinDataPageGetData(page) + ((i)-1) * GinSizeOfItem(page))

  #define GinDataPageGetFreeSpace(page)    \
!     (BLCKSZ - MAXALIGN(SizeOfPageHeaderData04) \
       - MAXALIGN(sizeof(ItemPointerData)) \
       - GinPageGetOpaque(page)->maxoff * GinSizeOfItem(page) \
       - MAXALIGN(sizeof(GinPageOpaqueData)))
diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/access/gist_private.h
pgsql_master_upgrade.13a47c410da7/src/include/access/gist_private.h
*** pgsql_master_upgrade.751eb7c6969f/src/include/access/gist_private.h    2008-10-31 21:45:33.176541012 +0100
--- pgsql_master_upgrade.13a47c410da7/src/include/access/gist_private.h    2008-10-31 21:45:33.270105878 +0100
***************
*** 272,278 ****
  /* gistutil.c */

  #define GiSTPageSize   \
!     ( BLCKSZ - SizeOfPageHeaderData - MAXALIGN(sizeof(GISTPageOpaqueData)) )

  #define GIST_MIN_FILLFACTOR            10
  #define GIST_DEFAULT_FILLFACTOR        90
--- 272,278 ----
  /* gistutil.c */

  #define GiSTPageSize   \
!     ( BLCKSZ - SizeOfPageHeaderData04 - MAXALIGN(sizeof(GISTPageOpaqueData)) )

  #define GIST_MIN_FILLFACTOR            10
  #define GIST_DEFAULT_FILLFACTOR        90
diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/access/hash.h
pgsql_master_upgrade.13a47c410da7/src/include/access/hash.h
*** pgsql_master_upgrade.751eb7c6969f/src/include/access/hash.h    2008-10-31 21:45:33.180773015 +0100
--- pgsql_master_upgrade.13a47c410da7/src/include/access/hash.h    2008-10-31 21:45:33.274260668 +0100
***************
*** 168,174 ****
   */
  #define HashMaxItemSize(page) \
      MAXALIGN_DOWN(PageGetPageSize(page) - \
!                   SizeOfPageHeaderData - \
                    sizeof(ItemIdData) - \
                    MAXALIGN(sizeof(HashPageOpaqueData)))

--- 168,174 ----
   */
  #define HashMaxItemSize(page) \
      MAXALIGN_DOWN(PageGetPageSize(page) - \
!                   SizeOfPageHeaderData04 - \
                    sizeof(ItemIdData) - \
                    MAXALIGN(sizeof(HashPageOpaqueData)))

***************
*** 198,204 ****

  #define HashGetMaxBitmapSize(page) \
      (PageGetPageSize((Page) page) - \
!      (MAXALIGN(SizeOfPageHeaderData) + MAXALIGN(sizeof(HashPageOpaqueData))))

  #define HashPageGetMeta(page) \
      ((HashMetaPage) PageGetContents(page))
--- 198,204 ----

  #define HashGetMaxBitmapSize(page) \
      (PageGetPageSize((Page) page) - \
!      (MAXALIGN(SizeOfPageHeaderData04) + MAXALIGN(sizeof(HashPageOpaqueData))))

  #define HashPageGetMeta(page) \
      ((HashMetaPage) PageGetContents(page))
diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/access/htup_03.h
pgsql_master_upgrade.13a47c410da7/src/include/access/htup_03.h
*** pgsql_master_upgrade.751eb7c6969f/src/include/access/htup_03.h    1970-01-01 01:00:00.000000000 +0100
--- pgsql_master_upgrade.13a47c410da7/src/include/access/htup_03.h    2008-10-31 21:45:33.282414095 +0100
***************
*** 0 ****
--- 1,311 ----
+ /*-------------------------------------------------------------------------
+  *
+  * htup.h
+  *      POSTGRES heap tuple definitions.
+  *
+  *
+  * Portions Copyright (c) 1996-2006, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * $PostgreSQL: pgsql/src/include/access/htup.h,v 1.87 2006/11/05 22:42:10 tgl Exp $
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef HTUP_03_H
+ #define HTUP_03_H
+
+ #include "access/htup.h"
+ #include "storage/itemptr.h"
+ #include "storage/relfilenode.h"
+ #include "utils/rel.h"
+
+ /*
+  * Heap tuple header.  To avoid wasting space, the fields should be
+  * layed out in such a way to avoid structure padding.
+  *
+  * Datums of composite types (row types) share the same general structure
+  * as on-disk tuples, so that the same routines can be used to build and
+  * examine them.  However the requirements are slightly different: a Datum
+  * does not need any transaction visibility information, and it does need
+  * a length word and some embedded type information.  We can achieve this
+  * by overlaying the xmin/cmin/xmax/cmax/xvac fields of a heap tuple
+  * with the fields needed in the Datum case.  Typically, all tuples built
+  * in-memory will be initialized with the Datum fields; but when a tuple is
+  * about to be inserted in a table, the transaction fields will be filled,
+  * overwriting the datum fields.
+  *
+  * The overall structure of a heap tuple looks like:
+  *            fixed fields (HeapTupleHeaderData struct)
+  *            nulls bitmap (if HEAP_HASNULL is set in t_infomask)
+  *            alignment padding (as needed to make user data MAXALIGN'd)
+  *            object ID (if HEAP_HASOID is set in t_infomask)
+  *            user data fields
+  *
+  * We store five "virtual" fields Xmin, Cmin, Xmax, Cmax, and Xvac in four
+  * physical fields.  Xmin, Cmin and Xmax are always really stored, but
+  * Cmax and Xvac share a field.  This works because we know that there are
+  * only a limited number of states that a tuple can be in, and that Cmax
+  * is only interesting for the lifetime of the deleting transaction.
+  * This assumes that VACUUM FULL never tries to move a tuple whose Cmax
+  * is still interesting (ie, delete-in-progress).
+  *
+  * Note that in 7.3 and 7.4 a similar idea was applied to Xmax and Cmin.
+  * However, with the advent of subtransactions, a tuple may need both Xmax
+  * and Cmin simultaneously, so this is no longer possible.
+  *
+  * A word about t_ctid: whenever a new tuple is stored on disk, its t_ctid
+  * is initialized with its own TID (location).    If the tuple is ever updated,
+  * its t_ctid is changed to point to the replacement version of the tuple.
+  * Thus, a tuple is the latest version of its row iff XMAX is invalid or
+  * t_ctid points to itself (in which case, if XMAX is valid, the tuple is
+  * either locked or deleted).  One can follow the chain of t_ctid links
+  * to find the newest version of the row.  Beware however that VACUUM might
+  * erase the pointed-to (newer) tuple before erasing the pointing (older)
+  * tuple.  Hence, when following a t_ctid link, it is necessary to check
+  * to see if the referenced slot is empty or contains an unrelated tuple.
+  * Check that the referenced tuple has XMIN equal to the referencing tuple's
+  * XMAX to verify that it is actually the descendant version and not an
+  * unrelated tuple stored into a slot recently freed by VACUUM.  If either
+  * check fails, one may assume that there is no live descendant version.
+  *
+  * Following the fixed header fields, the nulls bitmap is stored (beginning
+  * at t_bits).    The bitmap is *not* stored if t_infomask shows that there
+  * are no nulls in the tuple.  If an OID field is present (as indicated by
+  * t_infomask), then it is stored just before the user data, which begins at
+  * the offset shown by t_hoff.    Note that t_hoff must be a multiple of
+  * MAXALIGN.
+  */
+
+ typedef struct HeapTupleFields_03
+ {
+     TransactionId t_xmin;        /* inserting xact ID */
+     CommandId    t_cmin;            /* inserting command ID */
+     TransactionId t_xmax;        /* deleting or locking xact ID */
+
+     union
+     {
+         CommandId    t_cmax;        /* deleting or locking command ID */
+         TransactionId t_xvac;    /* VACUUM FULL xact ID */
+     }            t_field4;
+ } HeapTupleFields_03;
+
+ typedef struct DatumTupleFields_03
+ {
+     int32        datum_len;        /* required to be a varlena type */
+
+     int32        datum_typmod;    /* -1, or identifier of a record type */
+
+     Oid            datum_typeid;    /* composite type OID, or RECORDOID */
+
+     /*
+      * Note: field ordering is chosen with thought that Oid might someday
+      * widen to 64 bits.
+      */
+ } DatumTupleFields_03;
+
+ typedef struct HeapTupleHeaderData_03
+ {
+     union
+     {
+         HeapTupleFields_03 t_heap;
+         DatumTupleFields_03 t_datum;
+     }            t_choice;
+
+     ItemPointerData t_ctid;        /* current TID of this or newer tuple */
+
+     /* Fields below here must match MinimalTupleData! */
+
+     int16        t_natts;        /* number of attributes */
+
+     uint16        t_infomask;        /* various flag bits, see below */
+
+     uint8        t_hoff;            /* sizeof header incl. bitmap, padding */
+
+     /* ^ - 27 bytes - ^ */
+
+     bits8        t_bits[1];        /* bitmap of NULLs -- VARIABLE LENGTH */
+
+     /* MORE DATA FOLLOWS AT END OF STRUCT */
+ } HeapTupleHeaderData_03;
+
+ typedef HeapTupleHeaderData_03 *HeapTupleHeader_03;
+
+ /*
+  * information stored in t_infomask:
+  */
+ #define HEAP03_HASNULL            0x0001    /* has null attribute(s) */
+ #define HEAP03_HASVARWIDTH        0x0002    /* has variable-width attribute(s) */
+ #define HEAP03_HASEXTERNAL        0x0004    /* has external stored attribute(s) */
+ #define HEAP03_HASCOMPRESSED    0x0008    /* has compressed stored attribute(s) */
+ #define HEAP03_HASEXTENDED        0x000C    /* the two above combined */
+ #define HEAP03_HASOID            0x0010    /* has an object-id field */
+ /* 0x0020 is presently unused */
+ #define HEAP03_XMAX_EXCL_LOCK    0x0040    /* xmax is exclusive locker */
+ #define HEAP03_XMAX_SHARED_LOCK    0x0080    /* xmax is shared locker */
+ /* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */
+ #define HEAP03_IS_LOCKED    (HEAP03_XMAX_EXCL_LOCK | HEAP03_XMAX_SHARED_LOCK)
+ #define HEAP03_XMIN_COMMITTED    0x0100    /* t_xmin committed */
+ #define HEAP03_XMIN_INVALID        0x0200    /* t_xmin invalid/aborted */
+ #define HEAP03_XMAX_COMMITTED    0x0400    /* t_xmax committed */
+ #define HEAP03_XMAX_INVALID        0x0800    /* t_xmax invalid/aborted */
+ #define HEAP03_XMAX_IS_MULTI    0x1000    /* t_xmax is a MultiXactId */
+ #define HEAP03_UPDATED            0x2000    /* this is UPDATEd version of row */
+ #define HEAP03_MOVED_OFF        0x4000    /* moved to another place by VACUUM
+                                          * FULL */
+ #define HEAP03_MOVED_IN            0x8000    /* moved from another place by VACUUM
+                                          * FULL */
+ #define HEAP03_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
+
+ #define HEAP03_XACT_MASK            0xFFC0    /* visibility-related bits */
+
+
+ /*
+  * HeapTupleHeader accessor macros
+  *
+  * Note: beware of multiple evaluations of "tup" argument.    But the Set
+  * macros evaluate their other argument only once.
+  */
+ /*
+ #define HeapTupleHeaderGetXmin(tup) \
+ ( \
+     (tup)->t_choice.t_heap.t_xmin \
+ )
+
+ #define HeapTupleHeaderSetXmin(tup, xid) \
+ ( \
+     TransactionIdStore((xid), &(tup)->t_choice.t_heap.t_xmin) \
+ )
+
+ #define HeapTupleHeaderGetXmax(tup) \
+ ( \
+     (tup)->t_choice.t_heap.t_xmax \
+ )
+
+ #define HeapTupleHeaderSetXmax(tup, xid) \
+ ( \
+     TransactionIdStore((xid), &(tup)->t_choice.t_heap.t_xmax) \
+ )
+
+ #define HeapTupleHeaderGetCmin(tup) \
+ ( \
+     (tup)->t_choice.t_heap.t_cmin \
+ )
+
+ #define HeapTupleHeaderSetCmin(tup, cid) \
+ ( \
+     (tup)->t_choice.t_heap.t_cmin = (cid) \
+ )
+ */
+ /*
+  * Note: GetCmax will produce wrong answers after SetXvac has been executed
+  * by a transaction other than the inserting one.  We could check
+  * HEAP_XMAX_INVALID and return FirstCommandId if it's clear, but since that
+  * bit will be set again if the deleting transaction aborts, there'd be no
+  * real gain in safety from the extra test.  So, just rely on the caller not
+  * to trust the value unless it's meaningful.
+  */
+ /*
+ #define HeapTupleHeaderGetCmax(tup) \
+ ( \
+     (tup)->t_choice.t_heap.t_field4.t_cmax \
+ )
+
+ #define HeapTupleHeaderSetCmax(tup, cid) \
+ do { \
+     Assert(!((tup)->t_infomask & HEAP_MOVED)); \
+     (tup)->t_choice.t_heap.t_field4.t_cmax = (cid); \
+ } while (0)
+
+ #define HeapTupleHeaderGetXvac(tup) \
+ ( \
+     ((tup)->t_infomask & HEAP_MOVED) ? \
+         (tup)->t_choice.t_heap.t_field4.t_xvac \
+     : \
+         InvalidTransactionId \
+ )
+
+ #define HeapTupleHeaderSetXvac(tup, xid) \
+ do { \
+     Assert((tup)->t_infomask & HEAP_MOVED); \
+     TransactionIdStore((xid), &(tup)->t_choice.t_heap.t_field4.t_xvac); \
+ } while (0)
+
+ #define HeapTupleHeaderGetDatumLength(tup) \
+ ( \
+     (tup)->t_choice.t_datum.datum_len \
+ )
+
+ #define HeapTupleHeaderSetDatumLength(tup, len) \
+ ( \
+     (tup)->t_choice.t_datum.datum_len = (len) \
+ )
+
+ #define HeapTupleHeaderGetTypeId(tup) \
+ ( \
+     (tup)->t_choice.t_datum.datum_typeid \
+ )
+
+ #define HeapTupleHeaderSetTypeId(tup, typeid) \
+ ( \
+     (tup)->t_choice.t_datum.datum_typeid = (typeid) \
+ )
+
+ #define HeapTupleHeaderGetTypMod(tup) \
+ ( \
+     (tup)->t_choice.t_datum.datum_typmod \
+ )
+
+ #define HeapTupleHeaderSetTypMod(tup, typmod) \
+ ( \
+     (tup)->t_choice.t_datum.datum_typmod = (typmod) \
+ )
+
+ #define HeapTupleHeaderGetOid(tup) \
+ ( \
+     ((tup)->t_infomask & HEAP_HASOID) ? \
+         *((Oid *) ((char *)(tup) + (tup)->t_hoff - sizeof(Oid))) \
+     : \
+         InvalidOid \
+ )
+
+ #define HeapTupleHeaderSetOid(tup, oid) \
+ do { \
+     Assert((tup)->t_infomask & HEAP_HASOID); \
+     *((Oid *) ((char *)(tup) + (tup)->t_hoff - sizeof(Oid))) = (oid); \
+ } while (0)
+
+ */
+ /*
+  * BITMAPLEN(NATTS) -
+  *        Computes size of null bitmap given number of data columns.
+  */
+ //#define BITMAPLEN(NATTS)    (((int)(NATTS) + 7) / 8)
+
+ /*
+  * MaxTupleSize is the maximum allowed size of a tuple, including header and
+  * MAXALIGN alignment padding.    Basically it's BLCKSZ minus the other stuff
+  * that has to be on a disk page.  The "other stuff" includes access-method-
+  * dependent "special space", which we assume will be no more than
+  * MaxSpecialSpace bytes (currently, on heap pages it's actually zero).
+  *
+  * NOTE: we do not need to count an ItemId for the tuple because
+  * sizeof(PageHeaderData) includes the first ItemId on the page.
+  */
+ //#define MaxSpecialSpace  32
+
+ //#define MaxTupleSize    \
+ //    (BLCKSZ - MAXALIGN(sizeof(PageHeaderData) + MaxSpecialSpace))
+
+ /*
+  * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
+  * fit on one heap page.  (Note that indexes could have more, because they
+  * use a smaller tuple header.)  We arrive at the divisor because each tuple
+  * must be maxaligned, and it must have an associated item pointer.
+  */
+ //#define MaxHeapTuplesPerPage    \
+ //    ((int) ((BLCKSZ - offsetof(PageHeaderData, pd_linp)) / \
+ //            (MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) + sizeof(ItemIdData))))
+
+ extern HeapTuple heap_tuple_upgrade_03(Relation rel, HeapTuple tuple);
+
+ #endif   /* HTUP_H */
diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/access/tuplimits.h
pgsql_master_upgrade.13a47c410da7/src/include/access/tuplimits.h
*** pgsql_master_upgrade.751eb7c6969f/src/include/access/tuplimits.h    2008-10-31 21:45:33.181679309 +0100
--- pgsql_master_upgrade.13a47c410da7/src/include/access/tuplimits.h    2008-10-31 21:45:33.275179286 +0100
***************
*** 29,35 ****
   * ItemIds and tuples have different alignment requirements, don't assume that
   * you can, say, fit 2 tuples of size MaxHeapTupleSize/2 on the same page.
   */
! #define MaxHeapTupleSize  (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + sizeof(ItemIdData)))

  /*
   * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
--- 29,35 ----
   * ItemIds and tuples have different alignment requirements, don't assume that
   * you can, say, fit 2 tuples of size MaxHeapTupleSize/2 on the same page.
   */
! #define MaxHeapTupleSize  (BLCKSZ - MAXALIGN(SizeOfPageHeaderData04 + sizeof(ItemIdData)))

  /*
   * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
***************
*** 43,49 ****
   * require increases in the size of work arrays.
   */
  #define MaxHeapTuplesPerPage    \
!     ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
              (MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) + sizeof(ItemIdData))))


--- 43,49 ----
   * require increases in the size of work arrays.
   */
  #define MaxHeapTuplesPerPage    \
!     ((int) ((BLCKSZ - SizeOfPageHeaderData04) / \
              (MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) + sizeof(ItemIdData))))


***************
*** 55,61 ****
   * must be maxaligned, and it must have an associated item pointer.
   */
  #define MaxIndexTuplesPerPage    \
!     ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
              (MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))

  /*
--- 55,61 ----
   * must be maxaligned, and it must have an associated item pointer.
   */
  #define MaxIndexTuplesPerPage    \
!     ((int) ((BLCKSZ - SizeOfPageHeaderData04) / \
              (MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))

  /*
***************
*** 66,72 ****
   */
  #define BTMaxItemSize(page) \
      MAXALIGN_DOWN((PageGetPageSize(page) - \
!                    MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
                     MAXALIGN(sizeof(BTPageOpaqueData))) / 3)


--- 66,72 ----
   */
  #define BTMaxItemSize(page) \
      MAXALIGN_DOWN((PageGetPageSize(page) - \
!                    MAXALIGN(SizeOfPageHeaderData04 + 3*sizeof(ItemIdData)) - \
                     MAXALIGN(sizeof(BTPageOpaqueData))) / 3)


diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/access/tuptoaster.h
pgsql_master_upgrade.13a47c410da7/src/include/access/tuptoaster.h
*** pgsql_master_upgrade.751eb7c6969f/src/include/access/tuptoaster.h    2008-10-31 21:45:33.183208112 +0100
--- pgsql_master_upgrade.13a47c410da7/src/include/access/tuptoaster.h    2008-10-31 21:45:33.276709073 +0100
***************
*** 49,55 ****

  #define TOAST_TUPLE_THRESHOLD    \
      MAXALIGN_DOWN((BLCKSZ - \
!                    MAXALIGN(SizeOfPageHeaderData + TOAST_TUPLES_PER_PAGE * sizeof(ItemIdData))) \
                    / TOAST_TUPLES_PER_PAGE)

  #define TOAST_TUPLE_TARGET        TOAST_TUPLE_THRESHOLD
--- 49,55 ----

  #define TOAST_TUPLE_THRESHOLD    \
      MAXALIGN_DOWN((BLCKSZ - \
!                    MAXALIGN(SizeOfPageHeaderData04 + TOAST_TUPLES_PER_PAGE * sizeof(ItemIdData))) \
                    / TOAST_TUPLES_PER_PAGE)

  #define TOAST_TUPLE_TARGET        TOAST_TUPLE_THRESHOLD
***************
*** 75,81 ****

  #define EXTERN_TUPLE_MAX_SIZE    \
      MAXALIGN_DOWN((BLCKSZ - \
!                    MAXALIGN(SizeOfPageHeaderData + EXTERN_TUPLES_PER_PAGE * sizeof(ItemIdData))) \
                    / EXTERN_TUPLES_PER_PAGE)

  #define TOAST_MAX_CHUNK_SIZE    \
--- 75,81 ----

  #define EXTERN_TUPLE_MAX_SIZE    \
      MAXALIGN_DOWN((BLCKSZ - \
!                    MAXALIGN(SizeOfPageHeaderData04 + EXTERN_TUPLES_PER_PAGE * sizeof(ItemIdData))) \
                    / EXTERN_TUPLES_PER_PAGE)

  #define TOAST_MAX_CHUNK_SIZE    \
diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/storage/bufpage.h
pgsql_master_upgrade.13a47c410da7/src/include/storage/bufpage.h
*** pgsql_master_upgrade.751eb7c6969f/src/include/storage/bufpage.h    2008-10-31 21:45:33.185832325 +0100
--- pgsql_master_upgrade.13a47c410da7/src/include/storage/bufpage.h    2008-10-31 21:45:33.279178878 +0100
***************
*** 121,127 ****
   * On the high end, we can only support pages up to 32KB because lp_off/lp_len
   * are 15 bits.
   */
! typedef struct PageHeaderData
  {
      /* XXX LSN is member of *any* block, not only page-organized ones */
      XLogRecPtr    pd_lsn;            /* LSN: next byte after last byte of xlog
--- 121,127 ----
   * On the high end, we can only support pages up to 32KB because lp_off/lp_len
   * are 15 bits.
   */
! typedef struct PageHeaderData_04
  {
      /* XXX LSN is member of *any* block, not only page-organized ones */
      XLogRecPtr    pd_lsn;            /* LSN: next byte after last byte of xlog
***************
*** 135,143 ****
      uint16        pd_pagesize_version;
      TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
      ItemIdData    pd_linp[1];        /* beginning of line pointer array */
! } PageHeaderData;

- typedef PageHeaderData *PageHeader;

  /*
   * pd_flags contains the following flag bits.  Undefined bits are initialized
--- 135,160 ----
      uint16        pd_pagesize_version;
      TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
      ItemIdData    pd_linp[1];        /* beginning of line pointer array */
! } PageHeaderData_04;
!
! typedef PageHeaderData_04 *PageHeader_04;
!
! typedef struct PageHeaderData_03
! {
!     /* XXX LSN is member of *any* block, not only page-organized ones */
!     XLogRecPtr    pd_lsn;            /* LSN: next byte after last byte of xlog
!                                  * record for last change to this page */
!     TimeLineID    pd_tli;            /* TLI of last change */
!     LocationIndex pd_lower;        /* offset to start of free space */
!     LocationIndex pd_upper;        /* offset to end of free space */
!     LocationIndex pd_special;    /* offset to start of special space */
!     uint16        pd_pagesize_version;
!     ItemIdData    pd_linp[1];        /* beginning of line pointer array */
! } PageHeaderData_03;
!
! typedef PageHeaderData_03 *PageHeader_03;
!


  /*
   * pd_flags contains the following flag bits.  Undefined bits are initialized
***************
*** 181,195 ****
  #define PageIsValid(page) PointerIsValid(page)

  /*
!  * line pointer(s) do not count as part of header
   */
! #define SizeOfPageHeaderData (offsetof(PageHeaderData, pd_linp))

  /*
   * PageIsNew
   *        returns true iff page has not been initialized (by PageInit)
   */
! #define PageIsNew(page) (((PageHeader) (page))->pd_upper == 0)

  /* ----------------
   *        macros to access page size info
--- 198,213 ----
  #define PageIsValid(page) PointerIsValid(page)

  /*
!  * line pointer does not count as part of header
   */
! #define SizeOfPageHeaderData04 offsetof(PageHeaderData_04, pd_linp[0])
! #define SizeOfPageHeaderData03 offsetof(PageHeaderData_03, pd_linp[0])

  /*
   * PageIsNew
   *        returns true iff page has not been initialized (by PageInit)
   */
! #define PageIsNew(page) (((PageHeader_04) (page))->pd_upper == 0)

  /* ----------------
   *        macros to access page size info
***************
*** 211,224 ****
   * however, it can be called on a page that is not stored in a buffer.
   */
  #define PageGetPageSize(page) \
!     ((Size) (((PageHeader) (page))->pd_pagesize_version & (uint16) 0xFF00))

  /*
   * PageGetPageLayoutVersion
   *        Returns the page layout version of a page.
   */
  #define PageGetPageLayoutVersion(page) \
!     (((PageHeader) (page))->pd_pagesize_version & 0x00FF)

  /*
   * PageSetPageSizeAndVersion
--- 229,242 ----
   * however, it can be called on a page that is not stored in a buffer.
   */
  #define PageGetPageSize(page) \
!     ((Size) (((PageHeader_04) (page))->pd_pagesize_version & (uint16) 0xFF00))

  /*
   * PageGetPageLayoutVersion
   *        Returns the page layout version of a page.
   */
  #define PageGetPageLayoutVersion(page) \
!     (((PageHeader_04) (page))->pd_pagesize_version & 0x00FF)

  /*
   * PageSetPageSizeAndVersion
***************
*** 231,239 ****
  ( \
      AssertMacro(((size) & 0xFF00) == (size)), \
      AssertMacro(((version) & 0x00FF) == (version)), \
!     ((PageHeader) (page))->pd_pagesize_version = (size) | (version) \
  )

  /* ----------------------------------------------------------------
   *        extern declarations
   * ----------------------------------------------------------------
--- 249,259 ----
  ( \
      AssertMacro(((size) & 0xFF00) == (size)), \
      AssertMacro(((version) & 0x00FF) == (version)), \
!     ((PageHeader_04) (page))->pd_pagesize_version = (size) | (version) \
  )

+
+
  /* ----------------------------------------------------------------
   *        extern declarations
   * ----------------------------------------------------------------
***************
*** 303,307 ****
--- 323,328 ----
  extern OffsetNumber    PageItemGetRedirect(Page, OffsetNumber offsetNumber);

  extern ItemId PageGetItemId(Page page, OffsetNumber offsetNumber);/* do not use it - only for pg_inspect contrib
modul*/ 
+

  #endif   /* BUFPAGE_H */
diff -Nrc pgsql_master_upgrade.751eb7c6969f/src/include/storage/fsm_internals.h
pgsql_master_upgrade.13a47c410da7/src/include/storage/fsm_internals.h
*** pgsql_master_upgrade.751eb7c6969f/src/include/storage/fsm_internals.h    2008-10-31 21:45:33.186717311 +0100
--- pgsql_master_upgrade.13a47c410da7/src/include/storage/fsm_internals.h    2008-10-31 21:45:33.280068969 +0100
***************
*** 49,55 ****
   * Number of non-leaf and leaf nodes, and nodes in total, on an FSM page.
   * These definitions are internal to fsmpage.c.
   */
! #define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
                        offsetof(FSMPageData, fp_nodes))

  #define NonLeafNodesPerPage (BLCKSZ / 2 - 1)
--- 49,55 ----
   * Number of non-leaf and leaf nodes, and nodes in total, on an FSM page.
   * These definitions are internal to fsmpage.c.
   */
! #define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData04) - \
                        offsetof(FSMPageData, fp_nodes))

  #define NonLeafNodesPerPage (BLCKSZ / 2 - 1)

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

03 November 2008, 00:26:29

I tried to apply this patch to CVS HEAD and it blew up all over the
place. It doesn't seem to be intended to apply against CVS HEAD; for
example, I don't have backend/access/heap/htup.c at all, so can't
apply changes to that file. I was able to clone the GIT repository
with the following command...

git clone http://git.postgresql.org/git/~davidfetter/upgrade_in_place/.git

...but now I'm confused, because I don't see the changes from the diff
reflected in the resulting tree. As you can see, I am not a git
wizard. Any help would be appreciated.

Here are a few initial thoughts based mostly on reading the diff:

In the minor nit department, I don't really like the idea of
PageHeaderData_04, SizeOfPageHeaderData04, PageLayoutIsValid_04, etc.
I think the latest version should just be PageHeaderData and
SizeOfPageHeaderData, and previous versions should be, e.g.
PageHeaderDataV3. It looks to me like this would cut a few hunks out
of this and maybe make it a bit easier to understand what is going on.At any rate, if we are going to stick with an
explicitversion number

in both versions, it should be marked in a consistent way, not _04
sometimes and just 04 other times. My suggestion is e.g. "V4" but
YMMV.

The changes to nodeIndexscan.c and nodeSeqscan.c are worrisome to me.
It looks like the added code is (nearly?) identical in both places, so
probably it needs to be refactored to avoid code duplication. I'm
also a bit skeptical about the idea of doing the tuple conversion
here. Why here rather than ExecStoreTuple()? If you decide to
convert the tuple, you can palloc the new one, pfree the old one if
ShouldFree is set, and reset shouldFree to true.

I am pretty skeptical of the idea that all of the HeapTuple* functions
can just be conditionalized on the page version and everything will
Just Work. It seems like that is too low a level to be worrying about
such things. Even if it happens to work for the changes between V3
and V4, what happens when V5 or V6 is changed in such a way that the
answer to HeapTupleIsWhatever is neither "Yes" nor "No", but rather
"Maybe" or "Seven"? The performance hit also sounds painful. I don't
have a better idea right now though...

I think it's going to be absolutely imperative to begin vacuuming away
old V3 pages as quickly as possible after the upgrade. If you go with
the approach of converting the tuple in, or just before,
ExecStoreTuple, then you're going to introduce a lot of overhead when
working with V3 pages. I think that's fine. You should plan to do
your in-place upgrade at 1AM on Christmas morning (or whenever your
load hits rock bottom...) and immediately start converting the
database, starting with your most important and smallest tables. In
fact, I would look whenever possible for ways to make the V4 case a
fast-path and just accept that the system is going to labor a bit when
dealing with V3 stuff. Any overhead you introduce when dealing with
V3 pages can go away; any V4 overhead is permanent and therefore much
more difficult to accept.

That's about all I have for now... if you can give me some pointers on
working with this git repository, or provide a complete patch that
applies cleanly to CVS HEAD, I will try to look at this in more
detail.

...Robert

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

03 November 2008, 16:36:27

Big thanks for review.

Robert Haas napsal(a):
> I tried to apply this patch to CVS HEAD and it blew up all over the
> place.  It doesn't seem to be intended to apply against CVS HEAD; for
> example, I don't have backend/access/heap/htup.c at all, so can't
> apply changes to that file.  

You need to apply also two other patches:
which are located here:
http://wiki.postgresql.org/wiki/CommitFestInProgress#Upgrade-in-place_and_related_issues
I moved one related patch from another category here to correct place.

The problem is that it is difficult to keep it in sync with head, because they 
change a lot of things. It the reason why I put all also into GIT repository, 
but ...

> I was able to clone the GIT repository
> with the following command...
> 
> git clone http://git.postgresql.org/git/~davidfetter/upgrade_in_place/.git
> 
> ...but now I'm confused, because I don't see the changes from the diff
> reflected in the resulting tree.  As you can see, I am not a git
> wizard.  Any help would be appreciated.

I'm GIT newbie I use mercurial for development and I manually applied changes 
into GIT. I asked David Fetter with help how to get back the correct clone. In 
meantime you can download a tarball.


http://git.postgresql.org/?p=~davidfetter/upgrade_in_place/.git;a=snapshot;h=c72bafada59ed278ffac59657c913bc375f77808;sf=tgz

It should contains every think including yesterdays improvements (delete, 
insert, update works - inser/update only on table without index).

> Here are a few initial thoughts based mostly on reading the diff:
> 
> In the minor nit department, I don't really like the idea of
> PageHeaderData_04, SizeOfPageHeaderData04, PageLayoutIsValid_04, etc.
> I think the latest version should just be PageHeaderData and
> SizeOfPageHeaderData, and previous versions should be, e.g.
> PageHeaderDataV3.  It looks to me like this would cut a few hunks out
> of this and maybe make it a bit easier to understand what is going on.
>  At any rate, if we are going to stick with an explicit version number
> in both versions, it should be marked in a consistent way, not _04
> sometimes and just 04 other times.  My suggestion is e.g. "V4" but
> YMMV.

Yeah, it is most difficult part :-) find correct names for it. I think that each  version of structure should have
versionsuffix including lastone. And of 
 
cource the last one we should have a general name without suffix - see example:

typedef struct PageHeaderData_04 { ...} PageHeaderData_04
typedef struct PageHeaderData_03 { ...} PageHeaderData_03
typedef PageHeaderData_04 PageHeaderData

This allows you exactly specify version on places where you need it and keep 
general name where version is not relevant.

How suffix should looks it another question. I prefer to have 04 not only 4.
What's about PageHeaderData_V04?

By the way what YMMV means?

> The changes to nodeIndexscan.c and nodeSeqscan.c are worrisome to me.
> It looks like the added code is (nearly?) identical in both places, so
> probably it needs to be refactored to avoid code duplication.  I'm
> also a bit skeptical about the idea of doing the tuple conversion
> here.  Why here rather than ExecStoreTuple()?  If you decide to
> convert the tuple, you can palloc the new one, pfree the old one if
> ShouldFree is set, and reset shouldFree to true.

Good point. I thought about it as a one variant. And if I look it close now it 
is really much better place. It should fix a problem why REINDEX does not work. 
I will move it.

> I am pretty skeptical of the idea that all of the HeapTuple* functions
> can just be conditionalized on the page version and everything will
> Just Work.  It seems like that is too low a level to be worrying about
> such things.  Even if it happens to work for the changes between V3
> and V4, what happens when V5 or V6 is changed in such a way that the
> answer to HeapTupleIsWhatever is neither "Yes" nor "No", but rather
> "Maybe" or "Seven"?  The performance hit also sounds painful.  I don't
> have a better idea right now though...

OK. Currently it works (or I hope that it works). If somebody in a future invent 
some special change, i think in most (maybe all) cases there will be possible 
mapping.

The speed is key point. When I check it last time I go 1% performance drop in 
fresh database. I think 1% is good price for in-place online upgrade.

> I think it's going to be absolutely imperative to begin vacuuming away
> old V3 pages as quickly as possible after the upgrade.  If you go with
> the approach of converting the tuple in, or just before,
> ExecStoreTuple, then you're going to introduce a lot of overhead when
> working with V3 pages.  I think that's fine.  You should plan to do
> your in-place upgrade at 1AM on Christmas morning (or whenever your
> load hits rock bottom...) and immediately start converting the
> database, starting with your most important and smallest tables.  In
> fact, I would look whenever possible for ways to make the V4 case a
> fast-path and just accept that the system is going to labor a bit when
> dealing with V3 stuff.  Any overhead you introduce when dealing with
> V3 pages can go away; any V4 overhead is permanent and therefore much
> more difficult to accept.

Yes, it is a plan to improve vacuum to convert old page to new one. But in as a 
second step. I have already page converter code. With some modification it could 
be integrated easily into vacuum code.

> That's about all I have for now... if you can give me some pointers on
> working with this git repository, or provide a complete patch that
> applies cleanly to CVS HEAD, I will try to look at this in more
> detail.

Thanks for your comments. Try snapshot link. I hope that it will work.
    Zdenek

PS: I'm sorry about response time, but I'm on training this week.


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

03 November 2008, 21:20:34

> You need to apply also two other patches:
> which are located here:
> http://wiki.postgresql.org/wiki/CommitFestInProgress#Upgrade-in-place_and_related_issues
> I moved one related patch from another category here to correct place.

Just to confirm, which two?

>
http://git.postgresql.org/?p=~davidfetter/upgrade_in_place/.git;a=snapshot;h=c72bafada59ed278ffac59657c913bc375f77808;sf=tgz
>
> It should contains every think including yesterdays improvements (delete,
> insert, update works - inser/update only on table without index).

Wow, sounds like great improvements.  I understand your difficulties
in keeping up with HEAD, but I hope we can figure out some solution,
because right now I have a diff (that I can't apply) and a tarball
(that I can't diff) and that is not ideal for reviewing.

> Yeah, it is most difficult part :-) find correct names for it. I think that
> each  version of structure should have version suffix including lastone. And
> of cource the last one we should have a general name without suffix - see
> example:
>
> typedef struct PageHeaderData_04 { ...} PageHeaderData_04
> typedef struct PageHeaderData_03 { ...} PageHeaderData_03
> typedef PageHeaderData_04 PageHeaderData
>
> This allows you exactly specify version on places where you need it and keep
> general name where version is not relevant.

That doesn't make sense to me.  If PageHeaderData and
PageHeaderData_04 are the same type, how do you decide which one to
use in any particular place in the code?

> How suffix should looks it another question. I prefer to have 04 not only 4.
> What's about PageHeaderData_V04?

I prefer "V" as a delimiter rather than "_" because that makes it more
clear that the number which follows is a version number, but I think
"_V" is overkill.  However, I don't really want to argue the point;
I'm just throwing in my $0.02 and I am sure others will have their own
views as well.

> By the way what YMMV means?

"Your Mileage May Vary."
http://www.urbandictionary.com/define.php?term=YMMV

>> I am pretty skeptical of the idea that all of the HeapTuple* functions
>> can just be conditionalized on the page version and everything will
>> Just Work.  It seems like that is too low a level to be worrying about
>> such things.  Even if it happens to work for the changes between V3
>> and V4, what happens when V5 or V6 is changed in such a way that the
>> answer to HeapTupleIsWhatever is neither "Yes" nor "No", but rather
>> "Maybe" or "Seven"?  The performance hit also sounds painful.  I don't
>> have a better idea right now though...
>
> OK. Currently it works (or I hope that it works). If somebody in a future
> invent some special change, i think in most (maybe all) cases there will be
> possible mapping.
>
> The speed is key point. When I check it last time I go 1% performance drop
> in fresh database. I think 1% is good price for in-place online upgrade.

I think that's arguable and something that needs to be more broadly
discussed.  I wouldn't be keen to pay a 1% performance drop for this
feature, because it's not a feature I really need.  Sure, in-place
upgrade would be nice to have, but for me, dump and reload isn't a
huge problem.  It's a lot better than the 5% number you quoted
previously, but I'm not sure whether it is good enough,

I would feel more comfortable if the feature could be completely
disabled via compile-time defines.  Then you could build the system
either with or without in-place upgrade, according to your needs.  But
I don't think that's very practical with HeapTuple* as functions.  You
could conditionalize away the switch, but the function call overhead
would remain.  To get rid of that, you'd need some enormous, fragile
hack that I don't even want to contemplate.

Really, what I'd ideally like to see here is a system where the V3
code is in essence error-recovery code.  Everything should be V4-only
unless you detect a V3 page, and then you error out (if in-place
upgrade is not enabled) or jump to the appropriate V3-aware code (if
in-place upgrade is enabled).  In theory, with a system like this, it
seems like the overhead for V4 ought to be no more than the cost of
checking the page version on each page read, which is a cheap sanity
check we'd be willing to pay for anyway, and trivial in cost.

But I think we probably need some input from -core on this topic as well.

...Robert

Re: [WIP] In-place upgrade

From

Tom Lane

Date:

03 November 2008, 21:27:41

"Robert Haas" <robertmhaas@gmail.com> writes:
> Really, what I'd ideally like to see here is a system where the V3
> code is in essence error-recovery code.  Everything should be V4-only
> unless you detect a V3 page, and then you error out (if in-place
> upgrade is not enabled) or jump to the appropriate V3-aware code (if
> in-place upgrade is enabled).  In theory, with a system like this, it
> seems like the overhead for V4 ought to be no more than the cost of
> checking the page version on each page read, which is a cheap sanity
> check we'd be willing to pay for anyway, and trivial in cost.

We already do check the page version on read-in --- see PageHeaderIsValid.

> But I think we probably need some input from -core on this topic as well.

I concur that I don't want to see this patch adding more than the
absolute unavoidable minimum of overhead for data that meets the
"current" layout definition.  I'm disturbed by the proposal to stick
overhead into tuple header access, for example.
        regards, tom lane

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

03 November 2008, 22:22:39

> We already do check the page version on read-in --- see PageHeaderIsValid.

Right, but the only place this is called is in ReadBuffer_common,
which doesn't seem like a suitable place to deal with the possibility
of a V3 page since you don't yet know what you plan to do with it.
I'm not quite sure what the right solution to that problem is...

>> But I think we probably need some input from -core on this topic as well.
> I concur that I don't want to see this patch adding more than the
> absolute unavoidable minimum of overhead for data that meets the
> "current" layout definition.  I'm disturbed by the proposal to stick
> overhead into tuple header access, for example.

...but it seems like we both agree that conditionalizing heap tuple
header access on page version is not the right answer.  Based on that,
I'm going to move the "htup and bufpage API clean up" patch to
"Returned with feedback" and continue reviewing the remainder of these
patches.

As I'm looking at this, I'm realizing another problem - there is a lot
of code that looks like this:

void HeapTupleSetXmax(HeapTuple tuple, TransactionId xmax)
{     switch(tuple->t_ver)     {             case 4 : tuple->t_data->t_choice.t_heap.t_xmax = xmax;
        break;             case 3 : TPH03(tuple)->t_choice.t_heap.t_xmax = xmax;                              break;
        default: elog(PANIC, "HeapTupleSetXmax is not supported.");     }
 
}

TPH03 is a macro that is casting tuple->t_data to HeapTupleHeader_03.
Unless I'm missing something, that means that given an arbitrary
pointer to HeapTuple, there is absolutely no guarantee that
tuple->t_data->t_choice actually points to that field at all.  It will
if tuple->t_ver happens to be 4 OR if HeapTupleHeader and
HeapTupleHeader_03 happen to agree on where t_choice is; otherwise it
points to some other member of HeapTupleHeader_03, or off the end of
the structure.  To me that seems unacceptably fragile, because it
means the compiler can't warn us that we're using a pointer
inappropriately.  If we truly want to be safe here then we need to
create an opaque HeapTupleHeader structure that contains only those
elements that HeapTupleHeader_03 and HeapTupleHeader_04 have in
common, and cast BOTH of them after checking the version.  That way if
somone writes a function that attempts to deference a HeapTupleHeader
without going through the API, it will fail to compile rather than
mostly working but possibly failing on a V3 page.

...Robert

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

04 November 2008, 10:06:38

Robert Haas napsal(a):

> 
> Really, what I'd ideally like to see here is a system where the V3
> code is in essence error-recovery code.  Everything should be V4-only
> unless you detect a V3 page, and then you error out (if in-place
> upgrade is not enabled) or jump to the appropriate V3-aware code (if
> in-place upgrade is enabled).  In theory, with a system like this, it
> seems like the overhead for V4 ought to be no more than the cost of
> checking the page version on each page read, which is a cheap sanity
> check we'd be willing to pay for anyway, and trivial in cost.

OK. It was original idea to make "Convert on read" which has several problems 
with no easy solution. One is that new data does not fit on the page and second 
big problem is how to convert TOAST table data. Another problem which is general 
is how to convert indexes...

Convert on read has minimal impact on core when latest version is processed. But 
problem is what happen when you need to migrate tuple form page to new one 
modify index and also needs convert toast value(s)... Problem is that response 
could be long in some query, because it invokes a lot of changes and conversion.  I think in corner case it could
requiresconverts all index when you request 
 
one record.
Zdenek




-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

04 November 2008, 10:33:03

> OK. It was original idea to make "Convert on read" which has several
> problems with no easy solution. One is that new data does not fit on the
> page and second big problem is how to convert TOAST table data. Another
> problem which is general is how to convert indexes...
>
> Convert on read has minimal impact on core when latest version is processed.
> But problem is what happen when you need to migrate tuple form page to new
> one modify index and also needs convert toast value(s)... Problem is that
> response could be long in some query, because it invokes a lot of changes
> and conversion.  I think in corner case it could requires converts all index
> when you request one record.

I don't think I'm proposing convert on read, exactly.  If you actually
try to convert the entire page when you read it in, I think you're
doomed to failure, because, as you rightly point out, there is
absolutely no guarantee that the page contents in their new format
will still fit into one block.  I think what you want to do is convert
the structures within the page one by one as you read them out of the
page.  The proposed refactoring of ExecStoreTuple will do exactly
this, for example.

HEAD uses a pointer into the actual buffer for a V4 tuple that comes
from an existing relation, and a pointer to a palloc'd structure for a
tuple that is generated during query execution.  The proposed
refactoring will keep these rules, plus add a new rule that if you
happen to read a V3 page, you will palloc space for a new V4 tuple
that is semantically equivalent to the V3 tuple on the page, and use
that pointer instead.  That, it seems to me, is exactly the right
balance - the PAGE is still a V3 page, but all of the tuples that the
upper-level code ever sees are V4 tuples.

I'm not sure how far this particular approach can be generalized.
ExecStoreTuple has the advantage that it already has to deal with both
direct buffer pointers and palloc'd structures, so the code doesn't
need to be much more complex to handle this case as well.  I think the
thing to do is go through and scrutinize all of the ReadBuffer call
sites and figure out an approach to each one.  I haven't looked at
your latest code yet, so you may have already done this, but just for
example, RelationGetBufferForTuple should probably just reject any V3
pages encountered as if they were full, including updating the FSM
where appropriate.  I would think that it would be possible to
implement that with almost zero performance impact.  I'm happy to look
at and discuss the problem cases with you, and hopefully others will
chime in as well since my knowledge of the code is far from
exhaustive.

...Robert

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

04 November 2008, 11:09:29

Robert Haas napsal(a):
>> OK. It was original idea to make "Convert on read" which has several
>> problems with no easy solution. One is that new data does not fit on the
>> page and second big problem is how to convert TOAST table data. Another
>> problem which is general is how to convert indexes...
>>
>> Convert on read has minimal impact on core when latest version is processed.
>> But problem is what happen when you need to migrate tuple form page to new
>> one modify index and also needs convert toast value(s)... Problem is that
>> response could be long in some query, because it invokes a lot of changes
>> and conversion.  I think in corner case it could requires converts all index
>> when you request one record.
> 
> I don't think I'm proposing convert on read, exactly.  If you actually
> try to convert the entire page when you read it in, I think you're
> doomed to failure, because, as you rightly point out, there is
> absolutely no guarantee that the page contents in their new format
> will still fit into one block.  I think what you want to do is convert
> the structures within the page one by one as you read them out of the
> page.  The proposed refactoring of ExecStoreTuple will do exactly
> this, for example.

I see. But Vacuum and other internals function access heap pages directly 
without ExecStoreTuple. however you point to one idea which I'm currently 
thinking about it too. There is my version:

If you look into new page API it has PageGetHeapTuple. It could do the 
conversion job. Problem is that you don't have relation info there and you 
cannot convert data, but transaction information can be converted.

I think about HeapTupleData structure modification. It will have pointer to 
transaction info t_transinfo, which will point to the page tuple for V4. For V3 
PageGetHeapTuple function will allocate memory and put converted data here.

ExecStoreTuple will finally convert data. Because it know about relation and It 
does not make sense convert data early. Who wants to convert invisible or dead data.

With this approach tuple will be processed same way with V4 without any overhead 
(they will be small overhead with allocating and free heaptupledata in some 
places - mostly vacuum).

Only multi version access will be driven on page basis.
    Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

04 November 2008, 11:38:36

> I see. But Vacuum and other internals function access heap pages directly
> without ExecStoreTuple.

Right.  I don't think there's any getting around the fact that any
function which accesses heap pages directly is going to need
modification.  The key is to make those modifications as non-invasive
as possible.  For example, in the case of vacuum, as soon as it
detects that a V3 page has been read, it should call a special
function whose only purpose in life is to move the data out of that V3
page and onto one or more V4 pages, and return.  What you shouldn't do
is try to make the regular vacuum code handle both V3 and V4 pages,
because that will lead to code that may be slow and will almost
certainly be complicated and difficult to maintain.

I'll read through the rest of this when I have a bit more time.

...Robert

Re: [WIP] In-place upgrade

From

Heikki Linnakangas

Date:

04 November 2008, 11:46:16

Zdenek Kotala wrote:
> Robert Haas napsal(a):
>> Really, what I'd ideally like to see here is a system where the V3
>> code is in essence error-recovery code.  Everything should be V4-only
>> unless you detect a V3 page, and then you error out (if in-place
>> upgrade is not enabled) or jump to the appropriate V3-aware code (if
>> in-place upgrade is enabled).  In theory, with a system like this, it
>> seems like the overhead for V4 ought to be no more than the cost of
>> checking the page version on each page read, which is a cheap sanity
>> check we'd be willing to pay for anyway, and trivial in cost.
> 
> OK. It was original idea to make "Convert on read" which has several 
> problems with no easy solution. One is that new data does not fit on the 
> page and second big problem is how to convert TOAST table data. Another 
> problem which is general is how to convert indexes...

We've talked about this many times before, so I'm sure you know what my 
opinion is. Let me phrase it one more time:

1. You *will* need a function to convert a page from old format to new 
format. We do want to get rid of the old format pages eventually, 
whether it's during VACUUM, whenever a page is read in, or by using an 
extra utility. And that process needs to online. Please speak up now if 
you disagree with that.

2. It follows from point 1, that you *will* need to solve the problems 
with pages where the data doesn't fit on the page in new format, as well 
as converting TOAST data.

We've discussed various solutions to those problems; it's not 
insurmountable. For the "data doesn't fit anymore" problem, a fairly 
simple solution is to run a pre-upgrade utility in the old version, that 
reserves some free space on each page, to make sure everything fits 
after converting to new format. For TOAST, you can retoast tuples when 
the heap page is read in. I'm not sure what the problem with indexes is, 
but you can split pages if necessary, for example.

Assuming everyone agrees with point 1, could we focus on these issues?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

04 November 2008, 12:01:11

> We've talked about this many times before, so I'm sure you know what my
> opinion is. Let me phrase it one more time:
>
> 1. You *will* need a function to convert a page from old format to new
> format. We do want to get rid of the old format pages eventually, whether
> it's during VACUUM, whenever a page is read in, or by using an extra
> utility. And that process needs to online. Please speak up now if you
> disagree with that.

Well, I just proposed an approach that doesn't work this way, so I
guess I'll have to put myself in the disagree category, or anyway yet
to be convinced.  As long as you can move individual tuples onto new
pages, you can eventually empty V3 pages and reinitialize them as new,
empty V4 pages.  You can force that process along via, say, VACUUM,
but in the meantime you can still continue to read the old pages
without being forced to change them to the new format.  That's not the
only possible approach, but it's not obvious to me that it's insane.
If you think it's a non-starter, it would be good to know why.

...Robert

Re: [WIP] In-place upgrade

From

Tom Lane

Date:

04 November 2008, 12:14:50

"Robert Haas" <robertmhaas@gmail.com> writes:
> Well, I just proposed an approach that doesn't work this way, so I
> guess I'll have to put myself in the disagree category, or anyway yet
> to be convinced.  As long as you can move individual tuples onto new
> pages, you can eventually empty V3 pages and reinitialize them as new,
> empty V4 pages.  You can force that process along via, say, VACUUM,
> but in the meantime you can still continue to read the old pages
> without being forced to change them to the new format.  That's not the
> only possible approach, but it's not obvious to me that it's insane.
> If you think it's a non-starter, it would be good to know why.

That's sane *if* you can guarantee that only negligible overhead is
added for accessing data that is in the up-to-date format.  I don't
think that will be the case if we start putting version checks into
every tuple access macro.
        regards, tom lane

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

04 November 2008, 12:43:09

> That's sane *if* you can guarantee that only negligible overhead is
> added for accessing data that is in the up-to-date format.  I don't
> think that will be the case if we start putting version checks into
> every tuple access macro.

Yes, the point is that you'll read the page as V3 or V4, whichever it
is, but if it's V3, you'll convert the tuples to V4 format before you
try to doing anything with them (for example by modifying
ExecStoreTuple to copy any V3 tuple into a palloc'd buffer, which fits
nicely into what that function already does).

...Robert

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

04 November 2008, 16:57:37

>> Well, I just proposed an approach that doesn't work this way, so I
>> guess I'll have to put myself in the disagree category, or anyway yet
>> to be convinced.  As long as you can move individual tuples onto new
>> pages, you can eventually empty V3 pages and reinitialize them as new,
>> empty V4 pages.  You can force that process along via, say, VACUUM,
>
> No, if you can force that process along via some command, whatever it is, then
> you're still in the category he described.

Maybe.  The difference is that I'm talking about converting tuples,
not pages, so "What happens when the data doesn't fit on the new
page?" is a meaningless question.  Since that seemed to be Heikki's
main concern, I thought we must be talking about different things.  My
thought was that the code path for converting a tuple would be very
similar to what heap_update does today, and large tuples would be
handled via TOAST just as they are now - by converting the relation
one tuple at a time, you might end up with a new relation that has
either more or fewer pages than the old relation, and it really
doesn't matter which.

I haven't really thought through all of the other kinds of things that
might need to be converted, though.  That's where it would be useful
for someone more experienced to weigh in on indexes, etc.

...Robert

Re: [WIP] In-place upgrade

From

Gregory Stark

Date:

04 November 2008, 17:29:05

"Robert Haas" <robertmhaas@gmail.com> writes:

>> We've talked about this many times before, so I'm sure you know what my
>> opinion is. Let me phrase it one more time:
>>
>> 1. You *will* need a function to convert a page from old format to new
>> format. We do want to get rid of the old format pages eventually, whether
>> it's during VACUUM, whenever a page is read in, or by using an extra
>> utility. And that process needs to online. Please speak up now if you
>> disagree with that.
>
> Well, I just proposed an approach that doesn't work this way, so I
> guess I'll have to put myself in the disagree category, or anyway yet
> to be convinced.  As long as you can move individual tuples onto new
> pages, you can eventually empty V3 pages and reinitialize them as new,
> empty V4 pages.  You can force that process along via, say, VACUUM,

No, if you can force that process along via some command, whatever it is, then
you're still in the category he described.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!

Re: [WIP] In-place upgrade

From

Gregory Stark

Date:

04 November 2008, 18:18:59

"Robert Haas" <robertmhaas@gmail.com> writes:

>>> Well, I just proposed an approach that doesn't work this way, so I
>>> guess I'll have to put myself in the disagree category, or anyway yet
>>> to be convinced.  As long as you can move individual tuples onto new
>>> pages, you can eventually empty V3 pages and reinitialize them as new,
>>> empty V4 pages.  You can force that process along via, say, VACUUM,
>>
>> No, if you can force that process along via some command, whatever it is, then
>> you're still in the category he described.
>
> Maybe.  The difference is that I'm talking about converting tuples,
> not pages, so "What happens when the data doesn't fit on the new
> page?" is a meaningless question.  

No it's not, because as you pointed out you still need a way for the user to
force it to happen sometime. Unless you're going to be happy with telling
users they need to update all their tuples which would not be an online
process.

In any case it sounds like you're saying you want to allow multiple versions
of tuples on the same page -- which a) would be much harder and b) doesn't
solve the problem since the page still has to be converted sometime anyways.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

04 November 2008, 23:50:01

>> Maybe.  The difference is that I'm talking about converting tuples,
>> not pages, so "What happens when the data doesn't fit on the new
>> page?" is a meaningless question.
>
> No it's not, because as you pointed out you still need a way for the user to
> force it to happen sometime. Unless you're going to be happy with telling
> users they need to update all their tuples which would not be an online
> process.
>
> In any case it sounds like you're saying you want to allow multiple versions
> of tuples on the same page -- which a) would be much harder and b) doesn't
> solve the problem since the page still has to be converted sometime anyways.

No, that's not what I'm suggesting.  My thought was that any V3 page
would be treated as if it were completely full, with the exception of
a completely empty page which can be reinitialized as a V4 page.  So
you would never add any tuples to a V3 page, but you would need to
update xmax, hint bits, etc.  Eventually when all the tuples were dead
you could reuse the page.

...Robert

Re: [WIP] In-place upgrade

From

Gregory Stark

Date:

05 November 2008, 00:25:35

"Robert Haas" <robertmhaas@gmail.com> writes:

>>> Maybe.  The difference is that I'm talking about converting tuples,
>>> not pages, so "What happens when the data doesn't fit on the new
>>> page?" is a meaningless question.
>>
>> No it's not, because as you pointed out you still need a way for the user to
>> force it to happen sometime. Unless you're going to be happy with telling
>> users they need to update all their tuples which would not be an online
>> process.
>>
>> In any case it sounds like you're saying you want to allow multiple versions
>> of tuples on the same page -- which a) would be much harder and b) doesn't
>> solve the problem since the page still has to be converted sometime anyways.
>
> No, that's not what I'm suggesting.  My thought was that any V3 page
> would be treated as if it were completely full, with the exception of
> a completely empty page which can be reinitialized as a V4 page.  So
> you would never add any tuples to a V3 page, but you would need to
> update xmax, hint bits, etc.  Eventually when all the tuples were dead
> you could reuse the page.

But there's no guarantee that will ever happen. Heikki claimed you would need
a mechanism to convert the page some day and you said you proposed a system
where that wasn't true.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

05 November 2008, 00:51:34

>> No, that's not what I'm suggesting.  My thought was that any V3 page
>> would be treated as if it were completely full, with the exception of
>> a completely empty page which can be reinitialized as a V4 page.  So
>> you would never add any tuples to a V3 page, but you would need to
>> update xmax, hint bits, etc.  Eventually when all the tuples were dead
>> you could reuse the page.
>
> But there's no guarantee that will ever happen. Heikki claimed you would need
> a mechanism to convert the page some day and you said you proposed a system
> where that wasn't true.

What's the scenario you're concerned about?  An old snapshot that
never goes away?

Can we lock the old and new pages, move the tuple to a V4 page, and
update index entries without changing xmin/xmax?

...Robert

Re: [WIP] In-place upgrade

From

Gregory Stark

Date:

05 November 2008, 01:11:09

"Robert Haas" <robertmhaas@gmail.com> writes:

>>> No, that's not what I'm suggesting.  My thought was that any V3 page
>>> would be treated as if it were completely full, with the exception of
>>> a completely empty page which can be reinitialized as a V4 page.  So
>>> you would never add any tuples to a V3 page, but you would need to
>>> update xmax, hint bits, etc.  Eventually when all the tuples were dead
>>> you could reuse the page.
>>
>> But there's no guarantee that will ever happen. Heikki claimed you would need
>> a mechanism to convert the page some day and you said you proposed a system
>> where that wasn't true.
>
> What's the scenario you're concerned about?  An old snapshot that
> never goes away?

An old page which never goes away. New page formats are introduced for a
reason -- to support new features. An old page lying around indefinitely means
some pages can't support those new features. Just as an example, DBAs may be
surprised to find out that large swathes of their database are still not
protected by CRC checksums months or years after having upgraded to 8.4 (or
even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their
data is upgraded.

> Can we lock the old and new pages, move the tuple to a V4 page, and
> update index entries without changing xmin/xmax?

Not exactly. But regardless -- the point is we need to do something.

(And then the argument goes that since we *have* to do that then we needn't
bother with doing anything else. At least if we do it's just an optimization
over just doing the whole page right away.)

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!

Re: [WIP] In-place upgrade

From

"Joshua D. Drake"

Date:

05 November 2008, 01:52:21

Gregory Stark wrote:
> "Robert Haas" <robertmhaas@gmail.com> writes:

> An old page which never goes away. New page formats are introduced for a
> reason -- to support new features. An old page lying around indefinitely means
> some pages can't support those new features. Just as an example, DBAs may be
> surprised to find out that large swathes of their database are still not
> protected by CRC checksums months or years after having upgraded to 8.4 (or
> even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their
> data is upgraded.

Then provide a manual mechanism to convert all pages?

Joshua D. Drake

Re: [WIP] In-place upgrade

From

Gregory Stark

Date:

05 November 2008, 02:04:49

"Joshua D. Drake" <jd@commandprompt.com> writes:

> Gregory Stark wrote:
>> "Robert Haas" <robertmhaas@gmail.com> writes:
>
>> An old page which never goes away. New page formats are introduced for a
>> reason -- to support new features. An old page lying around indefinitely means
>> some pages can't support those new features. Just as an example, DBAs may be
>> surprised to find out that large swathes of their database are still not
>> protected by CRC checksums months or years after having upgraded to 8.4 (or
>> even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their
>> data is upgraded.
>
> Then provide a manual mechanism to convert all pages?

The origin of this thread was the dispute over this claim:
   1. You *will* need a function to convert a page from old format to new   format. We do want to get rid of the old
formatpages eventually, whether   it's during VACUUM, whenever a page is read in, or by using an extra   utility. And
thatprocess needs to online. Please speak up now if you   disagree with that.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!

Re: [WIP] In-place upgrade

From

"Joshua D. Drake"

Date:

05 November 2008, 02:31:09

Gregory Stark wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
> 
>> Gregory Stark wrote:
>>> "Robert Haas" <robertmhaas@gmail.com> writes:
>>> An old page which never goes away. New page formats are introduced for a
>>> reason -- to support new features. An old page lying around indefinitely means
>>> some pages can't support those new features. Just as an example, DBAs may be
>>> surprised to find out that large swathes of their database are still not
>>> protected by CRC checksums months or years after having upgraded to 8.4 (or
>>> even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their
>>> data is upgraded.
>> Then provide a manual mechanism to convert all pages?
> 
> The origin of this thread was the dispute over this claim:
> 
>     1. You *will* need a function to convert a page from old format to new
>     format. We do want to get rid of the old format pages eventually, whether
>     it's during VACUUM, whenever a page is read in, or by using an extra
>     utility. And that process needs to online. Please speak up now if you
>     disagree with that.
> 

I agree.

Joshua D. Drake

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

05 November 2008, 08:32:10

> An old page which never goes away. New page formats are introduced for a
> reason -- to support new features. An old page lying around indefinitely means
> some pages can't support those new features. Just as an example, DBAs may be
> surprised to find out that large swathes of their database are still not
> protected by CRC checksums months or years after having upgraded to 8.4 (or
> even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their
> data is upgraded.

OK, I see your point.  In the absence of any old snapshots,
convert-on-write allows you to forcibly upgrade the whole table by
rewriting all of the tuples into new pages:

UPDATE table SET col = col

In the absence of page expansion, you can put logic into VACUUM to
upgrade each page in place.

If you have both old snapshots that you can't get rid of, and page
expansion, then you have a big problem, which I guess brings us back
to Heikki's point.

...Robert

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

05 November 2008, 08:43:27

Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:

> 
> We've talked about this many times before, so I'm sure you know what my 
> opinion is. Let me phrase it one more time:
> 
> 1. You *will* need a function to convert a page from old format to new 
> format. We do want to get rid of the old format pages eventually, 
> whether it's during VACUUM, whenever a page is read in, or by using an 
> extra utility. And that process needs to online. Please speak up now if 
> you disagree with that.

Yes. Agree. The basic idea is to create new empty page and copy+convert tuples 
into new page. This new page will overwrite old one I have already code which 
converts heap table (excluding arrays and composite datatype).

> 2. It follows from point 1, that you *will* need to solve the problems 
> with pages where the data doesn't fit on the page in new format, as well 
> as converting TOAST data.

Yes or no. It depends if we will want live with old pages forever. But I think 
convert all pages to the newest version is good idea.

> We've discussed various solutions to those problems; it's not 
> insurmountable. For the "data doesn't fit anymore" problem, a fairly 
> simple solution is to run a pre-upgrade utility in the old version, that 
> reserves some free space on each page, to make sure everything fits 
> after converting to new format. 

I think it will not work. you need protect also PotgreSQL to put any data extra 
data on a page. Which requires modification into PostgreSQL code in old branches.

> For TOAST, you can retoast tuples when 
> the heap page is read in. 

Yes you have to retosted it which is only possible method but problem is thet 
you need workinig toastable index ... yeah, indexes are different story.
> I'm not sure what the problem with indexes is,> but you can split pages if necessary, for example.

Indexes is different story. In first step I prefer to use reindex. But in the 
future a prefer to extend pg_am and add ampageconvert which will point to 
conversion function. Maybe we can extend it now and keep this column empty.

> Assuming everyone agrees with point 1, could we focus on these issues?

Yes, OK I'm going to cleanup code which I have and I will send it soon. Tuple 
conversion is already part of patch which I already send. See 
access/heapam/htup_03.c.
Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

05 November 2008, 09:26:44

Tom Lane napsal(a):

> 
> I concur that I don't want to see this patch adding more than the
> absolute unavoidable minimum of overhead for data that meets the
> "current" layout definition.  I'm disturbed by the proposal to stick
> overhead into tuple header access, for example.

OK. I agree that it is overhead. However the patch contains also Tuple and Page 
API cleanup which is general thing. All function should use HeapTuple access not 
HeapTupleHeader. I used function in the patch because I added multi version 
access, but they can be macro.

The main change of page API is to add two function PageGetHeapTuple and 
PageGetIndexTuple. I also add function like PageItemIsDead and so on. These 
change are not only related to upgrade.

I accepting your complains about Tuples, but I think we should have multi page 
version access method. The main advantage is that indexes are ready for reading 
without any problem. It helps mostly in TOAST chunk data access and it is 
necessary for retoasting. OK it will works until somebody change btree ondisk 
format, but now it helps.
    Zdenek

Re: [WIP] In-place upgrade

From

Greg Stark

Date:

05 November 2008, 09:44:29

I don't think this really qualifies as "in place upgrade" since it  
would mean creating a whole second copy of all your data. And it's  
only online got read-only queries too.

I think we need a way to upgrade the pages in place and deal with any  
overflow data as exceptional cases or else there's hardly much point  
in the exercise.

greg

On 5 Nov 2008, at 07:32 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:

>> An old page which never goes away. New page formats are introduced  
>> for a
>> reason -- to support new features. An old page lying around  
>> indefinitely means
>> some pages can't support those new features. Just as an example,  
>> DBAs may be
>> surprised to find out that large swathes of their database are  
>> still not
>> protected by CRC checksums months or years after having upgraded to  
>> 8.4 (or
>> even 8.5 or 8.6 or ...). They would certainly want a way to ensure  
>> all their
>> data is upgraded.
>
> OK, I see your point.  In the absence of any old snapshots,
> convert-on-write allows you to forcibly upgrade the whole table by
> rewriting all of the tuples into new pages:
>
> UPDATE table SET col = col
>
> In the absence of page expansion, you can put logic into VACUUM to
> upgrade each page in place.
>
> If you have both old snapshots that you can't get rid of, and page
> expansion, then you have a big problem, which I guess brings us back
> to Heikki's point.
>
> ...Robert

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

05 November 2008, 10:05:06

Greg Stark napsal(a):
> I don't think this really qualifies as "in place upgrade" since it would 
> mean creating a whole second copy of all your data. And it's only online 
> got read-only queries too.
> 
> I think we need a way to upgrade the pages in place and deal with any 
> overflow data as exceptional cases or else there's hardly much point in 
> the exercise.

It is exceptional case between V3 and V4 and only on heap, because you save in 
varlena. But between V4 and V5 we will lost another 4 bytes in a page header -> 
page header will be 28 bytes long but tuple size is same.

Try to get raw free space on each page in 8.3 database and you probably see a 
lot of pages where free space is 0. My last experience is something about 1-2% 
of pages.
    Zdenek

Re: [WIP] In-place upgrade

From

Martijn van Oosterhout

Date:

05 November 2008, 14:16:22

On Wed, Nov 05, 2008 at 03:04:42PM +0100, Zdenek Kotala wrote:
> Greg Stark napsal(a):
> It is exceptional case between V3 and V4 and only on heap, because you save
> in varlena. But between V4 and V5 we will lost another 4 bytes in a page
> header -> page header will be 28 bytes long but tuple size is same.
>
> Try to get raw free space on each page in 8.3 database and you probably see
> a lot of pages where free space is 0. My last experience is something about
> 1-2% of pages.

Is this really such a big deal? You do the null-update on the last
tuple of the page and then you do have enough room. So Phase one moves
a few tuples to make room. Phase 2 actually converts the pages inplace.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

05 November 2008, 15:55:23

Martijn van Oosterhout napsal(a):
> On Wed, Nov 05, 2008 at 03:04:42PM +0100, Zdenek Kotala wrote:
>> Greg Stark napsal(a):
>> It is exceptional case between V3 and V4 and only on heap, because you save 
>> in varlena. But between V4 and V5 we will lost another 4 bytes in a page 
>> header -> page header will be 28 bytes long but tuple size is same.
>>
>> Try to get raw free space on each page in 8.3 database and you probably see 
>> a lot of pages where free space is 0. My last experience is something about 
>> 1-2% of pages.
> 
> Is this really such a big deal? You do the null-update on the last
> tuple of the page and then you do have enough room. So Phase one moves
> a few tuples to make room. Phase 2 actually converts the pages inplace.

Problem is how to move tuple from page to another and keep indexes in sync. One 
solution is to perform some think like "update" operation on the tuple. But you 
need exclusive lock on the page and pin counter have to be zero. And question is 
where it is safe operation.
Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: [WIP] In-place upgrade

From

Tom Lane

Date:

05 November 2008, 16:45:16

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> Martijn van Oosterhout napsal(a):
>> Is this really such a big deal? You do the null-update on the last
>> tuple of the page and then you do have enough room. So Phase one moves
>> a few tuples to make room. Phase 2 actually converts the pages inplace.

> Problem is how to move tuple from page to another and keep indexes in
> sync. One solution is to perform some think like "update" operation on
> the tuple. But you need exclusive lock on the page and pin counter
> have to be zero. And question is where it is safe operation.

Hmm.  Well, it may be a nasty problem but you have to find a solution.
We're not going to guarantee that no update ever expands the data ...
        regards, tom lane

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

05 November 2008, 17:05:55

> Problem is how to move tuple from page to another and keep indexes in sync.
> One solution is to perform some think like "update" operation on the tuple.
> But you need exclusive lock on the page and pin counter have to be zero. And
> question is where it is safe operation.

But doesn't this problem go away if you do it in a transaction?  You
set xmax on the old tuple, write the new tuple, and add index entries
just as you would for a normal update.

When the old tuple is no longer visible to any transaction, you nuke it.

...Robert

Re: [WIP] In-place upgrade

From

Gregory Stark

Date:

05 November 2008, 17:41:57

"Robert Haas" <robertmhaas@gmail.com> writes:

>> Problem is how to move tuple from page to another and keep indexes in sync.
>> One solution is to perform some think like "update" operation on the tuple.
>> But you need exclusive lock on the page and pin counter have to be zero. And
>> question is where it is safe operation.
>
> But doesn't this problem go away if you do it in a transaction?  You
> set xmax on the old tuple, write the new tuple, and add index entries
> just as you would for a normal update.

But that doesn't actually solve the overflow problem on the old page...

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!

Re: [WIP] In-place upgrade

From

Martijn van Oosterhout

Date:

05 November 2008, 18:07:48

On Wed, Nov 05, 2008 at 09:41:52PM +0000, Gregory Stark wrote:
> "Robert Haas" <robertmhaas@gmail.com> writes:
>
> >> Problem is how to move tuple from page to another and keep indexes in sync.
> >> One solution is to perform some think like "update" operation on the tuple.
> >> But you need exclusive lock on the page and pin counter have to be zero. And
> >> question is where it is safe operation.
> >
> > But doesn't this problem go away if you do it in a transaction?  You
> > set xmax on the old tuple, write the new tuple, and add index entries
> > just as you would for a normal update.
>
> But that doesn't actually solve the overflow problem on the old page...

Sure it does. You move just enough tuples that you can convert the page
without an overflow.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: [WIP] In-place upgrade

From

Gregory Stark

Date:

05 November 2008, 21:38:31

Martijn van Oosterhout <kleptog@svana.org> writes:

> On Wed, Nov 05, 2008 at 09:41:52PM +0000, Gregory Stark wrote:
>> "Robert Haas" <robertmhaas@gmail.com> writes:
>> 
>> >> Problem is how to move tuple from page to another and keep indexes in sync.
>> >> One solution is to perform some think like "update" operation on the tuple.
>> >> But you need exclusive lock on the page and pin counter have to be zero. And
>> >> question is where it is safe operation.
>> >
>> > But doesn't this problem go away if you do it in a transaction?  You
>> > set xmax on the old tuple, write the new tuple, and add index entries
>> > just as you would for a normal update.
>> 
>> But that doesn't actually solve the overflow problem on the old page...
>
> Sure it does. You move just enough tuples that you can convert the page
> without an overflow.

setting the xmax on a tuple doesn't "move" the tuple


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

05 November 2008, 23:12:23

>>> >> Problem is how to move tuple from page to another and keep indexes in sync.
>>> >> One solution is to perform some think like "update" operation on the tuple.
>>> >> But you need exclusive lock on the page and pin counter have to be zero. And
>>> >> question is where it is safe operation.
>>> >
>>> > But doesn't this problem go away if you do it in a transaction?  You
>>> > set xmax on the old tuple, write the new tuple, and add index entries
>>> > just as you would for a normal update.
>>>
>>> But that doesn't actually solve the overflow problem on the old page...
>>
>> Sure it does. You move just enough tuples that you can convert the page
>> without an overflow.
>
> setting the xmax on a tuple doesn't "move" the tuple

Nobody said it did.  I think this would have been more clear if you
had quoted my whole email instead of stopping in the middle:

>> But doesn't this problem go away if you do it in a transaction?  You
>> set xmax on the old tuple, write the new tuple, and add index entries
>> just as you would for a normal update.
>>
>> When the old tuple is no longer visible to any transaction, you nuke it.

To spell this out in more detail:

Suppose page 123 is a V3 page containing 6 tuples A, B, C, D, E, and
F.  We examine the page and determine that if we convert this to a V4
page, only five tuples will fit.  So we need to get rid of one of the
tuples.  We begin a transaction and choose F as the victim.  Searching
the FSM, we discover that page 456 is a V4 page with available free
space.  We pin and lock pages 123 and 456 just as if we were doing a
heap_update.  We create F', the V4 version of F, and write it onto
page 456.  We set xmax on the original F.  We peform the corresponding
index updates and commit the transaction.

Time passes.  Eventually F becomes dead.  We reclaim the space
previously used by F, and page 123 now contains only 5 tuples.  This
is exactly what we needed in order to convert page F to a V4 page, so
we do.

...Robert

Re: [WIP] In-place upgrade

From

Tom Lane

Date:

06 November 2008, 09:52:43

"Robert Haas" <robertmhaas@gmail.com> writes:
> To spell this out in more detail:

> Suppose page 123 is a V3 page containing 6 tuples A, B, C, D, E, and
> F.  We examine the page and determine that if we convert this to a V4
> page, only five tuples will fit.  So we need to get rid of one of the
> tuples.  We begin a transaction and choose F as the victim.  Searching
> the FSM, we discover that page 456 is a V4 page with available free
> space.  We pin and lock pages 123 and 456 just as if we were doing a
> heap_update.  We create F', the V4 version of F, and write it onto
> page 456.  We set xmax on the original F.  We peform the corresponding
> index updates and commit the transaction.

> Time passes.  Eventually F becomes dead.  We reclaim the space
> previously used by F, and page 123 now contains only 5 tuples.  This
> is exactly what we needed in order to convert page F to a V4 page, so
> we do.

That's all fine and dandy, except that it presumes that you can perform
SELECT/UPDATE/DELETE on V3 tuple versions; you can't just pretend that
A-E aren't there until they get converted.  Which is exactly the
overhead we were looking to avoid.

(Another small issue is exactly when you convert the index entries,
should you be faced with an upgrade that requires that.)
        regards, tom lane

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

06 November 2008, 10:30:18

Tom Lane napsal(a):
> "Robert Haas" <robertmhaas@gmail.com> writes:
>> To spell this out in more detail:
> 
>> Suppose page 123 is a V3 page containing 6 tuples A, B, C, D, E, and
>> F.  We examine the page and determine that if we convert this to a V4
>> page, only five tuples will fit.  So we need to get rid of one of the
>> tuples.  We begin a transaction and choose F as the victim.  Searching
>> the FSM, we discover that page 456 is a V4 page with available free
>> space.  We pin and lock pages 123 and 456 just as if we were doing a
>> heap_update.  We create F', the V4 version of F, and write it onto
>> page 456.  We set xmax on the original F.  We peform the corresponding
>> index updates and commit the transaction.
> 
>> Time passes.  Eventually F becomes dead.  We reclaim the space
>> previously used by F, and page 123 now contains only 5 tuples.  This
>> is exactly what we needed in order to convert page F to a V4 page, so
>> we do.
> 
> That's all fine and dandy, except that it presumes that you can perform
> SELECT/UPDATE/DELETE on V3 tuple versions; you can't just pretend that
> A-E aren't there until they get converted.  Which is exactly the
> overhead we were looking to avoid.

We want to avoid overhead on V$lastest$ tuples, but I guess small performance 
gap on old tuple is acceptable. The only way (which I see now) how it should 
work is to have multi page version processing. And old tuple will be converted 
when PageGetHepaTuple will be called.

However, how Heikki mentioned tuple and page conversion is basic and same for 
all upgrade method and it should be done first.
    Zdenek




-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

06 November 2008, 10:39:33

> That's all fine and dandy, except that it presumes that you can perform
> SELECT/UPDATE/DELETE on V3 tuple versions; you can't just pretend that
> A-E aren't there until they get converted.  Which is exactly the
> overhead we were looking to avoid.

I don't understand this comment at all.  Unless you have some sort of
magical wand in your back pocket that will instantaneously transform
the entire database, there is going to be a period of time when you
have to cope with both V3 and V4 pages.  ISTM that what we should be
talking about here is:

(1) How are we going to do that in a way that imposes near-zero
overhead once the entire database has been converted?
(2) How are we going to do that in a way that is minimally invasive to the code?
(3) Can we accomplish (1) and (2) while still retaining somewhat
reasonable performance for V3 pages?

Zdenek's initial proposal did this by replacing all of the tuple
header macros with functions that were conditionalized on page
version.  I think we agree that's not going to work.  That doesn't
mean that there is no approach that can work, and we were discussing
possible ways to make it work upthread until the thread got hijacked
to discuss the right way of handling page expansion.  Now that it
seems we agree that a transaction can be used to move tuples onto new
pages, I think we'd be well served to stop talking about page
expansion and get back to the original topic: where and how to insert
the hooks for V3 tuple handling.

> (Another small issue is exactly when you convert the index entries,
> should you be faced with an upgrade that requires that.)

Zdenek set out his thoughts on this point upthread, no need to rehash here.

...Robert

Re: [WIP] In-place upgrade

From

Bruce Momjian

Date:

06 November 2008, 13:15:58

Robert Haas wrote:
> > That's all fine and dandy, except that it presumes that you can perform
> > SELECT/UPDATE/DELETE on V3 tuple versions; you can't just pretend that
> > A-E aren't there until they get converted.  Which is exactly the
> > overhead we were looking to avoid.
> 
> I don't understand this comment at all.  Unless you have some sort of
> magical wand in your back pocket that will instantaneously transform
> the entire database, there is going to be a period of time when you
> have to cope with both V3 and V4 pages.  ISTM that what we should be
> talking about here is:
> 
> (1) How are we going to do that in a way that imposes near-zero
> overhead once the entire database has been converted?
> (2) How are we going to do that in a way that is minimally invasive to the code?
> (3) Can we accomplish (1) and (2) while still retaining somewhat
> reasonable performance for V3 pages?
> 
> Zdenek's initial proposal did this by replacing all of the tuple
> header macros with functions that were conditionalized on page
> version.  I think we agree that's not going to work.  That doesn't
> mean that there is no approach that can work, and we were discussing
> possible ways to make it work upthread until the thread got hijacked
> to discuss the right way of handling page expansion.  Now that it
> seems we agree that a transaction can be used to move tuples onto new
> pages, I think we'd be well served to stop talking about page
> expansion and get back to the original topic: where and how to insert
> the hooks for V3 tuple handling.

I think the above is a good summary.  For me, the problem with any
approach that has information about prior-version block formats in the
main code path is code complexity, and secondarily performance.

I know there is concern that converting all blocks on read-in might
expand the page beyond 8k in size.  One idea Heikki had was to require
some tool must be run on minor releases before a major upgrade to
guarantee there is enough free space to convert the block to the current
format on read-in, which would localize the information about prior
block formats.  We could release the tool in minor branches around the
time as a major release.  Also consider that there are very few releases
that expand the page size.

For these reasons, the expand-the-page-beyond-8k problem should not be
dictating what approach we take for upgrade-in-place because there are
workarounds for the problem, and the problem is rare.  I would like us
to again focus on converting the pages to the current version format on
read-in, and perhaps a tool to convert all old pages to the new format.

FYI, we are also going to need the ability to convert all pages to the
current format for multi-release upgrades.  For example, if you did
upgrade-in-place from 8.2 to 8.3, you are going to need to update all
pages to the 8.3 format before doing upgrade-in-place to 8.4;  perhaps
vacuum can do something like this on a per-table basis, and we can
record that status a pg_class column.

Also, consider that when we did PITR, we required commands before and
after the tar so that there was a consistent API for PITR, and later had
to add capabilities to those functions, but the user API didn't change.

I envision a similar system where we have utilities to guarantee all
pages have enough free space, and all pages are the current version,
before allowing an upgrade-in-place to the next version.  Such a
consistent API will make the job for users easier and our job simpler,
and with upgrade-in-place, where we have limited time and resources to
code this for each release, simplicity is important.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: [WIP] In-place upgrade

From

Tom Lane

Date:

06 November 2008, 14:09:57

Bruce Momjian <bruce@momjian.us> writes:
> I envision a similar system where we have utilities to guarantee all
> pages have enough free space, and all pages are the current version,
> before allowing an upgrade-in-place to the next version.  Such a
> consistent API will make the job for users easier and our job simpler,
> and with upgrade-in-place, where we have limited time and resources to
> code this for each release, simplicity is important.

An external utility doesn't seem like the right way to approach it.
For example, given the need to ensure X amount of free space in each
page, the only way to guarantee that would be to shut down the database
while you run the utility over all the pages --- otherwise somebody
might fill some page up again.  And that completely defeats the purpose,
which is to have minimal downtime during upgrade.

I think we can have a notion of pre-upgrade maintenance, but it would
have to be integrated into normal operations.  For instance, if
conversion to 8.4 requires extra free space, we'd make late releases
of 8.3.x not only be able to force that to occur, but also tweak the
normal code paths to maintain that minimum free space.

The full concept as I understood it (dunno why Bruce left all these
details out of his message) went like this:

* Add a "format serial number" column to pg_class, and probably also
pg_database.  Rather like the frozenxid columns, this would have the
semantics that all pages in a relation or database are known to have at
least the specified format number.

* There would actually be two serial numbers per release, at least for
releases where pre-update prep work is involved --- for instance,
between 8.3 and 8.4 there'd be an "8.3-and-a-half" format which is
8.3 but known ready to update to 8.4 (eg, enough free space available).
Minor releases of 8.3 that appear with or subsequent to 8.4 release
understand the "half" format number and how to upgrade to it.

* VACUUM would be empowered, in the same way as it handles frozenxid
maintenance, to update any less-than-the-latest-version pages and then
fix the pg_class and pg_database entries.

* We could mechanically enforce that you not update until the database
is ready for it by checking pg_database.datformatversion during
postmaster startup.

So the update process would require users to install a suitably late
version of 8.3, vacuum everything over a suitable maintenance window,
then install 8.4, then perhaps vacuum everything again if they want to
try to push page update work into specific maintenance windows.  But
the DB is up and functioning the whole time.
        regards, tom lane

Re: [WIP] In-place upgrade

From

Bruce Momjian

Date:

06 November 2008, 14:36:51

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > I envision a similar system where we have utilities to guarantee all
> > pages have enough free space, and all pages are the current version,
> > before allowing an upgrade-in-place to the next version.  Such a
> > consistent API will make the job for users easier and our job simpler,
> > and with upgrade-in-place, where we have limited time and resources to
> > code this for each release, simplicity is important.
> 
> An external utility doesn't seem like the right way to approach it.
> For example, given the need to ensure X amount of free space in each
> page, the only way to guarantee that would be to shut down the database
> while you run the utility over all the pages --- otherwise somebody
> might fill some page up again.  And that completely defeats the purpose,
> which is to have minimal downtime during upgrade.
> 
> I think we can have a notion of pre-upgrade maintenance, but it would
> have to be integrated into normal operations.  For instance, if
> conversion to 8.4 requires extra free space, we'd make late releases
> of 8.3.x not only be able to force that to occur, but also tweak the
> normal code paths to maintain that minimum free space.
> 
> The full concept as I understood it (dunno why Bruce left all these
> details out of his message) went like this:

Exactly.  I didn't go into the implementation details to make it easer
for people to see my general goals.  Tom's implementation steps are the
correct approach, assuming we can get agreement on the general goals.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

06 November 2008, 15:17:45

> An external utility doesn't seem like the right way to approach it.
> For example, given the need to ensure X amount of free space in each
> page, the only way to guarantee that would be to shut down the database
> while you run the utility over all the pages --- otherwise somebody
> might fill some page up again.  And that completely defeats the purpose,
> which is to have minimal downtime during upgrade.

Agreed.

> I think we can have a notion of pre-upgrade maintenance, but it would
> have to be integrated into normal operations.  For instance, if
> conversion to 8.4 requires extra free space, we'd make late releases
> of 8.3.x not only be able to force that to occur, but also tweak the
> normal code paths to maintain that minimum free space.

1. This seems to fly in the face of the sort of thing we've
traditionally back-patched.  The code to make pages ready for upgrade
to the next major release will not necessarily be straightforward (in
fact it probably isn't, otherwise we wouldn't have insisted on a
two-stage conversion process), which turns a seemingly safe minor
upgrade into a potentially dangerous operation.

2. Just because I want to upgrade to 8.3.47 and get the latest bug
fixes does not mean that I have any intention of upgrading to 8.4, and
yet you've rearranged all of my pages to have useless free space in
them (possibly at considerable and unexpected I/O cost for at least as
long as the conversion is running).

The second point could probably be addressed with a GUC but the first
one certainly can't.

3. What about multi-release upgrades?  Say someone wants to upgrade
from 8.3 to 8.6.  8.6 only knows how to read pages that are
8.5-and-a-half or better, 8.5 only knows how to read pages that are
8.4-and-a-half or better, and 8.4 only knows how to read pages that
are 8.3-and-a-half or better.  So the user will have to upgrade to
8.3.MAX, then 8.4.MAX, then 8.5.MAX, and then 8.6.

It seems to me that if there is any way to put all of the logic to
handle old page versions in the new code that would be much better,
especially if it's an optional feature that can be compiled in or not.Then when it's time to upgrade from 8.3 to 8.6
youcould do:
 

./configure --with-upgrade-83 --with-upgrade-84 --with-upgrade85

but if you don't need the code to handle old page versions you can:

./configure --without-upgrade85

Admittedly, this requires making the new code capable of rearranging
pages to create free space when necessary, and to be able to continue
to execute queries while doing it, but ways of doing this have been
proposed.  The only uncertainty is as to whether the performance and
code complexity can be kept manageable, but I don't believe that
question has been explored to the point where we should be ready to
declare defeat.

...Robert

Re: [WIP] In-place upgrade

From

Bruce Momjian

Date:

06 November 2008, 15:32:02

Robert Haas wrote:
> The second point could probably be addressed with a GUC but the first
> one certainly can't.
> 
> 3. What about multi-release upgrades?  Say someone wants to upgrade
> from 8.3 to 8.6.  8.6 only knows how to read pages that are
> 8.5-and-a-half or better, 8.5 only knows how to read pages that are
> 8.4-and-a-half or better, and 8.4 only knows how to read pages that
> are 8.3-and-a-half or better.  So the user will have to upgrade to
> 8.3.MAX, then 8.4.MAX, then 8.5.MAX, and then 8.6.

Yes.

> It seems to me that if there is any way to put all of the logic to
> handle old page versions in the new code that would be much better,
> especially if it's an optional feature that can be compiled in or not.
>  Then when it's time to upgrade from 8.3 to 8.6 you could do:
> 
> ./configure --with-upgrade-83 --with-upgrade-84 --with-upgrade85
> 
> but if you don't need the code to handle old page versions you can:
> 
> ./configure --without-upgrade85
> 
> Admittedly, this requires making the new code capable of rearranging
> pages to create free space when necessary, and to be able to continue
> to execute queries while doing it, but ways of doing this have been
> proposed.  The only uncertainty is as to whether the performance and
> code complexity can be kept manageable, but I don't believe that
> question has been explored to the point where we should be ready to
> declare defeat.

And almost guarantee that the job will never be completed, or tested
fully.  Remember that in-place upgrades would be pretty painless so
doing multiple major upgrades should not be a difficult requiremnt, or
they can dump/reload their data to skip it.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: [WIP] In-place upgrade

From

Heikki Linnakangas

Date:

06 November 2008, 15:49:18

Tom Lane wrote:
> I think we can have a notion of pre-upgrade maintenance, but it would
> have to be integrated into normal operations.  For instance, if
> conversion to 8.4 requires extra free space, we'd make late releases
> of 8.3.x not only be able to force that to occur, but also tweak the
> normal code paths to maintain that minimum free space.

Agreed, the backend needs to be modified to reserve the space.

> The full concept as I understood it (dunno why Bruce left all these
> details out of his message) went like this:
> 
> * Add a "format serial number" column to pg_class, and probably also
> pg_database.  Rather like the frozenxid columns, this would have the
> semantics that all pages in a relation or database are known to have at
> least the specified format number.
> 
> * There would actually be two serial numbers per release, at least for
> releases where pre-update prep work is involved --- for instance,
> between 8.3 and 8.4 there'd be an "8.3-and-a-half" format which is
> 8.3 but known ready to update to 8.4 (eg, enough free space available).
> Minor releases of 8.3 that appear with or subsequent to 8.4 release
> understand the "half" format number and how to upgrade to it.
> 
> * VACUUM would be empowered, in the same way as it handles frozenxid
> maintenance, to update any less-than-the-latest-version pages and then
> fix the pg_class and pg_database entries.
> 
> * We could mechanically enforce that you not update until the database
> is ready for it by checking pg_database.datformatversion during
> postmaster startup.

Adding catalog columns seems rather complicated, and not back-patchable. 
Not backpatchable means that we'd need to be sure now that the format 
serial numbers are enough for the upcoming 8.4-8.5 upgrade.

I imagined that you would have just a single cluster-wide variable, a 
GUC perhaps, indicating how much space should be reserved by 
updates/inserts. Then you'd have an additional program, perhaps a new 
contrib module, that sets the variable to the right value for the 
version you're upgrading, and scans through all tables, moving tuples so 
that every page has enough free space for the upgrade. After that's 
done, it'd set a flag in the data directory indicating that the cluster 
is ready for upgrade.

The tool could run concurrently with normal activity, so you could just 
let it run for as long as it takes.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

06 November 2008, 16:40:38

> And almost guarantee that the job will never be completed, or tested
> fully.  Remember that in-place upgrades would be pretty painless so
> doing multiple major upgrades should not be a difficult requiremnt, or
> they can dump/reload their data to skip it.

Regardless of what design is chosen, there's no requirement that we
support in-place upgrade from 8.3 to 8.6, or even 8.4 to 8.6, in one
shot.  But the design that you and Tom are proposing pretty much
ensures that it will be impossible.

But that's certainly the least important reason not to do it this way.I think this comment from Heikki is pretty
revealing:

> Adding catalog columns seems rather complicated, and not back-patchable. Not backpatchable means that we'd need to be
surenow
 
> that the format serial numbers are enough for the upcoming 8.4-8.5 upgrade.

That means, in essence, that the earliest possible version that could
be in-place upgraded would be an 8.4 system - we are giving up
completely on in-place upgrade to 8.4 from any earlier version (which
personally I thought was the whole point of this feature in the first
place).  And we'll only be able to in-place upgrade to 8.5 if the
unproven assumption that these catalog changes are sufficient turns
out to be true, or if whatever other changes turn out to be necessary
are back-patchable.

...Robert

Re: [WIP] In-place upgrade

From

Tom Lane

Date:

06 November 2008, 17:27:04

"Robert Haas" <robertmhaas@gmail.com> writes:
> That means, in essence, that the earliest possible version that could
> be in-place upgraded would be an 8.4 system - we are giving up
> completely on in-place upgrade to 8.4 from any earlier version (which
> personally I thought was the whole point of this feature in the first
> place).

Quite honestly, given where we are in the schedule and the lack of
consensus about how to do this, I think we would be well advised to
decide right now to forget about supporting in-place upgrade to 8.4,
and instead work on allowing in-place upgrades from 8.4 onwards.
Shooting for a general-purpose does-it-all scheme that can handle
old versions that had no thought of supporting such updates is likely
to ensure that we end up with *NOTHING*.

What Bruce is proposing, I think, is that we intentionally restrict what
we want to accomplish to something that might be within reach now and
also sustainable over the long term.  Planning to update any version to
any other version is *not* sustainable --- we haven't got the resources
nor the interest to create large amounts of conversion code.
        regards, tom lane

Re: [WIP] In-place upgrade

From

Tom Lane

Date:

06 November 2008, 17:42:50

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Adding catalog columns seems rather complicated, and not back-patchable. 

Agreed, we'd not be able to make them retroactively appear in 8.3.

> I imagined that you would have just a single cluster-wide variable, a 
> GUC perhaps, indicating how much space should be reserved by 
> updates/inserts. Then you'd have an additional program, perhaps a new 
> contrib module, that sets the variable to the right value for the 
> version you're upgrading, and scans through all tables, moving tuples so 
> that every page has enough free space for the upgrade. After that's 
> done, it'd set a flag in the data directory indicating that the cluster 
> is ready for upgrade.

Possibly that could work.  The main thing is to have a way of being sure
that the prep work has been completed on every page of the database.
The disadvantage of not having catalog support is that you'd have to
complete the entire scan operation in one go to be sure you'd hit
everything.

Another thought here is that I don't think we are yet committed to any
changes that require extra space between 8.3 and 8.4, are we?  The
proposed addition of CRC words could be put off to 8.5, for instance.
So it seems at least within reach to not require any preparatory steps
for 8.3-to-8.4, and put the infrastructure in place now to support such
steps in future go-rounds.
        regards, tom lane

Re: [WIP] In-place upgrade

From

Greg Smith

Date:

06 November 2008, 19:52:02

On Thu, 6 Nov 2008, Tom Lane wrote:

> Another thought here is that I don't think we are yet committed to any
> changes that require extra space between 8.3 and 8.4, are we?  The
> proposed addition of CRC words could be put off to 8.5, for instance.

I was just staring at that code as you wrote this thinking about the same 
thing.  CRCs are a great feature I'd really like to see.  On the other 
hand, announcing that 8.4 features in-place upgrades for 8.3 databases, 
and that the project has laid the infrastructure such that future releases 
will also upgrade in-place, would IMHO be the biggest positive 
announcement of the new release by a large margin.  At least then new 
large (>1TB) installs could kick off on either the stable 8.3 or 8.4 
knowing they'd never be forced to deal with dump/reload, whereas right now 
there is no reasonable solution for them that involves PostgreSQL (I just 
crossed 3TB on a system last month and I'm not looking forward to its 
future upgrades).

Two questions come to mind here:

-If you reduce the page layout upgrade problem to "convert from V4 to V5 
adding support for CRCs", is there a worthwhile simpler path to handling 
that without dragging the full complexity of the older page layout changes 
in?

-Is it worth considering making CRCs an optional compile-time feature, and 
that (for now at least) you couldn't get them and the in-place upgrade at 
the same time?

Stepping back for a second, the idea that in-place upgrade is only 
worthwhile if it yields zero downtime isn't necessarily the case.  Even 
having an offline-only upgrade tool to handle the more complicated 
situations where tuples have to be squeezed onto another page would still 
be a major improvement over the current situation.  The thing that you 
have to recognize here is that dump/reload is extremely slow because of 
bottlenecks in the COPY process.  That makes for a large amount of 
downtime--many hours isn't unusual.

If older version upgrade downtime was reduced to how long it takes to run 
a "must scan every page and fiddle with it if full" tool, that would still 
be a giant improvement over the current state of things.  If Zdenek's 
figures that only a small percentages of pages will need such adjustment 
holds up, that should take only some factor longer than a sequential scan 
of the whole database.  That's not instant, but it's at least an order of 
magnitude faster than a dump/reload on a big system.

The idea that you're going to get in-place upgrade all the way back to 8.2 
without taking the database down for a even little bit to run such a 
utility is hard to pull off, and it's impressive that Zdenek and everyone 
else involved has gotten so close to doing it.  I personally am on the 
fence as to whether it's worth paying even the 1% penalty for that 
implementation all the time just to get in-place upgrades.  If an offline 
utility with reasonable (scan instead of dump/reload) downtime and closer 
to zero overhead when finished was available instead, that might be a more 
reasonable trade-off to make for handling older releases.  There are so 
many bottlenecks in the older versions that you're less likely to find a 
database too large to dump and reload there anyway.  It would also be the 
case that improvements to that offline utility could continue after 8.4 
proper was completely frozen.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

06 November 2008, 20:25:40

> The idea that you're going to get in-place upgrade all the way back to 8.2
> without taking the database down for a even little bit to run such a utility
> is hard to pull off, and it's impressive that Zdenek and everyone else
> involved has gotten so close to doing it.

I think we should at least wait to see what the next version of his
patch looks like before making any final decisions.

...Robert

Re: [WIP] In-place upgrade

From

Tom Lane

Date:

06 November 2008, 21:44:06

Greg Smith <gsmith@gregsmith.com> writes:
> On Thu, 6 Nov 2008, Tom Lane wrote:
>> Another thought here is that I don't think we are yet committed to any
>> changes that require extra space between 8.3 and 8.4, are we?  The
>> proposed addition of CRC words could be put off to 8.5, for instance.

> I was just staring at that code as you wrote this thinking about the same 
> thing. ...

> -Is it worth considering making CRCs an optional compile-time feature, and 
> that (for now at least) you couldn't get them and the in-place upgrade at 
> the same time?

Hmm ... might be better than not offering them in 8.4 at all, but the
thing is that then you are asking packagers to decide for their
customers which is more important.  And I'd bet you anything you want
that in-place upgrade would be their choice.

Also, having such an option would create extra complexity for 8.4-to-8.5
upgrades.
        regards, tom lane

Re: [WIP] In-place upgrade

From

Greg Smith

Date:

06 November 2008, 22:31:50

On Thu, 6 Nov 2008, Tom Lane wrote:

>> -Is it worth considering making CRCs an optional compile-time feature, and
>> that (for now at least) you couldn't get them and the in-place upgrade at
>> the same time?
>
> Hmm ... might be better than not offering them in 8.4 at all, but the
> thing is that then you are asking packagers to decide for their
> customers which is more important.  And I'd bet you anything you want
> that in-place upgrade would be their choice.

I was thinking of something similar to how --enable-thread-safety has been 
rolled out.  It could be hanging around there and available to those who 
want it in their build, even though it might not be available by default 
in a typical mainstream distribution.  Since there's already a GUC for 
toggling the checksums in the code, internally it could work like 
debug_assertions where you only get that option if support was compiled in 
appropriately.  Just a thought I wanted to throw out there, if it makes 
eventual upgrades from 8.4 more complicated it may not be worth even 
considering.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

07 November 2008, 16:22:30

Heikki Linnakangas napsal(a):
> Tom Lane wrote:
>> I think we can have a notion of pre-upgrade maintenance, but it would
>> have to be integrated into normal operations.  For instance, if
>> conversion to 8.4 requires extra free space, we'd make late releases
>> of 8.3.x not only be able to force that to occur, but also tweak the
>> normal code paths to maintain that minimum free space.
> 
> Agreed, the backend needs to be modified to reserve the space.
> 
>> The full concept as I understood it (dunno why Bruce left all these
>> details out of his message) went like this:
>>
>> * Add a "format serial number" column to pg_class, and probably also
>> pg_database.  Rather like the frozenxid columns, this would have the
>> semantics that all pages in a relation or database are known to have at
>> least the specified format number.
>>
>> * There would actually be two serial numbers per release, at least for
>> releases where pre-update prep work is involved --- for instance,
>> between 8.3 and 8.4 there'd be an "8.3-and-a-half" format which is
>> 8.3 but known ready to update to 8.4 (eg, enough free space available).
>> Minor releases of 8.3 that appear with or subsequent to 8.4 release
>> understand the "half" format number and how to upgrade to it.
>>
>> * VACUUM would be empowered, in the same way as it handles frozenxid
>> maintenance, to update any less-than-the-latest-version pages and then
>> fix the pg_class and pg_database entries.
>>
>> * We could mechanically enforce that you not update until the database
>> is ready for it by checking pg_database.datformatversion during
>> postmaster startup.
> 
> Adding catalog columns seems rather complicated, and not back-patchable. 
> Not backpatchable means that we'd need to be sure now that the format 
> serial numbers are enough for the upcoming 8.4-8.5 upgrade.

Reloptions is suitable for keeping amount of reserver space. And it can be back 
ported into 8.3 and 8.2. And of course there is no problem to convert 8.1->8.2.

For backported branch would be better to combine internal modification - 
preserve space and e.g. store procedure which check all relations.

In the 8.4 and newer pg_class could be extended for new attributes.

> I imagined that you would have just a single cluster-wide variable, a 
> GUC perhaps, indicating how much space should be reserved by 
> updates/inserts. 

You sometimes need different reserved size for different type of relation. For 
example on 32bit x86 you don't need reserve space for heap but you need do it 
for indexes (between v3->v4). Better is to use reloptions and pre-upgrade 
procedure sets this information correctly.

> Then you'd have an additional program, perhaps a new 
> contrib module, that sets the variable to the right value for the 
> version you're upgrading, and scans through all tables, moving tuples so 
> that every page has enough free space for the upgrade. After that's 
> done, it'd set a flag in the data directory indicating that the cluster 
> is ready for upgrade.

I prefer to have this information in pg_class. It is accessible by SQL commands. 
pg_class should also contains information about last checked page to prevent 
repeatable check on very large tables.

> The tool could run concurrently with normal activity, so you could just 
> let it run for as long as it takes.

Agree.
    Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

07 November 2008, 16:27:37

Tom Lane napsal(a):
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Adding catalog columns seems rather complicated, and not back-patchable. 
> 
> Agreed, we'd not be able to make them retroactively appear in 8.3.
> 
>> I imagined that you would have just a single cluster-wide variable, a 
>> GUC perhaps, indicating how much space should be reserved by 
>> updates/inserts. Then you'd have an additional program, perhaps a new 
>> contrib module, that sets the variable to the right value for the 
>> version you're upgrading, and scans through all tables, moving tuples so 
>> that every page has enough free space for the upgrade. After that's 
>> done, it'd set a flag in the data directory indicating that the cluster 
>> is ready for upgrade.
> 
> Possibly that could work.  The main thing is to have a way of being sure
> that the prep work has been completed on every page of the database.
> The disadvantage of not having catalog support is that you'd have to
> complete the entire scan operation in one go to be sure you'd hit
> everything.

I prefer to have catalog support. Special on very long tables it helps when 
somebody stop preupgrade script for some reason.

> Another thought here is that I don't think we are yet committed to any
> changes that require extra space between 8.3 and 8.4, are we?  The
> proposed addition of CRC words could be put off to 8.5, for instance.
> So it seems at least within reach to not require any preparatory steps
> for 8.3-to-8.4, and put the infrastructure in place now to support such
> steps in future go-rounds.

Yeah. We still have V4 without any storage modification (exclude HASH index). 
However I think if reloptions will be use for storing information about reserved 
space then It shouldn't be a problem. But we need to be sure if it is possible.
    Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

07 November 2008, 16:41:25

Tom Lane napsal(a):

> I think we can have a notion of pre-upgrade maintenance, but it would
> have to be integrated into normal operations.  For instance, if
> conversion to 8.4 requires extra free space, we'd make late releases
> of 8.3.x not only be able to force that to occur, but also tweak the
> normal code paths to maintain that minimum free space.

OK. I will focus on this. I guess this approach revival my hook patch:

http://archives.postgresql.org/pgsql-hackers/2008-04/msg00990.php

> The full concept as I understood it (dunno why Bruce left all these
> details out of his message) went like this:
> 
> * Add a "format serial number" column to pg_class, and probably also
> pg_database.  Rather like the frozenxid columns, this would have the
> semantics that all pages in a relation or database are known to have at
> least the specified format number.
> 
> * There would actually be two serial numbers per release, at least for
> releases where pre-update prep work is involved --- for instance,
> between 8.3 and 8.4 there'd be an "8.3-and-a-half" format which is
> 8.3 but known ready to update to 8.4 (eg, enough free space available).
> Minor releases of 8.3 that appear with or subsequent to 8.4 release
> understand the "half" format number and how to upgrade to it.

I prefer to have latest processed block. InvalidBlockNumber would mean nothing 
is processed and 0 means everything is already reserved. I suggest to process it 
backward. It should prevent to check new extended block which will be already 
correctly setup.

> * VACUUM would be empowered, in the same way as it handles frozenxid
> maintenance, to update any less-than-the-latest-version pages and then
> fix the pg_class and pg_database entries.
> 
> * We could mechanically enforce that you not update until the database
> is ready for it by checking pg_database.datformatversion during
> postmaster startup.

I'm don't understand you here? Do you mean on old server version or new server 
version. Or who will perform this check? Do not remember that we currently do 
catalog conversion by dump and import which lost all extended information.
Thanks Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: [WIP] In-place upgrade

From

Tom Lane

Date:

07 November 2008, 16:52:40

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> Tom Lane napsal(a):
>> * Add a "format serial number" column to pg_class, and probably also
>> pg_database.  Rather like the frozenxid columns, this would have the
>> semantics that all pages in a relation or database are known to have at
>> least the specified format number.

> I prefer to have latest processed block. InvalidBlockNumber would mean
> nothing is processed and 0 means everything is already reserved. I
> suggest to process it backward. It should prevent to check new
> extended block which will be already correctly setup.

That seems bizarre and not very helpful.  In the first place, if we're
driving it off vacuum there would be no opportunity for recording a
half-processed state value.  In the second place, this formulation fails
to provide any evidence of *what* processing you completed or didn't
complete.  In a multi-step upgrade sequence I think it's going to be a
mess if we aren't explicit about that.
        regards, tom lane

Re: [WIP] In-place upgrade

From

Decibel!

Date:

09 November 2008, 20:12:25

On Nov 6, 2008, at 1:31 PM, Bruce Momjian wrote:
>> 3. What about multi-release upgrades?  Say someone wants to upgrade
>> from 8.3 to 8.6.  8.6 only knows how to read pages that are
>> 8.5-and-a-half or better, 8.5 only knows how to read pages that are
>> 8.4-and-a-half or better, and 8.4 only knows how to read pages that
>> are 8.3-and-a-half or better.  So the user will have to upgrade to
>> 8.3.MAX, then 8.4.MAX, then 8.5.MAX, and then 8.6.
>
> Yes.

I think that's pretty seriously un-desirable. It's not at all  
uncommon for databases to stick around for a very long time and then  
jump ahead many versions. I don't think we want to tell people they  
can't do that.

More importantly, I think we're barking up the wrong tree by putting  
migration knowledge into old versions. All that the old versions need  
to do is guarantee a specific amount of free space per page. We  
should provide a mechanism to tell a cluster what that free space  
requirement is, and not hard-code it into the backend.

Unless I'm mistaken, there are only two cases we care about for  
additional space: per-page and per-tuple. Those requirements could  
also vary for different types of pg_class objects. What we need is an  
API that allows an administrator to tell the database to start  
setting this space aside. One possibility:

pg_min_free_space( version, relkind, bytes_per_page, bytes_per_tuple );
pg_min_free_space_index( version, indexkind, bytes_per_page,  
bytes_per_tuple );

version: This would be provided as a safety mechanism. You would have  
to provide the major version that matches what the backend is  
running. See below for an example.

relkind: Essentially, heap vs toast, though I suppose it's possible  
we might need this for sequences.

indexkind: Because we support different types of indexes, I think we  
need to handle them differently than heap/toast. If we wanted, we  
could have a single function that demands that indexkind is NULL if  
relkind != 'index'.

bytes_per_(page|tuple): obvious. :)

Once we have an API, we need to get users to make use of it. I'm  
thinking add something like the following to the release notes:

"To upgrade from a prior version to 8.4, you will need to run some of  
the following commands, depending on what version you are currently  
using:

For version 8.3:
SELECT pg_min_free_space( '8.3', 'heap', 4, 12 );
SELECT pg_min_free_space( '8.3', 'toast', 4, 12 );

For version 8.2:
SELECT pg_min_free_space( '8.2', 'heap', 14, 12 );
SELECT pg_min_free_space( '8.2', 'toast', 14, 12 );
SELECT pg_min_free_space_index( '8.2', 'b-tree', 4, 4);"

(Note I'm just pulling numbers out of thin air in this example.)

As you can see, we pass in the version number to ensure that if  
someone accidentally cut and pastes the wrong stuff they know what  
they did wrong immediately.

One downside to this scheme is that it doesn't provide a mechanism to  
ensure that all required minimum free space requirements were passed  
in. Perhaps we want a function that takes an array of complex types  
and forces you to supply information for all known storage  
mechanisms. Another possibility would be to pass in some kind of  
binary format that contains a checksum.

Even if we do come up with a pretty fool-proof way to tell the old  
version what free space it needs to set aside, I think we should  
still have a mechanism for the new version to know exactly what the  
old version has set aside, and if it's actually been accomplished or  
not. One option that comes to mind is to add min_free_space_per_page  
and min_free_space_per_tuple to pg_class. Normally these fields would  
be NULL; the old version would only set them once it had verified  
that all pages in a given relation met those requirements (presumably  
via vacuum). The new version would check all these values on startup  
to ensure they made sense.

OTOH, we might not want to go mucking around with changing the  
catalog for older versions (I'm not even sure if we can). So perhaps  
it would be better to store this information in a separate table, or  
maybe a separate file. That might be best anyway; we generally  
wouldn't need this information, so it would be nice if it wasn't  
bloating pg_class all the time.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Re: [WIP] In-place upgrade

From

Tom Lane

Date:

09 November 2008, 21:03:09

Decibel! <decibel@decibel.org> writes:
> I think that's pretty seriously un-desirable. It's not at all  
> uncommon for databases to stick around for a very long time and then  
> jump ahead many versions. I don't think we want to tell people they  
> can't do that.

Of course they can do that --- they just have to do it one version at a
time.

I think it's time for people to stop asking for the moon and realize
that if we don't constrain this feature pretty darn tightly, we will
have *nothing at all* for 8.4.  Again.
        regards, tom lane

Re: [WIP] In-place upgrade

From

"Joshua D. Drake"

Date:

10 November 2008, 00:10:21

On Sun, 2008-11-09 at 20:02 -0500, Tom Lane wrote:
> Decibel! <decibel@decibel.org> writes:
> > I think that's pretty seriously un-desirable. It's not at all  
> > uncommon for databases to stick around for a very long time and then  
> > jump ahead many versions. I don't think we want to tell people they  
> > can't do that.
> 
> Of course they can do that --- they just have to do it one version at a
> time.
> 
> I think it's time for people to stop asking for the moon and realize
> that if we don't constrain this feature pretty darn tightly, we will
> have *nothing at all* for 8.4.  Again.

Gotta go with Tom on this one. The idea that we would somehow upgrade
from 8.1 to 8.4 is silly. Yes it will be unfortunate for those running
8.1 but keeping track of multi version like that is going to be entirely
too expensive.

At some point it won't matter but right now it really does.

Joshua D. Drake

> 
>             regards, tom lane
> 
--

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

10 November 2008, 04:15:33

Decibel! napsal(a):

> Unless I'm mistaken, there are only two cases we care about for 
> additional space: per-page and per-tuple. 

Yes. And maybe special space indexes could be extended, but it is covered in 
per-page setting.

> Those requirements could also 
> vary for different types of pg_class objects. What we need is an API 
> that allows an administrator to tell the database to start setting this 
> space aside. One possibility:

We need API or mechanism how in-place upgrade will setup it. It must be done by 
in-place upgrade.

<snip>

> relkind: Essentially, heap vs toast, though I suppose it's possible we 
> might need this for sequences.

Sequences are converted during catalog upgrade.


<snip>
> Once we have an API, we need to get users to make use of it. I'm 
> thinking add something like the following to the release notes:
> 
> "To upgrade from a prior version to 8.4, you will need to run some of 
> the following commands, depending on what version you are currently using:
>
<snip>

It is too complicated. At first it depends also on architecture and it is 
possible to easily compute by in-place upgrade script. What you need is only run 
script which do all setting for you. You can obtain it from next version (IIRC 
Oracle do it this way) or we can add this configuration script into previous 
version during a minor update.

> 
> OTOH, we might not want to go mucking around with changing the catalog 
> for older versions (I'm not even sure if we can). So perhaps it would be 
> better to store this information in a separate table, or maybe a 
> separate file. That might be best anyway; we generally wouldn't need 
> this information, so it would be nice if it wasn't bloating pg_class all 
> the time.

It is why I selected relopt for storing this configuration parameter, which is 
supported from 8.2 and upgrade from 8.1->8.2 works fine.
    Zdenek

Re: [WIP] In-place upgrade

From

"Matthew T. O'Connor"

Date:

10 November 2008, 10:14:30

Tom Lane wrote:
> Decibel! <decibel@decibel.org> writes:
>   
>> I think that's pretty seriously un-desirable. It's not at all  
>> uncommon for databases to stick around for a very long time and then  
>> jump ahead many versions. I don't think we want to tell people they  
>> can't do that.
>>     
>
> Of course they can do that --- they just have to do it one version at a
> time.

Also, people may be less likely to stick with an old outdated version 
for years and years if the upgrade process is easier.

Re: [WIP] In-place upgrade

From

"Joshua D. Drake"

Date:

10 November 2008, 14:17:26

On Mon, 2008-11-10 at 09:14 -0500, Matthew T. O'Connor wrote:
> Tom Lane wrote:
> > Decibel! <decibel@decibel.org> writes:
> >   
> >> I think that's pretty seriously un-desirable. It's not at all  
> >> uncommon for databases to stick around for a very long time and then  
> >> jump ahead many versions. I don't think we want to tell people they  
> >> can't do that.
> >>     
> >
> > Of course they can do that --- they just have to do it one version at a
> > time.
> 
> Also, people may be less likely to stick with an old outdated version 
> for years and years if the upgrade process is easier.

Kind of OT but, I don't agree with this. There will always be those who
are willing to just upgrade because they can but the smart play is to
upgrade because you need to. If anything in place upgrades is just going
to remove the last real business and technical barrier to using
postgresql for enterprises.

Joshua D. Drake

> 
> 
--

Re: [WIP] In-place upgrade

From

Jeff

Date:

10 November 2008, 16:36:37

On Nov 9, 2008, at 11:09 PM, Joshua D. Drake wrote:
>> I think it's time for people to stop asking for the moon and realize
>> that if we don't constrain this feature pretty darn tightly, we will
>> have *nothing at all* for 8.4.  Again.
>
> Gotta go with Tom on this one. The idea that we would somehow upgrade
> from 8.1 to 8.4 is silly. Yes it will be unfortunate for those running
> 8.1 but keeping track of multi version like that is going to be  
> entirely
> too expensive.
>

I agree as well. If we can get the at least the base level stuff in  
8.4 so that 8.5 and beyond is in-place upgradable then that is a huge  
win.   If we could support 8.2 or 8.3 or 6.5 :) that would be nice,  
but I think dealing with everything retroactively will cause our heads  
to explode and a mountain of awful code to arise.   If we say "8.4 and  
beyond will be upgradable" we can toss everything in we think we'll  
need to deal with it and not worry about the retroactive case (unless  
someone has a really clever(tm) idea!)

This can't be an original problem to solve, too many other databases  
do it as well.

--
Jeff Trout <jeff@jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

26 November 2008, 00:40:35

Zdenek -

I am a bit murky on where we stand with upgrade-in-place in terms of
reviewing.  Initially, you had submitted four patches for this
commitfest:

1. htup and bufpage API clean up
2. HeapTuple version extension + code cleanup
3. In-place online upgrade
4. Extending pg_class info + more flexible TOAST chunk size

I think that it was decided that replacing the heap tuple access
macros with function calls was not acceptable, so I have moved patches
#1 and #2 to the "Returned with feedback" section.  I thought that
perhaps the third patch could be salvaged, but the consensus seemed to
be to go in a new direction, so I'm thinking that one should probably
be moved to "Returned with feedback" as well.  However, I'm not clear
on whether you will be submitting something else instead and whether
that thing should be considered material for this commitfest.  Can you
let me know how you are thinking about this?

With respect to #4, I know that Alvaro submitted a draft patch, but
I'm not clear on whether that needs to be reviewed, because:

- I'm not sure whether it's close enough to being finished for a
review to be a good use of time.
- I'm not sure how much you and Heikki have already reviewed it.
- I'm not sure whether this patch buys us anything by itself.

Thoughts?

...Robert

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

26 November 2008, 05:03:27

Robert,

big thanks for your review. I think #1 is still partially valid, because it 
contains general cleanups, but part of it is not necessary now. #2, #3 and #4 
you can move to return with feedback section.
    Thanks Zdenek

Robert Haas napsal(a):
> Zdenek -
> 
> I am a bit murky on where we stand with upgrade-in-place in terms of
> reviewing.  Initially, you had submitted four patches for this
> commitfest:
> 
> 1. htup and bufpage API clean up
> 2. HeapTuple version extension + code cleanup
> 3. In-place online upgrade
> 4. Extending pg_class info + more flexible TOAST chunk size
> 
> I think that it was decided that replacing the heap tuple access
> macros with function calls was not acceptable, so I have moved patches
> #1 and #2 to the "Returned with feedback" section.  I thought that
> perhaps the third patch could be salvaged, but the consensus seemed to
> be to go in a new direction, so I'm thinking that one should probably
> be moved to "Returned with feedback" as well.  However, I'm not clear
> on whether you will be submitting something else instead and whether
> that thing should be considered material for this commitfest.  Can you
> let me know how you are thinking about this?
> 
> With respect to #4, I know that Alvaro submitted a draft patch, but
> I'm not clear on whether that needs to be reviewed, because:
> 
> - I'm not sure whether it's close enough to being finished for a
> review to be a good use of time.
> - I'm not sure how much you and Heikki have already reviewed it.
> - I'm not sure whether this patch buys us anything by itself.
> 
> Thoughts?
> 
> ...Robert

Re: [WIP] In-place upgrade

From

"Robert Haas"

Date:

26 November 2008, 10:45:02

>> 1. htup and bufpage API clean up
>> 2. HeapTuple version extension + code cleanup
>> 3. In-place online upgrade
>> 4. Extending pg_class info + more flexible TOAST chunk size
> big thanks for your review. I think #1 is still partially valid, because it
> contains general cleanups, but part of it is not necessary now. #2, #3 and
> #4 you can move to return with feedback section.

OK, when can you submit a new version of #1 with the parts that are
still valid, updated to CVS HEAD, etc?

Thanks,

...Robert

Re: [WIP] In-place upgrade

From

Alvaro Herrera

Date:

26 November 2008, 10:56:08

Robert Haas escribió:

> With respect to #4, I know that Alvaro submitted a draft patch, but
> I'm not clear on whether that needs to be reviewed, because:
> 
> - I'm not sure whether it's close enough to being finished for a
> review to be a good use of time.
> - I'm not sure how much you and Heikki have already reviewed it.
> - I'm not sure whether this patch buys us anything by itself.

I finished that patch, but I didn't submit it because in later
discussion it turned out (at least as I read it) that it's considered to
be unnecessary.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

26 November 2008, 11:19:11

Alvaro Herrera napsal(a):
> Robert Haas escribió:
> 
>> With respect to #4, I know that Alvaro submitted a draft patch, but
>> I'm not clear on whether that needs to be reviewed, because:
>>
>> - I'm not sure whether it's close enough to being finished for a
>> review to be a good use of time.
>> - I'm not sure how much you and Heikki have already reviewed it.
>> - I'm not sure whether this patch buys us anything by itself.
> 
> I finished that patch, but I didn't submit it because in later
> discussion it turned out (at least as I read it) that it's considered to
> be unnecessary.
> 
From pg_upgrade perspective, it is something what we will need do anyway. 
Because TOAST_MAX_CHUNK_SIZE will be different in 8.5 (if you commit CRC). Then 
we will need the patch for 8.5. It is not necessary for 8.3->8.4 upgrade because  TOAST_MAX_CHUNK_SIZE is same. And
makethis change into toast table now will 
 
add unnecessary complexity.
    Zdenek

Re: [WIP] In-place upgrade

From

Zdenek Kotala

Date:

27 November 2008, 07:55:50

Robert Haas napsal(a):
>>> 1. htup and bufpage API clean up
>>> 2. HeapTuple version extension + code cleanup
>>> 3. In-place online upgrade
>>> 4. Extending pg_class info + more flexible TOAST chunk size
>> big thanks for your review. I think #1 is still partially valid, because it
>> contains general cleanups, but part of it is not necessary now. #2, #3 and
>> #4 you can move to return with feedback section.
> 
> OK, when can you submit a new version of #1 with the parts that are
> still valid, updated to CVS HEAD, etc?
> 

It does not have priority now. I'm working on space reservation first.
    Thanks Zdenek