Removing NONSEG mode - Mailing list pgsql-patches

From Zdenek Kotala
Subject Removing NONSEG mode
Date
Msg-id 480DCAEF.6040106@sun.com
Whole thread Raw
Responses Re: Removing NONSEG mode  (Alvaro Herrera <alvherre@commandprompt.com>)
Re: Removing NONSEG mode  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-patches
I attach patch which remove nonsegment mode support. It was discussed during
last commit fest. Nonsegment mode is possible uses only on couple of FS (ZFS,
XFS) and it is not safe on any OS because each OS support more filesystems.

I added RELSEG option to the configure script to allow easily compile with
different segment size (on most filesystem 1T is safe value). As a bonus I added
also BLCKSZ to configure script. It is not important for this patch but it could
be useful e.g. for buildfarm testing with different BLCKSZ.

Patch requires to run autoconf and autoheader.

        Zdenek

PS: --with-segsize=1/1024 allows set segsize to 1MB - good for testing


Index: configure.in
===================================================================
RCS file: /zfs_data/cvs_pgsql/cvsroot/pgsql/configure.in,v
retrieving revision 1.555
diff -c -r1.555 configure.in
*** configure.in    30 Mar 2008 04:08:14 -0000    1.555
--- configure.in    21 Apr 2008 15:19:59 -0000
***************
*** 220,233 ****
  #
  # Data file segmentation
  #
! PGAC_ARG_BOOL(enable, segmented-files, yes,
!               [  --disable-segmented-files disable data file segmentation (requires largefile support)])

  #
  # C compiler
  #

  # For historical reasons you can also use --with-CC to specify the C compiler
  # to use, although the standard way to do this is to set the CC environment
  # variable.
  PGAC_ARG_REQ(with, CC, [], [CC=$with_CC])
--- 220,287 ----
  #
  # Data file segmentation
  #
! AC_MSG_CHECKING([for default relation segment size])
! PGAC_ARG_REQ(with, segsize, [  --with-segsize=RELSEG_SIZE  change default relation segment size in GB [[1]]],
!              [default_segsize=$withval],
!              [default_segsize=1])
! AC_MSG_RESULT([${default_segsize}GB])
! AC_DEFINE_UNQUOTED([RELSEG_SIZE], 1024*1024*1024LL*${default_segsize}/BLCKSZ, [
!  RELSEG_SIZE is the maximum number of blocks allowed in one disk
!  file. Thus, the maximum size of a single file is RELSEG_SIZE * BLCKSZ;
!  relations bigger than that are divided into multiple files.
!
!  RELSEG_SIZE * BLCKSZ must be less than your OS' limit on file size.
!  This is often 2 GB or 4GB in a 32-bit operating system, unless you
!  have large file support enabled.  By default, we make the limit 1
!  GB to avoid any possible integer-overflow problems within the OS.
!  A limit smaller than necessary only means we divide a large
!  relation into more chunks than necessary, so it seems best to err
!  in the direction of a small limit.  (Besides, a power-of-2 value
!  saves a few cycles in md.c.)

+  Changing RELSEG_SIZE requires an initdb.
+ ])
+ AC_SUBST(default_segsize)
+
+ #
+ # Block size
  #
+ AC_MSG_CHECKING([for default block size])
+ PGAC_ARG_REQ(with, blocksize, [  --with-blocksize=BLCKSZ change default block size (1,2,4,8,16,32 are allowed
values).[[8]]], 
+              [default_blocksize=$withval],
+              [default_blocksize=8])
+ case ${default_blocksize} in
+   1) default_blocksize=1024;;
+   2) default_blocksize=2048;;
+   4) default_blocksize=4096;;
+   8) default_blocksize=8192;;
+  16) default_blocksize=16384;;
+  32) default_blocksize=32768;;
+   *) AC_MSG_ERROR([Invalid block size. Allowed values are 1,2,4,8,16,32.])
+ esac
+
+ AC_MSG_RESULT([${default_blocksize}B])
+ AC_DEFINE_UNQUOTED([BLCKSZ], ${default_blocksize}, [
+  Size of a disk block --- this also limits the size of a tuple.  You
+  can set it bigger if you need bigger tuples (although TOAST should
+  reduce the need to have large tuples, since fields can be spread
+  across multiple tuples).
+
+  BLCKSZ must be a power of 2.  The maximum possible value of BLCKSZ
+  is currently 2^15 (32768).  This is determined by the 15-bit widths
+  of the lp_off and lp_len fields in ItemIdData (see
+  include/storage/itemid.h).
+
+  Changing BLCKSZ requires an initdb.
+ ])
+ AC_SUBST(default_blocksize)
+
+
  # C compiler
  #

  # For historical reasons you can also use --with-CC to specify the C compiler
+
  # to use, although the standard way to do this is to set the CC environment
  # variable.
  PGAC_ARG_REQ(with, CC, [], [CC=$with_CC])
***************
*** 1435,1443 ****

  # Check for largefile support (must be after AC_SYS_LARGEFILE)
  AC_CHECK_SIZEOF([off_t])
!
! if test "$ac_cv_sizeof_off_t" -lt 8 -o "$enable_segmented_files" = "yes"; then
!   AC_DEFINE([USE_SEGMENTED_FILES], 1, [Define to split data files into 1GB segments.])
  fi

  # SunOS doesn't handle negative byte comparisons properly with +/- return
--- 1489,1496 ----

  # Check for largefile support (must be after AC_SYS_LARGEFILE)
  AC_CHECK_SIZEOF([off_t])
! if test "$ac_cv_sizeof_off_t" -lt 8 -a "$default_segsize" != "1"; then
!    AC_MSG_ERROR([Large file support is not enabled. Segment size cannot be larger then 1GB.])
  fi

  # SunOS doesn't handle negative byte comparisons properly with +/- return
Index: src/backend/storage/file/buffile.c
===================================================================
RCS file: /zfs_data/cvs_pgsql/cvsroot/pgsql/src/backend/storage/file/buffile.c,v
retrieving revision 1.30
diff -c -r1.30 buffile.c
*** src/backend/storage/file/buffile.c    10 Mar 2008 20:06:27 -0000    1.30
--- src/backend/storage/file/buffile.c    18 Apr 2008 08:13:45 -0000
***************
*** 38,45 ****
  #include "storage/buffile.h"

  /*
!  * We break BufFiles into gigabyte-sized segments, whether or not
!  * USE_SEGMENTED_FILES is defined.  The reason is that we'd like large
   * temporary BufFiles to be spread across multiple tablespaces when available.
   */
  #define MAX_PHYSICAL_FILESIZE    0x40000000
--- 38,44 ----
  #include "storage/buffile.h"

  /*
!  * We break BufFiles into gigabyte-sized segments. The reason is that we'd like large
   * temporary BufFiles to be spread across multiple tablespaces when available.
   */
  #define MAX_PHYSICAL_FILESIZE    0x40000000
Index: src/backend/storage/smgr/md.c
===================================================================
RCS file: /zfs_data/cvs_pgsql/cvsroot/pgsql/src/backend/storage/smgr/md.c,v
retrieving revision 1.137
diff -c -r1.137 md.c
*** src/backend/storage/smgr/md.c    18 Apr 2008 06:48:38 -0000    1.137
--- src/backend/storage/smgr/md.c    18 Apr 2008 08:12:02 -0000
***************
*** 89,106 ****
   *
   *    All MdfdVec objects are palloc'd in the MdCxt memory context.
   *
-  *    On platforms that support large files, USE_SEGMENTED_FILES can be
-  *    #undef'd to disable the segmentation logic.  In that case each
-  *    relation is a single operating-system file.
   */

  typedef struct _MdfdVec
  {
      File        mdfd_vfd;        /* fd number in fd.c's pool */
      BlockNumber mdfd_segno;        /* segment number, from 0 */
- #ifdef USE_SEGMENTED_FILES
      struct _MdfdVec *mdfd_chain;    /* next segment, or NULL */
- #endif
  } MdfdVec;

  static MemoryContext MdCxt;        /* context for all md.c allocations */
--- 89,101 ----
***************
*** 162,171 ****
  static void register_unlink(RelFileNode rnode);
  static MdfdVec *_fdvec_alloc(void);

- #ifdef USE_SEGMENTED_FILES
  static MdfdVec *_mdfd_openseg(SMgrRelation reln, BlockNumber segno,
                int oflags);
- #endif
  static MdfdVec *_mdfd_getseg(SMgrRelation reln, BlockNumber blkno,
               bool isTemp, ExtensionBehavior behavior);
  static BlockNumber _mdnblocks(SMgrRelation reln, MdfdVec *seg);
--- 157,164 ----
***************
*** 258,266 ****

      reln->md_fd->mdfd_vfd = fd;
      reln->md_fd->mdfd_segno = 0;
- #ifdef USE_SEGMENTED_FILES
      reln->md_fd->mdfd_chain = NULL;
- #endif
  }

  /*
--- 251,257 ----
***************
*** 344,350 ****
                              rnode.relNode)));
      }

- #ifdef USE_SEGMENTED_FILES
      /* Delete the additional segments, if any */
      else
      {
--- 335,340 ----
***************
*** 374,380 ****
          }
          pfree(segpath);
      }
- #endif

      pfree(path);

--- 364,369 ----
***************
*** 420,431 ****

      v = _mdfd_getseg(reln, blocknum, isTemp, EXTENSION_CREATE);

- #ifdef USE_SEGMENTED_FILES
      seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
      Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- #else
-     seekpos = (off_t) BLCKSZ * blocknum;
- #endif

      /*
       * Note: because caller usually obtained blocknum by calling mdnblocks,
--- 409,416 ----
***************
*** 469,477 ****
      if (!isTemp)
          register_dirty_segment(reln, v);

- #ifdef USE_SEGMENTED_FILES
      Assert(_mdnblocks(reln, v) <= ((BlockNumber) RELSEG_SIZE));
- #endif
  }

  /*
--- 454,460 ----
***************
*** 530,539 ****

      mdfd->mdfd_vfd = fd;
      mdfd->mdfd_segno = 0;
- #ifdef USE_SEGMENTED_FILES
      mdfd->mdfd_chain = NULL;
      Assert(_mdnblocks(reln, mdfd) <= ((BlockNumber) RELSEG_SIZE));
- #endif

      return mdfd;
  }
--- 513,520 ----
***************
*** 552,558 ****

      reln->md_fd = NULL;            /* prevent dangling pointer after error */

- #ifdef USE_SEGMENTED_FILES
      while (v != NULL)
      {
          MdfdVec    *ov = v;
--- 533,538 ----
***************
*** 564,574 ****
          v = v->mdfd_chain;
          pfree(ov);
      }
- #else
-     if (v->mdfd_vfd >= 0)
-         FileClose(v->mdfd_vfd);
-     pfree(v);
- #endif
  }

  /*
--- 544,549 ----
***************
*** 583,594 ****

      v = _mdfd_getseg(reln, blocknum, false, EXTENSION_FAIL);

- #ifdef USE_SEGMENTED_FILES
      seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
      Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- #else
-     seekpos = (off_t) BLCKSZ * blocknum;
- #endif

      if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
          ereport(ERROR,
--- 558,565 ----
***************
*** 653,664 ****

      v = _mdfd_getseg(reln, blocknum, isTemp, EXTENSION_FAIL);

- #ifdef USE_SEGMENTED_FILES
      seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
      Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- #else
-     seekpos = (off_t) BLCKSZ * blocknum;
- #endif

      if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
          ereport(ERROR,
--- 624,631 ----
***************
*** 708,714 ****
  {
      MdfdVec    *v = mdopen(reln, EXTENSION_FAIL);

- #ifdef USE_SEGMENTED_FILES
      BlockNumber nblocks;
      BlockNumber segno = 0;

--- 675,680 ----
***************
*** 764,772 ****

          v = v->mdfd_chain;
      }
- #else
-     return _mdnblocks(reln, v);
- #endif
  }

  /*
--- 730,735 ----
***************
*** 777,786 ****
  {
      MdfdVec    *v;
      BlockNumber curnblk;
-
- #ifdef USE_SEGMENTED_FILES
      BlockNumber priorblocks;
- #endif

      /*
       * NOTE: mdnblocks makes sure we have opened all active segments, so that
--- 740,746 ----
***************
*** 804,810 ****

      v = mdopen(reln, EXTENSION_FAIL);

- #ifdef USE_SEGMENTED_FILES
      priorblocks = 0;
      while (v != NULL)
      {
--- 764,769 ----
***************
*** 866,884 ****
          }
          priorblocks += RELSEG_SIZE;
      }
- #else
-     /* For unsegmented files, it's a lot easier */
-     if (FileTruncate(v->mdfd_vfd, (off_t) nblocks * BLCKSZ) < 0)
-         ereport(ERROR,
-                 (errcode_for_file_access(),
-               errmsg("could not truncate relation %u/%u/%u to %u blocks: %m",
-                      reln->smgr_rnode.spcNode,
-                      reln->smgr_rnode.dbNode,
-                      reln->smgr_rnode.relNode,
-                      nblocks)));
-     if (!isTemp)
-         register_dirty_segment(reln, v);
- #endif
  }

  /*
--- 825,830 ----
***************
*** 901,907 ****

      v = mdopen(reln, EXTENSION_FAIL);

- #ifdef USE_SEGMENTED_FILES
      while (v != NULL)
      {
          if (FileSync(v->mdfd_vfd) < 0)
--- 847,852 ----
***************
*** 914,928 ****
                         reln->smgr_rnode.relNode)));
          v = v->mdfd_chain;
      }
- #else
-     if (FileSync(v->mdfd_vfd) < 0)
-         ereport(ERROR,
-                 (errcode_for_file_access(),
-                  errmsg("could not fsync relation %u/%u/%u: %m",
-                         reln->smgr_rnode.spcNode,
-                         reln->smgr_rnode.dbNode,
-                         reln->smgr_rnode.relNode)));
- #endif
  }

  /*
--- 859,864 ----
***************
*** 1476,1483 ****
      return (MdfdVec *) MemoryContextAlloc(MdCxt, sizeof(MdfdVec));
  }

- #ifdef USE_SEGMENTED_FILES
-
  /*
   * Open the specified segment of the relation,
   * and make a MdfdVec object for it.  Returns NULL on failure.
--- 1412,1417 ----
***************
*** 1522,1528 ****
      /* all done */
      return v;
  }
- #endif   /* USE_SEGMENTED_FILES */

  /*
   *    _mdfd_getseg() -- Find the segment of the relation holding the
--- 1456,1461 ----
***************
*** 1538,1544 ****
  {
      MdfdVec    *v = mdopen(reln, behavior);

- #ifdef USE_SEGMENTED_FILES
      BlockNumber targetseg;
      BlockNumber nextsegno;

--- 1471,1476 ----
***************
*** 1600,1607 ****
          }
          v = v->mdfd_chain;
      }
- #endif
-
      return v;
  }

--- 1532,1537 ----
Index: src/include/pg_config_manual.h
===================================================================
RCS file: /zfs_data/cvs_pgsql/cvsroot/pgsql/src/include/pg_config_manual.h,v
retrieving revision 1.31
diff -c -r1.31 pg_config_manual.h
*** src/include/pg_config_manual.h    11 Apr 2008 22:54:23 -0000    1.31
--- src/include/pg_config_manual.h    21 Apr 2008 15:17:07 -0000
***************
*** 11,57 ****
   */

  /*
-  * Size of a disk block --- this also limits the size of a tuple.  You
-  * can set it bigger if you need bigger tuples (although TOAST should
-  * reduce the need to have large tuples, since fields can be spread
-  * across multiple tuples).
-  *
-  * BLCKSZ must be a power of 2.  The maximum possible value of BLCKSZ
-  * is currently 2^15 (32768).  This is determined by the 15-bit widths
-  * of the lp_off and lp_len fields in ItemIdData (see
-  * include/storage/itemid.h).
-  *
-  * Changing BLCKSZ requires an initdb.
-  */
- #define BLCKSZ    8192
-
- /*
-  * RELSEG_SIZE is the maximum number of blocks allowed in one disk
-  * file when USE_SEGMENTED_FILES is defined.  Thus, the maximum size
-  * of a single file is RELSEG_SIZE * BLCKSZ; relations bigger than that
-  * are divided into multiple files.
-  *
-  * RELSEG_SIZE * BLCKSZ must be less than your OS' limit on file size.
-  * This is often 2 GB or 4GB in a 32-bit operating system, unless you
-  * have large file support enabled.  By default, we make the limit 1
-  * GB to avoid any possible integer-overflow problems within the OS.
-  * A limit smaller than necessary only means we divide a large
-  * relation into more chunks than necessary, so it seems best to err
-  * in the direction of a small limit.  (Besides, a power-of-2 value
-  * saves a few cycles in md.c.)
-  *
-  * When not using segmented files, RELSEG_SIZE is set to zero so that
-  * this behavior can be distinguished in pg_control.
-  *
-  * Changing RELSEG_SIZE requires an initdb.
-  */
- #ifdef USE_SEGMENTED_FILES
- #define RELSEG_SIZE (0x40000000 / BLCKSZ)
- #else
- #define RELSEG_SIZE 0
- #endif
-
- /*
   * Size of a WAL file block.  This need have no particular relation to BLCKSZ.
   * XLOG_BLCKSZ must be a power of 2, and if your system supports O_DIRECT I/O,
   * XLOG_BLCKSZ must be a multiple of the alignment requirement for direct-I/O
--- 11,16 ----

pgsql-patches by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Improve shutdown during online backup, take 2
Next
From: "Albe Laurenz"
Date:
Subject: Re: Improve shutdown during online backup, take 3