Thread: Re: [HACKERS] O_DIRECT for WAL writes

Re: [HACKERS] O_DIRECT for WAL writes

From
Bruce Momjian
Date:
I have modified and attached your patch for your review.  I didn't see
any value to adding new fsync_method values because, to me, O_DIRECT is
basically just like O_SYNC except it doesn't keep a copy of the buffer
in the kernel cache.  If you are doing fsync(), I don't see how O_DIRECT
makes any sense because O_DIRECT is writing to disk on every write, and
then what is the fsync() actually doing.  This might explain why your
fsync/direct and open/direct performance numbers are almost identical.
Basically, if you are going to use O_DIRECT, why not use open_sync.

What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
and O_DSYNC, so it is automatically used in those cases.  And of course,
if your operating system doens't support O_DIRECT, it isn't used.

With your posted performance numbers, perhaps we should favor
fsync_method O_SYNC on platforms that have O_DIRECT even if we don't
support OPEN_DATASYNC, but I bet most platforms that have O_DIRECT also
have O_DATASYNC.  Perhaps some folks can run testes once the patch is
applied.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Yeah, this is about what I was afraid of: if you're actually fsyncing
> > then you get at best one commit per disk revolution, and the negotiation
> > with the OS is down in the noise.
>
> If we disable writeback-cache and use open_sync, the per-page writing
> behavior in WAL module will show up as bad result. O_DIRECT is similar
> to O_DSYNC (at least on linux), so that the benefit of it will disappear
> behind the slow disk revolution.
>
> In the current source, WAL is written as:
>     for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); }
> Is this intentional? Can we rewrite it as follows?
>    write(&buffers[0], N * BLCKSZ);
>
> In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
> Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
> These two patches are independent, so they can be applied either or both.
>
>
> I tested them on my machine and the results as follows. It shows that
> direct-io and gather-write is the best choice when writeback-cache is off.
> Are these two patches worth trying if they are used together?
>
>
>             | writeback | fsync= | fdata | open_ | fsync_ | open_
> patch       | cache     |  false |  sync |  sync | direct | direct
> ------------+-----------+--------+-------+-------+--------+---------
> direct io   | off       |  124.2 | 105.7 |  48.3 |   48.3 |  48.2
> direct io   | on        |  129.1 | 112.3 | 114.1 |  142.9 | 144.5
> gather-write| off       |  124.3 | 108.7 | 105.4 |  (N/A) | (N/A)
> both        | off       |  131.5 | 115.5 | 114.4 |  145.4 | 145.2
>
> - 20runs * pgbench -s 100 -c 50 -t 200
>    - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
> - using 2 ATA disks:
>    - hda(reiserfs) includes system and wal.
>    - hdc(jfs) includes database files. writeback-cache is always on.
>
> ---
> ITAGAKI Takahiro
> NTT Cyber Space Laboratories
>

[ Attachment, skipping... ]

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.210
diff -c -c -r1.210 xlog.c
*** src/backend/access/transam/xlog.c    23 Jul 2005 15:31:16 -0000    1.210
--- src/backend/access/transam/xlog.c    23 Jul 2005 16:09:12 -0000
***************
*** 48,77 ****


  /*
   * This chunk of hackery attempts to determine which file sync methods
   * are available on the current platform, and to choose an appropriate
   * default method.    We assume that fsync() is always available, and that
   * configure determined whether fdatasync() is.
   */
  #if defined(O_SYNC)
! #define OPEN_SYNC_FLAG            O_SYNC
  #else
  #if defined(O_FSYNC)
! #define OPEN_SYNC_FLAG            O_FSYNC
  #endif
  #endif

  #if defined(O_DSYNC)
  #if defined(OPEN_SYNC_FLAG)
! #if O_DSYNC != OPEN_SYNC_FLAG
! #define OPEN_DATASYNC_FLAG        O_DSYNC
  #endif
  #else /* !defined(OPEN_SYNC_FLAG) */
  /* Win32 only has O_DSYNC */
! #define OPEN_DATASYNC_FLAG        O_DSYNC
  #endif
  #endif

  #if defined(OPEN_DATASYNC_FLAG)
  #define DEFAULT_SYNC_METHOD_STR    "open_datasync"
  #define DEFAULT_SYNC_METHOD        SYNC_METHOD_OPEN
--- 48,114 ----


  /*
+  *    Becauase O_DIRECT bypasses the kernel buffers, and because we never
+  *    read those buffers except during crash recovery, it seems like
+  *    a win to use it in all cases where we sync on each write().
+  */
+ #ifdef O_DIRECT
+ #define PG_O_DIRECT                O_DIRECT
+ #else
+ #define PG_O_DIRECT                0
+ #endif
+
+ /*
   * This chunk of hackery attempts to determine which file sync methods
   * are available on the current platform, and to choose an appropriate
   * default method.    We assume that fsync() is always available, and that
   * configure determined whether fdatasync() is.
   */
  #if defined(O_SYNC)
! #define CMP_OPEN_SYNC_FLAG        O_SYNC
  #else
  #if defined(O_FSYNC)
! #define CMP_OPEN_SYNC_FLAG        O_FSYNC
  #endif
  #endif
+ #define OPEN_SYNC_FLAG            (CMP_OPEN_SYNC_FLAG | PG_O_DIRECT)

  #if defined(O_DSYNC)
  #if defined(OPEN_SYNC_FLAG)
! #if O_DSYNC != CMP_OPEN_SYNC_FLAG
! #define OPEN_DATASYNC_FLAG        (O_DSYNC | PG_O_DIRECT)
  #endif
  #else /* !defined(OPEN_SYNC_FLAG) */
  /* Win32 only has O_DSYNC */
! #define OPEN_DATASYNC_FLAG        (O_DSYNC | PG_O_DIRECT)
  #endif
  #endif

+ /*
+  * Limitation of buffer-alignment for direct io depend on OS and filesystem,
+  * but BLCKSZ is assumed to be enough for it.
+  */
+ #ifdef O_DIRECT
+ #define ALIGNOF_XLOG_BUFFER        BLCKSZ
+ #else
+ #define ALIGNOF_XLOG_BUFFER        MAXIMUM_ALIGNOF
+ #endif
+
+ /*
+  * Switch the alignment routine because ShmemAlloc() returns a max-aligned
+  * buffer and ALIGNOF_XLOG_BUFFER may be greater than MAXIMUM_ALIGNOF.
+  */
+ #if ALIGNOF_XLOG_BUFFER <= MAXIMUM_ALIGNOF
+ #define XLOG_BUFFER_ALIGN(LEN)    MAXALIGN((LEN))
+ #else
+ #define XLOG_BUFFER_ALIGN(LEN)    ((LEN) + (ALIGNOF_XLOG_BUFFER))
+ #endif
+ /* assume sizeof(ptrdiff_t) == sizeof(void*) */
+ #define POINTERALIGN(ALIGNVAL,PTR)    \
+     ((char *)(((ptrdiff_t) (PTR) + (ALIGNVAL-1)) & ~((ptrdiff_t) (ALIGNVAL-1))))
+ #define XLOG_BUFFER_POINTERALIGN(PTR)    \
+     POINTERALIGN((ALIGNOF_XLOG_BUFFER), (PTR))
+
  #if defined(OPEN_DATASYNC_FLAG)
  #define DEFAULT_SYNC_METHOD_STR    "open_datasync"
  #define DEFAULT_SYNC_METHOD        SYNC_METHOD_OPEN
***************
*** 469,474 ****
--- 506,522 ----
  static char *str_time(time_t tnow);
  static void issue_xlog_fsync(void);

+ /* XLog gather-write staffs */
+ typedef struct XLogPages
+ {
+     char    *head;        /* Head of first page */
+     int         size;        /* Total bytes of pages == count(pages) * BLCKSZ */
+     int         offset;    /* Offset in xlog segment file  */
+ } XLogPages;
+ static void XLogPageReset(XLogPages *pages);
+ static void XLogPageWrite(XLogPages *pages, int index);
+ static void XLogPageFlush(XLogPages *pages, int index);
+
  #ifdef WAL_DEBUG
  static void xlog_outrec(char *buf, XLogRecord *record);
  #endif
***************
*** 1245,1253 ****
  XLogWrite(XLogwrtRqst WriteRqst)
  {
      XLogCtlWrite *Write = &XLogCtl->Write;
-     char       *from;
      bool        ispartialpage;
      bool        use_existent;

      /* We should always be inside a critical section here */
      Assert(CritSectionCount > 0);
--- 1293,1302 ----
  XLogWrite(XLogwrtRqst WriteRqst)
  {
      XLogCtlWrite *Write = &XLogCtl->Write;
      bool        ispartialpage;
      bool        use_existent;
+     int            currentIndex = Write->curridx;
+     XLogPages    pages;

      /* We should always be inside a critical section here */
      Assert(CritSectionCount > 0);
***************
*** 1258,1263 ****
--- 1307,1314 ----
       */
      LogwrtResult = Write->LogwrtResult;

+     XLogPageReset(&pages);
+
      while (XLByteLT(LogwrtResult.Write, WriteRqst.Write))
      {
          /*
***************
*** 1266,1279 ****
           * end of the last page that's been initialized by
           * AdvanceXLInsertBuffer.
           */
!         if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx]))
              elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
                   LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
!                  XLogCtl->xlblocks[Write->curridx].xlogid,
!                  XLogCtl->xlblocks[Write->curridx].xrecoff);

          /* Advance LogwrtResult.Write to end of current buffer page */
!         LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx];
          ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);

          if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 1317,1330 ----
           * end of the last page that's been initialized by
           * AdvanceXLInsertBuffer.
           */
!         if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex]))
              elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
                   LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
!                  XLogCtl->xlblocks[currentIndex].xlogid,
!                  XLogCtl->xlblocks[currentIndex].xrecoff);

          /* Advance LogwrtResult.Write to end of current buffer page */
!         LogwrtResult.Write = XLogCtl->xlblocks[currentIndex];
          ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);

          if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1281,1286 ****
--- 1332,1338 ----
              /*
               * Switch to new logfile segment.
               */
+             XLogPageFlush(&pages, currentIndex);
              if (openLogFile >= 0)
              {
                  if (close(openLogFile))
***************
*** 1354,1384 ****
              openLogOff = 0;
          }

!         /* Need to seek in the file? */
!         if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
!         {
!             openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
!             if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
!                 ereport(PANIC,
!                         (errcode_for_file_access(),
!                          errmsg("could not seek in log file %u, segment %u to offset %u: %m",
!                                 openLogId, openLogSeg, openLogOff)));
!         }
!
!         /* OK to write the page */
!         from = XLogCtl->pages + Write->curridx * BLCKSZ;
!         errno = 0;
!         if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
!         {
!             /* if write didn't set errno, assume problem is no disk space */
!             if (errno == 0)
!                 errno = ENOSPC;
!             ereport(PANIC,
!                     (errcode_for_file_access(),
!                      errmsg("could not write to log file %u, segment %u at offset %u: %m",
!                             openLogId, openLogSeg, openLogOff)));
!         }
!         openLogOff += BLCKSZ;

          /*
           * If we just wrote the whole last page of a logfile segment,
--- 1406,1413 ----
              openLogOff = 0;
          }

!         /* Add a page to buffer */
!         XLogPageWrite(&pages, currentIndex);

          /*
           * If we just wrote the whole last page of a logfile segment,
***************
*** 1390,1397 ****
           * This is also the right place to notify the Archiver that the
           * segment is ready to copy to archival storage.
           */
!         if (openLogOff >= XLogSegSize && !ispartialpage)
          {
              issue_xlog_fsync();
              LogwrtResult.Flush = LogwrtResult.Write;    /* end of current page */

--- 1419,1427 ----
           * This is also the right place to notify the Archiver that the
           * segment is ready to copy to archival storage.
           */
!         if (openLogOff + pages.size >= XLogSegSize && !ispartialpage)
          {
+             XLogPageFlush(&pages, currentIndex);
              issue_xlog_fsync();
              LogwrtResult.Flush = LogwrtResult.Write;    /* end of current page */

***************
*** 1405,1412 ****
              LogwrtResult.Write = WriteRqst.Write;
              break;
          }
!         Write->curridx = NextBufIdx(Write->curridx);
      }

      /*
       * If asked to flush, do so
--- 1435,1443 ----
              LogwrtResult.Write = WriteRqst.Write;
              break;
          }
!         currentIndex = NextBufIdx(currentIndex);
      }
+     XLogPageFlush(&pages, currentIndex);

      /*
       * If asked to flush, do so
***************
*** 3584,3590 ****
      if (XLOGbuffers < MinXLOGbuffers)
          XLOGbuffers = MinXLOGbuffers;

!     return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
          + BLCKSZ * XLOGbuffers +
          MAXALIGN(sizeof(ControlFileData));
  }
--- 3615,3621 ----
      if (XLOGbuffers < MinXLOGbuffers)
          XLOGbuffers = MinXLOGbuffers;

!     return XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
          + BLCKSZ * XLOGbuffers +
          MAXALIGN(sizeof(ControlFileData));
  }
***************
*** 3601,3607 ****

      XLogCtl = (XLogCtlData *)
          ShmemInitStruct("XLOG Ctl",
!                         MAXALIGN(sizeof(XLogCtlData) +
                                   sizeof(XLogRecPtr) * XLOGbuffers)
                          + BLCKSZ * XLOGbuffers,
                          &foundXLog);
--- 3632,3638 ----

      XLogCtl = (XLogCtlData *)
          ShmemInitStruct("XLOG Ctl",
!                         XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) +
                                   sizeof(XLogRecPtr) * XLOGbuffers)
                          + BLCKSZ * XLOGbuffers,
                          &foundXLog);
***************
*** 3630,3638 ****
       * Here, on the other hand, we must MAXALIGN to ensure the page
       * buffers have worst-case alignment.
       */
!     XLogCtl->pages =
!         ((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) +
!                                       sizeof(XLogRecPtr) * XLOGbuffers);
      memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);

      /*
--- 3661,3669 ----
       * Here, on the other hand, we must MAXALIGN to ensure the page
       * buffers have worst-case alignment.
       */
!     XLogCtl->pages = XLOG_BUFFER_POINTERALIGN(
!         ((char *) XLogCtl)
!         + sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers);
      memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);

      /*
***************
*** 3662,3667 ****
--- 3693,3699 ----
  {
      CheckPoint    checkPoint;
      char       *buffer;
+     char        buffer0[BLCKSZ + ALIGNOF_XLOG_BUFFER];
      XLogPageHeader page;
      XLogLongPageHeader longpage;
      XLogRecord *record;
***************
*** 3690,3697 ****
      /* First timeline ID is always 1 */
      ThisTimeLineID = 1;

!     /* Use malloc() to ensure buffer is MAXALIGNED */
!     buffer = (char *) malloc(BLCKSZ);
      page = (XLogPageHeader) buffer;
      memset(buffer, 0, BLCKSZ);

--- 3722,3728 ----
      /* First timeline ID is always 1 */
      ThisTimeLineID = 1;

!     buffer = XLOG_BUFFER_POINTERALIGN(buffer0);
      page = (XLogPageHeader) buffer;
      memset(buffer, 0, BLCKSZ);

***************
*** 5837,5839 ****
--- 5868,5938 ----
                       errmsg("could not remove file \"%s\": %m",
                              BACKUP_LABEL_FILE)));
  }
+
+
+ /* XLog gather-write staffs */
+
+ static void
+ XLogPageReset(XLogPages *pages)
+ {
+     memset(pages, 0, sizeof(*pages));
+ }
+
+ static void
+ XLogPageWrite(XLogPages *pages, int index)
+ {
+     char *page = XLogCtl->pages + index * BLCKSZ;
+     int size = BLCKSZ;
+     int offset = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
+
+     if (pages->head + pages->size == page
+         && pages->offset + pages->size == offset)
+     {    /* Pages are continuous. Append new page. */
+         pages->size += size;
+     }
+     else
+     {    /* Pages are not continuous. Flush and clear. */
+         XLogPageFlush(pages, PrevBufIdx(index));
+         pages->head = page;
+         pages->size = size;
+         pages->offset = offset;
+     }
+ }
+
+ static void
+ XLogPageFlush(XLogPages *pages, int index)
+ {
+     if (!pages->head)
+     {    /* No needs to write pages. */
+         XLogCtl->Write.curridx = index;
+         return;
+     }
+
+     /* Need to seek in the file? */
+     if (openLogOff != pages->offset)
+     {
+         openLogOff = pages->offset;
+         if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
+             ereport(PANIC,
+                     (errcode_for_file_access(),
+                      errmsg("could not seek in log file %u, segment %u to offset %u: %m",
+                             openLogId, openLogSeg, openLogOff)));
+     }
+
+     /* OK to write the page */
+     errno = 0;
+     if (write(openLogFile, pages->head, pages->size) != pages->size)
+     {
+         /* if write didn't set errno, assume problem is no disk space */
+         if (errno == 0)
+             errno = ENOSPC;
+         ereport(PANIC,
+                 (errcode_for_file_access(),
+                  errmsg("could not write to log file %u, segment %u at offset %u: %m",
+                         openLogId, openLogSeg, openLogOff)));
+     }
+
+     openLogOff += pages->size;
+     XLogCtl->Write.curridx = index;
+     XLogPageReset(pages);
+ }

Re: [HACKERS] O_DIRECT for WAL writes

From
ITAGAKI Takahiro
Date:
Thanks for reviewing!
But the patch does not work on HEAD, because of the changes in BootStrapXLOG().
I send the patch with a fix for it.


Bruce Momjian <pgman@candle.pha.pa.us> wrote:

> If you are doing fsync(), I don't see how O_DIRECT
> makes any sense because O_DIRECT is writing to disk on every write, and
> then what is the fsync() actually doing.

It's depends on OSes. Manpage of Linux says,
  http://linux.com.hk/PenguinWeb/manpage.jsp?name=open§ion=2
    File I/O is done directly to/from user space buffers. The I/O is
    synchronous, i.e., at the completion of the read(2) or write(2) system
    call, data is **guaranteed to have been transferred**.
But manpage of FreeBSD says,
  http://www.manpages.info/freebsd/open.2.html
    O_DIRECT may be used to minimize or eliminate the cache effects of read-
    ing and writing.  The system will attempt to avoid caching the data you
    read or write.  If it cannot avoid caching the data,
    it will **minimize the impact the data has on the cache**.

In my understanding, the completion of write() with O_DIRECT does not always
assure an actual write. So there may be difference between O_DIRECT+O_SYNC
and O_DIRECT+fsync(), but I think that is not very often.


> What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
> and O_DSYNC, so it is automatically used in those cases.  And of course,
> if your operating system doens't support O_DIRECT, it isn't used.

I agree with your way, where O_DIRECT is automatically used.
I bet the combination of O_DIRECT and O_SYNC is always better than
the case O_SYNC only used.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories


Attachment

Re: [HACKERS] O_DIRECT for WAL writes

From
Bruce Momjian
Date:
ITAGAKI Takahiro wrote:
> Thanks for reviewing!
> But the patch does not work on HEAD, because of the changes in BootStrapXLOG().
> I send the patch with a fix for it.

Thanks.

> > If you are doing fsync(), I don't see how O_DIRECT
> > makes any sense because O_DIRECT is writing to disk on every write, and
> > then what is the fsync() actually doing.
>
> It's depends on OSes. Manpage of Linux says,
>   http://linux.com.hk/PenguinWeb/manpage.jsp?name=open§ion=2
>     File I/O is done directly to/from user space buffers. The I/O is
>     synchronous, i.e., at the completion of the read(2) or write(2) system
>     call, data is **guaranteed to have been transferred**.
> But manpage of FreeBSD says,
>   http://www.manpages.info/freebsd/open.2.html
>     O_DIRECT may be used to minimize or eliminate the cache effects of read-
>     ing and writing.  The system will attempt to avoid caching the data you
>     read or write.  If it cannot avoid caching the data,
>     it will **minimize the impact the data has on the cache**.
>
> In my understanding, the completion of write() with O_DIRECT does not always
> assure an actual write. So there may be difference between O_DIRECT+O_SYNC
> and O_DIRECT+fsync(), but I think that is not very often.

Yes, I do remember that.  I know we _need_ fsync when using O_DIRECT,
but the downside of O_DIRECT (force every write to disk) is the same as
O_SYNC, so it seems if we are using O_DIRECT, we might as well use
O_SYNC too and skip the fsync().

I will add a comment mentioning this.

> > What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
> > and O_DSYNC, so it is automatically used in those cases.  And of course,
> > if your operating system doens't support O_DIRECT, it isn't used.
>
> I agree with your way, where O_DIRECT is automatically used.
> I bet the combination of O_DIRECT and O_SYNC is always better than
> the case O_SYNC only used.

OK.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [HACKERS] O_DIRECT for WAL writes

From
Bruce Momjian
Date:
Patch applied.  Thanks.

---------------------------------------------------------------------------


ITAGAKI Takahiro wrote:
> Thanks for reviewing!
> But the patch does not work on HEAD, because of the changes in BootStrapXLOG().
> I send the patch with a fix for it.
>
>
> Bruce Momjian <pgman@candle.pha.pa.us> wrote:
>
> > If you are doing fsync(), I don't see how O_DIRECT
> > makes any sense because O_DIRECT is writing to disk on every write, and
> > then what is the fsync() actually doing.
>
> It's depends on OSes. Manpage of Linux says,
>   http://linux.com.hk/PenguinWeb/manpage.jsp?name=open§ion=2
>     File I/O is done directly to/from user space buffers. The I/O is
>     synchronous, i.e., at the completion of the read(2) or write(2) system
>     call, data is **guaranteed to have been transferred**.
> But manpage of FreeBSD says,
>   http://www.manpages.info/freebsd/open.2.html
>     O_DIRECT may be used to minimize or eliminate the cache effects of read-
>     ing and writing.  The system will attempt to avoid caching the data you
>     read or write.  If it cannot avoid caching the data,
>     it will **minimize the impact the data has on the cache**.
>
> In my understanding, the completion of write() with O_DIRECT does not always
> assure an actual write. So there may be difference between O_DIRECT+O_SYNC
> and O_DIRECT+fsync(), but I think that is not very often.
>
>
> > What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
> > and O_DSYNC, so it is automatically used in those cases.  And of course,
> > if your operating system doens't support O_DIRECT, it isn't used.
>
> I agree with your way, where O_DIRECT is automatically used.
> I bet the combination of O_DIRECT and O_SYNC is always better than
> the case O_SYNC only used.
>
> ---
> ITAGAKI Takahiro
> NTT Cyber Space Laboratories
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [HACKERS] O_DIRECT for WAL writes

From
Mark Wong
Date:
Here are comments that Daniel McNeil made earlier, which I've neglected
to forward earlier.  I've cc'ed him and Mark Havercamp, which some of
you got to meet the other day.

Mark

-----

With O_DIRECT on Linux, when the write() returns the i/o has been
transferred to the disk.

Normally, this i/o will be DMAed directly from user-space to the
device.  The current exception is when doing an O_DIRECT write to a
hole in a file.  (If an program does a truncate() or lseek()/write()
that makes a file larger, the file system does not allocated space
between the old end of file and the new end of file.)  An O_DIRECT
write to hole like this, requires the file system to allocated space,
but there is a race condition between the O_DIRECT write doing the
allocate and then write to initialized the newly allocated data and
any other process that attempts a buffered (page cache) read of the
same area in the file -- it was possible for the read to data from
the allocated region before the O_DIRECT write().  The fix in Linux
is for the O_DIRECT write() to fall back to use buffer i/o to do
the write() and flush the data from the page cache to the disk.

A write() with O_DIRECT only means the data has been transferred to
the disk.   Depending on the file system and mount options, it does
not mean the meta data for the file has been written to disk (see
fsync man page).  Fsync() will guarantee the data and metadata have
been written to disk.

Lastly, if a disk has a write back cache, an O_DIRECT write() does not
guarantee that the disk has put the data on the physical media.
I think some of the journal file systems now support i/o barriers
on commit which will flush the disk write back cache.  (I'm still
looking the kernel code to see how this is done).

Conclusion:

O_DIRECT + fsync() can make sense.  It avoids the copying of data
to the page cache before being written and will also guarantee
that the file's metadata is also written to disk.  It also
prevents the page cache from filling up with write data that
will never be read (I assume it is only read if a recovery
is necessary - which should be rare).  It can also
helps disks with write back cache when using the journaling
file system that use i/o barriers.  You would want to use
large writes, since the kernel page cache won't be writing
multiple pages for you.

I need to look at the kernel code more to comment on O_DIRECT with
O_SYNC.

Questions:

Does the database transaction logger preallocate the log file?

Does the logger care about the order in which each write hits the disk?

Now someone else can comment on my comments.

Daniel

Re: [HACKERS] O_DIRECT for WAL writes

From
Bruce Momjian
Date:
Mark Wong wrote:
> O_DIRECT + fsync() can make sense.  It avoids the copying of data
> to the page cache before being written and will also guarantee
> that the file's metadata is also written to disk.  It also
> prevents the page cache from filling up with write data that
> will never be read (I assume it is only read if a recovery
> is necessary - which should be rare).  It can also
> helps disks with write back cache when using the journaling
> file system that use i/o barriers.  You would want to use
> large writes, since the kernel page cache won't be writing
> multiple pages for you.

Right, but it seems O_DIRECT is pretty much the same as O_DIRECT with
O_DSYNC because the data is always written to disk on write().  Our
logic is that there is nothing for fdatasync to do in most cases after
using O_DIRECT, so the O_DIRECT/fdatasync() combination doesn't make
sense.

And FreeBSD, and perhaps others, need O_SYNC or fdatasync with O_DIRECT
because O_DIRECT doesn't force stuff to disk in all cases.

> I need to look at the kernel code more to comment on O_DIRECT with
> O_SYNC.
>
> Questions:
>
> Does the database transaction logger preallocate the log file?

Yes.

> Does the logger care about the order in which each write hits the disk?

Not really.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [HACKERS] O_DIRECT for WAL writes

From
Mark Wong
Date:
Ok, I finally got a couple of tests done against CVS from Aug 3, 2005.
I'm not sure if I'm showing anything insightful though.  I've learned
that fdatasync and O_DSYNC are simply fsync and O_SYNC respectively on
Linux, which you guys may have already known.  There appears to be a
fair performance decrease in using open_sync.  Just to double check, am
I correct in understanding only open_sync uses O_DIRECT?

fdatasync
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/
5462 notpm

open_sync
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/40/
4860 notpm

Mark

Re: [HACKERS] O_DIRECT for WAL writes

From
Bruce Momjian
Date:
Mark Wong wrote:
> Ok, I finally got a couple of tests done against CVS from Aug 3, 2005.
> I'm not sure if I'm showing anything insightful though.  I've learned
> that fdatasync and O_DSYNC are simply fsync and O_SYNC respectively on
> Linux, which you guys may have already known.  There appears to be a

That is not what we thought for Linux, but many other OS's behave that
way.

> fair performance decrease in using open_sync.  Just to double check, am
> I correct in understanding only open_sync uses O_DIRECT?

Right.

> fdatasync
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/
> 5462 notpm
>
> open_sync
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/40/
> 4860 notpm

Right now open_sync is our last choice, which seems to still be valid
for Linux, at least.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073