Re: [PERFORM] Direct I/O issues - Mailing list pgsql-patches

From Bruce Momjian
Subject Re: [PERFORM] Direct I/O issues
Date
Msg-id 200611231641.kANGfae01113@momjian.us
Whole thread Raw
Responses Re: [PERFORM] Direct I/O issues
List pgsql-patches
I have applied your test_fsync patch for 8.2.  Thanks.

---------------------------------------------------------------------------

Greg Smith wrote:
> I've been trying to optimize a Linux system where benchmarking suggests
> large performance differences between the various wal_sync_method options
> (with o_sync being the big winner).  I started that by using
> src/tools/fsync/test_fsync to get an idea what I was dealing with (and to
> spot which drives had write caching turned on).  Since those results
> didn't match what I was seeing in the benchmarks, I've been browsing the
> backend source to figure out why.  I noticed test_fsync appears to be,
> ahem, out of sync with what the engine is doing.
>
> It looks like V8.1 introduced O_DIRECT writes to the WAL, determined at
> compile time by a series of preprocessor tests in
> src/backend/access/transam/xlog.c When O_DIRECT is available,
> O_SYNC/O_FSYNC/O_DSYNC writes use it.  test_fsync doesn't do that.
>
> I moved the new code (in 8.2 beta 3, lines 61-92 in xlog.c) into
> test_fsync; all the flags had the same name so it dropped right in.  You
> can get the version I made at http://www.westnet.com/~gsmith/test_fsync.c
> (fixed a compiler warning, too)
>
> The results I get now look fishy.  I'm not sure if I screwed up a step, or
> if I'm seeing a real problem.  The system here is running RedHat Linux,
> RHEL ES 4.0 kernel 2.6.9, and the disk I'm writing to is a standard
> 7200RPM IDE drive.  I turned off write caching with hdparm -W 0
>
> Here's an excerpt from the stock test_fsync:
>
> Compare one o_sync write to two:
>          one 16k o_sync write     8.717944
>          two 8k o_sync writes    17.501980
>
> Compare file sync methods with 2 8k writes:
>          (o_dsync unavailable)
>          open o_sync, write      17.018495
>          write, fdatasync         8.842473
>          write, fsync,            8.809117
>
> And here's the version I tried to modify to include O_DIRECT support:
>
> Compare one o_sync write to two:
>          one 16k o_sync write     0.004995
>          two 8k o_sync writes     0.003027
>
> Compare file sync methods with 2 8k writes:
>          (o_dsync unavailable)
>          open o_sync, write       0.004978
>          write, fdatasync         8.845498
>          write, fsync,            8.834037
>
> Obivously the o_sync writes aren't waiting for the disk.  Is this a
> problem with O_DIRECT under Linux?  Or is my code just not correctly
> testing this behavior?
>
> Just as a sanity check, I did try this on another system, running SuSE
> with drives connected to a cciss SCSI device, and I got exactly the same
> results.  I'm concerned that Linux users who use O_SYNC because they
> notice it's faster will be losing their WAL integrity without being aware
> of the problem, especially as the whole O_DIRECT business isn't even
> mentioned in the WAL documentation--it really deserves to be brought up in
> the wal_sync_method notes at
> http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html
>
> And while I'm mentioning improvements to that particular documentation
> page...the wal_buffers notes there are so sparse they misled me initially.
> They suggest only bumping it up for situations with very large
> transactions; since I was testing with small ones I left it woefully
> undersized initially.  I would suggest copying the text from
> http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html to
> here: "When full_page_writes is set and the system is very busy, setting
> this value higher will help smooth response times during the period
> immediately following each checkpoint."  That seems to match what I found
> in testing.
>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
*** /pg/tools/fsync/test_fsync.c    Fri Oct 13 10:18:33 2006
--- test_fsync.c    Thu Nov 23 00:24:49 2006
***************
*** 14,19 ****
--- 14,20 ----
  #include <time.h>
  #include <sys/time.h>
  #include <unistd.h>
+ #include <string.h>

  #ifdef WIN32
  #define FSYNC_FILENAME    "./test_fsync.out"
***************
*** 21,40 ****
  #define FSYNC_FILENAME    "/var/tmp/test_fsync.out"
  #endif

! /* O_SYNC and O_FSYNC are the same */
  #if defined(O_SYNC)
! #define OPEN_SYNC_FLAG        O_SYNC
  #elif defined(O_FSYNC)
! #define OPEN_SYNC_FLAG        O_FSYNC
! #elif defined(O_DSYNC)
! #define OPEN_DATASYNC_FLAG    O_DSYNC
  #endif

  #if defined(OPEN_SYNC_FLAG)
! #if defined(O_DSYNC) && (O_DSYNC != OPEN_SYNC_FLAG)
! #define OPEN_DATASYNC_FLAG    O_DSYNC
  #endif
  #endif

  #define WAL_FILE_SIZE    (16 * 1024 * 1024)

--- 22,54 ----
  #define FSYNC_FILENAME    "/var/tmp/test_fsync.out"
  #endif

! /* This logic comes from src/backend/access/transam/xlog.c where it's
!    better documented */
! #ifdef O_DIRECT
! #define PG_O_DIRECT                             O_DIRECT
! #else
! #define PG_O_DIRECT                             0
! #endif
!
  #if defined(O_SYNC)
! #define BARE_OPEN_SYNC_FLAG             O_SYNC
  #elif defined(O_FSYNC)
! #define BARE_OPEN_SYNC_FLAG             O_FSYNC
! #endif
! #ifdef BARE_OPEN_SYNC_FLAG
! #define OPEN_SYNC_FLAG                  (BARE_OPEN_SYNC_FLAG | PG_O_DIRECT)
  #endif

+ #if defined(O_DSYNC)
  #if defined(OPEN_SYNC_FLAG)
! #if O_DSYNC != BARE_OPEN_SYNC_FLAG
! #define OPEN_DATASYNC_FLAG              (O_DSYNC | PG_O_DIRECT)
! #endif
! #else
! #define OPEN_DATASYNC_FLAG              (O_DSYNC | PG_O_DIRECT)
  #endif
  #endif
+

  #define WAL_FILE_SIZE    (16 * 1024 * 1024)


pgsql-patches by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: large object regression tests, take two
Next
From: Tom Lane
Date:
Subject: Re: [PERFORM] Direct I/O issues