I have modified and attached your patch for your review. I didn't see
any value to adding new fsync_method values because, to me, O_DIRECT is
basically just like O_SYNC except it doesn't keep a copy of the buffer
in the kernel cache. If you are doing fsync(), I don't see how O_DIRECT
makes any sense because O_DIRECT is writing to disk on every write, and
then what is the fsync() actually doing. This might explain why your
fsync/direct and open/direct performance numbers are almost identical.
Basically, if you are going to use O_DIRECT, why not use open_sync.
What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
and O_DSYNC, so it is automatically used in those cases. And of course,
if your operating system doens't support O_DIRECT, it isn't used.
With your posted performance numbers, perhaps we should favor
fsync_method O_SYNC on platforms that have O_DIRECT even if we don't
support OPEN_DATASYNC, but I bet most platforms that have O_DIRECT also
have O_DATASYNC. Perhaps some folks can run testes once the patch is
applied.
---------------------------------------------------------------------------
ITAGAKI Takahiro wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Yeah, this is about what I was afraid of: if you're actually fsyncing
> > then you get at best one commit per disk revolution, and the negotiation
> > with the OS is down in the noise.
>
> If we disable writeback-cache and use open_sync, the per-page writing
> behavior in WAL module will show up as bad result. O_DIRECT is similar
> to O_DSYNC (at least on linux), so that the benefit of it will disappear
> behind the slow disk revolution.
>
> In the current source, WAL is written as:
> for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); }
> Is this intentional? Can we rewrite it as follows?
> write(&buffers[0], N * BLCKSZ);
>
> In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
> Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
> These two patches are independent, so they can be applied either or both.
>
>
> I tested them on my machine and the results as follows. It shows that
> direct-io and gather-write is the best choice when writeback-cache is off.
> Are these two patches worth trying if they are used together?
>
>
> | writeback | fsync= | fdata | open_ | fsync_ | open_
> patch | cache | false | sync | sync | direct | direct
> ------------+-----------+--------+-------+-------+--------+---------
> direct io | off | 124.2 | 105.7 | 48.3 | 48.3 | 48.2
> direct io | on | 129.1 | 112.3 | 114.1 | 142.9 | 144.5
> gather-write| off | 124.3 | 108.7 | 105.4 | (N/A) | (N/A)
> both | off | 131.5 | 115.5 | 114.4 | 145.4 | 145.2
>
> - 20runs * pgbench -s 100 -c 50 -t 200
> - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
> - using 2 ATA disks:
> - hda(reiserfs) includes system and wal.
> - hdc(jfs) includes database files. writeback-cache is always on.
>
> ---
> ITAGAKI Takahiro
> NTT Cyber Space Laboratories
>
[ Attachment, skipping... ]
[ Attachment, skipping... ]
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.210
diff -c -c -r1.210 xlog.c
*** src/backend/access/transam/xlog.c 23 Jul 2005 15:31:16 -0000 1.210
--- src/backend/access/transam/xlog.c 23 Jul 2005 16:09:12 -0000
***************
*** 48,77 ****
/*
* This chunk of hackery attempts to determine which file sync methods
* are available on the current platform, and to choose an appropriate
* default method. We assume that fsync() is always available, and that
* configure determined whether fdatasync() is.
*/
#if defined(O_SYNC)
! #define OPEN_SYNC_FLAG O_SYNC
#else
#if defined(O_FSYNC)
! #define OPEN_SYNC_FLAG O_FSYNC
#endif
#endif
#if defined(O_DSYNC)
#if defined(OPEN_SYNC_FLAG)
! #if O_DSYNC != OPEN_SYNC_FLAG
! #define OPEN_DATASYNC_FLAG O_DSYNC
#endif
#else /* !defined(OPEN_SYNC_FLAG) */
/* Win32 only has O_DSYNC */
! #define OPEN_DATASYNC_FLAG O_DSYNC
#endif
#endif
#if defined(OPEN_DATASYNC_FLAG)
#define DEFAULT_SYNC_METHOD_STR "open_datasync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
--- 48,114 ----
/*
+ * Becauase O_DIRECT bypasses the kernel buffers, and because we never
+ * read those buffers except during crash recovery, it seems like
+ * a win to use it in all cases where we sync on each write().
+ */
+ #ifdef O_DIRECT
+ #define PG_O_DIRECT O_DIRECT
+ #else
+ #define PG_O_DIRECT 0
+ #endif
+
+ /*
* This chunk of hackery attempts to determine which file sync methods
* are available on the current platform, and to choose an appropriate
* default method. We assume that fsync() is always available, and that
* configure determined whether fdatasync() is.
*/
#if defined(O_SYNC)
! #define CMP_OPEN_SYNC_FLAG O_SYNC
#else
#if defined(O_FSYNC)
! #define CMP_OPEN_SYNC_FLAG O_FSYNC
#endif
#endif
+ #define OPEN_SYNC_FLAG (CMP_OPEN_SYNC_FLAG | PG_O_DIRECT)
#if defined(O_DSYNC)
#if defined(OPEN_SYNC_FLAG)
! #if O_DSYNC != CMP_OPEN_SYNC_FLAG
! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
#endif
#else /* !defined(OPEN_SYNC_FLAG) */
/* Win32 only has O_DSYNC */
! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
#endif
#endif
+ /*
+ * Limitation of buffer-alignment for direct io depend on OS and filesystem,
+ * but BLCKSZ is assumed to be enough for it.
+ */
+ #ifdef O_DIRECT
+ #define ALIGNOF_XLOG_BUFFER BLCKSZ
+ #else
+ #define ALIGNOF_XLOG_BUFFER MAXIMUM_ALIGNOF
+ #endif
+
+ /*
+ * Switch the alignment routine because ShmemAlloc() returns a max-aligned
+ * buffer and ALIGNOF_XLOG_BUFFER may be greater than MAXIMUM_ALIGNOF.
+ */
+ #if ALIGNOF_XLOG_BUFFER <= MAXIMUM_ALIGNOF
+ #define XLOG_BUFFER_ALIGN(LEN) MAXALIGN((LEN))
+ #else
+ #define XLOG_BUFFER_ALIGN(LEN) ((LEN) + (ALIGNOF_XLOG_BUFFER))
+ #endif
+ /* assume sizeof(ptrdiff_t) == sizeof(void*) */
+ #define POINTERALIGN(ALIGNVAL,PTR) \
+ ((char *)(((ptrdiff_t) (PTR) + (ALIGNVAL-1)) & ~((ptrdiff_t) (ALIGNVAL-1))))
+ #define XLOG_BUFFER_POINTERALIGN(PTR) \
+ POINTERALIGN((ALIGNOF_XLOG_BUFFER), (PTR))
+
#if defined(OPEN_DATASYNC_FLAG)
#define DEFAULT_SYNC_METHOD_STR "open_datasync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
***************
*** 469,474 ****
--- 506,522 ----
static char *str_time(time_t tnow);
static void issue_xlog_fsync(void);
+ /* XLog gather-write staffs */
+ typedef struct XLogPages
+ {
+ char *head; /* Head of first page */
+ int size; /* Total bytes of pages == count(pages) * BLCKSZ */
+ int offset; /* Offset in xlog segment file */
+ } XLogPages;
+ static void XLogPageReset(XLogPages *pages);
+ static void XLogPageWrite(XLogPages *pages, int index);
+ static void XLogPageFlush(XLogPages *pages, int index);
+
#ifdef WAL_DEBUG
static void xlog_outrec(char *buf, XLogRecord *record);
#endif
***************
*** 1245,1253 ****
XLogWrite(XLogwrtRqst WriteRqst)
{
XLogCtlWrite *Write = &XLogCtl->Write;
- char *from;
bool ispartialpage;
bool use_existent;
/* We should always be inside a critical section here */
Assert(CritSectionCount > 0);
--- 1293,1302 ----
XLogWrite(XLogwrtRqst WriteRqst)
{
XLogCtlWrite *Write = &XLogCtl->Write;
bool ispartialpage;
bool use_existent;
+ int currentIndex = Write->curridx;
+ XLogPages pages;
/* We should always be inside a critical section here */
Assert(CritSectionCount > 0);
***************
*** 1258,1263 ****
--- 1307,1314 ----
*/
LogwrtResult = Write->LogwrtResult;
+ XLogPageReset(&pages);
+
while (XLByteLT(LogwrtResult.Write, WriteRqst.Write))
{
/*
***************
*** 1266,1279 ****
* end of the last page that's been initialized by
* AdvanceXLInsertBuffer.
*/
! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx]))
elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! XLogCtl->xlblocks[Write->curridx].xlogid,
! XLogCtl->xlblocks[Write->curridx].xrecoff);
/* Advance LogwrtResult.Write to end of current buffer page */
! LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx];
ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 1317,1330 ----
* end of the last page that's been initialized by
* AdvanceXLInsertBuffer.
*/
! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex]))
elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! XLogCtl->xlblocks[currentIndex].xlogid,
! XLogCtl->xlblocks[currentIndex].xrecoff);
/* Advance LogwrtResult.Write to end of current buffer page */
! LogwrtResult.Write = XLogCtl->xlblocks[currentIndex];
ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1281,1286 ****
--- 1332,1338 ----
/*
* Switch to new logfile segment.
*/
+ XLogPageFlush(&pages, currentIndex);
if (openLogFile >= 0)
{
if (close(openLogFile))
***************
*** 1354,1384 ****
openLogOff = 0;
}
! /* Need to seek in the file? */
! if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
! {
! openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
! if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
! ereport(PANIC,
! (errcode_for_file_access(),
! errmsg("could not seek in log file %u, segment %u to offset %u: %m",
! openLogId, openLogSeg, openLogOff)));
! }
!
! /* OK to write the page */
! from = XLogCtl->pages + Write->curridx * BLCKSZ;
! errno = 0;
! if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
! {
! /* if write didn't set errno, assume problem is no disk space */
! if (errno == 0)
! errno = ENOSPC;
! ereport(PANIC,
! (errcode_for_file_access(),
! errmsg("could not write to log file %u, segment %u at offset %u: %m",
! openLogId, openLogSeg, openLogOff)));
! }
! openLogOff += BLCKSZ;
/*
* If we just wrote the whole last page of a logfile segment,
--- 1406,1413 ----
openLogOff = 0;
}
! /* Add a page to buffer */
! XLogPageWrite(&pages, currentIndex);
/*
* If we just wrote the whole last page of a logfile segment,
***************
*** 1390,1397 ****
* This is also the right place to notify the Archiver that the
* segment is ready to copy to archival storage.
*/
! if (openLogOff >= XLogSegSize && !ispartialpage)
{
issue_xlog_fsync();
LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
--- 1419,1427 ----
* This is also the right place to notify the Archiver that the
* segment is ready to copy to archival storage.
*/
! if (openLogOff + pages.size >= XLogSegSize && !ispartialpage)
{
+ XLogPageFlush(&pages, currentIndex);
issue_xlog_fsync();
LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
***************
*** 1405,1412 ****
LogwrtResult.Write = WriteRqst.Write;
break;
}
! Write->curridx = NextBufIdx(Write->curridx);
}
/*
* If asked to flush, do so
--- 1435,1443 ----
LogwrtResult.Write = WriteRqst.Write;
break;
}
! currentIndex = NextBufIdx(currentIndex);
}
+ XLogPageFlush(&pages, currentIndex);
/*
* If asked to flush, do so
***************
*** 3584,3590 ****
if (XLOGbuffers < MinXLOGbuffers)
XLOGbuffers = MinXLOGbuffers;
! return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
+ BLCKSZ * XLOGbuffers +
MAXALIGN(sizeof(ControlFileData));
}
--- 3615,3621 ----
if (XLOGbuffers < MinXLOGbuffers)
XLOGbuffers = MinXLOGbuffers;
! return XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
+ BLCKSZ * XLOGbuffers +
MAXALIGN(sizeof(ControlFileData));
}
***************
*** 3601,3607 ****
XLogCtl = (XLogCtlData *)
ShmemInitStruct("XLOG Ctl",
! MAXALIGN(sizeof(XLogCtlData) +
sizeof(XLogRecPtr) * XLOGbuffers)
+ BLCKSZ * XLOGbuffers,
&foundXLog);
--- 3632,3638 ----
XLogCtl = (XLogCtlData *)
ShmemInitStruct("XLOG Ctl",
! XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) +
sizeof(XLogRecPtr) * XLOGbuffers)
+ BLCKSZ * XLOGbuffers,
&foundXLog);
***************
*** 3630,3638 ****
* Here, on the other hand, we must MAXALIGN to ensure the page
* buffers have worst-case alignment.
*/
! XLogCtl->pages =
! ((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) +
! sizeof(XLogRecPtr) * XLOGbuffers);
memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
/*
--- 3661,3669 ----
* Here, on the other hand, we must MAXALIGN to ensure the page
* buffers have worst-case alignment.
*/
! XLogCtl->pages = XLOG_BUFFER_POINTERALIGN(
! ((char *) XLogCtl)
! + sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers);
memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
/*
***************
*** 3662,3667 ****
--- 3693,3699 ----
{
CheckPoint checkPoint;
char *buffer;
+ char buffer0[BLCKSZ + ALIGNOF_XLOG_BUFFER];
XLogPageHeader page;
XLogLongPageHeader longpage;
XLogRecord *record;
***************
*** 3690,3697 ****
/* First timeline ID is always 1 */
ThisTimeLineID = 1;
! /* Use malloc() to ensure buffer is MAXALIGNED */
! buffer = (char *) malloc(BLCKSZ);
page = (XLogPageHeader) buffer;
memset(buffer, 0, BLCKSZ);
--- 3722,3728 ----
/* First timeline ID is always 1 */
ThisTimeLineID = 1;
! buffer = XLOG_BUFFER_POINTERALIGN(buffer0);
page = (XLogPageHeader) buffer;
memset(buffer, 0, BLCKSZ);
***************
*** 5837,5839 ****
--- 5868,5938 ----
errmsg("could not remove file \"%s\": %m",
BACKUP_LABEL_FILE)));
}
+
+
+ /* XLog gather-write staffs */
+
+ static void
+ XLogPageReset(XLogPages *pages)
+ {
+ memset(pages, 0, sizeof(*pages));
+ }
+
+ static void
+ XLogPageWrite(XLogPages *pages, int index)
+ {
+ char *page = XLogCtl->pages + index * BLCKSZ;
+ int size = BLCKSZ;
+ int offset = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
+
+ if (pages->head + pages->size == page
+ && pages->offset + pages->size == offset)
+ { /* Pages are continuous. Append new page. */
+ pages->size += size;
+ }
+ else
+ { /* Pages are not continuous. Flush and clear. */
+ XLogPageFlush(pages, PrevBufIdx(index));
+ pages->head = page;
+ pages->size = size;
+ pages->offset = offset;
+ }
+ }
+
+ static void
+ XLogPageFlush(XLogPages *pages, int index)
+ {
+ if (!pages->head)
+ { /* No needs to write pages. */
+ XLogCtl->Write.curridx = index;
+ return;
+ }
+
+ /* Need to seek in the file? */
+ if (openLogOff != pages->offset)
+ {
+ openLogOff = pages->offset;
+ if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not seek in log file %u, segment %u to offset %u: %m",
+ openLogId, openLogSeg, openLogOff)));
+ }
+
+ /* OK to write the page */
+ errno = 0;
+ if (write(openLogFile, pages->head, pages->size) != pages->size)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write to log file %u, segment %u at offset %u: %m",
+ openLogId, openLogSeg, openLogOff)));
+ }
+
+ openLogOff += pages->size;
+ XLogCtl->Write.curridx = index;
+ XLogPageReset(pages);
+ }