Re: AW: AW: AW: WAL does not recover gracefully from out-of -dis k-sp ace - Mailing list pgsql-hackers

From Tom Lane
Subject Re: AW: AW: AW: WAL does not recover gracefully from out-of -dis k-sp ace
Date
Msg-id 7642.984163344@sss.pgh.pa.us
Whole thread Raw
In response to AW: AW: AW: WAL does not recover gracefully from out-of -dis k-sp ace  (Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>)
List pgsql-hackers
Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at> writes:
>> This seems odd.  As near as I can tell, O_SYNC is simply a command to do
>> fsync implicitly during each write call.  It cannot save any I/O unless
>> I'm missing something significant.  Where is the performance difference
>> coming from?

> Yes, odd, but sure very reproducible here.

I tried this on HPUX 10.20, which has not only O_SYNC but also O_DSYNC
(defined to do the equivalent of fdatasync()), and got truly fascinating
results.  Apparently, on this platform these flags change the kernel's
buffering behavior!  Observe:

$ gcc -Wall -O tfsync.c
$ time a.out

real    1m0.32s
user    0m0.02s
sys     0m16.16s
$ gcc -Wall -O -DINIT_WRITE tfsync.c
$ time a.out

real    1m15.11s
user    0m0.04s
sys     0m32.76s

Note the large amount of system time here, and the fact that the extra
time in INIT_WRITE is all system time.  I have previously observed that
fsync() on HPUX 10.20 appears to iterate through every kernel disk
buffer belonging to the file, presumably checking their dirtybits one by
one.  The INIT_WRITE form loses because each fsync in the second loop
has to iterate through a full 16Mb worth of buffers, whereas without
INIT_WRITE there will only be as many buffers as the amount of file
we've filled so far.  (On this platform, it'd probably be a win to use
log segments smaller than 16Mb...)  It's interesting that there's no
visible I/O cost here for the extra write pass --- the extra I/O must be
completely overlapped with the extra system time.

$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC tfsync.c
$ time a.out

real    0m45.04s
user    0m0.02s
sys     0m0.83s

We just bought back almost all the system time.  The only possible
explanation is that this way either doesn't keep the buffers from prior
blocks, or does not scan them for dirtybits.  I note that the open(2)
man page is phrased so that O_SYNC is actually defined not to fsync the
whole file, but only the part you just wrote --- I wonder if it's
actually implemented that way?

$ gcc -Wall -O -DINIT_WRITE -DUSE_SPARSE tfsync.c
$ time a.out

real    1m2.96s
user    0m0.02s
sys     0m27.11s
$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC -DUSE_SPARSE tfsync.c
$ time a.out

real    1m1.34s
user    0m0.01s
sys     0m0.59s

Sparse initialization wins a little in the non-O_SYNC case, but loses
when compared with O_SYNC on.  Not sure why; perhaps it alters the
amount of I/O that has to be done for indirect blocks?

$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC tfsync.c
$ time a.out

real    0m21.40s
user    0m0.02s
sys     0m0.60s

And the piece de resistance: O_DSYNC *actually works* here, even though
I previously found that the fdatasync() call is stubbed to fsync() in
libc!  This old HP box is built like a tank and has a similar lack of
attention to noise level ;-) so I can very easily tell by ear that I am
not getting back-and-forth seeks in this last case, even if the timing
didn't prove it to be true.

$ gcc -Wall -O  -DUSE_ODSYNC tfsync.c
$ time a.out

real    1m1.56s
user    0m0.02s
sys     0m0.67s

Without INIT_WRITE, we are back to essentially the performance of fsync
even though we use DSYNC.  This is expected since the inode must be
written to change the EOF value.  Interestingly, the system time is
small, whereas in my first example it was large; but the elapsed time
is the same.  Evidently the system time is nearly all overlapped with
I/O in the first example.

At least on this platform, it would be definitely worthwhile to use
O_DSYNC even if that meant fsync per write rather than per transaction.
Can anyone else reproduce these results?

I attach my modified version of Andreas' program.  Note I do not believe
his assertion that close() implies fsync() --- on the machines I've
used, it demonstrably does not sync.  You'll also note that I made the
lseek optional in the second loop.  This appears to make no real
difference, so I didn't include timings with the lseek enabled.

            regards, tom lane


#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
/* #define USE_SPARSE */
// #define INIT_WRITE
// #define USE_OSYNC

int main()
{
    char zbuffer[8192];
    int nbytes;
    int fd;
    char *tpath="xlog_testfile";

    fd = open(tpath, O_RDWR | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR);
    if (fd < 0)
        exit(1);
#ifdef INIT_WRITE
#ifdef USE_SPARSE
/*  the sparse file makes things worse here */
    lseek(fd, 16*1024*1024 - 1, SEEK_SET);
    write(fd, 0, 1);
#else
    memset(zbuffer, '\0', sizeof(zbuffer));
    for (nbytes = 0; nbytes < 16*1024*1024; nbytes += sizeof(zbuffer))
    {
        write(fd, zbuffer, sizeof(zbuffer));
    }
#endif
    fsync(fd);
#endif
    /* no fsync here, since close does it for us */
    /* You think so? */
    close (fd);

#ifdef USE_OSYNC
    fd = open(tpath, O_RDWR | O_SYNC, S_IRUSR | S_IWUSR);
#else
#ifdef USE_ODSYNC
    fd = open(tpath, O_RDWR | O_DSYNC, S_IRUSR | S_IWUSR);
#else
    fd = open(tpath, O_RDWR , S_IRUSR | S_IWUSR);
#endif
#endif
    if (fd < 0)
        exit(1);

    memset(zbuffer, 'X', sizeof(zbuffer));
    nbytes = 0;
    do
    {
#ifdef USE_LSEEK
        lseek(fd, nbytes, SEEK_SET);
#endif
        write(fd, zbuffer, sizeof(zbuffer));
#ifndef USE_OSYNC
#ifndef USE_ODSYNC
        fsync(fd);
#endif
#endif
        nbytes += sizeof(zbuffer);
    } while (nbytes < 16*1024*1024);

    close(fd);

    return 0;

}

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Performance monitor ready
Next
From: Zeugswetter Andreas SB
Date:
Subject: AW: AW: AW: WAL does not recover gracefully from out-of -dis k-sp ace