Re: WAL fsync scheduling - Mailing list pgsql-hackers

From Vadim Mikheev
Subject Re: WAL fsync scheduling
Date
Msg-id 002101c0525e$2d964480$b97a30d0@sectorbase.com
Whole thread Raw
In response to WAL fsync scheduling  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: WAL fsync scheduling
List pgsql-hackers
> There are two parts to transaction commit.  The first is writing all
> dirty buffers or log changes to the kernel, and second is fsync of the  ^^^^^^^^^^^^
Backend doesn't write any dirty buffer to the kernel at commit time.

> log file.

The first part is writing commit record into WAL buffers in shmem.
This is what XLogInsert does.  After that XLogFlush is called to ensure
that  entire commit record is on disk. XLogFlush does *both* write() and
fsync() (single slock is used for both writing and fsyncing) if it needs to
do it at all.

> I suggest having a per-backend shared memory byte that has the following
> values:
> 
> START_LOG_WRITE
> WAIT_ON_FSYNC
> NOT_IN_COMMIT
> backend_number_doing_fsync
> 
> I suggest that when each backend starts a commit, it sets its byte to
> START_LOG_WRITE.  ^^^^^^^^^^^^^^^^^^^^^^^
Isn't START_COMMIT more meaningful?

> When it gets ready to fsync, it checks all backends.   ^^^^^^^^^^^^^^^^^^^^^^^^^^
What do you mean by this? The moment just after XLogInsert?

> If all are NOT_IN_COMMIT, it does fsync and continues.

1st edition:
> If one or more are in START_LOG_WRITE, it waits until no one is in
> START_LOG_WRITE.  It then checks all WAIT_ON_FSYNC, and if it is the
> lowest backend in WAIT_ON_FSYNC, marks all others with its backend
> number, and does fsync.  It then clears all backends with its number to
> NOT_IN_COMMIT.  Other backend will see they are not the lowest
> WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT
> so they can then continue, knowing their data was synced.

2nd edition:
> I have another idea.  If a backend gets to the point that it needs
> fsync, and there is another backend in START_LOG_WRITE, it can go to an
> interuptable sleep, knowing another backend will perform the fsync and
> wake it up.  Therefore, there is no busy-wait or timed sleep.
> 
> Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a
> race condition.

The 2nd edition is much better. But I'm not sure do we really need in
these per-backend bytes in shmem. Why not just have some counters?
We can use a semaphore to wake-up all waiters at once.

> This allows a single backend not to sleep, and allows multiple backends
> to bunch up only when they are all about to commit.
> 
> The reason backend numbers are written is so other backends entering the
> commit code will not interfere with the backends performing fsync.

Being waked-up backend can check what's written/fsynced by calling XLogFlush.

Vadim




pgsql-hackers by date:

Previous
From: devik@cdi.cz
Date:
Subject: Re: RE: [COMMITTERS] pgsql/src/backend/access/transam (xact.c xlog.c)
Next
From: "Vadim Mikheev"
Date:
Subject: Re: RE: [COMMITTERS] pgsql/src/backend/access/transam (xact.c xlog.c)