Re: Maximum transaction rate - Mailing list pgsql-general

From Marco Colombo
Subject Re: Maximum transaction rate
Date
Msg-id 49C16E8F.7020100@esiway.net
Whole thread Raw
In response to Re: Maximum transaction rate  (Greg Smith <gsmith@gregsmith.com>)
Responses Re: Maximum transaction rate
List pgsql-general
Greg Smith wrote:
> On Wed, 18 Mar 2009, Marco Colombo wrote:
>
>> If you fsync() after each write you want ordered, there can't be any
>> "subsequent I/O" (unless there are many different processes
>> cuncurrently writing to the file w/o synchronization).
>
> Inside PostgreSQL, each of the database backend processes ends up
> writing blocks to the database disk, if they need to allocate a new
> buffer and the one they are handed is dirty.  You can easily have
> several of those writing to the same 1GB underlying file on disk.  So
> that prerequisite is there.  The main potential for a problem here would
> be if a stray unsynchronized write from one of those backends happened
> in a way that wasn't accounted for by the WAL+checkpoint design.

Wow, that would be quite a bug. That's why I wrote "w/o synchronization".
"stray" + "unaccounted" + "cuncurrent" smells like the recipe for an
explosive to me :)

> What I
> was suggesting is that the way that synchronization happens in the
> database provides some defense from running into problems in this area.

I hope it's "full defence". If you have two processes doing at the
same time write(); fsycn(); on the same file, either there are no order
requirements, or it will boom sooner or later... fsync() works inside
a single process, but any system call may put the process to sleep, and
who knows when it will be awakened and what other processes did to that
file meanwhile. I'm pretty confident that PG code protects access to
shared resources with synchronization primitives.

Anyway I was referring to WAL writes... due to the nature of a log,
it's hard to think of many unordered writes and of cuncurrent access
w/o synchronization. But inside a critical region, there can be more
than one single write, and you may need to enforce an order, but no
more than that before the final fsycn(). If so, userland originated
barriers instead of full fsync()'s may help with performance.
But I'm speculating.

> The way backends handle writes themselves is also why your suggestion
> about the database being able to utilize barriers isn't really helpful.
> Those trickle out all the time, and normally you don't even have to care
> about ordering them.  The only you do need to care, at checkpoint time,
> only a hard line is really practical--all writes up to that point,
> period. Trying to implement ordered writes for everything that happened
> before then would complicate the code base, which isn't going to happen
> for such a platform+filesystem specific feature, one that really doesn't
> offer much acceleration from the database's perspective.

I don't know the internals of WAL writing, I can't really reply on that.

>> only when the journal wraps around there's a (extremely) small window
>> of vulnerability. You need to write a careful crafted torture program
>> to get any chance to observe that... such program exists, and triggers
>> the problem
>
> Yeah, I've been following all that.  The PostgreSQL WAL design works on
> ext2 filesystems with no journal at all.  Some people even put their
> pg_xlog directory onto ext2 filesystems for best performance, relying on
> the WAL to be the journal.  As long as fsync is honored correctly, the
> WAL writes should be re-writing already allocated space, which makes
> this category of journal mayhem not so much of a problem.  But when I
> read about fsync doing unexpected things, that gets me more concerned.

Well, that's highly dependant on your expectations :) I don't expect
a fsync to trigger a journal commit, if metadata hasn't changed. That's
obviuosly true for metadata-only journals (like most of them, with
notable exceptions of ext3 in data=journal mode).

Yet, if you're referring to this
http://article.gmane.org/gmane.linux.file-systems/21373

well that seems to me the same usual thing/bug, fsync() allows disks to
lie when it comes to caching writes. Nothing new under the sun.

Barriers don't change much, because they don't replace a flush. They're
about consistency, not durability. So even with full barriers support,
a fsync implementation needs to end up in a disk cache flush, to be fully
compliant with its own semantics.

.TM.

pgsql-general by date:

Previous
From: Bill Moran
Date:
Subject: Re: How to configure PostgreSQl for low-profile users
Next
From: ray
Date:
Subject: Installation Error, Server Won't Start