Dirty Buffer Writing [was Proposed LogWriter Scheme] - Mailing list pgsql-hackers

From Curtis Faith
Subject Dirty Buffer Writing [was Proposed LogWriter Scheme]
Date
Msg-id DMEEJMCDOJAKPPFACMPMKEFCCEAA.curtis@galtair.com
Whole thread Raw
In response to Re: Proposed LogWriter Scheme, WAS: Potential Large  (Greg Copeland <greg@CopelandConsulting.Net>)
Responses Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
> On Sun, 2002-10-06 at 11:46, Tom Lane wrote:
> > I can't personally get excited about something that only helps if your
> > server is starved for RAM --- who runs servers that aren't fat on RAM
> > anymore?  But give it a shot if you like.  Perhaps your analysis is
> > pessimistic.
>
> <snipped> I don't find it far fetched to
> imagine situations where people may commit large amounts of memory for
> the database yet marginally starve available memory for file system
> buffers.  Especially so on heavily I/O bound systems or where sporadicly
> other types of non-database file activity may occur.
>
> <snipped> Of course, that opens the door for simply adding more memory
> and/or slightly reducing the amount of memory available to the database
> (thus making it available elsewhere).  Now, after all that's said and
> done, having something like aio in use would seemingly allowing it to be
> somewhat more "self-tuning" from a potential performance perspective.

Good points.

Now for some surprising news (at least it surprised me).

I researched the file system source on my system (FreeBSD 4.6) and found
that the behavior was optimized for non-database access to eliminate
unnecessary writes when temp files are created and deleted rapidly. It was
not optimized to get data to the disk in the most efficient manner.

The syncer on FreeBSD appears to place dirtied filesystem buffers into
work queues that range from 1 to SYNCER_MAXDELAY. Each second the syncer
processes one of the queues and increments a counter syncer_delayno.

On my system the setting for SYNCER_MAXDELAY is 32. So each second 1/32nd
of the writes that were buffered are processed. If the syncer gets behind
and the writes for a given second exceed one second to process the syncer
does not wait but begins processing the next queue.

AFAICT this means that there is no opportunity to have writes combined by
the  disk since they are processed in buckets based on the time the writes
came in.

Also, it seems very likely that many installations won't have enough
buffers for 30 seconds worth of changes and that there would be some level
of SYNCHRONOUS writing because of this delay and the syncer process getting
backed up. This might happen once per second as the buffers get full and
the syncer has not yet started for that second interval.

Linux might handle this better. I saw some emails exchanged a year or so
ago about starting writes immediately in a low-priority way but I'm not
sure if those patches got applied to the linux kernel or not. The source I
had access to seems to do something analogous to FreeBSD but using fixed
percentages of the dirty blocks or a minimum number of blocks. They appear
to be handled in LRU order however.

On-disk caches are much much larger these days so it seems that some way of
getting the data out sooner would result in better write performance for
the cache. My newer drive is a 10K RPM IBM Ultrastar SCSI and it has a 4M
cache. I don't see these caches getting smaller over time so not letting
the disk see writes will become more and more of a performance drain.

- Curtis



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: BTree metapage lock and freelist structure
Next
From: "Curtis Faith"
Date:
Subject: Re: Analysis of ganged WAL writes