Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers

From Curt Sampson
Subject Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date
Msg-id Pine.NEB.4.43.0206240057070.2100-100000@angelic.cynic.net
Whole thread Raw
In response to Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Index Scans become Seq Scans after VACUUM ANALYSE  ("J. R. Nield" <jrnield@usol.com>)
List pgsql-hackers
On Sun, 23 Jun 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
> > This should also allow us to disable completely the ping-pong writes
> > if we have a disk subsystem that we trust.
>
> If we have a disk subsystem we trust, we just disable fsync on the
> WAL and the performance issue largely goes away.

No, you can't do this. If you don't fsync(), there's no guarantee
that the write ever got out of the computer's buffer cache and to
the disk subsystem in the first place.

> I concur with Bruce: the reason we keep page images in WAL is to
> minimize the number of places we have to fsync, and thus the amount of
> head movement required for a commit.

An fsync() does not necessarially cause head movement, or any real
disk writes at all. If you're writing to many external disk arrays,
for example, the fsync() ensures that the data are in the disk array's
non-volatile or UPS-backed RAM, no more. The array might hold the data
for quite some time before it actually writes it to disk.

But you're right that it's faster, if you're going to write out changed
pages and have have the ping-pong file and the transaction log on the
same disk, just to write out the entire page to the transaction log.

So what we would really need to implement, if we wanted to be more
efficient with trusted disk subsystems, would be the option of writing
to the log only the changed row or changed part of the row, or writing
the entire changed page. I don't know how hard this would be....

> > Well, whether or not there's a cheap way depends on whether you consider
> > fsync to be cheap. :-)
>
> It's never cheap :-(

Actually, with a good external RAID system with non-volatile RAM,
it's a good two to four orders of magnitude cheaper than writing to a
directly connected disk that doesn't claim the write is complete until
it's physically on disk. I'd say that it qualifies as at least "not
expensive." Not that you want to do it more often than you have to
anyway....

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Suggestions for implementing IS DISTINCT FROM?
Next
From: "J. R. Nield"
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE