Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers

From J. R. Nield
Subject Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date
Msg-id 1024855044.1793.414.camel@localhost.localdomain
Whole thread Raw
In response to Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Curt Sampson <cjs@cynic.net>)
Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
On Sun, 2002-06-23 at 11:19, Tom Lane wrote: 
> Curt Sampson <cjs@cynic.net> writes:
> > This should also allow us to disable completely the ping-pong writes
> > if we have a disk subsystem that we trust.
> 
> If we have a disk subsystem we trust, we just disable fsync on the
> WAL and the performance issue largely goes away.

It wouldn't work because the OS buffering interferes, and we need those
WAL records on disk up to the greatest LSN of the Buffer we will be writing.


We already buffer WAL ourselves. We also already buffer regular pages.
Whenever we write a Buffer out of the buffer cache, it is because we
really want that page on disk and wanted to start an IO. If thats not
the case, then we should have more block buffers! 

So since we have all this buffering designed especially to meet our
needs, and since the OS buffering is in the way, can someone explain to
me why postgresql would ever open a file without the O_DSYNC flag if the
platform supports it? 



> 
> I concur with Bruce: the reason we keep page images in WAL is to
> minimize the number of places we have to fsync, and thus the amount of
> head movement required for a commit.  Putting the page images elsewhere
> cannot be a win AFAICS.


Why not put all the page images in a single pre-allocated file and treat
it as a ring? How could this be any worse than flushing them in the WAL
log? 

Maybe fsync would be slower with two files, but I don't see how
fdatasync would be, and most platforms support that. 

What would improve performance would be to have a dbflush process that
would work in the background flushing buffers in groups and trying to
stay ahead of ReadBuffer requests. That would let you do the temporary
side of the ping-pong as a huge O_DSYNC writev(2) request (or
fdatasync() once) and then write out the other buffers. It would also
tend to prevent the other backends from blocking on write requests. 

A dbflush could also support aio_read/aio_write on platforms like
Solaris and WindowsNT that support it. 

Am I correct that right now, buffers only get written when they get
removed from the free list for reuse? So a released dirty buffer will
sit in the buffer free list until it becomes the Least Recently Used
buffer, and will then cause a backend to block for IO in a call to
BufferAlloc? 

This would explain why we like using the OS buffer cache, and why our
performance is troublesome when we have to do synchronous IO writes, and
why fsync() takes so long to complete. All of the backends block for
each call to BufferAlloc() after a large table update by a single
backend, and then the OS buffers are always full of our "written" data. 

Am I reading the bufmgr code correctly? I already found an imaginary
race condition there once :-) 

;jnield 


> 
> > Well, whether or not there's a cheap way depends on whether you consider
> > fsync to be cheap. :-)
> 
> It's never cheap :-(
> 
-- 
J. R. Nield
jrnield@usol.com



pgsql-hackers by date:

Previous
From: Curt Sampson
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE
Next
From: "J. R. Nield"
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE