Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers
From | J. R. Nield |
---|---|
Subject | Re: Index Scans become Seq Scans after VACUUM ANALYSE |
Date | |
Msg-id | 1024855044.1793.414.camel@localhost.localdomain Whole thread Raw |
In response to | Re: Index Scans become Seq Scans after VACUUM ANALYSE (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Index Scans become Seq Scans after VACUUM ANALYSE
Re: Index Scans become Seq Scans after VACUUM ANALYSE |
List | pgsql-hackers |
On Sun, 2002-06-23 at 11:19, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > This should also allow us to disable completely the ping-pong writes > > if we have a disk subsystem that we trust. > > If we have a disk subsystem we trust, we just disable fsync on the > WAL and the performance issue largely goes away. It wouldn't work because the OS buffering interferes, and we need those WAL records on disk up to the greatest LSN of the Buffer we will be writing. We already buffer WAL ourselves. We also already buffer regular pages. Whenever we write a Buffer out of the buffer cache, it is because we really want that page on disk and wanted to start an IO. If thats not the case, then we should have more block buffers! So since we have all this buffering designed especially to meet our needs, and since the OS buffering is in the way, can someone explain to me why postgresql would ever open a file without the O_DSYNC flag if the platform supports it? > > I concur with Bruce: the reason we keep page images in WAL is to > minimize the number of places we have to fsync, and thus the amount of > head movement required for a commit. Putting the page images elsewhere > cannot be a win AFAICS. Why not put all the page images in a single pre-allocated file and treat it as a ring? How could this be any worse than flushing them in the WAL log? Maybe fsync would be slower with two files, but I don't see how fdatasync would be, and most platforms support that. What would improve performance would be to have a dbflush process that would work in the background flushing buffers in groups and trying to stay ahead of ReadBuffer requests. That would let you do the temporary side of the ping-pong as a huge O_DSYNC writev(2) request (or fdatasync() once) and then write out the other buffers. It would also tend to prevent the other backends from blocking on write requests. A dbflush could also support aio_read/aio_write on platforms like Solaris and WindowsNT that support it. Am I correct that right now, buffers only get written when they get removed from the free list for reuse? So a released dirty buffer will sit in the buffer free list until it becomes the Least Recently Used buffer, and will then cause a backend to block for IO in a call to BufferAlloc? This would explain why we like using the OS buffer cache, and why our performance is troublesome when we have to do synchronous IO writes, and why fsync() takes so long to complete. All of the backends block for each call to BufferAlloc() after a large table update by a single backend, and then the OS buffers are always full of our "written" data. Am I reading the bufmgr code correctly? I already found an imaginary race condition there once :-) ;jnield > > > Well, whether or not there's a cheap way depends on whether you consider > > fsync to be cheap. :-) > > It's never cheap :-( > -- J. R. Nield jrnield@usol.com
pgsql-hackers by date: