Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date
Msg-id 200206242125.g5OLPFG26140@candle.pha.pa.us
Whole thread Raw
In response to Re: Index Scans become Seq Scans after VACUUM ANALYSE  ("J. R. Nield" <jrnield@usol.com>)
Responses Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
J. R. Nield wrote:
> > This I don't quite understand. Assuming you're using a SCSI drive
> > (and this mostly applies to ATAPI/IDE, too), you can do naught but
> > align block writes to the structure of the underlying device. When you
> > initiate a SCSI WRITE command, you start by telling the device at which
> > block to start writing and how many blocks you intend to write. Then you
> > start passing the data.
> > 
> 
> All I'm saying is that the entire postgresql block write must be
> converted into exactly one SCSI write command in all cases, and I don't
> know a portable way to ensure this. 

...

> I agree with this. My point was only that you need to know what
> guarantees your operating system/hardware combination provides on a
> case-by-case basis, and there is no standard way for a program to
> discover this. Most system administrators are not going to know this
> either, unless databases are their main responsibility.

Yes, agreed.  >1% are going to know the answer to this question so we
have to assume worst case.

> > It seems to me that it is doing this right now. In fact, it's more
> > reliable than some commerial systems (such as SQL Server) because it can
> > recover from a torn block with just the logfile.
> 
> Again, what I meant to say is that the commercial systems can recover
> with an old file backup + logs. How old the backup can be depends only
> on how much time you are willing to spend playing the logs forward. So
> if you do a full backup once a week, and multiplex and backup the logs,
> then even if a backup tape gets destroyed you can still survive. It just
> takes longer.
> 
> Also, postgreSQL can't recover from any other type of block corruption,
> while the commercial systems can. That's what I meant by the "critical
> production use" comment, which was sort-of unfair.
> 
> So I would say they are equally reliable for torn pages (but not bad
> blocks), and the commercial systems let you trade potential recovery
> time for not having to write the blocks twice. You do need to back-up
> the log archives though.

Yes, good tradeoff analysis.  We recover from partial writes quicker,
and don't require saving of log files, _but_ we don't recover from bad
disk blocks.  Good summary.

> I'll back off on that. I don't know if we want to use the OS buffer
> manager, but shouldn't we try to have our buffer manager group writes
> together by files, and pro-actively get them out to disk? Right now, it
> looks like all our write requests are delayed as long as possible and
> the order in which they are written is pretty-much random, as is the
> backend that writes the block, so there is no locality of reference even
> when the blocks are adjacent on disk, and the write calls are spread-out
> over all the backends.
> 
> Would it not be the case that things like read-ahead, grouping writes,
> and caching written data are probably best done by PostgreSQL, because
> only our buffer manager can understand when they will be useful or when
> they will thrash the cache?

The OS should handle all of this.  We are doing main table writes but no
sync until checkpoint, so the OS can keep those blocks around and write
them at its convenience.  It knows the size of the buffer cache and when
stuff is forced to disk.  We can't second-guess that.

> I may likely be wrong on this, and I haven't done any performance
> testing. I shouldn't have brought this up alongside the logging issues,
> but there seemed to be some question about whether the OS was actually
> doing all these things behind the scene.

It had better.  Looking at the kernel source is the way to know.

> Does anyone know what the major barriers to infinite log replay are in
> PostgreSQL? I'm trying to look for everything that might need to be
> changed outside xlog.c, but surely this has come up before. Searching
> the archives hasn't revealed much.

This has been brought up.  Could we just save WAL files and get replay? 
I believe some things have to be added to WAL to allow this, but it
seems possible.  However, the pg_dump is just a data dump and does not
have the file offsets and things.  Somehow you would need a tar-type
backup of the database, and with a running db, it is hard to get a valid
snapshot of that.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 




pgsql-hackers by date:

Previous
From: "J. R. Nield"
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE
Next
From: Tom Lane
Date:
Subject: Nonrecursive ALTER TABLE ADD/RENAME COLUMN is wrong