Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers

From J. R. Nield
Subject Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date
Msg-id 1024951786.1793.865.camel@localhost.localdomain
Whole thread Raw
In response to Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Curt Sampson <cjs@cynic.net>)
Responses Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Curt Sampson <cjs@cynic.net>)
Buffer Management  (Curt Sampson <cjs@cynic.net>)
List pgsql-hackers
On Sun, 2002-06-23 at 23:40, Curt Sampson wrote:
> On 23 Jun 2002, J. R. Nield wrote:
> 
> > If is impossible to do what you want. You can not protect against
> > partial writes without writing pages twice and calling fdatasync
> > between them while going through a generic filesystem.
> 
> I agree with this.
> 
> > The best disk array will not protect you if the operating system does
> > not align block writes to the structure of the underlying device.
> 
> This I don't quite understand. Assuming you're using a SCSI drive
> (and this mostly applies to ATAPI/IDE, too), you can do naught but
> align block writes to the structure of the underlying device. When you
> initiate a SCSI WRITE command, you start by telling the device at which
> block to start writing and how many blocks you intend to write. Then you
> start passing the data.
> 

All I'm saying is that the entire postgresql block write must be
converted into exactly one SCSI write command in all cases, and I don't
know a portable way to ensure this. 

> > Even with raw devices, you need special support or knowledge of the
> > operating system and/or the disk device to ensure that each write
> > request will be atomic to the underlying hardware.
> 
> Well, so here I guess you're talking about two things:
> 
>     1. When you request, say, an 8K block write, will the OS really
>     write it to disk in a single 8K or multiple of 8K SCSI write
>     command?
> 
>     2. Does the SCSI device you're writing to consider these writes to
>     be transactional. That is, if the write is interrupted before being
>     completed, does the SCSI device guarantee that the partially-sent
>     data is not written, and the old data is maintained? And of course,
>     does it guarantee that, when it acknowledges a write, that write is
>     now in stable storage and will never go away?
> 
> Both of these are not hard to guarantee, actually. For a BSD-based OS,
> for example, just make sure that your filesystem block size is the
> same as or a multiple of the database block size. BSD will never write
> anything other than a block or a sequence of blocks to a disk in a
> single SCSI transaction (unless you've got a really odd SCSI driver).
> And for your disk, buy a Baydel or Clarion disk array, or something
> similar.
> 
> Given that it's not hard to set up a system that meets these criteria,
> and this is in fact commonly done for database servers, it would seem a
> good idea for postgres to have the option to take advantage of the time
> and money spent and adjust its performance upward appropriately.

I agree with this. My point was only that you need to know what
guarantees your operating system/hardware combination provides on a
case-by-case basis, and there is no standard way for a program to
discover this. Most system administrators are not going to know this
either, unless databases are their main responsibility.

> 
> > All other systems rely on the fact that you can recover a damaged file
> > using the log archive.
> 
> Not exactly. For MS SQL Server, at any rate, if it detects a page tear
> you cannot restore based on the log file alone. You need a full or
> partial backup that includes that entire torn block.
> 

I should have been more specific: you need a backup of the file from
some time ago, plus all the archived logs from then until the current
log sequence number.

> > This means downtime in the rare case, but no data loss. Until
> > PostgreSQL can do this, then it will not be acceptable for real
> > critical production use.
> 
> It seems to me that it is doing this right now. In fact, it's more
> reliable than some commerial systems (such as SQL Server) because it can
> recover from a torn block with just the logfile.

Again, what I meant to say is that the commercial systems can recover
with an old file backup + logs. How old the backup can be depends only
on how much time you are willing to spend playing the logs forward. So
if you do a full backup once a week, and multiplex and backup the logs,
then even if a backup tape gets destroyed you can still survive. It just
takes longer.

Also, postgreSQL can't recover from any other type of block corruption,
while the commercial systems can. That's what I meant by the "critical
production use" comment, which was sort-of unfair.

So I would say they are equally reliable for torn pages (but not bad
blocks), and the commercial systems let you trade potential recovery
time for not having to write the blocks twice. You do need to back-up
the log archives though.

> 
> > But at the end of the day, unless you have complete understanding of
> > the I/O system from write(2) through to the disk system, the only sure
> > ways to protect against partial writes are by "careful writes" (in
> > the WAL log or elsewhere, writing pages twice), or by requiring (and
> > allowing) users to do log-replay recovery when a file is corrupted by
> > a partial write.
> 
> I don't understand how, without a copy of the old data that was in the
> torn block, you can restore that block from just log file entries. Can
> you explain this to me? Take, as an example, a block with ten tuples,
> only one of which has been changed "recently." (I.e., only that change
> is in the log files.)
>
> 
> > If we log pages to WAL, they are useless when archived (after a
> > checkpoint). So either we have a separate "log" for them (the
> > ping-pong file), or we should at least remove them when archived,
> > which makes log archiving more complex but is perfectly doable.
> 
> Right. That seems to me a better option, since we've now got only one
> write point on the disk rather than two.

OK. I agree with this now.

> 
> > Finally, I would love to hear why we are using the operating system
> > buffer manager at all. The OS is acting as a secondary buffer manager
> > for us. Why is that? What flaw in our I/O system does this reveal?
> 
> It's acting as a "second-level" buffer manager, yes, but to say it's
> "secondary" may be a bit misleading. On most of the systems I've set
> up, the OS buffer cache is doing the vast majority of the work, and the
> postgres buffering is fairly minimal.
> 
> There are some good (and some perhaps not-so-good) reasons to do it this
> way. I'll list them more or less in the order of best to worst:
> 
>     1. The OS knows where the blocks physically reside on disk, and
>     postgres does not. Therefore it's in the interest of postgresql to
>     dispatch write responsibility back to the OS as quickly as possible
>     so that the OS can prioritize requests appropriately. Most operating
>     systems use an "elevator" algorithm to minimize disk head movement;
>     but if the OS does not have a block that it could write while the
>     head is "on the way" to another request, it can't write it in that
>     head pass.
> 
>     2. Postgres does not know about any "bank-switching" tricks for
>     mapping more physical memory than it has address space. Thus, on
>     32-bit machines, postgres might be limited to mapping 2 or 3 GB of
>     memory, even though the machine has, say, 6 GB of physical RAM. The
>     OS can use all of the available memory for caching; postgres cannot.
> 
>     3. A lot of work has been put into the seek algorithms, read-ahead
>     algorithms, block allocation algorithms, etc. in the OS. Why
>     duplicate all that work again in postgres?
> 
> When you say things like the following:
> 
> > We should only be writing blocks when they need to be on disk. We
> > should not be expecting the OS to write them "sometime later" and
> > avoid blocking (as long) for the write. If we need that, then our
> > buffer management is wrong and we need to fix it.
> 
> you appear to be making the arugment that we should take the route of
> other database systems, and use raw devices and our own management of
> disk block allocation. If so, you might want first to look back through
> the archives at the discussion I and several others had about this a
> month or two ago. After looking in detail at what NetBSD, at least, does
> in terms of its disk I/O algorithms and buffering, I've pretty much come
> around, at least for the moment, to the attitude that we should stick
> with using the OS. I wouldn't mind seeing postgres be able to manage all
> of this stuff, but it's a *lot* of work for not all that much benefit
> that I can see.

I'll back off on that. I don't know if we want to use the OS buffer
manager, but shouldn't we try to have our buffer manager group writes
together by files, and pro-actively get them out to disk? Right now, it
looks like all our write requests are delayed as long as possible and
the order in which they are written is pretty-much random, as is the
backend that writes the block, so there is no locality of reference even
when the blocks are adjacent on disk, and the write calls are spread-out
over all the backends.

Would it not be the case that things like read-ahead, grouping writes,
and caching written data are probably best done by PostgreSQL, because
only our buffer manager can understand when they will be useful or when
they will thrash the cache?

I may likely be wrong on this, and I haven't done any performance
testing. I shouldn't have brought this up alongside the logging issues,
but there seemed to be some question about whether the OS was actually
doing all these things behind the scene.


> 
> > The ORACLE people were not kidding when they said that they could not
> > certify Linux for production use until it supported O_DSYNC. Can you
> > explain why that was the case?
> 
> I'm suspecting it's because Linux at the time had no raw devices, so
> O_DSYNC was the only other possible method of making sure that disk
> writes actually got to disk.
> 
> You certainly don't want to use O_DSYNC if you can use another method,
> because O_DSYNC still goes through the the operating system's buffer
> cache, wasting memory and double-caching things. If you're doing your
> own management, you need either to use a raw device or open files with
> the flag that indicates that the buffer cache should not be used at all
> for reads from and writes to that file.

Would O_DSYNC|O_RSYNC turn off the cache? 

> 
> > However, this discussion and a search of the pgsql-hackers archives
> > reveals this problem to be the KEY area of PostgreSQL's failing, and
> > general misunderstanding, when compared to its commercial competitors.
> 
> No, I think it's just that you're under a few minor misapprehensions
> here about what postgres and the OS are actually doing. As I said, I
> went through this whole exact argument a month or two ago, on this very
> list, and I came around to the idea that what postgres is doing now
> works quite well, at least on NetBSD. (Most other OSes have disk I/O
> algorithms that are pretty much as good or better.) There might be a
> very slight advantage to doing all one's own I/O management, but it's
> a huge amount of work, and I think that much effort could be much more
> usefully applied to other areas.

I will look for that discussion in the archives.

The logging issue is a key one I think. At least I would be very nervous
as a DBA if I were running a system where any damaged file would cause
data loss.

Does anyone know what the major barriers to infinite log replay are in
PostgreSQL? I'm trying to look for everything that might need to be
changed outside xlog.c, but surely this has come up before. Searching
the archives hasn't revealed much.



As to the I/O issue:

Since you know a lot about NetBSD internals, I'd be interested in
hearing about what postgresql looks like to the NetBSD buffer manager.
Am I right that strings of successive writes get randomized? What do our
cache-hit percentages look like? I'm going to do some experimenting with
this.

> 
> Just as a side note, I've been a NetBSD developer since about '96,
> and have been delving into the details of OS design since well before
> that time, so I'm coming to this with what I hope is reasonably good
> knowledge of how disks work and how operating systems use them. (Not
> that this should stop you from pointing out holes in my arguments. :-))
> 

This stuff is very difficult to get right. Glad to know you follow this
list.


> cjs
> -- 
> Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
>     Don't you know, in this new Dark Age, we're all light.  --XTC
> 
-- 
J. R. Nield
jrnield@usol.com







pgsql-hackers by date:

Previous
From: Tony Griffiths
Date:
Subject: Re: ADTs and embedded sql
Next
From: Bruce Momjian
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE