Home > mailing lists
Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers

From	Curt Sampson
Subject	Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date	June 24, 2002 12:18:29
Msg-id	Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net Whole thread Raw
In response to	Re: Index Scans become Seq Scans after VACUUM ANALYSE ("J. R. Nield" <jrnield@usol.com>)
Responses	Re: Index Scans become Seq Scans after VACUUM ANALYSE
List	pgsql-hackers
Tree view
On 23 Jun 2002, J. R. Nield wrote:

> If is impossible to do what you want. You can not protect against
> partial writes without writing pages twice and calling fdatasync
> between them while going through a generic filesystem.

I agree with this.

> The best disk array will not protect you if the operating system does
> not align block writes to the structure of the underlying device.

This I don't quite understand. Assuming you're using a SCSI drive
(and this mostly applies to ATAPI/IDE, too), you can do naught but
align block writes to the structure of the underlying device. When you
initiate a SCSI WRITE command, you start by telling the device at which
block to start writing and how many blocks you intend to write. Then you
start passing the data.

(See http://www.danbbs.dk/~dino/SCSI/SCSI2-09.html#9.2.21 for parameter
details for the SCSI WRITE(10) command. You may find the SCSI 2
specification, at http://www.danbbs.dk/~dino/SCSI/ to be a useful
reference here.)

> Even with raw devices, you need special support or knowledge of the
> operating system and/or the disk device to ensure that each write
> request will be atomic to the underlying hardware.

Well, so here I guess you're talking about two things:
   1. When you request, say, an 8K block write, will the OS really   write it to disk in a single 8K or multiple of 8K
SCSIwrite   command?
 
   2. Does the SCSI device you're writing to consider these writes to   be transactional. That is, if the write is
interruptedbefore being   completed, does the SCSI device guarantee that the partially-sent   data is not written, and
theold data is maintained? And of course,   does it guarantee that, when it acknowledges a write, that write is   now
instable storage and will never go away?
 

Both of these are not hard to guarantee, actually. For a BSD-based OS,
for example, just make sure that your filesystem block size is the
same as or a multiple of the database block size. BSD will never write
anything other than a block or a sequence of blocks to a disk in a
single SCSI transaction (unless you've got a really odd SCSI driver).
And for your disk, buy a Baydel or Clarion disk array, or something
similar.

Given that it's not hard to set up a system that meets these criteria,
and this is in fact commonly done for database servers, it would seem a
good idea for postgres to have the option to take advantage of the time
and money spent and adjust its performance upward appropriately.

> All other systems rely on the fact that you can recover a damaged file
> using the log archive.

Not exactly. For MS SQL Server, at any rate, if it detects a page tear
you cannot restore based on the log file alone. You need a full or
partial backup that includes that entire torn block.

> This means downtime in the rare case, but no data loss. Until
> PostgreSQL can do this, then it will not be acceptable for real
> critical production use.

It seems to me that it is doing this right now. In fact, it's more
reliable than some commerial systems (such as SQL Server) because it can
recover from a torn block with just the logfile.

> But at the end of the day, unless you have complete understanding of
> the I/O system from write(2) through to the disk system, the only sure
> ways to protect against partial writes are by "careful writes" (in
> the WAL log or elsewhere, writing pages twice), or by requiring (and
> allowing) users to do log-replay recovery when a file is corrupted by
> a partial write.

I don't understand how, without a copy of the old data that was in the
torn block, you can restore that block from just log file entries. Can
you explain this to me? Take, as an example, a block with ten tuples,
only one of which has been changed "recently." (I.e., only that change
is in the log files.)

> If we log pages to WAL, they are useless when archived (after a
> checkpoint). So either we have a separate "log" for them (the
> ping-pong file), or we should at least remove them when archived,
> which makes log archiving more complex but is perfectly doable.

Right. That seems to me a better option, since we've now got only one
write point on the disk rather than two.

> Finally, I would love to hear why we are using the operating system
> buffer manager at all. The OS is acting as a secondary buffer manager
> for us. Why is that? What flaw in our I/O system does this reveal?

It's acting as a "second-level" buffer manager, yes, but to say it's
"secondary" may be a bit misleading. On most of the systems I've set
up, the OS buffer cache is doing the vast majority of the work, and the
postgres buffering is fairly minimal.

There are some good (and some perhaps not-so-good) reasons to do it this
way. I'll list them more or less in the order of best to worst:
   1. The OS knows where the blocks physically reside on disk, and   postgres does not. Therefore it's in the interest
ofpostgresql to   dispatch write responsibility back to the OS as quickly as possible   so that the OS can prioritize
requestsappropriately. Most operating   systems use an "elevator" algorithm to minimize disk head movement;   but if
theOS does not have a block that it could write while the   head is "on the way" to another request, it can't write it
inthat   head pass.
 
   2. Postgres does not know about any "bank-switching" tricks for   mapping more physical memory than it has address
space.Thus, on   32-bit machines, postgres might be limited to mapping 2 or 3 GB of   memory, even though the machine
has,say, 6 GB of physical RAM. The   OS can use all of the available memory for caching; postgres cannot.
 
   3. A lot of work has been put into the seek algorithms, read-ahead   algorithms, block allocation algorithms, etc.
inthe OS. Why   duplicate all that work again in postgres?
 

When you say things like the following:

> We should only be writing blocks when they need to be on disk. We
> should not be expecting the OS to write them "sometime later" and
> avoid blocking (as long) for the write. If we need that, then our
> buffer management is wrong and we need to fix it.

you appear to be making the arugment that we should take the route of
other database systems, and use raw devices and our own management of
disk block allocation. If so, you might want first to look back through
the archives at the discussion I and several others had about this a
month or two ago. After looking in detail at what NetBSD, at least, does
in terms of its disk I/O algorithms and buffering, I've pretty much come
around, at least for the moment, to the attitude that we should stick
with using the OS. I wouldn't mind seeing postgres be able to manage all
of this stuff, but it's a *lot* of work for not all that much benefit
that I can see.

> The ORACLE people were not kidding when they said that they could not
> certify Linux for production use until it supported O_DSYNC. Can you
> explain why that was the case?

I'm suspecting it's because Linux at the time had no raw devices, so
O_DSYNC was the only other possible method of making sure that disk
writes actually got to disk.

You certainly don't want to use O_DSYNC if you can use another method,
because O_DSYNC still goes through the the operating system's buffer
cache, wasting memory and double-caching things. If you're doing your
own management, you need either to use a raw device or open files with
the flag that indicates that the buffer cache should not be used at all
for reads from and writes to that file.

> However, this discussion and a search of the pgsql-hackers archives
> reveals this problem to be the KEY area of PostgreSQL's failing, and
> general misunderstanding, when compared to its commercial competitors.

No, I think it's just that you're under a few minor misapprehensions
here about what postgres and the OS are actually doing. As I said, I
went through this whole exact argument a month or two ago, on this very
list, and I came around to the idea that what postgres is doing now
works quite well, at least on NetBSD. (Most other OSes have disk I/O
algorithms that are pretty much as good or better.) There might be a
very slight advantage to doing all one's own I/O management, but it's
a huge amount of work, and I think that much effort could be much more
usefully applied to other areas.

Just as a side note, I've been a NetBSD developer since about '96,
and have been delving into the details of OS design since well before
that time, so I'm coming to this with what I hope is reasonably good
knowledge of how disks work and how operating systems use them. (Not
that this should stop you from pointing out holes in my arguments. :-))

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
pgsql-hackers by date:
From: Stephan Szabo
Date: 24 June 2002, 12:12:22
Subject: Re: pg_dump and ALTER TABLE / ADD FOREIGN KEY
From: Peter Eisentraut
Date: 24 June 2002, 12:20:06
Subject: Re: SQL99, CREATE CAST, and initdb
Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers

Previous

Next