Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers
From | Curt Sampson |
---|---|
Subject | Re: Index Scans become Seq Scans after VACUUM ANALYSE |
Date | |
Msg-id | Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net Whole thread Raw |
In response to | Re: Index Scans become Seq Scans after VACUUM ANALYSE ("J. R. Nield" <jrnield@usol.com>) |
Responses |
Re: Index Scans become Seq Scans after VACUUM ANALYSE
|
List | pgsql-hackers |
On 23 Jun 2002, J. R. Nield wrote: > If is impossible to do what you want. You can not protect against > partial writes without writing pages twice and calling fdatasync > between them while going through a generic filesystem. I agree with this. > The best disk array will not protect you if the operating system does > not align block writes to the structure of the underlying device. This I don't quite understand. Assuming you're using a SCSI drive (and this mostly applies to ATAPI/IDE, too), you can do naught but align block writes to the structure of the underlying device. When you initiate a SCSI WRITE command, you start by telling the device at which block to start writing and how many blocks you intend to write. Then you start passing the data. (See http://www.danbbs.dk/~dino/SCSI/SCSI2-09.html#9.2.21 for parameter details for the SCSI WRITE(10) command. You may find the SCSI 2 specification, at http://www.danbbs.dk/~dino/SCSI/ to be a useful reference here.) > Even with raw devices, you need special support or knowledge of the > operating system and/or the disk device to ensure that each write > request will be atomic to the underlying hardware. Well, so here I guess you're talking about two things: 1. When you request, say, an 8K block write, will the OS really write it to disk in a single 8K or multiple of 8K SCSIwrite command? 2. Does the SCSI device you're writing to consider these writes to be transactional. That is, if the write is interruptedbefore being completed, does the SCSI device guarantee that the partially-sent data is not written, and theold data is maintained? And of course, does it guarantee that, when it acknowledges a write, that write is now instable storage and will never go away? Both of these are not hard to guarantee, actually. For a BSD-based OS, for example, just make sure that your filesystem block size is the same as or a multiple of the database block size. BSD will never write anything other than a block or a sequence of blocks to a disk in a single SCSI transaction (unless you've got a really odd SCSI driver). And for your disk, buy a Baydel or Clarion disk array, or something similar. Given that it's not hard to set up a system that meets these criteria, and this is in fact commonly done for database servers, it would seem a good idea for postgres to have the option to take advantage of the time and money spent and adjust its performance upward appropriately. > All other systems rely on the fact that you can recover a damaged file > using the log archive. Not exactly. For MS SQL Server, at any rate, if it detects a page tear you cannot restore based on the log file alone. You need a full or partial backup that includes that entire torn block. > This means downtime in the rare case, but no data loss. Until > PostgreSQL can do this, then it will not be acceptable for real > critical production use. It seems to me that it is doing this right now. In fact, it's more reliable than some commerial systems (such as SQL Server) because it can recover from a torn block with just the logfile. > But at the end of the day, unless you have complete understanding of > the I/O system from write(2) through to the disk system, the only sure > ways to protect against partial writes are by "careful writes" (in > the WAL log or elsewhere, writing pages twice), or by requiring (and > allowing) users to do log-replay recovery when a file is corrupted by > a partial write. I don't understand how, without a copy of the old data that was in the torn block, you can restore that block from just log file entries. Can you explain this to me? Take, as an example, a block with ten tuples, only one of which has been changed "recently." (I.e., only that change is in the log files.) > If we log pages to WAL, they are useless when archived (after a > checkpoint). So either we have a separate "log" for them (the > ping-pong file), or we should at least remove them when archived, > which makes log archiving more complex but is perfectly doable. Right. That seems to me a better option, since we've now got only one write point on the disk rather than two. > Finally, I would love to hear why we are using the operating system > buffer manager at all. The OS is acting as a secondary buffer manager > for us. Why is that? What flaw in our I/O system does this reveal? It's acting as a "second-level" buffer manager, yes, but to say it's "secondary" may be a bit misleading. On most of the systems I've set up, the OS buffer cache is doing the vast majority of the work, and the postgres buffering is fairly minimal. There are some good (and some perhaps not-so-good) reasons to do it this way. I'll list them more or less in the order of best to worst: 1. The OS knows where the blocks physically reside on disk, and postgres does not. Therefore it's in the interest ofpostgresql to dispatch write responsibility back to the OS as quickly as possible so that the OS can prioritize requestsappropriately. Most operating systems use an "elevator" algorithm to minimize disk head movement; but if theOS does not have a block that it could write while the head is "on the way" to another request, it can't write it inthat head pass. 2. Postgres does not know about any "bank-switching" tricks for mapping more physical memory than it has address space.Thus, on 32-bit machines, postgres might be limited to mapping 2 or 3 GB of memory, even though the machine has,say, 6 GB of physical RAM. The OS can use all of the available memory for caching; postgres cannot. 3. A lot of work has been put into the seek algorithms, read-ahead algorithms, block allocation algorithms, etc. inthe OS. Why duplicate all that work again in postgres? When you say things like the following: > We should only be writing blocks when they need to be on disk. We > should not be expecting the OS to write them "sometime later" and > avoid blocking (as long) for the write. If we need that, then our > buffer management is wrong and we need to fix it. you appear to be making the arugment that we should take the route of other database systems, and use raw devices and our own management of disk block allocation. If so, you might want first to look back through the archives at the discussion I and several others had about this a month or two ago. After looking in detail at what NetBSD, at least, does in terms of its disk I/O algorithms and buffering, I've pretty much come around, at least for the moment, to the attitude that we should stick with using the OS. I wouldn't mind seeing postgres be able to manage all of this stuff, but it's a *lot* of work for not all that much benefit that I can see. > The ORACLE people were not kidding when they said that they could not > certify Linux for production use until it supported O_DSYNC. Can you > explain why that was the case? I'm suspecting it's because Linux at the time had no raw devices, so O_DSYNC was the only other possible method of making sure that disk writes actually got to disk. You certainly don't want to use O_DSYNC if you can use another method, because O_DSYNC still goes through the the operating system's buffer cache, wasting memory and double-caching things. If you're doing your own management, you need either to use a raw device or open files with the flag that indicates that the buffer cache should not be used at all for reads from and writes to that file. > However, this discussion and a search of the pgsql-hackers archives > reveals this problem to be the KEY area of PostgreSQL's failing, and > general misunderstanding, when compared to its commercial competitors. No, I think it's just that you're under a few minor misapprehensions here about what postgres and the OS are actually doing. As I said, I went through this whole exact argument a month or two ago, on this very list, and I came around to the idea that what postgres is doing now works quite well, at least on NetBSD. (Most other OSes have disk I/O algorithms that are pretty much as good or better.) There might be a very slight advantage to doing all one's own I/O management, but it's a huge amount of work, and I think that much effort could be much more usefully applied to other areas. Just as a side note, I've been a NetBSD developer since about '96, and have been delving into the details of OS design since well before that time, so I'm coming to this with what I hope is reasonably good knowledge of how disks work and how operating systems use them. (Not that this should stop you from pointing out holes in my arguments. :-)) cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
pgsql-hackers by date: