Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: Index Scans become Seq Scans after VACUUM ANALYSE |
Date | |
Msg-id | 200206240246.g5O2ki712992@candle.pha.pa.us Whole thread Raw |
In response to | Re: Index Scans become Seq Scans after VACUUM ANALYSE ("J. R. Nield" <jrnield@usol.com>) |
List | pgsql-hackers |
J. R. Nield wrote: > On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote: > > Yes, I don't see writing to two files vs. one to be any win, especially > > when we need to fsync both of them. What I would really like is to > > avoid the double I/O of writing to WAL and to the data file; improving > > that would be a huge win. > > > > If is impossible to do what you want. You can not protect against > partial writes without writing pages twice and calling fdatasync between > them while going through a generic filesystem. The best disk array will > not protect you if the operating system does not align block writes to > the structure of the underlying device. Even with raw devices, you need > special support or knowledge of the operating system and/or the disk > device to ensure that each write request will be atomic to the > underlying hardware. Yes, I suspected it was impossible, but that doesn't mean I want it any less. ;-) > All other systems rely on the fact that you can recover a damaged file > using the log archive. This means downtime in the rare case, but no data > loss. Until PostgreSQL can do this, then it will not be acceptable for > real critical production use. This is not to knock PostgreSQL, because > it is a very good database system, and clearly the best open-source one. > It even has feature advantages over the commercial systems. But at the > end of the day, unless you have complete understanding of the I/O system > from write(2) through to the disk system, the only sure ways to protect > against partial writes are by "careful writes" (in the WAL log or > elsewhere, writing pages twice), or by requiring (and allowing) users to > do log-replay recovery when a file is corrupted by a partial write. As > long as there is a UPS, and the operating system doesn't crash, then > there still should be no partial writes. You are talking point-in-time recovery, a major missing feature right next to replication, and I agree it makes PostgreSQL unacceptable for some applications. Point taken. And the interesting thing you are saying is that with point-in-time recovery, we don't need to write pre-write images of pages because if we detect a partial page write, we then abort the database and tell the user to do a point-in-time recovery, basically meaning we are using the previous full backup as our pre-write page image and roll forward using the logical logs. This is clearly a nice thing to be able to do because it let's you take a pre-write image of the page once during full backup, keep it offline, and bring it back in the rare case of a full page write failure. I now can see how the MSSQL tearoff-bits would be used, not for recovery, but to detect a partial write and force a point-in-time recovery from the administrator. > If we log pages to WAL, they are useless when archived (after a > checkpoint). So either we have a separate "log" for them (the ping-pong > file), or we should at least remove them when archived, which makes log > archiving more complex but is perfectly doable. Yes, that is how we will do point-in-time recovery; remove the pre-write page images and archive the rest. It is more complex, but having the fsync all in one file is too big a win. > Finally, I would love to hear why we are using the operating system > buffer manager at all. The OS is acting as a secondary buffer manager > for us. Why is that? What flaw in our I/O system does this reveal? I > know that: > > >We sync only WAL, not the other pages, except for the sync() call we do > > during checkpoint when we discard old WAL files. > > But this is probably not a good thing. We should only be writing blocks > when they need to be on disk. We should not be expecting the OS to write > them "sometime later" and avoid blocking (as long) for the write. If we > need that, then our buffer management is wrong and we need to fix it. > The reason we are doing this is because we expect the OS buffer manager > to do asynchronous I/O for us, but then we don't control the order. That > is the reason why we have to call fdatasync(), to create "sequence > points". Yes. I think I understand. It is true we have to fsync WAL because we can't control the individual writes by the OS. > The reason we have performance problems with either D_OSYNC or fdatasync > on the normal relations is because we have no dbflush process. This > causes an unacceptable amount of I/O blocking by other transactions. Uh, that would force writes all over the disk. Why do we really care how the OS writes them? If we are going to fsync, let's just do the one file and be done with it. What would a separate flusher process really buy us if it has to use fsync too. The main backend doesn't have to wait for the fsync, but then again, we can't say the transaction is committed until it hits the disk, so how does a flusher help? > The ORACLE people were not kidding when they said that they could not > certify Linux for production use until it supported O_DSYNC. Can you > explain why that was the case? I don't see O_DSYNC as very different from write/fsync(or fdatasync). > Finally, let me apologize if the above comes across as somewhat > belligerent. I know very well that I can't compete with you guys for > knowledge of the PostgreSQL system. I am still at a loss when I look at > the optimizer and executor modules, and it will take some time before I > can follow discussion of that area. Even then, I doubt my ability to > compare with people like Mr. Lane and Mr. Momjian in experience and > general intelligence, or in the field of database programming and > software development in particular. However, this discussion and a > search of the pgsql-hackers archives reveals this problem to be the KEY > area of PostgreSQL's failing, and general misunderstanding, when > compared to its commercial competitors. We appreciate your ideas. Few of us are professional db folks so we are always looking for good ideas. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
pgsql-hackers by date: