Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers
From | J. R. Nield |
---|---|
Subject | Re: Index Scans become Seq Scans after VACUUM ANALYSE |
Date | |
Msg-id | 1024882167.1793.733.camel@localhost.localdomain Whole thread Raw |
In response to | Index Scans become Seq Scans after VACUUM ANALYSE (Louis-David Mitterrand <vindex@apartia.org>) |
Responses |
Re: Index Scans become Seq Scans after VACUUM ANALYSE
Re: Index Scans become Seq Scans after VACUUM ANALYSE |
List | pgsql-hackers |
On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote: > Yes, I don't see writing to two files vs. one to be any win, especially > when we need to fsync both of them. What I would really like is to > avoid the double I/O of writing to WAL and to the data file; improving > that would be a huge win. > If is impossible to do what you want. You can not protect against partial writes without writing pages twice and calling fdatasync between them while going through a generic filesystem. The best disk array will not protect you if the operating system does not align block writes to the structure of the underlying device. Even with raw devices, you need special support or knowledge of the operating system and/or the disk device to ensure that each write request will be atomic to the underlying hardware. All other systems rely on the fact that you can recover a damaged file using the log archive. This means downtime in the rare case, but no data loss. Until PostgreSQL can do this, then it will not be acceptable for real critical production use. This is not to knock PostgreSQL, because it is a very good database system, and clearly the best open-source one. It even has feature advantages over the commercial systems. But at the end of the day, unless you have complete understanding of the I/O system from write(2) through to the disk system, the only sure ways to protect against partial writes are by "careful writes" (in the WAL log or elsewhere, writing pages twice), or by requiring (and allowing) users to do log-replay recovery when a file is corrupted by a partial write. As long as there is a UPS, and the operating system doesn't crash, then there still should be no partial writes. If we log pages to WAL, they are useless when archived (after a checkpoint). So either we have a separate "log" for them (the ping-pong file), or we should at least remove them when archived, which makes log archiving more complex but is perfectly doable. Finally, I would love to hear why we are using the operating system buffer manager at all. The OS is acting as a secondary buffer manager for us. Why is that? What flaw in our I/O system does this reveal? I know that: >We sync only WAL, not the other pages, except for the sync() call we do > during checkpoint when we discard old WAL files. But this is probably not a good thing. We should only be writing blocks when they need to be on disk. We should not be expecting the OS to write them "sometime later" and avoid blocking (as long) for the write. If we need that, then our buffer management is wrong and we need to fix it. The reason we are doing this is because we expect the OS buffer manager to do asynchronous I/O for us, but then we don't control the order. That is the reason why we have to call fdatasync(), to create "sequence points". The reason we have performance problems with either D_OSYNC or fdatasync on the normal relations is because we have no dbflush process. This causes an unacceptable amount of I/O blocking by other transactions. The ORACLE people were not kidding when they said that they could not certify Linux for production use until it supported O_DSYNC. Can you explain why that was the case? Finally, let me apologize if the above comes across as somewhat belligerent. I know very well that I can't compete with you guys for knowledge of the PosgreSQL system. I am still at a loss when I look at the optimizer and executor modules, and it will take some time before I can follow discussion of that area. Even then, I doubt my ability to compare with people like Mr. Lane and Mr. Momjian in experience and general intelligence, or in the field of database programming and software development in particular. However, this discussion and a search of the pgsql-hackers archives reveals this problem to be the KEY area of PostgreSQL's failing, and general misunderstanding, when compared to its commercial competitors. Sincerely, J. R. Nield -- J. R. Nield jrnield@usol.com
pgsql-hackers by date: