Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date
Msg-id 200206240246.g5O2ki712992@candle.pha.pa.us
Whole thread Raw
In response to Re: Index Scans become Seq Scans after VACUUM ANALYSE  ("J. R. Nield" <jrnield@usol.com>)
List pgsql-hackers
J. R. Nield wrote:
> On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote:
> > Yes, I don't see writing to two files vs. one to be any win, especially
> > when we need to fsync both of them.  What I would really like is to
> > avoid the double I/O of writing to WAL and to the data file;  improving
> > that would be a huge win.
> > 
> 
> If is impossible to do what you want. You can not protect against
> partial writes without writing pages twice and calling fdatasync between
> them while going through a generic filesystem. The best disk array will
> not protect you if the operating system does not align block writes to
> the structure of the underlying device. Even with raw devices, you need
> special support or knowledge of the operating system and/or the disk
> device to ensure that each write request will be atomic to the
> underlying hardware. 

Yes, I suspected it was impossible, but that doesn't mean I want it any
less.  ;-)

> All other systems rely on the fact that you can recover a damaged file
> using the log archive. This means downtime in the rare case, but no data
> loss. Until PostgreSQL can do this, then it will not be acceptable for
> real critical production use. This is not to knock PostgreSQL, because
> it is a very good database system, and clearly the best open-source one.
> It even has feature advantages over the commercial systems. But at the
> end of the day, unless you have complete understanding of the I/O system
> from write(2) through to the disk system, the only sure ways to protect
> against partial writes are by "careful writes" (in the WAL log or
> elsewhere, writing pages twice), or by requiring (and allowing) users to
> do log-replay recovery when a file is corrupted by a partial write. As
> long as there is a UPS, and the operating system doesn't crash, then
> there still should be no partial writes.

You are talking point-in-time recovery, a major missing feature right
next to replication, and I agree it makes PostgreSQL unacceptable for
some applications.  Point taken.

And the interesting thing you are saying is that with point-in-time
recovery, we don't need to write pre-write images of pages because if we
detect a partial page write, we then abort the database and tell the
user to do a point-in-time recovery, basically meaning we are using the
previous full backup as our pre-write page image and roll forward using
the logical logs.  This is clearly a nice thing to be able to do because
it let's you take a pre-write image of the page once during full backup,
keep it offline, and bring it back in the rare case of a full page write
failure.  I now can see how the MSSQL tearoff-bits would be used, not
for recovery, but to detect a partial write and force a point-in-time
recovery from the administrator.


> If we log pages to WAL, they are useless when archived (after a
> checkpoint). So either we have a separate "log" for them (the ping-pong
> file), or we should at least remove them when archived, which makes log
> archiving more complex but is perfectly doable.

Yes, that is how we will do point-in-time recovery;  remove the
pre-write page images and archive the rest.  It is more complex, but
having the fsync all in one file is too big a win.

> Finally, I would love to hear why we are using the operating system
> buffer manager at all. The OS is acting as a secondary buffer manager
> for us. Why is that? What flaw in our I/O system does this reveal? I
> know that:
> 
> >We sync only WAL, not the other pages, except for the sync() call we do
> > during checkpoint when we discard old WAL files.
> 
> But this is probably not a good thing. We should only be writing blocks
> when they need to be on disk. We should not be expecting the OS to write
> them "sometime later" and avoid blocking (as long) for the write. If we
> need that, then our buffer management is wrong and we need to fix it.
> The reason we are doing this is because we expect the OS buffer manager
> to do asynchronous I/O for us, but then we don't control the order. That
> is the reason why we have to call fdatasync(), to create "sequence
> points".

Yes.  I think I understand.  It is true we have to fsync WAL because we
can't control the individual writes by the OS.

> The reason we have performance problems with either D_OSYNC or fdatasync
> on the normal relations is because we have no dbflush process. This
> causes an unacceptable amount of I/O blocking by other transactions.

Uh, that would force writes all over the disk. Why do we really care how
the OS writes them?  If we are going to fsync, let's just do the one
file and be done with it.  What would a separate flusher process really
buy us  if it has to use fsync too. The main backend doesn't have to
wait for the fsync, but then again, we can't say the transaction is
committed until it hits the disk, so how does a flusher help?

> The ORACLE people were not kidding when they said that they could not
> certify Linux for production use until it supported O_DSYNC. Can you
> explain why that was the case?

I don't see O_DSYNC as very different from write/fsync(or fdatasync).

> Finally, let me apologize if the above comes across as somewhat
> belligerent. I know very well that I can't compete with you guys for
> knowledge of the PostgreSQL system. I am still at a loss when I look at
> the optimizer and executor modules, and it will take some time before I
> can follow discussion of that area. Even then, I doubt my ability to
> compare with people like Mr. Lane and Mr. Momjian in experience and
> general intelligence, or in the field of database programming and
> software development in particular. However, this discussion and a
> search of the pgsql-hackers archives reveals this problem to be the KEY
> area of PostgreSQL's failing, and general misunderstanding, when
> compared to its commercial competitors.

We appreciate your ideas.  Few of us are professional db folks so we are
always looking for good ideas.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 




pgsql-hackers by date:

Previous
From: "J. R. Nield"
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE
Next
From: Bruce Momjian
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE