"Jacky Leng" <lengjianquan@163.com> writes:
> Shouldn't we write xlog record before we do a physical operation?
The reasoning for not doing it that way was that we can't be sure
beforehand that the filesystem operation will succeed. If we xlog
the truncate first, it fails, and then we crash, we're in deep trouble
because WAL replay will try to do the truncate and likewise fail,
preventing the system from restarting. Other non-rollbackable
filesystem ops (I think just CREATE/DROP DATABASE/TABLESPACE) are done
the same way. CREATE DATABASE would be particularly nasty to reverse
the order for, since there are obvious cases like out-of-disk-space
that will make it fail.
> An test case:
> 1. set full_page_writes off;
> 2. startup database; create a table; insert 100000 rows in it; shutdown
> database;
> 3. startup database again; delete all rows from this table;
> 4. vacuum this table, and it will come into smgrtruncate; kill postmaster
> before smgrtruncate do xlog stuff(set a breakpoint before xlog stuff);
> 5. startup database the 3rd time, during the recovery, the database will
> crash with:
> PANIC: WAL contains references to invalid pages
Hmm. Maybe we need something like xlog a "tentative truncate", do it,
xlog "real truncate"? The tentative truncate would merely tell replay
not to be surprised if those blocks aren't there anymore. Seems a bit
grotty though.
regards, tom lane