Thread: Fast index build vs. PITR

Fast index build vs. PITR

From

Tom Lane

Date:

31 May 2004, 21:28:27

I was just about to commit a patch that revises the btree index build
procedure as discussed here:
http://archives.postgresql.org/pgsql-general/2004-05/msg00480.php
specifically, not using shared buffers during index build and bypassing
WAL-logging in favor of just fsyncing the index file before commit.

I was actually writing the commit message when it occurred to me that
this would seriously break PITR.  If the WAL datastream doesn't contain
enough info to rebuild the index then rolling forward from a past backup
isn't gonna work.

I thought for a little bit about a magic "reconstruct the index" WAL
entry that would invoke the index build procedure in toto, but that
doesn't look like it will fly either.  (Two problems: during crash
recovery, you couldn't be sure that what's on disk for the underlying
table exactly matches the index you need to build --- it could be a
later state of the table; and besides, the environment of the WAL replay
process isn't capable of running user-defined functions, so it couldn't
work for functional indexes.)

So AFAICS, we've got to dump the index contents into WAL to support
PITR.  This is a tad annoying.

What I'm thinking about right now is tweaking the index-build code to
write to WAL only if it sees that PITR is actually in use.  It would
have to look at the GUC variables to determine whether WAL archiving
is enabled.  If archiving isn't turned on, then we could assume that
rollforward from a past backup isn't needed in this installation, and
use the WAL-less index build method.

Comments?
        regards, tom lane

Re: Fast index build vs. PITR

From

Gavin Sherry

Date:

31 May 2004, 22:34:28

On Mon, 31 May 2004, Tom Lane wrote:

[snip]

> I thought for a little bit about a magic "reconstruct the index" WAL
> entry that would invoke the index build procedure in toto, but that
> doesn't look like it will fly either.  (Two problems: during crash
> recovery, you couldn't be sure that what's on disk for the underlying
> table exactly matches the index you need to build --- it could be a
> later state of the table; and besides, the environment of the WAL replay
> process isn't capable of running user-defined functions, so it couldn't
> work for functional indexes.)
>
> So AFAICS, we've got to dump the index contents into WAL to support
> PITR.  This is a tad annoying.

Is it possible in this case to dump the index block by block into the log
after it has been generated? It seems to me this should reduce the
amount of data we write to WAL and (therefore) speed up the rebuild.

Gavin

Re: Fast index build vs. PITR

From

Christopher Kings-Lynne

Date:

31 May 2004, 22:49:56

> What I'm thinking about right now is tweaking the index-build code to
> write to WAL only if it sees that PITR is actually in use.  It would
> have to look at the GUC variables to determine whether WAL archiving
> is enabled.  If archiving isn't turned on, then we could assume that
> rollforward from a past backup isn't needed in this installation, and
> use the WAL-less index build method.

Seems reasonable.

Chris

Re: Fast index build vs. PITR

From

Bruce Momjian

Date:

31 May 2004, 23:03:12

Christopher Kings-Lynne wrote:
> > What I'm thinking about right now is tweaking the index-build code to
> > write to WAL only if it sees that PITR is actually in use.  It would
> > have to look at the GUC variables to determine whether WAL archiving
> > is enabled.  If archiving isn't turned on, then we could assume that
> > rollforward from a past backup isn't needed in this installation, and
> > use the WAL-less index build method.
> 
> Seems reasonable.

What happens if someone turns on archiving while the index is being
built?  Is that possible?

I assume if someone turns on archiving in postgresql.conf, sighups the
postmaster, then does a tar backup, they should be able to do archiving,
no?


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Fast index build vs. PITR

From

Alvaro Herrera

Date:

31 May 2004, 23:13:55

On Tue, Jun 01, 2004 at 11:34:15AM +1000, Gavin Sherry wrote:
> On Mon, 31 May 2004, Tom Lane wrote:
> 
> > I thought for a little bit about a magic "reconstruct the index" WAL
> > entry that would invoke the index build procedure in toto, but that
> > doesn't look like it will fly either.  (Two problems: during crash
> > recovery, you couldn't be sure that what's on disk for the underlying
> > table exactly matches the index you need to build --- it could be a
> > later state of the table; and besides, the environment of the WAL replay
> > process isn't capable of running user-defined functions, so it couldn't
> > work for functional indexes.)
> >
> > So AFAICS, we've got to dump the index contents into WAL to support
> > PITR.  This is a tad annoying.
> 
> Is it possible in this case to dump the index block by block into the log
> after it has been generated? It seems to me this should reduce the
> amount of data we write to WAL and (therefore) speed up the rebuild.

Is this less log traffic?  You save a lot of per-record overhead, but
you have to save internal pages (which are not saved with the standard
code).  Plus it would be a lot different from standard btree WAL
traffic, so it'd be more code.

Maybe there's more to be saved by logging only leaf pages.  But then
there would be even more code to be able to reconstruct the index from
only leaf pages, and there are not that many internal pages either.

A completely different idea would be to log a "logical index creation",
so that during normal recovery those entries are saved somewhere; after
the rest of WAL recovery is done, the system is taken into a more normal
post-recovery pre-usable state, on which those indexes are recreated
from user data.  This would be cheapest in WAL traffic, but probably
it'll also require more code and new hooks in the startup mechanism.
Also, it'd require examining later WAL entries that refer to the index
and act accordingly (e.g. ignore the entry if it modifies the index, and
forget the creation if it's a DROP INDEX command.)

Not that I like neither of those ideas really ... issuing normal WAL
index creation traffic if PITR is active is certainly the easiest way.

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"In Europe they call me Niklaus Wirth; in the US they call me Nickel's worth.
That's because in Europe they call me by name, and in the US by value!"

Re: Fast index build vs. PITR

From

Tom Lane

Date:

31 May 2004, 23:15:26

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I assume if someone turns on archiving in postgresql.conf, sighups the
> postmaster, then does a tar backup, they should be able to do archiving,
> no?

I would have zero problem with labeling the archive parameter as
changeable only at postmaster start.
        regards, tom lane

Re: Fast index build vs. PITR

From

Tom Lane

Date:

31 May 2004, 23:17:13

Gavin Sherry <swm@linuxworld.com.au> writes:
>> So AFAICS, we've got to dump the index contents into WAL to support
>> PITR.  This is a tad annoying.

> Is it possible in this case to dump the index block by block into the log
> after it has been generated?

That's what we do now, and it more or less doubles the amount of I/O
needed to create an index ...
        regards, tom lane

Re: Fast index build vs. PITR

From

Bruce Momjian

Date:

31 May 2004, 23:21:58

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I assume if someone turns on archiving in postgresql.conf, sighups the
> > postmaster, then does a tar backup, they should be able to do archiving,
> > no?
> 
> I would have zero problem with labeling the archive parameter as
> changeable only at postmaster start.

I guess the question is whether it would be possible to start/stop it at
other times.  And what process are we going to use to do a tar backup? 
Do they turn off archiving before doing the tar, or tell the system to
tar to another location?  

And you brought up the issue of how do we feed multilple archive files
back into the xlog directory during restore if they don't all fit on the
disk.

I think we need to explore the procedures we are going to use for PITR.

Also, do we need to do tar in a special way, like tar up a specific file
first?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Fast index build vs. PITR

From

Tom Lane

Date:

31 May 2004, 23:23:39

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> A completely different idea would be to log a "logical index creation",
> so that during normal recovery those entries are saved somewhere; after
> the rest of WAL recovery is done, the system is taken into a more normal
> post-recovery pre-usable state, on which those indexes are recreated
> from user data.

I think an actually implementable version of this would be:

1. Don't log any index operations at all in WAL.

2. When recovering from WAL, restore all the table contents by WAL
replay.  (This would of course include the system catalog contents that
describe the indexes.)  Then sit there and do a global REINDEX to
rebuild all the indexes.

This would gain a reduction of some percentage in WAL traffic, at the
cost of a hugely expensive recovery cycle any time you actually needed
to use the WAL.  I guess this could be attractive to some installations,
but I'm not sure very many people would want it ...
        regards, tom lane

Re: Fast index build vs. PITR

From

Simon Riggs

Date:

01 June 2004, 05:10:28

On Tue, 2004-06-01 at 01:24, Tom Lane wrote:
> I was just about to commit a patch that revises the btree index build
> procedure as discussed here:
> http://archives.postgresql.org/pgsql-general/2004-05/msg00480.php
> specifically, not using shared buffers during index build and bypassing
> WAL-logging in favor of just fsyncing the index file before commit.
> 
> I was actually writing the commit message when it occurred to me that
> this would seriously break PITR.  If the WAL datastream doesn't contain
> enough info to rebuild the index then rolling forward from a past backup
> isn't gonna work.
> 
> I thought for a little bit about a magic "reconstruct the index" WAL
> entry that would invoke the index build procedure in toto, but that
> doesn't look like it will fly either.  (Two problems: during crash
> recovery, you couldn't be sure that what's on disk for the underlying
> table exactly matches the index you need to build --- it could be a
> later state of the table; and besides, the environment of the WAL replay
> process isn't capable of running user-defined functions, so it couldn't
> work for functional indexes.)
> 
> So AFAICS, we've got to dump the index contents into WAL to support
> PITR.  This is a tad annoying.
> 
> What I'm thinking about right now is tweaking the index-build code to
> write to WAL only if it sees that PITR is actually in use.  It would
> have to look at the GUC variables to determine whether WAL archiving
> is enabled.  If archiving isn't turned on, then we could assume that
> rollforward from a past backup isn't needed in this installation, and
> use the WAL-less index build method.
> 
> Comments?

The mechanism you suggest would also break crash recovery, not just PITR
- though the avoidance of shared buffers seems like a gain either way.


..You raise the whole subject of UNRECOVERABLE data objects. Also known
as NOT LOGGED etc.

There are many significant performance gains to be had by turning off
recoverability for certain large operations. Oracle and Teradata make
extensive use of such features, i.e. especially in Data Warehousing.

Examples of such operations might be:
- index builds
- INSERT SELECTs into previously empty tables

This is an important area for performance...not just index builds.


A suggestion would be:
- add the "dont send to xlog" functionality as a user option on each
statement, default=LOGGING - this could be Oracle compatible, or not,
but concept is similar. Put the hooks in now and we can add this to all
appropriate statement syntax later.

e.g. 
CREATE INDEX blah ... NO LOGGING... ;
INSERT INTO blah ... NO LOGGING .... SELECT... ;

- if conf file says dont use fsync, then dont write to log - clearly
they dont mind losing data in the event of a crash...
i.e. default=NOLOGGING on all statements



Best regards, Simon Riggs

Re: Fast index build vs. PITR

From

"Zeugswetter Andreas SB SD"

Date:

01 June 2004, 10:00:38

> I think an actually implementable version of this would be:
>
> 1. Don't log any index operations at all in WAL.
>
> 2. When recovering from WAL, restore all the table contents by WAL
> replay.  (This would of course include the system catalog contents that
> describe the indexes.)  Then sit there and do a global REINDEX to
> rebuild all the indexes.
>
> This would gain a reduction of some percentage in WAL traffic, at the
> cost of a hugely expensive recovery cycle any time you actually needed
> to use the WAL.  I guess this could be attractive to some installations,
> but I'm not sure very many people would want it ...

I think only the "global" part of it is not really acceptable. If we had a flag
for each index that marks it "inconsistent" reindexing only those that are
marked would be great.

Could we log a WAL record that basically only marks an index for deferred reindex
after WAL recovery ? During WAL replay all records for this index could be
ignored (this is not a must because of the post update page images in WAL,
the index would still stay inconsistent until reindex of course).

I think such a reindex step could also be responsible for those non-btree
indexes that don't fully support WAL (gist?).

Andreas

Re: Fast index build vs. PITR

From

Tom Lane

Date:

01 June 2004, 10:42:53

Simon Riggs <simon@2ndquadrant.com> writes:
> The mechanism you suggest would also break crash recovery, not just PITR

No it wouldn't.
        regards, tom lane

Re: Fast index build vs. PITR

From

Greg Stark

Date:

01 June 2004, 13:53:59

Tom Lane <tgl@sss.pgh.pa.us> writes:

> I thought for a little bit about a magic "reconstruct the index" WAL
> entry that would invoke the index build procedure in toto, but that
> doesn't look like it will fly either.  (Two problems: during crash
> recovery, you couldn't be sure that what's on disk for the underlying
> table exactly matches the index you need to build --- it could be a
> later state of the table; and besides, the environment of the WAL replay
> process isn't capable of running user-defined functions, so it couldn't
> work for functional indexes.)

Could you just mark the index as unusable? Have the optimizer ignore such
indexes and PITR recovery can notify the user of these indexes and/or invoke a
rebuild automatically?

It wouldn't happen unless the user had done an index rebuild since the last
complete backup, so it wouldn't even be a performance issue. Restoring the
index from the WAL replay of an index rebuild must take a long time anyways.

-- 
greg

Re: Fast index build vs. PITR

From

Alvaro Herrera

Date:

01 June 2004, 14:50:54

On Tue, Jun 01, 2004 at 12:52:32PM -0400, Greg Stark wrote:
> 
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > I thought for a little bit about a magic "reconstruct the index" WAL
> > entry that would invoke the index build procedure in toto, but that
> > doesn't look like it will fly either.  (Two problems: during crash
> > recovery, you couldn't be sure that what's on disk for the underlying
> > table exactly matches the index you need to build --- it could be a
> > later state of the table; and besides, the environment of the WAL replay
> > process isn't capable of running user-defined functions, so it couldn't
> > work for functional indexes.)
> 
> Could you just mark the index as unusable? Have the optimizer ignore such
> indexes and PITR recovery can notify the user of these indexes and/or invoke a
> rebuild automatically?

The big problem I see with this kind of approaches is that building an
index from scratch can take a huge amount of time, because you have to
sort the data.  Building from WAL does not have this problem, so it can
be much faster.  Of course, when you are restoring using a PITR
approach you probably want it to be very fast, and have the DB running
with as little quirks as possible, as soon as possible.

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"El realista sabe lo que quiere; el idealista quiere lo que sabe" (Anónimo)

Re: Fast index build vs. PITR

From

Greg Stark

Date:

01 June 2004, 15:02:16

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

> The big problem I see with this kind of approaches is that building an
> index from scratch can take a huge amount of time, because you have to
> sort the data.  Building from WAL does not have this problem, so it can
> be much faster.  

I'm not clear that building from WAL is really going to be that much faster.
A) algorithmically it's only the factor of log(n) that you're talking about.
and B) the WAL will have records for every write, not just the final product,
so it might potentially have a lot more writes to do.

I thought part of the original problem was specifically that going through WAL
slowed down the index rebuild much more than a factor of 2, which would tend
to imply that in fact rebuilding from WAL isn't going to be as fast as you
might expect.

Another possibility is doing the complete index build without going through
WAL and then inserting a complete copy of the index into the WAL without
syncing or activating the rebuilt index until the copy do WAL is done. That
kind of sucks since it's equivalent to just taking another backup of the data
files immediately after the rebuild, but might be a more direct solution using
the existing tools.

> Of course, when you are restoring using a PITR approach you probably want it
> to be very fast, and have the DB running with as little quirks as possible,
> as soon as possible.

This is certainly true.

-- 
greg

Re: Fast index build vs. PITR

From

Tom Lane

Date:

01 June 2004, 16:21:49

Greg Stark <gsstark@mit.edu> writes:
> I'm not clear that building from WAL is really going to be that much faster.
> A) algorithmically it's only the factor of log(n) that you're talking about.
> and B) the WAL will have records for every write, not just the final product,
> so it might potentially have a lot more writes to do.

Wrong ... what we log in WAL for a btree index build is just the series
of completed index page images.  Recreation of the index would proceed
at whatever your disk read/write bandwidth is.

Like Alvaro, I suspect that people who are using PITR will be concerned
about recovery time, and would not be thrilled with any scenario that
involves REINDEX to get the system back on its feet.
        regards, tom lane

Re: Fast index build vs. PITR

From

Alvaro Herrera

Date:

01 June 2004, 16:22:26

On Tue, Jun 01, 2004 at 01:55:38PM -0400, Greg Stark wrote:
> Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> 
> > The big problem I see with this kind of approaches is that building an
> > index from scratch can take a huge amount of time, because you have to
> > sort the data.  Building from WAL does not have this problem, so it can
> > be much faster.  
> 
> I'm not clear that building from WAL is really going to be that much faster.
> A) algorithmically it's only the factor of log(n) that you're talking about.
> and B) the WAL will have records for every write, not just the final product,
> so it might potentially have a lot more writes to do.

Maybe it is log(n) algorithmically, but the constants are big because
there's a lot of non-sequential I/O involved.  With the WAL approach you
only read the pages from the log and copy them somewhere else.  It's not
nearly the same amount of I/O, and it's mostly sequential.  And if I
understood correctly what Tom said, after index construction the whole
index pages are dropped, so there's no need to redo each node split
operation.

> I thought part of the original problem was specifically that going through WAL
> slowed down the index rebuild much more than a factor of 2, which would tend
> to imply that in fact rebuilding from WAL isn't going to be as fast as you
> might expect.

I think there was more than one problem in the code.  I expect at least
those not related (such as some locking issue apparently) are solved.
And I'd expect WAL construction to be heavier than recovery from WAL,
because some WAL writes have to be fsync()ed, and this is a heavy
burden, while recovery does not need fsync on the new files AFAIK (but I
could be wrong on this).

One of the things that bothered me was that for some reason you couldn't
get the whole performance benefit you would expect from simultaneous
index builds when using a multiprocessor machine.

> Another possibility is doing the complete index build without going through
> WAL and then inserting a complete copy of the index into the WAL without
> syncing or activating the rebuilt index until the copy do WAL is done. That
> kind of sucks since it's equivalent to just taking another backup of the data
> files immediately after the rebuild, but might be a more direct solution using
> the existing tools.

Apparently this is the current state of affairs, though I'm not sure.

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La persona que no quería pecar / estaba obligada a sentarse
en duras y empinadas sillas    / desprovistas, por cierto
de blandos atenuantes"                          (Patricio Vogel)

Re: Fast index build vs. PITR

From

Bruce Momjian

Date:

01 June 2004, 17:18:05

Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > I'm not clear that building from WAL is really going to be that much faster.
> > A) algorithmically it's only the factor of log(n) that you're talking about.
> > and B) the WAL will have records for every write, not just the final product,
> > so it might potentially have a lot more writes to do.
> 
> Wrong ... what we log in WAL for a btree index build is just the series
> of completed index page images.  Recreation of the index would proceed
> at whatever your disk read/write bandwidth is.
> 
> Like Alvaro, I suspect that people who are using PITR will be concerned
> about recovery time, and would not be thrilled with any scenario that
> involves REINDEX to get the system back on its feet.

Agreed.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Fast index build vs. PITR

From

Simon Riggs

Date:

01 June 2004, 17:29:36

On Tue, 2004-06-01 at 03:13, Alvaro Herrera wrote:
> A completely different idea would be to log a "logical index creation",
> so that during normal recovery those entries are saved somewhere; after
> the rest of WAL recovery is done, the system is taken into a more normal
> post-recovery pre-usable state, on which those indexes are recreated
> from user data.  This would be cheapest in WAL traffic, but probably
> it'll also require more code and new hooks in the startup mechanism.
> Also, it'd require examining later WAL entries that refer to the index
> and act accordingly (e.g. ignore the entry if it modifies the index, and
> forget the creation if it's a DROP INDEX command.)
> 

There will be many ways to optimise recovery once we have PITR
working...

The current code does a straight replay of all changes. We can imagine
lots of different multi-pass or lookahead strategies for replaying xlog
records, but please lets wait awhile...

> Not that I like neither of those ideas really ... issuing normal WAL
> index creation traffic if PITR is active is certainly the easiest way.

I agree, certainly for now.

Best Regards, Simon Riggs

Re: Fast index build vs. PITR

From

Simon Riggs

Date:

01 June 2004, 17:31:47

On Tue, 2004-06-01 at 03:21, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > I assume if someone turns on archiving in postgresql.conf, sighups the
> > > postmaster, then does a tar backup, they should be able to do archiving,
> > > no?
> > 
> > I would have zero problem with labeling the archive parameter as
> > changeable only at postmaster start.
> 
> I guess the question is whether it would be possible to start/stop it at
> other times.  And what process are we going to use to do a tar backup? 
> Do they turn off archiving before doing the tar, or tell the system to
> tar to another location?  
> 

This situation has been discussed and agreed twice already. First we
discussed it was SUSET, then SIGHUP, now we talk about postmaster
startup.

I'm not sure I'm too bothered either way, but the code has now been
written to make it a SIGHUP operation. Making it SIGHUP effects the way
we invoke the archiver process at postmaster startup, so if we want to
change things again we must do so real soon.

Postmaster startup is the simplest scenario at run-time, so I'd suggest
we move to that NOW and then MAYBE back to SIGHUP at a later time, when
we are more certain everything works in production.

> And you brought up the issue of how do we feed multilple archive files
> back into the xlog directory during restore if they don't all fit on the
> disk.
> 

That has already been requested by Tom and agreed as on-the-PITR feature
list as an embellishment of the general recover-to-a point scenario. It
*MIGHT* make it into this release, if we get the other stuff done first.

> I think we need to explore the procedures we are going to use for PITR.
> 

Much of that has already been discussed.

Best Regards, Simon Riggs