Thread: Fast index build vs. PITR
I was just about to commit a patch that revises the btree index build procedure as discussed here: http://archives.postgresql.org/pgsql-general/2004-05/msg00480.php specifically, not using shared buffers during index build and bypassing WAL-logging in favor of just fsyncing the index file before commit. I was actually writing the commit message when it occurred to me that this would seriously break PITR. If the WAL datastream doesn't contain enough info to rebuild the index then rolling forward from a past backup isn't gonna work. I thought for a little bit about a magic "reconstruct the index" WAL entry that would invoke the index build procedure in toto, but that doesn't look like it will fly either. (Two problems: during crash recovery, you couldn't be sure that what's on disk for the underlying table exactly matches the index you need to build --- it could be a later state of the table; and besides, the environment of the WAL replay process isn't capable of running user-defined functions, so it couldn't work for functional indexes.) So AFAICS, we've got to dump the index contents into WAL to support PITR. This is a tad annoying. What I'm thinking about right now is tweaking the index-build code to write to WAL only if it sees that PITR is actually in use. It would have to look at the GUC variables to determine whether WAL archiving is enabled. If archiving isn't turned on, then we could assume that rollforward from a past backup isn't needed in this installation, and use the WAL-less index build method. Comments? regards, tom lane
On Mon, 31 May 2004, Tom Lane wrote: [snip] > I thought for a little bit about a magic "reconstruct the index" WAL > entry that would invoke the index build procedure in toto, but that > doesn't look like it will fly either. (Two problems: during crash > recovery, you couldn't be sure that what's on disk for the underlying > table exactly matches the index you need to build --- it could be a > later state of the table; and besides, the environment of the WAL replay > process isn't capable of running user-defined functions, so it couldn't > work for functional indexes.) > > So AFAICS, we've got to dump the index contents into WAL to support > PITR. This is a tad annoying. Is it possible in this case to dump the index block by block into the log after it has been generated? It seems to me this should reduce the amount of data we write to WAL and (therefore) speed up the rebuild. Gavin
> What I'm thinking about right now is tweaking the index-build code to > write to WAL only if it sees that PITR is actually in use. It would > have to look at the GUC variables to determine whether WAL archiving > is enabled. If archiving isn't turned on, then we could assume that > rollforward from a past backup isn't needed in this installation, and > use the WAL-less index build method. Seems reasonable. Chris
Christopher Kings-Lynne wrote: > > What I'm thinking about right now is tweaking the index-build code to > > write to WAL only if it sees that PITR is actually in use. It would > > have to look at the GUC variables to determine whether WAL archiving > > is enabled. If archiving isn't turned on, then we could assume that > > rollforward from a past backup isn't needed in this installation, and > > use the WAL-less index build method. > > Seems reasonable. What happens if someone turns on archiving while the index is being built? Is that possible? I assume if someone turns on archiving in postgresql.conf, sighups the postmaster, then does a tar backup, they should be able to do archiving, no? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, Jun 01, 2004 at 11:34:15AM +1000, Gavin Sherry wrote: > On Mon, 31 May 2004, Tom Lane wrote: > > > I thought for a little bit about a magic "reconstruct the index" WAL > > entry that would invoke the index build procedure in toto, but that > > doesn't look like it will fly either. (Two problems: during crash > > recovery, you couldn't be sure that what's on disk for the underlying > > table exactly matches the index you need to build --- it could be a > > later state of the table; and besides, the environment of the WAL replay > > process isn't capable of running user-defined functions, so it couldn't > > work for functional indexes.) > > > > So AFAICS, we've got to dump the index contents into WAL to support > > PITR. This is a tad annoying. > > Is it possible in this case to dump the index block by block into the log > after it has been generated? It seems to me this should reduce the > amount of data we write to WAL and (therefore) speed up the rebuild. Is this less log traffic? You save a lot of per-record overhead, but you have to save internal pages (which are not saved with the standard code). Plus it would be a lot different from standard btree WAL traffic, so it'd be more code. Maybe there's more to be saved by logging only leaf pages. But then there would be even more code to be able to reconstruct the index from only leaf pages, and there are not that many internal pages either. A completely different idea would be to log a "logical index creation", so that during normal recovery those entries are saved somewhere; after the rest of WAL recovery is done, the system is taken into a more normal post-recovery pre-usable state, on which those indexes are recreated from user data. This would be cheapest in WAL traffic, but probably it'll also require more code and new hooks in the startup mechanism. Also, it'd require examining later WAL entries that refer to the index and act accordingly (e.g. ignore the entry if it modifies the index, and forget the creation if it's a DROP INDEX command.) Not that I like neither of those ideas really ... issuing normal WAL index creation traffic if PITR is active is certainly the easiest way. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "In Europe they call me Niklaus Wirth; in the US they call me Nickel's worth. That's because in Europe they call me by name, and in the US by value!"
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I assume if someone turns on archiving in postgresql.conf, sighups the > postmaster, then does a tar backup, they should be able to do archiving, > no? I would have zero problem with labeling the archive parameter as changeable only at postmaster start. regards, tom lane
Gavin Sherry <swm@linuxworld.com.au> writes: >> So AFAICS, we've got to dump the index contents into WAL to support >> PITR. This is a tad annoying. > Is it possible in this case to dump the index block by block into the log > after it has been generated? That's what we do now, and it more or less doubles the amount of I/O needed to create an index ... regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I assume if someone turns on archiving in postgresql.conf, sighups the > > postmaster, then does a tar backup, they should be able to do archiving, > > no? > > I would have zero problem with labeling the archive parameter as > changeable only at postmaster start. I guess the question is whether it would be possible to start/stop it at other times. And what process are we going to use to do a tar backup? Do they turn off archiving before doing the tar, or tell the system to tar to another location? And you brought up the issue of how do we feed multilple archive files back into the xlog directory during restore if they don't all fit on the disk. I think we need to explore the procedures we are going to use for PITR. Also, do we need to do tar in a special way, like tar up a specific file first? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > A completely different idea would be to log a "logical index creation", > so that during normal recovery those entries are saved somewhere; after > the rest of WAL recovery is done, the system is taken into a more normal > post-recovery pre-usable state, on which those indexes are recreated > from user data. I think an actually implementable version of this would be: 1. Don't log any index operations at all in WAL. 2. When recovering from WAL, restore all the table contents by WAL replay. (This would of course include the system catalog contents that describe the indexes.) Then sit there and do a global REINDEX to rebuild all the indexes. This would gain a reduction of some percentage in WAL traffic, at the cost of a hugely expensive recovery cycle any time you actually needed to use the WAL. I guess this could be attractive to some installations, but I'm not sure very many people would want it ... regards, tom lane
On Tue, 2004-06-01 at 01:24, Tom Lane wrote: > I was just about to commit a patch that revises the btree index build > procedure as discussed here: > http://archives.postgresql.org/pgsql-general/2004-05/msg00480.php > specifically, not using shared buffers during index build and bypassing > WAL-logging in favor of just fsyncing the index file before commit. > > I was actually writing the commit message when it occurred to me that > this would seriously break PITR. If the WAL datastream doesn't contain > enough info to rebuild the index then rolling forward from a past backup > isn't gonna work. > > I thought for a little bit about a magic "reconstruct the index" WAL > entry that would invoke the index build procedure in toto, but that > doesn't look like it will fly either. (Two problems: during crash > recovery, you couldn't be sure that what's on disk for the underlying > table exactly matches the index you need to build --- it could be a > later state of the table; and besides, the environment of the WAL replay > process isn't capable of running user-defined functions, so it couldn't > work for functional indexes.) > > So AFAICS, we've got to dump the index contents into WAL to support > PITR. This is a tad annoying. > > What I'm thinking about right now is tweaking the index-build code to > write to WAL only if it sees that PITR is actually in use. It would > have to look at the GUC variables to determine whether WAL archiving > is enabled. If archiving isn't turned on, then we could assume that > rollforward from a past backup isn't needed in this installation, and > use the WAL-less index build method. > > Comments? The mechanism you suggest would also break crash recovery, not just PITR - though the avoidance of shared buffers seems like a gain either way. ..You raise the whole subject of UNRECOVERABLE data objects. Also known as NOT LOGGED etc. There are many significant performance gains to be had by turning off recoverability for certain large operations. Oracle and Teradata make extensive use of such features, i.e. especially in Data Warehousing. Examples of such operations might be: - index builds - INSERT SELECTs into previously empty tables This is an important area for performance...not just index builds. A suggestion would be: - add the "dont send to xlog" functionality as a user option on each statement, default=LOGGING - this could be Oracle compatible, or not, but concept is similar. Put the hooks in now and we can add this to all appropriate statement syntax later. e.g. CREATE INDEX blah ... NO LOGGING... ; INSERT INTO blah ... NO LOGGING .... SELECT... ; - if conf file says dont use fsync, then dont write to log - clearly they dont mind losing data in the event of a crash... i.e. default=NOLOGGING on all statements Best regards, Simon Riggs
> I think an actually implementable version of this would be: > > 1. Don't log any index operations at all in WAL. > > 2. When recovering from WAL, restore all the table contents by WAL > replay. (This would of course include the system catalog contents that > describe the indexes.) Then sit there and do a global REINDEX to > rebuild all the indexes. > > This would gain a reduction of some percentage in WAL traffic, at the > cost of a hugely expensive recovery cycle any time you actually needed > to use the WAL. I guess this could be attractive to some installations, > but I'm not sure very many people would want it ... I think only the "global" part of it is not really acceptable. If we had a flag for each index that marks it "inconsistent" reindexing only those that are marked would be great. Could we log a WAL record that basically only marks an index for deferred reindex after WAL recovery ? During WAL replay all records for this index could be ignored (this is not a must because of the post update page images in WAL, the index would still stay inconsistent until reindex of course). I think such a reindex step could also be responsible for those non-btree indexes that don't fully support WAL (gist?). Andreas
Simon Riggs <simon@2ndquadrant.com> writes: > The mechanism you suggest would also break crash recovery, not just PITR No it wouldn't. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > I thought for a little bit about a magic "reconstruct the index" WAL > entry that would invoke the index build procedure in toto, but that > doesn't look like it will fly either. (Two problems: during crash > recovery, you couldn't be sure that what's on disk for the underlying > table exactly matches the index you need to build --- it could be a > later state of the table; and besides, the environment of the WAL replay > process isn't capable of running user-defined functions, so it couldn't > work for functional indexes.) Could you just mark the index as unusable? Have the optimizer ignore such indexes and PITR recovery can notify the user of these indexes and/or invoke a rebuild automatically? It wouldn't happen unless the user had done an index rebuild since the last complete backup, so it wouldn't even be a performance issue. Restoring the index from the WAL replay of an index rebuild must take a long time anyways. -- greg
On Tue, Jun 01, 2004 at 12:52:32PM -0400, Greg Stark wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > I thought for a little bit about a magic "reconstruct the index" WAL > > entry that would invoke the index build procedure in toto, but that > > doesn't look like it will fly either. (Two problems: during crash > > recovery, you couldn't be sure that what's on disk for the underlying > > table exactly matches the index you need to build --- it could be a > > later state of the table; and besides, the environment of the WAL replay > > process isn't capable of running user-defined functions, so it couldn't > > work for functional indexes.) > > Could you just mark the index as unusable? Have the optimizer ignore such > indexes and PITR recovery can notify the user of these indexes and/or invoke a > rebuild automatically? The big problem I see with this kind of approaches is that building an index from scratch can take a huge amount of time, because you have to sort the data. Building from WAL does not have this problem, so it can be much faster. Of course, when you are restoring using a PITR approach you probably want it to be very fast, and have the DB running with as little quirks as possible, as soon as possible. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "El realista sabe lo que quiere; el idealista quiere lo que sabe" (Anónimo)
Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > The big problem I see with this kind of approaches is that building an > index from scratch can take a huge amount of time, because you have to > sort the data. Building from WAL does not have this problem, so it can > be much faster. I'm not clear that building from WAL is really going to be that much faster. A) algorithmically it's only the factor of log(n) that you're talking about. and B) the WAL will have records for every write, not just the final product, so it might potentially have a lot more writes to do. I thought part of the original problem was specifically that going through WAL slowed down the index rebuild much more than a factor of 2, which would tend to imply that in fact rebuilding from WAL isn't going to be as fast as you might expect. Another possibility is doing the complete index build without going through WAL and then inserting a complete copy of the index into the WAL without syncing or activating the rebuilt index until the copy do WAL is done. That kind of sucks since it's equivalent to just taking another backup of the data files immediately after the rebuild, but might be a more direct solution using the existing tools. > Of course, when you are restoring using a PITR approach you probably want it > to be very fast, and have the DB running with as little quirks as possible, > as soon as possible. This is certainly true. -- greg
Greg Stark <gsstark@mit.edu> writes: > I'm not clear that building from WAL is really going to be that much faster. > A) algorithmically it's only the factor of log(n) that you're talking about. > and B) the WAL will have records for every write, not just the final product, > so it might potentially have a lot more writes to do. Wrong ... what we log in WAL for a btree index build is just the series of completed index page images. Recreation of the index would proceed at whatever your disk read/write bandwidth is. Like Alvaro, I suspect that people who are using PITR will be concerned about recovery time, and would not be thrilled with any scenario that involves REINDEX to get the system back on its feet. regards, tom lane
On Tue, Jun 01, 2004 at 01:55:38PM -0400, Greg Stark wrote: > Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > > > The big problem I see with this kind of approaches is that building an > > index from scratch can take a huge amount of time, because you have to > > sort the data. Building from WAL does not have this problem, so it can > > be much faster. > > I'm not clear that building from WAL is really going to be that much faster. > A) algorithmically it's only the factor of log(n) that you're talking about. > and B) the WAL will have records for every write, not just the final product, > so it might potentially have a lot more writes to do. Maybe it is log(n) algorithmically, but the constants are big because there's a lot of non-sequential I/O involved. With the WAL approach you only read the pages from the log and copy them somewhere else. It's not nearly the same amount of I/O, and it's mostly sequential. And if I understood correctly what Tom said, after index construction the whole index pages are dropped, so there's no need to redo each node split operation. > I thought part of the original problem was specifically that going through WAL > slowed down the index rebuild much more than a factor of 2, which would tend > to imply that in fact rebuilding from WAL isn't going to be as fast as you > might expect. I think there was more than one problem in the code. I expect at least those not related (such as some locking issue apparently) are solved. And I'd expect WAL construction to be heavier than recovery from WAL, because some WAL writes have to be fsync()ed, and this is a heavy burden, while recovery does not need fsync on the new files AFAIK (but I could be wrong on this). One of the things that bothered me was that for some reason you couldn't get the whole performance benefit you would expect from simultaneous index builds when using a multiprocessor machine. > Another possibility is doing the complete index build without going through > WAL and then inserting a complete copy of the index into the WAL without > syncing or activating the rebuilt index until the copy do WAL is done. That > kind of sucks since it's equivalent to just taking another backup of the data > files immediately after the rebuild, but might be a more direct solution using > the existing tools. Apparently this is the current state of affairs, though I'm not sure. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "La persona que no quería pecar / estaba obligada a sentarse en duras y empinadas sillas / desprovistas, por cierto de blandos atenuantes" (Patricio Vogel)
Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > I'm not clear that building from WAL is really going to be that much faster. > > A) algorithmically it's only the factor of log(n) that you're talking about. > > and B) the WAL will have records for every write, not just the final product, > > so it might potentially have a lot more writes to do. > > Wrong ... what we log in WAL for a btree index build is just the series > of completed index page images. Recreation of the index would proceed > at whatever your disk read/write bandwidth is. > > Like Alvaro, I suspect that people who are using PITR will be concerned > about recovery time, and would not be thrilled with any scenario that > involves REINDEX to get the system back on its feet. Agreed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, 2004-06-01 at 03:13, Alvaro Herrera wrote: > A completely different idea would be to log a "logical index creation", > so that during normal recovery those entries are saved somewhere; after > the rest of WAL recovery is done, the system is taken into a more normal > post-recovery pre-usable state, on which those indexes are recreated > from user data. This would be cheapest in WAL traffic, but probably > it'll also require more code and new hooks in the startup mechanism. > Also, it'd require examining later WAL entries that refer to the index > and act accordingly (e.g. ignore the entry if it modifies the index, and > forget the creation if it's a DROP INDEX command.) > There will be many ways to optimise recovery once we have PITR working... The current code does a straight replay of all changes. We can imagine lots of different multi-pass or lookahead strategies for replaying xlog records, but please lets wait awhile... > Not that I like neither of those ideas really ... issuing normal WAL > index creation traffic if PITR is active is certainly the easiest way. I agree, certainly for now. Best Regards, Simon Riggs
On Tue, 2004-06-01 at 03:21, Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > I assume if someone turns on archiving in postgresql.conf, sighups the > > > postmaster, then does a tar backup, they should be able to do archiving, > > > no? > > > > I would have zero problem with labeling the archive parameter as > > changeable only at postmaster start. > > I guess the question is whether it would be possible to start/stop it at > other times. And what process are we going to use to do a tar backup? > Do they turn off archiving before doing the tar, or tell the system to > tar to another location? > This situation has been discussed and agreed twice already. First we discussed it was SUSET, then SIGHUP, now we talk about postmaster startup. I'm not sure I'm too bothered either way, but the code has now been written to make it a SIGHUP operation. Making it SIGHUP effects the way we invoke the archiver process at postmaster startup, so if we want to change things again we must do so real soon. Postmaster startup is the simplest scenario at run-time, so I'd suggest we move to that NOW and then MAYBE back to SIGHUP at a later time, when we are more certain everything works in production. > And you brought up the issue of how do we feed multilple archive files > back into the xlog directory during restore if they don't all fit on the > disk. > That has already been requested by Tom and agreed as on-the-PITR feature list as an embellishment of the general recover-to-a point scenario. It *MIGHT* make it into this release, if we get the other stuff done first. > I think we need to explore the procedures we are going to use for PITR. > Much of that has already been discussed. Best Regards, Simon Riggs