Thread: Move tablespace
I just noticed that ALTER TABLE foo SET TABLESPACE new_tablespace; doesn't optimise writing WAL, so when streaming enabled it will copy the whole table to WAL and across the wire. My understanding was that ALTER TABLE had been optimised, though not as much as could be in this case. Following patch writes a new WAL record that just says "copy foo to newts" and during replay we flush buffers and then re-execute the copy (but only when InArchiveRecovery). So the copy happens locally on the standby, not copying from primary to standby. We do this just with a little refactoring and a simple new WAL message. So about 64 bytes rather than potentially gigabytes of data, which seems important when using SR. Simple patch, no new concepts, so I figure to apply this now. (Yes, I need to bump WAL format id as well). Objections? -- Simon Riggs www.2ndQuadrant.com
Attachment
Simon Riggs <simon@2ndQuadrant.com> writes: > Following patch writes a new WAL record that just says "copy foo to > newts" and during replay we flush buffers and then re-execute the copy > (but only when InArchiveRecovery). So the copy happens locally on the > standby, not copying from primary to standby. We do this just with a > little refactoring and a simple new WAL message. And what happens to crash-recovery replay? You can't have it both ways, either the data is in WAL or it's missing. > Objections? This is NOT the time to be rushing in marginal performance optimizations. I don't think you've thought through all the corner cases anyway. regards, tom lane
On Tue, 2010-04-20 at 21:03 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > Following patch writes a new WAL record that just says "copy foo to > > newts" and during replay we flush buffers and then re-execute the copy > > (but only when InArchiveRecovery). So the copy happens locally on the > > standby, not copying from primary to standby. We do this just with a > > little refactoring and a simple new WAL message. > > And what happens to crash-recovery replay? You can't have it both ways, > either the data is in WAL or it's missing. The patch changes nothing in the case of crash recovery. There is no WAL written if !XLogIsNeeded, so we *must* have already made the decision that the absence of WAL is not a problem for crash recovery. Note that currently we flush the new table to disk just like we do for heap_sync(), whether or not WAL is written. > > Objections? > > This is NOT the time to be rushing in marginal performance > optimizations. I don't think you've thought through all the corner > cases anyway. The performance gain isn't marginal, otherwise I wouldn't have bothered writing it * We avoid writing GB of unnecessary WAL data on primary * We avoid streaming that WAL data to the standby If you can see a corner case that I do not, please say. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Tue, 2010-04-20 at 21:03 -0400, Tom Lane wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >>> Following patch writes a new WAL record that just says "copy foo to >>> newts" and during replay we flush buffers and then re-execute the copy >>> (but only when InArchiveRecovery). So the copy happens locally on the >>> standby, not copying from primary to standby. We do this just with a >>> little refactoring and a simple new WAL message. >> And what happens to crash-recovery replay? You can't have it both ways, >> either the data is in WAL or it's missing. > > The patch changes nothing in the case of crash recovery. What happens if the record is replayed twice in archive recovery? For example if you stop and restart a standby server after it has replayed that record. What does the 2nd redo attempt do if the source file was already deleted by the 1st recovery. I also think we shouldn't be fiddling with this at this stage in the release cycle. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2010-04-21 at 14:37 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Tue, 2010-04-20 at 21:03 -0400, Tom Lane wrote: > >> Simon Riggs <simon@2ndQuadrant.com> writes: > >>> Following patch writes a new WAL record that just says "copy foo to > >>> newts" and during replay we flush buffers and then re-execute the copy > >>> (but only when InArchiveRecovery). So the copy happens locally on the > >>> standby, not copying from primary to standby. We do this just with a > >>> little refactoring and a simple new WAL message. > >> And what happens to crash-recovery replay? You can't have it both ways, > >> either the data is in WAL or it's missing. > > > > The patch changes nothing in the case of crash recovery. > > What happens if the record is replayed twice in archive recovery? For > example if you stop and restart a standby server after it has replayed > that record. What does the 2nd redo attempt do if the source file was > already deleted by the 1st recovery. If the source is absent then we know that replay had successfully copied the file and synced it, so we can just skip the record. > I also think we shouldn't be fiddling with this at this stage in the > release cycle. OK, but not because I see a problem with the technique. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Wed, 2010-04-21 at 14:37 +0300, Heikki Linnakangas wrote: >> I also think we shouldn't be fiddling with this at this stage in the >> release cycle. > OK, but not because I see a problem with the technique. You made that plain already, but you have not convinced anyone else. More to the point, ALTER SET TABLESPACE is not an operation that happens so much that we ought to take any risks to optimize it. regards, tom lane
On Wed, 2010-04-21 at 11:53 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Wed, 2010-04-21 at 14:37 +0300, Heikki Linnakangas wrote: > >> I also think we shouldn't be fiddling with this at this stage in the > >> release cycle. > > > OK, but not because I see a problem with the technique. > > You made that plain already, but you have not convinced anyone else. Not pouting, just recording the fact that its an ongoing discussion. > More to the point, ALTER SET TABLESPACE is not an operation that > happens so much that we ought to take any risks to optimize it. I think its the opposite: people don't run it because they can't. But I'm happy not to press further at this stage of 9.0. I am always optimistic about convincing you ;-) -- Simon Riggs www.2ndQuadrant.com