Thread: xlogdump fixups and WAL log question.
Not sure who cares, so xzilla indicated I should drop a note here. I just made the xlogdump stuff work for 8.1 (trivial) and fixed a few other small issues that caused it to not work right both generally and in our environment. http://pgfoundry.org/tracker/index.php? func=detail&aid=1000760&group_id=1000202&atid=772 We're using it to track down what's causing some wal log ruckus. We're generating about a quarter terabyte of WAL logs a day (on bad days) which is posing some PITR backup pains. That amount isn't a severe challenge to backup, but our previous install was on Oracle and it generated substantially less archive redo logs (10-20 gigs per day). Is it possible to create tables in fashion that will not write info to the WAL log -- knowingly and intentionally making them unrecoverable? This is very desirable for us. We snapshot tables from a production environment. If the database goes down and we recover, the old snapshots are out of date anyway and serve no useful purpose. The periodic snapshot procedure would re-snap them in short order anyway. I'd like to do: INSERT INTO TABLE tblfoo_snap1 AS SELECT * from <table on remote database> NO LOGGING; (NO LOGGING being the only part we're currently missing) Is something like this possible? Cheers ;-) Theo // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
Theo Schlossnagle <jesus@omniti.com> writes: > Is it possible to create tables in fashion that will not write info > to the WAL log -- knowingly and intentionally making them > unrecoverable? Use temp tables? Also, it's likely that much of the WAL volume is full-page images. While you can't safely turn those off in 8.1, you can dial down the frequency of occurrence by increasing checkpoint_segments and checkpoint_timeout as much as you can stand. (The tradeoffs are amount of space occupied by pg_xlog/ and time to recover from a crash.) regards, tom lane
On Oct 20, 2006, at 1:58 PM, Tom Lane wrote: > Theo Schlossnagle <jesus@omniti.com> writes: >> Is it possible to create tables in fashion that will not write info >> to the WAL log -- knowingly and intentionally making them >> unrecoverable? > > Use temp tables? temp tables won't work too well -- unless I can make a whole tablespace "temp" and multiple backends can see it. They work fine for small tables we snapshot (couple hundred or even a few thousand rows), but many of the tables are a few hundred thousand rows and several processes on the system all need them. > Also, it's likely that much of the WAL volume is full-page images. > While you can't safely turn those off in 8.1, you can dial down the > frequency of occurrence by increasing checkpoint_segments and > checkpoint_timeout as much as you can stand. (The tradeoffs are > amount of space occupied by pg_xlog/ and time to recover from a > crash.) Our pg_xlog is currently at 9.6GB. Not sure I can reasonably tune it up much higher. // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
On Fri, 2006-10-20 at 13:18 -0400, Theo Schlossnagle wrote: > Not sure who cares, so xzilla indicated I should drop a note here. I > just made the xlogdump stuff work for 8.1 (trivial) and fixed a few > other small issues that caused it to not work right both generally > and in our environment. > > http://pgfoundry.org/tracker/index.php? > func=detail&aid=1000760&group_id=1000202&atid=772 Diogo Biazus was working on that; I care also. > We're using it to track down what's causing some wal log ruckus. > We're generating about a quarter terabyte of WAL logs a day (on bad > days) which is posing some PITR backup pains. That amount isn't a > severe challenge to backup, but our previous install was on Oracle > and it generated substantially less archive redo logs (10-20 gigs per > day). As Tom says, definitely because of full_page_writes=on > Is it possible to create tables in fashion that will not write info > to the WAL log -- knowingly and intentionally making them > unrecoverable? This is very desirable for us. We snapshot tables > from a production environment. If the database goes down and we > recover, the old snapshots are out of date anyway and serve no useful > purpose. The periodic snapshot procedure would re-snap them in short > order anyway. I'd like to do: > > INSERT INTO TABLE tblfoo_snap1 AS SELECT * from <table on remote > database> NO LOGGING; > > (NO LOGGING being the only part we're currently missing) Is something > like this possible? Do you want this because of: 1) performance? 2) to reduce the WAL volume of PITR backups? If you're thinking (1), then I guess I'd ask whether you've considered what will happen when the reporting environment includes data from other sources as it inevitably will. At that point, data loss would be much more annoying. My experience is that the success of your current implementation will lead quickly to a greatly increased user requirement. I've been looking at ways of reducing the WAL volume for PITR backups. Here's a few ideas: 1. Provide a filter that can be easily used by archive_command to remove full page writes from WAL files. This would require us to disable the file size test when we begin recovery on a new WAL files, plus would need to redesign initial location of the checkpoint record since we could no longer rely on the XLogRecPtr being a byte offset within the file. e.g. archive_command = 'pg_WAL_filter -f | ... ' 2. Include tablespaceid within the header of xlog records. This would allow us to filter out WAL from one or more tablespaces, similarly to (1), plus it would also allow single tablespace recovery. e.g. archive_command = 'pg_WAL_filter -x 35456 | ... ' There are some other ideas for generally reducing WAL volume also. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Oct 20, 2006, at 4:24 PM, Simon Riggs wrote: > On Fri, 2006-10-20 at 13:18 -0400, Theo Schlossnagle wrote: >> Not sure who cares, so xzilla indicated I should drop a note here. I >> just made the xlogdump stuff work for 8.1 (trivial) and fixed a few >> other small issues that caused it to not work right both generally >> and in our environment. >> >> http://pgfoundry.org/tracker/index.php? >> func=detail&aid=1000760&group_id=1000202&atid=772 > > Diogo Biazus was working on that; I care also. Cool. Patch is short. >> We're using it to track down what's causing some wal log ruckus. >> We're generating about a quarter terabyte of WAL logs a day (on bad >> days) which is posing some PITR backup pains. That amount isn't a >> severe challenge to backup, but our previous install was on Oracle >> and it generated substantially less archive redo logs (10-20 gigs per >> day). > > As Tom says, definitely because of full_page_writes=on Can I turn that off in 8.1? >> Is it possible to create tables in fashion that will not write info >> to the WAL log -- knowingly and intentionally making them >> unrecoverable? This is very desirable for us. We snapshot tables >> from a production environment. If the database goes down and we >> recover, the old snapshots are out of date anyway and serve no useful >> purpose. The periodic snapshot procedure would re-snap them in short >> order anyway. I'd like to do: >> >> INSERT INTO TABLE tblfoo_snap1 AS SELECT * from <table on remote >> database> NO LOGGING; >> >> (NO LOGGING being the only part we're currently missing) Is something >> like this possible? > > Do you want this because of: > 1) performance? performance in that a substantial portion of my time is spent writing to pg_xlog > 2) to reduce the WAL volume of PITR backups? Yes. Main concern. > > e.g. archive_command = 'pg_WAL_filter -f | ... ' > e.g. archive_command = 'pg_WAL_filter -x 35456 | ... ' > > There are some other ideas for generally reducing WAL volume also. I'd like to see them not written to the xlogs at all (if possible). Seems rather unnecessary unless I'm missing something. // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
On Fri, 2006-10-20 at 17:04 -0400, Theo Schlossnagle wrote: > On Oct 20, 2006, at 4:24 PM, Simon Riggs wrote: > >> Is it possible to create tables in fashion that will not write info > >> to the WAL log -- knowingly and intentionally making them > >> unrecoverable? This is very desirable for us. We snapshot tables > >> from a production environment. If the database goes down and we > >> recover, the old snapshots are out of date anyway and serve no useful > >> purpose. The periodic snapshot procedure would re-snap them in short > >> order anyway. I'd like to do: > >> > >> INSERT INTO TABLE tblfoo_snap1 AS SELECT * from <table on remote > >> database> NO LOGGING; > >> > >> (NO LOGGING being the only part we're currently missing) Is something > >> like this possible? > > Do you want this because of: > > 1) performance? > > performance in that a substantial portion of my time is spent writing > to pg_xlog > > > 2) to reduce the WAL volume of PITR backups? > > Yes. Main concern. > > > > > e.g. archive_command = 'pg_WAL_filter -f | ... ' > > e.g. archive_command = 'pg_WAL_filter -x 35456 | ... ' > > > > There are some other ideas for generally reducing WAL volume also. > > I'd like to see them not written to the xlogs at all (if possible). > Seems rather unnecessary unless I'm missing something. You aware you can turn archive_command off until after the load? That way you'll have nothing to back up at all until its done. In 8.1, when archive_command is off, CREATE TABLE AS SELECT is optimised to produce no WAL. Turning off WAL is a difficult topic. Without it you have no crash recovery, which IMHO everybody says they don't care about until they crash, then they realise. It's hard to be selective about writing WAL for specific operations also. However, there may be two cases not discussed before: 1. A newly created database into which a full load and/or pg_dump restore is being run. In that case we could have a mode where we turn off WAL completely during initdb via pg_control and then turn it back on again permanently (i.e. a one way switch) once the server is fully loaded. That covers the "restore database" use case. 2. Turn off WAL for one or more tablespaces, though never the main data directory. This allows for data which is "externally recoverable" to be isolated from things like the catalog and other more normal data. That would be a tablespace level option that would propagate to each object. That covers the "regular snapshot load" use case you describe. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Sat, Oct 21, 2006 at 10:37:51AM +0100, Simon Riggs wrote: > Turning off WAL is a difficult topic. Without it you have no crash > recovery, which IMHO everybody says they don't care about until they > crash, then they realise. It's hard to be selective about writing WAL > for specific operations also. It's been discussed before. One idea is to declare tables without logging. The idea being that during recovery those tables and related indexes are simply truncated. No foreign keys allowed. Obviously they will not be saved via PITR either. Put another way, the table structure is saved in WAL, but the data isn't. Have a ncie day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Oct 21, 2006, at 6:08 AM, Martijn van Oosterhout wrote: > On Sat, Oct 21, 2006 at 10:37:51AM +0100, Simon Riggs wrote: >> Turning off WAL is a difficult topic. Without it you have no crash >> recovery, which IMHO everybody says they don't care about until they >> crash, then they realise. It's hard to be selective about writing WAL >> for specific operations also. > > It's been discussed before. One idea is to declare tables without > logging. The idea being that during recovery those tables and related > indexes are simply truncated. No foreign keys allowed. Obviously they > will not be saved via PITR either. > > Put another way, the table structure is saved in WAL, but the data > isn't. This is exactly what I'd like. Simon suggested turning off WAL during the loads as a possible hack solution. The reason this won't work is that we snap all the time, lots of tables. We have between 2000 and 4000 snapshot operations per day (throughout). At the same time we have reporting queries running (that create and/or populate other tables) that last from 5 minutes to 18 hours. It is important that we run everything but the snapshots with WAL on (as we must have PITR -- sans snapshots) // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
On Sat, 2006-10-21 at 09:00 -0400, Theo Schlossnagle wrote: > On Oct 21, 2006, at 6:08 AM, Martijn van Oosterhout wrote: > > > On Sat, Oct 21, 2006 at 10:37:51AM +0100, Simon Riggs wrote: > >> Turning off WAL is a difficult topic. Without it you have no crash > >> recovery, which IMHO everybody says they don't care about until they > >> crash, then they realise. It's hard to be selective about writing WAL > >> for specific operations also. > > > > It's been discussed before. One idea is to declare tables without > > logging. The idea being that during recovery those tables and related > > indexes are simply truncated. No foreign keys allowed. Obviously they > > will not be saved via PITR either. > > > > Put another way, the table structure is saved in WAL, but the data > > isn't. > > This is exactly what I'd like. Simon suggested turning off WAL > during the loads as a possible hack solution. The reason this won't > work is that we snap all the time, lots of tables. We have between > 2000 and 4000 snapshot operations per day (throughout). At the same > time we have reporting queries running (that create and/or populate > other tables) that last from 5 minutes to 18 hours. It is important > that we run everything but the snapshots with WAL on (as we must have > PITR -- sans snapshots) These tables are loaded once then read-only, yes? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Oct 21, 2006, at 3:12 PM, Simon Riggs wrote: > On Sat, 2006-10-21 at 09:00 -0400, Theo Schlossnagle wrote: >> On Oct 21, 2006, at 6:08 AM, Martijn van Oosterhout wrote: >> >>> On Sat, Oct 21, 2006 at 10:37:51AM +0100, Simon Riggs wrote: >>>> Turning off WAL is a difficult topic. Without it you have no crash >>>> recovery, which IMHO everybody says they don't care about until >>>> they >>>> crash, then they realise. It's hard to be selective about >>>> writing WAL >>>> for specific operations also. >>> >>> It's been discussed before. One idea is to declare tables without >>> logging. The idea being that during recovery those tables and >>> related >>> indexes are simply truncated. No foreign keys allowed. Obviously >>> they >>> will not be saved via PITR either. >>> >>> Put another way, the table structure is saved in WAL, but the data >>> isn't. >> >> This is exactly what I'd like. Simon suggested turning off WAL >> during the loads as a possible hack solution. The reason this won't >> work is that we snap all the time, lots of tables. We have between >> 2000 and 4000 snapshot operations per day (throughout). At the same >> time we have reporting queries running (that create and/or populate >> other tables) that last from 5 minutes to 18 hours. It is important >> that we run everything but the snapshots with WAL on (as we must have >> PITR -- sans snapshots) > > These tables are loaded once then read-only, yes? No, they are loaded, and then reloaded, and then reloaded. Queries that use them will get the most recently loaded version of them. It meets a business rule like: table foo on the warehouse should be representative of version of table foo on OLTP no older than 30 minutes. // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
On Sat, 2006-10-21 at 15:17 -0400, Theo Schlossnagle wrote: > On Oct 21, 2006, at 3:12 PM, Simon Riggs wrote: > > > On Sat, 2006-10-21 at 09:00 -0400, Theo Schlossnagle wrote: > >> On Oct 21, 2006, at 6:08 AM, Martijn van Oosterhout wrote: > >> > >>> On Sat, Oct 21, 2006 at 10:37:51AM +0100, Simon Riggs wrote: > >>>> Turning off WAL is a difficult topic. Without it you have no crash > >>>> recovery, which IMHO everybody says they don't care about until > >>>> they > >>>> crash, then they realise. It's hard to be selective about > >>>> writing WAL > >>>> for specific operations also. > >>> > >>> It's been discussed before. One idea is to declare tables without > >>> logging. The idea being that during recovery those tables and > >>> related > >>> indexes are simply truncated. No foreign keys allowed. Obviously > >>> they > >>> will not be saved via PITR either. > >>> > >>> Put another way, the table structure is saved in WAL, but the data > >>> isn't. > >> > >> This is exactly what I'd like. Simon suggested turning off WAL > >> during the loads as a possible hack solution. The reason this won't > >> work is that we snap all the time, lots of tables. We have between > >> 2000 and 4000 snapshot operations per day (throughout). At the same > >> time we have reporting queries running (that create and/or populate > >> other tables) that last from 5 minutes to 18 hours. It is important > >> that we run everything but the snapshots with WAL on (as we must have > >> PITR -- sans snapshots) > > > > These tables are loaded once then read-only, yes? > > No, they are loaded, and then reloaded, and then reloaded. Queries > that use them will get the most recently loaded version of them. It > meets a business rule like: table foo on the warehouse should be > representative of version of table foo on OLTP no older than 30 minutes. But they can be re-created anew with the same name each time? Or I guess not, but you redefine a view every 30 minutes to point to the latest one? If so, then I have a patch that will speed up COPY when in the same transaction as the table that created it. I've finally fixed a bug in my earlier prototypes that seems to make that work now, in all cases. I was being slightly slow before; I thought this was a new requirement but its just the old one slightly restated. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Oct 21, 2006, at 4:40 PM, Simon Riggs wrote: > On Sat, 2006-10-21 at 15:17 -0400, Theo Schlossnagle wrote: >> On Oct 21, 2006, at 3:12 PM, Simon Riggs wrote: >> >>> On Sat, 2006-10-21 at 09:00 -0400, Theo Schlossnagle wrote: >>>> On Oct 21, 2006, at 6:08 AM, Martijn van Oosterhout wrote: >>>> >>>>> On Sat, Oct 21, 2006 at 10:37:51AM +0100, Simon Riggs wrote: >>>>>> Turning off WAL is a difficult topic. Without it you have no >>>>>> crash >>>>>> recovery, which IMHO everybody says they don't care about until >>>>>> they >>>>>> crash, then they realise. It's hard to be selective about >>>>>> writing WAL >>>>>> for specific operations also. >>>>> >>>>> It's been discussed before. One idea is to declare tables without >>>>> logging. The idea being that during recovery those tables and >>>>> related >>>>> indexes are simply truncated. No foreign keys allowed. Obviously >>>>> they >>>>> will not be saved via PITR either. >>>>> >>>>> Put another way, the table structure is saved in WAL, but the data >>>>> isn't. >>>> >>>> This is exactly what I'd like. Simon suggested turning off WAL >>>> during the loads as a possible hack solution. The reason this >>>> won't >>>> work is that we snap all the time, lots of tables. We have between >>>> 2000 and 4000 snapshot operations per day (throughout). At the >>>> same >>>> time we have reporting queries running (that create and/or populate >>>> other tables) that last from 5 minutes to 18 hours. It is >>>> important >>>> that we run everything but the snapshots with WAL on (as we must >>>> have >>>> PITR -- sans snapshots) >>> >>> These tables are loaded once then read-only, yes? >> >> No, they are loaded, and then reloaded, and then reloaded. Queries >> that use them will get the most recently loaded version of them. It >> meets a business rule like: table foo on the warehouse should be >> representative of version of table foo on OLTP no older than 30 >> minutes. > > But they can be re-created anew with the same name each time? Or I > guess > not, but you redefine a view every 30 minutes to point to the latest > one? closest to the latter. A view is redefined when the new snapshot is complete. > If so, then I have a patch that will speed up COPY when in the same > transaction as the table that created it. I've finally fixed a bug > in my > earlier prototypes that seems to make that work now, in all cases. > > I was being slightly slow before; I thought this was a new requirement > but its just the old one slightly restated. We don't use COPY. We directly INSERT INTO target_snap SELECT * from remote_select(...) t(cast); remote_select is part of dbi-link. // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
"Simon Riggs" <simon@2ndquadrant.com> writes: > But they can be re-created anew with the same name each time? Or I guess > not, but you redefine a view every 30 minutes to point to the latest > one? > If so, then I have a patch that will speed up COPY when in the same > transaction as the table that created it. I've finally fixed a bug in my > earlier prototypes that seems to make that work now, in all cases. Can you make the patch cover the case of begin;truncate foo;copy foo from ...commit; It might be infeasible to detect this case, but if it's not ... regards, tom lane
On Sat, 2006-10-21 at 19:24 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > But they can be re-created anew with the same name each time? Or I guess > > not, but you redefine a view every 30 minutes to point to the latest > > one? > > > If so, then I have a patch that will speed up COPY when in the same > > transaction as the table that created it. I've finally fixed a bug in my > > earlier prototypes that seems to make that work now, in all cases. > > Can you make the patch cover the case of > > begin; > truncate foo; > copy foo from ... > commit; > > It might be infeasible to detect this case, but if it's not ... I think it is possible to detect this case without making catalog entries, so I'll give this a try. Methinks that the truncate *must* be the immediately preceding command, otherwise we might have a trigger executing to put rows back into the table before we COPY. I should also be able to get that to work with Insert Select without much bother too. We might also add this capability to COPY itself by providing a WITH TRUNCATE option, which would be even cleaner code-wise. We can add that to pg_dump so it will work with/without --single-transaction mode. That will change the permissions reqd slightly, but seems OK. I'll go for all of the above changes unless there are objections. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Sat, 2006-10-21 at 19:24 -0400, Tom Lane wrote: > > Can you make the patch cover the case of > > > > begin; > > truncate foo; > > copy foo from ... > > commit; > > > > It might be infeasible to detect this case, but if it's not ... > > I think it is possible to detect this case without making catalog > entries, so I'll give this a try. Methinks that the truncate *must* be > the immediately preceding command, otherwise we might have a trigger > executing to put rows back into the table before we COPY. Hmm, is it possible to save the "is empty" info somewhere in local memory, perhaps the relcache (not necessarily propagated), and have heap_insert turn it off? That would help when you do things like begin; truncate foo, bar; copy foo from ... copy bar from ... commit; On the other hand, what happens if you do begin; truncate foo; -- another sessioncopy foo from ... -- original session copy foo from ... commit; How do you detect that the table is no longer empty? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Simon Riggs wrote: >> I think it is possible to detect this case without making catalog >> entries, so I'll give this a try. Methinks that the truncate *must* be >> the immediately preceding command, otherwise we might have a trigger >> executing to put rows back into the table before we COPY. > Hmm, is it possible to save the "is empty" info somewhere in local > memory, perhaps the relcache (not necessarily propagated), and have > heap_insert turn it off? The relcache isn't a very safe place to store state --- it's a cache, not stable storage. However, I don't understand why Simon is on about "empty". ISTM the important state is "new relfilenode assigned in this transaction". regards, tom lane
On Sun, 2006-10-22 at 12:12 -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Simon Riggs wrote: > >> I think it is possible to detect this case without making catalog > >> entries, so I'll give this a try. Methinks that the truncate *must* be > >> the immediately preceding command, otherwise we might have a trigger > >> executing to put rows back into the table before we COPY. > > > Hmm, is it possible to save the "is empty" info somewhere in local > > memory, perhaps the relcache (not necessarily propagated), and have > > heap_insert turn it off? > > The relcache isn't a very safe place to store state --- it's a cache, > not stable storage. I was imagining adding this onto the Relation struct, just as we do with rd_createSubid and rd_targblock. That isn't a safe place to store that state so we can't do this across multiple backends. We wouldn't want that anyway since otherwise various statements would need to access shared state before they can act, which is not good. > However, I don't understand why Simon is on about "empty". ISTM the > important state is "new relfilenode assigned in this transaction". Thank you for supplying clarity of thought; empty would be important if we were taking full table locks, which we don't want to do (Christmas Past...). So we need not have the DML immediately following a truncate, only that a truncate has previously occurred within the top level transaction or an unaborted subtransaction. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Hi, Simon, Simon Riggs wrote: > 1. Provide a filter that can be easily used by archive_command to remove > full page writes from WAL files. This would require us to disable the > file size test when we begin recovery on a new WAL files, plus would > need to redesign initial location of the checkpoint record since we > could no longer rely on the XLogRecPtr being a byte offset within the > file. pg_WAL_filter could "correct" the XLogRecPtr and file sizes during the filter run. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org