Thread: WAL logging of SELECT ... INTO command
Currently, it appears that SELECT * INTO new_table FROM old_table logs each page as it's written to WAL. Is this actually needed? Couldn't the database simply log that the SELECT ... INTO statement was executed instead? Doing so would likely result in a large performance improvement in most installs. Is there no provision for writing anything but data page changes (or whole pages) to WAL? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, 2006-03-21 at 06:22 -0600, Jim C. Nasby wrote: > Currently, it appears that SELECT * INTO new_table FROM old_table logs > each page as it's written to WAL. Is this actually needed? Couldn't the > database simply log that the SELECT ... INTO statement was executed > instead? Doing so would likely result in a large performance improvement > in most installs. Is there no provision for writing anything but data > page changes (or whole pages) to WAL? AFAIK it takes the same code path as CREATE TABLE AS SELECT, which already does exactly what you suggest (except when using PITR). Best Regards, Simon Riggs
On Tue, Mar 21, 2006 at 08:33:50PM +0000, Simon Riggs wrote: > On Tue, 2006-03-21 at 06:22 -0600, Jim C. Nasby wrote: > > Currently, it appears that SELECT * INTO new_table FROM old_table logs > > each page as it's written to WAL. Is this actually needed? Couldn't the > > database simply log that the SELECT ... INTO statement was executed > > instead? Doing so would likely result in a large performance improvement > > in most installs. Is there no provision for writing anything but data > > page changes (or whole pages) to WAL? > > AFAIK it takes the same code path as CREATE TABLE AS SELECT, which > already does exactly what you suggest (except when using PITR). Ok, I saw disk activity on the base directory and assumed it was pg_xlog stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore default_tablepsace and create the new tables in the base directory. I'm guessing that's a bug... (this is on 8.1.2, btw). Also, why do we log rows for CTAS/SELECT INTO when PITR is in use for simple SELECTs (ones that don't call non-deterministic functions)? The data should alread be available AFAICS... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, 2006-03-22 at 06:47 -0600, Jim C. Nasby wrote: > Also, why do we log rows for CTAS/SELECT INTO when PITR is in use for > simple SELECTs (ones that don't call non-deterministic functions)? The > data should alread be available AFAICS... Not sure what you're asking... SELECTs don't produce WAL. PITR wants all changes. Without PITR we can optimise certain logging actions. Best Regards, Simon Riggs
On Wed, Mar 22, 2006 at 01:08:34PM +0000, Simon Riggs wrote: > On Wed, 2006-03-22 at 06:47 -0600, Jim C. Nasby wrote: > > > Also, why do we log rows for CTAS/SELECT INTO when PITR is in use for > > simple SELECTs (ones that don't call non-deterministic functions)? The > > data should alread be available AFAICS... > > Not sure what you're asking... SELECTs don't produce WAL. Yes, there'd have to be some special kind of WAL entry that specifies what select statement was used in CTAS. > PITR wants all changes. Without PITR we can optimise certain logging > actions. The only change here is that we're creating a new table based on the results of a SELECT. If that SELECT doesn't use anything that's non-deterministic, then the machine doing the recovery should already have all the data it needs, provided that we log the SELECT that was used in the CTAS. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: >> PITR wants all changes. Without PITR we can optimise certain logging >> actions. > The only change here is that we're creating a new table based on the > results of a SELECT. If that SELECT doesn't use anything that's > non-deterministic, then the machine doing the recovery should already > have all the data it needs, provided that we log the SELECT that was > used in the CTAS. This is based on a fundamental misconception about the way PITR log-shipping works. We log actions at the physical level (put this tuple here), not the logical here's-the-statement-we-executed level. The two approaches cannot mix, because as soon as there's any physical discrepancy at all, physical-level actions would be incorrectly applied to the slave database. regards, tom lane
On Wed, Mar 22, 2006 at 10:06:05AM -0500, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > >> PITR wants all changes. Without PITR we can optimize certain logging > >> actions. > > > The only change here is that we're creating a new table based on the > > results of a SELECT. If that SELECT doesn't use anything that's > > non-deterministic, then the machine doing the recovery should already > > have all the data it needs, provided that we log the SELECT that was > > used in the CTAS. > > This is based on a fundamental misconception about the way PITR > log-shipping works. We log actions at the physical level (put this > tuple here), not the logical here's-the-statement-we-executed level. > The two approaches cannot mix, because as soon as there's any physical > discrepancy at all, physical-level actions would be incorrectly applied > to the slave database. Oh, so in other words, SELECT * INTO temp FROM table is inherently non-deterministic at the physical level, so the only way to be able to allow PITR to work is to duplicate all the physical changes. Darn. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > Oh, so in other words, SELECT * INTO temp FROM table is inherently > non-deterministic at the physical level, so the only way to be able to > allow PITR to work is to duplicate all the physical changes. Darn. Well, lemme put it this way: I'm not prepared to require that PG be deterministic at the physical level. One obvious source of non-determinancy is the FSM, which is likely to hand out different free space to different transactions depending on what else is going on at the same time. There are others, such as deliberately random tie-breaking during btree index insertion. regards, tom lane
On Wed, 2006-03-22 at 16:35, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > Oh, so in other words, SELECT * INTO temp FROM table is inherently > > non-deterministic at the physical level, so the only way to be able to > > allow PITR to work is to duplicate all the physical changes. Darn. > > Well, lemme put it this way: I'm not prepared to require that PG be > deterministic at the physical level. One obvious source of > non-determinancy is the FSM, which is likely to hand out different free > space to different transactions depending on what else is going on at > the same time. There are others, such as deliberately random > tie-breaking during btree index insertion. While you're at talking about WAL and PITR... I see from the aboce discussion that PITR is already demanding special handling in the code (I hope I got this one right, as the following are based on this). What if the PITR logging would be disconnected from the WAL logging completely ? What I mean is to introduce a WAL subscription mechanism, which basically means some incoming connections where we stream the log records. We don't need to write them to disk at all in the normal case, I guess usually PITR will store the records on some other machine so it means network, not disk. And it doesn't need to be done synchronously, it can lag behind the running transactions, and we can do it in batches of WAL records. It also would mean that the local WAL does not need to log the things which are only needed for the PITR... that would likely mean some spared WAL disk activity. Of course it also would mean that the local WAL and PITR WAL are not the same, but that is not an issue I guess. It would also permit immediate recycling of the WAL files if the current archiving style is not used. The drawbacks I can see (please add yours): 1) the need for the subscription management code with the added complexity it implies; 2) problems if the WAL stream lags too much behind; 3) problems if the subscribed client's connection is interrupted; Nr. 2 could be solved by saving the PITR WAL separately if the lag grows over a threshold, and issue a warning. This could still be acceptable, as the writing doesn't have to be synchronous and can be made in relatively large blocks. There could be a second bigger lag threshold which completely cancels the subscription. All these thresholds should be configurable, as it depends on the application what's more important, to have the standby available all the time or have the primary faster if loaded... Nr. 3. can be solved by either canceling the subscription on connection drop, or by allowing a certain amount of time after which the subscription is canceled. The client can reconnect before this timeout expires. In the meantime the primary can store the PITR WAL on disk as mentioned above... Cheers, Csaba.
On Wed, 22 Mar 2006, Jim C. Nasby wrote: > Ok, I saw disk activity on the base directory and assumed it was pg_xlog > stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore > default_tablepsace and create the new tables in the base directory. I'm > guessing that's a bug... (this is on 8.1.2, btw). This has been fixed in CVS HEAD as part of a patch to allow additional options to CREATE TABLE AS. http://archives.postgresql.org/pgsql-patches/2006-02/msg00211.php Kris Jurka
"Simon Riggs" <simon@2ndquadrant.com> wrote > On Tue, 2006-03-21 at 06:22 -0600, Jim C. Nasby wrote: > > Currently, it appears that SELECT * INTO new_table FROM old_table logs > > each page as it's written to WAL. Is this actually needed? Couldn't the > > database simply log that the SELECT ... INTO statement was executed > > instead? Doing so would likely result in a large performance improvement > > in most installs. Is there no provision for writing anything but data > > page changes (or whole pages) to WAL? > > AFAIK it takes the same code path as CREATE TABLE AS SELECT, which > already does exactly what you suggest (except when using PITR). > As I read, they did take the same code path, but did they "simply log that the SELECT ... INTO statement was executed"? If so, how can we rely on the unreliable content of the old_table to do recovery? Regards, Qingqing
On Wed, Mar 22, 2006 at 02:37:28PM -0500, Kris Jurka wrote: > > > On Wed, 22 Mar 2006, Jim C. Nasby wrote: > > >Ok, I saw disk activity on the base directory and assumed it was pg_xlog > >stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore > >default_tablepsace and create the new tables in the base directory. I'm > >guessing that's a bug... (this is on 8.1.2, btw). > > This has been fixed in CVS HEAD as part of a patch to allow additional > options to CREATE TABLE AS. > > http://archives.postgresql.org/pgsql-patches/2006-02/msg00211.php I'll argue that the current behavior is still a bug and should be fixed. Would it be difficult to patch 8.1 (and 8.0 if there were tablespaces then...) to honor default_tablespace? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, Mar 22, 2006 at 02:20:39PM +0800, Qingqing Zhou wrote: > > "Simon Riggs" <simon@2ndquadrant.com> wrote > > On Tue, 2006-03-21 at 06:22 -0600, Jim C. Nasby wrote: > > > Currently, it appears that SELECT * INTO new_table FROM old_table logs > > > each page as it's written to WAL. Is this actually needed? Couldn't the > > > database simply log that the SELECT ... INTO statement was executed > > > instead? Doing so would likely result in a large performance improvement > > > in most installs. Is there no provision for writing anything but data > > > page changes (or whole pages) to WAL? > > > > AFAIK it takes the same code path as CREATE TABLE AS SELECT, which > > already does exactly what you suggest (except when using PITR). > > > > As I read, they did take the same code path, but did they "simply log that > the SELECT ... INTO statement was executed"? If so, how can we rely on the > unreliable content of the old_table to do recovery? Why would the content of the old_table be unreliable? If we've replayed logs up to the point of the CTAS then any data that would be visible to the CTAS should be fine, no? Though, the way Tom put it in one of his replies it sounds like WAL doesn't do any kind of statement logging, only data logging. If that's the case I'm not sure that the CTAS would actually get replayed. But I suspect I'm just misunderstanding... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: > Why would the content of the old_table be unreliable? If we've replayed > logs up to the point of the CTAS then any data that would be visible to > the CTAS should be fine, no? > > Though, the way Tom put it in one of his replies it sounds like WAL > doesn't do any kind of statement logging, only data logging. If that's > the case I'm not sure that the CTAS would actually get replayed. But I > suspect I'm just misunderstanding... The CTAS doesn't get logged (nor replayed obviously). What happens is that the involved files are fsync'ed before transaction commit, AFAIR. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, Mar 24, 2006 at 08:39:02AM -0400, Alvaro Herrera wrote: > Jim C. Nasby wrote: > > > Why would the content of the old_table be unreliable? If we've replayed > > logs up to the point of the CTAS then any data that would be visible to > > the CTAS should be fine, no? > > > > Though, the way Tom put it in one of his replies it sounds like WAL > > doesn't do any kind of statement logging, only data logging. If that's > > the case I'm not sure that the CTAS would actually get replayed. But I > > suspect I'm just misunderstanding... > > The CTAS doesn't get logged (nor replayed obviously). What happens is > that the involved files are fsync'ed before transaction commit, AFAIR. Ahh, yes, that sounds right. Might be a nice gain to be had if there was some way to log the statement, but I suspect getting WAL to support that would be extremely non-trivial. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: > On Fri, Mar 24, 2006 at 08:39:02AM -0400, Alvaro Herrera wrote: > > Jim C. Nasby wrote: > > > > > Why would the content of the old_table be unreliable? If we've replayed > > > logs up to the point of the CTAS then any data that would be visible to > > > the CTAS should be fine, no? > > > > > > Though, the way Tom put it in one of his replies it sounds like WAL > > > doesn't do any kind of statement logging, only data logging. If that's > > > the case I'm not sure that the CTAS would actually get replayed. But I > > > suspect I'm just misunderstanding... > > > > The CTAS doesn't get logged (nor replayed obviously). What happens is > > that the involved files are fsync'ed before transaction commit, AFAIR. > > Ahh, yes, that sounds right. Might be a nice gain to be had if there was > some way to log the statement, but I suspect getting WAL to support that > would be extremely non-trivial. None at all, at least in the current incarnation, I think, because said query execution is dependent on the contents of the FSM, which is itself dependent on the timing of VACUUM and other stuff. Such an action, running with a different FSM content, can very trivially cause data corruption. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Fri, Mar 24, 2006 at 09:47:20AM -0400, Alvaro Herrera wrote: > Jim C. Nasby wrote: > > On Fri, Mar 24, 2006 at 08:39:02AM -0400, Alvaro Herrera wrote: > > > Jim C. Nasby wrote: > > > > > > > Why would the content of the old_table be unreliable? If we've replayed > > > > logs up to the point of the CTAS then any data that would be visible to > > > > the CTAS should be fine, no? > > > > > > > > Though, the way Tom put it in one of his replies it sounds like WAL > > > > doesn't do any kind of statement logging, only data logging. If that's > > > > the case I'm not sure that the CTAS would actually get replayed. But I > > > > suspect I'm just misunderstanding... > > > > > > The CTAS doesn't get logged (nor replayed obviously). What happens is > > > that the involved files are fsync'ed before transaction commit, AFAIR. > > > > Ahh, yes, that sounds right. Might be a nice gain to be had if there was > > some way to log the statement, but I suspect getting WAL to support that > > would be extremely non-trivial. > > None at all, at least in the current incarnation, I think, because said > query execution is dependent on the contents of the FSM, which is itself > dependent on the timing of VACUUM and other stuff. Such an action, > running with a different FSM content, can very trivially cause data > corruption. Oh, duh, because subsiquent operations will depend on the heap being in a very specific state. Oh well. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Fri, 24 Mar 2006, Jim C. Nasby wrote: > On Wed, Mar 22, 2006 at 02:37:28PM -0500, Kris Jurka wrote: >> >> On Wed, 22 Mar 2006, Jim C. Nasby wrote: >> >>> Ok, I saw disk activity on the base directory and assumed it was pg_xlog >>> stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore >>> default_tablepsace and create the new tables in the base directory. I'm >>> guessing that's a bug... (this is on 8.1.2, btw). >> >> This has been fixed in CVS HEAD as part of a patch to allow additional >> options to CREATE TABLE AS. >> >> http://archives.postgresql.org/pgsql-patches/2006-02/msg00211.php > > I'll argue that the current behavior is still a bug and should be fixed. > Would it be difficult to patch 8.1 (and 8.0 if there were tablespaces > then...) to honor default_tablespace? Here are patches that fix this for 8.0 and 8.1. Kris Jurka
Attachment
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Kris Jurka wrote: > > > On Fri, 24 Mar 2006, Jim C. Nasby wrote: > > > On Wed, Mar 22, 2006 at 02:37:28PM -0500, Kris Jurka wrote: > >> > >> On Wed, 22 Mar 2006, Jim C. Nasby wrote: > >> > >>> Ok, I saw disk activity on the base directory and assumed it was pg_xlog > >>> stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore > >>> default_tablepsace and create the new tables in the base directory. I'm > >>> guessing that's a bug... (this is on 8.1.2, btw). > >> > >> This has been fixed in CVS HEAD as part of a patch to allow additional > >> options to CREATE TABLE AS. > >> > >> http://archives.postgresql.org/pgsql-patches/2006-02/msg00211.php > > > > I'll argue that the current behavior is still a bug and should be fixed. > > Would it be difficult to patch 8.1 (and 8.0 if there were tablespaces > > then...) to honor default_tablespace? > > Here are patches that fix this for 8.0 and 8.1. > > Kris Jurka Content-Description: [ Attachment, skipping... ] Content-Description: [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian http://candle.pha.pa.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 2006-04-21 at 19:56 -0400, Bruce Momjian wrote: > Your patch has been added to the PostgreSQL unapplied patches list at: > > http://momjian.postgresql.org/cgi-bin/pgpatches > > It will be applied as soon as one of the PostgreSQL committers reviews > and approves it. This patch should now be referred to as allow CREATE TABLE AS/SELECT INTO to use default_tablespace or something similar. The name of the original thread no longer bears any resemblance to the intention of this patch as submitted in its final form. I've no objection to the patch, which seems to fill a functional gap/bug. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com/
Backpatched to 8.0.X and 8.1.X. --------------------------------------------------------------------------- Kris Jurka wrote: > > > On Fri, 24 Mar 2006, Jim C. Nasby wrote: > > > On Wed, Mar 22, 2006 at 02:37:28PM -0500, Kris Jurka wrote: > >> > >> On Wed, 22 Mar 2006, Jim C. Nasby wrote: > >> > >>> Ok, I saw disk activity on the base directory and assumed it was pg_xlog > >>> stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore > >>> default_tablepsace and create the new tables in the base directory. I'm > >>> guessing that's a bug... (this is on 8.1.2, btw). > >> > >> This has been fixed in CVS HEAD as part of a patch to allow additional > >> options to CREATE TABLE AS. > >> > >> http://archives.postgresql.org/pgsql-patches/2006-02/msg00211.php > > > > I'll argue that the current behavior is still a bug and should be fixed. > > Would it be difficult to patch 8.1 (and 8.0 if there were tablespaces > > then...) to honor default_tablespace? > > Here are patches that fix this for 8.0 and 8.1. > > Kris Jurka Content-Description: [ Attachment, skipping... ] Content-Description: [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian http://candle.pha.pa.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +