Thread: WAL logging of SELECT ... INTO command

WAL logging of SELECT ... INTO command

From
"Jim C. Nasby"
Date:
Currently, it appears that SELECT * INTO new_table FROM old_table logs
each page as it's written to WAL. Is this actually needed? Couldn't the
database simply log that the SELECT ... INTO statement was executed
instead? Doing so would likely result in a large performance improvement
in most installs. Is there no provision for writing anything but data
page changes (or whole pages) to WAL?
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: WAL logging of SELECT ... INTO command

From
Simon Riggs
Date:
On Tue, 2006-03-21 at 06:22 -0600, Jim C. Nasby wrote:
> Currently, it appears that SELECT * INTO new_table FROM old_table logs
> each page as it's written to WAL. Is this actually needed? Couldn't the
> database simply log that the SELECT ... INTO statement was executed
> instead? Doing so would likely result in a large performance improvement
> in most installs. Is there no provision for writing anything but data
> page changes (or whole pages) to WAL?

AFAIK it takes the same code path as CREATE TABLE AS SELECT, which
already does exactly what you suggest (except when using PITR).

Best Regards, Simon Riggs


Re: WAL logging of SELECT ... INTO command

From
"Jim C. Nasby"
Date:
On Tue, Mar 21, 2006 at 08:33:50PM +0000, Simon Riggs wrote:
> On Tue, 2006-03-21 at 06:22 -0600, Jim C. Nasby wrote:
> > Currently, it appears that SELECT * INTO new_table FROM old_table logs
> > each page as it's written to WAL. Is this actually needed? Couldn't the
> > database simply log that the SELECT ... INTO statement was executed
> > instead? Doing so would likely result in a large performance improvement
> > in most installs. Is there no provision for writing anything but data
> > page changes (or whole pages) to WAL?
>
> AFAIK it takes the same code path as CREATE TABLE AS SELECT, which
> already does exactly what you suggest (except when using PITR).

Ok, I saw disk activity on the base directory and assumed it was pg_xlog
stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore
default_tablepsace and create the new tables in the base directory. I'm
guessing that's a bug... (this is on 8.1.2, btw).

Also, why do we log rows for CTAS/SELECT INTO when PITR is in use for
simple SELECTs (ones that don't call non-deterministic functions)? The
data should alread be available AFAICS...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: WAL logging of SELECT ... INTO command

From
Simon Riggs
Date:
On Wed, 2006-03-22 at 06:47 -0600, Jim C. Nasby wrote:

> Also, why do we log rows for CTAS/SELECT INTO when PITR is in use for
> simple SELECTs (ones that don't call non-deterministic functions)? The
> data should alread be available AFAICS...

Not sure what you're asking... SELECTs don't produce WAL.

PITR wants all changes. Without PITR we can optimise certain logging
actions.

Best Regards, Simon Riggs


Re: WAL logging of SELECT ... INTO command

From
"Jim C. Nasby"
Date:
On Wed, Mar 22, 2006 at 01:08:34PM +0000, Simon Riggs wrote:
> On Wed, 2006-03-22 at 06:47 -0600, Jim C. Nasby wrote:
>
> > Also, why do we log rows for CTAS/SELECT INTO when PITR is in use for
> > simple SELECTs (ones that don't call non-deterministic functions)? The
> > data should alread be available AFAICS...
>
> Not sure what you're asking... SELECTs don't produce WAL.

Yes, there'd have to be some special kind of WAL entry that specifies
what select statement was used in CTAS.

> PITR wants all changes. Without PITR we can optimise certain logging
> actions.

The only change here is that we're creating a new table based on the
results of a SELECT. If that SELECT doesn't use anything that's
non-deterministic, then the machine doing the recovery should already
have all the data it needs, provided that we log the SELECT that was
used in the CTAS.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: WAL logging of SELECT ... INTO command

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
>> PITR wants all changes. Without PITR we can optimise certain logging
>> actions.

> The only change here is that we're creating a new table based on the
> results of a SELECT. If that SELECT doesn't use anything that's
> non-deterministic, then the machine doing the recovery should already
> have all the data it needs, provided that we log the SELECT that was
> used in the CTAS.

This is based on a fundamental misconception about the way PITR
log-shipping works.  We log actions at the physical level (put this
tuple here), not the logical here's-the-statement-we-executed level.
The two approaches cannot mix, because as soon as there's any physical
discrepancy at all, physical-level actions would be incorrectly applied
to the slave database.

            regards, tom lane

Re: WAL logging of SELECT ... INTO command

From
"Jim C. Nasby"
Date:
On Wed, Mar 22, 2006 at 10:06:05AM -0500, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> >> PITR wants all changes. Without PITR we can optimize certain logging
> >> actions.
>
> > The only change here is that we're creating a new table based on the
> > results of a SELECT. If that SELECT doesn't use anything that's
> > non-deterministic, then the machine doing the recovery should already
> > have all the data it needs, provided that we log the SELECT that was
> > used in the CTAS.
>
> This is based on a fundamental misconception about the way PITR
> log-shipping works.  We log actions at the physical level (put this
> tuple here), not the logical here's-the-statement-we-executed level.
> The two approaches cannot mix, because as soon as there's any physical
> discrepancy at all, physical-level actions would be incorrectly applied
> to the slave database.

Oh, so in other words, SELECT * INTO temp FROM table is inherently
non-deterministic at the physical level, so the only way to be able to
allow PITR to work is to duplicate all the physical changes. Darn.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: WAL logging of SELECT ... INTO command

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Oh, so in other words, SELECT * INTO temp FROM table is inherently
> non-deterministic at the physical level, so the only way to be able to
> allow PITR to work is to duplicate all the physical changes. Darn.

Well, lemme put it this way: I'm not prepared to require that PG be
deterministic at the physical level.  One obvious source of
non-determinancy is the FSM, which is likely to hand out different free
space to different transactions depending on what else is going on at
the same time.  There are others, such as deliberately random
tie-breaking during btree index insertion.

            regards, tom lane

Re: WAL logging of SELECT ... INTO command

From
Csaba Nagy
Date:
On Wed, 2006-03-22 at 16:35, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > Oh, so in other words, SELECT * INTO temp FROM table is inherently
> > non-deterministic at the physical level, so the only way to be able to
> > allow PITR to work is to duplicate all the physical changes. Darn.
>
> Well, lemme put it this way: I'm not prepared to require that PG be
> deterministic at the physical level.  One obvious source of
> non-determinancy is the FSM, which is likely to hand out different free
> space to different transactions depending on what else is going on at
> the same time.  There are others, such as deliberately random
> tie-breaking during btree index insertion.

While you're at talking about WAL and PITR... I see from the aboce
discussion that PITR is already demanding special handling in the code
(I hope I got this one right, as the following are based on this).

What if the PITR logging would be disconnected from the WAL logging
completely ?

What I mean is to introduce a WAL subscription mechanism, which
basically means some incoming connections where we stream the log
records. We don't need to write them to disk at all in the normal case,
I guess usually PITR will store the records on some other machine so it
means network, not disk. And it doesn't need to be done synchronously,
it can lag behind the running transactions, and we can do it in batches
of WAL records.

It also would mean that the local WAL does not need to log the things
which are only needed for the PITR... that would likely mean some spared
WAL disk activity. Of course it also would mean that the local WAL and
PITR WAL are not the same, but that is not an issue I guess.

It would also permit immediate recycling of the WAL files if the current
archiving style is not used.

The drawbacks I can see (please add yours):
1) the need for the subscription management code with the added
complexity it implies;
2) problems if the WAL stream lags too much behind;
3) problems if the subscribed client's connection is interrupted;

Nr. 2 could be solved by saving the PITR WAL separately if the lag grows
over a threshold, and issue a warning. This could still be acceptable,
as the writing doesn't have to be synchronous and can be made in
relatively large blocks.
There could be a second bigger lag threshold which completely cancels
the subscription. All these thresholds should be configurable, as it
depends on the application what's more important, to have the standby
available all the time or have the primary faster if loaded...

Nr. 3. can be solved by either canceling the subscription on connection
drop, or by allowing a certain amount of time after which the
subscription is canceled. The client can reconnect before this timeout
expires. In the meantime the primary can store the PITR WAL on disk as
mentioned above...

Cheers,
Csaba.



Re: WAL logging of SELECT ... INTO command

From
Kris Jurka
Date:

On Wed, 22 Mar 2006, Jim C. Nasby wrote:

> Ok, I saw disk activity on the base directory and assumed it was pg_xlog
> stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore
> default_tablepsace and create the new tables in the base directory. I'm
> guessing that's a bug... (this is on 8.1.2, btw).

This has been fixed in CVS HEAD as part of a patch to allow additional
options to CREATE TABLE AS.

http://archives.postgresql.org/pgsql-patches/2006-02/msg00211.php

Kris Jurka


Re: WAL logging of SELECT ... INTO command

From
"Qingqing Zhou"
Date:
"Simon Riggs" <simon@2ndquadrant.com> wrote
> On Tue, 2006-03-21 at 06:22 -0600, Jim C. Nasby wrote:
> > Currently, it appears that SELECT * INTO new_table FROM old_table logs
> > each page as it's written to WAL. Is this actually needed? Couldn't the
> > database simply log that the SELECT ... INTO statement was executed
> > instead? Doing so would likely result in a large performance improvement
> > in most installs. Is there no provision for writing anything but data
> > page changes (or whole pages) to WAL?
>
> AFAIK it takes the same code path as CREATE TABLE AS SELECT, which
> already does exactly what you suggest (except when using PITR).
>

As I read, they did take the same code path, but did they "simply log that
the SELECT ... INTO statement was executed"? If so, how can we rely on the
unreliable content of the old_table to do recovery?

Regards,
Qingqing


Re: WAL logging of SELECT ... INTO command

From
"Jim C. Nasby"
Date:
On Wed, Mar 22, 2006 at 02:37:28PM -0500, Kris Jurka wrote:
>
>
> On Wed, 22 Mar 2006, Jim C. Nasby wrote:
>
> >Ok, I saw disk activity on the base directory and assumed it was pg_xlog
> >stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore
> >default_tablepsace and create the new tables in the base directory. I'm
> >guessing that's a bug... (this is on 8.1.2, btw).
>
> This has been fixed in CVS HEAD as part of a patch to allow additional
> options to CREATE TABLE AS.
>
> http://archives.postgresql.org/pgsql-patches/2006-02/msg00211.php

I'll argue that the current behavior is still a bug and should be fixed.
Would it be difficult to patch 8.1 (and 8.0 if there were tablespaces
then...) to honor default_tablespace?
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: WAL logging of SELECT ... INTO command

From
"Jim C. Nasby"
Date:
On Wed, Mar 22, 2006 at 02:20:39PM +0800, Qingqing Zhou wrote:
>
> "Simon Riggs" <simon@2ndquadrant.com> wrote
> > On Tue, 2006-03-21 at 06:22 -0600, Jim C. Nasby wrote:
> > > Currently, it appears that SELECT * INTO new_table FROM old_table logs
> > > each page as it's written to WAL. Is this actually needed? Couldn't the
> > > database simply log that the SELECT ... INTO statement was executed
> > > instead? Doing so would likely result in a large performance improvement
> > > in most installs. Is there no provision for writing anything but data
> > > page changes (or whole pages) to WAL?
> >
> > AFAIK it takes the same code path as CREATE TABLE AS SELECT, which
> > already does exactly what you suggest (except when using PITR).
> >
>
> As I read, they did take the same code path, but did they "simply log that
> the SELECT ... INTO statement was executed"? If so, how can we rely on the
> unreliable content of the old_table to do recovery?

Why would the content of the old_table be unreliable? If we've replayed
logs up to the point of the CTAS then any data that would be visible to
the CTAS should be fine, no?

Though, the way Tom put it in one of his replies it sounds like WAL
doesn't do any kind of statement logging, only data logging. If that's
the case I'm not sure that the CTAS would actually get replayed. But I
suspect I'm just misunderstanding...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: WAL logging of SELECT ... INTO command

From
Alvaro Herrera
Date:
Jim C. Nasby wrote:

> Why would the content of the old_table be unreliable? If we've replayed
> logs up to the point of the CTAS then any data that would be visible to
> the CTAS should be fine, no?
>
> Though, the way Tom put it in one of his replies it sounds like WAL
> doesn't do any kind of statement logging, only data logging. If that's
> the case I'm not sure that the CTAS would actually get replayed. But I
> suspect I'm just misunderstanding...

The CTAS doesn't get logged (nor replayed obviously).  What happens is
that the involved files are fsync'ed before transaction commit, AFAIR.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: WAL logging of SELECT ... INTO command

From
"Jim C. Nasby"
Date:
On Fri, Mar 24, 2006 at 08:39:02AM -0400, Alvaro Herrera wrote:
> Jim C. Nasby wrote:
>
> > Why would the content of the old_table be unreliable? If we've replayed
> > logs up to the point of the CTAS then any data that would be visible to
> > the CTAS should be fine, no?
> >
> > Though, the way Tom put it in one of his replies it sounds like WAL
> > doesn't do any kind of statement logging, only data logging. If that's
> > the case I'm not sure that the CTAS would actually get replayed. But I
> > suspect I'm just misunderstanding...
>
> The CTAS doesn't get logged (nor replayed obviously).  What happens is
> that the involved files are fsync'ed before transaction commit, AFAIR.

Ahh, yes, that sounds right. Might be a nice gain to be had if there was
some way to log the statement, but I suspect getting WAL to support that
would be extremely non-trivial.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: WAL logging of SELECT ... INTO command

From
Alvaro Herrera
Date:
Jim C. Nasby wrote:
> On Fri, Mar 24, 2006 at 08:39:02AM -0400, Alvaro Herrera wrote:
> > Jim C. Nasby wrote:
> >
> > > Why would the content of the old_table be unreliable? If we've replayed
> > > logs up to the point of the CTAS then any data that would be visible to
> > > the CTAS should be fine, no?
> > >
> > > Though, the way Tom put it in one of his replies it sounds like WAL
> > > doesn't do any kind of statement logging, only data logging. If that's
> > > the case I'm not sure that the CTAS would actually get replayed. But I
> > > suspect I'm just misunderstanding...
> >
> > The CTAS doesn't get logged (nor replayed obviously).  What happens is
> > that the involved files are fsync'ed before transaction commit, AFAIR.
>
> Ahh, yes, that sounds right. Might be a nice gain to be had if there was
> some way to log the statement, but I suspect getting WAL to support that
> would be extremely non-trivial.

None at all, at least in the current incarnation, I think, because said
query execution is dependent on the contents of the FSM, which is itself
dependent on the timing of VACUUM and other stuff.  Such an action,
running with a different FSM content, can very trivially cause data
corruption.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: WAL logging of SELECT ... INTO command

From
"Jim C. Nasby"
Date:
On Fri, Mar 24, 2006 at 09:47:20AM -0400, Alvaro Herrera wrote:
> Jim C. Nasby wrote:
> > On Fri, Mar 24, 2006 at 08:39:02AM -0400, Alvaro Herrera wrote:
> > > Jim C. Nasby wrote:
> > >
> > > > Why would the content of the old_table be unreliable? If we've replayed
> > > > logs up to the point of the CTAS then any data that would be visible to
> > > > the CTAS should be fine, no?
> > > >
> > > > Though, the way Tom put it in one of his replies it sounds like WAL
> > > > doesn't do any kind of statement logging, only data logging. If that's
> > > > the case I'm not sure that the CTAS would actually get replayed. But I
> > > > suspect I'm just misunderstanding...
> > >
> > > The CTAS doesn't get logged (nor replayed obviously).  What happens is
> > > that the involved files are fsync'ed before transaction commit, AFAIR.
> >
> > Ahh, yes, that sounds right. Might be a nice gain to be had if there was
> > some way to log the statement, but I suspect getting WAL to support that
> > would be extremely non-trivial.
>
> None at all, at least in the current incarnation, I think, because said
> query execution is dependent on the contents of the FSM, which is itself
> dependent on the timing of VACUUM and other stuff.  Such an action,
> running with a different FSM content, can very trivially cause data
> corruption.

Oh, duh, because subsiquent operations will depend on the heap being in
a very specific state. Oh well.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: WAL logging of SELECT ... INTO command

From
Kris Jurka
Date:

On Fri, 24 Mar 2006, Jim C. Nasby wrote:

> On Wed, Mar 22, 2006 at 02:37:28PM -0500, Kris Jurka wrote:
>>
>> On Wed, 22 Mar 2006, Jim C. Nasby wrote:
>>
>>> Ok, I saw disk activity on the base directory and assumed it was pg_xlog
>>> stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore
>>> default_tablepsace and create the new tables in the base directory. I'm
>>> guessing that's a bug... (this is on 8.1.2, btw).
>>
>> This has been fixed in CVS HEAD as part of a patch to allow additional
>> options to CREATE TABLE AS.
>>
>> http://archives.postgresql.org/pgsql-patches/2006-02/msg00211.php
>
> I'll argue that the current behavior is still a bug and should be fixed.
> Would it be difficult to patch 8.1 (and 8.0 if there were tablespaces
> then...) to honor default_tablespace?

Here are patches that fix this for 8.0 and 8.1.

Kris Jurka

Attachment

Re: WAL logging of SELECT ... INTO command

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Kris Jurka wrote:
>
>
> On Fri, 24 Mar 2006, Jim C. Nasby wrote:
>
> > On Wed, Mar 22, 2006 at 02:37:28PM -0500, Kris Jurka wrote:
> >>
> >> On Wed, 22 Mar 2006, Jim C. Nasby wrote:
> >>
> >>> Ok, I saw disk activity on the base directory and assumed it was pg_xlog
> >>> stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore
> >>> default_tablepsace and create the new tables in the base directory. I'm
> >>> guessing that's a bug... (this is on 8.1.2, btw).
> >>
> >> This has been fixed in CVS HEAD as part of a patch to allow additional
> >> options to CREATE TABLE AS.
> >>
> >> http://archives.postgresql.org/pgsql-patches/2006-02/msg00211.php
> >
> > I'll argue that the current behavior is still a bug and should be fixed.
> > Would it be difficult to patch 8.1 (and 8.0 if there were tablespaces
> > then...) to honor default_tablespace?
>
> Here are patches that fix this for 8.0 and 8.1.
>
> Kris Jurka

Content-Description:

[ Attachment, skipping... ]

Content-Description:

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: WAL logging of SELECT ... INTO command

From
Simon Riggs
Date:
On Fri, 2006-04-21 at 19:56 -0400, Bruce Momjian wrote:
> Your patch has been added to the PostgreSQL unapplied patches list at:
>
>     http://momjian.postgresql.org/cgi-bin/pgpatches
>
> It will be applied as soon as one of the PostgreSQL committers reviews
> and approves it.

This patch should now be referred to as
    allow CREATE TABLE AS/SELECT INTO to use default_tablespace
or something similar.

The name of the original thread no longer bears any resemblance to the
intention of this patch as submitted in its final form.

I've no objection to the patch, which seems to fill a functional
gap/bug.

--
  Simon Riggs
  EnterpriseDB          http://www.enterprisedb.com/


Re: WAL logging of SELECT ... INTO command

From
Bruce Momjian
Date:
Backpatched to 8.0.X and 8.1.X.

---------------------------------------------------------------------------

Kris Jurka wrote:
>
>
> On Fri, 24 Mar 2006, Jim C. Nasby wrote:
>
> > On Wed, Mar 22, 2006 at 02:37:28PM -0500, Kris Jurka wrote:
> >>
> >> On Wed, 22 Mar 2006, Jim C. Nasby wrote:
> >>
> >>> Ok, I saw disk activity on the base directory and assumed it was pg_xlog
> >>> stuff. Turns out that both SELECT INTO and CREATE TABLE AS ignore
> >>> default_tablepsace and create the new tables in the base directory. I'm
> >>> guessing that's a bug... (this is on 8.1.2, btw).
> >>
> >> This has been fixed in CVS HEAD as part of a patch to allow additional
> >> options to CREATE TABLE AS.
> >>
> >> http://archives.postgresql.org/pgsql-patches/2006-02/msg00211.php
> >
> > I'll argue that the current behavior is still a bug and should be fixed.
> > Would it be difficult to patch 8.1 (and 8.0 if there were tablespaces
> > then...) to honor default_tablespace?
>
> Here are patches that fix this for 8.0 and 8.1.
>
> Kris Jurka

Content-Description:

[ Attachment, skipping... ]

Content-Description:

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +