Thread: Reduce WAL logging of INSERT SELECT

Reduce WAL logging of INSERT SELECT

From
Bruce Momjian
Date:
We currently have reduced WAL logging for wal_level = minimum for these
commands:
CREATE TABLE ASCREATE INDEXCLUSTERCOPY into tables that were created or truncated in the sametransaction

One thing we don't optimize is INSERT ... SELECT when the table is
created or truncated in the same transaction.  Seems we could.

We optimize CREATE TABLE AS which is effectively SELECT ... INTO using a
different syntax.  Is this a TODO?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Reduce WAL logging of INSERT SELECT

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> One thing we don't optimize is INSERT ... SELECT when the table is
> created or truncated in the same transaction.  Seems we could.

> We optimize CREATE TABLE AS which is effectively SELECT ... INTO using a
> different syntax.  Is this a TODO?

Considering that SELECT INTO is deprecated, I don't think we should be
expending effort to encourage people to use it.
        regards, tom lane


Re: Reduce WAL logging of INSERT SELECT

From
Andrew Dunstan
Date:

On 08/04/2011 04:55 PM, Tom Lane wrote:
> Bruce Momjian<bruce@momjian.us>  writes:
>> One thing we don't optimize is INSERT ... SELECT when the table is
>> created or truncated in the same transaction.  Seems we could.
>> We optimize CREATE TABLE AS which is effectively SELECT ... INTO using a
>> different syntax.  Is this a TODO?
> Considering that SELECT INTO is deprecated, I don't think we should be
> expending effort to encourage people to use it.
>
>             


Right, but the original point about INSERT ... SELECT seems reasonable, no?

cheers

andrew


Re: Reduce WAL logging of INSERT SELECT

From
Bruce Momjian
Date:
Andrew Dunstan wrote:
> 
> 
> On 08/04/2011 04:55 PM, Tom Lane wrote:
> > Bruce Momjian<bruce@momjian.us>  writes:
> >> One thing we don't optimize is INSERT ... SELECT when the table is
> >> created or truncated in the same transaction.  Seems we could.
> >> We optimize CREATE TABLE AS which is effectively SELECT ... INTO using a
> >> different syntax.  Is this a TODO?
> > Considering that SELECT INTO is deprecated, I don't think we should be
> > expending effort to encourage people to use it.
> >
> >             
> 
> 
> Right, but the original point about INSERT ... SELECT seems reasonable, no?

Right.  I brought up SELECT INTO because you could make the argument
that INSERT ... SELECT is not a utility command like the other ones and
therefore can't be done easily, but CREATE TABLE AS is internal SELECT
INTO and implemented in execMain.c, which I think is where INSERT ...
SELECT would also be implemented.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Reduce WAL logging of INSERT SELECT

From
Simon Riggs
Date:
On Thu, Aug 4, 2011 at 10:46 PM, Bruce Momjian <bruce@momjian.us> wrote:

> Right.  I brought up SELECT INTO because you could make the argument
> that INSERT ... SELECT is not a utility command like the other ones and
> therefore can't be done easily, but CREATE TABLE AS is internal SELECT
> INTO and implemented in execMain.c, which I think is where INSERT ...
> SELECT would also be implemented.

What you should be asking is whether the optimisation would be
effective for INSERT SELECT, or even test it.

My observation is that the optimisation is only effective for very
large loads that cause I/O. As RAM sizes get bigger, I'm inclined to
remove the optimisation and make it optional since it is ineffective
in many cases and negative benefit for smaller cases.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Reduce WAL logging of INSERT SELECT

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Thu, Aug 4, 2011 at 10:46 PM, Bruce Momjian <bruce@momjian.us> wrote:
> 
> > Right. ?I brought up SELECT INTO because you could make the argument
> > that INSERT ... SELECT is not a utility command like the other ones and
> > therefore can't be done easily, but CREATE TABLE AS is internal SELECT
> > INTO and implemented in execMain.c, which I think is where INSERT ...
> > SELECT would also be implemented.
> 
> What you should be asking is whether the optimisation would be
> effective for INSERT SELECT, or even test it.
> 
> My observation is that the optimisation is only effective for very
> large loads that cause I/O. As RAM sizes get bigger, I'm inclined to
> remove the optimisation and make it optional since it is ineffective
> in many cases and negative benefit for smaller cases.

I am confused how generating WAL traffic that is larger than the heap
file we are fsync'ing can possibly be slower.  Are you just throwing out
an idea to try to make me prove it?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Reduce WAL logging of INSERT SELECT

From
Jeff Davis
Date:
On Thu, 2011-08-04 at 17:46 -0400, Bruce Momjian wrote:
> Right.  I brought up SELECT INTO because you could make the argument
> that INSERT ... SELECT is not a utility command like the other ones and
> therefore can't be done easily, but CREATE TABLE AS is internal SELECT
> INTO and implemented in execMain.c, which I think is where INSERT ...
> SELECT would also be implemented.

The above statement is a little confusing, so let me start from the
beginning:

How could we avoid WAL logging for INSERT ... SELECT?

The way we do it for CREATE TABLE AS is because nobody would even *see*
the table if our transaction doesn't commit. Therefore we don't need to
bother logging it. Same can be said for SELECT INTO.

INSERT ... SELECT is just an insert. It needs just as much logging as
inserting tuples any other way. For instance, it will potentially share
pages with other inserts, and better properly record all such page
modifications so that they return to a consistent state.

Regards,Jeff Davis





Re: Reduce WAL logging of INSERT SELECT

From
Bruce Momjian
Date:
Jeff Davis wrote:
> On Thu, 2011-08-04 at 17:46 -0400, Bruce Momjian wrote:
> > Right.  I brought up SELECT INTO because you could make the argument
> > that INSERT ... SELECT is not a utility command like the other ones and
> > therefore can't be done easily, but CREATE TABLE AS is internal SELECT
> > INTO and implemented in execMain.c, which I think is where INSERT ...
> > SELECT would also be implemented.
> 
> The above statement is a little confusing, so let me start from the
> beginning:
> 
> How could we avoid WAL logging for INSERT ... SELECT?
> 
> The way we do it for CREATE TABLE AS is because nobody would even *see*
> the table if our transaction doesn't commit. Therefore we don't need to
> bother logging it. Same can be said for SELECT INTO.
> 
> INSERT ... SELECT is just an insert. It needs just as much logging as
> inserting tuples any other way. For instance, it will potentially share
> pages with other inserts, and better properly record all such page
> modifications so that they return to a consistent state.

It would act like COPY, meaning the table would have to be truncated or
created in the same transaction.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Reduce WAL logging of INSERT SELECT

From
Robert Haas
Date:
On Thu, Aug 4, 2011 at 8:55 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Jeff Davis wrote:
>> On Thu, 2011-08-04 at 17:46 -0400, Bruce Momjian wrote:
>> > Right.  I brought up SELECT INTO because you could make the argument
>> > that INSERT ... SELECT is not a utility command like the other ones and
>> > therefore can't be done easily, but CREATE TABLE AS is internal SELECT
>> > INTO and implemented in execMain.c, which I think is where INSERT ...
>> > SELECT would also be implemented.
>>
>> The above statement is a little confusing, so let me start from the
>> beginning:
>>
>> How could we avoid WAL logging for INSERT ... SELECT?
>>
>> The way we do it for CREATE TABLE AS is because nobody would even *see*
>> the table if our transaction doesn't commit. Therefore we don't need to
>> bother logging it. Same can be said for SELECT INTO.
>>
>> INSERT ... SELECT is just an insert. It needs just as much logging as
>> inserting tuples any other way. For instance, it will potentially share
>> pages with other inserts, and better properly record all such page
>> modifications so that they return to a consistent state.
>
> It would act like COPY, meaning the table would have to be truncated or
> created in the same transaction.

It seems to me that, if we know the relation was created or truncated
in the current transaction, and if wal_level = minimal, then we don't
need to WAL-log *anything* until transaction commit (provided we fsync
at commit).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Reduce WAL logging of INSERT SELECT

From
Jeff Davis
Date:
On Thu, 2011-08-04 at 20:55 -0400, Bruce Momjian wrote:
> It would act like COPY, meaning the table would have to be truncated or
> created in the same transaction.

Well, in that case it could work for any INSERT. No need for a SELECT to
be involved. For that matter, why not make it work for DELETE and
UPDATE, too?

However, I think this is all just a workaround for not having a faster
loading path. I don't object to applying this optimization to inserts,
but I think it might be more productive to figure out if we can support
loading data efficiently -- i.e. also set hint bits and frozenxid during
the load.

Regards,Jeff Davis




Re: Reduce WAL logging of INSERT SELECT

From
Heikki Linnakangas
Date:
On 05.08.2011 04:23, Jeff Davis wrote:
> On Thu, 2011-08-04 at 20:55 -0400, Bruce Momjian wrote:
>> It would act like COPY, meaning the table would have to be truncated or
>> created in the same transaction.
>
> Well, in that case it could work for any INSERT. No need for a SELECT to
> be involved. For that matter, why not make it work for DELETE and
> UPDATE, too?

Yep. If we are to expand it, we should make it work for any operation.

However, for small operations it's a net loss - you avoid writing a WAL 
record, but you have to fsync() the heap instead. If you only modify a 
few rows, the extra fsync (or fsyncs if there are indexes too) is more 
expensive than writing the WAL.

We'd need a heuristic to decide whether to write WAL or fsync at the 
end. For regular INSERTs, UPDATEs and DELETEs, you have the planner's 
estimate of number of rows affected. Another thing we should do is move 
the fsync call from the end of COPY (and other such operations) to the 
end of transaction. That way if you do e.g one COPY followed by a bunch 
of smaller INSERTs or UPDATEs, you only need to fsync the files once.

> However, I think this is all just a workaround for not having a faster
> loading path. I don't object to applying this optimization to inserts,
> but I think it might be more productive to figure out if we can support
> loading data efficiently -- i.e. also set hint bits and frozenxid during
> the load.

Yeah, that would make a much bigger impact in practice.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Reduce WAL logging of INSERT SELECT

From
Simon Riggs
Date:
On Thu, Aug 4, 2011 at 11:07 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Simon Riggs wrote:
>> On Thu, Aug 4, 2011 at 10:46 PM, Bruce Momjian <bruce@momjian.us> wrote:
>>
>> > Right. ?I brought up SELECT INTO because you could make the argument
>> > that INSERT ... SELECT is not a utility command like the other ones and
>> > therefore can't be done easily, but CREATE TABLE AS is internal SELECT
>> > INTO and implemented in execMain.c, which I think is where INSERT ...
>> > SELECT would also be implemented.
>>
>> What you should be asking is whether the optimisation would be
>> effective for INSERT SELECT, or even test it.
>>
>> My observation is that the optimisation is only effective for very
>> large loads that cause I/O. As RAM sizes get bigger, I'm inclined to
>> remove the optimisation and make it optional since it is ineffective
>> in many cases and negative benefit for smaller cases.
>
> I am confused how generating WAL traffic that is larger than the heap
> file we are fsync'ing can possibly be slower.

I'm telling you what I know to be true as an assistance to you.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Reduce WAL logging of INSERT SELECT

From
Jeff Davis
Date:
On Thu, 2011-08-04 at 18:07 -0400, Bruce Momjian wrote:
> I am confused how generating WAL traffic that is larger than the heap
> file we are fsync'ing can possibly be slower.  Are you just throwing out
> an idea to try to make me prove it?

That's worded in a slightly confusing way, but here is the trade-off:

1. If you are using WAL, then regardless of what your transaction does,
only the WAL needs to be fsync'd at commit time. Conveniently, that's
being written sequentially, so it's a single fairly cheap fsync; and all
the data page changes are deferred, collected together, and fsync'd at
checkpoint time (rather than commit time). The cost is that you
double-write the data.

2. If you aren't using WAL, you need to fsync every data file the
transaction touched, which are probably not localized with other
activity. Also, the _entire_ data files needs to be sync'd, so perhaps
many other transactions have made changes to one data file all over, and
it may require _many_ seeks to accomplish the one fsync. The benefit is
that you don't double-write the data.

So, fundamentally, WAL is (in the OLTP case, where a transaction is much
shorter than a checkpoint interval) a big performance _win_, because it
allows us to do nice sequential writing in a single place for all
activities of all transactions; and defer all those random writes to
data pages until the next checkpoint. So we shouldn't treat WAL like a
cost burden that we want to avoid in every case we can.

But in the data load case (where many checkpoints may happen during a
single transaction anyway), it happens that avoiding WAL is a
performance win, because the seeks are not the dominant cost.

Regards,Jeff Davis



Re: Reduce WAL logging of INSERT SELECT

From
Bruce Momjian
Date:
Jeff Davis wrote:
> On Thu, 2011-08-04 at 18:07 -0400, Bruce Momjian wrote:
> > I am confused how generating WAL traffic that is larger than the heap
> > file we are fsync'ing can possibly be slower.  Are you just throwing out
> > an idea to try to make me prove it?
> 
> That's worded in a slightly confusing way, but here is the trade-off:
> 
> 1. If you are using WAL, then regardless of what your transaction does,
> only the WAL needs to be fsync'd at commit time. Conveniently, that's
> being written sequentially, so it's a single fairly cheap fsync; and all
> the data page changes are deferred, collected together, and fsync'd at
> checkpoint time (rather than commit time). The cost is that you
> double-write the data.
> 
> 2. If you aren't using WAL, you need to fsync every data file the
> transaction touched, which are probably not localized with other
> activity. Also, the _entire_ data files needs to be sync'd, so perhaps
> many other transactions have made changes to one data file all over, and
> it may require _many_ seeks to accomplish the one fsync. The benefit is
> that you don't double-write the data.
> 
> So, fundamentally, WAL is (in the OLTP case, where a transaction is much
> shorter than a checkpoint interval) a big performance _win_, because it
> allows us to do nice sequential writing in a single place for all
> activities of all transactions; and defer all those random writes to
> data pages until the next checkpoint. So we shouldn't treat WAL like a
> cost burden that we want to avoid in every case we can.
> 
> But in the data load case (where many checkpoints may happen during a
> single transaction anyway), it happens that avoiding WAL is a
> performance win, because the seeks are not the dominant cost.

Well, if the table is created in the same transaction (which is the only
case under consideration), no other sessions can write to the table so
you are just writing the entire table on commit, rather than to the WAL.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Reduce WAL logging of INSERT SELECT

From
Gokulakannan Somasundaram
Date:


However, for small operations it's a net loss - you avoid writing a WAL record, but you have to fsync() the heap instead. If you only modify a few rows, the extra fsync (or fsyncs if there are indexes too) is more expensive than writing the WAL.

We'd need a heuristic to decide whether to write WAL or fsync at the end. For regular INSERTs, UPDATEs and DELETEs, you have the planner's estimate of number of rows affected. Another thing we should do is move the fsync call from the end of COPY (and other such operations) to the end of transaction. That way if you do e.g one COPY followed by a bunch of smaller INSERTs or UPDATEs, you only need to fsync the files once.

Have you thought about recovery, especially when the page size is greater than the disk block size( > 4K ). With WAL, there is a way to restore the pages to the original state, during recovery, if there are partial page writes. Is it possible to do the same with direct fsync without WAL?

Gokul.

Re: Reduce WAL logging of INSERT SELECT

From
Heikki Linnakangas
Date:
On 06.08.2011 06:32, Gokulakannan Somasundaram wrote:
>>
>> However, for small operations it's a net loss - you avoid writing a WAL
>> record, but you have to fsync() the heap instead. If you only modify a few
>> rows, the extra fsync (or fsyncs if there are indexes too) is more expensive
>> than writing the WAL.
>>
>> We'd need a heuristic to decide whether to write WAL or fsync at the end.
>> For regular INSERTs, UPDATEs and DELETEs, you have the planner's estimate of
>> number of rows affected. Another thing we should do is move the fsync call
>> from the end of COPY (and other such operations) to the end of transaction.
>> That way if you do e.g one COPY followed by a bunch of smaller INSERTs or
>> UPDATEs, you only need to fsync the files once.
>
>
> Have you thought about recovery, especially when the page size is greater
> than the disk block size(>  4K ). With WAL, there is a way to restore the
> pages to the original state, during recovery, if there are partial page
> writes. Is it possible to do the same with direct fsync without WAL?

The point of the optimization is that you can only skip WAL when it's 
been created (or truncated) in the same transaction. In that case, if 
the transaction aborts because of a crash, you don't care about the 
contents of the table anyway.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Reduce WAL logging of INSERT SELECT

From
Simon Riggs
Date:
On Sat, Aug 6, 2011 at 4:16 AM, Bruce Momjian <bruce@momjian.us> wrote:

> Well, if the table is created in the same transaction (which is the only
> case under consideration), no other sessions can write to the table so
> you are just writing the entire table on commit, rather than to the WAL.

Below a certain point, skipping WAL is slower and over an intermediate
range there is no benefit. So small amounts of data on large servers
goes slower.

heap_fsync() requires a scan of shared buffers, which may not be cheap.

There is a difficulty because you would need to calculate the cut-off
is for a particular database, and then predict ahead of time whether
the number of rows that will be handled by the statement is low enough
to warrant using the optimisation. Both of which I call a hard
problem.

I think we should remove the COPY optimisation because of this and
definitely not extend INSERT SELECT to perform it automatically.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Reduce WAL logging of INSERT SELECT

From
Heikki Linnakangas
Date:
On 06.08.2011 13:13, Simon Riggs wrote:
> I think we should remove the COPY optimisation because of this and
> definitely not extend INSERT SELECT to perform it automatically.

It can be very helpful when loading a lot of data, so I'm not in favor 
of removing it altogether. Maybe WAL-log the first 10000 rows or such 
normally, and skip WAL after that. Of course, loading 10001 rows becomes 
the worst case then, but something along those lines...

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Reduce WAL logging of INSERT SELECT

From
Jaime Casanova
Date:
On Sat, Aug 6, 2011 at 11:05 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 06.08.2011 13:13, Simon Riggs wrote:
>>
>> I think we should remove the COPY optimisation because of this and
>> definitely not extend INSERT SELECT to perform it automatically.
>
> It can be very helpful when loading a lot of data, so I'm not in favor of
> removing it altogether. Maybe WAL-log the first 10000 rows or such normally,
> and skip WAL after that. Of course, loading 10001 rows becomes the worst
> case then, but something along those lines...
>

why 10000 rows? maybe the right solution is move towards make a normal
table unlogged and viceversa... probably that's harder to do but we
will have better control and less odd heuristics

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación


Re: Reduce WAL logging of INSERT SELECT

From
"Kevin Grittner"
Date:
Heikki Linnakangas  wrote:
> On 06.08.2011 13:13, Simon Riggs wrote:
>> I think we should remove the COPY optimisation because of this and
>> definitely not extend INSERT SELECT to perform it automatically.
> 
> It can be very helpful when loading a lot of data, so I'm not in
> favor of removing it altogether.
Yeah, it can currently help a lot.  Of course, if WAL-logging could
in any way facilitate hint bit and frozen xmin setting during bulk
loads, I'm sure the WAL-logged version would win easily.
-Kevin


Re: Reduce WAL logging of INSERT SELECT

From
Tom Lane
Date:
Jaime Casanova <jaime@2ndquadrant.com> writes:
> On Sat, Aug 6, 2011 at 11:05 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> It can be very helpful when loading a lot of data, so I'm not in favor of
>> removing it altogether. Maybe WAL-log the first 10000 rows or such normally,
>> and skip WAL after that. Of course, loading 10001 rows becomes the worst
>> case then, but something along those lines...

> why 10000 rows?

Yeah; any particular number is wrong.  Perhaps it'd be better to put the
behavior under user control.  In the case of COPY, where we already have
a place to stick random options, you could imagine writing something
like
COPY ... WITH (bulk)

to cue the system that a lot of data is coming in.  But I don't see any
nice way to do something similar for INSERT/SELECT.  I hesitate to
suggest a GUC, but something like "SET bulk_load = on" would be pretty
straightforward to use in pg_dump for instance.
        regards, tom lane


Re: Reduce WAL logging of INSERT SELECT

From
Jeff Davis
Date:
On Fri, 2011-08-05 at 23:16 -0400, Bruce Momjian wrote:
> Well, if the table is created in the same transaction (which is the only
> case under consideration), no other sessions can write to the table so
> you are just writing the entire table on commit, rather than to the WAL.

The transaction can still write to many tables that way, and that could
mean many fsyncs.

Also, there may be a bunch of other transactions trying to write to the
WAL that have to wait as your transaction has to seek out to fsync the
data file and then seek back to the WAL.

Regards,Jeff Davis




Re: Reduce WAL logging of INSERT SELECT

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Sat, Aug 6, 2011 at 4:16 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> > Well, if the table is created in the same transaction (which is the only
> > case under consideration), no other sessions can write to the table so
> > you are just writing the entire table on commit, rather than to the WAL.
>
> Below a certain point, skipping WAL is slower and over an intermediate
> range there is no benefit. So small amounts of data on large servers
> goes slower.
>
> heap_fsync() requires a scan of shared buffers, which may not be cheap.
>
> There is a difficulty because you would need to calculate the cut-off
> is for a particular database, and then predict ahead of time whether
> the number of rows that will be handled by the statement is low enough
> to warrant using the optimisation. Both of which I call a hard
> problem.
>
> I think we should remove the COPY optimisation because of this and
> definitely not extend INSERT SELECT to perform it automatically.

I ran some tests and Simon was correct and I was wrong.  There is a
measurable overhead to the optimization of avoiding WAL traffic for
small tables.

I tested git head with COPY and created the table inside and outside the
COPY transaction, with the attached script.  It ran in 11 seconds
without the optimization, and 12 seconds with the CREATE TABLE inside
the COPY transaction.

With these results, I withdraw my idea of adding this optimization to
other commands.  I think COPY is usually used in bulk mode, but the
other commands are often used in smaller batches that would be worse
with this optimization.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
:

. traprm

echo 1 > $TMP/1

echo "DROP TABLE copytest;" > $TMP/0
for X in $(jot 1000)
do
    cat >> $TMP/0 <<END
-- move the next line into the transaction block to test the optimization overhead
CREATE TABLE copytest(x int);
BEGIN WORK;
COPY copytest FROM '$TMP/1';
COMMIT;
DROP TABLE copytest;
END
done
time sql test < $TMP/0 > /dev/null