Thread: Using multi-row technique with COPY

Using multi-row technique with COPY

From
Simon Riggs
Date:
Since we just discussed using a multi-row per buffer lock technique with
Seq Scans, it seems appropriate to discuss a similar technique with COPY
FROM that I had been mulling over.

COPY FROM can read in sufficient rows until it has a whole block worth
of data, then get a new block and write it all with one pair of
BufferLock calls. This would only be possible, AFAICS, when a table does
not have any unique indexes or row triggers. It should be possible to
take advantage of this automatically when those requirements are met,
without any new options. Just as it was with Seq Scans, this is worth
about 10% reduction in CPU for a COPY FROM.

The main use case for this is initial table loads or restores from
dumps. However, it applies more widely since there is no requirement
that the table is empty, nor that it has only just been created within
the current transaction.

FSM access would need to change slightly to allow for whole-block-only
requests to be made for heaps, without damaging the average row length
calculation. It might be simpler to ignore FSM entirely?

As a further enhancement, I would also return to the NOLOGGING option
for COPY. Previously we had said that COPY LOCK was the way to go -
taking a full table lock to prevent concurrent inserts to a block from a
COPY that didn't write WAL and another backend which wanted to write WAL
about that block. With the above suggested all-inserts-at-once
optimization, it would no longer be a requirement to lock the table.
That means we can continue to take advantage of the ability to run
multiple COPY loads into the same table. Avoiding writing WAL will
further reduce CPU by about 15% and I/O by about 50%. 

I would also suggest that pgdump be changed to use the NOLOGGING option
by default, with an option to work as previously.

Comments?

Best Regards, Simon Riggs




Re: Using multi-row technique with COPY

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> COPY FROM can read in sufficient rows until it has a whole block worth
> of data, then get a new block and write it all with one pair of
> BufferLock calls.

> Comments?

I don't see any way to do this without horrible modularity violations.
The COPY code has no business going anywhere near individual buffers;
for that matter, it doesn't even really know what "a block worth" of
data is, since the tuples it's dealing with aren't toasted yet.
        regards, tom lane


Re: Using multi-row technique with COPY

From
Martijn van Oosterhout
Date:
On Sun, Nov 27, 2005 at 05:45:31PM -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > COPY FROM can read in sufficient rows until it has a whole block worth
> > of data, then get a new block and write it all with one pair of
> > BufferLock calls.
>
> > Comments?
>
> I don't see any way to do this without horrible modularity violations.
> The COPY code has no business going anywhere near individual buffers;
> for that matter, it doesn't even really know what "a block worth" of
> data is, since the tuples it's dealing with aren't toasted yet.

I wonder if you could get the same effect by having COPY copy into a
temporary table (thus no WAL, no bufferlocks) and then every hundred
rows do a SELECT INTO. Mind you, that just moves the modularity
violation elsewhere since only the bufmgr knows about the size of
buffers really.

Whatever happened to that idea to build as entire datafile with COPY or
some external tool and simply copy it into place and update the
catalog?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Using multi-row technique with COPY

From
Simon Riggs
Date:
On Mon, 2005-11-28 at 09:40 +0100, Martijn van Oosterhout wrote:
> On Sun, Nov 27, 2005 at 05:45:31PM -0500, Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > COPY FROM can read in sufficient rows until it has a whole block worth
> > > of data, then get a new block and write it all with one pair of
> > > BufferLock calls.
> > 
> > > Comments?

> Whatever happened to that idea to build as entire datafile with COPY or
> some external tool and simply copy it into place and update the
> catalog?

What's wrong with tuning the server to do this? 

Zapping the catalog as a normal operation is the wrong approach if you
want a robust system. All actions on the catalog must be under tight
control.

Most other RDBMS support a "fast path" loader, but all of them include
strong hooks into the main server to maintain catalog correctly. That is
one approach, but it requires creation of an external API - which seems
more work, plus a security risk. Copying data in a block at a time is
the basic technique all use.

I never discuss implementing features that other RDBMS have for any
other reason than than a similar use case exists for both. There are
many features where PostgreSQL is already ahead.

Best Regards, Simon Riggs



Re: Using multi-row technique with COPY

From
Simon Riggs
Date:
On Sun, 2005-11-27 at 17:45 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > COPY FROM can read in sufficient rows until it has a whole block worth
> > of data, then get a new block and write it all with one pair of
> > BufferLock calls.
> 
> > Comments?
> 
> I don't see any way to do this without horrible modularity violations.
> The COPY code has no business going anywhere near individual buffers;
> for that matter, it doesn't even really know what "a block worth" of
> data is, since the tuples it's dealing with aren't toasted yet.

I've taken on board your comments about modularity issues from earlier.
[I've not included anything on unique indexes, notice]

I was expecting to buffer this in the heap access method with a new
call, say, heap_bulk_insert() rather than have all that code hanging
around in COPY. A lower level routine RelationGetBufferForTupleArray can
handle the actual grunt. It can work, without ugliness.

We'd need to handle a buffer bigger than a single tuple anyway, so you
keep adding tuples until the last one tips over the edge, which then
gets saved for the next block. Heap access method knows about blocks.

We could reasonably do a test for would-be-toasted within those
routines. I should have said that this wouldn't apply if any of the
tuples require toasting, which of course has to be a dynamic test.

COPY would only need to know whether it was invoking the normal or the
bulk mode, which is reasonable, since it knows about indexes, triggers
etc.

Best Regards, Simon Riggs



Re: Using multi-row technique with COPY

From
Alvaro Herrera
Date:
Simon Riggs wrote:
> On Sun, 2005-11-27 at 17:45 -0500, Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > COPY FROM can read in sufficient rows until it has a whole block worth
> > > of data, then get a new block and write it all with one pair of
> > > BufferLock calls.
> > 
> > > Comments?
> > 
> > I don't see any way to do this without horrible modularity violations.
> > The COPY code has no business going anywhere near individual buffers;
> > for that matter, it doesn't even really know what "a block worth" of
> > data is, since the tuples it's dealing with aren't toasted yet.
> 
> I've taken on board your comments about modularity issues from earlier.
> [I've not included anything on unique indexes, notice]

I don't see why couldn't have an additional index access method entry
point to insert multiple rows on one call.  For an unique index, we
could return an array of entries that violated the condition (if we were
to implement such a feature for single-entry insertion.)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Using multi-row technique with COPY

From
Hannu Krosing
Date:
On Mon, 2005-11-28 at 00:56 +0000, Simon Riggs wrote:
> On Sun, 2005-11-27 at 17:45 -0500, Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > COPY FROM can read in sufficient rows until it has a whole block worth
> > > of data, then get a new block and write it all with one pair of
> > > BufferLock calls.
> > 
> > > Comments?
> > 
> > I don't see any way to do this without horrible modularity violations.
> > The COPY code has no business going anywhere near individual buffers;
> > for that matter, it doesn't even really know what "a block worth" of
> > data is, since the tuples it's dealing with aren't toasted yet.
> 
> I've taken on board your comments about modularity issues from earlier.
> [I've not included anything on unique indexes, notice]
> 
> I was expecting to buffer this in the heap access method with a new
> call, say, heap_bulk_insert() rather than have all that code hanging
> around in COPY. A lower level routine RelationGetBufferForTupleArray can
> handle the actual grunt. It can work, without ugliness.
> 
> We'd need to handle a buffer bigger than a single tuple anyway, so you
> keep adding tuples until the last one tips over the edge, which then
> gets saved for the next block. Heap access method knows about blocks.
> 
> We could reasonably do a test for would-be-toasted within those
> routines. I should have said that this wouldn't apply if any of the
> tuples require toasting, which of course has to be a dynamic test.

If we had a buffer big enough (say 10-100x the page size), then we would
not actually need to test toasting. We can just pass the big buffer to
heap_bulk_insert() which inserts the whole buffer in as big chunks  as
needed to fill the free space on pages (with single page lock).

--------------
Hannu




Re: Using multi-row technique with COPY

From
Martijn van Oosterhout
Date:
On Mon, Nov 28, 2005 at 09:39:36AM +0000, Simon Riggs wrote:
> On Mon, 2005-11-28 at 09:40 +0100, Martijn van Oosterhout wrote:
> > Whatever happened to that idea to build as entire datafile with COPY or
> > some external tool and simply copy it into place and update the
> > catalog?
>
> What's wrong with tuning the server to do this?

It was just a question. I remember it being mentioned but didn't hear
if it had been rejected or just not implemented.

I was wondering if maybe we can approach this another way. Whenever you
create a new block by fetching with P_NEW you pretty much know that no
other backends are going to be interested in this block. Doesn't this
mean you can simply hold this block exclusively until the end of the
transaction? Hence you can avoid all the locking on this block.

If this is acheives the same reduction in block locking then it would
be active in far more places such as SELECT INTO or any large copy (I
think index creation is already special cased). Once a block has been
written once it would revert to standard locking.

Is this feasable?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Using multi-row technique with COPY

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I don't see why couldn't have an additional index access method entry
> point to insert multiple rows on one call.

I think Simon was mainly on about the idea of inserting multiple *heap*
entries with one call, ie, only one cycle of locking a heap buffer.
It's not apparent to me that there'd be any noticeable win from batching
index insertions, because even if the heap entries are all on one page,
it's unlikely their index entries would be.

I'm a bit dubious about the whole thing from a concurrency standpoint,
too: holding locks longer is not a great thing in general.  The recent
patch to avoid repeated locking cycles during a read seems OK, because
(1) checking visibility of a tuple is ordinarily pretty cheap, and
(2) it's only a shared lock so other people can be doing the same thing
concurrently.  However, heap insertion can make neither of those claims.
You're talking about an exclusive lock, and you're talking about holding
it while copying data around and then making multiple WAL entries.
        regards, tom lane


Re: Using multi-row technique with COPY

From
"Jim C. Nasby"
Date:
On Sun, Nov 27, 2005 at 07:44:55PM +0000, Simon Riggs wrote:
> not have any unique indexes or row triggers. It should be possible to
> take advantage of this automatically when those requirements are met,
> without any new options. Just as it was with Seq Scans, this is worth
> about 10% reduction in CPU for a COPY FROM.
<snip> 
> FSM access would need to change slightly to allow for whole-block-only
> requests to be made for heaps, without damaging the average row length
> calculation. It might be simpler to ignore FSM entirely?

Does that mean that this fast copy would end up not re-using space on
pages that have space available? ISTM that's something users would want
to be able to over-ride. In fact, it seems like it shouldn't be a
default behavior...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Using multi-row technique with COPY

From
Harald Fuchs
Date:
In article <20051129004435.GR78939@pervasive.com>,
"Jim C. Nasby" <jnasby@pervasive.com> writes:

> On Sun, Nov 27, 2005 at 07:44:55PM +0000, Simon Riggs wrote:
>> not have any unique indexes or row triggers. It should be possible to
>> take advantage of this automatically when those requirements are met,
>> without any new options. Just as it was with Seq Scans, this is worth
>> about 10% reduction in CPU for a COPY FROM.
> <snip> 
>> FSM access would need to change slightly to allow for whole-block-only
>> requests to be made for heaps, without damaging the average row length
>> calculation. It might be simpler to ignore FSM entirely?

> Does that mean that this fast copy would end up not re-using space on
> pages that have space available? ISTM that's something users would want
> to be able to over-ride. In fact, it seems like it shouldn't be a
> default behavior...

Why not?  If you later do INSERT/UPDATE/DELETE, space will get reused
anyway, and if you don't (i.e. one-time bulk load on a constant
table), you should afterwards do a VACUUM FULL ANALYZE anyway.



Re: Using multi-row technique with COPY

From
Tom Lane
Date:
Harald Fuchs <hf0923x@protecting.net> writes:
> In article <20051129004435.GR78939@pervasive.com>,
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
>> Does that mean that this fast copy would end up not re-using space on
>> pages that have space available? ISTM that's something users would want
>> to be able to over-ride. In fact, it seems like it shouldn't be a
>> default behavior...

> Why not?  If you later do INSERT/UPDATE/DELETE, space will get reused
> anyway, and if you don't (i.e. one-time bulk load on a constant
> table), you should afterwards do a VACUUM FULL ANALYZE anyway.

You lost me at the "FULL" part.

In general, any "optimization" proposal that requires VACUUM FULL as
part of normal operation is a lost cause.  You should instead be
thinking about procedures that won't require a VACUUM FULL, ever.
        regards, tom lane


Re: Using multi-row technique with COPY

From
Bruce Momjian
Date:
Simon Riggs wrote:
> As a further enhancement, I would also return to the NOLOGGING option
> for COPY. Previously we had said that COPY LOCK was the way to go -
> taking a full table lock to prevent concurrent inserts to a block from a
> COPY that didn't write WAL and another backend which wanted to write WAL
> about that block. With the above suggested all-inserts-at-once
> optimization, it would no longer be a requirement to lock the table.
> That means we can continue to take advantage of the ability to run
> multiple COPY loads into the same table. Avoiding writing WAL will
> further reduce CPU by about 15% and I/O by about 50%. 
> 
> I would also suggest that pgdump be changed to use the NOLOGGING option
> by default, with an option to work as previously.

For those who have been around, they know I dislike having options that
95% of our users desire not be the default behavior.  I think the COPY
NOLOGGING idea falls in that category.  I would like to explore if there
is a way to have COPY automatically do no logging where possible by
default.

First, I think NOLOGGING is probably the wrong keyword.  I am thinking
SHARE/EXCLUSIVE is best because they are already keywords, and they
explain the effect of the flag on other applications, rather than the
LOGGING capability, which is invisible to applications.

I am thinking we would have COPY WITH [ [ EXCLUSIVE | SHARE ] [ LOCK ]] ...
EXCLUSIVE lock would be NOLOGGING, SHARE would do logging because other
applications could insert into the table at the same time (and do
UPDATES/DELETES of the inserted rows).

One idea for default behavior would be to use EXCLUSIVE when the table
is zero size.  I think that would do pg_dump and most of the user cases,
and of course users could override the default by using a keyword.  We
could emit a NOTICE if an an exclusive lock is used without an EXCLUSIVE
keyword.  One problem I see is that there is no way to insure zero size
without a lock that blocks other writers.  Is that reliable?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Using multi-row technique with COPY

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> One idea for default behavior would be to use EXCLUSIVE when the table
> is zero size.  I think that would do pg_dump and most of the user cases,
> and of course users could override the default by using a keyword.  We
> could emit a NOTICE if an an exclusive lock is used without an EXCLUSIVE
> keyword.  One problem I see is that there is no way to insure zero size
> without a lock that blocks other writers.  Is that reliable?

No, and if you try to upgrade your lock after checking, you create a
deadlock problem.

Something that would probably be reasonable, and require *no* weird new
syntax, is to shortcut in a COPY into a table created in the current
transaction.  I believe we still keep a flag in the relcache indicating
whether that's the case ...
        regards, tom lane


Re: Using multi-row technique with COPY

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > One idea for default behavior would be to use EXCLUSIVE when the table
> > is zero size.  I think that would do pg_dump and most of the user cases,
> > and of course users could override the default by using a keyword.  We
> > could emit a NOTICE if an an exclusive lock is used without an EXCLUSIVE
> > keyword.  One problem I see is that there is no way to insure zero size
> > without a lock that blocks other writers.  Is that reliable?
> 
> No, and if you try to upgrade your lock after checking, you create a
> deadlock problem.
> 
> Something that would probably be reasonable, and require *no* weird new
> syntax, is to shortcut in a COPY into a table created in the current
> transaction.  I believe we still keep a flag in the relcache indicating
> whether that's the case ...

So if the table is created in the current transaction, we don't log? 
Yes, I guess, but do we want to propogate that into pg_dump output?  I
would think not.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Using multi-row technique with COPY

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> Something that would probably be reasonable, and require *no* weird new
>> syntax, is to shortcut in a COPY into a table created in the current
>> transaction.  I believe we still keep a flag in the relcache indicating
>> whether that's the case ...

> So if the table is created in the current transaction, we don't log? 

Log, yes, unless it's a temp table.  The point is we could avoid taking
buffer content locks.  Come to think of it, we could implement that
trivially in the heapam.c routines; it would then apply to any table
update whether generated by COPY or otherwise.

> Yes, I guess, but do we want to propogate that into pg_dump output?  I
> would think not.

Exactly my point; we don't have to change any syntax, so pg_dump
doesn't care.
        regards, tom lane


Re: Using multi-row technique with COPY

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> Something that would probably be reasonable, and require *no* weird new
> >> syntax, is to shortcut in a COPY into a table created in the current
> >> transaction.  I believe we still keep a flag in the relcache indicating
> >> whether that's the case ...
> 
> > So if the table is created in the current transaction, we don't log? 
> 
> Log, yes, unless it's a temp table.  The point is we could avoid taking
> buffer content locks.  Come to think of it, we could implement that
> trivially in the heapam.c routines; it would then apply to any table
> update whether generated by COPY or otherwise.

I am confused.  This optimization prevents locking, but still requires
logging?

> > Yes, I guess, but do we want to propogate that into pg_dump output?  I
> > would think not.
> 
> Exactly my point; we don't have to change any syntax, so pg_dump
> doesn't care.

But can pg_dump take advantage of this optimization?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Using multi-row technique with COPY

From
Simon Riggs
Date:
On Tue, 2005-11-29 at 14:17 -0500, Bruce Momjian wrote:
> Simon Riggs wrote:
> > As a further enhancement, I would also return to the NOLOGGING option
> > for COPY. Previously we had said that COPY LOCK was the way to go -
> > taking a full table lock to prevent concurrent inserts to a block from a
> > COPY that didn't write WAL and another backend which wanted to write WAL
> > about that block. With the above suggested all-inserts-at-once
> > optimization, it would no longer be a requirement to lock the table.
> > That means we can continue to take advantage of the ability to run
> > multiple COPY loads into the same table. Avoiding writing WAL will
> > further reduce CPU by about 15% and I/O by about 50%. 
> > 
> > I would also suggest that pgdump be changed to use the NOLOGGING option
> > by default, with an option to work as previously.
> 
> For those who have been around, they know I dislike having options that
> 95% of our users desire not be the default behavior.  

I like that approach and it makes for a great product.

> I think the COPY
> NOLOGGING idea falls in that category.  I would like to explore if there
> is a way to have COPY automatically do no logging where possible by
> default.

see last, below

> First, I think NOLOGGING is probably the wrong keyword.  I am thinking
> SHARE/EXCLUSIVE is best because they are already keywords, and they
> explain the effect of the flag on other applications, rather than the
> LOGGING capability, which is invisible to applications.
> 
> I am thinking we would have COPY WITH [ [ EXCLUSIVE | SHARE ] [ LOCK ]] ...
> EXCLUSIVE lock would be NOLOGGING, SHARE would do logging because other
> applications could insert into the table at the same time (and do
> UPDATES/DELETES of the inserted rows).

I wasn't trying to suggest new wording for COPY LOCK, or whatever it was
called previously. My suggestion was to alter the underlying mechanism
so that it would not actually need to take a lock at all. So I was
trying to find a name that matched that thought.... so your comments
above don't really apply.... but I'm more than happy to have others
suggest names and will go with the flow.

> One idea for default behavior would be to use EXCLUSIVE when the table
> is zero size.  I think that would do pg_dump and most of the user cases,
> and of course users could override the default by using a keyword.  We
> could emit a NOTICE if an an exclusive lock is used without an EXCLUSIVE
> keyword.  One problem I see is that there is no way to insure zero size
> without a lock that blocks other writers.  Is that reliable?

I think it would be possible to say: if when a COPY starts the table has
zero blocks, then we do a NOLOGGING (change_the_name) mode operation.

That gives a problem for recovery. If you lose the server and recover
with PITR, that table would come back empty. I'd want to make very sure
that I knew whether it was going to be there, or not, and I want it to
be my choice. So, I agree with the thought from a automatic performance
perspective, but not from a recovery perspective.

(There is some irony in being the person to both advocate PITR and to
try so hard not to log...)

I think of PITR as the default mode of operation now; almost all Oracle,
DB2, SQLServer etc users make consistent use of archivelogmode/archive
recovery.

Best Regards, Simon Riggs



Re: Using multi-row technique with COPY

From
Simon Riggs
Date:
On Tue, 2005-11-29 at 15:15 -0500, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Tom Lane wrote:
> > >> Something that would probably be reasonable, and require *no* weird new
> > >> syntax, is to shortcut in a COPY into a table created in the current
> > >> transaction.  I believe we still keep a flag in the relcache indicating
> > >> whether that's the case ...
> > 
> > > So if the table is created in the current transaction, we don't log? 
> > 
> > Log, yes, unless it's a temp table.  The point is we could avoid taking
> > buffer content locks.  Come to think of it, we could implement that
> > trivially in the heapam.c routines; it would then apply to any table
> > update whether generated by COPY or otherwise.
> 
> I am confused.  This optimization prevents locking, but still requires
> logging?

It minimises *buffer* lwlocks, not data locks.

Best Regards, Simon Riggs



Re: Using multi-row technique with COPY

From
Simon Riggs
Date:
On Tue, 2005-11-29 at 14:56 -0500, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> Something that would probably be reasonable, and require *no* weird new
> >> syntax, is to shortcut in a COPY into a table created in the current
> >> transaction.  I believe we still keep a flag in the relcache indicating
> >> whether that's the case ...
> 
> > So if the table is created in the current transaction, we don't log? 
> 
> Log, yes, unless it's a temp table.  The point is we could avoid taking
> buffer content locks.  Come to think of it, we could implement that
> trivially in the heapam.c routines; it would then apply to any table
> update whether generated by COPY or otherwise.

We should do that, yes, we've discussed that before. I'm happy to blat
that out.

...but I do still want to improve COPY in other ways.

> > Yes, I guess, but do we want to propogate that into pg_dump output?  I
> > would think not.
> 
> Exactly my point; we don't have to change any syntax, so pg_dump
> doesn't care.

Does pg_dump wrap the CREATE TABLE and COPY into the same transaction?

Best Regards, Simon Riggs



Re: Using multi-row technique with COPY

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Log, yes, unless it's a temp table.  The point is we could avoid taking
>> buffer content locks.  Come to think of it, we could implement that
>> trivially in the heapam.c routines; it would then apply to any table
>> update whether generated by COPY or otherwise.

> We should do that, yes, we've discussed that before.

Not that I recall ... anyway, I had second thoughts about this while
eating lunch.  A buffer for a new table can reasonably be assumed not to
be accessed by any other backend, but we can *not* make that assumption
for the bgwriter.  This means that the bgwriter might come along and
dump the buffer to disk while we are in the midst of scribbling on it.
Since we'll mark the thing dirty after we finish scribbling, no harm
is done --- unless there is a crash after we commit and before the next
write of the buffer occurs.  In that case we have a corrupt copy of the
buffer on disk, which is going to pose a problem for recovery.  (As an
example, the corrupt copy might have an invalid LSN that is larger than
it really should be, which will cause us to ignore the WAL record that
might have otherwise let us fix the problem.)

As far as I can see at the moment, such a problem will be repaired
during recovery *as long as you have the full-page-writes option on*.
Otherwise not.  So we'd probably have to restrict the optimization to
happen only if that's on, as well as only for new tables.


>> Exactly my point; we don't have to change any syntax, so pg_dump
>> doesn't care.

> Does pg_dump wrap the CREATE TABLE and COPY into the same transaction?

No, but you can manually put BEGIN and END around the whole script.
Some people do that anyway.  (It wouldn't be a bad idea to add an option
to pg_restore to do it, since there's no very convenient way to add
begin/end manually when piping pg_restore right into a database.)
        regards, tom lane


Re: Using multi-row technique with COPY

From
Simon Riggs
Date:
On Mon, 2005-11-28 at 09:26 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > I don't see why couldn't have an additional index access method entry
> > point to insert multiple rows on one call.
> 
> I think Simon was mainly on about the idea of inserting multiple *heap*
> entries with one call, ie, only one cycle of locking a heap buffer.
> It's not apparent to me that there'd be any noticeable win from batching
> index insertions, because even if the heap entries are all on one page,
> it's unlikely their index entries would be.

It might be a valid idea for monotonically increasing key indexes, but
I'll leave that idea alone for now. Tom's right: I was discussing heap
blocks only (for now).

> I'm a bit dubious about the whole thing from a concurrency standpoint,
> too: holding locks longer is not a great thing in general.  The recent
> patch to avoid repeated locking cycles during a read seems OK, because
> (1) checking visibility of a tuple is ordinarily pretty cheap, and
> (2) it's only a shared lock so other people can be doing the same thing
> concurrently.  However, heap insertion can make neither of those claims.
> You're talking about an exclusive lock, and you're talking about holding
> it while copying data around and then making multiple WAL entries.

I wasn't talking about holding locks for any longer than normal.

Heres the rough code:

heap_bulk_insert (tuple, bool needsToasting)
{/* if tuple needs toasting, unpick the buffer */if (needsToasting){    for (i=0; i < nTuplesBuffered; i++)
heap_insert(tupleBuffer[i]);   heap_insert(tuple);    nTuplesBuffered = 0;    sizeTupleBuffer = 0;    return;}
 
/* if buffer overflows, write the whole tuplebuffer in one * go to a freshly allocated shared_buffer/table block * so
thatwe do only one buffer manager lock pair per block * rather than one per row */tuple_length = length(tuple);if
((length(tuple)+ sizeTupleBuffer) > MAXDATAINBLOCK){    RelationGetBufferForTupleArray(nTuplesBuffered,
   tupleBuffer);    tupleBuffer[0] = tuple;    nTuplesBuffered = 1;    sizeTupleBuffer = tuple_length;}else{
tupleBuffer[++nTuplesBuffered]= tuple;    sizeTupleBuffer += tuple_length;}
 
return;
}
in heapam.c

(needsToasting would be set for a tuple during parsing by COPY, which is
fine since it does actually know how long columns are. Actual decision
to toast or not can be wrapped in a modular call to heapam.c)

RelationGetBufferForTupleArray() in hio.c
very similar to RelationGetBufferForTupleArray but less complex because
it never needs to handle the Update case.
If we do choose to write WAL records for these inserts, then we can
simply log the whole block, rather than making repeated individual
inserts into WALbuffers, so we save on WALInsertLock overhead too. Thats
no more than we do after a checkpoint anyway, so no problems.

If the transaction fails, then we lose the buffer: so what? Slightly
better than writing them to the table then failing.    
Most importantly, we don't actually allocate a new data block until we
decide to flush the tupleBuffer. So the exclusive lock is only held
momentarily while we insert all the rows from the tupleBuffer into the
fresh block.

There's no modularity wierdness, but we do need some code in COPY to
decide whether it should use fast mode or not. 

Best Regards, Simon Riggs



Re: Using multi-row technique with COPY

From
Simon Riggs
Date:
On Tue, 2005-11-29 at 15:58 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> Log, yes, unless it's a temp table.  The point is we could avoid taking
> >> buffer content locks.  Come to think of it, we could implement that
> >> trivially in the heapam.c routines; it would then apply to any table
> >> update whether generated by COPY or otherwise.
> 
> > We should do that, yes, we've discussed that before.
> 
> Not that I recall ... 

It was a quick comment of yours only, not a long discussion. Its a good
idea and we should do it, but its not everything I want (other posts
pursue the other ideas)

> anyway, I had second thoughts about this while
> eating lunch.  A buffer for a new table can reasonably be assumed not to
> be accessed by any other backend, but we can *not* make that assumption
> for the bgwriter.  This means that the bgwriter might come along and
> dump the buffer to disk while we are in the midst of scribbling on it.
> Since we'll mark the thing dirty after we finish scribbling, no harm
> is done --- unless there is a crash after we commit and before the next
> write of the buffer occurs.  In that case we have a corrupt copy of the
> buffer on disk, which is going to pose a problem for recovery.  (As an
> example, the corrupt copy might have an invalid LSN that is larger than
> it really should be, which will cause us to ignore the WAL record that
> might have otherwise let us fix the problem.)
> 
> As far as I can see at the moment, such a problem will be repaired
> during recovery *as long as you have the full-page-writes option on*.
> Otherwise not.  So we'd probably have to restrict the optimization to
> happen only if that's on, as well as only for new tables.

Not sure I understand that. If there's a crash then the transaction
failed so any data on the partially written block is invisible. Why else
would the block become corrupt? 

Whether full page writes is on or not, we still fill the block. After
which we never log the full page because the block doesn't change again.

Should we checkpoint half way through filling the block, then its still
written to disk whether we do full page writes or not.

Maybe its just late here...

> >> Exactly my point; we don't have to change any syntax, so pg_dump
> >> doesn't care.
> 
> > Does pg_dump wrap the CREATE TABLE and COPY into the same transaction?
> 
> No, but you can manually put BEGIN and END around the whole script.
> Some people do that anyway.  (It wouldn't be a bad idea to add an option
> to pg_restore to do it, since there's no very convenient way to add
> begin/end manually when piping pg_restore right into a database.)

Can do.

Best Regards, Simon Riggs



Re: Using multi-row technique with COPY

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2005-11-29 at 15:58 -0500, Tom Lane wrote:
>> anyway, I had second thoughts about this while
>> eating lunch.  A buffer for a new table can reasonably be assumed not to
>> be accessed by any other backend, but we can *not* make that assumption
>> for the bgwriter.  This means that the bgwriter might come along and
>> dump the buffer to disk while we are in the midst of scribbling on it.
>> Since we'll mark the thing dirty after we finish scribbling, no harm
>> is done --- unless there is a crash after we commit and before the next
>> write of the buffer occurs.

> Not sure I understand that. If there's a crash then the transaction
> failed so any data on the partially written block is invisible.

I said "after we commit".  In any case, corrupt data on-disk could still
cause WAL recovery to fail: at the time we process a WAL record, we have
no way to know that the transaction that generated it will ultimately
not commit.

> Whether full page writes is on or not, we still fill the block. After
> which we never log the full page because the block doesn't change again.

You seem to be confusing this idea with some of your own...
        regards, tom lane


Re: Using multi-row technique with COPY

From
Bruce Momjian
Date:
Please let me back up and ask a more simplistic question.  I understand
the idea of allowing COPY to insert rows with less locking, but I am
wondering about the NOLOGGING idea.  On commit, we must guarantee that
all the rows are in the table, so what advantage is there to a NOLOGGING
option?

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Tue, 2005-11-29 at 14:17 -0500, Bruce Momjian wrote:
> > Simon Riggs wrote:
> > > As a further enhancement, I would also return to the NOLOGGING option
> > > for COPY. Previously we had said that COPY LOCK was the way to go -
> > > taking a full table lock to prevent concurrent inserts to a block from a
> > > COPY that didn't write WAL and another backend which wanted to write WAL
> > > about that block. With the above suggested all-inserts-at-once
> > > optimization, it would no longer be a requirement to lock the table.
> > > That means we can continue to take advantage of the ability to run
> > > multiple COPY loads into the same table. Avoiding writing WAL will
> > > further reduce CPU by about 15% and I/O by about 50%. 
> > > 
> > > I would also suggest that pgdump be changed to use the NOLOGGING option
> > > by default, with an option to work as previously.
> > 
> > For those who have been around, they know I dislike having options that
> > 95% of our users desire not be the default behavior.  
> 
> I like that approach and it makes for a great product.
> 
> > I think the COPY
> > NOLOGGING idea falls in that category.  I would like to explore if there
> > is a way to have COPY automatically do no logging where possible by
> > default.
> 
> see last, below
> 
> > First, I think NOLOGGING is probably the wrong keyword.  I am thinking
> > SHARE/EXCLUSIVE is best because they are already keywords, and they
> > explain the effect of the flag on other applications, rather than the
> > LOGGING capability, which is invisible to applications.
> > 
> > I am thinking we would have COPY WITH [ [ EXCLUSIVE | SHARE ] [ LOCK ]] ...
> > EXCLUSIVE lock would be NOLOGGING, SHARE would do logging because other
> > applications could insert into the table at the same time (and do
> > UPDATES/DELETES of the inserted rows).
> 
> I wasn't trying to suggest new wording for COPY LOCK, or whatever it was
> called previously. My suggestion was to alter the underlying mechanism
> so that it would not actually need to take a lock at all. So I was
> trying to find a name that matched that thought.... so your comments
> above don't really apply.... but I'm more than happy to have others
> suggest names and will go with the flow.
> 
> > One idea for default behavior would be to use EXCLUSIVE when the table
> > is zero size.  I think that would do pg_dump and most of the user cases,
> > and of course users could override the default by using a keyword.  We
> > could emit a NOTICE if an an exclusive lock is used without an EXCLUSIVE
> > keyword.  One problem I see is that there is no way to insure zero size
> > without a lock that blocks other writers.  Is that reliable?
> 
> I think it would be possible to say: if when a COPY starts the table has
> zero blocks, then we do a NOLOGGING (change_the_name) mode operation.
> 
> That gives a problem for recovery. If you lose the server and recover
> with PITR, that table would come back empty. I'd want to make very sure
> that I knew whether it was going to be there, or not, and I want it to
> be my choice. So, I agree with the thought from a automatic performance
> perspective, but not from a recovery perspective.
> 
> (There is some irony in being the person to both advocate PITR and to
> try so hard not to log...)
> 
> I think of PITR as the default mode of operation now; almost all Oracle,
> DB2, SQLServer etc users make consistent use of archivelogmode/archive
> recovery.
> 
> Best Regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Using multi-row technique with COPY

From
Simon Riggs
Date:
On Tue, 2005-11-29 at 18:51 -0500, Bruce Momjian wrote:
> Please let me back up and ask a more simplistic question.  I understand
> the idea of allowing COPY to insert rows with less locking, but I am
> wondering about the NOLOGGING idea.  On commit, we must guarantee that
> all the rows are in the table, so what advantage is there to a NOLOGGING
> option?

We would need to flush all the blocks in the table out of cache at
commit time, for that table only. (As with CTAS, CIDX).


To allow a full discussion, I'll separate the various ideas:
1. COPY using bulk copy mode
2. NOLOGGING
3. Created in this transaction
4. ERRORTABLES
5. Uniqueness violations

Right now, I think you have reasonable objections/input to (2) that we
should discuss more before agreeing a way forward. I would aim to do (1)
first, then return with a full and much better explained proposal for
(2) for us to discuss, since (2) depends upon (1) somewhat.

(3) and (4) seem to have been generally accepted, but (5) seems not
viable with present thinking.

Best Regards, Simon Riggs

> ---------------------------------------------------------------------------
> 
> Simon Riggs wrote:
> > On Tue, 2005-11-29 at 14:17 -0500, Bruce Momjian wrote:
> > > Simon Riggs wrote:
> > > > As a further enhancement, I would also return to the NOLOGGING option
> > > > for COPY. Previously we had said that COPY LOCK was the way to go -
> > > > taking a full table lock to prevent concurrent inserts to a block from a
> > > > COPY that didn't write WAL and another backend which wanted to write WAL
> > > > about that block. With the above suggested all-inserts-at-once
> > > > optimization, it would no longer be a requirement to lock the table.
> > > > That means we can continue to take advantage of the ability to run
> > > > multiple COPY loads into the same table. Avoiding writing WAL will
> > > > further reduce CPU by about 15% and I/O by about 50%. 
> > > > 
> > > > I would also suggest that pgdump be changed to use the NOLOGGING option
> > > > by default, with an option to work as previously.
> > > 
> > > For those who have been around, they know I dislike having options that
> > > 95% of our users desire not be the default behavior.  
> > 
> > I like that approach and it makes for a great product.
> > 
> > > I think the COPY
> > > NOLOGGING idea falls in that category.  I would like to explore if there
> > > is a way to have COPY automatically do no logging where possible by
> > > default.
> > 
> > see last, below
> > 
> > > First, I think NOLOGGING is probably the wrong keyword.  I am thinking
> > > SHARE/EXCLUSIVE is best because they are already keywords, and they
> > > explain the effect of the flag on other applications, rather than the
> > > LOGGING capability, which is invisible to applications.
> > > 
> > > I am thinking we would have COPY WITH [ [ EXCLUSIVE | SHARE ] [ LOCK ]] ...
> > > EXCLUSIVE lock would be NOLOGGING, SHARE would do logging because other
> > > applications could insert into the table at the same time (and do
> > > UPDATES/DELETES of the inserted rows).
> > 
> > I wasn't trying to suggest new wording for COPY LOCK, or whatever it was
> > called previously. My suggestion was to alter the underlying mechanism
> > so that it would not actually need to take a lock at all. So I was
> > trying to find a name that matched that thought.... so your comments
> > above don't really apply.... but I'm more than happy to have others
> > suggest names and will go with the flow.
> > 
> > > One idea for default behavior would be to use EXCLUSIVE when the table
> > > is zero size.  I think that would do pg_dump and most of the user cases,
> > > and of course users could override the default by using a keyword.  We
> > > could emit a NOTICE if an an exclusive lock is used without an EXCLUSIVE
> > > keyword.  One problem I see is that there is no way to insure zero size
> > > without a lock that blocks other writers.  Is that reliable?
> > 
> > I think it would be possible to say: if when a COPY starts the table has
> > zero blocks, then we do a NOLOGGING (change_the_name) mode operation.
> > 
> > That gives a problem for recovery. If you lose the server and recover
> > with PITR, that table would come back empty. I'd want to make very sure
> > that I knew whether it was going to be there, or not, and I want it to
> > be my choice. So, I agree with the thought from a automatic performance
> > perspective, but not from a recovery perspective.
> > 
> > (There is some irony in being the person to both advocate PITR and to
> > try so hard not to log...)
> > 
> > I think of PITR as the default mode of operation now; almost all Oracle,
> > DB2, SQLServer etc users make consistent use of archivelogmode/archive
> > recovery.
> > 
> > Best Regards, Simon Riggs
> > 
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 1: if posting/reading through Usenet, please send an appropriate
> >        subscribe-nomail command to majordomo@postgresql.org so that your
> >        message can get through to the mailing list cleanly
> > 
> 



Re: Using multi-row technique with COPY

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> [...context omitted...]
> We would need to flush all the blocks in the table out of cache at
> commit time, for that table only.

This seems striking close to the Old Concept of temp tables, which
we got rid of for good-and-sufficient reasons.  You might want to
spend some time reviewing the archives before proposing that we
reintroduce it.
        regards, tom lane


Re: Using multi-row technique with COPY

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Tue, 2005-11-29 at 18:51 -0500, Bruce Momjian wrote:
> > Please let me back up and ask a more simplistic question.  I understand
> > the idea of allowing COPY to insert rows with less locking, but I am
> > wondering about the NOLOGGING idea.  On commit, we must guarantee that
> > all the rows are in the table, so what advantage is there to a NOLOGGING
> > option?
> 
> We would need to flush all the blocks in the table out of cache at
> commit time, for that table only. (As with CTAS, CIDX).
> 
> 
> To allow a full discussion, I'll separate the various ideas:
> 1. COPY using bulk copy mode

What is "bulk copy mode"?  Full page loading?

> 2. NOLOGGING

Means flush/fsync table pages on commit.

> 3. Created in this transaction

Reduces locking?

> 4. ERRORTABLES
> 5. Uniqueness violations
> 
> Right now, I think you have reasonable objections/input to (2) that we
> should discuss more before agreeing a way forward. I would aim to do (1)
> first, then return with a full and much better explained proposal for
> (2) for us to discuss, since (2) depends upon (1) somewhat.
> 
> (3) and (4) seem to have been generally accepted, but (5) seems not
> viable with present thinking.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Using multi-row technique with COPY

From
"Jim C. Nasby"
Date:
On Tue, Nov 29, 2005 at 01:35:09PM +0100, Harald Fuchs wrote:
> In article <20051129004435.GR78939@pervasive.com>,
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> 
> > On Sun, Nov 27, 2005 at 07:44:55PM +0000, Simon Riggs wrote:
> >> not have any unique indexes or row triggers. It should be possible to
> >> take advantage of this automatically when those requirements are met,
> >> without any new options. Just as it was with Seq Scans, this is worth
> >> about 10% reduction in CPU for a COPY FROM.
> > <snip> 
> >> FSM access would need to change slightly to allow for whole-block-only
> >> requests to be made for heaps, without damaging the average row length
> >> calculation. It might be simpler to ignore FSM entirely?
> 
> > Does that mean that this fast copy would end up not re-using space on
> > pages that have space available? ISTM that's something users would want
> > to be able to over-ride. In fact, it seems like it shouldn't be a
> > default behavior...
> 
> Why not?  If you later do INSERT/UPDATE/DELETE, space will get reused
> anyway, and if you don't (i.e. one-time bulk load on a constant
> table), you should afterwards do a VACUUM FULL ANALYZE anyway.

Because the proposal is to fill entirely new pages from COPY, which
pretty much makes FSM useless since you could only use pages in the FSM
if they were entirely empty.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Using multi-row technique with COPY

From
Simon Riggs
Date:
On Wed, 2005-11-30 at 02:10 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > [...context omitted...]
> > We would need to flush all the blocks in the table out of cache at
> > commit time, for that table only.
> 
> This seems striking close to the Old Concept of temp tables, which
> we got rid of for good-and-sufficient reasons.  You might want to
> spend some time reviewing the archives before proposing that we
> reintroduce it.

My suggestion is that I use the same mechanism as was introduced for 8.1
with the CREATE TABLE AS SELECT statement.

Best Regards, Simon Riggs