Thread: Using multi-row technique with COPY
Since we just discussed using a multi-row per buffer lock technique with Seq Scans, it seems appropriate to discuss a similar technique with COPY FROM that I had been mulling over. COPY FROM can read in sufficient rows until it has a whole block worth of data, then get a new block and write it all with one pair of BufferLock calls. This would only be possible, AFAICS, when a table does not have any unique indexes or row triggers. It should be possible to take advantage of this automatically when those requirements are met, without any new options. Just as it was with Seq Scans, this is worth about 10% reduction in CPU for a COPY FROM. The main use case for this is initial table loads or restores from dumps. However, it applies more widely since there is no requirement that the table is empty, nor that it has only just been created within the current transaction. FSM access would need to change slightly to allow for whole-block-only requests to be made for heaps, without damaging the average row length calculation. It might be simpler to ignore FSM entirely? As a further enhancement, I would also return to the NOLOGGING option for COPY. Previously we had said that COPY LOCK was the way to go - taking a full table lock to prevent concurrent inserts to a block from a COPY that didn't write WAL and another backend which wanted to write WAL about that block. With the above suggested all-inserts-at-once optimization, it would no longer be a requirement to lock the table. That means we can continue to take advantage of the ability to run multiple COPY loads into the same table. Avoiding writing WAL will further reduce CPU by about 15% and I/O by about 50%. I would also suggest that pgdump be changed to use the NOLOGGING option by default, with an option to work as previously. Comments? Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > COPY FROM can read in sufficient rows until it has a whole block worth > of data, then get a new block and write it all with one pair of > BufferLock calls. > Comments? I don't see any way to do this without horrible modularity violations. The COPY code has no business going anywhere near individual buffers; for that matter, it doesn't even really know what "a block worth" of data is, since the tuples it's dealing with aren't toasted yet. regards, tom lane
On Sun, Nov 27, 2005 at 05:45:31PM -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > COPY FROM can read in sufficient rows until it has a whole block worth > > of data, then get a new block and write it all with one pair of > > BufferLock calls. > > > Comments? > > I don't see any way to do this without horrible modularity violations. > The COPY code has no business going anywhere near individual buffers; > for that matter, it doesn't even really know what "a block worth" of > data is, since the tuples it's dealing with aren't toasted yet. I wonder if you could get the same effect by having COPY copy into a temporary table (thus no WAL, no bufferlocks) and then every hundred rows do a SELECT INTO. Mind you, that just moves the modularity violation elsewhere since only the bufmgr knows about the size of buffers really. Whatever happened to that idea to build as entire datafile with COPY or some external tool and simply copy it into place and update the catalog? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On Mon, 2005-11-28 at 09:40 +0100, Martijn van Oosterhout wrote: > On Sun, Nov 27, 2005 at 05:45:31PM -0500, Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > COPY FROM can read in sufficient rows until it has a whole block worth > > > of data, then get a new block and write it all with one pair of > > > BufferLock calls. > > > > > Comments? > Whatever happened to that idea to build as entire datafile with COPY or > some external tool and simply copy it into place and update the > catalog? What's wrong with tuning the server to do this? Zapping the catalog as a normal operation is the wrong approach if you want a robust system. All actions on the catalog must be under tight control. Most other RDBMS support a "fast path" loader, but all of them include strong hooks into the main server to maintain catalog correctly. That is one approach, but it requires creation of an external API - which seems more work, plus a security risk. Copying data in a block at a time is the basic technique all use. I never discuss implementing features that other RDBMS have for any other reason than than a similar use case exists for both. There are many features where PostgreSQL is already ahead. Best Regards, Simon Riggs
On Sun, 2005-11-27 at 17:45 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > COPY FROM can read in sufficient rows until it has a whole block worth > > of data, then get a new block and write it all with one pair of > > BufferLock calls. > > > Comments? > > I don't see any way to do this without horrible modularity violations. > The COPY code has no business going anywhere near individual buffers; > for that matter, it doesn't even really know what "a block worth" of > data is, since the tuples it's dealing with aren't toasted yet. I've taken on board your comments about modularity issues from earlier. [I've not included anything on unique indexes, notice] I was expecting to buffer this in the heap access method with a new call, say, heap_bulk_insert() rather than have all that code hanging around in COPY. A lower level routine RelationGetBufferForTupleArray can handle the actual grunt. It can work, without ugliness. We'd need to handle a buffer bigger than a single tuple anyway, so you keep adding tuples until the last one tips over the edge, which then gets saved for the next block. Heap access method knows about blocks. We could reasonably do a test for would-be-toasted within those routines. I should have said that this wouldn't apply if any of the tuples require toasting, which of course has to be a dynamic test. COPY would only need to know whether it was invoking the normal or the bulk mode, which is reasonable, since it knows about indexes, triggers etc. Best Regards, Simon Riggs
Simon Riggs wrote: > On Sun, 2005-11-27 at 17:45 -0500, Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > COPY FROM can read in sufficient rows until it has a whole block worth > > > of data, then get a new block and write it all with one pair of > > > BufferLock calls. > > > > > Comments? > > > > I don't see any way to do this without horrible modularity violations. > > The COPY code has no business going anywhere near individual buffers; > > for that matter, it doesn't even really know what "a block worth" of > > data is, since the tuples it's dealing with aren't toasted yet. > > I've taken on board your comments about modularity issues from earlier. > [I've not included anything on unique indexes, notice] I don't see why couldn't have an additional index access method entry point to insert multiple rows on one call. For an unique index, we could return an array of entries that violated the condition (if we were to implement such a feature for single-entry insertion.) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Mon, 2005-11-28 at 00:56 +0000, Simon Riggs wrote: > On Sun, 2005-11-27 at 17:45 -0500, Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > COPY FROM can read in sufficient rows until it has a whole block worth > > > of data, then get a new block and write it all with one pair of > > > BufferLock calls. > > > > > Comments? > > > > I don't see any way to do this without horrible modularity violations. > > The COPY code has no business going anywhere near individual buffers; > > for that matter, it doesn't even really know what "a block worth" of > > data is, since the tuples it's dealing with aren't toasted yet. > > I've taken on board your comments about modularity issues from earlier. > [I've not included anything on unique indexes, notice] > > I was expecting to buffer this in the heap access method with a new > call, say, heap_bulk_insert() rather than have all that code hanging > around in COPY. A lower level routine RelationGetBufferForTupleArray can > handle the actual grunt. It can work, without ugliness. > > We'd need to handle a buffer bigger than a single tuple anyway, so you > keep adding tuples until the last one tips over the edge, which then > gets saved for the next block. Heap access method knows about blocks. > > We could reasonably do a test for would-be-toasted within those > routines. I should have said that this wouldn't apply if any of the > tuples require toasting, which of course has to be a dynamic test. If we had a buffer big enough (say 10-100x the page size), then we would not actually need to test toasting. We can just pass the big buffer to heap_bulk_insert() which inserts the whole buffer in as big chunks as needed to fill the free space on pages (with single page lock). -------------- Hannu
On Mon, Nov 28, 2005 at 09:39:36AM +0000, Simon Riggs wrote: > On Mon, 2005-11-28 at 09:40 +0100, Martijn van Oosterhout wrote: > > Whatever happened to that idea to build as entire datafile with COPY or > > some external tool and simply copy it into place and update the > > catalog? > > What's wrong with tuning the server to do this? It was just a question. I remember it being mentioned but didn't hear if it had been rejected or just not implemented. I was wondering if maybe we can approach this another way. Whenever you create a new block by fetching with P_NEW you pretty much know that no other backends are going to be interested in this block. Doesn't this mean you can simply hold this block exclusively until the end of the transaction? Hence you can avoid all the locking on this block. If this is acheives the same reduction in block locking then it would be active in far more places such as SELECT INTO or any large copy (I think index creation is already special cased). Once a block has been written once it would revert to standard locking. Is this feasable? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Alvaro Herrera <alvherre@commandprompt.com> writes: > I don't see why couldn't have an additional index access method entry > point to insert multiple rows on one call. I think Simon was mainly on about the idea of inserting multiple *heap* entries with one call, ie, only one cycle of locking a heap buffer. It's not apparent to me that there'd be any noticeable win from batching index insertions, because even if the heap entries are all on one page, it's unlikely their index entries would be. I'm a bit dubious about the whole thing from a concurrency standpoint, too: holding locks longer is not a great thing in general. The recent patch to avoid repeated locking cycles during a read seems OK, because (1) checking visibility of a tuple is ordinarily pretty cheap, and (2) it's only a shared lock so other people can be doing the same thing concurrently. However, heap insertion can make neither of those claims. You're talking about an exclusive lock, and you're talking about holding it while copying data around and then making multiple WAL entries. regards, tom lane
On Sun, Nov 27, 2005 at 07:44:55PM +0000, Simon Riggs wrote: > not have any unique indexes or row triggers. It should be possible to > take advantage of this automatically when those requirements are met, > without any new options. Just as it was with Seq Scans, this is worth > about 10% reduction in CPU for a COPY FROM. <snip> > FSM access would need to change slightly to allow for whole-block-only > requests to be made for heaps, without damaging the average row length > calculation. It might be simpler to ignore FSM entirely? Does that mean that this fast copy would end up not re-using space on pages that have space available? ISTM that's something users would want to be able to over-ride. In fact, it seems like it shouldn't be a default behavior... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
In article <20051129004435.GR78939@pervasive.com>, "Jim C. Nasby" <jnasby@pervasive.com> writes: > On Sun, Nov 27, 2005 at 07:44:55PM +0000, Simon Riggs wrote: >> not have any unique indexes or row triggers. It should be possible to >> take advantage of this automatically when those requirements are met, >> without any new options. Just as it was with Seq Scans, this is worth >> about 10% reduction in CPU for a COPY FROM. > <snip> >> FSM access would need to change slightly to allow for whole-block-only >> requests to be made for heaps, without damaging the average row length >> calculation. It might be simpler to ignore FSM entirely? > Does that mean that this fast copy would end up not re-using space on > pages that have space available? ISTM that's something users would want > to be able to over-ride. In fact, it seems like it shouldn't be a > default behavior... Why not? If you later do INSERT/UPDATE/DELETE, space will get reused anyway, and if you don't (i.e. one-time bulk load on a constant table), you should afterwards do a VACUUM FULL ANALYZE anyway.
Harald Fuchs <hf0923x@protecting.net> writes: > In article <20051129004435.GR78939@pervasive.com>, > "Jim C. Nasby" <jnasby@pervasive.com> writes: >> Does that mean that this fast copy would end up not re-using space on >> pages that have space available? ISTM that's something users would want >> to be able to over-ride. In fact, it seems like it shouldn't be a >> default behavior... > Why not? If you later do INSERT/UPDATE/DELETE, space will get reused > anyway, and if you don't (i.e. one-time bulk load on a constant > table), you should afterwards do a VACUUM FULL ANALYZE anyway. You lost me at the "FULL" part. In general, any "optimization" proposal that requires VACUUM FULL as part of normal operation is a lost cause. You should instead be thinking about procedures that won't require a VACUUM FULL, ever. regards, tom lane
Simon Riggs wrote: > As a further enhancement, I would also return to the NOLOGGING option > for COPY. Previously we had said that COPY LOCK was the way to go - > taking a full table lock to prevent concurrent inserts to a block from a > COPY that didn't write WAL and another backend which wanted to write WAL > about that block. With the above suggested all-inserts-at-once > optimization, it would no longer be a requirement to lock the table. > That means we can continue to take advantage of the ability to run > multiple COPY loads into the same table. Avoiding writing WAL will > further reduce CPU by about 15% and I/O by about 50%. > > I would also suggest that pgdump be changed to use the NOLOGGING option > by default, with an option to work as previously. For those who have been around, they know I dislike having options that 95% of our users desire not be the default behavior. I think the COPY NOLOGGING idea falls in that category. I would like to explore if there is a way to have COPY automatically do no logging where possible by default. First, I think NOLOGGING is probably the wrong keyword. I am thinking SHARE/EXCLUSIVE is best because they are already keywords, and they explain the effect of the flag on other applications, rather than the LOGGING capability, which is invisible to applications. I am thinking we would have COPY WITH [ [ EXCLUSIVE | SHARE ] [ LOCK ]] ... EXCLUSIVE lock would be NOLOGGING, SHARE would do logging because other applications could insert into the table at the same time (and do UPDATES/DELETES of the inserted rows). One idea for default behavior would be to use EXCLUSIVE when the table is zero size. I think that would do pg_dump and most of the user cases, and of course users could override the default by using a keyword. We could emit a NOTICE if an an exclusive lock is used without an EXCLUSIVE keyword. One problem I see is that there is no way to insure zero size without a lock that blocks other writers. Is that reliable? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > One idea for default behavior would be to use EXCLUSIVE when the table > is zero size. I think that would do pg_dump and most of the user cases, > and of course users could override the default by using a keyword. We > could emit a NOTICE if an an exclusive lock is used without an EXCLUSIVE > keyword. One problem I see is that there is no way to insure zero size > without a lock that blocks other writers. Is that reliable? No, and if you try to upgrade your lock after checking, you create a deadlock problem. Something that would probably be reasonable, and require *no* weird new syntax, is to shortcut in a COPY into a table created in the current transaction. I believe we still keep a flag in the relcache indicating whether that's the case ... regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > One idea for default behavior would be to use EXCLUSIVE when the table > > is zero size. I think that would do pg_dump and most of the user cases, > > and of course users could override the default by using a keyword. We > > could emit a NOTICE if an an exclusive lock is used without an EXCLUSIVE > > keyword. One problem I see is that there is no way to insure zero size > > without a lock that blocks other writers. Is that reliable? > > No, and if you try to upgrade your lock after checking, you create a > deadlock problem. > > Something that would probably be reasonable, and require *no* weird new > syntax, is to shortcut in a COPY into a table created in the current > transaction. I believe we still keep a flag in the relcache indicating > whether that's the case ... So if the table is created in the current transaction, we don't log? Yes, I guess, but do we want to propogate that into pg_dump output? I would think not. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> Something that would probably be reasonable, and require *no* weird new >> syntax, is to shortcut in a COPY into a table created in the current >> transaction. I believe we still keep a flag in the relcache indicating >> whether that's the case ... > So if the table is created in the current transaction, we don't log? Log, yes, unless it's a temp table. The point is we could avoid taking buffer content locks. Come to think of it, we could implement that trivially in the heapam.c routines; it would then apply to any table update whether generated by COPY or otherwise. > Yes, I guess, but do we want to propogate that into pg_dump output? I > would think not. Exactly my point; we don't have to change any syntax, so pg_dump doesn't care. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> Something that would probably be reasonable, and require *no* weird new > >> syntax, is to shortcut in a COPY into a table created in the current > >> transaction. I believe we still keep a flag in the relcache indicating > >> whether that's the case ... > > > So if the table is created in the current transaction, we don't log? > > Log, yes, unless it's a temp table. The point is we could avoid taking > buffer content locks. Come to think of it, we could implement that > trivially in the heapam.c routines; it would then apply to any table > update whether generated by COPY or otherwise. I am confused. This optimization prevents locking, but still requires logging? > > Yes, I guess, but do we want to propogate that into pg_dump output? I > > would think not. > > Exactly my point; we don't have to change any syntax, so pg_dump > doesn't care. But can pg_dump take advantage of this optimization? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, 2005-11-29 at 14:17 -0500, Bruce Momjian wrote: > Simon Riggs wrote: > > As a further enhancement, I would also return to the NOLOGGING option > > for COPY. Previously we had said that COPY LOCK was the way to go - > > taking a full table lock to prevent concurrent inserts to a block from a > > COPY that didn't write WAL and another backend which wanted to write WAL > > about that block. With the above suggested all-inserts-at-once > > optimization, it would no longer be a requirement to lock the table. > > That means we can continue to take advantage of the ability to run > > multiple COPY loads into the same table. Avoiding writing WAL will > > further reduce CPU by about 15% and I/O by about 50%. > > > > I would also suggest that pgdump be changed to use the NOLOGGING option > > by default, with an option to work as previously. > > For those who have been around, they know I dislike having options that > 95% of our users desire not be the default behavior. I like that approach and it makes for a great product. > I think the COPY > NOLOGGING idea falls in that category. I would like to explore if there > is a way to have COPY automatically do no logging where possible by > default. see last, below > First, I think NOLOGGING is probably the wrong keyword. I am thinking > SHARE/EXCLUSIVE is best because they are already keywords, and they > explain the effect of the flag on other applications, rather than the > LOGGING capability, which is invisible to applications. > > I am thinking we would have COPY WITH [ [ EXCLUSIVE | SHARE ] [ LOCK ]] ... > EXCLUSIVE lock would be NOLOGGING, SHARE would do logging because other > applications could insert into the table at the same time (and do > UPDATES/DELETES of the inserted rows). I wasn't trying to suggest new wording for COPY LOCK, or whatever it was called previously. My suggestion was to alter the underlying mechanism so that it would not actually need to take a lock at all. So I was trying to find a name that matched that thought.... so your comments above don't really apply.... but I'm more than happy to have others suggest names and will go with the flow. > One idea for default behavior would be to use EXCLUSIVE when the table > is zero size. I think that would do pg_dump and most of the user cases, > and of course users could override the default by using a keyword. We > could emit a NOTICE if an an exclusive lock is used without an EXCLUSIVE > keyword. One problem I see is that there is no way to insure zero size > without a lock that blocks other writers. Is that reliable? I think it would be possible to say: if when a COPY starts the table has zero blocks, then we do a NOLOGGING (change_the_name) mode operation. That gives a problem for recovery. If you lose the server and recover with PITR, that table would come back empty. I'd want to make very sure that I knew whether it was going to be there, or not, and I want it to be my choice. So, I agree with the thought from a automatic performance perspective, but not from a recovery perspective. (There is some irony in being the person to both advocate PITR and to try so hard not to log...) I think of PITR as the default mode of operation now; almost all Oracle, DB2, SQLServer etc users make consistent use of archivelogmode/archive recovery. Best Regards, Simon Riggs
On Tue, 2005-11-29 at 15:15 -0500, Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Tom Lane wrote: > > >> Something that would probably be reasonable, and require *no* weird new > > >> syntax, is to shortcut in a COPY into a table created in the current > > >> transaction. I believe we still keep a flag in the relcache indicating > > >> whether that's the case ... > > > > > So if the table is created in the current transaction, we don't log? > > > > Log, yes, unless it's a temp table. The point is we could avoid taking > > buffer content locks. Come to think of it, we could implement that > > trivially in the heapam.c routines; it would then apply to any table > > update whether generated by COPY or otherwise. > > I am confused. This optimization prevents locking, but still requires > logging? It minimises *buffer* lwlocks, not data locks. Best Regards, Simon Riggs
On Tue, 2005-11-29 at 14:56 -0500, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> Something that would probably be reasonable, and require *no* weird new > >> syntax, is to shortcut in a COPY into a table created in the current > >> transaction. I believe we still keep a flag in the relcache indicating > >> whether that's the case ... > > > So if the table is created in the current transaction, we don't log? > > Log, yes, unless it's a temp table. The point is we could avoid taking > buffer content locks. Come to think of it, we could implement that > trivially in the heapam.c routines; it would then apply to any table > update whether generated by COPY or otherwise. We should do that, yes, we've discussed that before. I'm happy to blat that out. ...but I do still want to improve COPY in other ways. > > Yes, I guess, but do we want to propogate that into pg_dump output? I > > would think not. > > Exactly my point; we don't have to change any syntax, so pg_dump > doesn't care. Does pg_dump wrap the CREATE TABLE and COPY into the same transaction? Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Tom Lane wrote: >> Log, yes, unless it's a temp table. The point is we could avoid taking >> buffer content locks. Come to think of it, we could implement that >> trivially in the heapam.c routines; it would then apply to any table >> update whether generated by COPY or otherwise. > We should do that, yes, we've discussed that before. Not that I recall ... anyway, I had second thoughts about this while eating lunch. A buffer for a new table can reasonably be assumed not to be accessed by any other backend, but we can *not* make that assumption for the bgwriter. This means that the bgwriter might come along and dump the buffer to disk while we are in the midst of scribbling on it. Since we'll mark the thing dirty after we finish scribbling, no harm is done --- unless there is a crash after we commit and before the next write of the buffer occurs. In that case we have a corrupt copy of the buffer on disk, which is going to pose a problem for recovery. (As an example, the corrupt copy might have an invalid LSN that is larger than it really should be, which will cause us to ignore the WAL record that might have otherwise let us fix the problem.) As far as I can see at the moment, such a problem will be repaired during recovery *as long as you have the full-page-writes option on*. Otherwise not. So we'd probably have to restrict the optimization to happen only if that's on, as well as only for new tables. >> Exactly my point; we don't have to change any syntax, so pg_dump >> doesn't care. > Does pg_dump wrap the CREATE TABLE and COPY into the same transaction? No, but you can manually put BEGIN and END around the whole script. Some people do that anyway. (It wouldn't be a bad idea to add an option to pg_restore to do it, since there's no very convenient way to add begin/end manually when piping pg_restore right into a database.) regards, tom lane
On Mon, 2005-11-28 at 09:26 -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > I don't see why couldn't have an additional index access method entry > > point to insert multiple rows on one call. > > I think Simon was mainly on about the idea of inserting multiple *heap* > entries with one call, ie, only one cycle of locking a heap buffer. > It's not apparent to me that there'd be any noticeable win from batching > index insertions, because even if the heap entries are all on one page, > it's unlikely their index entries would be. It might be a valid idea for monotonically increasing key indexes, but I'll leave that idea alone for now. Tom's right: I was discussing heap blocks only (for now). > I'm a bit dubious about the whole thing from a concurrency standpoint, > too: holding locks longer is not a great thing in general. The recent > patch to avoid repeated locking cycles during a read seems OK, because > (1) checking visibility of a tuple is ordinarily pretty cheap, and > (2) it's only a shared lock so other people can be doing the same thing > concurrently. However, heap insertion can make neither of those claims. > You're talking about an exclusive lock, and you're talking about holding > it while copying data around and then making multiple WAL entries. I wasn't talking about holding locks for any longer than normal. Heres the rough code: heap_bulk_insert (tuple, bool needsToasting) {/* if tuple needs toasting, unpick the buffer */if (needsToasting){ for (i=0; i < nTuplesBuffered; i++) heap_insert(tupleBuffer[i]); heap_insert(tuple); nTuplesBuffered = 0; sizeTupleBuffer = 0; return;} /* if buffer overflows, write the whole tuplebuffer in one * go to a freshly allocated shared_buffer/table block * so thatwe do only one buffer manager lock pair per block * rather than one per row */tuple_length = length(tuple);if ((length(tuple)+ sizeTupleBuffer) > MAXDATAINBLOCK){ RelationGetBufferForTupleArray(nTuplesBuffered, tupleBuffer); tupleBuffer[0] = tuple; nTuplesBuffered = 1; sizeTupleBuffer = tuple_length;}else{ tupleBuffer[++nTuplesBuffered]= tuple; sizeTupleBuffer += tuple_length;} return; } in heapam.c (needsToasting would be set for a tuple during parsing by COPY, which is fine since it does actually know how long columns are. Actual decision to toast or not can be wrapped in a modular call to heapam.c) RelationGetBufferForTupleArray() in hio.c very similar to RelationGetBufferForTupleArray but less complex because it never needs to handle the Update case. If we do choose to write WAL records for these inserts, then we can simply log the whole block, rather than making repeated individual inserts into WALbuffers, so we save on WALInsertLock overhead too. Thats no more than we do after a checkpoint anyway, so no problems. If the transaction fails, then we lose the buffer: so what? Slightly better than writing them to the table then failing. Most importantly, we don't actually allocate a new data block until we decide to flush the tupleBuffer. So the exclusive lock is only held momentarily while we insert all the rows from the tupleBuffer into the fresh block. There's no modularity wierdness, but we do need some code in COPY to decide whether it should use fast mode or not. Best Regards, Simon Riggs
On Tue, 2005-11-29 at 15:58 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Tom Lane wrote: > >> Log, yes, unless it's a temp table. The point is we could avoid taking > >> buffer content locks. Come to think of it, we could implement that > >> trivially in the heapam.c routines; it would then apply to any table > >> update whether generated by COPY or otherwise. > > > We should do that, yes, we've discussed that before. > > Not that I recall ... It was a quick comment of yours only, not a long discussion. Its a good idea and we should do it, but its not everything I want (other posts pursue the other ideas) > anyway, I had second thoughts about this while > eating lunch. A buffer for a new table can reasonably be assumed not to > be accessed by any other backend, but we can *not* make that assumption > for the bgwriter. This means that the bgwriter might come along and > dump the buffer to disk while we are in the midst of scribbling on it. > Since we'll mark the thing dirty after we finish scribbling, no harm > is done --- unless there is a crash after we commit and before the next > write of the buffer occurs. In that case we have a corrupt copy of the > buffer on disk, which is going to pose a problem for recovery. (As an > example, the corrupt copy might have an invalid LSN that is larger than > it really should be, which will cause us to ignore the WAL record that > might have otherwise let us fix the problem.) > > As far as I can see at the moment, such a problem will be repaired > during recovery *as long as you have the full-page-writes option on*. > Otherwise not. So we'd probably have to restrict the optimization to > happen only if that's on, as well as only for new tables. Not sure I understand that. If there's a crash then the transaction failed so any data on the partially written block is invisible. Why else would the block become corrupt? Whether full page writes is on or not, we still fill the block. After which we never log the full page because the block doesn't change again. Should we checkpoint half way through filling the block, then its still written to disk whether we do full page writes or not. Maybe its just late here... > >> Exactly my point; we don't have to change any syntax, so pg_dump > >> doesn't care. > > > Does pg_dump wrap the CREATE TABLE and COPY into the same transaction? > > No, but you can manually put BEGIN and END around the whole script. > Some people do that anyway. (It wouldn't be a bad idea to add an option > to pg_restore to do it, since there's no very convenient way to add > begin/end manually when piping pg_restore right into a database.) Can do. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2005-11-29 at 15:58 -0500, Tom Lane wrote: >> anyway, I had second thoughts about this while >> eating lunch. A buffer for a new table can reasonably be assumed not to >> be accessed by any other backend, but we can *not* make that assumption >> for the bgwriter. This means that the bgwriter might come along and >> dump the buffer to disk while we are in the midst of scribbling on it. >> Since we'll mark the thing dirty after we finish scribbling, no harm >> is done --- unless there is a crash after we commit and before the next >> write of the buffer occurs. > Not sure I understand that. If there's a crash then the transaction > failed so any data on the partially written block is invisible. I said "after we commit". In any case, corrupt data on-disk could still cause WAL recovery to fail: at the time we process a WAL record, we have no way to know that the transaction that generated it will ultimately not commit. > Whether full page writes is on or not, we still fill the block. After > which we never log the full page because the block doesn't change again. You seem to be confusing this idea with some of your own... regards, tom lane
Please let me back up and ask a more simplistic question. I understand the idea of allowing COPY to insert rows with less locking, but I am wondering about the NOLOGGING idea. On commit, we must guarantee that all the rows are in the table, so what advantage is there to a NOLOGGING option? --------------------------------------------------------------------------- Simon Riggs wrote: > On Tue, 2005-11-29 at 14:17 -0500, Bruce Momjian wrote: > > Simon Riggs wrote: > > > As a further enhancement, I would also return to the NOLOGGING option > > > for COPY. Previously we had said that COPY LOCK was the way to go - > > > taking a full table lock to prevent concurrent inserts to a block from a > > > COPY that didn't write WAL and another backend which wanted to write WAL > > > about that block. With the above suggested all-inserts-at-once > > > optimization, it would no longer be a requirement to lock the table. > > > That means we can continue to take advantage of the ability to run > > > multiple COPY loads into the same table. Avoiding writing WAL will > > > further reduce CPU by about 15% and I/O by about 50%. > > > > > > I would also suggest that pgdump be changed to use the NOLOGGING option > > > by default, with an option to work as previously. > > > > For those who have been around, they know I dislike having options that > > 95% of our users desire not be the default behavior. > > I like that approach and it makes for a great product. > > > I think the COPY > > NOLOGGING idea falls in that category. I would like to explore if there > > is a way to have COPY automatically do no logging where possible by > > default. > > see last, below > > > First, I think NOLOGGING is probably the wrong keyword. I am thinking > > SHARE/EXCLUSIVE is best because they are already keywords, and they > > explain the effect of the flag on other applications, rather than the > > LOGGING capability, which is invisible to applications. > > > > I am thinking we would have COPY WITH [ [ EXCLUSIVE | SHARE ] [ LOCK ]] ... > > EXCLUSIVE lock would be NOLOGGING, SHARE would do logging because other > > applications could insert into the table at the same time (and do > > UPDATES/DELETES of the inserted rows). > > I wasn't trying to suggest new wording for COPY LOCK, or whatever it was > called previously. My suggestion was to alter the underlying mechanism > so that it would not actually need to take a lock at all. So I was > trying to find a name that matched that thought.... so your comments > above don't really apply.... but I'm more than happy to have others > suggest names and will go with the flow. > > > One idea for default behavior would be to use EXCLUSIVE when the table > > is zero size. I think that would do pg_dump and most of the user cases, > > and of course users could override the default by using a keyword. We > > could emit a NOTICE if an an exclusive lock is used without an EXCLUSIVE > > keyword. One problem I see is that there is no way to insure zero size > > without a lock that blocks other writers. Is that reliable? > > I think it would be possible to say: if when a COPY starts the table has > zero blocks, then we do a NOLOGGING (change_the_name) mode operation. > > That gives a problem for recovery. If you lose the server and recover > with PITR, that table would come back empty. I'd want to make very sure > that I knew whether it was going to be there, or not, and I want it to > be my choice. So, I agree with the thought from a automatic performance > perspective, but not from a recovery perspective. > > (There is some irony in being the person to both advocate PITR and to > try so hard not to log...) > > I think of PITR as the default mode of operation now; almost all Oracle, > DB2, SQLServer etc users make consistent use of archivelogmode/archive > recovery. > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, 2005-11-29 at 18:51 -0500, Bruce Momjian wrote: > Please let me back up and ask a more simplistic question. I understand > the idea of allowing COPY to insert rows with less locking, but I am > wondering about the NOLOGGING idea. On commit, we must guarantee that > all the rows are in the table, so what advantage is there to a NOLOGGING > option? We would need to flush all the blocks in the table out of cache at commit time, for that table only. (As with CTAS, CIDX). To allow a full discussion, I'll separate the various ideas: 1. COPY using bulk copy mode 2. NOLOGGING 3. Created in this transaction 4. ERRORTABLES 5. Uniqueness violations Right now, I think you have reasonable objections/input to (2) that we should discuss more before agreeing a way forward. I would aim to do (1) first, then return with a full and much better explained proposal for (2) for us to discuss, since (2) depends upon (1) somewhat. (3) and (4) seem to have been generally accepted, but (5) seems not viable with present thinking. Best Regards, Simon Riggs > --------------------------------------------------------------------------- > > Simon Riggs wrote: > > On Tue, 2005-11-29 at 14:17 -0500, Bruce Momjian wrote: > > > Simon Riggs wrote: > > > > As a further enhancement, I would also return to the NOLOGGING option > > > > for COPY. Previously we had said that COPY LOCK was the way to go - > > > > taking a full table lock to prevent concurrent inserts to a block from a > > > > COPY that didn't write WAL and another backend which wanted to write WAL > > > > about that block. With the above suggested all-inserts-at-once > > > > optimization, it would no longer be a requirement to lock the table. > > > > That means we can continue to take advantage of the ability to run > > > > multiple COPY loads into the same table. Avoiding writing WAL will > > > > further reduce CPU by about 15% and I/O by about 50%. > > > > > > > > I would also suggest that pgdump be changed to use the NOLOGGING option > > > > by default, with an option to work as previously. > > > > > > For those who have been around, they know I dislike having options that > > > 95% of our users desire not be the default behavior. > > > > I like that approach and it makes for a great product. > > > > > I think the COPY > > > NOLOGGING idea falls in that category. I would like to explore if there > > > is a way to have COPY automatically do no logging where possible by > > > default. > > > > see last, below > > > > > First, I think NOLOGGING is probably the wrong keyword. I am thinking > > > SHARE/EXCLUSIVE is best because they are already keywords, and they > > > explain the effect of the flag on other applications, rather than the > > > LOGGING capability, which is invisible to applications. > > > > > > I am thinking we would have COPY WITH [ [ EXCLUSIVE | SHARE ] [ LOCK ]] ... > > > EXCLUSIVE lock would be NOLOGGING, SHARE would do logging because other > > > applications could insert into the table at the same time (and do > > > UPDATES/DELETES of the inserted rows). > > > > I wasn't trying to suggest new wording for COPY LOCK, or whatever it was > > called previously. My suggestion was to alter the underlying mechanism > > so that it would not actually need to take a lock at all. So I was > > trying to find a name that matched that thought.... so your comments > > above don't really apply.... but I'm more than happy to have others > > suggest names and will go with the flow. > > > > > One idea for default behavior would be to use EXCLUSIVE when the table > > > is zero size. I think that would do pg_dump and most of the user cases, > > > and of course users could override the default by using a keyword. We > > > could emit a NOTICE if an an exclusive lock is used without an EXCLUSIVE > > > keyword. One problem I see is that there is no way to insure zero size > > > without a lock that blocks other writers. Is that reliable? > > > > I think it would be possible to say: if when a COPY starts the table has > > zero blocks, then we do a NOLOGGING (change_the_name) mode operation. > > > > That gives a problem for recovery. If you lose the server and recover > > with PITR, that table would come back empty. I'd want to make very sure > > that I knew whether it was going to be there, or not, and I want it to > > be my choice. So, I agree with the thought from a automatic performance > > perspective, but not from a recovery perspective. > > > > (There is some irony in being the person to both advocate PITR and to > > try so hard not to log...) > > > > I think of PITR as the default mode of operation now; almost all Oracle, > > DB2, SQLServer etc users make consistent use of archivelogmode/archive > > recovery. > > > > Best Regards, Simon Riggs > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 1: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that your > > message can get through to the mailing list cleanly > > >
Simon Riggs <simon@2ndquadrant.com> writes: > [...context omitted...] > We would need to flush all the blocks in the table out of cache at > commit time, for that table only. This seems striking close to the Old Concept of temp tables, which we got rid of for good-and-sufficient reasons. You might want to spend some time reviewing the archives before proposing that we reintroduce it. regards, tom lane
Simon Riggs wrote: > On Tue, 2005-11-29 at 18:51 -0500, Bruce Momjian wrote: > > Please let me back up and ask a more simplistic question. I understand > > the idea of allowing COPY to insert rows with less locking, but I am > > wondering about the NOLOGGING idea. On commit, we must guarantee that > > all the rows are in the table, so what advantage is there to a NOLOGGING > > option? > > We would need to flush all the blocks in the table out of cache at > commit time, for that table only. (As with CTAS, CIDX). > > > To allow a full discussion, I'll separate the various ideas: > 1. COPY using bulk copy mode What is "bulk copy mode"? Full page loading? > 2. NOLOGGING Means flush/fsync table pages on commit. > 3. Created in this transaction Reduces locking? > 4. ERRORTABLES > 5. Uniqueness violations > > Right now, I think you have reasonable objections/input to (2) that we > should discuss more before agreeing a way forward. I would aim to do (1) > first, then return with a full and much better explained proposal for > (2) for us to discuss, since (2) depends upon (1) somewhat. > > (3) and (4) seem to have been generally accepted, but (5) seems not > viable with present thinking. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, Nov 29, 2005 at 01:35:09PM +0100, Harald Fuchs wrote: > In article <20051129004435.GR78939@pervasive.com>, > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > > On Sun, Nov 27, 2005 at 07:44:55PM +0000, Simon Riggs wrote: > >> not have any unique indexes or row triggers. It should be possible to > >> take advantage of this automatically when those requirements are met, > >> without any new options. Just as it was with Seq Scans, this is worth > >> about 10% reduction in CPU for a COPY FROM. > > <snip> > >> FSM access would need to change slightly to allow for whole-block-only > >> requests to be made for heaps, without damaging the average row length > >> calculation. It might be simpler to ignore FSM entirely? > > > Does that mean that this fast copy would end up not re-using space on > > pages that have space available? ISTM that's something users would want > > to be able to over-ride. In fact, it seems like it shouldn't be a > > default behavior... > > Why not? If you later do INSERT/UPDATE/DELETE, space will get reused > anyway, and if you don't (i.e. one-time bulk load on a constant > table), you should afterwards do a VACUUM FULL ANALYZE anyway. Because the proposal is to fill entirely new pages from COPY, which pretty much makes FSM useless since you could only use pages in the FSM if they were entirely empty. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, 2005-11-30 at 02:10 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > [...context omitted...] > > We would need to flush all the blocks in the table out of cache at > > commit time, for that table only. > > This seems striking close to the Old Concept of temp tables, which > we got rid of for good-and-sufficient reasons. You might want to > spend some time reviewing the archives before proposing that we > reintroduce it. My suggestion is that I use the same mechanism as was introduced for 8.1 with the CREATE TABLE AS SELECT statement. Best Regards, Simon Riggs