Thread: BufferAccessStrategy for bulk insert

BufferAccessStrategy for bulk insert

From
"Robert Haas"
Date:
I'm taking a look at doing the refactoring Tom Lane and Simon Riggs
discussed here:

http://archives.postgresql.org/pgsql-patches/2008-02/msg00155.php

In terms of the buffer manager, I think we can simply introduce a new
strategy type BAS_BULKWRITE and make it behave identically to
BAS_VACUUM.  Anyone see a reason to do anything else?

The trickier part is to handle the communication between CopyFrom (or
the CTAS machinery), heap_insert, and RelationGetBufferForTuple.
There are basically three things we need to keep track of here:

(1) a BufferAccessStrategy (that is, the ring of buffers we're using
for this bulk insert)
(2) the last-pinned page (to implement Simon Riggs's proposed
optimization of keeping the most-recently-written page pinned)
(3) use_wal and use_fsm (to implement Tom Lane's suggestion of
reducing the number of options to heap_insert by rolling everything
into an options object)

Tom's email seemed to suggest that we might want to roll everything
into the BufferAccessStrategy itself, but that seems to require quite
a few things to know about the internals of BufferAccessStrategy that
currently don't, so I think that's a bad idea.  I am kind of inclined
to define flags like this:

#define HEAP_INSERT_SKIP_WAL 0x0001
#define HEAP_INSERT_SKIP_FSM 0x0002
#define HEAP_INSERT_BULK 0x0004 /* do we even need this one? */

And then:

Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
unsigned options, BulkInsertState *bistate);
BulkInsertState *GetBulkInsertState(void);
void FreeBulkInsertState(BulkInsertState *);

I'm always wary of reversing the sense of a boolean, but I think it
makes sense here; it doesn't really matter whether you call
heap_insert(relation, tup, cid, true, true) or heap_insert(relation,
tup, cid, false, false), but heap_insert(relation, tup, cid,
HEAP_INSERT_USE_WAL|HEAP_INSERT_USE_FSM, NULL) is a lot uglier than
heap_insert(relation, tup, cid, 0, NULL), and there aren't that many
places that need to be checked for correctness in making the change.

Admittedly, we could make the calling sequence for heap_insert shorter
by putting the options (and maybe even the CommandId) into
BulkInsertState and calling it HeapInsertOptions, but that forces
several callers of heap_insert who don't care at all about bulk
inserts to uselessly create and destroy a HeapInsertOptions object
just to pass a couple of boolean flags (and maybe the CommandId),
which seems like a loser.

Comments?

...Robert


Re: BufferAccessStrategy for bulk insert

From
Tom Lane
Date:
"Robert Haas" <robertmhaas@gmail.com> writes:
> I am kind of inclined to define flags like this:

> #define HEAP_INSERT_SKIP_WAL 0x0001
> #define HEAP_INSERT_SKIP_FSM 0x0002
> #define HEAP_INSERT_BULK 0x0004 /* do we even need this one? */

> And then:

> Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
> unsigned options, BulkInsertState *bistate);
> BulkInsertState *GetBulkInsertState(void);
> void FreeBulkInsertState(BulkInsertState *);

Seems sane to me.  I don't see the point of the HEAP_INSERT_BULK flag
bit --- providing or not providing bistate would cover that, and if
you have a bit as well then you have to define what the inconsistent
combinations mean.  I concur with making all-zeroes be the typical
state of the flag bits, too.

FWIW, we generally declare bitmask flag variables as int, unless
there's some really good reason to do otherwise.
        regards, tom lane


Re: BufferAccessStrategy for bulk insert

From
"Robert Haas"
Date:
> Seems sane to me.  I don't see the point of the HEAP_INSERT_BULK flag
> bit --- providing or not providing bistate would cover that, and if
> you have a bit as well then you have to define what the inconsistent
> combinations mean.  I concur with making all-zeroes be the typical
> state of the flag bits, too.

Thanks for the design review.  I had thought to make the inconsistent
combinations fail an assertion, but I'm just as happy to leave it out
altogether.

> FWIW, we generally declare bitmask flag variables as int, unless
> there's some really good reason to do otherwise.

OK, thanks for the tip.

...Robert


Re: BufferAccessStrategy for bulk insert

From
"Robert Haas"
Date:
And here's the patch, which based on comments thus far does the following:

- Replaces the use_wal, use_fsm arguments in various places with a
single options argument.
- Creates a BAS_BULKWRITE buffer access strategy.
- Creates a BulkInsertState object so that COPY and CTAS can use
BAS_BULKWRITE and also keep the most recent page pinned.

Note that the original purpose of this exercise was to implement the
optimization that COPY and CTAS would keep the most recent page pinned
to avoid repeated pin/unpin cycles.  This change shows a small but
measurable performance improvement on short rows.  The remaining items
were added based on reviewer comments.

One concern that I have about this approach is that the situation in
which people are probably most concerned about COPY performance is
restoring a dump.  In that case, the COPY will be the only thing
running, and using a BufferAccessStrategy is an anti-optimization.  I
don't think it's a very big effect (any testing anyone can do on real
hardware rather than what I have would be appreciated) but I'm sort of
unsold of optimizing for what I believe to be the less-common use
case.  If the consensus is to reverse course on this point I'm happy
to rip those changes back out and resubmit; they are a relatively
small proportion of the patch.

...Robert

On Sun, Oct 26, 2008 at 8:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Seems sane to me.  I don't see the point of the HEAP_INSERT_BULK flag
>> bit --- providing or not providing bistate would cover that, and if
>> you have a bit as well then you have to define what the inconsistent
>> combinations mean.  I concur with making all-zeroes be the typical
>> state of the flag bits, too.
>
> Thanks for the design review.  I had thought to make the inconsistent
> combinations fail an assertion, but I'm just as happy to leave it out
> altogether.
>
>> FWIW, we generally declare bitmask flag variables as int, unless
>> there's some really good reason to do otherwise.
>
> OK, thanks for the tip.
>
> ...Robert
>

Attachment

Re: BufferAccessStrategy for bulk insert

From
Simon Riggs
Date:
On Tue, 2008-10-28 at 23:45 -0400, Robert Haas wrote:

> One concern that I have about this approach is that the situation in
> which people are probably most concerned about COPY performance is
> restoring a dump.  In that case, the COPY will be the only thing
> running, and using a BufferAccessStrategy is an anti-optimization.  I
> don't think it's a very big effect (any testing anyone can do on real
> hardware rather than what I have would be appreciated) but I'm sort of
> unsold of optimizing for what I believe to be the less-common use
> case.  If the consensus is to reverse course on this point I'm happy
> to rip those changes back out and resubmit; they are a relatively
> small proportion of the patch.

Having COPY use a BAS is mainly to ensure it doesn't swamp the cache.
Which is a gain in itself.

If you say its a loss you should publish timings to support that. Using
a BAS for VACUUM was a performance gain, not a loss.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: BufferAccessStrategy for bulk insert

From
"Robert Haas"
Date:
> If you say its a loss you should publish timings to support that. Using
> a BAS for VACUUM was a performance gain, not a loss.

Well, I can dig up and publish the timings from my laptop, but I'm not
sure where that will get us.  Trust me, the numbers were higher with
BAS, otherwise I wouldn't be worrying about this.  But I pretty much
doubt anyone cares how my laptop runs PostgreSQL anyway, which is why
I think someone should test this on good hardware and see what happens
there.  The only change I made to disable the BAS was a one-line
change in GetBulkInsertState to replace BAS_BULKWRITE with BAS_NORMAL,
so it should be easy for someone to try it both ways.

Not at any point in the development of this patch was I able to match
the 15-17% copy speedup, 20% CTAS speedup that you cited with your
original email.  I did get speedups, but they were considerably
smaller.  So either my testing methodology is different, or my
hardware is different, or there is something wrong with my patch.  I
don't think we're going to find out which it is until someone other
than me looks at this.

In any event, VACUUM is a read-write workload, and specifically, it
tends to write pages that have been written by other writers, and are
therefore potentially already in shared buffers.  COPY and CTAS are
basically write-only workloads, though with COPY on an existing table
the FSM might guide you to free space on a page already in shared
buffers, or you might find an index page you need there.  Still, if
you are doing a large bulk data load, those effects are probably
pretty small.  So, the profile is somewhat.

I'm not really trying to argue that the BAS is a bad idea, but it is
certainly true that I do not have the data to prove that it is a good
idea.

...Robert


Re: BufferAccessStrategy for bulk insert

From
Simon Riggs
Date:
On Wed, 2008-10-29 at 21:58 -0400, Robert Haas wrote:
> > If you say its a loss you should publish timings to support that. Using
> > a BAS for VACUUM was a performance gain, not a loss.
> 
> Well, I can dig up and publish the timings from my laptop, but I'm not
> sure where that will get us.  Trust me, the numbers were higher with
> BAS, otherwise I wouldn't be worrying about this.  But I pretty much
> doubt anyone cares how my laptop runs PostgreSQL anyway, which is why
> I think someone should test this on good hardware and see what happens
> there.  The only change I made to disable the BAS was a one-line
> change in GetBulkInsertState to replace BAS_BULKWRITE with BAS_NORMAL,
> so it should be easy for someone to try it both ways.
> 
> Not at any point in the development of this patch was I able to match
> the 15-17% copy speedup, 20% CTAS speedup that you cited with your
> original email.  I did get speedups, but they were considerably
> smaller.  So either my testing methodology is different, or my
> hardware is different, or there is something wrong with my patch.  I
> don't think we're going to find out which it is until someone other
> than me looks at this.
> 
> In any event, VACUUM is a read-write workload, and specifically, it
> tends to write pages that have been written by other writers, and are
> therefore potentially already in shared buffers.  COPY and CTAS are
> basically write-only workloads, though with COPY on an existing table
> the FSM might guide you to free space on a page already in shared
> buffers, or you might find an index page you need there.  Still, if
> you are doing a large bulk data load, those effects are probably
> pretty small.  So, the profile is somewhat.
> 
> I'm not really trying to argue that the BAS is a bad idea, but it is
> certainly true that I do not have the data to prove that it is a good
> idea.

You should try profiling the patch. You can count the invocations of the
buffer access routines to check its all working in the right ratios.

Whatever timings you have are worth publishing. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: BufferAccessStrategy for bulk insert

From
"Robert Haas"
Date:
> You should try profiling the patch. You can count the invocations of the
> buffer access routines to check its all working in the right ratios.

*goes and learns how to do profile PostgreSQL*

OK, that was a good suggestion.  It looks like part of my problem here
is that I didn't put the CREATE TABLE and the COPY into the same
transaction.  As a result, a lot of time was spent on XLogInsert.
Modified the test case, new profiling results attached.

...Robert

Attachment

Re: BufferAccessStrategy for bulk insert

From
"Robert Haas"
Date:
> Whatever timings you have are worth publishing.

Here are the timings for copying the first ten million integers into a
one-column table created in the same transaction, with and without the
patch.  As you can see, now that I've corrected my previous error of
not putting CREATE TABLE and COPY in the same transaction, the savings
are quite substantial, about 15%.  Nice!

Trunk:
Time: 18931.516 ms
Time: 18251.732 ms
Time: 17284.274 ms
Time: 15900.131 ms
Time: 16439.617 ms

Patch:
Time: 14852.123 ms
Time: 15673.759 ms
Time: 15776.450 ms
Time: 14160.266 ms
Time: 13374.243 ms

...Robert


Re: BufferAccessStrategy for bulk insert

From
Simon Riggs
Date:
On Thu, 2008-10-30 at 23:05 -0400, Robert Haas wrote:
> > Whatever timings you have are worth publishing.
> 
> Here are the timings for copying the first ten million integers into a
> one-column table created in the same transaction, with and without the
> patch.  As you can see, now that I've corrected my previous error of
> not putting CREATE TABLE and COPY in the same transaction, the savings
> are quite substantial, about 15%.  Nice!

I had faith. ;-)

Can you test whether using the buffer access strategy is a win or a
loss? Most of that gain is probably coming from the reduction in
pinning.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: BufferAccessStrategy for bulk insert

From
Simon Riggs
Date:
On Thu, 2008-10-30 at 22:46 -0400, Robert Haas wrote:
> > You should try profiling the patch. You can count the invocations of the
> > buffer access routines to check its all working in the right ratios.
> 
> *goes and learns how to do profile PostgreSQL*
> 
> OK, that was a good suggestion.  It looks like part of my problem here
> is that I didn't put the CREATE TABLE and the COPY into the same
> transaction.  As a result, a lot of time was spent on XLogInsert.
> Modified the test case, new profiling results attached.

The CPU time in XLogInsert can be confusing. The WAL writes can make
COPY I/O bound and so any savings on CPU may have been masked in the
earlier tests.

Patched profile shows we can still save a further 20% by writing data
block-at-a-time. That's more complex because we'd need to buffer the
index inserts also, or it would optimise only for the no-index (initial
load) case. So I think this is definitely enough for this release.

Using the buffer access strategy is going to be a big win for people
running large data loads in production and it will also help with people
running parallel load tasks (e.g. Dimitri's pg_loader). That effect is
more subtle and harder to measure, but it's an important consideration.

Thanks very much for finishing the patch in time for commitfest.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: BufferAccessStrategy for bulk insert

From
"Robert Haas"
Date:
> Can you test whether using the buffer access strategy is a win or a
> loss? Most of that gain is probably coming from the reduction in
> pinning.

Patch resnapped to HEAD, with straightforward adjustments to
compensate for Heikki's changes to the ReadBuffer interface.  See
attached.

New testing results, now with and without BAS:

--TRUNK--
Time: 17945.523 ms
Time: 18682.172 ms
Time: 17047.841 ms
Time: 16344.442 ms
Time: 18727.417 ms

--PATCHED--
Time: 13323.772 ms
Time: 13869.724 ms
Time: 14043.666 ms
Time: 13934.132 ms
Time: 13193.702 ms

--PATCHED with BAS disabled--
Time: 14460.432 ms
Time: 14745.206 ms
Time: 14345.973 ms
Time: 14601.448 ms
Time: 16535.167 ms

I'm not sure why the BAS seemed to be slowing things down before.
Maybe it's different if we're copying into a pre-existing table, so
that WAL is enabled?  Or it could have just been a fluke - the numbers
were close.  I'll try to run some additional tests if time permits.

...Robert

Attachment

Re: BufferAccessStrategy for bulk insert

From
Simon Riggs
Date:
On Sat, 2008-11-01 at 13:23 -0400, Robert Haas wrote:
> > Can you test whether using the buffer access strategy is a win or a
> > loss? Most of that gain is probably coming from the reduction in
> > pinning.
> 
> --PATCHED--
> Time: 13869.724 ms (median)

> --PATCHED with BAS disabled--
> Time: 14460.432 ms (median with outlier removed)

That seems a conclusive argument in favour. Small additional performance
gain. plus generally beneficial behaviour for concurrent loads.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: BufferAccessStrategy for bulk insert

From
Tom Lane
Date:
"Robert Haas" <robertmhaas@gmail.com> writes:
> Patch resnapped to HEAD, with straightforward adjustments to
> compensate for Heikki's changes to the ReadBuffer interface.  See
> attached.

I looked this over a bit.  A couple of suggestions:

1. You could probably simplify life a bit by treating the
BulkInsertState as having an *extra* pin on the buffer, ie, do
IncrBufferRefCount when saving a buffer reference in BulkInsertState and
ReleaseBuffer when removing one.  Changing a buffer's local pin count
from 1 to 2 or back again is quite cheap, so you wouldn't need to
special-case things to avoid the existing pin and release operations.
For instance this diff hunk goes away:

***************
*** 1963,1969 ****      END_CRIT_SECTION(); 
!     UnlockReleaseBuffer(buffer);      /*      * If tuple is cachable, mark it for invalidation from the caches in
case
--- 1987,1996 ----      END_CRIT_SECTION(); 
!     /* Release the lock, but keep the buffer pinned if doing bulk insert. */
!     LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
!     if (!bistate)
!         ReleaseBuffer(buffer);      /*      * If tuple is cachable, mark it for invalidation from the caches in case


2. The logic changes in RelationGetBufferForTuple seem bizarre and
overcomplicated.  ISTM that the buffer saved by the bistate ought to
be about equivalent to relation->rd_targblock, ie, it's your first
trial location and also a place to save the located buffer on the way
out.  I'd suggest tossing that part of the patch and starting over.
        regards, tom lane


Re: BufferAccessStrategy for bulk insert

From
"Robert Haas"
Date:
> 2. The logic changes in RelationGetBufferForTuple seem bizarre and
> overcomplicated.  ISTM that the buffer saved by the bistate ought to
> be about equivalent to relation->rd_targblock, ie, it's your first
> trial location and also a place to save the located buffer on the way
> out.  I'd suggest tossing that part of the patch and starting over.

Hmm, would that be safe in the presence of concurrent or recursive
bulk inserts into the same relation?

...Robert


Re: BufferAccessStrategy for bulk insert

From
Tom Lane
Date:
"Robert Haas" <robertmhaas@gmail.com> writes:
>> 2. The logic changes in RelationGetBufferForTuple seem bizarre and
>> overcomplicated.  ISTM that the buffer saved by the bistate ought to
>> be about equivalent to relation->rd_targblock, ie, it's your first
>> trial location and also a place to save the located buffer on the way
>> out.  I'd suggest tossing that part of the patch and starting over.

> Hmm, would that be safe in the presence of concurrent or recursive
> bulk inserts into the same relation?

As safe as it is now --- you're relying on the bistate to carry the
query-local state.  Probably the best design is to just ignore
rd_targblock when a bistate is provided, and use the bistate's buffer
instead.
        regards, tom lane


Re: BufferAccessStrategy for bulk insert

From
"Robert Haas"
Date:
OK, here's an updated version...

1. Use IncrBufferRefCount() so that we can do unconditional
ReleaseBuffers elsewhere.  I'm not sure this is really any simpler,
and although IncrBufferRefCount() is pretty cheap, it's certainly not
as cheap as a NULL pointer test.

2. Consolidate a bunch of logic into a new function
RelationReadBuffer.  This simpifies the logic in
RelationGetBufferForTuple() considerably.

3. Make RelationGetBufferForTuple ignore relation->rd_block in favor
of bistate->last_pin whenever possible.  Changing this to also not
bother setting relation->rd_block didn't seem worthwhile, so I didn't.

...Robert

On Tue, Nov 4, 2008 at 4:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Robert Haas" <robertmhaas@gmail.com> writes:
>>> 2. The logic changes in RelationGetBufferForTuple seem bizarre and
>>> overcomplicated.  ISTM that the buffer saved by the bistate ought to
>>> be about equivalent to relation->rd_targblock, ie, it's your first
>>> trial location and also a place to save the located buffer on the way
>>> out.  I'd suggest tossing that part of the patch and starting over.
>
>> Hmm, would that be safe in the presence of concurrent or recursive
>> bulk inserts into the same relation?
>
> As safe as it is now --- you're relying on the bistate to carry the
> query-local state.  Probably the best design is to just ignore
> rd_targblock when a bistate is provided, and use the bistate's buffer
> instead.
>
>                        regards, tom lane
>

Attachment

Re: BufferAccessStrategy for bulk insert

From
Tom Lane
Date:
"Robert Haas" <robertmhaas@gmail.com> writes:
> OK, here's an updated version...

Applied with some small stylistic revisions.
        regards, tom lane