Re: Inserting heap tuples in bulk in COPY - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Inserting heap tuples in bulk in COPY
Date
Msg-id CA+U5nMJ_fHuwFLDCDvCBJt1jsVw4oenn46zGwaXt0-FEQj85Jw@mail.gmail.com
Whole thread Raw
In response to Re: Inserting heap tuples in bulk in COPY  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: Inserting heap tuples in bulk in COPY
Re: Inserting heap tuples in bulk in COPY
List pgsql-hackers
On 7 August 2012 20:58, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Fri, Aug 12, 2011 at 2:59 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> On 13.08.2011 00:17, Simon Riggs wrote:
>>>
>>> Also, we discussed that you would work on buffering the index inserts,
>>> which is where the main problem lies. The main heap is only a small
>>> part of the overhead if we have multiple indexes already built on a
>>> table - which is the use case that causes the most problem.
>>
>>
>> Sure, if you have indexes on the table already, then the overhead of
>> updating them is more significant. I am actually working on that too, I will
>> make a separate post about that.
>
> Hi Heikki,
>
> Is the bulk index insert still an active area for you?
>
> If not, is there some kind of summary of design or analysis work
> already done, which would be a useful for someone else interested in
> picking it up?

The main cost comes from repeated re-seeking down the index tree to
find the insertion point, but repeated lock and pin operations on
index buffers are also high. That is optimisable if the index inserts
are grouped, as they will be with monotonic indexes, (e.g. SERIAL), or
with partial monotonic (i.e. with Parent/Child relationship, on Child
table many inserts will be of the form (x,1), (x,2), (x, 3) etc, e.g.
Order/OrderLine).

All we need do is buffer the inserts in the inserts, before inserting
them into the main index. As long as we flush the buffer before end of
transaction, we're good.

Incidentally, we can also optimise repeated inserts within a normal
transaction using this method, by implementing deferred unique
constraints. At present we say that unique constraints aren't
deferrable, but there's no reason they can't be, if we allow buffering
in the index. (Implementing a deferred constraint that won't fail if
we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
which is probably a bad plan, plus you'd need to sort the inputs, so
that particular nut is another issue altogether, AFAICS).

Suggested implementation is to buffer index tuples at the generic
index API, then implement a bulk insert index API call that can then
be implemented for each AM separately. Suggested buffer size is
work_mem. Note we must extract index tuple from heap tuples -
refetching heap blocks to get rows is too costly.

I think we need to implement buffering both to end of statement or end
of transaction, not just one or the other.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Pg_ctl promote -- wait for slave to be promoted fully ?
Next
From: Tom Lane
Date:
Subject: Re: WIP: pg_pretty_query