Re: Bulk Inserts - Mailing list pgsql-hackers

From Pierre Frédéric Caillaud
Subject Re: Bulk Inserts
Date
Msg-id op.u0ae1rg8cke6l8@soyouz
Whole thread Raw
In response to Re: Bulk Inserts  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: Bulk Inserts
List pgsql-hackers
> Does that heuristic change the timings much?  If not, it seems like it  
> would
> better to keep it simple and always do the same thing, like log the  
> tuples
> (if it is done under one WALInsertLock, which I am assuming it is..)

It is the logging of whole pages that makes it faster.
If you fill a page with tuples in one operation (while holding exclusive  
lock) and then insert WAL records for each tuple, there is no speed gain.

Inserting a full page WAL record (since you just filled the page  
completely) :

- only takes WalInsertLock once instead of once per tuple
- reduces wal traffic
- is about 2x faster in my benchmark

And inserting a "clear new page" record (if the page was previously  
new/empty and relation is fsync'd at the end) :

- only takes WalInsertLock once instead of once per tuple
- reduces wal traffic a lot
- is about 4x faster in my benchmark

> Do you even need the new empty page record?  I think a zero page will be
> handled correctly next time it is read into shared buffers, won't it?

I have no idea ;)

> But I
> guess it is need to avoid  problems with partial page writes that would
> leave in a state that is neither all zeros nor consistent.

Plus, empty page records make for very small WAL traffic and I didn't see  
any performance difference with or without them.

> If the entire page is logged, would it have to marked as not removable by
> the log compression tool?  Or can the tool recreate the needed delta?

No, the tool cannot recreate the data, since the idea is precisely to  
replace a lot of "tuple insert" messages with one "entire page" message,  
which takes both less space and less time. The warm-standby replicators  
that get this WAL need to know the page contents to replicate it... (also,  
it will probably be faster for them to redo a page write than redo all the  
tuple inserts).

Here is what I'm thinking about now :

* have some kind of BulkInsertState which contains
- info about the relation, indexes, triggers, etc
- a tuple queue.

The tuple queue may be a tuple store, or simply tuple copies in a local  
memory context.

You'd have functions to :

- Setup the BulkInsertState
- Add a tuple to the BulkInsertState
- Finish the operation and clear the BulkInsertState

When adding a tuple, it is stored in the queue.
When the queue is full, a bulk insert operation takes place, hopefully we  
can fill an entire page, and return.
Post insert triggers and index updates are also handled at this point.

When finished, the function that clears the state also inserts all  
remaining tuples in the queue.

With this you could also do something *really* interesting : bulk index  
updates...

Bulk index updates are probably mutually exclusive with after-row triggers  
though.

Another angle of attack would be to make wal-writing more efficient...







pgsql-hackers by date:

Previous
From: Pierre Frédéric Caillaud
Date:
Subject: Re: Bulk Inserts
Next
From: Pavel Stehule
Date:
Subject: Re: Issues for named/mixed function notation patch