Re: should INSERT SELECT use a BulkInsertState? - Mailing list pgsql-hackers
From | Luc Vlaming |
---|---|
Subject | Re: should INSERT SELECT use a BulkInsertState? |
Date | |
Msg-id | bd0dc980-c4c9-29c9-8cb6-8d7787ae9fef@swarm64.com Whole thread Raw |
In response to | Re: should INSERT SELECT use a BulkInsertState? (Justin Pryzby <pryzby@telsasoft.com>) |
List | pgsql-hackers |
On 30.10.20 05:51, Justin Pryzby wrote: > On Thu, Oct 22, 2020 at 01:29:53PM +0100, Simon Riggs wrote: >> On Fri, 16 Oct 2020 at 22:05, Justin Pryzby <pryzby@telsasoft.com> wrote: >> >>>>> I made this conditional on BEGIN BULK/SET bulk, so I'll solicit comments on that. >> >> I think it would be better if this was self-tuning. So that we don't >> allocate a bulkinsert state until we've done say 100 (?) rows >> inserted. > > I made it an optional, non-default behavior in response to the legitimate > concern for performance regression for the cases where a loader needs to be as > fast as possible - as compared with our case, where we want instead to optimize > for our reports by making the loaders responsible for their own writes, rather > than leaving behind many dirty pages, and clobbering the cache, too. > > Also, INSERT SELECT doesn't immediately help us (telsasoft), since we use > INSERT .. VALUES () .. ON CONFLICT. This would handle that case, which is > great, even though that wasn't a design goal. It could also be an integer GUC > to allow configuring the size of the ring buffer. > >> You should also use table_multi_insert() since that will give further >> performance gains by reducing block access overheads. Switching from >> single row to multi-row should also only happen once we've loaded a >> few rows, so we don't introduce overahads for smaller SQL statements. > > Good idea...multi_insert (which reduces the overhead of individual inserts) is > mostly independent from BulkInsert state (which uses a ring-buffer to avoid > dirtying the cache). I made this 0002. > > This makes INSERT SELECT several times faster, and not clobber the cache too. > > Time: 4700.606 ms (00:04.701) > 123 | 1 > 37 | 2 > 20 | 3 > 11 | 4 > 4537 | 5 > 11656 | > > Time: 1125.302 ms (00:01.125) > 2171 | 1 > 37 | 2 > 20 | 3 > 11 | 4 > 111 | 5 > 14034 | > > When enabled, this passes nearly all regression tests, and all but 2 of the > changes are easily understood. The 2nd patch still needs work. > Hi, Came across this thread because I'm working on an improvement for the relation extension to improve the speed of the bulkinsert itself in (highly) parallel cases and would like to make sure that our approaches work nicely together. Given what I've seen and tried so far with various benchmarks I would also really like to see a different approach here. The "BEGIN BULK" can be problematic for example if you mix small amounts of inserts and big amounts in the same transaction, or if your application possibly does a bulk insert but otherwise mostly OLTP transactions. To me the idea from Simon sounds good to only use a bulk insert state after inserting e.g. a 1000 rows, and this also seems more applicable to most applications compared to requiring a change to any application that wishes to have faster ingest. Another approach could be to combine this, for example, with a few extra requirements to limit the amount of regressions and first learn more how this behaves in the field. We could, for example, only (just throwing out some ideas), require that: - the relation has a certain size - a BufferStrategy a maximum certain size is used - there is a certain amount of lock waiters on relation extension. (like we do with bulk extend) - we have extended the relation for at least e.g. 4 MB and not used the FSM anymore thereby proving that we are doing bulk operations instead of random small extensions everywhere into the relation that use the FSM. Another thing is that we first try to improve the bulk operation facilities in general and then have another shot at this? Not sure if there is some benchmark / query that shows where such a 10x slowdown would appear but maybe that would be worth a look as well possibly. Regards, Luc
pgsql-hackers by date: