Re: Batch update of indexes on data loading - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Batch update of indexes on data loading |
Date | |
Msg-id | 1204016917.4252.171.camel@ebony.site Whole thread Raw |
In response to | Re: Batch update of indexes on data loading (ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>) |
Responses |
Re: Batch update of indexes on data loading
Re: Batch update of indexes on data loading |
List | pgsql-hackers |
On Tue, 2008-02-26 at 15:19 +0900, ITAGAKI Takahiro wrote: > Simon Riggs <simon@2ndquadrant.com> wrote: > > > One of the reasons why I hadn't wanted to pursue earlier ideas to use > > LOCK was that applying a lock will prevent running in parallel, which > > ultimately may prevent further performance gains. > > > > Is there a way of doing this that will allow multiple concurrent COPYs? > > I think there is same difficulty as parallel queries. It requires tighter > communication among COPY threads whether we will use multi-process model > or multi-thread model. > > We have independent concurrent COPYs now; COPYs are not aware of each > other because no intermediate status during COPY. However, COPY will > have "phases" if we use bulkbuild. Therefore, we will need joining > COPY threads and passing each working memories between threads. The use case for this seems to be * an existing table - since we have indexes * large indexes - larger than shared_buffers since we are worried about the effects of index inserts causing random I/O * a table that can be LOCKed for periods of time * load performance is critical I'm worried that the last two items are often mutually exclusive for many people, making this look like a very narrow use case. My experience at Teradata with complicated load mechanisms is that they take a long time to write, are bug-prone and have serious restrictions on how and when we can use them. Even very large data warehouses frequently use a trickle loader or "normal SQL" mechanism because the business requirement for immediate access to data is just as high as the need for load speed. Faster loading is a requirement for most people however. (Dimitri Fontaine is working on parallel COPY statements from pgloader). So I feel we must try very hard to avoid the LOCK. The LOCK is only required because we defer the inserts into unique indexes, yes? Perhaps we might get good performance by making the inserts into the unique index immediately, but deferring the insert of other indexes? Unique index inserts tend to go into the rightmost blocks, which are almost always in memory as a result. So all the random I/O is caused by non-unique indexes. I very much like the idea of index merging, or put another way: batch index inserts. How big do the batch of index inserts have to be for us to gain benefit from this technique? Would it be possible to just buffer the index inserts inside the indexam module so that we perform a batch of index inserts every N rows? Maybe use work_mem? Or specify a batch size as a parameter on COPY? Do we really need to pass data between COPY sessions to gain maximum benefit? Feels like there is a way that is faster in many cases but simpler than the fully parallel route described. The SQL Standard allows us to define a table as GENERATED ALWAYS AS IDENTITY, so in that case we would be able to defer unique index inserts also, since we know they are already unique. Unique index values arriving from outside the database might be able to be checked against each other as a batch first before adding to the index. For example, deferring unique index checks until we have filled a whole block, then checking that all values on the block are unique with respect to each other and then inserting into the unique index. Some other aspects to this, slightly OT * variable load speed - it is often a requirement to have the data load slow down at busy times and speed up again when less busy. Maybe automatically, but definitely manually. If we avoid locks and large batches then we should be able to do this eventually. * driving UPDATEs and DELETEs - we often want to perform more than just INSERTs. If we over-specialise code for INSERTing data then we may miss out on opportunities to improve the overall data maintenance workload. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
pgsql-hackers by date: