Thread: Improve COPY performance into table with indexes.

Improve COPY performance into table with indexes.

From

James Brauman

Date:

03 April 2020, 04:42:39

I am using the COPY command to insert 10 million rows from a CSV file
into a database table and I am finding the performance is
unacceptable. When the COPY command is executed the disk I/O is
extremely high which leads to degraded query performance for other
queries being executed on the database.

I have tried removing the indexes from the database table and this
dramatically improved performance (both reducing the execution time
and reducing disk I/O).

The constraints I am working with are:
 -  Indexes must be in place to ensure that other queries that are
executed on the table while the COPY command is executing have
acceptable performance.
 - The table should not be locked for reads during the COPY (it is
fine to lock for writes).

I don't know how COPY is implemented, but it seems that the indexes
are updated as data is inserted into the table. Ideally I would like
to delay updating the indexes until the COPY command has completed and
just update the indexes a single time, I think this would lead to much
better performance. Is this possible?

Another idea that I had was to start a transaction, drop the indexes,
run the COPY command and then recreate the indexes. The problem with
this approach is that DROP INDEX acquires an exclusive lock on the
table, which is not acceptable in this system (otherwise other queries
won't be able to execute). I tried using DROP INDEX CONCURRENTLY, but
that cannot be executed inside a transaction.

What are my options in this scenario?

Thanks

Re: Improve COPY performance into table with indexes.

From

Tim Cross

Date:

03 April 2020, 07:27:11

James Brauman <james.brauman@envato.com> writes:

> I am using the COPY command to insert 10 million rows from a CSV file
> into a database table and I am finding the performance is
> unacceptable. When the COPY command is executed the disk I/O is
> extremely high which leads to degraded query performance for other
> queries being executed on the database.
>
> I have tried removing the indexes from the database table and this
> dramatically improved performance (both reducing the execution time
> and reducing disk I/O).
>
> The constraints I am working with are:
>  -  Indexes must be in place to ensure that other queries that are
> executed on the table while the COPY command is executing have
> acceptable performance.
>  - The table should not be locked for reads during the COPY (it is
> fine to lock for writes).
>
> I don't know how COPY is implemented, but it seems that the indexes
> are updated as data is inserted into the table. Ideally I would like
> to delay updating the indexes until the COPY command has completed and
> just update the indexes a single time, I think this would lead to much
> better performance. Is this possible?
>
> Another idea that I had was to start a transaction, drop the indexes,
> run the COPY command and then recreate the indexes. The problem with
> this approach is that DROP INDEX acquires an exclusive lock on the
> table, which is not acceptable in this system (otherwise other queries
> won't be able to execute). I tried using DROP INDEX CONCURRENTLY, but
> that cannot be executed inside a transaction.
>
> What are my options in this scenario?
>
> Thanks

I don't think there is any way you can disable/drop indexes just for one
transaction and not impact on other activities, such as queries. What
options you have depends on a number of factors. Some
questions/suggestions.

The main problem with any solution which relies on dropping indexes is
that you still have to pay that cost at some point. If you drop the
indexes, you then have to recreate them and do an analyze to update the
stats. Depending on the table, amount of data and number/type of
indexes, this can be very slow and also impact your queries. 

We use the copy command a lot - inserting approx 400M records per day
(from Javascript rather than CSV file). Our tables are large and our
queries need to be very fast. Initial implementations were not perform
particularly well. Below are some of the things we did that helped. Note
that all of these things did not always help - it really depends a lot
on the tables, types of queries, indexes and column/row definitions. 

- You don't indicate what current performance is and what would be an
  acceptable level of performance. How long does it take to insert your
  10M records? What is the acceptable performance time for queries while
  the update is occurring?

- Not all indexes are equal. Have you tried dropping some rather than
  all of the indexes? Have you verified all the indexes are being used
  by the queries that run during the update. Sometimes, dropping or
  modifying just one or a couple of the indexes can have significant
  impact. You may be able to get a better balance where inserts are
  faster and only a small drop in performance for queries. It is very
  common for there to be indexes which are unnecessary or unused,
  especially if you have a number of developers. As queries are
  refined/changed, people often forget to remove indexes which are no
  longer needed. Verifying index use is something If find to be a
  valuable maintenance task.

- Have you tried using a temporary staging table. We were able to get
  significant performance improvement by copying into a temporary table
  and then having a database function which inserted the data from the
  temp table into the target table. While the overall time for the
  process was slightly longer, the actual amount of time spent updating
  the target table was significantly less, reducing impact on other
  activity like queries. This can be especially useful if you also need
  to do some other data transformation or validation prior to inserting
  the data and some protection against bad CSV data resulting in failed
  copies and having to rollback on a core prod table.

- Do you have any triggers on the table. Personally, I don't like
  triggers - they are a side effect which often have performance
  implications, easily overlooked and a frequent source of bugs. It is
  rare that there isn't an alternative solution which is easier to
  maintain (I know this is a 'touchy' topic and others will disagree).
  At any rate, if there are triggers, make sure to analyze them and
  understand their impact on the process.

- CSV quality. Make sure the CSV files are good quality and you are not
  doing any implicit data conversion. Consider whether breaking up your
  CSV into smaller units might help (sometimes, multiple but shorter
  performance impacts can be more acceptable than one long one).

- Make sure you have good logging enabled and check for things like
  overly frequent writing of WALs. This can have significant impact on
  performance. If your rows are large, you may be adversely impacting
  performance writing the WAL cache etc. 
-- 
Tim Cross

Re: Improve COPY performance into table with indexes.

From

Laurenz Albe

Date:

03 April 2020, 08:47:40

On Fri, 2020-04-03 at 18:27 +1100, Tim Cross wrote:
> > I am using the COPY command to insert 10 million rows from a CSV file
> > into a database table and I am finding the performance is
> > unacceptable. When the COPY command is executed the disk I/O is
> > extremely high which leads to degraded query performance for other
> > queries being executed on the database.
> > 
> > What are my options in this scenario?
> 
> I don't think there is any way you can disable/drop indexes just for one
> transaction and not impact on other activities, such as queries. What
> options you have depends on a number of factors. Some
> questions/suggestions.

These are all good suggestions.

Let me add that there is a trade-off between fast queries and fast data
modifications.  You may drop those indexes that are not absolutely required
and accept that some queries become slower.

Another way to save one index is to partition on one column and drop the
index on that column.  Then queries can use a sequential scan on one
partition rather than an index scan, which might still be acceptable.

Speaking of partitioning: perhaps it is an option to use a partitioned
table and load the new data in a new partition. Then you can add the index
after loading and only then you turn the new table into a table partition.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com

Re: Improve COPY performance into table with indexes.

From

Adrian Klaver

Date:

03 April 2020, 15:52:21

On 4/2/20 9:42 PM, James Brauman wrote:
> I am using the COPY command to insert 10 million rows from a CSV file
> into a database table and I am finding the performance is
> unacceptable. When the COPY command is executed the disk I/O is
> extremely high which leads to degraded query performance for other
> queries being executed on the database.
> 
> I have tried removing the indexes from the database table and this
> dramatically improved performance (both reducing the execution time
> and reducing disk I/O).

Is there the option to use tablespaces to move the index(s) to another 
disk?:

https://www.postgresql.org/docs/12/sql-alterindex.html

"
SET TABLESPACE

     This form changes the index's tablespace to the specified 
tablespace and moves the data file(s) associated with the index to the 
new tablespace.  ..."


> 
> The constraints I am working with are:
>   -  Indexes must be in place to ensure that other queries that are
> executed on the table while the COPY command is executing have
> acceptable performance.
>   - The table should not be locked for reads during the COPY (it is
> fine to lock for writes).
> 
> I don't know how COPY is implemented, but it seems that the indexes
> are updated as data is inserted into the table. Ideally I would like
> to delay updating the indexes until the COPY command has completed and
> just update the indexes a single time, I think this would lead to much
> better performance. Is this possible?
> 
> Another idea that I had was to start a transaction, drop the indexes,
> run the COPY command and then recreate the indexes. The problem with
> this approach is that DROP INDEX acquires an exclusive lock on the
> table, which is not acceptable in this system (otherwise other queries
> won't be able to execute). I tried using DROP INDEX CONCURRENTLY, but
> that cannot be executed inside a transaction.
> 
> What are my options in this scenario?
> 
> Thanks
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Improve COPY performance into table with indexes.

From

Rob Sargent

Date:

03 April 2020, 16:08:04


On 4/3/20 9:52 AM, Adrian Klaver wrote:
> On 4/2/20 9:42 PM, James Brauman wrote:
>> I am using the COPY command to insert 10 million rows from a CSV file
>> into a database table and I am finding the performance is
>> unacceptable. When the COPY command is executed the disk I/O is
>> extremely high which leads to degraded query performance for other
>> queries being executed on the database.
>>
>> I have tried removing the indexes from the database table and this
>> dramatically improved performance (both reducing the execution time
>> and reducing disk I/O).
>
I've had excellent results dealing similar data quantity with the 
earlier suggestion of writing to a staging table.  Simply splitting the 
insert statements to cover 1/16[1] of the data turned untenable into 
real-time.

[1] I split on an id column which is of UUID type.  Any arbitrary 
splitting that reduces the footprint sufficiently will do.