Re: Bulk DML performance - Mailing list pgsql-performance

From Renan Alves Fonseca
Subject Re: Bulk DML performance
Date
Msg-id CAN_p2QjrVE1bUW5qoewmX-5jckJiUK=gCigkgdorgZpJrY-f8g@mail.gmail.com
Whole thread Raw
In response to RE: Bulk DML performance  (<bill.poole@ymail.com>)
List pgsql-performance
Hi,
Here are some observations.

Em seg., 17 de mar. de 2025 às 09:19, <bill.poole@ymail.com> escreveu:
> PostgreSQL has a lot of overhead per row.

Okay, thanks. I'm not actually too worried about this since in my scenario, each row is about 1.5 kB, so the % overhead is negligible.

> It is probably not the lookup, but the *modification* of the index that is slow.

Yes that makes sense for the original 3 million inserts, but when I perform the update of the 3 million rows, the index doesn't change - they are all HOT updates.

Using "perf" I can see that the overhead is indeed due to index lookup when we do HOT updates.
 
> Then the best you can do is to use COPY rather than INSERT. It will perform better (but [not] vastly better).

I need to perform a merge (INSERT ... ON CONFLICT ... DO UPDATE) on the data, so sadly I cannot use COPY.

I have discovered that for some reason, performing the original insert without the ON CONFLICT statement is twice as fast as performing the original insert with an ON CONFLICT ... DO UPDATE clause, completing in 4 seconds instead of 8. That seems strange to me because I wouldn't have thought it would be doing any additional work since a unique constraint is on the primary key, so each inserted value would need to be checked in either case, and there is no extra work to be done in either case.


In the INSERT case, we do not check the unique constraint for each row. We run into an error when inserting a duplicate, aborting the operation.
 
INSERT INTO test (id, text1)
SELECT generate_series, 'x'
FROM generate_series(1, 3000000)

It remains 4 seconds even when adding a clause to not insert duplicates.

INSERT INTO test (id, text1)
SELECT generate_series, 'x'
FROM generate_series(1, 3000000)
WHERE NOT EXISTS (
  SELECT 1
  FROM test4
  WHERE id = generate_series
)


In this case, we are not checking duplicates inside the input dataset. If you can guarantee, at the application level, that there are no duplicates, this seems a good speedup. Perhaps the MERGE clause...
 
Furthermore, I have found that performing an UPDATE rather than an INSERT ... ON CONFLICT ... DO UPDATE is twice as slow, completing in 16 seconds instead of 14 seconds.

UPDATE test
SET text1 = 'x'
FROM generate_series(1, 3000000)
WHERE test4.id = generate_series

This also now means that updating 3 million rows takes 4x longer than inserting those rows. Do we expect updates to be 4x slower than inserts?


It is not the update that is slower. It is the attached where clause that makes it slower. Try:
UPDATE test SET text1='x';

In my tests, the update of non-indexed columns is slightly faster than an insert.
 
Regards,
Renan Fonseca

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Bad perf when using DECLARE CURSOR on big table
Next
From: large.goose2829@salomvary.com
Date:
Subject: Re: Efficient pagination using multi-column cursors