Re: Bulk DML performance - Mailing list pgsql-performance

From Greg Sabino Mullane
Subject Re: Bulk DML performance
Date
Msg-id CAKAnmmKvxMG7t8pv8FZPkrgU9Sh8V=6Y8HaK4QymK-ZA+i+mcA@mail.gmail.com
Whole thread Raw
In response to RE: Bulk DML performance  (<bill.poole@ymail.com>)
List pgsql-performance
On Mon, Mar 17, 2025 at 4:19 AM <bill.poole@ymail.com> wrote:

 Can you help me understand why performing 3 million lookups on a b-tree index with all pages cached in memory takes so long?

It's not the lookup, it's writing the 3 million rows (and in this particular upsert case, deleting 3 million, then inserting 3 million)

> Well, that is not a great statement.

Understood, but I was highlighting the performance of deleting 3 million rows identified by 3 million IDs, as opposed to deleting rows in a given range of IDs or deleting the whole table. It seems like deleting 3 million rows identified by 3 million IDs should be faster than updating 3 million rows (also identified by 3 million IDs).

It should indeed be faster. But keep in mind a delete immediately after that upsert now has twice as many rows to walk through as the upsert did. Also, a subselect like your original query can lead to a large nested loop. Try another variant such as this one:

with ids as (select x from generate_series(1, 3_000_000) x) delete from test using ids where id=x;

> With the table as it is you won't get better performance if you want the features that a relational database provides.

Sorry to hear that. I had hoped there was room to improve this performance.

If pure upsert performance is the goal, remove the unique index and store a timestamp along with your inserted data. Back to pure inserts again! (and a few new downsides). When querying, only use the version of the row with the highest timestamp.

Other random ideas:

* remove or consolidate columns you don't need, or can store in another table
* pre-filter the rows in the app, so you can do a pure-insert (or COPY) of known-to-be-new rows, then upsert the remaining rows
* use the smallest data types possible
* avoid or minimize toasted values
* pack your columns efficiently (e.g. reorder for 8 byte blocks)
* put the indexes on a ram-based tablespace
* boost your work_mem (for things like giant deletes which build hashes)
* revisit unlogged tables and partitioning

Cheers,
Greg

--
Enterprise Postgres Software Products & Tech Support

pgsql-performance by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: Bulk DML performance
Next
From: kimaidou
Date:
Subject: Bad perf when using DECLARE CURSOR on big table