It's not the lookup, it's writing the 3 million rows (and in this particular upsert case, deleting 3 million, then inserting 3 million)
> Well, that is not a great statement.
Understood, but I was highlighting the performance of deleting 3 million rows identified by 3 million IDs, as opposed to deleting rows in a given range of IDs or deleting the whole table. It seems like deleting 3 million rows identified by 3 million IDs should be faster than updating 3 million rows (also identified by 3 million IDs).
It should indeed be faster. But keep in mind a delete immediately after that upsert now has twice as many rows to walk through as the upsert did. Also, a subselect like your original query can lead to a large nested loop. Try another variant such as this one:
with ids as (select x from generate_series(1, 3_000_000) x) delete from test using ids where id=x;
> With the table as it is you won't get better performance if you want the features that a relational database provides.
Sorry to hear that. I had hoped there was room to improve this performance.
If pure upsert performance is the goal, remove the unique index and store a timestamp along with your inserted data. Back to pure inserts again! (and a few new downsides). When querying, only use the version of the row with the highest timestamp.
Other random ideas:
* remove or consolidate columns you don't need, or can store in another table
* pre-filter the rows in the app, so you can do a pure-insert (or COPY) of known-to-be-new rows, then upsert the remaining rows
* use the smallest data types possible
* avoid or minimize toasted values
* pack your columns efficiently (e.g. reorder for 8 byte blocks)
* put the indexes on a ram-based tablespace
* boost your work_mem (for things like giant deletes which build hashes)
* revisit unlogged tables and partitioning
Cheers,