Re: Duplicate deletion optimizations - Mailing list pgsql-performance

From antoine@inaps.org
Subject Re: Duplicate deletion optimizations
Date
Msg-id 1b467beb79cae86d93ab8e55d3e79292@inaps.org
Whole thread Raw
In response to Duplicate deletion optimizations  (antoine@inaps.org)
List pgsql-performance
Hello,

Thanks for your numerous and complete answers!

For those who have asked for more information about the process and
hardware:

The goal of the process is to compute data from a nosql cluster and
write results in a PostgreSQL database. This process is triggered every
5 minutes for the latest 5 minutes data. 80% of data can be wrote in the
database with a simple copy, which is the fastest solution we found for
bulk insertion. But for some data, duplicates are possible (but very
unusual), and the new data must replace the old one in database. I'm
looking for the fastest solution to do this upsert.

About the hardware:

The PostgreSQL database run on a KVM virtual machine, configured with
8GB of ram and 4 cores of a L5640 CPU. The hypervisor have two 7,2k
standard SAS disks working in linux software raid 1. Disks are shared by
VMs, and obviously, this PostgreSQL VM doesn't share its hypervisor with
another "write-intensive" VM.

Also, this database is dedicated to store the data outgoing the
process, so I'm really free for its configuration and tuning. I also
plan to add a replicated slave database for read operations, and maybe
do a partitioning of data, if needed.

If I summarize your solutions:

  - Add an "order by" statement to my initial query can help the planner
to use the index.
  - Temporary tables, with a COPY of new data to the temporary table and
a merge of data (you proposed different ways for the merge).
  - Use EXISTS statement in the delete (but not recommended by another
reply)

I'll try your ideas this week, and I'll give you results.

Antoine.

pgsql-performance by date:

Previous
From: Jochen Erwied
Date:
Subject: Re: Duplicate deletion optimizations
Next
From: darklow
Date:
Subject: Query planner doesn't use index scan on tsvector GIN index if LIMIT is specifiedQuery planner doesn't use index scan on tsvector GIN index if LIMIT is specified