Re: Duplicate deletion optimizations - Mailing list pgsql-performance
From | Misa Simic |
---|---|
Subject | Re: Duplicate deletion optimizations |
Date | |
Msg-id | 527717384066155926@iso-8859-1msgid Whole thread Raw |
In response to | Duplicate deletion optimizations (antoine@inaps.org) |
List | pgsql-performance |
If solution with temp table is acceptable - i think steps could be reduced... • copy to temp_imp ( temp table does not have id column) • update live set count = temp_imp.count from temp_imp using ( col1,col2,col3) • insert into live from temp where col1, col2 and col3 not exists in live Kind Regards, Misa Sent from my Windows Phone From: Jochen Erwied Sent: 07/01/2012 12:58 To: antoine@inaps.org Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Duplicate deletion optimizations Friday, January 6, 2012, 4:21:06 PM you wrote: >> Every 5 minutes, a process have to insert a few thousand of rows in this >> table, but sometime, the process have to insert an already existing row >> (based on values in the triplet (t_value, t_record, output_id). In this >> case, the row must be updated with the new count value. I've tried some >> solution given on this stackoverflow question [1] but the insertion rate >> is always too low for my needs. I did check the following in a loop, starting with an empty table, and inserting/updating 50000 random unique entries. After 15 minutes I've got about 10 million records, each loop takes about 3 seconds. After 30 minutes the table contains approx. 18 million entries, time per loop only slightly increased. After 90 minutes the database has about 30 million entries. The speed has dropped to about 15-20 seconds per loop, but the server is doing lots of other queries in parallel, so with an unloaded server the updates should still take less than 10 seconds. The generator runs in perl, and generates records for a maximum of 100 million different entries: use strict; srand time; my $i = 0; open FD, ">data.in"; for (1..50000) { $i += rand(2000); print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000)); } close FD; The SQL-script looks like this: \timing on begin; create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint); \copy t_imp (t_value, t_record, output_id, count) from 'data.in' --an index is not really needed, table is in memory anyway --create index t_imp_ix on t_imp(t_value,t_record,output_id); -- find matching rows update t_imp set id=test.id from test where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id); -- update matching rows using primary key update test set count=t_imp.count from t_imp where t_imp.id is null and test.id=t_imp.id; -- insert missing rows insert into test(t_value,t_record,output_id,count) select t_value,t_record,output_id,count from t_imp where id is null; commit; Advantages of this solution: - all updates are done in-place, no index modifications (except for the inserts, of course) - big table only gets inserts - no dead tuples from deletes - completely avoids sequential scans on the big table Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 GHz, table and indices stored on a SSD) Table statistics: relid | 14332525 schemaname | public relname | test seq_scan | 8 seq_tup_read | 111541821 idx_scan | 149240169 idx_tup_fetch | 117901695 n_tup_ins | 30280175 n_tup_upd | 0 n_tup_del | 0 n_tup_hot_upd | 0 n_live_tup | 30264431 n_dead_tup | 0 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | 2012-01-07 12:38:49.593651+01 vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 31 The sequential scans were from some 'select count(*)' in between. HTH. -- Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
pgsql-performance by date: