Thread: Duplicate deletion optimizations
Hello, I've a table with approximately 50 million rows with a schema like this: id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass), t_value integer NOT NULL DEFAULT 0, t_record integer NOT NULL DEFAULT 0, output_id integer NOT NULL DEFAULT 0, count bigint NOT NULL DEFAULT 0, CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id) Every 5 minutes, a process have to insert a few thousand of rows in this table, but sometime, the process have to insert an already existing row (based on values in the triplet (t_value, t_record, output_id). In this case, the row must be updated with the new count value. I've tried some solution given on this stackoverflow question [1] but the insertion rate is always too low for my needs. So, I've decided to do it in two times: - I insert all my new data with a COPY command - When it's done, I run a delete query to remove oldest duplicates Right now, my delete query look like this: SELECT min(id) FROM stats_5mn GROUP BY t_value, t_record, output_id HAVING count(*) > 1; The duration of the query on my test machine with approx. 16 million rows is ~18s. To reduce this duration, I've tried to add an index on my triplet: CREATE INDEX test ON stats_5mn USING btree (t_value , t_record , output_id ); By default, the PostgreSQL planner doesn't want to use my index and do a sequential scan [2], but if I force it with "SET enable_seqscan = off", the index is used [3] and query duration is lowered to ~5s. My questions: - Why the planner refuse to use my index? - Is there a better method for my problem? Thanks by advance for your help, Antoine Millet. [1] http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different [2] http://explain.depesz.com/s/UzW : GroupAggregate (cost=1167282.380..1294947.770 rows=762182 width=20) (actual time=20067.661..20067.661 rows=0 loops=1) Filter: (five(*) > 1) -> Sort (cost=1167282.380..1186336.910 rows=7621814 width=20) (actual time=15663.549..17463.458 rows=7621805 loops=1) Sort Key: delta, kilo, four Sort Method: external merge Disk: 223512kB -> Seq Scan on three (cost=0.000..139734.140 rows=7621814 width=20) (actual time=0.041..2093.434 rows=7621805 loops=1) [3] http://explain.depesz.com/s/o9P : GroupAggregate (cost=0.000..11531349.190 rows=762182 width=20) (actual time=5307.734..5307.734 rows=0 loops=1) Filter: (five(*) > 1) -> Index Scan using charlie on three (cost=0.000..11422738.330 rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805 loops=1)
On Fri, 06 Jan 2012 15:35:36 +0100, antoine@inaps.org wrote: > Hello, > > I've a table with approximately 50 million rows with a schema like > this: > > id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass), > t_value integer NOT NULL DEFAULT 0, > t_record integer NOT NULL DEFAULT 0, > output_id integer NOT NULL DEFAULT 0, > count bigint NOT NULL DEFAULT 0, > CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id) > > Every 5 minutes, a process have to insert a few thousand of rows in > this table, > but sometime, the process have to insert an already existing row > (based on > values in the triplet (t_value, t_record, output_id). In this case, > the row > must be updated with the new count value. I've tried some solution > given on this > stackoverflow question [1] but the insertion rate is always too low > for my needs. > > So, I've decided to do it in two times: > > - I insert all my new data with a COPY command > - When it's done, I run a delete query to remove oldest duplicates > > Right now, my delete query look like this: > > SELECT min(id) FROM stats_5mn > GROUP BY t_value, t_record, output_id > HAVING count(*) > 1; Correction: DELETE FROM stats_5mn WHERE id in ( SELECT min(id) FROM stats_5mn GROUP BY t_value, t_record, output_id HAVING count(*) > 1; ); Sorry :-) > > The duration of the query on my test machine with approx. 16 million > rows is ~18s. > > To reduce this duration, I've tried to add an index on my triplet: > > CREATE INDEX test > ON stats_5mn > USING btree > (t_value , t_record , output_id ); > > By default, the PostgreSQL planner doesn't want to use my index and > do a sequential > scan [2], but if I force it with "SET enable_seqscan = off", the > index is used [3] > and query duration is lowered to ~5s. > > > My questions: > > - Why the planner refuse to use my index? > - Is there a better method for my problem? > > > Thanks by advance for your help, > Antoine Millet. > > > [1] > > http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql > > > http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different > > [2] http://explain.depesz.com/s/UzW : > GroupAggregate (cost=1167282.380..1294947.770 rows=762182 > width=20) (actual time=20067.661..20067.661 rows=0 loops=1) > Filter: (five(*) > 1) > -> Sort (cost=1167282.380..1186336.910 rows=7621814 width=20) > (actual time=15663.549..17463.458 rows=7621805 loops=1) > Sort Key: delta, kilo, four > Sort Method: external merge Disk: 223512kB > -> Seq Scan on three (cost=0.000..139734.140 > rows=7621814 width=20) (actual time=0.041..2093.434 rows=7621805 > loops=1) > > [3] http://explain.depesz.com/s/o9P : > GroupAggregate (cost=0.000..11531349.190 rows=762182 width=20) > (actual time=5307.734..5307.734 rows=0 loops=1) > Filter: (five(*) > 1) > -> Index Scan using charlie on three > (cost=0.000..11422738.330 rows=7621814 width=20) (actual > time=0.046..2062.952 rows=7621805 loops=1)
Hello,
I've a table with approximately 50 million rows with a schema like this:
id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
t_value integer NOT NULL DEFAULT 0,
t_record integer NOT NULL DEFAULT 0,
output_id integer NOT NULL DEFAULT 0,
count bigint NOT NULL DEFAULT 0,
CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)
Every 5 minutes, a process have to insert a few thousand of rows in this table,
but sometime, the process have to insert an already existing row (based on
values in the triplet (t_value, t_record, output_id). In this case, the row
must be updated with the new count value. I've tried some solution given on this
stackoverflow question [1] but the insertion rate is always too low for my needs.
So, I've decided to do it in two times:
- I insert all my new data with a COPY command
- When it's done, I run a delete query to remove oldest duplicates
Right now, my delete query look like this:
SELECT min(id) FROM stats_5mn
GROUP BY t_value, t_record, output_id
HAVING count(*) > 1;
The duration of the query on my test machine with approx. 16 million rows is ~18s.
Hi Samuel! On 6 January 2012 20:02, Samuel Gendler <sgendler@ideasculptor.com> wrote: > Have you considered doing the insert by doing a bulk insert into a temp > table and then pulling rows that don't exist across to the final table in > one query and updating rows that do exist in another query? I did a very > brief scan of the SO thread and didn't see it suggested. Something like > this: > > update stats_5mn set count = count + t.count > from temp_table t > where stats_5mn.t_value = t.t_value and stats_5mn.t_record and > stats_5mn.output_id = t.output_id; > > insert into stats_5mn > select * from temp_table t > where not exists ( > select 1 from stats_5mn s > where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id = > t.output_id > ); > > drop table temp_table; Am I right to assume that the update/insert needs to be placed into a begin / end transaction block if such batch uploads might happen concurrently? Doesn't seem to be the case for this question here, but I like the solution and wonder if it works under more general circumstances. What's the overhead of creating and dropping a temporary table? Is it only worth doing this for a large number of inserted/updated elements? What if the number of inserts/updates is only a dozen at a time for a large table (>10M entries)? Thanks, Marc
Hi Samuel!Am I right to assume that the update/insert needs to be placed into a
On 6 January 2012 20:02, Samuel Gendler <sgendler@ideasculptor.com> wrote:
> Have you considered doing the insert by doing a bulk insert into a temp
> table and then pulling rows that don't exist across to the final table in
> one query and updating rows that do exist in another query? I did a very
> brief scan of the SO thread and didn't see it suggested. Something like
> this:
>
> update stats_5mn set count = count + t.count
> from temp_table t
> where stats_5mn.t_value = t.t_value and stats_5mn.t_record and
> stats_5mn.output_id = t.output_id;
>
> insert into stats_5mn
> select * from temp_table t
> where not exists (
> select 1 from stats_5mn s
> where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id =
> t.output_id
> );
>
> drop table temp_table;
begin / end transaction block if such batch uploads might happen
concurrently? Doesn't seem to be the case for this question here, but
I like the solution and wonder if it works under more general
circumstances.
What's the overhead of creating and dropping a temporary table? Is it
only worth doing this for a large number of inserted/updated elements?
What if the number of inserts/updates is only a dozen at a time for a
large table (>10M entries)?
On 6 January 2012 20:38, Samuel Gendler <sgendler@ideasculptor.com> wrote: > On Fri, Jan 6, 2012 at 12:22 PM, Marc Eberhard <eberhardma@googlemail.com> > wrote: >> On 6 January 2012 20:02, Samuel Gendler <sgendler@ideasculptor.com> wrote: >> > Have you considered doing the insert by doing a bulk insert into a temp >> > table and then pulling rows that don't exist across to the final table >> > in >> > one query and updating rows that do exist in another query? I did a >> > very >> > brief scan of the SO thread and didn't see it suggested. Something like >> > this: >> > >> > update stats_5mn set count = count + t.count >> > from temp_table t >> > where stats_5mn.t_value = t.t_value and stats_5mn.t_record and >> > stats_5mn.output_id = t.output_id; >> > >> > insert into stats_5mn >> > select * from temp_table t >> > where not exists ( >> > select 1 from stats_5mn s >> > where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id >> > = >> > t.output_id >> > ); >> > >> > drop table temp_table; >> >> Am I right to assume that the update/insert needs to be placed into a >> begin / end transaction block if such batch uploads might happen >> concurrently? Doesn't seem to be the case for this question here, but >> I like the solution and wonder if it works under more general >> circumstances. > > > yes, assuming you are concerned about making the insertion atomic. > Obviously, a failure in the second query after success in the 1st query > would be problematic outside of a transaction, since any attempt to repeat > the entire operation would result in repeated updates. True, but I was more concerned about concurrency, where a second upsert inserts an element between update/insert from the first. That would then skip the element in the first upsert as it is neither updated (doesn't exist at that point in time) nor inserted (does exists at that later point). Or would that be impossible anyway? >> What's the overhead of creating and dropping a temporary table? Is it >> only worth doing this for a large number of inserted/updated elements? >> What if the number of inserts/updates is only a dozen at a time for a >> large table (>10M entries)? > > pretty minimal, but enough that doing a handful of rows at a time probably > wouldn't be worth it. You'd surely get index usage on a plain insert in > such a case, so I'd probably just use an upsert stored proc for doing small > numbers of rows - unless you are doing large numbers of inserts, just a few > at a time. In that case, I'd try to accumulate them and then do them in > bulk. Those are tough questions to answer without a specific context. My > real answer is 'try it and see.' You'll always get an answer that is > specific to your exact circumstance that way. It's a fairly tricky problem. I have a number of sensors producing energy data about every 5 minutes, but at random times between 1 and 15 minutes. I can't change that as that's the way the hardware of the sensors works. These feed into another unit, which accumulates them and forwards them in batches over the Internet to my PostgreSQL database server every few minutes (again at random times outside my control and with random batch sizes). To make things worse, if the Internet connection between the unit and the database server fails, it will send the latest data first to provide a quick update to the current values and then send the backlog of stored values. Thus, data do not always arrive in correct time order. At the moment I only look at the latest data for each sensor and these should be as close to real time as possible. Thus, collecting data for some time to get a larger size for a batch update isn't preferable. What I want to do, and this is where the upsert problem starts, is to build a table with energy values at fixed times. These should be calculated as a linear interpolation between the nearest reported values from the sensors. Please note each sensor is reporting a measured energy value (not instant power), which always increases monotonically with time. To compare the performance of the different devices that are measured, I need to have the energy values at the same time and not at the random times when the sensors report. This also allows the calculation of average power for the devices by taking the difference of the energy values over longer periods, like 30 minutes. What I simply haven't got my head around is how to do this in an efficient way. When new values arrive, the table of interpolated values needs to be updated. For some times, there will already be values in the table, but for other times there won't. Thus, the upsert. If there was a communication failure, the reported sensor times will go backwards as the most recent is transmitted first until the backlog is cleared. In that case the interpolation table will be populated with intermediate values from the first insert with the latest timestamp and then these values will be refined by the backlog data as they trickle in. Under normal circumstances, reported timestamps will be monotonically increasing and the interpolation table will simply extend to later times. There are more reads from the interpolation table than updates as there are many clients watching the data live via a COMET web frontend (or better will be once I get this working). I could try to code all of this in the application code (Tomcat servlets in my case), but I'd much rather like to find an elegant way to let the database server populated the interpolation table from the inserted sensor values. I can find the nearest relevant entries in the interpolation table to be upserted by using date_trunc() on the timestamp from the sensor value. But I then also need to find out the closest sensor value in the database with an earlier and possibly later timestamp around the fixed times in the interpolation table. Sometimes a new value will result in an update and sometimes not. Sometimes a new value needs to be added to the interpolation table and sometimes not. I know I'm pushing SQL a bit hard with this type of problem, but doing it in the application logic would result in quite a few round trips between the database server and the application code. It's sort of an intellectual challenge for me to see how much I can offload onto the database server. Thus, my interest in batch upserts. Another reason is that I don't want to hold any state or intermediate data in the application code. I want this in the database as it is much better in storing things persistently than my own code could ever be. It was designed to do that properly after all! > By the way, there is definitely a difference between creating a temp table > and creating a table temporarily. See the postgres docs about temp tables Yes, I'm aware of that and meant a temporary/temp table, but being old fashioned I prefer the long form, which is also valid syntax. From the docs (v9.1.2): CREATE ... { TEMPORARY | TEMP } ... TABLE Thanks, Marc
hi, Maybe these thoughts could help.... 1) order by those three columns in your select min query could force index usage... 2) or DELETE FROM table WHERE EXISTS(SELECT id FROM table t WHERE t.id > table.id AND t.col1 = table.col1 AND t.col2 = table.col2 AND col3 = table.col3) Sent from my Windows Phone From: antoine@inaps.org Sent: 06/01/2012 15:36 To: pgsql-performance@postgresql.org Subject: [PERFORM] Duplicate deletion optimizations Hello, I've a table with approximately 50 million rows with a schema like this: id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass), t_value integer NOT NULL DEFAULT 0, t_record integer NOT NULL DEFAULT 0, output_id integer NOT NULL DEFAULT 0, count bigint NOT NULL DEFAULT 0, CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id) Every 5 minutes, a process have to insert a few thousand of rows in this table, but sometime, the process have to insert an already existing row (based on values in the triplet (t_value, t_record, output_id). In this case, the row must be updated with the new count value. I've tried some solution given on this stackoverflow question [1] but the insertion rate is always too low for my needs. So, I've decided to do it in two times: - I insert all my new data with a COPY command - When it's done, I run a delete query to remove oldest duplicates Right now, my delete query look like this: SELECT min(id) FROM stats_5mn GROUP BY t_value, t_record, output_id HAVING count(*) > 1; The duration of the query on my test machine with approx. 16 million rows is ~18s. To reduce this duration, I've tried to add an index on my triplet: CREATE INDEX test ON stats_5mn USING btree (t_value , t_record , output_id ); By default, the PostgreSQL planner doesn't want to use my index and do a sequential scan [2], but if I force it with "SET enable_seqscan = off", the index is used [3] and query duration is lowered to ~5s. My questions: - Why the planner refuse to use my index? - Is there a better method for my problem? Thanks by advance for your help, Antoine Millet. [1] http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different [2] http://explain.depesz.com/s/UzW : GroupAggregate (cost=1167282.380..1294947.770 rows=762182 width=20) (actual time=20067.661..20067.661 rows=0 loops=1) Filter: (five(*) > 1) -> Sort (cost=1167282.380..1186336.910 rows=7621814 width=20) (actual time=15663.549..17463.458 rows=7621805 loops=1) Sort Key: delta, kilo, four Sort Method: external merge Disk: 223512kB -> Seq Scan on three (cost=0.000..139734.140 rows=7621814 width=20) (actual time=0.041..2093.434 rows=7621805 loops=1) [3] http://explain.depesz.com/s/o9P : GroupAggregate (cost=0.000..11531349.190 rows=762182 width=20) (actual time=5307.734..5307.734 rows=0 loops=1) Filter: (five(*) > 1) -> Index Scan using charlie on three (cost=0.000..11422738.330 rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805 loops=1) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
hi, Maybe these thoughts could help.... 1) order by those three columns in your select min query could force index usage... 2) or DELETE FROM table WHERE EXISTS(SELECT id FROM table t WHERE t.id > table.id AND t.col1 = table.col1 AND t.col2 = table.col2 AND col3 = table.col3) Sent from my Windows Phone From: antoine@inaps.org Sent: 06/01/2012 15:36 To: pgsql-performance@postgresql.org Subject: [PERFORM] Duplicate deletion optimizations Hello, I've a table with approximately 50 million rows with a schema like this: id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass), t_value integer NOT NULL DEFAULT 0, t_record integer NOT NULL DEFAULT 0, output_id integer NOT NULL DEFAULT 0, count bigint NOT NULL DEFAULT 0, CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id) Every 5 minutes, a process have to insert a few thousand of rows in this table, but sometime, the process have to insert an already existing row (based on values in the triplet (t_value, t_record, output_id). In this case, the row must be updated with the new count value. I've tried some solution given on this stackoverflow question [1] but the insertion rate is always too low for my needs. So, I've decided to do it in two times: - I insert all my new data with a COPY command - When it's done, I run a delete query to remove oldest duplicates Right now, my delete query look like this: SELECT min(id) FROM stats_5mn GROUP BY t_value, t_record, output_id HAVING count(*) > 1; The duration of the query on my test machine with approx. 16 million rows is ~18s. To reduce this duration, I've tried to add an index on my triplet: CREATE INDEX test ON stats_5mn USING btree (t_value , t_record , output_id ); By default, the PostgreSQL planner doesn't want to use my index and do a sequential scan [2], but if I force it with "SET enable_seqscan = off", the index is used [3] and query duration is lowered to ~5s. My questions: - Why the planner refuse to use my index? - Is there a better method for my problem? Thanks by advance for your help, Antoine Millet. [1] http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different [2] http://explain.depesz.com/s/UzW : GroupAggregate (cost=1167282.380..1294947.770 rows=762182 width=20) (actual time=20067.661..20067.661 rows=0 loops=1) Filter: (five(*) > 1) -> Sort (cost=1167282.380..1186336.910 rows=7621814 width=20) (actual time=15663.549..17463.458 rows=7621805 loops=1) Sort Key: delta, kilo, four Sort Method: external merge Disk: 223512kB -> Seq Scan on three (cost=0.000..139734.140 rows=7621814 width=20) (actual time=0.041..2093.434 rows=7621805 loops=1) [3] http://explain.depesz.com/s/o9P : GroupAggregate (cost=0.000..11531349.190 rows=762182 width=20) (actual time=5307.734..5307.734 rows=0 loops=1) Filter: (five(*) > 1) -> Index Scan using charlie on three (cost=0.000..11422738.330 rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805 loops=1) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Friday, January 6, 2012, 4:21:06 PM you wrote: >> Every 5 minutes, a process have to insert a few thousand of rows in this >> table, but sometime, the process have to insert an already existing row >> (based on values in the triplet (t_value, t_record, output_id). In this >> case, the row must be updated with the new count value. I've tried some >> solution given on this stackoverflow question [1] but the insertion rate >> is always too low for my needs. I did check the following in a loop, starting with an empty table, and inserting/updating 50000 random unique entries. After 15 minutes I've got about 10 million records, each loop takes about 3 seconds. After 30 minutes the table contains approx. 18 million entries, time per loop only slightly increased. After 90 minutes the database has about 30 million entries. The speed has dropped to about 15-20 seconds per loop, but the server is doing lots of other queries in parallel, so with an unloaded server the updates should still take less than 10 seconds. The generator runs in perl, and generates records for a maximum of 100 million different entries: use strict; srand time; my $i = 0; open FD, ">data.in"; for (1..50000) { $i += rand(2000); print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000)); } close FD; The SQL-script looks like this: \timing on begin; create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint); \copy t_imp (t_value, t_record, output_id, count) from 'data.in' --an index is not really needed, table is in memory anyway --create index t_imp_ix on t_imp(t_value,t_record,output_id); -- find matching rows update t_imp set id=test.id from test where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id); -- update matching rows using primary key update test set count=t_imp.count from t_imp where t_imp.id is null and test.id=t_imp.id; -- insert missing rows insert into test(t_value,t_record,output_id,count) select t_value,t_record,output_id,count from t_imp where id is null; commit; Advantages of this solution: - all updates are done in-place, no index modifications (except for the inserts, of course) - big table only gets inserts - no dead tuples from deletes - completely avoids sequential scans on the big table Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 GHz, table and indices stored on a SSD) Table statistics: relid | 14332525 schemaname | public relname | test seq_scan | 8 seq_tup_read | 111541821 idx_scan | 149240169 idx_tup_fetch | 117901695 n_tup_ins | 30280175 n_tup_upd | 0 n_tup_del | 0 n_tup_hot_upd | 0 n_live_tup | 30264431 n_dead_tup | 0 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | 2012-01-07 12:38:49.593651+01 vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 31 The sequential scans were from some 'select count(*)' in between. HTH. -- Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
> It's a fairly tricky problem. I have a number of sensors producing > energy data about every 5 minutes, but at random times between 1 and > 15 minutes. I can't change that as that's the way the hardware of the > sensors works. These feed into another unit, which accumulates them > and forwards them in batches over the Internet to my PostgreSQL > database server every few minutes (again at random times outside my > control and with random batch sizes). To make things worse, if the > Internet connection between the unit and the database server fails, it > will send the latest data first to provide a quick update to the > current values and then send the backlog of stored values. Thus, data > do not always arrive in correct time order. I'm stuck home with flu, so I'm happy to help ;) I'll build an example setup to make it clearer... -- A list of all sensors create table sensors( sensor_id integer primary key ); insert into sensors select generate_series(1,100); -- A table to contain raw sensor data create table log( sensor_id integer not null references sensors(sensor_id), time integer not null, value float not null ); -- Fill it up with test data insert into log select sensor_id, time, time from ( select distinct sensor_id, (n+random()*10)::INTEGER as time from generate_series(0,50000,5) n cross join sensors ) d; -- index it alter table log add primary key( time, sensor_id ); create index log_sensor_time on log( sensor_id, time ); select * from log where sensor_id=1 order by time; sensor_id | time | value -----------+-------+------- 1 | 12 | 12 1 | 14 | 14 1 | 21 | 21 1 | 29 | 29 1 | 30 | 30 (....) 1 | 49996 | 49996 1 | 50001 | 50001 -- create a table which will contain the time ticks -- which will be used as x-axis for interpolation -- (in this example, one tick every 10 time units) create table ticks( time integer primary key, check( time%10 = 0 ) ); insert into ticks select generate_series( 0, (select max(time) from log), 10 ); -- create interpolated values table create table interp( sensor_id integer not null references sensors( sensor_id ), time integer not null references ticks( time ), value float, distance integer not null ); -- fill interpolated values table -- (pretty slow) insert into interp select sensor_id, t.time, start_value + (end_value-start_value)*(t.time-start_time)/(end_time-start_time), greatest( t.time - start_time, end_time-t.time ) from (select sensor_id, lag(time) over (partition by sensor_id order by time) as start_time, time as end_time, lag(value) over (partition by sensor_id order by time) as start_value, value as end_value from log ) as l join ticks t on (t.time >= start_time and t.time < end_time); -- alternate query if you don't like the ticks table (same sesult) : insert into interp select sensor_id, time, start_value + (end_value-start_value)*(time-start_time)/(end_time-start_time), greatest( time - start_time, end_time-time ) from (select *, generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10, 10 ) AS time from (select sensor_id, lag(time) over (partition by sensor_id order by time) as start_time, time as end_time, lag(value) over (partition by sensor_id order by time) as start_value, value as end_value from log ) as l ) l; alter table interp add primary key( time,sensor_id ); create index interp_sensor_time on interp( sensor_id, time ); For each interval in the log table that contains a time tick, this query generates the interpolated data at that tick. Note that the "distance" field represents the distance (in time) between the interpolated value and the farthest real data point that was used to calculate it. Therefore, it can be used as a measure of the quality of the interpolated point ; if the distance is greater than some threshold, the value might not be that precise. Now, suppose we receive a bunch of data. The data isn't ordered according to time. There are two possibilities : - the new data starts right where we left off (ie, just after the last time for each sensor in table log) - the new data starts later in time, and we want to process the results right away, expecting to receive, at some later point, older data to fill the holes The second one is hairier, lets' do that. Anyway, let's create a packet : -- A table to contain raw sensor data create temporary table packet( sensor_id integer not null, time integer not null, value float not null ); -- Fill it up with test data insert into packet select sensor_id, time, time from ( select distinct sensor_id, (n+random()*10)::INTEGER as time from generate_series(50200,50400) n cross join sensors ) d; Note that I deliberately inserted a hole : the log table contains times 0-50000 and the packet contains times 50200-50400. We'll need to decide if we want the hole to appear in the "interp" table or not. Let's say we don't want it to appear, we'll just interpolate over the hole. he "distance" column will be there so we don't forget this data is some sort of guess. If we receive data to fill that hole later, we can always use it. For each sensor in the packet, we need to grab some entries from table "log", at least the most recent one, to be able to do some interpolation with the first (oldest) value in the packet. To be more general, in case we receive old data that will plug a hole, we'll also grab the oldest log entry that is more recent than the most recent one in the packet for this sensor (hum... i have to re-read that...) Anyway, first let's create the missing ticks : INSERT INTO ticks SELECT generate_series( (SELECT max(time) FROM ticks)+10, (SELECT max(time) FROM packet), 10); And ... CREATE TEMPORARY TABLE new_interp( sensor_id INTEGER NOT NULL, time INTEGER NOT NULL, value FLOAT NOT NULL, distance INTEGER NOT NULL ); -- time range in the packet for each sensor WITH ranges AS ( SELECT sensor_id, min(time) AS packet_start_time, max(time) AS packet_end_time FROM packet GROUP BY sensor_id ), -- time ranges for records already in table log that will be needed to interpolate packet records log_boundaries AS ( SELECT sensor_id, COALESCE( (SELECT max(l.time) FROM log l WHERE l.sensor_id=r.sensor_id AND l.time < r.packet_start_time), r.packet_start_time ) AS packet_start_time, COALESCE( (SELECT min(l.time) FROM log l WHERE l.sensor_id=r.sensor_id AND l.time > r.packet_end_time), r.packet_end_time ) AS packet_end_time FROM ranges r ), -- merge existing and new data extended_packet AS ( SELECT log.* FROM log JOIN log_boundaries USING (sensor_id) WHERE log.time BETWEEN packet_start_time AND packet_end_time UNION ALL SELECT * FROM packet ), -- zip current and next records pre_interp AS ( SELECT sensor_id, lag(time) OVER (PARTITION BY sensor_id ORDER BY time) AS start_time, time AS end_time, lag(value) over (PARTITION BY sensor_id ORDER BY time) AS start_value, value AS end_value FROM extended_packet ), -- add tick info pre_interp2 AS ( SELECT *, generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10, 10 ) AS time FROM pre_interp ) -- interpolate INSERT INTO new_interp SELECT sensor_id, time, start_value + (end_value-start_value)*(time-start_time)/(end_time-start_time) AS value, greatest( time - start_time, end_time-time ) AS distance FROM pre_interp2; Although this query is huge, it's very fast, since it doesn't hit the big tables with any seq scans (hence the max() and min() tricks to use the indexes instead). I love how postgres can blast that huge pile of SQL in, like, 50 ms... If there is some overlap between packet data and data already in the log, you might get some division by zero errors, in this case you'll need to apply a DISTINCT somewhere (or simply replace the UNION ALL with an UNION, which might be wiser anyway...) Anyway, that doesn't solve the "upsert" problem, so here we go : -- Update the existing rows UPDATE interp SET value = new_interp.value, distance = new_interp.distance FROM new_interp WHERE interp.sensor_id = new_interp.sensor_id AND interp.time = new_interp.time AND interp.distance > new_interp.distance; -- insert new rows INSERT INTO interp SELECT new_interp.* FROM new_interp LEFT JOIN interp USING (sensor_id,time) WHERE interp.sensor_id IS NULL; -- also insert data into log (don't forget this !) INSERT INTO log SELECT * FROM packet; Tada. select * from interp where sensor_id=1 and time > 49950 order by time; sensor_id | time | value | distance -----------+-------+-------+---------- 1 | 49960 | 49960 | 7 1 | 49970 | 49970 | 4 1 | 49980 | 49980 | 3 1 | 49990 | 49990 | 5 1 | 50000 | 50000 | 2 1 | 50010 | 50010 | 190 1 | 50020 | 50020 | 180 1 | 50030 | 50030 | 170 (...) 1 | 50180 | 50180 | 178 1 | 50190 | 50190 | 188 1 | 50200 | 50200 | 2 1 | 50210 | 50210 | 1 1 | 50220 | 50220 | 1 1 | 50230 | 50230 | 1 1 | 50240 | 50240 | 2 Note that the hole was interpolated over, but the "distance" column shows this data is a guess, not real. What happens if we receive some data later to plug the hole ? -- plug the previously left hole truncate packet; truncate new_interp; insert into packet select sensor_id, time, time from ( select distinct sensor_id, (n+random()*10)::INTEGER as time from generate_series(50050,50150) n cross join sensors ) d; (re-run huge query and upsert) select * from interp where sensor_id=1 and time > 49950 order by time; sensor_id | time | value | distance -----------+-------+-------+---------- 1 | 49960 | 49960 | 7 1 | 49970 | 49970 | 4 1 | 49980 | 49980 | 3 1 | 49990 | 49990 | 5 1 | 50000 | 50000 | 2 1 | 50010 | 50010 | 45 1 | 50020 | 50020 | 35 1 | 50030 | 50030 | 28 1 | 50040 | 50040 | 38 1 | 50050 | 50050 | 48 1 | 50060 | 50060 | 1 1 | 50070 | 50070 | 1 1 | 50080 | 50080 | 2 (...) 1 | 50130 | 50130 | 1 1 | 50140 | 50140 | 3 1 | 50150 | 50150 | 1 1 | 50160 | 50160 | 40 1 | 50170 | 50170 | 30 1 | 50180 | 50180 | 26 1 | 50190 | 50190 | 36 1 | 50200 | 50200 | 2 1 | 50210 | 50210 | 1 1 | 50220 | 50220 | 1 1 | 50230 | 50230 | 1 1 | 50240 | 50240 | 2 It has used the new data to rewrite new values over the entire hole, and those values should have better precision. Enjoy !
Yes, but it should become a bit slower if you fix your code :-)
where t_imp.id is null and test.id=t_imp.id;
=>
where t_imp.id is not null and test.id=t_imp.id;
and a partial index on matching rows might help (should be tested):
(after the first updat)
create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is not null.
regards,
Marc Mamin
-----Ursprüngliche Nachricht-----
Von: pgsql-performance-owner@postgresql.org im Auftrag von Jochen Erwied
Gesendet: Sa 1/7/2012 12:57
An: antoine@inaps.org
Cc: pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] Duplicate deletion optimizations
Friday, January 6, 2012, 4:21:06 PM you wrote:
>> Every 5 minutes, a process have to insert a few thousand of rows in this
>> table, but sometime, the process have to insert an already existing row
>> (based on values in the triplet (t_value, t_record, output_id). In this
>> case, the row must be updated with the new count value. I've tried some
>> solution given on this stackoverflow question [1] but the insertion rate
>> is always too low for my needs.
I did check the following in a loop, starting with an empty table, and
inserting/updating 50000 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.
The generator runs in perl, and generates records for a maximum of 100
million different entries:
use strict;
srand time;
my $i = 0;
open FD, ">data.in";
for (1..50000)
{
$i += rand(2000);
print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000));
}
close FD;
The SQL-script looks like this:
\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);
-- find matching rows
update t_imp
set id=test.id
from test
where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
set count=t_imp.count
from t_imp
where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
select t_value,t_record,output_id,count
from t_imp
where id is null;
commit;
Advantages of this solution:
- all updates are done in-place, no index modifications (except for the
inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table
Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6
GHz, table and indices stored on a SSD)
Table statistics:
relid | 14332525
schemaname | public
relname | test
seq_scan | 8
seq_tup_read | 111541821
idx_scan | 149240169
idx_tup_fetch | 117901695
n_tup_ins | 30280175
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup | 30264431
n_dead_tup | 0
last_vacuum |
last_autovacuum |
last_analyze |
last_autoanalyze | 2012-01-07 12:38:49.593651+01
vacuum_count | 0
autovacuum_count | 0
analyze_count | 0
autoanalyze_count | 31
The sequential scans were from some 'select count(*)' in between.
HTH.
--
Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50
D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
If solution with temp table is acceptable - i think steps could be reduced... =E2=80=A2 copy to temp_imp ( temp table does not have id column) =E2=80=A2 update live set count =3D temp_imp.count from temp_imp using ( col1,col2,col3) =E2=80=A2 insert into live from temp where col1, col2 and col3 not exists in live Kind Regards, Misa Sent from my Windows Phone From: Jochen Erwied Sent: 07/01/2012 12:58 To: antoine@inaps.org Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Duplicate deletion optimizations Friday, January 6, 2012, 4:21:06 PM you wrote: >> Every 5 minutes, a process have to insert a few thousand of rows in this >> table, but sometime, the process have to insert an already existing row >> (based on values in the triplet (t_value, t_record, output_id). In this >> case, the row must be updated with the new count value. I've tried some >> solution given on this stackoverflow question [1] but the insertion rate >> is always too low for my needs. I did check the following in a loop, starting with an empty table, and inserting/updating 50000 random unique entries. After 15 minutes I've got about 10 million records, each loop takes about 3 seconds. After 30 minutes the table contains approx. 18 million entries, time per loop only slightly increased. After 90 minutes the database has about 30 million entries. The speed has dropped to about 15-20 seconds per loop, but the server is doing lots of other queries in parallel, so with an unloaded server the updates should still take less than 10 seconds. The generator runs in perl, and generates records for a maximum of 100 million different entries: use strict; srand time; my $i =3D 0; open FD, ">data.in"; for (1..50000) { $i +=3D rand(2000); print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000)); } close FD; The SQL-script looks like this: \timing on begin; create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint); \copy t_imp (t_value, t_record, output_id, count) from 'data.in' --an index is not really needed, table is in memory anyway --create index t_imp_ix on t_imp(t_value,t_record,output_id); -- find matching rows update t_imp set id=3Dtest.id from test where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=3D(test.t_value= ,test.t_record,test.output_id); -- update matching rows using primary key update test set count=3Dt_imp.count from t_imp where t_imp.id is null and test.id=3Dt_imp.id; -- insert missing rows insert into test(t_value,t_record,output_id,count) select t_value,t_record,output_id,count from t_imp where id is null; commit; Advantages of this solution: - all updates are done in-place, no index modifications (except for the inserts, of course) - big table only gets inserts - no dead tuples from deletes - completely avoids sequential scans on the big table Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 GHz, table and indices stored on a SSD) Table statistics: relid | 14332525 schemaname | public relname | test seq_scan | 8 seq_tup_read | 111541821 idx_scan | 149240169 idx_tup_fetch | 117901695 n_tup_ins | 30280175 n_tup_upd | 0 n_tup_del | 0 n_tup_hot_upd | 0 n_live_tup | 30264431 n_dead_tup | 0 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | 2012-01-07 12:38:49.593651+01 vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 31 The sequential scans were from some 'select count(*)' in between. HTH. --=20 Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164 --=20 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Saturday, January 7, 2012, 3:02:10 PM you wrote: > • insert into live from temp where col1, col2 and col3 not exists in > live 'not exists' is something I'm trying to avoid, even if the optimizer is able to handle it. -- Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
> It's a fairly tricky problem. I have a number of sensors producing > energy data about every 5 minutes, but at random times between 1 and > 15 minutes. I can't change that as that's the way the hardware of the > sensors works. These feed into another unit, which accumulates them > and forwards them in batches over the Internet to my PostgreSQL > database server every few minutes (again at random times outside my > control and with random batch sizes). To make things worse, if the > Internet connection between the unit and the database server fails, it > will send the latest data first to provide a quick update to the > current values and then send the backlog of stored values. Thus, data > do not always arrive in correct time order. I'm stuck home with flu, so I'm happy to help ;) I'll build an example setup to make it clearer... -- A list of all sensors create table sensors( sensor_id integer primary key ); insert into sensors select generate_series(1,100); -- A table to contain raw sensor data create table log( sensor_id integer not null references sensors(sensor_id), time integer not null, value float not null ); -- Fill it up with test data insert into log select sensor_id, time, time from ( select distinct sensor_id, (n+random()*10)::INTEGER as time from generate_series(0,50000,5) n cross join sensors ) d; -- index it alter table log add primary key( time, sensor_id ); create index log_sensor_time on log( sensor_id, time ); select * from log where sensor_id=1 order by time; sensor_id | time | value -----------+-------+------- 1 | 12 | 12 1 | 14 | 14 1 | 21 | 21 1 | 29 | 29 1 | 30 | 30 (....) 1 | 49996 | 49996 1 | 50001 | 50001 -- create a table which will contain the time ticks -- which will be used as x-axis for interpolation -- (in this example, one tick every 10 time units) create table ticks( time integer primary key, check( time%10 = 0 ) ); insert into ticks select generate_series( 0, (select max(time) from log), 10 ); -- create interpolated values table create table interp( sensor_id integer not null references sensors( sensor_id ), time integer not null references ticks( time ), value float, distance integer not null ); -- fill interpolated values table -- (pretty slow) insert into interp select sensor_id, t.time, start_value + (end_value-start_value)*(t.time-start_time)/(end_time-start_time), greatest( t.time - start_time, end_time-t.time ) from (select sensor_id, lag(time) over (partition by sensor_id order by time) as start_time, time as end_time, lag(value) over (partition by sensor_id order by time) as start_value, value as end_value from log ) as l join ticks t on (t.time >= start_time and t.time < end_time); -- alternate query if you don't like the ticks table (same sesult) : insert into interp select sensor_id, time, start_value + (end_value-start_value)*(time-start_time)/(end_time-start_time), greatest( time - start_time, end_time-time ) from (select *, generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10, 10 ) AS time from (select sensor_id, lag(time) over (partition by sensor_id order by time) as start_time, time as end_time, lag(value) over (partition by sensor_id order by time) as start_value, value as end_value from log ) as l ) l; alter table interp add primary key( time,sensor_id ); create index interp_sensor_time on interp( sensor_id, time ); For each interval in the log table that contains a time tick, this query generates the interpolated data at that tick. Note that the "distance" field represents the distance (in time) between the interpolated value and the farthest real data point that was used to calculate it. Therefore, it can be used as a measure of the quality of the interpolated point ; if the distance is greater than some threshold, the value might not be that precise. Now, suppose we receive a bunch of data. The data isn't ordered according to time. There are two possibilities : - the new data starts right where we left off (ie, just after the last time for each sensor in table log) - the new data starts later in time, and we want to process the results right away, expecting to receive, at some later point, older data to fill the holes The second one is hairier, lets' do that. Anyway, let's create a packet : -- A table to contain raw sensor data create temporary table packet( sensor_id integer not null, time integer not null, value float not null ); -- Fill it up with test data insert into packet select sensor_id, time, time from ( select distinct sensor_id, (n+random()*10)::INTEGER as time from generate_series(50200,50400) n cross join sensors ) d; Note that I deliberately inserted a hole : the log table contains times 0-50000 and the packet contains times 50200-50400. We'll need to decide if we want the hole to appear in the "interp" table or not. Let's say we don't want it to appear, we'll just interpolate over the hole. he "distance" column will be there so we don't forget this data is some sort of guess. If we receive data to fill that hole later, we can always use it. For each sensor in the packet, we need to grab some entries from table "log", at least the most recent one, to be able to do some interpolation with the first (oldest) value in the packet. To be more general, in case we receive old data that will plug a hole, we'll also grab the oldest log entry that is more recent than the most recent one in the packet for this sensor (hum... i have to re-read that...) Anyway, first let's create the missing ticks : INSERT INTO ticks SELECT generate_series( (SELECT max(time) FROM ticks)+10, (SELECT max(time) FROM packet), 10); And ... CREATE TEMPORARY TABLE new_interp( sensor_id INTEGER NOT NULL, time INTEGER NOT NULL, value FLOAT NOT NULL, distance INTEGER NOT NULL ); -- time range in the packet for each sensor WITH ranges AS ( SELECT sensor_id, min(time) AS packet_start_time, max(time) AS packet_end_time FROM packet GROUP BY sensor_id ), -- time ranges for records already in table log that will be needed to interpolate packet records log_boundaries AS ( SELECT sensor_id, COALESCE( (SELECT max(l.time) FROM log l WHERE l.sensor_id=r.sensor_id AND l.time < r.packet_start_time), r.packet_start_time ) AS packet_start_time, COALESCE( (SELECT min(l.time) FROM log l WHERE l.sensor_id=r.sensor_id AND l.time > r.packet_end_time), r.packet_end_time ) AS packet_end_time FROM ranges r ), -- merge existing and new data extended_packet AS ( SELECT log.* FROM log JOIN log_boundaries USING (sensor_id) WHERE log.time BETWEEN packet_start_time AND packet_end_time UNION ALL SELECT * FROM packet ), -- zip current and next records pre_interp AS ( SELECT sensor_id, lag(time) OVER (PARTITION BY sensor_id ORDER BY time) AS start_time, time AS end_time, lag(value) over (PARTITION BY sensor_id ORDER BY time) AS start_value, value AS end_value FROM extended_packet ), -- add tick info pre_interp2 AS ( SELECT *, generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10, 10 ) AS time FROM pre_interp ) -- interpolate INSERT INTO new_interp SELECT sensor_id, time, start_value + (end_value-start_value)*(time-start_time)/(end_time-start_time) AS value, greatest( time - start_time, end_time-time ) AS distance FROM pre_interp2; Although this query is huge, it's very fast, since it doesn't hit the big tables with any seq scans (hence the max() and min() tricks to use the indexes instead). I love how postgres can blast that huge pile of SQL in, like, 50 ms... If there is some overlap between packet data and data already in the log, you might get some division by zero errors, in this case you'll need to apply a DISTINCT somewhere (or simply replace the UNION ALL with an UNION, which might be wiser anyway...) Anyway, that doesn't solve the "upsert" problem, so here we go : -- Update the existing rows UPDATE interp SET value = new_interp.value, distance = new_interp.distance FROM new_interp WHERE interp.sensor_id = new_interp.sensor_id AND interp.time = new_interp.time AND interp.distance > new_interp.distance; -- insert new rows INSERT INTO interp SELECT new_interp.* FROM new_interp LEFT JOIN interp USING (sensor_id,time) WHERE interp.sensor_id IS NULL; -- also insert data into log (don't forget this !) INSERT INTO log SELECT * FROM packet; Tada. select * from interp where sensor_id=1 and time > 49950 order by time; sensor_id | time | value | distance -----------+-------+-------+---------- 1 | 49960 | 49960 | 7 1 | 49970 | 49970 | 4 1 | 49980 | 49980 | 3 1 | 49990 | 49990 | 5 1 | 50000 | 50000 | 2 1 | 50010 | 50010 | 190 1 | 50020 | 50020 | 180 1 | 50030 | 50030 | 170 (...) 1 | 50180 | 50180 | 178 1 | 50190 | 50190 | 188 1 | 50200 | 50200 | 2 1 | 50210 | 50210 | 1 1 | 50220 | 50220 | 1 1 | 50230 | 50230 | 1 1 | 50240 | 50240 | 2 Note that the hole was interpolated over, but the "distance" column shows this data is a guess, not real. What happens if we receive some data later to plug the hole ? -- plug the previously left hole truncate packet; truncate new_interp; insert into packet select sensor_id, time, time from ( select distinct sensor_id, (n+random()*10)::INTEGER as time from generate_series(50050,50150) n cross join sensors ) d; (re-run huge query and upsert) select * from interp where sensor_id=1 and time > 49950 order by time; sensor_id | time | value | distance -----------+-------+-------+---------- 1 | 49960 | 49960 | 7 1 | 49970 | 49970 | 4 1 | 49980 | 49980 | 3 1 | 49990 | 49990 | 5 1 | 50000 | 50000 | 2 1 | 50010 | 50010 | 45 1 | 50020 | 50020 | 35 1 | 50030 | 50030 | 28 1 | 50040 | 50040 | 38 1 | 50050 | 50050 | 48 1 | 50060 | 50060 | 1 1 | 50070 | 50070 | 1 1 | 50080 | 50080 | 2 (...) 1 | 50130 | 50130 | 1 1 | 50140 | 50140 | 3 1 | 50150 | 50150 | 1 1 | 50160 | 50160 | 40 1 | 50170 | 50170 | 30 1 | 50180 | 50180 | 26 1 | 50190 | 50190 | 36 1 | 50200 | 50200 | 2 1 | 50210 | 50210 | 1 1 | 50220 | 50220 | 1 1 | 50230 | 50230 | 1 1 | 50240 | 50240 | 2 It has used the new data to rewrite new values over the entire hole, and those values should have better precision. Enjoy !
On Fri, Jan 6, 2012 at 6:35 AM, <antoine@inaps.org> wrote: > Hello, > > I've a table with approximately 50 million rows with a schema like this: > > id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass), > t_value integer NOT NULL DEFAULT 0, > t_record integer NOT NULL DEFAULT 0, > output_id integer NOT NULL DEFAULT 0, > count bigint NOT NULL DEFAULT 0, > CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id) > > Every 5 minutes, a process have to insert a few thousand of rows in this > table, > but sometime, the process have to insert an already existing row (based on > values in the triplet (t_value, t_record, output_id). In this case, the row > must be updated with the new count value. I've tried some solution given on > this > stackoverflow question [1] but the insertion rate is always too low for my > needs. What are your needs? It should take no special hardware or coding to be able to manage a few thousand rows over 5 minutes. > So, I've decided to do it in two times: > > - I insert all my new data with a COPY command > - When it's done, I run a delete query to remove oldest duplicates > > Right now, my delete query look like this: > > SELECT min(id) FROM stats_5mn > GROUP BY t_value, t_record, output_id > HAVING count(*) > 1; > > The duration of the query on my test machine with approx. 16 million rows is > ~18s. > > To reduce this duration, I've tried to add an index on my triplet: > > CREATE INDEX test > ON stats_5mn > USING btree > (t_value , t_record , output_id ); > > By default, the PostgreSQL planner doesn't want to use my index and do a > sequential > scan [2], but if I force it with "SET enable_seqscan = off", the index is > used [3] > and query duration is lowered to ~5s. > > > My questions: > > - Why the planner refuse to use my index? It thinks that using the index will be about 9 times more expensive than the full scan. Probably your settings for seq_page_cost and random_page_cost are such that the planner thinks that nearly every buffer read is going to be from disk. But in reality (in this case) your data is all in memory. So the planner is mis-estimating. (It would help verify this if you did your EXPLAIN ANALYZE with BUFFERS as well). But before trying to fix this by tweaking settings, will the real case always be like your test case? If the data stops being all in memory, either because the problem size increases or because you have to compete for buffer space with other things going on, then using the index scan could be catastrophic. Cheers, Jeff
It was not query... Just sentence where some index values in one table not exist in another... So query could be with: =E2=80=A2 WHERE (col1,col2,col2) NOT IN =E2=80=A2 WHERE NOT EXISTS =E2=80=A2 LEFT JOIN live USING (col1,col2,col2) WHERE live.id IS NULL what ever whoever prefer more or what gives better results... But I think it is more personal feelings which is better then real... Sent from my Windows Phone From: Jochen Erwied Sent: 07/01/2012 15:18 To: Misa Simic Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Duplicate deletion optimizations Saturday, January 7, 2012, 3:02:10 PM you wrote: > =E2=80=A2 insert into live from temp where col1, col2 and col3 not exists= in > live 'not exists' is something I'm trying to avoid, even if the optimizer is able to handle it. --=20 Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
If solution with temp table is acceptable - i think steps could be reduced... • copy to temp_imp ( temp table does not have id column) • update live set count = temp_imp.count from temp_imp using ( col1,col2,col3) • insert into live from temp where col1, col2 and col3 not exists in live Kind Regards, Misa Sent from my Windows Phone From: Jochen Erwied Sent: 07/01/2012 12:58 To: antoine@inaps.org Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Duplicate deletion optimizations Friday, January 6, 2012, 4:21:06 PM you wrote: >> Every 5 minutes, a process have to insert a few thousand of rows in this >> table, but sometime, the process have to insert an already existing row >> (based on values in the triplet (t_value, t_record, output_id). In this >> case, the row must be updated with the new count value. I've tried some >> solution given on this stackoverflow question [1] but the insertion rate >> is always too low for my needs. I did check the following in a loop, starting with an empty table, and inserting/updating 50000 random unique entries. After 15 minutes I've got about 10 million records, each loop takes about 3 seconds. After 30 minutes the table contains approx. 18 million entries, time per loop only slightly increased. After 90 minutes the database has about 30 million entries. The speed has dropped to about 15-20 seconds per loop, but the server is doing lots of other queries in parallel, so with an unloaded server the updates should still take less than 10 seconds. The generator runs in perl, and generates records for a maximum of 100 million different entries: use strict; srand time; my $i = 0; open FD, ">data.in"; for (1..50000) { $i += rand(2000); print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000)); } close FD; The SQL-script looks like this: \timing on begin; create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint); \copy t_imp (t_value, t_record, output_id, count) from 'data.in' --an index is not really needed, table is in memory anyway --create index t_imp_ix on t_imp(t_value,t_record,output_id); -- find matching rows update t_imp set id=test.id from test where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id); -- update matching rows using primary key update test set count=t_imp.count from t_imp where t_imp.id is null and test.id=t_imp.id; -- insert missing rows insert into test(t_value,t_record,output_id,count) select t_value,t_record,output_id,count from t_imp where id is null; commit; Advantages of this solution: - all updates are done in-place, no index modifications (except for the inserts, of course) - big table only gets inserts - no dead tuples from deletes - completely avoids sequential scans on the big table Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 GHz, table and indices stored on a SSD) Table statistics: relid | 14332525 schemaname | public relname | test seq_scan | 8 seq_tup_read | 111541821 idx_scan | 149240169 idx_tup_fetch | 117901695 n_tup_ins | 30280175 n_tup_upd | 0 n_tup_del | 0 n_tup_hot_upd | 0 n_live_tup | 30264431 n_dead_tup | 0 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | 2012-01-07 12:38:49.593651+01 vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 31 The sequential scans were from some 'select count(*)' in between. HTH. -- Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Hi Pierre! On 7 January 2012 12:20, Pierre C <lists@peufeu.com> wrote: > I'm stuck home with flu, so I'm happy to help ;) [...] > I'll build an example setup to make it clearer... [...] That's almost identical to my tables. :-) > Note that the "distance" field represents the distance (in time) between the > interpolated value and the farthest real data point that was used to > calculate it. Therefore, it can be used as a measure of the quality of the > interpolated point ; if the distance is greater than some threshold, the > value might not be that precise. Nice idea! > Although this query is huge, it's very fast, since it doesn't hit the big > tables with any seq scans (hence the max() and min() tricks to use the > indexes instead). And it can easily be tamed by putting parts of it into stored pgpsql functions. > I love how postgres can blast that huge pile of SQL in, like, 50 ms... Yes, indeed. It's incredible fast. Brilliant! > If there is some overlap between packet data and data already in the log, > you might get some division by zero errors, in this case you'll need to > apply a DISTINCT somewhere (or simply replace the UNION ALL with an UNION, > which might be wiser anyway...) I do have a unique constraint on the actual table to prevent duplicate data in case of retransmission after a failed connect. It's easy enough to delete the rows from packet that already exist in the main table with a short one line SQL delete statement before the interpolation and merge. > Tada. :-)))) > Enjoy ! I certainly will. Many thanks for those great lines of SQL! Hope you recover from your flu quickly! All the best, Marc
It was not query... Just sentence where some index values in one table not exist in another... So query could be with: • WHERE (col1,col2,col2) NOT IN • WHERE NOT EXISTS • LEFT JOIN live USING (col1,col2,col2) WHERE live.id IS NULL what ever whoever prefer more or what gives better results... But I think it is more personal feelings which is better then real... Sent from my Windows Phone From: Jochen Erwied Sent: 07/01/2012 15:18 To: Misa Simic Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Duplicate deletion optimizations Saturday, January 7, 2012, 3:02:10 PM you wrote: > • insert into live from temp where col1, col2 and col3 not exists in > live 'not exists' is something I'm trying to avoid, even if the optimizer is able to handle it. -- Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
> That's almost identical to my tables. You explained your problem very well ;) > I certainly will. Many thanks for those great lines of SQL! You're welcome ! Strangely I didn't receive the mail I posted to the list (received yours though).
Are your stats updated on the table after you added the index? - run the bad query with explain verbose on (you should send this anyways) - check to see what the difference is in expected rows vs. actual rows - make sure that your work_mem is high enough if you are sorting, if not you'll see it write out a temp file which will beslow. - if there is different analyze the table and rerun the query to see if you get the expected results. - I do believe having COUNT(*) > 1 will never use an index, but someone more experience can comment here. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of antoine@inaps.org Sent: Friday, January 06, 2012 8:36 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Duplicate deletion optimizations Hello, I've a table with approximately 50 million rows with a schema like this: id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass), t_value integer NOT NULL DEFAULT 0, t_record integer NOT NULL DEFAULT 0, output_id integer NOT NULL DEFAULT 0, count bigint NOT NULL DEFAULT 0, CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id) Every 5 minutes, a process have to insert a few thousand of rows in this table, but sometime, the process have to insertan already existing row (based on values in the triplet (t_value, t_record, output_id). In this case, the row mustbe updated with the new count value. I've tried some solution given on this stackoverflow question [1] but the insertionrate is always too low for my needs. So, I've decided to do it in two times: - I insert all my new data with a COPY command - When it's done, I run a delete query to remove oldest duplicates Right now, my delete query look like this: SELECT min(id) FROM stats_5mn GROUP BY t_value, t_record, output_id HAVING count(*) > 1; The duration of the query on my test machine with approx. 16 million rows is ~18s. To reduce this duration, I've tried to add an index on my triplet: CREATE INDEX test ON stats_5mn USING btree (t_value , t_record , output_id ); By default, the PostgreSQL planner doesn't want to use my index and do a sequential scan [2], but if I force it with "SETenable_seqscan = off", the index is used [3] and query duration is lowered to ~5s. My questions: - Why the planner refuse to use my index? - Is there a better method for my problem? Thanks by advance for your help, Antoine Millet. [1] http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different [2] http://explain.depesz.com/s/UzW : GroupAggregate (cost=1167282.380..1294947.770 rows=762182 width=20) (actual time=20067.661..20067.661 rows=0 loops=1) Filter: (five(*) > 1) -> Sort (cost=1167282.380..1186336.910 rows=7621814 width=20) (actual time=15663.549..17463.458 rows=7621805 loops=1) Sort Key: delta, kilo, four Sort Method: external merge Disk: 223512kB -> Seq Scan on three (cost=0.000..139734.140 rows=7621814 width=20) (actual time=0.041..2093.434 rows=7621805 loops=1) [3] http://explain.depesz.com/s/o9P : GroupAggregate (cost=0.000..11531349.190 rows=762182 width=20) (actual time=5307.734..5307.734 rows=0 loops=1) Filter: (five(*) > 1) -> Index Scan using charlie on three (cost=0.000..11422738.330 rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805 loops=1) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email.
Saturday, January 7, 2012, 1:21:02 PM you wrote: > where t_imp.id is null and test.id=t_imp.id; > => > where t_imp.id is not null and test.id=t_imp.id; You're right, overlooked that one. But the increase to execute the query is - maybe not completely - suprisingly minimal. Because the query updating the id-column of t_imp fetches all rows from test to be updated, they are already cached, and the second query is run completely from cache. I suppose you will get a severe performance hit when the table cannot be cached... I ran the loop again, after 30 minutes I'm at about 3-5 seconds per loop, as long as the server isn't doing something else. Under load it's at about 10-20 seconds, with a ratio of 40% updates, 60% inserts. > and a partial index on matching rows might help (should be tested): > (after the first updat) > create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is not null. I don't think this will help much since t_imp is scanned sequentially anyway, so creating an index is just unneeded overhead. -- Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
Hello, Thanks for your numerous and complete answers! For those who have asked for more information about the process and hardware: The goal of the process is to compute data from a nosql cluster and write results in a PostgreSQL database. This process is triggered every 5 minutes for the latest 5 minutes data. 80% of data can be wrote in the database with a simple copy, which is the fastest solution we found for bulk insertion. But for some data, duplicates are possible (but very unusual), and the new data must replace the old one in database. I'm looking for the fastest solution to do this upsert. About the hardware: The PostgreSQL database run on a KVM virtual machine, configured with 8GB of ram and 4 cores of a L5640 CPU. The hypervisor have two 7,2k standard SAS disks working in linux software raid 1. Disks are shared by VMs, and obviously, this PostgreSQL VM doesn't share its hypervisor with another "write-intensive" VM. Also, this database is dedicated to store the data outgoing the process, so I'm really free for its configuration and tuning. I also plan to add a replicated slave database for read operations, and maybe do a partitioning of data, if needed. If I summarize your solutions: - Add an "order by" statement to my initial query can help the planner to use the index. - Temporary tables, with a COPY of new data to the temporary table and a merge of data (you proposed different ways for the merge). - Use EXISTS statement in the delete (but not recommended by another reply) I'll try your ideas this week, and I'll give you results. Antoine.