Thread: Duplicate deletion optimizations

Duplicate deletion optimizations

From

antoine@inaps.org

Date:

06 January 2012, 10:36:16

Hello,

I've a table with approximately 50 million rows with a schema like
this:

     id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
     t_value integer NOT NULL DEFAULT 0,
     t_record integer NOT NULL DEFAULT 0,
     output_id integer NOT NULL DEFAULT 0,
     count bigint NOT NULL DEFAULT 0,
     CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in
this table,
but sometime, the process have to insert an already existing row (based
on
values in the triplet (t_value, t_record, output_id). In this case, the
row
must be updated with the new count value. I've tried some solution
given on this
stackoverflow question [1] but the insertion rate is always too low for
my needs.

So, I've decided to do it in two times:

  - I insert all my new data with a COPY command
  - When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

     SELECT min(id) FROM stats_5mn
     GROUP BY t_value, t_record, output_id
     HAVING count(*) > 1;

The duration of the query on my test machine with approx. 16 million
rows is ~18s.

To reduce this duration, I've tried to add an index on my triplet:

     CREATE INDEX test
       ON stats_5mn
       USING btree
       (t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and do
a sequential
scan [2], but if I force it with "SET enable_seqscan = off", the index
is used [3]
and query duration is lowered to ~5s.


My questions:

  - Why the planner refuse to use my index?
  - Is there a better method for my problem?


Thanks by advance for your help,
Antoine Millet.


[1]
http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql

http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different

[2] http://explain.depesz.com/s/UzW :
     GroupAggregate  (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20)
(actual time=15663.549..17463.458 rows=7621805 loops=1)
               Sort Key: delta, kilo, four
               Sort Method:  external merge  Disk: 223512kB
             ->  Seq Scan on three  (cost=0.000..139734.140 rows=7621814
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)

[3] http://explain.depesz.com/s/o9P :
     GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20)
(actual time=5307.734..5307.734 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Index Scan using charlie on three  (cost=0.000..11422738.330
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805
loops=1)

Re: Duplicate deletion optimizations

From

antoine@inaps.org

Date:

06 January 2012, 11:21:35

On Fri, 06 Jan 2012 15:35:36 +0100, antoine@inaps.org wrote:
> Hello,
>
> I've a table with approximately 50 million rows with a schema like
> this:
>
>     id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
>     t_value integer NOT NULL DEFAULT 0,
>     t_record integer NOT NULL DEFAULT 0,
>     output_id integer NOT NULL DEFAULT 0,
>     count bigint NOT NULL DEFAULT 0,
>     CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)
>
> Every 5 minutes, a process have to insert a few thousand of rows in
> this table,
> but sometime, the process have to insert an already existing row
> (based on
> values in the triplet (t_value, t_record, output_id). In this case,
> the row
> must be updated with the new count value. I've tried some solution
> given on this
> stackoverflow question [1] but the insertion rate is always too low
> for my needs.
>
> So, I've decided to do it in two times:
>
>  - I insert all my new data with a COPY command
>  - When it's done, I run a delete query to remove oldest duplicates
>
> Right now, my delete query look like this:
>
>     SELECT min(id) FROM stats_5mn
>     GROUP BY t_value, t_record, output_id
>     HAVING count(*) > 1;

Correction:

     DELETE FROM stats_5mn WHERE id in (
         SELECT min(id) FROM stats_5mn
         GROUP BY t_value, t_record, output_id
         HAVING count(*) > 1;
     );

Sorry :-)

>
> The duration of the query on my test machine with approx. 16 million
> rows is ~18s.
>
> To reduce this duration, I've tried to add an index on my triplet:
>
>     CREATE INDEX test
>       ON stats_5mn
>       USING btree
>       (t_value , t_record , output_id );
>
> By default, the PostgreSQL planner doesn't want to use my index and
> do a sequential
> scan [2], but if I force it with "SET enable_seqscan = off", the
> index is used [3]
> and query duration is lowered to ~5s.
>
>
> My questions:
>
>  - Why the planner refuse to use my index?
>  - Is there a better method for my problem?
>
>
> Thanks by advance for your help,
> Antoine Millet.
>
>
> [1]
>
> http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql
>
>
> http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different
>
> [2] http://explain.depesz.com/s/UzW :
>     GroupAggregate  (cost=1167282.380..1294947.770 rows=762182
> width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
>         Filter: (five(*) > 1)
>       ->  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20)
> (actual time=15663.549..17463.458 rows=7621805 loops=1)
>               Sort Key: delta, kilo, four
>               Sort Method:  external merge  Disk: 223512kB
>             ->  Seq Scan on three  (cost=0.000..139734.140
> rows=7621814 width=20) (actual time=0.041..2093.434 rows=7621805
> loops=1)
>
> [3] http://explain.depesz.com/s/o9P :
>     GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20)
> (actual time=5307.734..5307.734 rows=0 loops=1)
>         Filter: (five(*) > 1)
>       ->  Index Scan using charlie on three
> (cost=0.000..11422738.330 rows=7621814 width=20) (actual
> time=0.046..2062.952 rows=7621805 loops=1)

Re: Duplicate deletion optimizations

From

Samuel Gendler

Date:

06 January 2012, 16:02:45

On Fri, Jan 6, 2012 at 6:35 AM, <antoine@inaps.org> wrote:

Hello,

I've a table with approximately 50 million rows with a schema like this:

id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
t_value integer NOT NULL DEFAULT 0,
t_record integer NOT NULL DEFAULT 0,
output_id integer NOT NULL DEFAULT 0,
count bigint NOT NULL DEFAULT 0,
CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in this table,
but sometime, the process have to insert an already existing row (based on
values in the triplet (t_value, t_record, output_id). In this case, the row
must be updated with the new count value. I've tried some solution given on this
stackoverflow question [1] but the insertion rate is always too low for my needs.

So, I've decided to do it in two times:

- I insert all my new data with a COPY command
- When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

SELECT min(id) FROM stats_5mn
GROUP BY t_value, t_record, output_id
HAVING count(*) > 1;

The duration of the query on my test machine with approx. 16 million rows is ~18s.

Have you considered doing the insert by doing a bulk insert into a temp table and then pulling rows that don't exist across to the final table in one query and updating rows that do exist in another query? I did a very brief scan of the SO thread and didn't see it suggested. Something like this:

update stats_5mn set count = count + t.count

from temp_table t

where stats_5mn.t_value = t.t_value and stats_5mn.t_record and stats_5mn.output_id = t.output_id;

insert into stats_5mn

select * from temp_table t

where not exists (

select 1 from stats_5mn s

where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id = t.output_id

);

drop table temp_table;

Note - you must do the update before the insert because doing it the other way around will cause every row you just inserted to also be updated.

I'm not sure it'd be markedly faster, but you'd at least be able to retain a unique constraint on the triplet, if desired. And, to my eye, the logic is easier to comprehend. The different query structure may make better use of your index, but I imagine that it is not using it currently because your db isn't configured to accurately reflect the real cost of index use vs sequential scan, so it is incorrectly determining the cost of looking up 7.5 million rows. Its estimate of the row count is correct, so the estimate of the cost must be the problem. We'd need to know more about your current config and hardware specs to be able to even start making suggestions about config changes to correct the problem.

Re: Duplicate deletion optimizations

From

Marc Eberhard

Date:

06 January 2012, 16:23:06

Hi Samuel!

On 6 January 2012 20:02, Samuel Gendler <sgendler@ideasculptor.com> wrote:
> Have you considered doing the insert by doing a bulk insert into a temp
> table and then pulling rows that don't exist across to the final table in
> one query and updating rows that do exist in another query?  I did a very
> brief scan of the SO thread and didn't see it suggested.  Something like
> this:
>
> update stats_5mn set count = count + t.count
> from temp_table t
> where stats_5mn.t_value = t.t_value and stats_5mn.t_record and
> stats_5mn.output_id = t.output_id;
>
> insert into stats_5mn
> select * from temp_table t
> where not exists (
> select 1 from stats_5mn s
> where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id =
> t.output_id
> );
>
> drop table temp_table;

Am I right to assume that the update/insert needs to be placed into a
begin / end transaction block if such batch uploads might happen
concurrently? Doesn't seem to be the case for this question here, but
I like the solution and wonder if it works under more general
circumstances.

What's the overhead of creating and dropping a temporary table? Is it
only worth doing this for a large number of inserted/updated elements?
What if the number of inserts/updates is only a dozen at a time for a
large table (>10M entries)?

Thanks,
Marc

Re: Duplicate deletion optimizations

From

Samuel Gendler

Date:

06 January 2012, 16:38:49

On Fri, Jan 6, 2012 at 12:22 PM, Marc Eberhard <eberhardma@googlemail.com> wrote:

Hi Samuel!

On 6 January 2012 20:02, Samuel Gendler <sgendler@ideasculptor.com> wrote:
> Have you considered doing the insert by doing a bulk insert into a temp
> table and then pulling rows that don't exist across to the final table in
> one query and updating rows that do exist in another query? I did a very
> brief scan of the SO thread and didn't see it suggested. Something like
> this:
>
> update stats_5mn set count = count + t.count
> from temp_table t
> where stats_5mn.t_value = t.t_value and stats_5mn.t_record and
> stats_5mn.output_id = t.output_id;
>
> insert into stats_5mn
> select * from temp_table t
> where not exists (
> select 1 from stats_5mn s
> where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id =
> t.output_id
> );
>
> drop table temp_table;

Am I right to assume that the update/insert needs to be placed into a
begin / end transaction block if such batch uploads might happen
concurrently? Doesn't seem to be the case for this question here, but
I like the solution and wonder if it works under more general
circumstances.

yes, assuming you are concerned about making the insertion atomic. Obviously, a failure in the second query after success in the 1st query would be problematic outside of a transaction, since any attempt to repeat the entire operation would result in repeated updates.

What's the overhead of creating and dropping a temporary table? Is it
only worth doing this for a large number of inserted/updated elements?
What if the number of inserts/updates is only a dozen at a time for a
large table (>10M entries)?

pretty minimal, but enough that doing a handful of rows at a time probably wouldn't be worth it. You'd surely get index usage on a plain insert in such a case, so I'd probably just use an upsert stored proc for doing small numbers of rows - unless you are doing large numbers of inserts, just a few at a time. In that case, I'd try to accumulate them and then do them in bulk. Those are tough questions to answer without a specific context. My real answer is 'try it and see.' You'll always get an answer that is specific to your exact circumstance that way.

By the way, there is definitely a difference between creating a temp table and creating a table temporarily. See the postgres docs about temp tables for specifics, but many databases treat temp tables differently from ordinary tables, so it is worth understanding what those differences are. Temp tables are automatically dropped when a connection (or transaction) is closed. Temp table names are local to the connection, so multiple connections can each create a temp table with the same name without conflict, which is convenient. I believe they are also created in a specific tablespace on disk, etc.

Re: Duplicate deletion optimizations

From

Marc Eberhard

Date:

06 January 2012, 18:20:53

On 6 January 2012 20:38, Samuel Gendler <sgendler@ideasculptor.com> wrote:
> On Fri, Jan 6, 2012 at 12:22 PM, Marc Eberhard <eberhardma@googlemail.com>
> wrote:
>> On 6 January 2012 20:02, Samuel Gendler <sgendler@ideasculptor.com> wrote:
>> > Have you considered doing the insert by doing a bulk insert into a temp
>> > table and then pulling rows that don't exist across to the final table
>> > in
>> > one query and updating rows that do exist in another query?  I did a
>> > very
>> > brief scan of the SO thread and didn't see it suggested.  Something like
>> > this:
>> >
>> > update stats_5mn set count = count + t.count
>> > from temp_table t
>> > where stats_5mn.t_value = t.t_value and stats_5mn.t_record and
>> > stats_5mn.output_id = t.output_id;
>> >
>> > insert into stats_5mn
>> > select * from temp_table t
>> > where not exists (
>> > select 1 from stats_5mn s
>> > where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id
>> > =
>> > t.output_id
>> > );
>> >
>> > drop table temp_table;
>>
>> Am I right to assume that the update/insert needs to be placed into a
>> begin / end transaction block if such batch uploads might happen
>> concurrently? Doesn't seem to be the case for this question here, but
>> I like the solution and wonder if it works under more general
>> circumstances.
>
>
> yes, assuming you are concerned about making the insertion atomic.
>  Obviously, a failure in the second query after success in the 1st query
> would be problematic outside of a transaction, since any attempt to repeat
> the entire operation would result in repeated updates.

True, but I was more concerned about concurrency, where a second
upsert inserts an element between update/insert from the first. That
would then skip the element in the first upsert as it is neither
updated (doesn't exist at that point in time) nor inserted (does
exists at that later point). Or would that be impossible anyway?

>> What's the overhead of creating and dropping a temporary table? Is it
>> only worth doing this for a large number of inserted/updated elements?
>> What if the number of inserts/updates is only a dozen at a time for a
>> large table (>10M entries)?
>
> pretty minimal, but enough that doing a handful of rows at a time probably
> wouldn't be worth it.  You'd surely get index usage on a plain insert in
> such a case, so I'd probably just use an upsert stored proc for doing small
> numbers of rows - unless you are doing large numbers of inserts, just a few
> at a time.  In that case, I'd try to accumulate them and then do them in
> bulk.  Those are tough questions to answer without a specific context.  My
> real answer is 'try it and see.'  You'll always get an answer that is
> specific to your exact circumstance that way.

It's a fairly tricky problem. I have a number of sensors producing
energy data about every 5 minutes, but at random times between 1 and
15 minutes. I can't change that as that's the way the hardware of the
sensors works. These feed into another unit, which accumulates them
and forwards them in batches over the Internet to my PostgreSQL
database server every few minutes (again at random times outside my
control and with random batch sizes). To make things worse, if the
Internet connection between the unit and the database server fails, it
will send the latest data first to provide a quick update to the
current values and then send the backlog of stored values. Thus, data
do not always arrive in correct time order.

At the moment I only look at the latest data for each sensor and these
should be as close to real time as possible. Thus, collecting data for
some time to get a larger size for a batch update isn't preferable.
What I want to do, and this is where the upsert problem starts, is to
build a table with energy values at fixed times. These should be
calculated as a linear interpolation between the nearest reported
values from the sensors. Please note each sensor is reporting a
measured energy value (not instant power), which always increases
monotonically with time. To compare the performance of the different
devices that are measured, I need to have the energy values at the
same time and not at the random times when the sensors report. This
also allows the calculation of average power for the devices by taking
the difference of the energy values over longer periods, like 30
minutes.

What I simply haven't got my head around is how to do this in an
efficient way. When new values arrive, the table of interpolated
values needs to be updated. For some times, there will already be
values in the table, but for other times there won't. Thus, the
upsert. If there was a communication failure, the reported sensor
times will go backwards as the most recent is transmitted first until
the backlog is cleared. In that case the interpolation table will be
populated with intermediate values from the first insert with the
latest timestamp and then these values will be refined by the backlog
data as they trickle in. Under normal circumstances, reported
timestamps will be monotonically increasing and the interpolation
table will simply extend to later times. There are more reads from the
interpolation table than updates as there are many clients watching
the data live via a COMET web frontend (or better will be once I get
this working).

I could try to code all of this in the application code (Tomcat
servlets in my case), but I'd much rather like to find an elegant way
to let the database server populated the interpolation table from the
inserted sensor values. I can find the nearest relevant entries in the
interpolation table to be upserted by using date_trunc() on the
timestamp from the sensor value. But I then also need to find out the
closest sensor value in the database with an earlier and possibly
later timestamp around the fixed times in the interpolation table.
Sometimes a new value will result in an update and sometimes not.
Sometimes a new value needs to be added to the interpolation table and
sometimes not.

I know I'm pushing SQL a bit hard with this type of problem, but doing
it in the application logic would result in quite a few round trips
between the database server and the application code. It's sort of an
intellectual challenge for me to see how much I can offload onto the
database server. Thus, my interest in batch upserts.

Another reason is that I don't want to hold any state or intermediate
data in the application code. I want this in the database as it is
much better in storing things persistently than my own code could ever
be. It was designed to do that properly after all!

> By the way, there is definitely a difference between creating a temp table
> and creating a table temporarily.  See the postgres docs about temp tables

Yes, I'm aware of that and meant a temporary/temp table, but being old
fashioned I prefer the long form, which is also valid syntax.

From the docs (v9.1.2): CREATE ... { TEMPORARY | TEMP } ... TABLE

Thanks,
Marc

Re: Duplicate deletion optimizations

From

Misa Simic

Date:

06 January 2012, 23:28:28

hi,

Maybe these thoughts could help....

1) order by those three columns in your select min query could force
index usage...

2) or

DELETE FROM table
WHERE EXISTS(SELECT id FROM table t WHERE t.id > table.id AND t.col1 =
table.col1 AND t.col2 = table.col2 AND col3 = table.col3)



Sent from my Windows Phone
From: antoine@inaps.org
Sent: 06/01/2012 15:36
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Duplicate deletion optimizations
Hello,

I've a table with approximately 50 million rows with a schema like
this:

     id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
     t_value integer NOT NULL DEFAULT 0,
     t_record integer NOT NULL DEFAULT 0,
     output_id integer NOT NULL DEFAULT 0,
     count bigint NOT NULL DEFAULT 0,
     CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in
this table,
but sometime, the process have to insert an already existing row (based
on
values in the triplet (t_value, t_record, output_id). In this case, the
row
must be updated with the new count value. I've tried some solution
given on this
stackoverflow question [1] but the insertion rate is always too low for
my needs.

So, I've decided to do it in two times:

  - I insert all my new data with a COPY command
  - When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

     SELECT min(id) FROM stats_5mn
     GROUP BY t_value, t_record, output_id
     HAVING count(*) > 1;

The duration of the query on my test machine with approx. 16 million
rows is ~18s.

To reduce this duration, I've tried to add an index on my triplet:

     CREATE INDEX test
       ON stats_5mn
       USING btree
       (t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and do
a sequential
scan [2], but if I force it with "SET enable_seqscan = off", the index
is used [3]
and query duration is lowered to ~5s.


My questions:

  - Why the planner refuse to use my index?
  - Is there a better method for my problem?


Thanks by advance for your help,
Antoine Millet.


[1]
http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql

http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different

[2] http://explain.depesz.com/s/UzW :
     GroupAggregate  (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20)
(actual time=15663.549..17463.458 rows=7621805 loops=1)
               Sort Key: delta, kilo, four
               Sort Method:  external merge  Disk: 223512kB
             ->  Seq Scan on three  (cost=0.000..139734.140 rows=7621814
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)

[3] http://explain.depesz.com/s/o9P :
     GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20)
(actual time=5307.734..5307.734 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Index Scan using charlie on three  (cost=0.000..11422738.330
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805
loops=1)

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Duplicate deletion optimizations

From

Misa Simic

Date:

07 January 2012, 07:28:06

hi,

Maybe these thoughts could help....

1) order by those three columns in your select min query could force
index usage...

2) or

DELETE FROM table
WHERE EXISTS(SELECT id FROM table t WHERE t.id > table.id AND t.col1 =
table.col1 AND t.col2 = table.col2 AND col3 = table.col3)



Sent from my Windows Phone
From: antoine@inaps.org
Sent: 06/01/2012 15:36
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Duplicate deletion optimizations
Hello,

I've a table with approximately 50 million rows with a schema like
this:

     id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
     t_value integer NOT NULL DEFAULT 0,
     t_record integer NOT NULL DEFAULT 0,
     output_id integer NOT NULL DEFAULT 0,
     count bigint NOT NULL DEFAULT 0,
     CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in
this table,
but sometime, the process have to insert an already existing row (based
on
values in the triplet (t_value, t_record, output_id). In this case, the
row
must be updated with the new count value. I've tried some solution
given on this
stackoverflow question [1] but the insertion rate is always too low for
my needs.

So, I've decided to do it in two times:

  - I insert all my new data with a COPY command
  - When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

     SELECT min(id) FROM stats_5mn
     GROUP BY t_value, t_record, output_id
     HAVING count(*) > 1;

The duration of the query on my test machine with approx. 16 million
rows is ~18s.

To reduce this duration, I've tried to add an index on my triplet:

     CREATE INDEX test
       ON stats_5mn
       USING btree
       (t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and do
a sequential
scan [2], but if I force it with "SET enable_seqscan = off", the index
is used [3]
and query duration is lowered to ~5s.


My questions:

  - Why the planner refuse to use my index?
  - Is there a better method for my problem?


Thanks by advance for your help,
Antoine Millet.


[1]
http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql

http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different

[2] http://explain.depesz.com/s/UzW :
     GroupAggregate  (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20)
(actual time=15663.549..17463.458 rows=7621805 loops=1)
               Sort Key: delta, kilo, four
               Sort Method:  external merge  Disk: 223512kB
             ->  Seq Scan on three  (cost=0.000..139734.140 rows=7621814
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)

[3] http://explain.depesz.com/s/o9P :
     GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20)
(actual time=5307.734..5307.734 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Index Scan using charlie on three  (cost=0.000..11422738.330
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805
loops=1)

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Duplicate deletion optimizations

From

Jochen Erwied

Date:

07 January 2012, 07:58:14

Friday, January 6, 2012, 4:21:06 PM you wrote:

>> Every 5 minutes, a process have to insert a few thousand of rows in this
>> table, but sometime, the process have to insert an already existing row
>> (based on values in the triplet (t_value, t_record, output_id). In this
>> case, the row must be updated with the new count value. I've tried some
>> solution given on this stackoverflow question [1] but the insertion rate
>> is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 50000 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100
million different entries:

use strict;

srand time;
my $i = 0;
open FD, ">data.in";
for (1..50000)
{
        $i += rand(2000);
        print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
       set id=test.id
       from test
       where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
       set count=t_imp.count
       from t_imp
       where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
       select t_value,t_record,output_id,count
              from t_imp
              where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6
GHz, table and indices stored on a SSD)

Table statistics:

relid             | 14332525
schemaname        | public
relname           | test
seq_scan          | 8
seq_tup_read      | 111541821
idx_scan          | 149240169
idx_tup_fetch     | 117901695
n_tup_ins         | 30280175
n_tup_upd         | 0
n_tup_del         | 0
n_tup_hot_upd     | 0
n_live_tup        | 30264431
n_dead_tup        | 0
last_vacuum       |
last_autovacuum   |
last_analyze      |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count      | 0
autovacuum_count  | 0
analyze_count     | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

--
Jochen Erwied     |   home: jochen@erwied.eu     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: joe@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erwied@vodafone.de       +49-173-5404164

Re: Duplicate deletion optimizations

From

"Pierre C"

Date:

07 January 2012, 08:20:24

> It's a fairly tricky problem. I have a number of sensors producing
> energy data about every 5 minutes, but at random times between 1 and
> 15 minutes. I can't change that as that's the way the hardware of the
> sensors works. These feed into another unit, which accumulates them
> and forwards them in batches over the Internet to my PostgreSQL
> database server every few minutes (again at random times outside my
> control and with random batch sizes). To make things worse, if the
> Internet connection between the unit and the database server fails, it
> will send the latest data first to provide a quick update to the
> current values and then send the backlog of stored values. Thus, data
> do not always arrive in correct time order.

I'm stuck home with flu, so I'm happy to help ;)

I'll build an example setup to make it clearer...

-- A list of all sensors
create table sensors( sensor_id integer primary key );
insert into sensors select generate_series(1,100);

-- A table to contain raw sensor data
create table log(
   sensor_id integer not null references sensors(sensor_id),
   time integer not null,
   value float not null
);

-- Fill it up with test data
insert into log
select sensor_id, time, time from (
   select distinct sensor_id,
     (n+random()*10)::INTEGER as time
   from generate_series(0,50000,5) n
    cross join sensors
) d;

-- index it
alter table log add primary key( time, sensor_id );
create index log_sensor_time on log( sensor_id, time );

select * from log where sensor_id=1 order by time;
  sensor_id | time  | value
-----------+-------+-------
          1 |    12 |    12
          1 |    14 |    14
          1 |    21 |    21
          1 |    29 |    29
          1 |    30 |    30
(....)
          1 | 49996 | 49996
          1 | 50001 | 50001

-- create a table which will contain the time ticks
-- which will be used as x-axis for interpolation
-- (in this example, one tick every 10 time units)

create table ticks( time integer primary key,
    check( time%10 = 0 ) );
insert into ticks select
   generate_series( 0, (select max(time) from log), 10 );

-- create interpolated values table
create table interp(
   sensor_id integer not null references sensors( sensor_id ),
   time      integer not null references ticks( time ),
   value     float,
   distance  integer not null
);

-- fill interpolated values table
-- (pretty slow)

insert into interp
select
     sensor_id,
     t.time,
     start_value +
(end_value-start_value)*(t.time-start_time)/(end_time-start_time),
     greatest( t.time - start_time, end_time-t.time )
   from
     (select
       sensor_id,
       lag(time) over (partition by sensor_id order by time) as start_time,
       time as end_time,
       lag(value) over (partition by sensor_id order by time) as
start_value,
       value as end_value
     from log
     ) as l
   join ticks t on (t.time >= start_time and t.time < end_time);

-- alternate query if you don't like the ticks table (same sesult) :
insert into interp
select
     sensor_id,
     time,
     start_value +
(end_value-start_value)*(time-start_time)/(end_time-start_time),
     greatest( time - start_time, end_time-time )
   from
   (select
     *,
     generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10, 10 ) AS
time
     from
       (select
         sensor_id,
         lag(time) over (partition by sensor_id order by time) as
start_time,
         time as end_time,
         lag(value) over (partition by sensor_id order by time) as
start_value,
         value as end_value
       from log
       ) as l
     ) l;

alter table interp add primary key( time,sensor_id );
create index interp_sensor_time on interp( sensor_id, time );

For each interval in the log table that contains a time tick, this query
generates the interpolated data at that tick.

Note that the "distance" field represents the distance (in time) between
the interpolated value and the farthest real data point that was used to
calculate it. Therefore, it can be used as a measure of the quality of the
interpolated point ; if the distance is greater than some threshold, the
value might not be that precise.

Now, suppose we receive a bunch of data. The data isn't ordered according
to time.
There are two possibilities :

- the new data starts right where we left off (ie, just after the last
time for each sensor in table log)
- the new data starts later in time, and we want to process the results
right away, expecting to receive, at some later point, older data to fill
the holes

The second one is hairier, lets' do that.

Anyway, let's create a packet :

-- A table to contain raw sensor data
create temporary table packet(
   sensor_id integer not null,
   time integer not null,
   value float not null
);

-- Fill it up with test data
insert into packet
select sensor_id, time, time from (
   select distinct sensor_id,
     (n+random()*10)::INTEGER as time
   from generate_series(50200,50400) n
    cross join sensors
) d;

Note that I deliberately inserted a hole : the log table contains times
0-50000 and the packet contains times 50200-50400.

We'll need to decide if we want the hole to appear in the "interp" table
or not. Let's say we don't want it to appear, we'll just interpolate over
the hole. he "distance" column will be there so we don't forget this data
is some sort of guess. If we receive data to fill that hole later, we can
always use it.

For each sensor in the packet, we need to grab some entries from table
"log", at least the most recent one, to be able to do some interpolation
with the first (oldest) value in the packet. To be more general, in case
we receive old data that will plug a hole, we'll also grab the oldest log
entry that is more recent than the most recent one in the packet for this
sensor (hum... i have to re-read that...)

Anyway, first let's create the missing ticks :

INSERT INTO ticks
   SELECT generate_series(
     (SELECT max(time) FROM ticks)+10,
     (SELECT max(time) FROM packet),
     10);

And ...

CREATE TEMPORARY TABLE new_interp(
   sensor_id INTEGER NOT NULL,
   time INTEGER NOT NULL,
   value FLOAT NOT NULL,
   distance INTEGER NOT NULL
);

-- time range in the packet for each sensor
WITH ranges AS (
   SELECT sensor_id, min(time) AS packet_start_time, max(time) AS
packet_end_time
   FROM packet
   GROUP BY sensor_id
),
-- time ranges for records already in table log that will be needed to
interpolate packet records
log_boundaries AS (
   SELECT
     sensor_id,
     COALESCE(
       (SELECT max(l.time) FROM log l WHERE l.sensor_id=r.sensor_id AND
l.time < r.packet_start_time),
       r.packet_start_time
     ) AS packet_start_time,
     COALESCE(
       (SELECT min(l.time) FROM log l WHERE l.sensor_id=r.sensor_id AND
l.time > r.packet_end_time),
       r.packet_end_time
     ) AS packet_end_time
   FROM ranges r
),
-- merge existing and new data
extended_packet AS (
   SELECT log.* FROM log JOIN log_boundaries USING (sensor_id)
     WHERE log.time BETWEEN packet_start_time AND packet_end_time
   UNION ALL
   SELECT * FROM packet
),
-- zip current and next records
pre_interp AS (
   SELECT
         sensor_id,
         lag(time) OVER (PARTITION BY sensor_id ORDER BY time) AS
start_time,
         time AS end_time,
         lag(value) over (PARTITION BY sensor_id ORDER BY time) AS
start_value,
         value AS end_value
       FROM extended_packet
),
-- add tick info
pre_interp2 AS (
   SELECT *, generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10,
10 ) AS time
   FROM pre_interp
)
-- interpolate
INSERT INTO new_interp SELECT
     sensor_id,
     time,
     start_value +
(end_value-start_value)*(time-start_time)/(end_time-start_time) AS value,
     greatest( time - start_time, end_time-time ) AS distance
 FROM pre_interp2;

Although this query is huge, it's very fast, since it doesn't hit the big
tables with any seq scans (hence the max() and min() tricks to use the
indexes instead).

I love how postgres can blast that huge pile of SQL in, like, 50 ms...

If there is some overlap between packet data and data already in the log,
you might get some division by zero errors, in this case you'll need to
apply a DISTINCT somewhere (or simply replace the UNION ALL with an UNION,
which might be wiser anyway...)

Anyway, that doesn't solve the "upsert" problem, so here we go :

-- Update the existing rows
UPDATE interp
   SET value = new_interp.value, distance = new_interp.distance
   FROM new_interp
   WHERE interp.sensor_id = new_interp.sensor_id
     AND interp.time = new_interp.time
     AND interp.distance > new_interp.distance;

-- insert new rows
INSERT INTO interp
SELECT new_interp.* FROM new_interp
   LEFT JOIN interp USING (sensor_id,time)
   WHERE interp.sensor_id IS NULL;

-- also insert data into log (don't forget this !)
INSERT INTO log SELECT * FROM packet;

Tada.

select * from interp where sensor_id=1 and time > 49950 order by time;
  sensor_id | time  | value | distance
-----------+-------+-------+----------
          1 | 49960 | 49960 |        7
          1 | 49970 | 49970 |        4
          1 | 49980 | 49980 |        3
          1 | 49990 | 49990 |        5
          1 | 50000 | 50000 |        2
          1 | 50010 | 50010 |      190
          1 | 50020 | 50020 |      180
          1 | 50030 | 50030 |      170
(...)
          1 | 50180 | 50180 |      178
          1 | 50190 | 50190 |      188
          1 | 50200 | 50200 |        2
          1 | 50210 | 50210 |        1
          1 | 50220 | 50220 |        1
          1 | 50230 | 50230 |        1
          1 | 50240 | 50240 |        2

Note that the hole was interpolated over, but the "distance" column shows
this data is a guess, not real.

What happens if we receive some data later to plug the hole ?

-- plug the previously left hole
truncate packet;
truncate new_interp;
insert into packet
select sensor_id, time, time from (
   select distinct sensor_id,
     (n+random()*10)::INTEGER as time
   from generate_series(50050,50150) n
    cross join sensors
) d;

(re-run huge query and upsert)

select * from interp where sensor_id=1 and time > 49950 order by time;
sensor_id | time  | value | distance
-----------+-------+-------+----------
          1 | 49960 | 49960 |        7
          1 | 49970 | 49970 |        4
          1 | 49980 | 49980 |        3
          1 | 49990 | 49990 |        5
          1 | 50000 | 50000 |        2
          1 | 50010 | 50010 |       45
          1 | 50020 | 50020 |       35
          1 | 50030 | 50030 |       28
          1 | 50040 | 50040 |       38
          1 | 50050 | 50050 |       48
          1 | 50060 | 50060 |        1
          1 | 50070 | 50070 |        1
          1 | 50080 | 50080 |        2
(...)
          1 | 50130 | 50130 |        1
          1 | 50140 | 50140 |        3
          1 | 50150 | 50150 |        1
          1 | 50160 | 50160 |       40
          1 | 50170 | 50170 |       30
          1 | 50180 | 50180 |       26
          1 | 50190 | 50190 |       36
          1 | 50200 | 50200 |        2
          1 | 50210 | 50210 |        1
          1 | 50220 | 50220 |        1
          1 | 50230 | 50230 |        1
          1 | 50240 | 50240 |        2

It has used the new data to rewrite new values over the entire hole, and
those values should have better precision.

Enjoy !

Re: Duplicate deletion optimizations

From

"Marc Mamin"

Date:

07 January 2012, 08:51:16

Yes, but it should become a bit slower if you fix your code :-)

where t_imp.id is null and test.id=t_imp.id;
=>
where t_imp.id is not null and test.id=t_imp.id;

and a partial index on matching rows might help (should be tested):

(after the first updat)
create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is not null.

regards,
Marc Mamin

-----Ursprüngliche Nachricht-----
Von: pgsql-performance-owner@postgresql.org im Auftrag von Jochen Erwied
Gesendet: Sa 1/7/2012 12:57
An: antoine@inaps.org
Cc: pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] Duplicate deletion optimizations

Friday, January 6, 2012, 4:21:06 PM you wrote:

>> Every 5 minutes, a process have to insert a few thousand of rows in this
>> table, but sometime, the process have to insert an already existing row
>> (based on values in the triplet (t_value, t_record, output_id). In this
>> case, the row must be updated with the new count value. I've tried some
>> solution given on this stackoverflow question [1] but the insertion rate
>> is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 50000 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100
million different entries:

use strict;

srand time;
my $i = 0;
open FD, ">data.in";
for (1..50000)
{
        $i += rand(2000);
        print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
       set id=test.id
       from test
       where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
       set count=t_imp.count
       from t_imp
       where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
       select t_value,t_record,output_id,count
              from t_imp
              where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the
inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6
GHz, table and indices stored on a SSD)

Table statistics:

relid             | 14332525
schemaname        | public
relname           | test
seq_scan          | 8
seq_tup_read      | 111541821
idx_scan          | 149240169
idx_tup_fetch     | 117901695
n_tup_ins         | 30280175
n_tup_upd         | 0
n_tup_del         | 0
n_tup_hot_upd     | 0
n_live_tup        | 30264431
n_dead_tup        | 0
last_vacuum       |
last_autovacuum   |
last_analyze      |
last_autoanalyze | 2012-01-07 12:38:49.593651+01
vacuum_count      | 0
autovacuum_count | 0
analyze_count     | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

--
Jochen Erwied     |   home: jochen@erwied.eu     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: joe@mbs-software.de +49-2151-7294-24, FAX: -50
D-45470 Muelheim | mobile: jochen.erwied@vodafone.de       +49-173-5404164

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Duplicate deletion optimizations

From

Misa Simic

Date:

07 January 2012, 10:02:32

If solution with temp table is acceptable - i think steps could be
reduced...

=E2=80=A2 copy to temp_imp ( temp table does not have id column)

=E2=80=A2 update live set count =3D temp_imp.count from temp_imp using (
col1,col2,col3)

=E2=80=A2 insert into live from temp where col1, col2 and col3 not exists in
live

Kind Regards,

Misa

Sent from my Windows Phone
From: Jochen Erwied
Sent: 07/01/2012 12:58
To: antoine@inaps.org
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Duplicate deletion optimizations
Friday, January 6, 2012, 4:21:06 PM you wrote:

>> Every 5 minutes, a process have to insert a few thousand of rows in this
>> table, but sometime, the process have to insert an already existing row
>> (based on values in the triplet (t_value, t_record, output_id). In this
>> case, the row must be updated with the new count value. I've tried some
>> solution given on this stackoverflow question [1] but the insertion rate
>> is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 50000 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100
million different entries:

use strict;

srand time;
my $i =3D 0;
open FD, ">data.in";
for (1..50000)
{
        $i +=3D rand(2000);
        print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255,
$i%255, rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record
integer,output_id integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
       set id=3Dtest.id
       from test
       where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=3D(test.t_value=
,test.t_record,test.output_id);
-- update matching rows using primary key
update test
       set count=3Dt_imp.count
       from t_imp
       where t_imp.id is null and test.id=3Dt_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
       select t_value,t_record,output_id,count
              from t_imp
              where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6
GHz, table and indices stored on a SSD)

Table statistics:

relid             | 14332525
schemaname        | public
relname           | test
seq_scan          | 8
seq_tup_read      | 111541821
idx_scan          | 149240169
idx_tup_fetch     | 117901695
n_tup_ins         | 30280175
n_tup_upd         | 0
n_tup_del         | 0
n_tup_hot_upd     | 0
n_live_tup        | 30264431
n_dead_tup        | 0
last_vacuum       |
last_autovacuum   |
last_analyze      |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count      | 0
autovacuum_count  | 0
analyze_count     | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

--=20
Jochen Erwied     |   home: jochen@erwied.eu     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: joe@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erwied@vodafone.de       +49-173-5404164


--=20
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Duplicate deletion optimizations

From

Jochen Erwied

Date:

07 January 2012, 10:18:46

Saturday, January 7, 2012, 3:02:10 PM you wrote:

> • insert into live from temp where col1, col2 and col3 not exists in
> live

'not exists' is something I'm trying to avoid, even if the optimizer is
able to handle it.

--
Jochen Erwied     |   home: jochen@erwied.eu     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: joe@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erwied@vodafone.de       +49-173-5404164

Re: Duplicate deletion optimizations

From

"Pierre C"

Date:

07 January 2012, 11:31:44

> It's a fairly tricky problem. I have a number of sensors producing
> energy data about every 5 minutes, but at random times between 1 and
> 15 minutes. I can't change that as that's the way the hardware of the
> sensors works. These feed into another unit, which accumulates them
> and forwards them in batches over the Internet to my PostgreSQL
> database server every few minutes (again at random times outside my
> control and with random batch sizes). To make things worse, if the
> Internet connection between the unit and the database server fails, it
> will send the latest data first to provide a quick update to the
> current values and then send the backlog of stored values. Thus, data
> do not always arrive in correct time order.

I'm stuck home with flu, so I'm happy to help ;)

I'll build an example setup to make it clearer...

-- A list of all sensors
create table sensors( sensor_id integer primary key );
insert into sensors select generate_series(1,100);

-- A table to contain raw sensor data
create table log(
     sensor_id integer not null references sensors(sensor_id),
     time integer not null,
     value float not null
);

-- Fill it up with test data
insert into log
select sensor_id, time, time from (
     select distinct sensor_id,
       (n+random()*10)::INTEGER as time
     from generate_series(0,50000,5) n
      cross join sensors
) d;

-- index it
alter table log add primary key( time, sensor_id );
create index log_sensor_time on log( sensor_id, time );

select * from log where sensor_id=1 order by time;
    sensor_id | time  | value
-----------+-------+-------
            1 |    12 |    12
            1 |    14 |    14
            1 |    21 |    21
            1 |    29 |    29
            1 |    30 |    30
(....)
            1 | 49996 | 49996
            1 | 50001 | 50001

-- create a table which will contain the time ticks
-- which will be used as x-axis for interpolation
-- (in this example, one tick every 10 time units)

create table ticks( time integer primary key,
      check( time%10 = 0 ) );
insert into ticks select
     generate_series( 0, (select max(time) from log), 10 );

-- create interpolated values table
create table interp(
     sensor_id integer not null references sensors( sensor_id ),
     time      integer not null references ticks( time ),
     value     float,
     distance  integer not null
);

-- fill interpolated values table
-- (pretty slow)

insert into interp
select
       sensor_id,
       t.time,
       start_value +
(end_value-start_value)*(t.time-start_time)/(end_time-start_time),
       greatest( t.time - start_time, end_time-t.time )
     from
       (select
         sensor_id,
         lag(time) over (partition by sensor_id order by time) as
start_time,
         time as end_time,
         lag(value) over (partition by sensor_id order by time) as
start_value,
         value as end_value
       from log
       ) as l
     join ticks t on (t.time >= start_time and t.time < end_time);

-- alternate query if you don't like the ticks table (same sesult) :
insert into interp
select
       sensor_id,
       time,
       start_value +
(end_value-start_value)*(time-start_time)/(end_time-start_time),
       greatest( time - start_time, end_time-time )
     from
     (select
       *,
       generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10, 10 )
AS
time
       from
         (select
           sensor_id,
           lag(time) over (partition by sensor_id order by time) as
start_time,
           time as end_time,
           lag(value) over (partition by sensor_id order by time) as
start_value,
           value as end_value
         from log
         ) as l
       ) l;

alter table interp add primary key( time,sensor_id );
create index interp_sensor_time on interp( sensor_id, time );

For each interval in the log table that contains a time tick, this query
generates the interpolated data at that tick.

Note that the "distance" field represents the distance (in time) between
the interpolated value and the farthest real data point that was used to
calculate it. Therefore, it can be used as a measure of the quality of the
interpolated point ; if the distance is greater than some threshold, the
value might not be that precise.

Now, suppose we receive a bunch of data. The data isn't ordered according
to time.
There are two possibilities :

- the new data starts right where we left off (ie, just after the last
time for each sensor in table log)
- the new data starts later in time, and we want to process the results
right away, expecting to receive, at some later point, older data to fill
the holes

The second one is hairier, lets' do that.

Anyway, let's create a packet :

-- A table to contain raw sensor data
create temporary table packet(
     sensor_id integer not null,
     time integer not null,
     value float not null
);

-- Fill it up with test data
insert into packet
select sensor_id, time, time from (
     select distinct sensor_id,
       (n+random()*10)::INTEGER as time
     from generate_series(50200,50400) n
      cross join sensors
) d;

Note that I deliberately inserted a hole : the log table contains times
0-50000 and the packet contains times 50200-50400.

We'll need to decide if we want the hole to appear in the "interp" table
or not. Let's say we don't want it to appear, we'll just interpolate over
the hole. he "distance" column will be there so we don't forget this data
is some sort of guess. If we receive data to fill that hole later, we can
always use it.

For each sensor in the packet, we need to grab some entries from table
"log", at least the most recent one, to be able to do some interpolation
with the first (oldest) value in the packet. To be more general, in case
we receive old data that will plug a hole, we'll also grab the oldest log
entry that is more recent than the most recent one in the packet for this
sensor (hum... i have to re-read that...)

Anyway, first let's create the missing ticks :

INSERT INTO ticks
     SELECT generate_series(
       (SELECT max(time) FROM ticks)+10,
       (SELECT max(time) FROM packet),
       10);

And ...

CREATE TEMPORARY TABLE new_interp(
     sensor_id INTEGER NOT NULL,
     time INTEGER NOT NULL,
     value FLOAT NOT NULL,
     distance INTEGER NOT NULL
);

-- time range in the packet for each sensor
WITH ranges AS (
     SELECT sensor_id, min(time) AS packet_start_time, max(time) AS
packet_end_time
     FROM packet
     GROUP BY sensor_id
),
-- time ranges for records already in table log that will be needed to
interpolate packet records
log_boundaries AS (
     SELECT
       sensor_id,
       COALESCE(
         (SELECT max(l.time) FROM log l WHERE l.sensor_id=r.sensor_id AND
l.time < r.packet_start_time),
         r.packet_start_time
       ) AS packet_start_time,
       COALESCE(
         (SELECT min(l.time) FROM log l WHERE l.sensor_id=r.sensor_id AND
l.time > r.packet_end_time),
         r.packet_end_time
       ) AS packet_end_time
     FROM ranges r
),
-- merge existing and new data
extended_packet AS (
     SELECT log.* FROM log JOIN log_boundaries USING (sensor_id)
       WHERE log.time BETWEEN packet_start_time AND packet_end_time
     UNION ALL
     SELECT * FROM packet
),
-- zip current and next records
pre_interp AS (
     SELECT
           sensor_id,
           lag(time) OVER (PARTITION BY sensor_id ORDER BY time) AS
start_time,
           time AS end_time,
           lag(value) over (PARTITION BY sensor_id ORDER BY time) AS
start_value,
           value AS end_value
         FROM extended_packet
),
-- add tick info
pre_interp2 AS (
     SELECT *, generate_series( ((start_time+9)/10)*10,
((end_time-1)/10)*10,
10 ) AS time
     FROM pre_interp
)
-- interpolate
INSERT INTO new_interp SELECT
       sensor_id,
       time,
       start_value +
(end_value-start_value)*(time-start_time)/(end_time-start_time) AS value,
       greatest( time - start_time, end_time-time ) AS distance
   FROM pre_interp2;

Although this query is huge, it's very fast, since it doesn't hit the big
tables with any seq scans (hence the max() and min() tricks to use the
indexes instead).

I love how postgres can blast that huge pile of SQL in, like, 50 ms...

If there is some overlap between packet data and data already in the log,
you might get some division by zero errors, in this case you'll need to
apply a DISTINCT somewhere (or simply replace the UNION ALL with an UNION,
which might be wiser anyway...)

Anyway, that doesn't solve the "upsert" problem, so here we go :

-- Update the existing rows
UPDATE interp
     SET value = new_interp.value, distance = new_interp.distance
     FROM new_interp
     WHERE interp.sensor_id = new_interp.sensor_id
       AND interp.time = new_interp.time
       AND interp.distance > new_interp.distance;

-- insert new rows
INSERT INTO interp
SELECT new_interp.* FROM new_interp
     LEFT JOIN interp USING (sensor_id,time)
     WHERE interp.sensor_id IS NULL;

-- also insert data into log (don't forget this !)
INSERT INTO log SELECT * FROM packet;

Tada.

select * from interp where sensor_id=1 and time > 49950 order by time;
    sensor_id | time  | value | distance
-----------+-------+-------+----------
            1 | 49960 | 49960 |        7
            1 | 49970 | 49970 |        4
            1 | 49980 | 49980 |        3
            1 | 49990 | 49990 |        5
            1 | 50000 | 50000 |        2
            1 | 50010 | 50010 |      190
            1 | 50020 | 50020 |      180
            1 | 50030 | 50030 |      170
(...)
            1 | 50180 | 50180 |      178
            1 | 50190 | 50190 |      188
            1 | 50200 | 50200 |        2
            1 | 50210 | 50210 |        1
            1 | 50220 | 50220 |        1
            1 | 50230 | 50230 |        1
            1 | 50240 | 50240 |        2

Note that the hole was interpolated over, but the "distance" column shows
this data is a guess, not real.

What happens if we receive some data later to plug the hole ?

-- plug the previously left hole
truncate packet;
truncate new_interp;
insert into packet
select sensor_id, time, time from (
     select distinct sensor_id,
       (n+random()*10)::INTEGER as time
     from generate_series(50050,50150) n
      cross join sensors
) d;

(re-run huge query and upsert)

select * from interp where sensor_id=1 and time > 49950 order by time;
sensor_id | time  | value | distance
-----------+-------+-------+----------
            1 | 49960 | 49960 |        7
            1 | 49970 | 49970 |        4
            1 | 49980 | 49980 |        3
            1 | 49990 | 49990 |        5
            1 | 50000 | 50000 |        2
            1 | 50010 | 50010 |       45
            1 | 50020 | 50020 |       35
            1 | 50030 | 50030 |       28
            1 | 50040 | 50040 |       38
            1 | 50050 | 50050 |       48
            1 | 50060 | 50060 |        1
            1 | 50070 | 50070 |        1
            1 | 50080 | 50080 |        2
(...)
            1 | 50130 | 50130 |        1
            1 | 50140 | 50140 |        3
            1 | 50150 | 50150 |        1
            1 | 50160 | 50160 |       40
            1 | 50170 | 50170 |       30
            1 | 50180 | 50180 |       26
            1 | 50190 | 50190 |       36
            1 | 50200 | 50200 |        2
            1 | 50210 | 50210 |        1
            1 | 50220 | 50220 |        1
            1 | 50230 | 50230 |        1
            1 | 50240 | 50240 |        2

It has used the new data to rewrite new values over the entire hole, and
those values should have better precision.

Enjoy !

Re: Duplicate deletion optimizations

From

Jeff Janes

Date:

07 January 2012, 14:55:10

On Fri, Jan 6, 2012 at 6:35 AM,  <antoine@inaps.org> wrote:
> Hello,
>
> I've a table with approximately 50 million rows with a schema like this:
>
>    id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
>    t_value integer NOT NULL DEFAULT 0,
>    t_record integer NOT NULL DEFAULT 0,
>    output_id integer NOT NULL DEFAULT 0,
>    count bigint NOT NULL DEFAULT 0,
>    CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)
>
> Every 5 minutes, a process have to insert a few thousand of rows in this
> table,
> but sometime, the process have to insert an already existing row (based on
> values in the triplet (t_value, t_record, output_id). In this case, the row
> must be updated with the new count value. I've tried some solution given on
> this
> stackoverflow question [1] but the insertion rate is always too low for my
> needs.

What are your needs?  It should take no special hardware or coding to
be able to manage a few thousand rows over 5 minutes.

> So, I've decided to do it in two times:
>
>  - I insert all my new data with a COPY command
>  - When it's done, I run a delete query to remove oldest duplicates
>
> Right now, my delete query look like this:
>
>    SELECT min(id) FROM stats_5mn
>    GROUP BY t_value, t_record, output_id
>    HAVING count(*) > 1;
>
> The duration of the query on my test machine with approx. 16 million rows is
> ~18s.
>
> To reduce this duration, I've tried to add an index on my triplet:
>
>    CREATE INDEX test
>      ON stats_5mn
>      USING btree
>      (t_value , t_record , output_id );
>
> By default, the PostgreSQL planner doesn't want to use my index and do a
> sequential
> scan [2], but if I force it with "SET enable_seqscan = off", the index is
> used [3]
> and query duration is lowered to ~5s.
>
>
> My questions:
>
>  - Why the planner refuse to use my index?

It thinks that using the index will be about 9 times more expensive
than the full scan.  Probably your settings for seq_page_cost and
random_page_cost are such that the planner thinks that nearly every
buffer read is going to be from disk.  But in reality (in this case)
your data is all in memory.  So the planner is mis-estimating.  (It
would help verify this if you did your EXPLAIN ANALYZE with BUFFERS as
well).  But before trying to fix this by tweaking settings, will the
real case always be like your test case?  If the data stops being all
in memory, either because the problem size increases or because you
have to compete for buffer space with other things going on, then
using the index scan could be catastrophic.

Cheers,

Jeff

Re: Duplicate deletion optimizations

From

Misa Simic

Date:

07 January 2012, 16:16:34

It was not query... Just sentence where some index values in one table
not exist in another...

So query could be with:
=E2=80=A2 WHERE (col1,col2,col2) NOT IN
=E2=80=A2 WHERE NOT EXISTS
=E2=80=A2 LEFT JOIN live USING (col1,col2,col2) WHERE live.id IS NULL

what ever whoever prefer more or what gives better results... But I
think it is more personal feelings which is better then real...

Sent from my Windows Phone
From: Jochen Erwied
Sent: 07/01/2012 15:18
To: Misa Simic
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Duplicate deletion optimizations
Saturday, January 7, 2012, 3:02:10 PM you wrote:

> =E2=80=A2 insert into live from temp where col1, col2 and col3 not exists=
 in
> live

'not exists' is something I'm trying to avoid, even if the optimizer is
able to handle it.

--=20
Jochen Erwied     |   home: jochen@erwied.eu     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: joe@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erwied@vodafone.de       +49-173-5404164

Re: Duplicate deletion optimizations

From

Misa Simic

Date:

07 January 2012, 18:02:10

If solution with temp table is acceptable - i think steps could be
reduced...

• copy to temp_imp ( temp table does not have id column)

• update live set count = temp_imp.count from temp_imp using (
col1,col2,col3)

• insert into live from temp where col1, col2 and col3 not exists in
live

Kind Regards,

Misa

Sent from my Windows Phone
From: Jochen Erwied
Sent: 07/01/2012 12:58
To: antoine@inaps.org
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Duplicate deletion optimizations
Friday, January 6, 2012, 4:21:06 PM you wrote:

>> Every 5 minutes, a process have to insert a few thousand of rows in this
>> table, but sometime, the process have to insert an already existing row
>> (based on values in the triplet (t_value, t_record, output_id). In this
>> case, the row must be updated with the new count value. I've tried some
>> solution given on this stackoverflow question [1] but the insertion rate
>> is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 50000 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100
million different entries:

use strict;

srand time;
my $i = 0;
open FD, ">data.in";
for (1..50000)
{
        $i += rand(2000);
        print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255,
$i%255, rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record
integer,output_id integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
       set id=test.id
       from test
       where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
       set count=t_imp.count
       from t_imp
       where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
       select t_value,t_record,output_id,count
              from t_imp
              where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6
GHz, table and indices stored on a SSD)

Table statistics:

relid             | 14332525
schemaname        | public
relname           | test
seq_scan          | 8
seq_tup_read      | 111541821
idx_scan          | 149240169
idx_tup_fetch     | 117901695
n_tup_ins         | 30280175
n_tup_upd         | 0
n_tup_del         | 0
n_tup_hot_upd     | 0
n_live_tup        | 30264431
n_dead_tup        | 0
last_vacuum       |
last_autovacuum   |
last_analyze      |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count      | 0
autovacuum_count  | 0
analyze_count     | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

--
Jochen Erwied     |   home: jochen@erwied.eu     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: joe@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erwied@vodafone.de       +49-173-5404164


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Duplicate deletion optimizations

From

Marc Eberhard

Date:

07 January 2012, 18:54:39

Hi Pierre!

On 7 January 2012 12:20, Pierre C <lists@peufeu.com> wrote:
> I'm stuck home with flu, so I'm happy to help ;)
[...]
> I'll build an example setup to make it clearer...
[...]

That's almost identical to my tables. :-)

> Note that the "distance" field represents the distance (in time) between the
> interpolated value and the farthest real data point that was used to
> calculate it. Therefore, it can be used as a measure of the quality of the
> interpolated point ; if the distance is greater than some threshold, the
> value might not be that precise.

Nice idea!

> Although this query is huge, it's very fast, since it doesn't hit the big
> tables with any seq scans (hence the max() and min() tricks to use the
> indexes instead).

And it can easily be tamed by putting parts of it into stored pgpsql functions.

> I love how postgres can blast that huge pile of SQL in, like, 50 ms...

Yes, indeed. It's incredible fast. Brilliant!

> If there is some overlap between packet data and data already in the log,
> you might get some division by zero errors, in this case you'll need to
> apply a DISTINCT somewhere (or simply replace the UNION ALL with an UNION,
> which might be wiser anyway...)

I do have a unique constraint on the actual table to prevent duplicate
data in case of retransmission after a failed connect. It's easy
enough to delete the rows from packet that already exist in the main
table with a short one line SQL delete statement before the
interpolation and merge.

> Tada.

:-))))

> Enjoy !

I certainly will. Many thanks for those great lines of SQL!

Hope you recover from your flu quickly!

All the best,
Marc

Re: Duplicate deletion optimizations

From

Misa Simic

Date:

08 January 2012, 00:16:13

It was not query... Just sentence where some index values in one table
not exist in another...

So query could be with:
• WHERE (col1,col2,col2) NOT IN
• WHERE NOT EXISTS
• LEFT JOIN live USING (col1,col2,col2) WHERE live.id IS NULL

what ever whoever prefer more or what gives better results... But I
think it is more personal feelings which is better then real...

Sent from my Windows Phone
From: Jochen Erwied
Sent: 07/01/2012 15:18
To: Misa Simic
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Duplicate deletion optimizations
Saturday, January 7, 2012, 3:02:10 PM you wrote:

> • insert into live from temp where col1, col2 and col3 not exists in
> live

'not exists' is something I'm trying to avoid, even if the optimizer is
able to handle it.

--
Jochen Erwied     |   home: jochen@erwied.eu     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: joe@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erwied@vodafone.de       +49-173-5404164

Re: Duplicate deletion optimizations

From

"Pierre C"

Date:

08 January 2012, 14:10:22

> That's almost identical to my tables.

You explained your problem very well ;)

> I certainly will. Many thanks for those great lines of SQL!

You're welcome !
Strangely I didn't receive the mail I posted to the list (received yours
though).

Re: Duplicate deletion optimizations

From

"Strange, John W"

Date:

09 January 2012, 03:42:18

Are your stats updated on the table after you added the index?

- run the bad query with explain verbose on (you should send this anyways)
- check to see what the difference is in expected rows vs. actual rows
- make sure that your work_mem is high enough if you are sorting, if not you'll see it write out a temp file which will
beslow.
 
- if there is different analyze the table and rerun the query to see if you get the expected results.
- I do believe having COUNT(*) > 1 will never use an index, but someone more experience can comment here.


-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
antoine@inaps.org
Sent: Friday, January 06, 2012 8:36 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Duplicate deletion optimizations

Hello,

I've a table with approximately 50 million rows with a schema like
this:

     id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
     t_value integer NOT NULL DEFAULT 0,
     t_record integer NOT NULL DEFAULT 0,
     output_id integer NOT NULL DEFAULT 0,
     count bigint NOT NULL DEFAULT 0,
     CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in this table, but sometime, the process have to
insertan already existing row (based on values in the triplet (t_value, t_record, output_id). In this case, the row
mustbe updated with the new count value. I've tried some solution given on this stackoverflow question [1] but the
insertionrate is always too low for my needs.
 

So, I've decided to do it in two times:

  - I insert all my new data with a COPY command
  - When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

     SELECT min(id) FROM stats_5mn
     GROUP BY t_value, t_record, output_id
     HAVING count(*) > 1;

The duration of the query on my test machine with approx. 16 million rows is ~18s.

To reduce this duration, I've tried to add an index on my triplet:

     CREATE INDEX test
       ON stats_5mn
       USING btree
       (t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and do a sequential scan [2], but if I force it with
"SETenable_seqscan = off", the index is used [3] and query duration is lowered to ~5s.
 


My questions:

  - Why the planner refuse to use my index?
  - Is there a better method for my problem?


Thanks by advance for your help,
Antoine Millet.


[1]
http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql
     
http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different

[2] http://explain.depesz.com/s/UzW :
     GroupAggregate  (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20) (actual time=15663.549..17463.458 rows=7621805
loops=1)
               Sort Key: delta, kilo, four
               Sort Method:  external merge  Disk: 223512kB
             ->  Seq Scan on three  (cost=0.000..139734.140 rows=7621814
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)

[3] http://explain.depesz.com/s/o9P :
     GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20) (actual time=5307.734..5307.734 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Index Scan using charlie on three  (cost=0.000..11422738.330
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805
loops=1)

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

Re: Duplicate deletion optimizations

From

Jochen Erwied

Date:

09 January 2012, 03:42:24

Saturday, January 7, 2012, 1:21:02 PM you wrote:

>   where t_imp.id is null and test.id=t_imp.id;
>   =>
>   where t_imp.id is not null and test.id=t_imp.id;

You're right, overlooked that one. But the increase to execute the query is
- maybe not completely - suprisingly minimal.

Because the query updating the id-column of t_imp fetches all rows from
test to be updated, they are already cached, and the second query is run
completely from cache. I suppose you will get a severe performance hit when
the table cannot be cached...

I ran the loop again, after 30 minutes I'm at about 3-5 seconds per loop,
as long as the server isn't doing something else. Under load it's at about
10-20 seconds, with a ratio of 40% updates, 60% inserts.

> and a partial index on matching rows might help (should be tested):

>  (after the first updat)
>  create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is not null.

I don't think this will help much since t_imp is scanned sequentially
anyway, so creating an index is just unneeded overhead.

--
Jochen Erwied     |   home: jochen@erwied.eu     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: joe@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erwied@vodafone.de       +49-173-5404164

Re: Duplicate deletion optimizations

From

antoine@inaps.org

Date:

09 January 2012, 09:00:21

Hello,

Thanks for your numerous and complete answers!

For those who have asked for more information about the process and
hardware:

The goal of the process is to compute data from a nosql cluster and
write results in a PostgreSQL database. This process is triggered every
5 minutes for the latest 5 minutes data. 80% of data can be wrote in the
database with a simple copy, which is the fastest solution we found for
bulk insertion. But for some data, duplicates are possible (but very
unusual), and the new data must replace the old one in database. I'm
looking for the fastest solution to do this upsert.

About the hardware:

The PostgreSQL database run on a KVM virtual machine, configured with
8GB of ram and 4 cores of a L5640 CPU. The hypervisor have two 7,2k
standard SAS disks working in linux software raid 1. Disks are shared by
VMs, and obviously, this PostgreSQL VM doesn't share its hypervisor with
another "write-intensive" VM.

Also, this database is dedicated to store the data outgoing the
process, so I'm really free for its configuration and tuning. I also
plan to add a replicated slave database for read operations, and maybe
do a partitioning of data, if needed.

If I summarize your solutions:

  - Add an "order by" statement to my initial query can help the planner
to use the index.
  - Temporary tables, with a COPY of new data to the temporary table and
a merge of data (you proposed different ways for the merge).
  - Use EXISTS statement in the delete (but not recommended by another
reply)

I'll try your ideas this week, and I'll give you results.

Antoine.