Home > mailing lists

SELECT INTO large FKyed table is slow - Mailing list pgsql-performance

From	Mario Splivalo
Subject	SELECT INTO large FKyed table is slow
Date	November 28, 2010 08:12:18
Msg-id	4CF24103.8030209@megafon.hr Whole thread Raw
Responses	Re: SELECT INTO large FKyed table is slow Re: SELECT INTO large FKyed table is slow Re: SELECT INTO large FKyed table is slow Re: SELECT INTO large FKyed table is slow
List	pgsql-performance

Tree view

The database for monitoring certain drone statuses is quite simple:

CREATE TABLE samples (
    sample_id integer not null primary key,
    sample_timestamp timestamp not null default now()
);

CREATE TABLE drones (
    drone_id integer not null primary key,
    drone_log_notice character varying,
    crone_coordinates point not null,
    drone_temperature float,
    drone_pressure float
);

CREATE TABLE drones_history (
    drone_id integer not null,
    sample_id integer not null,
    drone_log_notice character varying,
    drone_temperature float,
    drone_pressure float,
    constraint drones_history_pk primary key (drone_id, sample_id),
    constraint drones_history_fk__samples foreign key (sample_id)
references samples(sample_id),
    constraint drones_history_fk__drones foreign key (drone_id) references
drones(drone_id)
);

Every ten to twenty minutes I receive CSV file with most of the drones
statuses. CSV file includes data for new drones, if they're put into
use. When I receive new data I load whole CSV file to a database, then
call stored procedure that 'deals' with that data.

So far I have around 6000 samples, around 160k drones and drones_history
is around 25M rows.

The CSV file contains around 15k-20k of 'rows', mostly data about old
drones. Every now and then (on every 5th - 10th CSV-insert) there is
data with around 1000-5000 new drones.

Here is what I do in stored procedure, after i COPYed the data from the
CSV to temporary.drones table:

First, I create temporary table, inside the procedure, that holds rows
for the new drones:

CREATE TEMPORARY TABLE tmpNew ON COMMIT DROP AS
SELECT drone_id, log_notice, coord_x, coord_y, temp, press
FROM temp.drones WHERE NOT EXISTS (SELECT 1 FROM public.drones WHERE
public.drones.drone_id = temporary.drone.drone_id);

This is done in miliseconds, even if the count for the new drones is
large (i've tested it with 10k new drones although I real-life action
I'd never get more thatn 5k new drones per CSV).

INSERT INTO public.drones (drone_id, drone_log_notice, coordinates,
drone_temperature, drone_temperature)
SELECT drone_id, log_notice, point(coord_x, coord_y) as coordinates,
temp, press FROM tmpNew;
INSERT INTO public.drones_history (sample_id, drone_id,
drone_log_notice, drone_temperature, drone_pressure)
SELECT a_sample_id, drone_id, log_notice, temp, pressue FROM tmpNew;

This is also done in miliseconds.

Now, I 'update' data for the existing drones, and fill in the history
table on those drones. First I create temporary table with just the
changed rows:

CREATE TEMPORARY TABLE tmpUpdate ON COMMIT DROP AS
SELECT a_batch_id, t.drone_id, t.log_notice, t.temp, t.press
    FROM temporary.drones t
    JOIN public.drones p
    ON t.drone_id = p.drone_id
WHERE p.drone_log_notice != t.log_notice OR p.temp != t.temp OR p.press
!= t.press;

Now, that part is also fast. I usualy have around 100-1000 drones that
changed 'state', but sometimes I get even half of the drones change
states (around 50k) and creation of the tmpUpdate takes no more than ten
to twenty milliseconds.

This is the slow part:
INSERT INTO drones_history (sample_id, drone_id, drone_log_notice,
drone_temperature, drone_pressure)
SELECT * FROM tmpUpdate;

For 100 rows this takes around 2 seconds. For 1000 rows this takes
around 40 seconds. For 5000 rows this takes around 5 minutes.
For 50k rows this takes around 30 minutes! Now this is where I start lag
because I get new CSV every 10 minutes or so.

And the last part is to upadte the actual drones table:
UPDATE public.drones p
SET drone_log_notice = t.log_notice, drone_temperature = t.temp,
drone_pressure = t.press
FROM temporary.drones t
WHERE t.drone_id = p.drone_id
AND (t.log_notice != p.drone_log_notice OR t.temp != p.drone_temperature
OR p.press != t.drone_pressure);

This is also very fast, even when almost half the table is updated the
UPDATE takes around 10 seconds. Usualy it's around 30-50 ms.

The machine I'm doing this has 4 GB of RAM, dual-Xeon something (3GHz).
Two SAS drives in mirror, capable of around 100 MB/s in sequential r/w
(i know it means nothing, but just to get an idea).

Database is around 2 GB is size (pg_database_size). When I dump/recreate
the database I can speedup things a bit, but after half day of
operations the INSERTs are slow again.
When I do dump/restore of the database I get around 40/50 MB/sec
reding/writing from the disk (COPYing data, PK/FK constraints creation),
but when that INSERT gets stuck io-wait goes to skies - iostat shows
that Postgres is mainly reading from the disks, around 800k/sec - 1024k/sec.

I've set shared_buffers to 256M, work_mem to 96M, wal_buffers to 16M and
checkpoint_segments to 16. I've turned off autovaccum, I do
analyze/vacuum after each insert-job is done, after TRUNCATEing
temporary.drones table.

Out of despair I tried to set fsync=off, but that gave me just a small
performance improvement.

When I remove foreign constraints (drones_history_fk__samples and
drones_history_fk__drones) (I leave the primary key on drones_history)
than that INSERT, even for 50k rows, takes no more than a second.

So, my question is - is there anything I can do to make INSERTS with PK
faster? Or, since all the reference checking is done inside the
procedure for loading data, shall I abandon those constraints entirely?

    Mario

pgsql-performance by date:

From: Mario Splivalo
Date: 28 November 2010, 08:02:43
Subject: Simple database, multiple instances?

From: "Pierre C"
Date: 28 November 2010, 15:03:24
Subject: Re: SELECT INTO large FKyed table is slow

SELECT INTO large FKyed table is slow - Mailing list pgsql-performance

Previous

Next