Running update in chunks? - Mailing list pgsql-general

From Tim Uckun
Subject Running update in chunks?
Date
Msg-id CAGuHJrPHRNmmoWOQD7YfArgGHRrUkBzC33ipsJr1tODYJCmMFg@mail.gmail.com
Whole thread Raw
Responses Re: Running update in chunks?  (Richard Huxton <dev@archonet.com>)
Re: Running update in chunks?  (Tim Uckun <timuckun@gmail.com>)
List pgsql-general
I have seen a lot of slow update questions asked both here and on
stack overflow  but they usually involve large tables. In my case the
dataset is kind of small.

I have an app in which I import data and then merge the records with
an existing table. Currently I do most of the heavy lifting with code
and it works well enough but as the imports get bigger I thought I
would rewrite the code to speed it up using postgres. Basically I get
the data which I consider to be dirty and I put it into a table using
hstore to store the data. I then run a series of update queries to
locate the "real" records in the various tables.  The import data
looks like this https://gist.github.com/4584366 and has about 98K
records in it. The lookup table is very small only a couple of hundred
records in it.

This is the query I am running

update cars.imports i
    set make_id = md.make_id
    from cars.models md where i.model_id = md.id;


Here is the analyse

"Update on imports i  (cost=2.46..49720.34 rows=138858 width=526)
(actual time=51968.553..51968.553 rows=0 loops=1)"
"  ->  Hash Join  (cost=2.46..49720.34 rows=138858 width=526) (actual
time=0.044..408.170 rows=98834 loops=1)"
"        Hash Cond: (i.model_id = md.id)"
"        ->  Seq Scan on imports i  (cost=0.00..47808.58 rows=138858
width=516) (actual time=0.010..323.616 rows=98834 loops=1)"
"        ->  Hash  (cost=1.65..1.65 rows=65 width=14) (actual
time=0.026..0.026 rows=65 loops=1)"
"              Buckets: 1024  Batches: 1  Memory Usage: 4kB"
"              ->  Seq Scan on models md  (cost=0.00..1.65 rows=65
width=14) (actual time=0.002..0.012 rows=65 loops=1)"
"Total runtime: 51968.602 ms"


This query takes fifty seconds on a macbook air with i7 processor and
eight gigs of RAM and SSD hard drive.  I am using postgres 9.2
installed with homebrew using the standard conf file.

So it seems to me that this query is running as fast as it could but
it's still much slower than doing things with code one record at a
time (using some memoization).

Anyway...  Presuming I can't really do anything to speed up this query
does it make sense to  try and do this in chunks and if so what is the
best technique for doing that.


pgsql-general by date:

Previous
From: Craig Ringer
Date:
Subject: SELinux users - Please consider testing SELinux/SEPostgreSQL patches
Next
From: Richard Huxton
Date:
Subject: Re: Running update in chunks?