Re: Insert performance for large transaction with multiple COPY FROM - Mailing list pgsql-performance

From Horst Dehmer
Subject Re: Insert performance for large transaction with multiple COPY FROM
Date
Msg-id 273CF8C2-159B-44FF-B382-61A76541F435@gmail.com
Whole thread Raw
In response to Re: Insert performance for large transaction with multiple COPY FROM  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-performance
Hey Jeff (and others)!

First of all: Thanks for your detailed explanations and guide lines.


On 17.01.2013, at 18:12, Jeff Janes <jeff.janes@gmail.com> wrote:

> So the theory is that the presence of idx_4 is causing the trigger to
> pick a poor plan (i.e. one using idx_4) while its absence removes that
> temptation?

Yes. And auto_explain confirms this for the first record (obj_item_loc_obj_item_id_idx = idx_4 from last my last mail):

2013-01-18 22:50:21 CET LOG:  duration: 0.021 ms  plan:
    Query Text: SELECT * FROM obj_item_loc WHERE obj_item_loc.obj_item_id = NEW.obj_item_id AND obj_item_loc.loc_id =
NEW.loc_idAND obj_item_loc.obj_item_loc_ix = NEW.obj_item_loc_ix 
    Index Scan using obj_item_loc_obj_item_id_idx on obj_item_loc  (cost=0.00..8.27 rows=1 width=382)
      Index Cond: (obj_item_id = $15)
      Filter: ((loc_id = $16) AND (obj_item_loc_ix = $17))
2013-01-18 22:50:21 CET CONTEXT:  SQL statement "SELECT * FROM obj_item_loc WHERE obj_item_loc.obj_item_id =
NEW.obj_item_idAND obj_item_loc.loc_id = NEW.loc_id AND obj_item_loc.obj_item_loc_ix = NEW.obj_item_loc_ix" 
    PL/pgSQL function obj_item_loc_before_insert() line 5 at SQL statement
    COPY obj_item_loc, line 1: "10903011224100007276    10903010224100015110    10900024100000029720    \N    \N    \N
 \N    \N    \N    \N    \N    \N    \N    \N    \N    \..." 

and for one of the last records:

2013-01-18 22:53:20 CET LOG:  duration: 16.088 ms  plan:
    Query Text: SELECT * FROM obj_item_loc WHERE obj_item_loc.obj_item_id = NEW.obj_item_id AND obj_item_loc.loc_id =
NEW.loc_idAND obj_item_loc.obj_item_loc_ix = NEW.obj_item_loc_ix 
    Index Scan using obj_item_loc_obj_item_id_idx on obj_item_loc  (cost=0.00..8.27 rows=1 width=382)
      Index Cond: (obj_item_id = $15)
      Filter: ((loc_id = $16) AND (obj_item_loc_ix = $17))

I see a linear increase of the duration from 0.0x ms to over 16 ms (apart from a few nasty outliers with about 22 ms).
Althougheven at the end there are still a few durations < 0.03 but mostly 15 ms and above. 

> True disk reads are much more expensive, but given how few reads you
> have relative to hits, I now think that in aggregate the hits are more
> of a concern than the reads are.  In other words, you seem to be CPU
> bound, not IO bound.

Yes, definitely CPU bound, as top shows 99+% CPU utilization.

> Even more so I think (but not with much confidence) that most of your
> "reads" are actually coming from the OS cache and not from the disk.
> PG cannot distinguish true disk reads from OS cache reads.
>
> When was the last time you reset the stats?  That is, are your
> reported numbers accumulated over several loads, with some having idx4
> and some not?

I set up a fresh database before each test run. So the stats should be clean.

> More formally, use use auto_explain and set
> auto_explain.log_nested_statements to true.  I haven't verified this
> works with triggers, just going by the description I think it should.

Nice tip! Works for triggers as well.

> Your use case is a little unusual.  If you are bulk loading into an
> initially empty table, usually you would remove the trigger and add it
> after the load (with some kind of bulk operation to make up for
> whatever it was the trigger would have been doing).  On the other
> hand, if you are bulk loading into a "live" table and so can't drop
> the trigger, then the live table should have good-enough preexisting
> statistics to make the trigger choose a good plan.

My case is indeed unusual as for the whole model of 276 tables there will never be an update nor a delete on any row.
The model is rather short-lived, from a few hours to a few months. COPY FROM/TO are the only ways to get data into the
databaseand back out. And in between there is lots of graph traversal and calculation of convex hulls. But the lengthy
transactionare by far not the common case. 

Having said that, I'm no longer sure if a RDBMS is the right tool for the backend. Maybe indexing and storing with a
plainfull text search engine is. Dunno... 

Thanks again!

--
Horst

pgsql-performance by date:

Previous
From: Jeff Janes
Date:
Subject: Re: Insert performance for large transaction with multiple COPY FROM
Next
From: rudi
Date:
Subject: High CPU usage after partitioning