Re: Insert performance for large transaction with multiple COPY FROM - Mailing list pgsql-performance

From Jeff Janes
Subject Re: Insert performance for large transaction with multiple COPY FROM
Date
Msg-id CAMkU=1ykVF4Ry6guyhFDsHr-JJhtwC35+fsPfxUQMri-7bw_+g@mail.gmail.com
Whole thread Raw
In response to Re: Insert performance for large transaction with multiple COPY FROM  (Horst Dehmer <horst.dehmer@gmail.com>)
Responses Re: Insert performance for large transaction with multiple COPY FROM  (Horst Dehmer <horst.dehmer@gmail.com>)
List pgsql-performance
On Tue, Jan 15, 2013 at 3:44 PM, Horst Dehmer <horst.dehmer@gmail.com> wrote:


> idx_4 together with a simple select in the tables on-insert trigger is
> slowing things down considerably.

So the theory is that the presence of idx_4 is causing the trigger to
pick a poor plan (i.e. one using idx_4) while its absence removes that
temptation?


> pg_statio comes up with same big numbers (reads = bad, hits = not so bad?):

True disk reads are much more expensive, but given how few reads you
have relative to hits, I now think that in aggregate the hits are more
of a concern than the reads are.  In other words, you seem to be CPU
bound, not IO bound.

Even more so I think (but not with much confidence) that most of your
"reads" are actually coming from the OS cache and not from the disk.
PG cannot distinguish true disk reads from OS cache reads.

When was the last time you reset the stats?  That is, are your
reported numbers accumulated over several loads, with some having idx4
and some not?

...
>
> Now I have some (more) questions:
>
> 1. How do I know which index (if any) is chosen for a select statement
> inside a trigger during a bulk load transaction? (or for that matter: a
> series of recursive plpgsql functions)

Informally, reset your database to the state it was in before the
load, analyze it, and do the explain again before you do the load.

More formally, use use auto_explain and set
auto_explain.log_nested_statements to true.  I haven't verified this
works with triggers, just going by the description I think it should.

> 2. The query planner depends on stats collected by auto-vacuum/vacuum
> analyze, right? Does stats collecting also happen during a lengthy
> transaction?

My understanding is that a transaction will not dump its stats until
the commit, so the auto analyze will not occur *due to* the lengthy
transaction until after it is over.  But if the table was already due
for analyze anyway due to previous or concurrent shorter transactions,
the analyze will happen.  However, the lengthy transaction might not
see the results of the analyze (I'm not clear on the transaction
snapshot semantics of the statistics tables) and even if it did see
them, it might just be using cached plans and so would not change the
plan in the middle.

> 3. Is it possible (or even advisable) to trigger vacuum analyze inside an
> ongoing transaction. Let's say load 10,000 rows of table A, analyze table A,
> insert the next 10,000 rows, analyze again, ...

You can't vacuum inside a transaction.  You can analyze, but I don't
know if it would be advisable.

Your use case is a little unusual.  If you are bulk loading into an
initially empty table, usually you would remove the trigger and add it
after the load (with some kind of bulk operation to make up for
whatever it was the trigger would have been doing).  On the other
hand, if you are bulk loading into a "live" table and so can't drop
the trigger, then the live table should have good-enough preexisting
statistics to make the trigger choose a good plan.

Cheers,

Jeff


pgsql-performance by date:

Previous
From: Ali Pouya
Date:
Subject: Re: Performance on Bulk Insert to Partitioned Table
Next
From: Horst Dehmer
Date:
Subject: Re: Insert performance for large transaction with multiple COPY FROM