Re: performance problem - Mailing list pgsql-general

From Mike Mascari
Subject Re: performance problem
Date
Msg-id 3FBD2E6F.20405@mascari.com
Whole thread Raw
In response to Re: performance problem  ("Rick Gigger" <rick@alpinenetworking.com>)
List pgsql-general
Rick Gigger wrote:

> Ok, adding the index back in worked the first time but then I tried
> generating the database from scratch again, this time adding the index right
> from the start.  It added the 45000 rows in about a minute but then was
> going dog slow on the updates again.  So I did an explain and sure enough it
> was not using the index.  After some investigation I determined that it was
> not using the index because when the transaction started there were only 4
> rows in the table so at that point it didn't want to use it.  It apparently
> doesn't gather analysis data fast enough to handle this kind of transaction.

The statistics are collected as a result of an ANALYZE command:

http://www.postgresql.org/docs/current/static/sql-analyze.html

This does not happen automatically. EXPLAIN output will show a default
assumption of 1000 rows, IIRC.

> I worked around this by starting the transaction and inserting the 45,000
> rows and then killing it.  The I removed the index and readded it which
> apparently gathered some stats and since there were all of the dead tuples
> in there from the failed transaction it now decided that it should use the
> index.  I reran the script and this time it took 5 minutes again instead of
> 1 1/2 hours.

If you examine the behavior of pg_dump output, you'll notice that it
doesn't built indexes until after the COPY command has completed the
data import. It's a waste of cpu cycles and disk bandwidth to update
indexes on upon every insert. Your script should:

1) Drop all indexes on the relation
2) Use COPY if possible instead of INSERT to import the data
3) Recreate the indexes
4) Use UPDATE to update as necessary

> I am using 7.2.4.  Has this improved in later versions?  I'm not concerened
> since this is a very rare thing to need to do and it's obviously possible to
> work around but it would be nice if postgres could figure things like that
> out on it's own.  (It certainly would have saved me a lot of time and
> confusion last night at about 3 am).  Is there a way to for the use of a
> specific index on a query?

You can force the use of an index scan by turning sequential scans to off:

SET ENABLE_SEQSCAN TO OFF;

But the problem is that the statistics didn't match the data. You
could have:

1) Used INSERTs to insert the data into a relation with an index
2) Executed ANALYZE <foo> to update the statistics
3) Perform the UPDATE

After the UPDATE, you'll still have dead tuples (the original rows)
which require that they be marked as dead, and so you should
occassionally run either VACUUM to mark them as such or VACUUM FULL to
reclaim the dead space or VACUUM FULL ANALYZE to also update the
relation's statistics.

I would just execute the steps I outlined above with COPY and not
worry about ANALYZEs and VACUUMs in a script. People often run VACUUM
ANALYZE in a 1/day cron job and VACUUM FULL ANALYZE in a 1/week cron
job. Then, of course, there's REINDEX...

Hope that helps,

Mike Mascari
mascarm@mascari.com



pgsql-general by date:

Previous
From: "Claudio Lapidus"
Date:
Subject: rounding timestamp
Next
From: "Rick Gigger"
Date:
Subject: Re: performance problem