Thread: One query run twice in parallel results in huge performance decrease

One query run twice in parallel results in huge performance decrease

From
Jan Michel
Date:
Dear all,
I have a quite strange problem running an extensive query on geo data
checking for crossing ways. I don't know if this is a postgres or postgis
problem, but I hope you can help. Running one thread is no problem, it
finishes within 10-15 minutes. Run two of those queries in parallel and
they will not finish within 24 hours. It is definitely not a caching or I/O
problem.

First, the environment:
Running on a large server (32 cores, 128 GB RAM, fast RAID disks)
I tested psql 8.1 / 9.1 / 9.3 and postgis 1.5 and 2.1.0 on Debian 6 and
OpenSuse 12.3. All behave similar. The pgsql server settings were
optimized using pgtune, wal logging and autovacuum is off.

I'm working on a set of databases, each 5-10 GB big filled with OSM
geo data. I run many different queries, and I know the server can handle
up to 8 parallel tasks without a decrease in performance compared to a
single thread. Most data is kept in the cache and almost no read access
to the disk needs to be done.
Everything works well, despite one query, that runs on a table with ~ 1M
entries. It searches for ways crossing each other:
http://etherpad.netluchs.de/pgquery
(The definition of the source table is included as well)

Here is the explain analyze of the query:
http://explain.depesz.com/s/fAcV
As you can see, the row estimate is far off, but the runtime of 11 minutes
is acceptable, I think.

When I run a second instance of this query in a unrelated database on the
same server, they take 100% CPU, no iowait and they do not finish even
after more than a day.
An explain done directly before executing the query shows a huge cost
estimate and varying different plans:
http://explain.depesz.com/s/XDR
http://explain.depesz.com/s/SeG

How can two queries have such a strong influence on each other? Especially
when the host server could handle even ten queries without problems?
And most important: What can I do?

Thank you all in advance for your help!
Jan


Re: One query run twice in parallel results in huge performance decrease

From
Jeff Janes
Date:
On Fri, Nov 29, 2013 at 2:07 PM, Jan Michel <Jan@mueschelsoft.de> wrote:

When I run a second instance of this query in a unrelated database on the
same server, they take 100% CPU, no iowait and they do not finish even
after more than a day.

The planner is not aware of what else is going on in the server, so it can't change plans with that in mind.  So I think that that is a red herring.  I'd guess that the 2nd database is missing the geometry index, or has it defined in some different way such that the database doesn't think it can be used.

Are you sure that you get good plans when you run the exact same queries on the exact same database/schema one at a time?

Cheers,

Jeff

Re: One query run twice in parallel results in huge performance decrease

From
Jan Michel
Date:
Hi Jeff,
thanks for the answer.

On 29.11.2013 23:42, Jeff Janes wrote:
> The planner is not aware of what else is going on in the server
I was not aware of this as well.

> I'd guess that the 2nd database is missing the geometry index, or has
> it defined in some different way such that the database doesn't think
> it can be used.
Unfortunately - no. E.g. the first problematic plan I posted is from the
same schema loaded with the same data as the one that works well.
All tables are generated freshly from scratch by the same script only
minutes before this query is run. I tested them all individually and
never saw any problem, all use the same plan. As soon as I run two in
parallel it happens. I also did a test by feeding two tables with
identical data - again the same problem.

First I used tables in different schemas, then I tested to run them in
different databases. It had no influence. The thing is 100% reproducable
on three different machines with different hardware, different OS and
different pgsql versions. A single query is fast, as soon as a second
one comes in parallel it gets stuck. Every other query I have in the
toolchain does not show this behavior - and there are some quite
expensive ones as well.

Jan






Re: One query run twice in parallel results in huge performance decrease

From
Tom Lane
Date:
Jan Michel <Jan@mueschelsoft.de> writes:
> All tables are generated freshly from scratch by the same script only
> minutes before this query is run. I tested them all individually and
> never saw any problem, all use the same plan. As soon as I run two in
> parallel it happens. I also did a test by feeding two tables with
> identical data - again the same problem.

Hm.  Are you explicitly ANALYZE'ing the newly-built tables in your script,
or are you just trusting auto-analyze to get the job done?  It seems
possible that auto-analyze manages to finish before you start your big
query if there's just one set of tables to analyze, but not if there's two
sets.  That would explain bad choices of plans ...

            regards, tom lane


Re: One query run twice in parallel results in huge performance decrease

From
Jan Michel
Date:
On 30.11.2013 00:48, Tom Lane wrote:
> Are you explicitly ANALYZE'ing the newly-built tables in your script,
> or are you just trusting auto-analyze to get the job done?
Hi Tom,
there is an explicit analyze of the table being done between filling the
table with values and running this query. I sketched the process of
creating and filling the table here:
http://etherpad.netluchs.de/pgquery

Jan



Re: One query run twice in parallel results in huge performance decrease

From
Jeff Janes
Date:
On Fri, Nov 29, 2013 at 3:03 PM, Jan Michel <Jan@mueschelsoft.de> wrote:
Hi Jeff,
thanks for the answer.


On 29.11.2013 23:42, Jeff Janes wrote:
The planner is not aware of what else is going on in the server
I was not aware of this as well.


I'd guess that the 2nd database is missing the geometry index, or has it defined in some different way such that the database doesn't think it can be used.
Unfortunately - no. E.g. the first problematic plan I posted is from the same schema loaded with the same data as the one that works well.
All tables are generated freshly from scratch by the same script only minutes before this query is run. I tested them all individually and never saw any problem, all use the same plan. As soon as I run two in parallel it happens. I also did a test by feeding two tables with identical data - again the same problem.

First I used tables in different schemas, then I tested to run them in different databases. It had no influence. The thing is 100% reproducable on three different machines with different hardware, different OS and different pgsql versions. A single query is fast, as soon as a second one comes in parallel it gets stuck. Every other query I have in the toolchain does not show this behavior - and there are some quite expensive ones as well.


I think what I would do next is EXPLAIN (without ANALYZE) one of the queries repeatedly, say once a second, while the other query either runs or doesn't run repeatedly, that is the other query runs for 11 minutes (or however it takes to run), and then sleeps for 11 minutes in a loop.  Then you can see if the explain plan differs very reliably, and if the transition is exactly aligned with the other starting and stopping or if it is offset.

Cheers,

Jeff 

Re: One query run twice in parallel results in huge performance decrease

From
Scott Marlowe
Date:
On Fri, Nov 29, 2013 at 3:07 PM, Jan Michel <Jan@mueschelsoft.de> wrote:
> Dear all,
> I have a quite strange problem running an extensive query on geo data
> checking for crossing ways. I don't know if this is a postgres or postgis
> problem, but I hope you can help. Running one thread is no problem, it
> finishes within 10-15 minutes. Run two of those queries in parallel and
> they will not finish within 24 hours. It is definitely not a caching or I/O
> problem.

What does your IO subsystem look like when you're running the query
both once and twice?

iostat, vmstat, iotop etc area ll useful here.


Re: One query run twice in parallel results in huge performance decrease

From
Jan Michel
Date:

Jeff Janes wrote:
I think what I would do next is EXPLAIN (without ANALYZE) one of the queries repeatedly, say once a second, while the other query either runs or doesn't run repeatedly, that is the other query runs for 11 minutes (or however it takes to run), and then sleeps for 11 minutes in a loop.  Then you can see if the explain plan differs very reliably, and if the transition is exactly aligned with the other starting and stopping or if it is offset.

Hi Jeff,
I ran the one analyze over and over again as you proposed - but the result never changed.
But I think I found a solution for the problem. While browsing through the manual I found a statement about GIN indexes:
"For tables with GIN indexes, VACUUM (in any form) also completes any pending index insertions, by moving pending index entries to the appropriate places in the main GIN index structure". I use a gist and no gin index, but I tried to vacuum the (freshly filled) table, and it helped. It seems that the planer is simply not aware of the existence of the index although I run an analyze on the table right before the query.

Thank you all for your suggestions!
Jan