Questions on query planner, join types, and work_mem - Mailing list pgsql-performance

From Peter Hussey
Subject Questions on query planner, join types, and work_mem
Date
Msg-id AANLkTikMwVZKDYLitnsK31R=VNOuTH8NtmDHDGkBR7k8@mail.gmail.com
Whole thread Raw
Responses Re: Questions on query planner, join types, and work_mem
Re: Questions on query planner, join types, and work_mem
Re: Questions on query planner, join types, and work_mem
List pgsql-performance
I have spent the last couple of weeks digging into a Postgres performance problem that ultimately boiled down to this:  the planner was choosing to use hash joins on a set of join keys that were much larger than the configured work_mem.  We found we could make the  performance much better by either
1) increasing work_mem to 500MB or more, or
2) forcing the planner to choose index-backed nested loops by turning off hash and merge joins as well as bitmap and sequential scans. 

Now we are trying to decide which of these paths to choose, and asking why the planner doesn't handle this for us.

Background:  LabKey builds an open source platform for biomedical research data.  The platform consists of a tomcat web application and a relational database.  we support two databases, Postgres and SQL Server.  We started with SQL Server because we were very familiar with it.  Two of our technical team came from the SQL Server development team.  We chose Postgres because we assessed that it was the open source database most likely to be able to handle our application  requirements for capacity and complex, nested, generated SQL handling.  Postgres is now the default database for our platform and most of our key customers use it.  In general we've been very satisfied with Postgres' performance and compatibility, but our customers are starting to hit situations where we really need to be able to understand why a particular operation is slow.  We are currently recommending version 8.4 and using that ourselves. 

The core of the problem query was

SELECT * INTO snapshot_table FROM
  (SELECT ... FROM  tableA A LEFT  OUTER JOIN tableB B ON (A.lsid = B.lsid) and A.datasetid = ? )  query1

the join column, lsid, is a poor choice for a join column as it is a long varchar value (avg length 101 characters) that us only gets unique way out on the right hand side.  But we are stuck with this choice.  I can post the SQL query and table definitions if it will help, but changes to either of those would be risky and difficult, whereas setting the work_mem value or forcing nested loop joins is less risky. 

The Performance curve looks something like this

Join Type      work_mem(MB)     time to populate snapshot (min)
______________________________________________________________
Hash              50                        85
Hash              200                       38
Hash              400                       21
Hash              500                       12
Hash             1000                       12
_______________________________________________________________
NestedLoop        50                        15
NestedLoop        200                       11
NestedLoop        400                       11
NestedLoop        500                       10
NestedLoop       1000                       10
________________________________________________________

Table A contains about 3.5 million rows, and table B contains about 4.4 million rows.  By looking at the EXPLAIN ANALYZE reports I concluded that the planner seemed to be accurately determining the approximate number of rows returned on each side of the join node.  I also noticed that at the work_mem = 50 test, the hash join query execution was using over a GB of space in the pgsql_tmp, space that grew and shrank slowly over the course of the test.

Now for the questions:
1)  If we tell the customer to set his work_mem value to 500MB or 1GB in postgres.config, what problems might they see?  the documentation and the guidelines we received from Rupinder Singh in support suggest a much lower value, e.g. a max work_mem of 10MB.  Other documentation such as the "Guide to Posting Slow Query Questions" suggest at least testing up to 1GB.  What is a reasonable maximum to configure for all connnections?

2) How is work_mem used by a query execution?  For example, does each hash table in an execution get allocated a full work_mem's worth of memory ?   Is this memory released when the query is finished, or does it stay attached to the connection or some other object?

3) is there a reason why the planner doesn't seem to recognize the condition when the hash table won't fit in the current work_mem, and choose a low-memory plan instead?

Excuse the long-winded post; I was trying to give the facts and nothing but the facts.

Thanks,
Peter Hussey
LabKey Software

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Slow query using the Cube contrib module.
Next
From: Andres Freund
Date:
Subject: Re: Questions on query planner, join types, and work_mem