Home > mailing lists

Re: Too Many OR's? - Mailing list pgsql-performance

From	Scott Marlowe
Subject	Re: Too Many OR's?
Date	November 15, 2005 16:01:57
Msg-id	1132084908.3582.46.camel@state.g2switchworks.com Whole thread Raw
In response to	Too Many OR's? (Bill McGonigle <bill@bfccomputing.com>)
List	pgsql-performance

Tree view

On Tue, 2005-11-15 at 13:12, Bill McGonigle wrote:
> I have a query that's making the planner do the wrong thing (for my
> definition of wrong) and I'm looking for advice on what to tune to make
> it do what I want.
>
> The query consists or SELECT'ing a few fields from a table for a large
> number of rows.  The table has about seventy thousand rows and the user
> is selecting some subset of them.  I first do a SELECT...WHERE to
> determine the unique identifiers I want (works fine) and then I do a
> SELECT WHERE IN giving the list of id's I need additional data on
> (which I see from EXPLAIN just gets translated into a very long list of
> OR's).
>
> Everything works perfectly until I get to 65301 rows.  At 65300 rows,
> it does an index scan and takes 2197.193 ms.  At 65301 rows it switches
> to a sequential scan and takes 778951.556 ms.  Values known not to
> affect this are: work_mem, effective_cache_size.  Setting
> random_page_cost from 4 to 1 helps (79543.214 ms) but I'm not really
> sure what '1' means, except it's relative.  Of course, setting
> 'enable_seqscan false' helps immensely (2337.289 ms) but that's as
> inelegant of a solution as I've found - if there were other databases
> on this install that wouldn't be the right approach.
>
> Now I can break this down into multiple SELECT's in code, capping each
> query at 65300 rows, and that's a usable workaround, but academically
> I'd like to know how to convince the planner to do it my way.  It's
> making a bad guess about something but I'm not sure what.  I didn't see
> any hard-coded limits grepping through the source (though it is close
> to the 16-bit unsigned boundry - probably coincidental) so if anyone
> has ideas or pointers to how I might figure out what's going wrong that
> would be helpful.

OK, there IS a point at which switching to a sequential scan will be
fast.  I.e. when you're getting everything in the table.  But the
database is picking a number where to switch that is too low.

First, we need to know if the statistics are giving the query planner a
good enough idea of how many rows it's really gonna get versus how many
it expects.

Do an explain <your query here> and see how many it thinks it's gonna
get.  Since you've actually run it, you know how many it really is going
to get, so there's no need for an explain analyze <your query here> just
yet.

Now, as long as the approximation is pretty close, fine.  But if it's
off by factors, then we need to increase the statistics target on that
column, with:

ALTER TABLE name ALTER columnname SET STATISTICS xxx

where xxx is the new number.  The default is set in your postgresql.conf
file, and is usually pretty low, say 10.  You can go up to 1000, but
that makes query planning take longer.  Try some incremental increase to
say 20 or 40 or even 100, and run analyze on that table then do an
explain on it again until the estimate is close.

Once the estimate is close, you use change random_page_cost to get the
query planner to switch at the "right" time.  Change the number of in()
numbers and play with random_page_cost and see where that sweet spot
is.  note that what seems right on a single table for a single user may
not be best as you increase load or access other tables.

random_page_cost represents the increase in a random access versus a
sequential access.  As long as your data fit into ram, the difference is
pretty much none (i.e. random_page_cost=1) so don't set it too low, or
accessing REALLY large data sets could become REALLY slow, as it uses
indexes when it should have been sequentially scanning.

Also, check what you've got effective_cache set to.  This tells
postgresql how much memory your kernel is using for cache, and so lets
it know about how likely it is that your current data set under your
query is to be in there.

Also, read this:

http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html

pgsql-performance by date:

From: Joost Kraaijeveld
Date: 15 November 2005, 15:51:32
Subject: Re: Performance PG 8.0 on dual opteron / 4GB / 3ware

From: "James Mello"
Date: 15 November 2005, 18:08:31
Subject: Re: Hardware/OS recommendations for large databases ( 5TB)

Re: Too Many OR's? - Mailing list pgsql-performance

Previous

Next