Re: index usage on queries on inherited tables - Mailing list pgsql-performance

From Samuel Gendler
Subject Re: index usage on queries on inherited tables
Date
Msg-id BANLkTiktd3hKAV9ntLWTxPCdz0Lh7usSUw@mail.gmail.com
Whole thread Raw
In response to Re: index usage on queries on inherited tables  (Joseph Shraibman <jks@selectacast.net>)
List pgsql-performance


On Wed, Apr 27, 2011 at 2:11 PM, Joseph Shraibman <jks@selectacast.net> wrote:
On 04/27/2011 04:32 PM, Robert Haas wrote:
> In the first case, PostgreSQL evidently thinks that using the indexes
> will be slower than just ignoring them.  You could find out whether
> it's right by trying it with enable_seqscan=off.

My point is that this is just a problem with inherited tables.  It
should be obvious to postgres that few rows are being returned, but in
the inherited tables case it doesn't use indexes.  This was just an
example.  In a 52 gig table I have a "select id from table limit 1 order
by id desc" returns instantly, but as soon as you declare a child table
it tries to seq scan all the tables.


If I'm understanding correctly, this kind of obviates the utility of partitioning if you structure a warehouse in a traditional manner.  Assuming a fact table partitioned by time, but with foreign keys to a time dimension, it is now not possible to gain any advantage from the partitioning if selecting on columns in the time dimension.

"select * from fact_table f join time_dimension t on f.time_id = t.time_id where t.quarter=3 and t.year = 2010" will scan all partitions of the fact table despite the fact that all of the rows would come from 3 partitions, assuming a partitioning schema that uses one partition for each month.  

I use a time id that is calculable from the from the timestamp so it doesn't need to be looked up, and partitioning on time_id directly is easy enough to handle, but if I'm understanding the problem, it sounds like nothing short of computing the appropriate time ids before issuing the query and then including a 'where f.time_id between x and y' clause to the query will result in the partitions being correctly excluded.  Is that what people are doing to solve this problem?  The alternative is to leave a timestamp column in the fact table (something I tend to do since it makes typing ad-hoc queries in psql much easier) and partition on that column and then always include a where clause for that column that is at least as large as the requested row range.  Both result in fairly ugly queries, though I can certainly see how I might structure my code to always build queries which adhere to this.

I'm just in the process of designing a star schema for a project and was intending to use exactly the structure I described at the top of the email. Is there a postgres best-practices for solving this problem? There's no way I can get away without partitioning.  I'm looking at a worst case table of 100,000 rows being written every 5 minutes, 24x7 - 29 million rows per day, a billion rows per month - with most queries running over a single month or comparing same months from differing years and quarters - so a month based partitioning.  Normal case is closer to 10K rows per 5 minutes.

Suggestions?

--sam


pgsql-performance by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: reducing random_page_cost from 4 to 2 to force index scan
Next
From: Greg Smith
Date:
Subject: Re: index usage on queries on inherited tables