Re: Query performance help with 'shadow table' approach. - Mailing list pgsql-general

From Alban Hertroys
Subject Re: Query performance help with 'shadow table' approach.
Date
Msg-id 940EF626-2EB1-44D2-93BC-44ED42F68DEC@gmail.com
Whole thread Raw
In response to Query performance help with 'shadow table' approach.  (Brian Fehrle <brianf@consistentstate.com>)
Responses Re: Query performance help with 'shadow table' approach.  (Brian Fehrle <brianf@consistentstate.com>)
List pgsql-general
On 13 Sep 2011, at 23:44, Brian Fehrle wrote:

> These queries basically do a 'select max(primary_key_column) from table group by column1, column2." Because of the
groupby, we would result in a sequential scan of the entire table which proves to be costly. 

That seems to suggest a row where the primary key that has the max value is "special" in some way. Making them more
easilydistinguishable from "normal" rows seems like a good idea here. 

> Since the table has a ton of columns, I set up a smaller table that will house a copy of some of the data that the
queryuses, the Primary Key colum, and the two columns I do my 'group by' on. 

That's one way to distinguish these special rows from the rest. You could also mark them as special using an extra
columnand/or create an expression-based index over just those rows. 

However, especially with the below section in mind, it would appear your data could be normalised a bit more (your
splittingoff that shadow table is a step in doing so, in fact).  

I'm also wondering, does your primary key have actual meaning? It would appear to just indicate the order in which the
recordswere created (I'm assuming it's a serial type surrogate PK, and not a natural one).  

> This shadow table will also only contain one row for every column1 and column2 combination (due to the group by), and
forthose rows, will have the max of the primary key. Even with this, the 'shadow' table will have about 14 million
rows,compared to the 15 million in the main table. 

Don't (column1, column2) make up a key then? I get the feeling you should split your table in 3 sections:
Table 1: main lookup (PK: pkey_sid)
Table 2: Variation lookup (PK: (column1, column2), FK: pkey_sid)
Table 3: Data (FK: the above)

> So the issue here comes in retrieving the needed data from my main table. The resulting rows is estimated to be
409,600,and the retrieving of the primary key's that are associated with those rows is actually really easy. However,
whenwe take those 409,600 rows back to the main table to retrieve the other columns I need, the planner is just doing a
sequentialscan as it's most likely going to be faster than hitting the index then retrieving the columns I need for all
400K+rows. 

Is that estimate accurate? If not, see Ondrej's suggestions.

That is only about 1/30th of your table. I don't think a seqscan makes sense here unless your data is distributed
badly.

> Things to note:
> 1. If I reduce my where clause's range, then the sequential scan turns into an index scan, but sadly this can't
alwaysbe done. 

Does it make sense to CLUSTER your data in some sense? That would improve the data distribution issue and would
probablypush the threshold for a seqscan up some. 

Cheers,

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


pgsql-general by date:

Previous
From: Thomas Kellerer
Date:
Subject: Re: Problem with the 9.1 one-click installer Windows7 64bit
Next
From: Toby Corkindale
Date:
Subject: Re: PostgreSQL benchmarked on XFS vs ZFS vs btrfs vs ext4