Thread: general question on two-partition table

general question on two-partition table

From
Janet Jacobsen
Date:
Hi.  We have a table with 30 M records that is growing by
about 100 K records per day.

The experimentalists, whose data are in the table, have
decided that they will focus on the records for which the
value of one field, rbscore, is greater than a cut-off.
However, they want to continue to store all of the data
- even the records for which rbscore is less than the cutoff
- in the database.

For the current table, there are about 400 K (out of 30 M)
records that meet the 'good' criteria.

Each record in the table has about 40 fields, and the
experimentalists do in fact write queries that use many of
those fields, some more than others.  (They are building a
model and have not pinned down exactly which fields are
more important than others, so that's why they store and
query by so many fields.)

If they are going to spend 95% of their time querying the
records that meet the 'good' criteria, what are the good
strategies for ensuring good performance for those queries?
(1) Should I partition the table into two partitions based on
the value of rbscore?
(2) Should I create two separate tables?

Are (1) and (2) more or less equivalent in terms of
performance?

I think that partitioning the table is a more flexible option
(i.e., what if the cutoff value changes, no need to change
the name of the table being queried, etc.), but would (2)
give better performance given that 95% of their queries
are for rbscores greater than a threshold value?

Can you suggest other strategies?

Thank you,
Janet




Re: general question on two-partition table

From
David Wilson
Date:
On Mon, Jul 27, 2009 at 7:52 PM, Janet Jacobsen<jsjacobsen@lbl.gov> wrote:

> Can you suggest other strategies?

Something that might be easier to play with is to create a (or
several, to speed up other queries) functional index on the comparison
between rbscore and the cutoff. It won't buy you anything on seq
scans, but if most of the queries are done on small subsets of the
tuples which meet the "good" criteria, it could be a big win that's
very easy to implement.

--
- David T. Wilson
david.t.wilson@gmail.com

Re: general question on two-partition table

From
Greg Stark
Date:
On Tue, Jul 28, 2009 at 1:08 AM, David Wilson<david.t.wilson@gmail.com> wrote:
> On Mon, Jul 27, 2009 at 7:52 PM, Janet Jacobsen<jsjacobsen@lbl.gov> wrote:
>
>> Can you suggest other strategies?
>
> Something that might be easier to play with is to create a (or
> several, to speed up other queries) functional index on the comparison
> between rbscore and the cutoff.

I think it would be even more interesting to have partial indexes --
ie specified with "WHERE rbscore < cutoff".

I'm actually wondering if partitioning is really what you want. You
might prefer to just keep two entirely separate tables. One that has
all the data and one that has a second copy of the desirable subset.
Kind of like a "materialized view" of a simple query with the where
clause of "rbscore < cutoff".


--
greg
http://mit.edu/~gsstark/resume.pdf

Re: general question on two-partition table

From
David Wilson
Date:
On Mon, Jul 27, 2009 at 8:24 PM, Greg Stark<gsstark@mit.edu> wrote:

> I think it would be even more interesting to have partial indexes --
> ie specified with "WHERE rbscore < cutoff".

Yes- that's what I actually meant. Word got scrambled between brain
and fingers...

--
- David T. Wilson
david.t.wilson@gmail.com

Re: general question on two-partition table

From
Stephen Frost
Date:
* Janet Jacobsen (jsjacobsen@lbl.gov) wrote:
> If they are going to spend 95% of their time querying the
> records that meet the 'good' criteria, what are the good
> strategies for ensuring good performance for those queries?
> (1) Should I partition the table into two partitions based on
> the value of rbscore?
> (2) Should I create two separate tables?
>
> Are (1) and (2) more or less equivalent in terms of
> performance?

It's not clear to me what you plan here..  How would you handle (2) for
the users?  Would you construct a view across them, or expect them to
query the right table(s)?  Options, as I see them, and in the order of
'best-to-worst' wrt user friendlyness and performance, I believe, are:

1- Partitioning (with CHECK constraints and constraint_exclusion)
2- View across two tables (with appropriate WHERE clauses)
3- Functional index (as suggested by someone else)
4- separate tables (users have to figure out how to use them)
5- single table with everything

My recommendation would be #1, followed by #2.  Be sure to look up how
to do partitioning by using inheiritance in PG, and, if you need to,
look at how to implement a trigger to handle inserts on the parent
table.  Make sure you create your CHECK() constraints correctly, and
that you have constraint_exclusion enabled, and that it *works*.

> I think that partitioning the table is a more flexible option
> (i.e., what if the cutoff value changes, no need to change
> the name of the table being queried, etc.), but would (2)
> give better performance given that 95% of their queries
> are for rbscores greater than a threshold value?

If you have your partitioning set up correctly, I don't believe having
actual separate tables would be that much of a performance gain for
queries.  It would help some with inserts, tho if you know which table
to insert into, you could just insert into that child rather than the
main and using a trigger.

    Enjoy,

        Stephen

Attachment

Re: general question on two-partition table

From
Janet Jacobsen
Date:
Hi.  Thanks for your reply.

I looked at the documentation for partitions - it is the case, right, that I
have to create the master table and the two partition tables (depending
on the value of rbscore) and then copy the records from the existing
table into the two partitions?

Stephen Frost wrote:
> * Janet Jacobsen (jsjacobsen@lbl.gov) wrote:
>
>> If they are going to spend 95% of their time querying the
>> records that meet the 'good' criteria, what are the good
>> strategies for ensuring good performance for those queries?
>> (1) Should I partition the table into two partitions based on
>> the value of rbscore?
>> (2) Should I create two separate tables?
>>
>> Are (1) and (2) more or less equivalent in terms of
>> performance?
>>
>
> It's not clear to me what you plan here..  How would you handle (2) for
> the users?  Would you construct a view across them, or expect them to
> query the right table(s)?  Options, as I see them, and in the order of
> 'best-to-worst' wrt user friendlyness and performance, I believe, are:
>
> 1- Partitioning (with CHECK constraints and constraint_exclusion)
> 2- View across two tables (with appropriate WHERE clauses)
> 3- Functional index (as suggested by someone else)
> 4- separate tables (users have to figure out how to use them)
> 5- single table with everything
>
> My recommendation would be #1, followed by #2.  Be sure to look up how
> to do partitioning by using inheiritance in PG, and, if you need to,
> look at how to implement a trigger to handle inserts on the parent
> table.  Make sure you create your CHECK() constraints correctly, and
> that you have constraint_exclusion enabled, and that it *works*.
>
Is there some reason why you emphasized "works"?  Would I use
EXPLAIN to verify that the constraints are being checked?
>> I think that partitioning the table is a more flexible option
>> (i.e., what if the cutoff value changes, no need to change
>> the name of the table being queried, etc.), but would (2)
>> give better performance given that 95% of their queries
>> are for rbscores greater than a threshold value?
>>
>
> If you have your partitioning set up correctly, I don't believe having
> actual separate tables would be that much of a performance gain for
> queries.  It would help some with inserts, tho if you know which table
> to insert into, you could just insert into that child rather than the
> main and using a trigger.
>
>     Enjoy,
>
>         Stephen
>
Since the insert would be based on the value of rbscore, and there is
only one cut-off value, I could skip using a trigger.

At this point it seems like it was be fairly easy to try both the partial
index on rbscore and partitioning to see whether there are performance
differences.

Does anyone have any experience with performances difference
between using a partial index and partitions?  (Or is it not so usual to
have a two-partition table.)

Also, does anyone have experience with the performance of partitions
as the size of the table increases, say from 10 million rows to 10 billion
rows?

Janet





Re: general question on two-partition table

From
Stephen Frost
Date:
* Janet Jacobsen (jsjacobsen@lbl.gov) wrote:
> I looked at the documentation for partitions - it is the case, right, that I
> have to create the master table and the two partition tables (depending
> on the value of rbscore) and then copy the records from the existing
> table into the two partitions?

That's probably what I'd do.  You might be able to alter them to say
'inheirited', I don't recall if that was a new capbility or something
existing though (as in, it might only be in 8.4).

    Thanks,

        Stephen

Attachment