Thread: Should I partition this table?

Should I partition this table?

From
AlexK
Date:
My table currently uses up 62 GB of storage, and it has 450 M rows. This
narrow table has a PK on (ParentID, ChildNumber), and it has between 20K and
50K of child rows per parent.

The data is inserted daily, rarely modified, never deleted. The performance
of modifications is not an issue. The only select from it is as follows:

SELECT <column_lis> FROM MyChildTable WHERE ParentID=?
ORDER BY ChildNumber;

The selects are frequent, and their performance is essential.

Would you advice me to partition this table?

TIA



--
View this message in context: http://postgresql.1045698.n5.nabble.com/Should-I-partition-this-table-tp5811130.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Should I partition this table?

From
Bill Moran
Date:
On Thu, 10 Jul 2014 07:59:20 -0700 (PDT) AlexK <alkuzo@gmail.com> wrote:

> My table currently uses up 62 GB of storage, and it has 450 M rows. This
> narrow table has a PK on (ParentID, ChildNumber), and it has between 20K and
> 50K of child rows per parent.
>
> The data is inserted daily, rarely modified, never deleted. The performance
> of modifications is not an issue. The only select from it is as follows:
>
> SELECT <column_lis> FROM MyChildTable WHERE ParentID=?
> ORDER BY ChildNumber;
>
> The selects are frequent, and their performance is essential.
>
> Would you advise me to partition this table?

In general, yes, given the information you provided.  A parition on
ParentID % $something should improve performance.  Exactly what
$something is will take some experimenting on your part to determine.
In my experience, the scenario you describe is likely to see SELECT
performance improve nearly linerally to the number of partitions, up
to some point that will take exerpimenting to determine.

However, there may other methods of partition and/or rearranging the
data that would be even better, depending on a lot of information you
did not provide.

As an example, how are your selects distributed?  Are they fairly even
across the entire data set?  Or do ParentID become less accessed the
older they get?  If the latter, you'll probably be better served by
lazily archiving infrequently accessed ParentID rows to an archive
table and adjusting the application to search that table only if the
rows weren't found in the primary table.  This is an improvement over
modulous partitioning becuase it's more likely that the frequently
accessed data will be in memory and stay there.

Going even further, since the data is infrequently modified, you might be
better served by putting some sort of cache (memcache, or a custom in-
app cache) in front of the DB and checking it first.  The in-app cache
is always the best because it incurs no network traffic to access, but
the feasibility of doing that depends on the exact nature of the
application.  Say it with me: An RDBMS is not RAM, and trying to use
it like RAM will probably lead to disappointing performance.

--
Bill Moran <wmoran@potentialtech.com>


Re: Should I partition this table?

From
Kevin Grittner
Date:
AlexK <alkuzo@gmail.com> wrote:

> My table currently uses up 62 GB of storage, and it has 450 M
> rows. This narrow table has a PK on (ParentID, ChildNumber), and
> it has between 20K and 50K of child rows per parent.
>
> The data is inserted daily, rarely modified, never deleted. The
> performance of modifications is not an issue. The only select
> from it is as follows:
>
> SELECT <column_lis> FROM MyChildTable WHERE ParentID=?
> ORDER BY ChildNumber;
>
> The selects are frequent, and their performance is essential.
>
> Would you advice me to partition this table?

You didn't actually tell us about the most salient facts for
whether partitioning will improve or degrade performance.

If data does not all fit in cache and parents are added over time
with increasing ID values and the vast majority of queries only
reference recent parents, then partitioning by ranges of parentID
will improve your cache hit ratio and thereby improve performance.

Even if all data fits in cache, if children are only added to
recently added parents you could partition by parentID and CLUSTER
partitions when they reach the point where there are few if any new
children or updates.  This will reduce the number of pages
referenced per scan, and may allow partitioning to be a win.

Otherwise I would expect partitioning to hurt performance.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Should I partition this table?

From
AlexK
Date:
Bill,

Regarding "SELECT performance improve nearly linerally to the number of
partitions," - can you elaborate why? If I split my table into several
partitions, even the index depth may stay the same, because the PK is
narrow, it only consists of 2 4-byte integers.

My selects are distributed more or less evenly, so we really like your
suggestion to split on ParentID%SomeNumber, thank you!

At this time we would rather not introduce bugs related to cache
invalidation. Although we do need to read data fast, we do not want stale
data, and we cannot afford any bugs in this application.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Should-I-partition-this-table-tp5811130p5811141.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Should I partition this table?

From
AlexK
Date:
Kevin,

For now, all the data fits in the cache: the box has 384GB of RAM. But I
want to be ready for later, when we have more data. It is easier to refactor
my table now, when it is still smallish.

Children are only added to recently added parents, and they are all
added/updated/deleted at once. These child rows represent an object which
changes as a whole.

Parents are added over time at a steady pace, with increasing ID values. But
we frequently read history as well as recent rows. Also we sometimes remove,
always the parent and all its child rows.

I hope this is enough information. Thank you!



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Should-I-partition-this-table-tp5811130p5811142.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Should I partition this table?

From
Bill Moran
Date:
On Thu, 10 Jul 2014 08:40:59 -0700 (PDT) AlexK <alkuzo@gmail.com> wrote:

> Bill,
>
> Regarding "SELECT performance improve nearly linerally to the number of
> partitions," - can you elaborate why? If I split my table into several
> partitions, even the index depth may stay the same, because the PK is
> narrow, it only consists of 2 4-byte integers.

That statement is based on experimentation.  About a year ago, I did a
research project for a former employer to determine the best way to
store a large amount of data.  Like you, we were talking about narrow
rows, but we were looking at the total # of rows exceeding 10 billion.
Also different, we didn't expect the number of unique "ParentID"s to
ever exceed 100,000.

I managed to borrow some beefy hardware from another project that
wouldn't need it for a few weeks and do some experimenting with
different partition configurations, all compared to a baseline of
an unpartitioned table.  I don't remember exactly, but I believe I
was populating the test databases with 1 billion rows.

That's where I came up with the "linear" determination.  A query that
averaged 100ms on a single table averaged 10ms on 10 partitions and
1ms on 100 partitions (etc).  Every test I concocted seemed to support
that the improvement was linear.

Another point that I expermimented with, and is worth noting: not all
queries are able to benefit from the partitioning, but the test queries
that I ran that could not, saw only a few percent of performance hit.
Since our app design would take advantage of the partitioning for 99%
of its queries, it was a no-brainer.

As a result, I can only theorize, but my theory is that since each
partition is actually a table, and has indexes independently of the
other table/partitions, that both indexes searches and table scans
required less work with smaller tables.  That theory doesn't 100%
make sense, since index performance should not degrade linerally with
index size, but that was the behavior I observed.  It's likely that
because of the significant size of the hardware, that the performance
curve was simply so close to flat that it looked linear at the sizes
I was working with.

> At this time we would rather not introduce bugs related to cache
> invalidation. Although we do need to read data fast, we do not want stale
> data, and we cannot afford any bugs in this application.

I can't make your decisions for you.  But I'll make an attempt here
to try to help you avoid the same mistake I made.

On that same project, we made a similar decision: caches are sources
of bugs so we'll just make sure the DB is tuned well enough that we
don't need a cache.

That decision led to a lot of late night work under duress to get
caching reliably implemented at the last minute.  There are a lot of
things that partitioning doesn't speed up: parsing and planning the
queries, the time it takes the network to move data back and forth,
and the time it takes the application to demarshall the results
provided by the SQL server.  Caching data in RAM completely removes
all of that overhead.

Of course, only you and your developers can make the determination as
to whether you need it.  I'm just recommending that you don't fail
to consider it simply because it's difficult to do reliably.  If
you're use the "can't risk bugs" argument, you might benefit more
by improving your QA process than anything else.

--
Bill Moran <wmoran@potentialtech.com>


Re: Should I partition this table?

From
Kevin Grittner
Date:
AlexK <alkuzo@gmail.com> wrote:

> For now, all the data fits in the cache: the box has 384GB of
> RAM. But I want to be ready for later, when we have more data. It
> is easier to refactor my table now, when it is still smallish.

Makes sense.

> Children are only added to recently added parents, and they are
> all added/updated/deleted at once. These child rows represent an
> object which changes as a whole.
>
> Parents are added over time at a steady pace, with increasing ID
> values. But we frequently read history as well as recent rows.
> Also we sometimes remove, always the parent and all its child
> rows.

That suggests to me that a partition based on ranges of parent IDs
would be optimal, with a CLUSTER of each partition as it reaches a
fairly stable state.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Should I partition this table?

From
AlexK
Date:
Kevin,

What would be the advantages of partitioning on ranges of ParentID? Each
query will touch at most one partition. I might or might not get PK indexes
one level of depth less.

I understand that I will CLUSTER these smaller tables and benefit from that.
Other than clustering, what are other advantages?



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Should-I-partition-this-table-tp5811130p5811157.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Should I partition this table?

From
Bill Moran
Date:
On Thu, 10 Jul 2014 11:36:27 -0700 (PDT) AlexK <alkuzo@gmail.com> wrote:

> What would be the advantages of partitioning on ranges of ParentID? Each
> query will touch at most one partition. I might or might not get PK indexes
> one level of depth less.

You need to partition by ParentID in order for the example queries you
provided to benefit from the partitioning.  You should abstain from removing
the previous comments from each email reply ... I'm not going to dig back
through this thread to find specific examples.

> I understand that I will CLUSTER these smaller tables and benefit from that.
> Other than clustering, what are other advantages?

Personally, I find the benefits of CLUSTER to be debatable.  I haven't done
a lot of investigation, but my experience has been that keeping things
CLUSTERed is more work than it's worth.  Certainly, if you have a table
that _never_ changes, and he access patterns dictate it, there's no reason
not to CLUSTER, but I'm not convinced that you'll benefit (again, there's
a lot of information about your application use that hasn't been provided
that's necessary to make such a determination)

--
Bill Moran <wmoran@potentialtech.com>


Re: Should I partition this table?

From
Jeff Janes
Date:
On Thu, Jul 10, 2014 at 8:20 AM, Bill Moran <wmoran@potentialtech.com> wrote:
On Thu, 10 Jul 2014 07:59:20 -0700 (PDT) AlexK <alkuzo@gmail.com> wrote:

> My table currently uses up 62 GB of storage, and it has 450 M rows. This
> narrow table has a PK on (ParentID, ChildNumber), and it has between 20K and
> 50K of child rows per parent.
>
> The data is inserted daily, rarely modified, never deleted. The performance
> of modifications is not an issue. The only select from it is as follows:
>
> SELECT <column_lis> FROM MyChildTable WHERE ParentID=?
> ORDER BY ChildNumber;
>
> The selects are frequent, and their performance is essential.

How is their performance currently?
 
>
> Would you advise me to partition this table?


No, not based on the current info.  There is no reason to think partitioning would improve the performance that matters to you.  You said that child rows for the same parent are all inserted at the same time, so they should naturally be well-clustered.  That will be important for performance once the data exceeds what can be cached.  If that clustering did not occur naturally then you might benefit from imposing it, and partitioning might be an import part of doing that. But it sounds like you will not need to worry about that.
 

In general, yes, given the information you provided.  A parition on
ParentID % $something should improve performance.

PostgresSQL's constraint exclusion logic is not smart enough to turn a simple equality into a mod equality.  Which means every select query would have to include "AND ParentID % something = (:1 % somthing)" in addition to the primary clause "ParentID=:1", in order to benefit from constraint exclusion.  That would be very unnatural, annoying, and error prone.  Range partitioning would be better, if any partitioning is needed at all.

Cheers,

Jeff

Re: Should I partition this table?

From
Jeff Janes
Date:
On Thu, Jul 10, 2014 at 11:36 AM, AlexK <alkuzo@gmail.com> wrote:
Kevin,

What would be the advantages of partitioning on ranges of ParentID? Each
query will touch at most one partition. I might or might not get PK indexes
one level of depth less.

I understand that I will CLUSTER these smaller tables and benefit from that.
Other than clustering, what are other advantages?

If you don't partition, it will take an unacceptably long time to run CLUSTER on the entire table.  If you do partition, you can CLUSTER one partition at a time, and only need to CLUSTER the fast-changing partitions more than once.  But based on your description, you probably don't need to run explicit CLUSTERs anyway as your data would end up naturally clustered.

Cheers,

Jeff

Re: Should I partition this table?

From
Bill Moran
Date:
On Thu, 10 Jul 2014 13:16:05 -0700 Jeff Janes <jeff.janes@gmail.com> wrote:
> >
> > In general, yes, given the information you provided.  A parition on
> > ParentID % $something should improve performance.
>
> PostgresSQL's constraint exclusion logic is not smart enough to turn a
> simple equality into a mod equality.  Which means every select query would
> have to include "AND ParentID % something = (:1 % somthing)" in addition to
> the primary clause "ParentID=:1", in order to benefit from constraint
> exclusion.  That would be very unnatural, annoying, and error prone.  Range
> partitioning would be better, if any partitioning is needed at all.

I was remiss in pointing out the additional WHERE requirement -- I tend to
assume that people are already aware of that, but it's likely that not
everyone is.  Thank you for pointing it out.

As far as your comments against adding them: I'm not going to speculate as
to what kind of queries people do or do not find annoying.  The term
"unnatural" is an odd choice of words, and the only thing I can think to
respond with is "platypus."  As far as error-prone is concerned, it's going
to have to be the OPs decision on whether the additional work is worth the
improvement.  In the end, software isn't error-prone, programmers are error-
prone.  If you have a good QA process in place, then you don't worry about
error-prone programmers, as the QA process catches their mistakes.  If you
don't have such a process in place, or you don't trust it; then things get
harder and you make tradeoff decisions like, "I don't want to write complex
code, even if it's better, because we don't have the ability to ensure it's
error free."  And I can't make those kinds of judgments because I don't know
what your environment is like.

There's no silver bullet.  The OP doesn't seem to have any information about
what he's planning for: How big is the data predicted to get?  What is
an acceptable level of performance?  Has he even tested to see if the existing
layout will scale acceptably to the expected data volume?  (perhaps nothing
needs to be changed at all)  We don't know, so we can only speculate.

--
Bill Moran <wmoran@potentialtech.com>