Thread: Smaller multiple tables or one large table?

Smaller multiple tables or one large table?

From
Benedict Holland
Date:
Hi All,

I am on postgres 9.0. I don't know the answer to what should be a fairly straight forward question. I have several static tables which are very large (around the order of 14 million rows and about 10GB). They are all linked together through foreign keys and indexed on rows which are queried and used most often. While they are more or less static, update operations do occur. This is not on a super fast computer. It has 2 cores with 8gb of ram so I am not expecting queries against them to be very fast but I am wondering in a structural sense if I should be dividing up the tables into 1 million row tables through constraints and a view. The potential speedup I could see being quite large where postgresql would split off all of the queries into n table chucks running on k cores and then aggregate all of the data for display or operation. Is there any documentation to make postgesql do this and is it worth it?

Also, is there a benefit to have one large table or many small tables as far indexes go?

Thanks,
~Ben

Re: Smaller multiple tables or one large table?

From
John R Pierce
Date:
On 06/15/12 11:34 AM, Benedict Holland wrote:
> I am on postgres 9.0. I don't know the answer to what should be a
> fairly straight forward question. I have several static tables which
> are very large (around the order of 14 million rows and about 10GB).
> They are all linked together through foreign keys and indexed on rows
> which are queried and used most often. While they are more or less
> static, update operations do occur. This is not on a super fast
> computer. It has 2 cores with 8gb of ram so I am not expecting queries
> against them to be very fast but I am wondering in a structural sense
> if I should be dividing up the tables into 1 million row tables
> through constraints and a view. The potential speedup I could see
> being quite large where postgresql would split off all of the queries
> into n table chucks running on k cores and then aggregate all of the
> data for display or operation. Is there any documentation to make
> postgesql do this and is it worth it?

postgres won't do that, one query is one process.  your application
could conceivably run multiple threads, each with a seperate postgres
connection, and execute multiple queries in parallel, but it would have
to do any aggregation of the results itself.

>
> Also, is there a benefit to have one large table or many small tables
> as far indexes go?

small tables only help  if you can query the specific table you 'know'
has your data, for instance, if you have time based data, and you put a
month in each table, and you know that this query only needs to look at
the current month, so you just query that one month's table.



--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast


Re: Smaller multiple tables or one large table?

From
Benedict Holland
Date:
Will the processes know that I have n tables which are constrained in their definition on primary keys? I am thinking a table constraint specifying that the primary key on that table is within some boundary. That way the single process can spawn one thread per n table and leave the thread management to the OS. Assuming it is well behaved, this should use every ounce of resource I throw at it and instead of sequentially going though one large table, it will sequentially go through 1 of n short tables in parallel with k other tables. The results of this would have to be aggregated but with a large enough table, the aggregation would pale in comparison to the run time of the query split between several smaller tables.

The tables would have to be specified with a table pk constraint falling between two ranges. A view would then be created to manage all of the small tables with triggers handling insert and update operations. Select would have to be view specific but that is really cheap compared to updates. That should have the additional benefit of only hitting a specific table(s) with an update.

Basically, I don't see how this particular configuration breaks and if PostgreSQL already has the ability to do this as it seems very useful to manage very large data sets.

Thanks,
~Ben

On Fri, Jun 15, 2012 at 2:42 PM, John R Pierce <pierce@hogranch.com> wrote:
On 06/15/12 11:34 AM, Benedict Holland wrote:
I am on postgres 9.0. I don't know the answer to what should be a fairly straight forward question. I have several static tables which are very large (around the order of 14 million rows and about 10GB). They are all linked together through foreign keys and indexed on rows which are queried and used most often. While they are more or less static, update operations do occur. This is not on a super fast computer. It has 2 cores with 8gb of ram so I am not expecting queries against them to be very fast but I am wondering in a structural sense if I should be dividing up the tables into 1 million row tables through constraints and a view. The potential speedup I could see being quite large where postgresql would split off all of the queries into n table chucks running on k cores and then aggregate all of the data for display or operation. Is there any documentation to make postgesql do this and is it worth it?

postgres won't do that, one query is one process.  your application could conceivably run multiple threads, each with a seperate postgres connection, and execute multiple queries in parallel, but it would have to do any aggregation of the results itself.



Also, is there a benefit to have one large table or many small tables as far indexes go?

small tables only help  if you can query the specific table you 'know' has your data, for instance, if you have time based data, and you put a month in each table, and you know that this query only needs to look at the current month, so you just query that one month's table.



--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Smaller multiple tables or one large table?

From
Gabriele Bartolini
Date:
Hi Benedict,

Il 15/06/12 20:58, Benedict Holland ha scritto:
> The tables would have to be specified with a table pk constraint
> falling between two ranges. A view would then be created to manage all
> of the small tables with triggers handling insert and update
> operations. Select would have to be view specific but that is really
> cheap compared to updates. That should have the additional benefit of
> only hitting a specific table(s) with an update.
>
> Basically, I don't see how this particular configuration breaks and if
> PostgreSQL already has the ability to do this as it seems very useful
> to manage very large data sets.

What you are looking for is called 'partitioning' (horizontal
partitioning). I suggest that you read this chapter:
http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

Cheers,
Gabriele

--
  Gabriele Bartolini - 2ndQuadrant Italia
  PostgreSQL Training, Services and Support
  gabriele.bartolini@2ndQuadrant.it | www.2ndQuadrant.it


Re: Smaller multiple tables or one large table?

From
Benedict Holland
Date:
Hi all,

I am curious if there is a significant speed up with doing this if most of the queries run against it are going to be table wide. I won't drop the data and the data won't really grow. Do I get better speedup with one large table and large indexes or many small tables with many small indexes?

Thanks,
~Ben

On Sat, Jun 16, 2012 at 2:13 AM, Gabriele Bartolini <gabriele.bartolini@2ndquadrant.it> wrote:
Hi Benedict,

Il 15/06/12 20:58, Benedict Holland ha scritto:

The tables would have to be specified with a table pk constraint falling between two ranges. A view would then be created to manage all of the small tables with triggers handling insert and update operations. Select would have to be view specific but that is really cheap compared to updates. That should have the additional benefit of only hitting a specific table(s) with an update.

Basically, I don't see how this particular configuration breaks and if PostgreSQL already has the ability to do this as it seems very useful to manage very large data sets.

What you are looking for is called 'partitioning' (horizontal partitioning). I suggest that you read this chapter: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

Cheers,
Gabriele

--
 Gabriele Bartolini - 2ndQuadrant Italia
 PostgreSQL Training, Services and Support
 gabriele.bartolini@2ndQuadrant.it | www.2ndQuadrant.it