Thread: many tables in db

many tables in db

From
Kenneth Been
Date:
Background:

I am considering a db schema with many tables (in the
thousands).  I once read a message on this list that said
that having a lot of tables could affect performance,
depending on how the OS handles having many files in a
directory.

Questions:

1. Does anyone have more specific information about the
perils of having many tables in a db?  (I notice that with
7.1 there are even more files, because of the toast files.)

2. One of the reasons I am considering breaking the data
into multiple tables is that I want to cluster a large table
(on an rtree index) and I am running out of disk space
during the clustering operation, since all of the data is
being duplicated.  Is there some other way to get around
this problem?  (Other than installing another disk, which
isn't as easy as it sounds, because I'm not the sys admin.)

3. Another reason for splitting up the data is that I am
worried about the size of the rtree.  I have some suspicion
that the rtree size grows more than linearly with the size
of the table.  Is that the case?

4. Another possible reason is query speed.  Any guesses on
which would be faster, two or three queries on tables of
about 40K records, or one query on a table of about 10M
records?  (All queries are geometric range searches on an
rtree index.)

Thanks.

Ken


Re: many tables in db

From
Tom Lane
Date:
Kenneth Been <kennethb@telocity.com> writes:
> I am considering a db schema with many tables (in the
> thousands).

Just out of curiosity, why would you need so many tables?  Couldn't you
represent the same info in fewer tables, maybe with more key columns?
If every table has a different set of columns, that wouldn't work, but
I'm wondering what the organizing principle really is.

> 2. One of the reasons I am considering breaking the data
> into multiple tables is that I want to cluster a large table
> (on an rtree index)

This isn't really relevant to your main point, but: since an rtree
doesn't have an associated sort order, it's not clear to me that this
operation makes any sense.  Have you determined that you'll actually get
any performance improvement as a result of the clustering?  I suspect
you may find that you're just rearranging the table into a different
random order.

> 3. Another reason for splitting up the data is that I am
> worried about the size of the rtree.  I have some suspicion
> that the rtree size grows more than linearly with the size
> of the table.  Is that the case?

Dunno.  The rtree code isn't very well maintained, because none of
the current crop of developers uses it or knows much about it.
It could well have some bugs that cause it to waste space.  (I realized
only a few days ago that it probably behaves completely horribly when
asked to store NULLs, for example.  Do you have any NULLs in the column
you're indexing?)

> 4. Another possible reason is query speed.  Any guesses on
> which would be faster, two or three queries on tables of
> about 40K records, or one query on a table of about 10M
> records?  (All queries are geometric range searches on an
> rtree index.)

Given the size difference, I'd suppose that the 10M table would actually
be the equivalent of a materialized join of the smaller tables?  I'd
guess that recomputing the join on the fly is better because it takes
so much less I/O, but that's just a guess.  You'd be well advised to
test both ways and see.

            regards, tom lane

Re: many tables in db

From
Kenneth Been
Date:
>>I am considering a db schema with many tables (in the
>>thousands).
>>
>
> Just out of curiosity, why would you need so many tables?  Couldn't you
> represent the same info in fewer tables, maybe with more key columns?
> If every table has a different set of columns, that wouldn't work, but
> I'm wondering what the organizing principle really is.


I would love to just use one table, but I am worried about
performance, both in building the db and in querying it.
(Once built, the db is static.)  So each table contains a
subset of the rows that would be in the hypothetical single
table.

Specifically, I am making an interactive map server for the
web.  So the tables store the geographic information, lines
and polygons, indexed by the (latitude,longitude) range of
each feature (geometric type box).  So a one table solution
would have a query like

select stuff from lines where range && query_range

which would be replaced in the multi-table solution with

(select stuff from lines_43 where range && query_range)
union
(select stuff from lines_46 where range && query_range)

I can easily store in memory the range (bounding box) for
each table, so no disk accesses are required to know which
tables need to be searched.

>>2. One of the reasons I am considering breaking the data
>>into multiple tables is that I want to cluster a large table
>>(on an rtree index)
>>
>
> This isn't really relevant to your main point, but: since an rtree
> doesn't have an associated sort order, it's not clear to me that this
> operation makes any sense.  Have you determined that you'll actually get
> any performance improvement as a result of the clustering?  I suspect
> you may find that you're just rearranging the table into a different
> random order.


The rtree doesn't have an exact order, but things that are
closer to each other in the tree (on the same branch) will
generally be closer to each other in the geography.  The
documentation for the cluster command says that the table is
copied to a temporary order "in index order" and then
renamed.  If this is really what is happening, then it
should put things that are close to each other in geography
close to each other on disk.

I haven't done any rigorous experiments, but I've eyeballed
it and I think the clustering helps.  But you're right, I
should do some more serious experiments to be sure.

>>3. Another reason for splitting up the data is that I am
>>worried about the size of the rtree.  I have some suspicion
>>that the rtree size grows more than linearly with the size
>>of the table.  Is that the case?
>>
>
> Dunno.  The rtree code isn't very well maintained, because none of
> the current crop of developers uses it or knows much about it.
> It could well have some bugs that cause it to waste space.  (I realized
> only a few days ago that it probably behaves completely horribly when
> asked to store NULLs, for example.  Do you have any NULLs in the column
> you're indexing?)


No, no nulls.

>>4. Another possible reason is query speed.  Any guesses on
>>which would be faster, two or three queries on tables of
>>about 40K records, or one query on a table of about 10M
>>records?  (All queries are geometric range searches on an
>>rtree index.)
>>
>
> Given the size difference, I'd suppose that the 10M table would actually
> be the equivalent of a materialized join of the smaller tables?  I'd
> guess that recomputing the join on the fly is better because it takes
> so much less I/O, but that's just a guess.  You'd be well advised to
> test both ways and see.

I'm not sure what you mean by "materialized join".  The big
table would be the union of the smaller tables.

Again, I'm sure you're right, I just need to try it.  But in
this case it is harder to try, because in order to build the
  big table, and cluster it, I have to get past that problem
of running out of disk space during the cluster!  I'll do
the experiment to see if clustering does anything first.

Thanks.

Ken


Re: many tables in db

From
Tom Lane
Date:
Kenneth Been <kennethb@telocity.com> writes:
> ... So a one table solution would have a query like

> select stuff from lines where range && query_range

> which would be replaced in the multi-table solution with

> (select stuff from lines_43 where range && query_range)
> union
> (select stuff from lines_46 where range && query_range)

> I can easily store in memory the range (bounding box) for
> each table, so no disk accesses are required to know which
> tables need to be searched.

Hmm.  My immediate reaction: if the rtree index can't do that for you a
lot better than you can do it for yourself, then the rtree concept is
seriously flawed, or the implementation needs lots of work.  What you're
essentially proposing is to do the top level of rtree indexing yourself.
That should definitely not be a win.

Getting around the not-enough-space-to-cluster problem might take some
creativity, however.  I'd suggest that before you spend too much time on
it, you run some experiments on a table with only half the entries, and
see if it's really worth the trouble to try to accomplish the
clustering.

            regards, tom lane

Re: many tables in db

From
Kenneth Been
Date:
I ran some experiments, and with the queries that I was
testing with, the clustered rtree was about 30% faster than
the unclustered one.

> This isn't really relevant to your main point, but: since an rtree
> doesn't have an associated sort order, it's not clear to me that this
> operation makes any sense.  Have you determined that you'll actually get
> any performance improvement as a result of the clustering?  I suspect
> you may find that you're just rearranging the table into a different
> random order.