Thread: many tables in db
Background: I am considering a db schema with many tables (in the thousands). I once read a message on this list that said that having a lot of tables could affect performance, depending on how the OS handles having many files in a directory. Questions: 1. Does anyone have more specific information about the perils of having many tables in a db? (I notice that with 7.1 there are even more files, because of the toast files.) 2. One of the reasons I am considering breaking the data into multiple tables is that I want to cluster a large table (on an rtree index) and I am running out of disk space during the clustering operation, since all of the data is being duplicated. Is there some other way to get around this problem? (Other than installing another disk, which isn't as easy as it sounds, because I'm not the sys admin.) 3. Another reason for splitting up the data is that I am worried about the size of the rtree. I have some suspicion that the rtree size grows more than linearly with the size of the table. Is that the case? 4. Another possible reason is query speed. Any guesses on which would be faster, two or three queries on tables of about 40K records, or one query on a table of about 10M records? (All queries are geometric range searches on an rtree index.) Thanks. Ken
Kenneth Been <kennethb@telocity.com> writes: > I am considering a db schema with many tables (in the > thousands). Just out of curiosity, why would you need so many tables? Couldn't you represent the same info in fewer tables, maybe with more key columns? If every table has a different set of columns, that wouldn't work, but I'm wondering what the organizing principle really is. > 2. One of the reasons I am considering breaking the data > into multiple tables is that I want to cluster a large table > (on an rtree index) This isn't really relevant to your main point, but: since an rtree doesn't have an associated sort order, it's not clear to me that this operation makes any sense. Have you determined that you'll actually get any performance improvement as a result of the clustering? I suspect you may find that you're just rearranging the table into a different random order. > 3. Another reason for splitting up the data is that I am > worried about the size of the rtree. I have some suspicion > that the rtree size grows more than linearly with the size > of the table. Is that the case? Dunno. The rtree code isn't very well maintained, because none of the current crop of developers uses it or knows much about it. It could well have some bugs that cause it to waste space. (I realized only a few days ago that it probably behaves completely horribly when asked to store NULLs, for example. Do you have any NULLs in the column you're indexing?) > 4. Another possible reason is query speed. Any guesses on > which would be faster, two or three queries on tables of > about 40K records, or one query on a table of about 10M > records? (All queries are geometric range searches on an > rtree index.) Given the size difference, I'd suppose that the 10M table would actually be the equivalent of a materialized join of the smaller tables? I'd guess that recomputing the join on the fly is better because it takes so much less I/O, but that's just a guess. You'd be well advised to test both ways and see. regards, tom lane
>>I am considering a db schema with many tables (in the >>thousands). >> > > Just out of curiosity, why would you need so many tables? Couldn't you > represent the same info in fewer tables, maybe with more key columns? > If every table has a different set of columns, that wouldn't work, but > I'm wondering what the organizing principle really is. I would love to just use one table, but I am worried about performance, both in building the db and in querying it. (Once built, the db is static.) So each table contains a subset of the rows that would be in the hypothetical single table. Specifically, I am making an interactive map server for the web. So the tables store the geographic information, lines and polygons, indexed by the (latitude,longitude) range of each feature (geometric type box). So a one table solution would have a query like select stuff from lines where range && query_range which would be replaced in the multi-table solution with (select stuff from lines_43 where range && query_range) union (select stuff from lines_46 where range && query_range) I can easily store in memory the range (bounding box) for each table, so no disk accesses are required to know which tables need to be searched. >>2. One of the reasons I am considering breaking the data >>into multiple tables is that I want to cluster a large table >>(on an rtree index) >> > > This isn't really relevant to your main point, but: since an rtree > doesn't have an associated sort order, it's not clear to me that this > operation makes any sense. Have you determined that you'll actually get > any performance improvement as a result of the clustering? I suspect > you may find that you're just rearranging the table into a different > random order. The rtree doesn't have an exact order, but things that are closer to each other in the tree (on the same branch) will generally be closer to each other in the geography. The documentation for the cluster command says that the table is copied to a temporary order "in index order" and then renamed. If this is really what is happening, then it should put things that are close to each other in geography close to each other on disk. I haven't done any rigorous experiments, but I've eyeballed it and I think the clustering helps. But you're right, I should do some more serious experiments to be sure. >>3. Another reason for splitting up the data is that I am >>worried about the size of the rtree. I have some suspicion >>that the rtree size grows more than linearly with the size >>of the table. Is that the case? >> > > Dunno. The rtree code isn't very well maintained, because none of > the current crop of developers uses it or knows much about it. > It could well have some bugs that cause it to waste space. (I realized > only a few days ago that it probably behaves completely horribly when > asked to store NULLs, for example. Do you have any NULLs in the column > you're indexing?) No, no nulls. >>4. Another possible reason is query speed. Any guesses on >>which would be faster, two or three queries on tables of >>about 40K records, or one query on a table of about 10M >>records? (All queries are geometric range searches on an >>rtree index.) >> > > Given the size difference, I'd suppose that the 10M table would actually > be the equivalent of a materialized join of the smaller tables? I'd > guess that recomputing the join on the fly is better because it takes > so much less I/O, but that's just a guess. You'd be well advised to > test both ways and see. I'm not sure what you mean by "materialized join". The big table would be the union of the smaller tables. Again, I'm sure you're right, I just need to try it. But in this case it is harder to try, because in order to build the big table, and cluster it, I have to get past that problem of running out of disk space during the cluster! I'll do the experiment to see if clustering does anything first. Thanks. Ken
Kenneth Been <kennethb@telocity.com> writes: > ... So a one table solution would have a query like > select stuff from lines where range && query_range > which would be replaced in the multi-table solution with > (select stuff from lines_43 where range && query_range) > union > (select stuff from lines_46 where range && query_range) > I can easily store in memory the range (bounding box) for > each table, so no disk accesses are required to know which > tables need to be searched. Hmm. My immediate reaction: if the rtree index can't do that for you a lot better than you can do it for yourself, then the rtree concept is seriously flawed, or the implementation needs lots of work. What you're essentially proposing is to do the top level of rtree indexing yourself. That should definitely not be a win. Getting around the not-enough-space-to-cluster problem might take some creativity, however. I'd suggest that before you spend too much time on it, you run some experiments on a table with only half the entries, and see if it's really worth the trouble to try to accomplish the clustering. regards, tom lane
I ran some experiments, and with the queries that I was testing with, the clustered rtree was about 30% faster than the unclustered one. > This isn't really relevant to your main point, but: since an rtree > doesn't have an associated sort order, it's not clear to me that this > operation makes any sense. Have you determined that you'll actually get > any performance improvement as a result of the clustering? I suspect > you may find that you're just rearranging the table into a different > random order.