Thread: INHERITS and planning
Greetings! Is there an issue when a large number of INHERITS tables exist for planning? We have 2 base tables, and use INHERITS to partition the data. When we get around 2000-2200 sub-tables (approx 1000-1100 per base table), planning a SELECT statement on the base table (ie, to search all sub-tables) will start slowing down dramatically (ie, feels like something exponential OR some kind of in-memory to on-disk transition). I haven't done enough to really plot out the planning times, but definitely around 1600 tables we were getting sub-second plans, and around 2200 we were above 30 seconds. Also, is there any plans to support proper partitioning/binning of data rather than through INHERITS? I know it has been mentioned as upcoming sometime similar to Oracle. I would like to put in a vote to support "auto-binning" in which a function is called to define the bin. The Oracle model really only supports: (1) explicit partitioning (ie, every new partition must be defined), or (2) hash binning. What we deal with is temporal data, and would like to bin on the hour or day "automatically", hopefully to support truncating whole bins. This helps us 2 ways: (1) data deletion is bulk (we currently drop a full inherited table), (2) cancelling a VACUUM/SELECT doesn't take forever while the execution engine finishes "this table" (we have had cancels take 2 hours because the VACUUM was on a very large single table). Regards! Ed
Edmund Dengler <edmundd@eSentire.com> writes: > Is there an issue when a large number of INHERITS tables exist for > planning? Well, there are a number of issues whenever a single query references a whole lot of tables in any fashion. It's only with Neil Conway's rewrite of the List package in 8.0 that we had any hope of less than O(N^2) behavior for practically any measure of query complexity N. I have been spending some time over the past week or so attacking other O(N^2) behaviors, but it's not a finished project yet. I tried to reproduce your issue by doing create table p1 (f1 int, f2 bigint); create table c0() inherits (p1); create table c1() inherits (p1); create table c2() inherits (p1); ... create table c2298() inherits (p1); create table c2299() inherits (p1); and then profiling select * from p1; With no data in the tables, of course this is just measuring planning time and executor startup/shutdown overhead. But I suppose that you don't have a whole lot of data in the tables either, because the data fetching stage is surely pretty linear and you'd not be complaining about overhead if there were much data to be read. What I see in the profile is % cumulative self self total time seconds seconds calls s/call s/call name 42.04 15.58 15.58 9214 0.00 0.00 list_nth_cell 20.29 23.10 7.52 34524302 0.00 0.00 SHMQueueNext 8.34 26.19 3.09 29939 0.00 0.00 LockCountMyLocks 5.64 28.28 2.09 2960617 0.00 0.00 AllocSetAlloc 2.37 29.16 0.88 2354 0.00 0.00 AllocSetCheck 2.29 30.01 0.85 302960 0.00 0.00 hash_search 2.13 30.80 0.79 2902873 0.00 0.00 MemoryContextAlloc The list_nth operations are all coming from rt_fetch() macros, so we could probably fix that by replacing rangetable Lists by arrays. This seems doable, but also tedious and bug-prone; there are too many places that figure they can randomly add onto rtable lists. What I'm more interested in at the moment are the next two entries, SHMQueueNext and LockCountMyLocks --- it turns out that almost all the SHMQueueNext calls are coming from LockCountMyLocks, which is invoked during LockAcquire. This is another O(N^2) loop, and it's really a whole lot nastier than the rangetable ones, because it executes with the LockMgrLock held. I spent a little time trying to see if we could avoid doing LockCountMyLocks altogether, but it didn't look very promising. What I am thinking though is that we could implement LockCountMyLocks as either a scan through all the proclocks attached to the target proc (the current way) or as a scan through all the proclocks attached to the target lock (proclocks are threaded both ways). There is no hard upper bound on the number of locks a proc holds, whereas there is a bound of MaxBackends on the number of procs linked to a lock. (Well, theoretically it could be 2*MaxBackends considering the possibility of session locks, but that could only happen if all the backends are trying to vacuum the same relation.) So it seems like it might be a win to scan over the per-lock list instead. But I'm very unsure about what the *average* case is, instead of the worst case. I'm also thinking that the shared memory lock structure may be overdesigned now that we've introduced the backend-private LocalLock table --- in particular, it's not clear why we still include transaction IDs in PROCLOCKTAG rather than leaving the backend to track all the different reasons why it wants to hold a lock. If we could get rid of that then LockCountMyLocks reduces to a single PROCLOCK hashtable lookup. Thoughts? regards, tom lane
On Thu, 2005-06-09 at 21:30 -0400, Edmund Dengler wrote: > We have 2 base tables, and use INHERITS to partition the data. When we get > around 2000-2200 sub-tables (approx 1000-1100 per base table), planning a > SELECT statement on the base table (ie, to search all sub-tables) will > start slowing down dramatically (ie, feels like something exponential OR > some kind of in-memory to on-disk transition). > > I haven't done enough to really plot out the planning times, but > definitely around 1600 tables we were getting sub-second plans, and around > 2200 we were above 30 seconds. Interesting... I tested up to 1000 and found the performance acceptable, as you suggest. I'd question why you have so many partitions. There is a known issue here to do with a lack of an index on the pg_inherits catalog relation.... Here is the comment from backend/optimizer/util/plancat.c's find_inheritance_children line 565:568 * XXX might be a good idea to create an index on pg_inherits' inhparent * field, so that we can use an indexscan instead of sequential scan here * However, in typical databases pg_inherits won't have enough entries to * justify an indexscan... In other places in the code there are comments that show that having more than a 1000 catalog entries usually requires an index. > Also, is there any plans to support proper partitioning/binning of data > rather than through INHERITS? I know it has been mentioned as upcoming > sometime similar to Oracle. Working on it now. > I would like to put in a vote to support > "auto-binning" in which a function is called to define the bin. The Oracle > model really only supports: (1) explicit partitioning (ie, every new > partition must be defined), or (2) hash binning. What we deal with is > temporal data, and would like to bin on the hour or day "automatically", > hopefully to support truncating whole bins. Unlikely in the 8.1 version.... Best Regards, Simon Riggs
On Fri, 2005-06-10 at 02:10 -0400, Tom Lane wrote: > What I see in the profile is > > % cumulative self self total > time seconds seconds calls s/call s/call name > 42.04 15.58 15.58 9214 0.00 0.00 list_nth_cell > 20.29 23.10 7.52 34524302 0.00 0.00 SHMQueueNext > 8.34 26.19 3.09 29939 0.00 0.00 LockCountMyLocks > 5.64 28.28 2.09 2960617 0.00 0.00 AllocSetAlloc > 2.37 29.16 0.88 2354 0.00 0.00 AllocSetCheck > 2.29 30.01 0.85 302960 0.00 0.00 hash_search > 2.13 30.80 0.79 2902873 0.00 0.00 MemoryContextAlloc Looks bad... but how does it look for 1000 inherited relations? My feeling is that we should not be optimizing the case above 1000 relations. That many partitions is very unwieldy. If you really do need that many, you can go to the trouble of grouping them in two levels of nesting, so you have a root table, multiple month tables and then each month table with multiple day tables (etc). > What I'm more interested in at the moment are the next two entries, > SHMQueueNext and LockCountMyLocks --- it turns out that almost all the > SHMQueueNext calls are coming from LockCountMyLocks, which is invoked > during LockAcquire. This is another O(N^2) loop, and it's really a > whole lot nastier than the rangetable ones, because it executes with the > LockMgrLock held. ISTM that having LockAcquire as a stateless call isn't much use here. Surely, caching the number of locks so we can avoid the call entirely when making repeated calls is the way to go... > I spent a little time trying to see if we could avoid doing > LockCountMyLocks altogether, but it didn't look very promising. Or is that what you meant? > What > I am thinking though is that we could implement LockCountMyLocks as > either a scan through all the proclocks attached to the target proc > (the current way) or as a scan through all the proclocks attached to > the target lock (proclocks are threaded both ways). There is no hard > upper bound on the number of locks a proc holds, whereas there is a > bound of MaxBackends on the number of procs linked to a lock. (Well, > theoretically it could be 2*MaxBackends considering the possibility > of session locks, but that could only happen if all the backends are > trying to vacuum the same relation.) So it seems like it might be a win > to scan over the per-lock list instead. But I'm very unsure about > what the *average* case is, instead of the worst case. Changing that behaviour would effect all other call locations, so I'm not sure I'd want an optimization of this rare case to have such a far reaching effect. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Looks bad... but how does it look for 1000 inherited relations? My > feeling is that we should not be optimizing the case above 1000 > relations. That many partitions is very unwieldy. Well, it's not so much that I care about queries with 1000+ relations, as that this is a good way to stress-test the code and find out where the performance issues are. There are many thousand lines of code that can never be performance-sensitive, but to expose the ones that are it helps to push the envelope a bit. Until Neil fixed the list.c package in 8.0, we had pretty much zero chance of avoiding O(N^2) or worse behavior on almost any measure of query size N that you cared to name; because most of the internal data structures depend on lists. (You do know that Postgres was once written in Lisp, right?) Now that that basic issue is taken care of, it's worth looking at secondary bad behaviors ... I've been doing some hacking in this area lately, but it's not all fixed yet. regards, tom lane
> Well, it's not so much that I care about queries with 1000+ relations, > as that this is a good way to stress-test the code and find out where > the performance issues are. There are many thousand lines of code that > can never be performance-sensitive, but to expose the ones that are > it helps to push the envelope a bit. Once we have partitioning and people set up automated scripts to partition off stuff, we may well end up with 1000+ table queries... Chris
Simon Riggs <simon@2ndquadrant.com> writes: > If you really do need that many, you can go to the trouble of grouping > them in two levels of nesting, so you have a root table, multiple month > tables and then each month table with multiple day tables (etc). I wonder if testing deeply nested inheritance graphs would show up an entirely different set of problem areas. -- greg
On Thu, 2005-06-16 at 01:10 -0400, Greg Stark wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > > If you really do need that many, you can go to the trouble of grouping > > them in two levels of nesting, so you have a root table, multiple month > > tables and then each month table with multiple day tables (etc). > > I wonder if testing deeply nested inheritance graphs would show up an entirely > different set of problem areas. I'm not sure two or three levels is "deeply nested", but I suspect you are correct. Best Regards, Simon Riggs
On Thu, 2005-06-16 at 12:59 +0800, Christopher Kings-Lynne wrote: > > Well, it's not so much that I care about queries with 1000+ relations, > > as that this is a good way to stress-test the code and find out where > > the performance issues are. There are many thousand lines of code that > > can never be performance-sensitive, but to expose the ones that are > > it helps to push the envelope a bit. > > Once we have partitioning and people set up automated scripts to > partition off stuff, we may well end up with 1000+ table queries... I can see why you think that, but there will always be pressure to reduce the number of partitions for a variety of reasons. IMHO that will lead to an optimum range of values. To me, it seems likely there would be a recommendation along the lines of: divide the table up naturally in a way that gives between 10 and 500 partitions that are mostly roughly equally sized. Using more than that could lead to some fairly strange designs. Anyway, lets wait and see. Best Regards, Simon Riggs
On Thu, 2005-06-16 at 00:55 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Looks bad... but how does it look for 1000 inherited relations? My > > feeling is that we should not be optimizing the case above 1000 > > relations. That many partitions is very unwieldy. > > Well, it's not so much that I care about queries with 1000+ relations, > as that this is a good way to stress-test the code and find out where > the performance issues are. There are many thousand lines of code that > can never be performance-sensitive, but to expose the ones that are > it helps to push the envelope a bit. I very much agree and I also appreciate you taking the time to look into this since it clearly has an effect on the usefulness of partitioning. I wanted to give my opinion that more than 1000 is not a frequent use case. > Until Neil fixed the list.c package in 8.0, we had pretty much zero > chance of avoiding O(N^2) or worse behavior on almost any measure of > query size N that you cared to name; because most of the internal data > structures depend on lists. (You do know that Postgres was once written > in Lisp, right?) Now that that basic issue is taken care of, it's worth > looking at secondary bad behaviors ... I've been doing some hacking in > this area lately, but it's not all fixed yet. Yes, know about that; I agree with the general principles you discuss. Your suggested fix to the 2000+ inherited relation problem seemed like it would apply to an area that most people would never use, yet would have an effect on anybody using LockAcquire. IMHO that is not worth the effort or risk, and that is from somebody that you know has been involved in tracking down O(N^2) behaviour in other parts of the code. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Your suggested fix to the 2000+ inherited relation problem seemed like > it would apply to an area that most people would never use, yet would > have an effect on anybody using LockAcquire. Just FYI, the proposed fix is already in, and I think it's a net win for anyone. LockCountMyLocks was really an artifact of a lock API that's been superseded by events --- namely the assumption that we want to take locks in the names of particular transactions rather than in the names of particular backends. I put that in around 7.1 or so, primarily to support "session locks" for VACUUM, but designed it the way I did with the idea that subtransactions would someday want it. In the event, subtransactions didn't want it --- it was a lot cheaper to add the backend-private LOCALLOCK tables and make all the subtransaction bookkeeping happen internally to a backend. Now that we have LOCALLOCK the obvious next step is to manage session locks entirely within LOCALLOCK too, and reduce the shared-memory state to essentially one bit per lock per backend: "I hold it" or "I don't hold it". When you know there is only one proclock per backend, there's no need to search for other ones and thus LockCountMyLocks goes away again. regards, tom lane