Thread: Constraint Exclusion + Joins?
I was wondering (for planning purposes) if anyone knew the status of constraint exclusions moving up to query runtime and working for joins. Is this something that's coming down the pipe in the foreseeable future, or just on a back-burner to-do list, or probably never happening, or... ? I have a painful work-around for my particular case that's good enough for now, but it would be helpful to know whether there's a good probability I can convert my code to do things the easy way somewhere in the foreseeable future if/when this feature goes in, or whether I should consider design changes now before my problems grow. Thanks, Brandon
"Brandon Black" <blblack@gmail.com> writes: > I was wondering (for planning purposes) if anyone knew the status of > constraint exclusions moving up to query runtime and working for > joins. The latter, done; the former, not on the radar screen IMHO. regards, tom lane
On 4/27/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Brandon Black" <blblack@gmail.com> writes: > > I was wondering (for planning purposes) if anyone knew the status of > > constraint exclusions moving up to query runtime and working for > > joins. > > The latter, done; the former, not on the radar screen IMHO. I didn't really care about the former, I was just under the impression it had to happen first based on some old list posts on the subject. I dug around in CVS to have a look for this, and I did eventually find it (well, I found the corresponding docs patch that removed the note about not working for joins). I see it's in MAIN but not in 8_1_STABLE. Does that mean it's headed for 8.2.x when that comes about? (Sorry, I'm not terribly familiar with how you guys handle all of this). Thanks, -- Brandon
On Fri, 28 Apr 2006, Brandon Black wrote: > I dug around in CVS to have a look for this, and I did eventually find > it (well, I found the corresponding docs patch that removed the note > about not working for joins). I see it's in MAIN but not in > 8_1_STABLE. Does that mean it's headed for 8.2.x when that comes > about? (Sorry, I'm not terribly familiar with how you guys handle all > of this). Yes. - Heikki
On 4/30/06, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On Fri, 28 Apr 2006, Brandon Black wrote: > > > I dug around in CVS to have a look for this, and I did eventually find > > it (well, I found the corresponding docs patch that removed the note > > about not working for joins). I see it's in MAIN but not in > > 8_1_STABLE. Does that mean it's headed for 8.2.x when that comes > > about? (Sorry, I'm not terribly familiar with how you guys handle all > > of this). > > Yes. > Perhaps I'm confused about the meaning of the removal of the JOINs-related caveat from the constraint exclusion docs in MAIN. What I was intending to ask about was constraint exclusion kicking in where the constrained column is being joined to a column of another table, with no constants involved. For a contrived example: -------------- CREATE TABLE basic ( basic_id INTEGER NOT NULL PRIMARY KEY, basic_data TEXT ); CREATE TABLE basic_sub1 ( PRIMARY KEY (basic_id), CHECK ( basic_id >= 0 AND basic_id < 100 ) ) INHERITS (basic); CREATE TABLE basic_sub2 ( PRIMARY KEY (basic_id), CHECK ( basic_id >= 100 AND basic_id < 200 ) ) INHERITS (basic); [...] CREATE TABLE jstuff ( jstuff_id INTEGER NOT NULL PRIMARY KEY, jstuff_data TEXT ); EXPLAIN ANALYZE SELECT basic.* FROM basic JOIN jstuff ON (basic.basic_id = jstuff.jstuff_id) WHERE jstuff_data = 'foo'; ------------------ I tried things like the above with small test data sets against cvs just now on my home machine, and constraint exclusion doesn't seem to apply here (even if all of the joined jstuff rows have ids which only match the constraint for basic_sub3, all basic_subX's seem to get scanned, as is the case I'm seeing in my real code against 8.1.3). Is this sort of dynamic constraint exclusion on the radar? Without it, some inheritance-based partitioning constructs which could otherwise be written as a single query have to be done as loops with seperate nested statements (to extract constants and then manually plug them into the next statement down the chain), which seems to make it more difficult (or nearly impossible) to support varying where/grouping/ordering/count of the resultant inner query rows from client code efficiently without writing a seperate plpgsql function for every possible variation. -- Brandon
"Brandon Black" <blblack@gmail.com> writes: > Is this sort of dynamic constraint exclusion on the radar? What sort of fantasy have you got in mind? It doesn't seem likely to me that testing constraints explicitly against each row from the other table would be a win compared to letting the indexes do it. regards, tom lane
On Tue, 2006-05-02 at 00:27 -0500, Brandon Black wrote: > I tried things like the above with small test data sets against cvs > just now on my home machine, and constraint exclusion doesn't seem to > apply here (even if all of the joined jstuff rows have ids which only > match the constraint for basic_sub3, all basic_subX's seem to get > scanned, as is the case I'm seeing in my real code against 8.1.3). Is > this sort of dynamic constraint exclusion on the radar? (You should use life size data for your testing.) If you think the plan you get can be improved, post the EXPLAIN (maybe EXPLAIN ANALYZE) and say how you think it can be improved. It's a lot easier to consider concrete requirements. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On 5/2/06, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2006-05-02 at 00:27 -0500, Brandon Black wrote: > > > I tried things like the above with small test data sets against cvs > > just now on my home machine, and constraint exclusion doesn't seem to > > apply here (even if all of the joined jstuff rows have ids which only > > match the constraint for basic_sub3, all basic_subX's seem to get > > scanned, as is the case I'm seeing in my real code against 8.1.3). Is > > this sort of dynamic constraint exclusion on the radar? > > (You should use life size data for your testing.) > > If you think the plan you get can be improved, post the EXPLAIN (maybe > EXPLAIN ANALYZE) and say how you think it can be improved. It's a lot > easier to consider concrete requirements. > Perhaps I should rewind a bit here then since clearly, as Tom pointed out, dynamic constraint exclusion is fantasy-land :) My real life-size situation is a bit more complicated, hence my trying to just break things down into a simple example for the earlier email, as I didn't want to bog you down reading all the details. It's really easier to explain in english than trying to post my pages upon pages of DDL and explain analyze outputs (but I can try to summarize some good example cases of those if neccesary later from work): In my real code, I've got a number of seperate tables logging time-series numerical statistics of various kinds. The PKs on these tables are normally composed of two fields, a resource identifier (there's about 4,500 resources), and a timestamp (the statistics come in once per minute). I'm using inheritance-based partitioning to split each stat table up into 1-hour chunks, and I keep about 72 hours of "current" data around (dropping old subtables and adding new ones as appropriate), which means there's about ~270k rows per subtable, ~19.4 million for the whole set at any given time. A few of the (top-level, seperate) tables in this scheme have a third PK component which is a sub-resource identifier, which there might be ~2-6 of per resource, multiplying the row counts further in those cases. The timestamp field I'm partitioning on is called t_stamp, and it's just an integer (unix epoch time, easier for my application). For a lot of my queries on these tables, I already have an explicit constant time-span to work with, so the constraint exclusion works well. For certain queries however, what I really want is to join into these statistical tables for a number of resource_ids, and for only the rows that are at the max(t_stamp) for each resource - the last thing inserted for each resource (they aren't neccesarily in perfect time synch with each other). My (quite possibly flawed) current solution for this is to maintain a seperate singular "resource_status" table which tracks the max timestamp for each resource as the data rows are inserted (it gets updated with the new timestamp being inserted if the new timestamp is greater than the max_t_stamp I had recorded earlier - sometimes data arruves not in timestamp order), so that I could quickly determine the current per-resource max t_stamp. I'm happy with the design in the overall, as my problematically huge insert (and transaction) rate is humming along nicely, and most of my queries (some of which are very complex) are running in excellent time. The only big performance issue I face now is those cases described above, where I need to get one datum per resource from a stat table for a number of resources using each individual resource's max t_stamp. When I do this by joining from the table where I recorded the max t_stamps into the top-level stat table, the plan does index scans of every child (obviously) to match the stamps, even though in the most likely case all of the stamps fall in a narrow window within just 1-2 child tables. By manually rewriting the query as a loop in plpgsql which first extracts the max t_stamp per resource from my "status" table, then inside the loop runs an individual select statement per-resource with the t_stamp as a constant, I see huge performance increases over the join (in the neighboorhood of an order of magnitude, but it varies up or down in different cases), to the point where things are acceptably fast. As I mentioned though, this (the plpgsql looping construct) is difficult to work with for dynamically constructed user queries (in terms of further where-clause constraints, joins to other tables afterwards, limit/offset counts, ordering, etc), whereas plugging those things in is easy if the client code is executing a single literal dynamic sql statement for the whole thing. There aren't any ugly sequential scans going on here or anything in the join case, it appears it's just the difference between checking ~72 indexes versus checking 1. I could break up the tables in larger chunks: if I did 12 hours instead of 1 hour, I'd be looking at more like 6 subtables/indexes, but each growing up to the vicinity of 1.5-9 million (in some rare cases, perhaps 20 million) rows per table. It is quite difficult and time-consuming to experiment with this, but clearly it's an avenue I need to look into. Another option I considered was that on each statistic insert into the big partitioned history table, I should also update a single row per-resource in a seperate table (or the top-level table), which tracks only the most recent data. This decreased my insertion-transaction performance substantially of course (I tried a few different logical ways of doing it), so I've been trying to work with the max_stamp storing method instead lately. It could turn out that the "update a table of latest-data when doing the inserts" is the least painful way to pay for these queries of course. And finally, a relatively-easy option I'm going to be trying out later today is to have the client code do a dynamic "create function" to make a custom plpgsql loop for itself, use it once, and toss it. For the current schema design this may turn out to be the best combination of fast and flexible, even though it's a bit ugly. Anyways, thanks for your time and efforts on postgres in general, I love it and don't mean this to sound as "complaining" as it probably does. I'm just finding it hard to come up with a way out of this latest corner I've painted myself into :) -- Brandon
Ühel kenal päeval, T, 2006-05-02 kell 00:27, kirjutas Brandon Black: > On 4/30/06, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On Fri, 28 Apr 2006, Brandon Black wrote: > > > > > I dug around in CVS to have a look for this, and I did eventually find > > > it (well, I found the corresponding docs patch that removed the note > > > about not working for joins). I see it's in MAIN but not in > > > 8_1_STABLE. Does that mean it's headed for 8.2.x when that comes > > > about? (Sorry, I'm not terribly familiar with how you guys handle all > > > of this). > > > > Yes. > > > > Perhaps I'm confused about the meaning of the removal of the > JOINs-related caveat from the constraint exclusion docs in MAIN. What > I was intending to ask about was constraint exclusion kicking in where > the constrained column is being joined to a column of another table, > with no constants involved. > > For a contrived example: > > -------------- > > CREATE TABLE basic ( > basic_id INTEGER NOT NULL PRIMARY KEY, > basic_data TEXT > ); > > CREATE TABLE basic_sub1 ( > PRIMARY KEY (basic_id), > CHECK ( basic_id >= 0 AND basic_id < 100 ) > ) INHERITS (basic); > > CREATE TABLE basic_sub2 ( > PRIMARY KEY (basic_id), > CHECK ( basic_id >= 100 AND basic_id < 200 ) > ) INHERITS (basic); > > [...] > > CREATE TABLE jstuff ( > jstuff_id INTEGER NOT NULL PRIMARY KEY, > jstuff_data TEXT > ); try putting a constraint on jstuff.jstuff_id so the CE mechanism has something to work on. > > EXPLAIN ANALYZE SELECT basic.* FROM basic JOIN jstuff ON > (basic.basic_id = jstuff.jstuff_id) WHERE jstuff_data = 'foo'; ------------ Hannu
On 5/2/06, Brandon Black <blblack@gmail.com> wrote:
If you only need 1 column from basic, then a subquery will do it for you:
SELECT jstuff_id AS basic_id, (SELECT basic_data FROM basic WHERE
basic.basic_id = jstuff.jstuff_id) FROM jstuff WHERE jstuff_data = 'foo';
If you need more than one column you can use ROW() constructors but that
gets pretty dirty.
On 4/30/06, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On Fri, 28 Apr 2006, Brandon Black wrote:
>
> > I dug around in CVS to have a look for this, and I did eventually find
> > it (well, I found the corresponding docs patch that removed the note
> > about not working for joins). I see it's in MAIN but not in
> > 8_1_STABLE. Does that mean it's headed for 8.2.x when that comes
> > about? (Sorry, I'm not terribly familiar with how you guys handle all
> > of this).
>
> Yes.
>
Perhaps I'm confused about the meaning of the removal of the
JOINs-related caveat from the constraint exclusion docs in MAIN. What
I was intending to ask about was constraint exclusion kicking in where
the constrained column is being joined to a column of another table,
with no constants involved.
For a contrived example:
--------------
CREATE TABLE basic (
basic_id INTEGER NOT NULL PRIMARY KEY,
basic_data TEXT
);
CREATE TABLE basic_sub1 (
PRIMARY KEY (basic_id),
CHECK ( basic_id >= 0 AND basic_id < 100 )
) INHERITS (basic);
CREATE TABLE basic_sub2 (
PRIMARY KEY (basic_id),
CHECK ( basic_id >= 100 AND basic_id < 200 )
) INHERITS (basic);
[...]
CREATE TABLE jstuff (
jstuff_id INTEGER NOT NULL PRIMARY KEY,
jstuff_data TEXT
);
EXPLAIN ANALYZE SELECT basic.* FROM basic JOIN jstuff ON
(basic.basic_id = jstuff.jstuff_id) WHERE jstuff_data = 'foo';
------------------
If you only need 1 column from basic, then a subquery will do it for you:
SELECT jstuff_id AS basic_id, (SELECT basic_data FROM basic WHERE
basic.basic_id = jstuff.jstuff_id) FROM jstuff WHERE jstuff_data = 'foo';
If you need more than one column you can use ROW() constructors but that
gets pretty dirty.