Thread: Re: COUNT(*) to find records which have a certain number of
I figured it eventually. (The only thing I don't know is where to put the ORDER BY.) > I want to select only those BRAND/MODEL combinations, where the MODEL > has more than one TYPE, but only where one of those has TYPE_NAME='xyz'. > I am not interested in MODELs with multiple TYPEs where none of them are > called 'xyz'. SELECT BRAND_NAME,MODEL_NAME FROM TYPE left outer join MODEL on MODEL_PK =TYPE.MODEL_FK left outer join BRAND on BRAND_PK =MODEL.BRAND_FK group by BRAND.BRAND_NAME,MODEL_NAME having count(*)>1 intersect SELECT BRAND_NAME,MODEL.MODEL_NAME FROM TYPE left outer join MODEL on MODEL_PK =TYPE.MODEL_FK left outer join BRAND on BRAND_PK =MODEL.BRAND_FK where TYPE_NAME='xyz' group by BRAND.BRAND_NAME,MODEL_NAME -- Regards/Gruß, Tarlika Elisabeth Schmitz -- Regards/Gruß, Tarlika Elisabeth Schmitz
On Mon, 2004-09-20 at 12:19, T E Schmitz wrote: > I figured it eventually. (The only thing I don't know is where to put > the ORDER BY.) Try this: SELECT brand_name, model_name FROM (SELECT ... INTERSECT SELECT ...) AS t ORDER BY ...
Hi Rod, Rod Taylor wrote: > On Mon, 2004-09-20 at 12:19, T E Schmitz wrote: > >>I figured it eventually. (The only thing I don't know is where to put >>the ORDER BY.) > > > Try this: > > SELECT brand_name, model_name > FROM (SELECT ... INTERSECT SELECT ...) AS t > ORDER BY ... That does the trick. You're a genius ;-) -- Regards/Gruß, Tarlika
T E Schmitz <mailreg@numerixtechnology.de> writes: > SELECT > BRAND_NAME,MODEL_NAME ... > intersect ... Huh, I never think of the set operation solutions. I'm curious how it compares speed-wise. -- greg
I have a view that when used, is slow: CREATE VIEW "Data".genlic_a4avail AS SELECT genlic_a4.*, last_action_date, end_date, canceled FROM genlic_a4 LEFT JOIN lic_hd USING( sys_id) WHERE status != 'A'; Here is the EXPLAIN output: Merge Join (cost=155360.47..159965.70 rows=13063 width=75) Merge Cond: ("outer".sys_id = "inner".sys_id) -> Sort (cost=3912.51..3916.48 rows=1589 width=62) Sort Key: "_GenLicGroupA4".sys_id -> Nested Loop (cost=0.00..3828.04rows=1589 width=62) Join Filter: ("outer".callsign ~ ("inner".pattern)::text) -> Seq Scan on "_GenLicGroupA4" (cost=0.00..1667.40 rows=1589 width=21) Filter: ((status <> 'R'::bpchar) AND (status <> 'A'::bpchar) AND (geo_region = 12)) -> Seq Scan on "_GeoRestrict" (cost=0.00..1.16 rows=16 width=41) -> Sort (cost=262032.96..264249.96 rows=886799 width=72) Sort Key: lic_hd.sys_id -> SubqueryScan lic_hd (cost=0.00..24529.99 rows=886799 width=72) -> Seq Scan on "_LicHD" (cost=0.00..24529.99rows=886799 width=72) If I change the view to this: CREATE VIEW "Data".genlic_a4avail AS SELECT genlic_a4.*, (SELECT last_action_dateFROM lic_hd WHERE sys_id = genlic_a4.sys_id LIMIT 1) AS last_action_date, (SELECT end_date FROM lic_hd WHERE sys_id= genlic_a4.sys_id LIMIT 1) AS end_date, (SELECT canceled FROM lic_hd WHERE sys_id = genlic_a4.sys_id LIMIT 1) AS canceled FROM genlic_a4 WHERE status != 'A'; Then the performance is MUCH better: Subquery Scan genlic_a4avail (cost=0.00..3828.04 rows=1589 width=62) -> Nested Loop (cost=0.00..3828.04 rows=1589 width=62) Join Filter: ("outer".callsign ~ ("inner".pattern)::text) -> Seq Scan on "_GenLicGroupA4" (cost=0.00..1667.40rows=1589 width=21) Filter: ((status <> 'R'::bpchar) AND (status <> 'A'::bpchar) AND (geo_region = 12)) -> Seq Scan on "_GeoRestrict" (cost=0.00..1.16 rows=16 width=41) SubPlan -> Limit (cost=0.00..3.01 rows=1 width=4) -> Index Scan using "_LicHD_pkey" on "_LicHD" (cost=0.00..3.01 rows=1 width=4) Index Cond: (unique_system_identifier = $0) -> Limit (cost=0.00..3.01 rows=1 width=8) -> Index Scan using "_LicHD_pkey" on "_LicHD" (cost=0.00..3.01 rows=1 width=8) Index Cond: (unique_system_identifier = $0) -> Limit (cost=0.00..3.01 rows=1 width=8) -> Index Scan using "_LicHD_pkey" on "_LicHD" (cost=0.00..3.01 rows=1 width=8) Index Cond: (unique_system_identifier = $0) Note that genlic_a4 is small (4519), and lic_hd is large (886799), and lic_hd has sys_id as its PRIMARY KEY. Is there a better way to write the LEFT JOIN so as to achieve the performance of the second VIEW without the clumsiness of the three (SELECT ... LIMIT 1) ???
"Dean Gibson (DB Administrator)" <postgresql3@ultimeth.com> writes: > I have a view that when used, is slow: It's obvious that you haven't given us anything remotely like full information here. AFAICT the "tables" in your view are actually other views, plus it looks like your call to the view is a query that adds some other conditions you didn't mention. (Or are those conditions coming from the sub-views? Hard to tell.) If you want useful help you need to be more complete. regards, tom lane
Tom Lane wrote on 2004-09-20 16:06: >"Dean Gibson (DB Administrator)" <postgresql3@ultimeth.com> writes: > > I have a view that when used, is slow: > >... If you want useful help you need to be more complete. I use views to "hide" tables so that I can populate new tables and then atomically switch to them with "CREATE OR REPLACE ...". Here is the same data with the raw tables: => explain select * from "20040920_070010"."_GenLicGroupA4" AS x LEFT JOIN "20040919_070713"."_LicHD" AS y ON x.sys_id = y.unique_system_identifier; QUERY PLAN ------------------------------------------------------------------------------------------------- Merge Join (cost=5235.14..35123.51rows=43680 width=365) Merge Cond: ("outer".unique_system_identifier = "inner".sys_id) -> IndexScan using "_LicHD_pkey" on "_LicHD" y (cost=0.00..27361.79 rows=886799 width=344) -> Sort (cost=5235.14..5344.34 rows=43680 width=21) Sort Key: x.sys_id -> SeqScan on "_GenLicGroupA4" x (cost=0.00..1339.80 rows=43680 width=21) Using first level views, as mentioned above, the results are the same: => explain select * from "Base"."GenLicGroupA4" AS x LEFT JOIN "Base"."LicHD" AS y ON x.sys_id = y.unique_system_identifier; QUERY PLAN ----------------------------------------------------------------------------------------------- Merge Join (cost=5235.14..35123.51rows=43680 width=365) Merge Cond: ("outer".unique_system_identifier = "inner".sys_id) -> IndexScan using "_LicHD_pkey" on "_LicHD" (cost=0.00..27361.79 rows=886799 width=344) -> Sort (cost=5235.14..5344.34 rows=43680 width=21) Sort Key: "_GenLicGroupA4".sys_id -> Seq Scan on "_GenLicGroupA4" (cost=0.00..1339.80 rows=43680 width=21) However, when I introduce a second-level view for the second table of: CREATE VIEW "Data".lic_hd AS SELECT unique_system_identifier AS sys_id, callsign AS callsign, uls_file_number AS uls_file_num, applicant_type_code AS applicant_type, radio_service_code AS radio_service, license_status AS license_status, grant_date AS grant_date, effective_date AS effective_date, cancellation_date AS cancel_date, expired_date AS expire_date, last_action_date AS last_action_date, CASE WHEN cancellation_date < expired_date THEN cancellation_date ELSE expired_date END AS end_date, cancellation_date < expired_date AS canceled FROM "Base"."LicHD"; And then change the query to use it, I get: => explain select * from "Base"."GenLicGroupA4" AS x LEFT JOIN lic_hd AS y ON x.sys_id = y.sys_id; QUERY PLAN ------------------------------------------------------------------------------------ Merge Join (cost=280258.11..289399.92rows=359154 width=98) Merge Cond: ("outer".sys_id = "inner".sys_id) -> Sort (cost=5235.14..5344.34rows=43680 width=21) Sort Key: "_GenLicGroupA4".sys_id -> Seq Scan on "_GenLicGroupA4" (cost=0.00..1339.80 rows=43680 width=21) -> Sort (cost=262032.96..264249.96 rows=886799 width=72) Sort Key: y.sys_id -> Subquery Scany (cost=0.00..24529.99 rows=886799 width=72) -> Seq Scan on "_LicHD" (cost=0.00..24529.99 rows=886799 width=72) Note that the scan on _LicHD is now sequential. If I change the above view to remove the last two columns, I get: QUERY PLAN ------------------------------------------------------------------------------------ Merge Join (cost=5235.14..35123.51rows=43680 width=93) Merge Cond: ("outer".unique_system_identifier = "inner".sys_id) -> IndexScan using "_LicHD_pkey" on "_LicHD" (cost=0.00..27361.79 rows=886799 width=72) -> Sort (cost=5235.14..5344.34 rows=43680 width=21) Sort Key: x.sys_id -> SeqScan on "_GenLicGroupA4" x (cost=0.00..1339.80 rows=43680 width=21) Which is back to my original (good) performance. Question: why do the last two column definitions in the second VIEW change the scan on _LicHD from indexed to sequential ?? -- Dean
"Dean Gibson (DB Administrator)" <postgresql3@ultimeth.com> writes: > Question: why do the last two column definitions in the second VIEW change > the scan on _LicHD from indexed to sequential ?? It's the CASE that's getting you. The poor plan is basically because the sub-view isn't getting "flattened" into the upper query, and so it's not possible to choose a plan for it that's dependent on the upper query context. And the reason it's not getting flattened is that subselects that are on the nullable side of an outer join can't be flattened unless they have nullable targetlists --- otherwise the results might not go to NULL when they are supposed to. A CASE construct is always going to be treated as non-nullable. Fixing this properly is a research project, and I haven't thought of any quick-and-dirty hacks that aren't too ugly to consider :-( In the meantime, you could easily replace that CASE construct with a min() function that's declared strict. I think date_smaller would do nicely, assuming the columns are actually of type date. regards, tom lane
Ahh, that explains why a "plain" JOIN (which doesn't do what I need) gave much better performance than the LEFT JOIN. I could ask why a CASE statement is always non-nullable, but I don't think the answer would help be solve my problem. <grin> I suppose it's that even though my particular CASE statement has WHEN/ELSE values that come from the nullable side of the JOIN, in general that's not true ... Okay, now for my big question: I searched high and low for a function that would return the minimum of two dates, and found none. Now you come up with "date_smaller", which works fine (as does "date_larger"), but where are those documented? More importantly, where are other functions like them documented? -- Dean ps: Who dreamt up the names "date_smaller" and "date_larger" ??? Much more intuitive are "min_date" and "max_date". pps: I realize that "date_smaller" isn't exactly equivalent to my CASE statement; a NULL value for one of the CASE operands causes the result of the ELSE clause to be returned, whereas "date_smaller" just returns NULL in that case. In my data, that's significant. I suppose that COALESCE has the same problem as CASE ... Tom Lane wrote on 2004-09-20 17:54: >"Dean Gibson (DB Administrator)" <postgresql3@ultimeth.com> writes: >Question: Why do the last two column definitions in the second VIEW >change the scan on _LicHD from indexed to sequential ?? > >It's the CASE that's getting you. The poor plan is basically because the >sub-view isn't getting "flattened" into the upper query, and so it's not >possible to choose a plan for it that's dependent on the upper query >context. And the reason it's not getting flattened is that subselects >that are on the nullable side of an outer join can't be flattened unless >they have nullable targetlists --- otherwise the results might not go to >NULL when they are supposed to. A CASE construct is always going to be >treated as non-nullable. > >Fixing this properly is a research project, and I haven't thought of any >quick-and-dirty hacks that aren't too ugly to consider :-( > >In the meantime, you could easily replace that CASE construct with a min() >function that's declared strict. I think date_smaller would do nicely, >assuming the columns are actually of type date.
On Mon, 20 Sep 2004, Dean Gibson (DB Administrator) wrote: > Okay, now for my big question: I searched high and low for a function that > would return the minimum of two dates, and found none. Now you come up > with "date_smaller", which works fine (as does "date_larger"), but where > are those documented? More importantly, where are other functions like > them documented? You can make them: http://www.postgresql.org/docs/7.4/interactive/server-programming.html "strict" means it will return NULL when the input is NULL.
"Dean Gibson (DB Administrator)" <postgresql3@ultimeth.com> writes: > I could ask why a CASE statement is always non-nullable, but I don't think > the answer would help be solve my problem. <grin> I suppose it's that even > though my particular CASE statement has WHEN/ELSE values that come from the > nullable side of the JOIN, in general that's not true ... Right, the code just sees CASE and barfs. A finer grain of analysis could conclude that this CASE is actually safe, but I'm unconvinced that it's worth doing. (Feel free to pursue this on -hackers if you care.) > Okay, now for my big question: I searched high and low for a function that > would return the minimum of two dates, and found none. Now you come up > with "date_smaller", which works fine (as does "date_larger"), but where > are those documented? They aren't; they are actually only intended as support functions for MIN (resp. MAX) on dates. But they're there, and there's nothing to stop you using 'em. (You do of course realize that you could have implemented these functions for yourself in a one-liner sql or plpgsql function. Finding the function in the system saves you a few minutes at most.) > ps: Who dreamt up the names "date_smaller" and "date_larger" ??? [ shrug... ] Some now-forgotten grad student at Berkeley. All the support functions for MIN and MAX are named 'foosmaller' and 'foolarger'. Try select aggtransfn from pg_aggregate, pg_proc where pg_proc.oid = aggfnoid and proname = 'min'; > pps: I realize that "date_smaller" isn't exactly equivalent to my CASE > statement; a NULL value for one of the CASE operands causes the result of > the ELSE clause to be returned, whereas "date_smaller" just returns NULL in > that case. In my data, that's significant. Well, in that case you may have a problem here. The point of the test in question is that the expression has to return NULL if *either* input is null. Now it's true that the LEFT JOIN only cares that it goes to NULL when *both* inputs are NULL, but we don't have any way to declare that particular property of a function. To make the planner happy you will have to declare the function as STRICT, which will force it to behave in the first fashion. > I suppose that COALESCE has the same problem as CASE ... Yup, of course. regards, tom lane
It turns out that even when I removed the CASE statement from the VIEW, the performance problem remained. I had to remove the conditional as well from the VIEW. To refresh your memory, there was this line in the VIEW (said VIEW being the subject of the LEFT JOIN): cancel_date < expire_date AS canceled, (cancel_date & expire_date are DATEs in the base view) When I removed the above line from the VIEW, the performance problem disappeared. Do you know why? I'm running 7.3.4, by the way. -- Dean
"Dean Gibson (DB Administrator)" <postgresql3@ultimeth.com> writes: > It turns out that even when I removed the CASE statement from the VIEW, the > performance problem remained. I had to remove the conditional as well from > the VIEW. > cancel_date < expire_date AS canceled, Yuck :-( > Do you know why? I'm running 7.3.4, by the way. 7.3 is even stupider than 7.4, is why ... In 7.3 only a view whose targetlist consists of simple column references can be pulled up into the nullable side of an outer join. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Fixing this properly is a research project, and I haven't thought of any > quick-and-dirty hacks that aren't too ugly to consider :-( Just thinking out loud here. Instead of trying to peek inside the CASE couldn't the optimizer just wrap the non-strict expression in a conditional that tests whether the row was found? -- greg
Am 20.09.2004 18:19 schrieb T E Schmitz: > I figured it eventually. (The only thing I don't know is where to put > the ORDER BY.) > > >>I want to select only those BRAND/MODEL combinations, where the MODEL >>has more than one TYPE, but only where one of those has TYPE_NAME='xyz'. >>I am not interested in MODELs with multiple TYPEs where none of them are >>called 'xyz'. > > Try this: select b.brand_name,m.model_name,count(*) from model m inner join brand b on m.brand_fk=b.brand_pk inner join type t on m.model_pk=t.model_fk where t.type_name='xyz' group by b.brand_name,m.model_name having count(m.model_name)>0; Greetings, Martin -- Martin Knipper www : http://www.mk-os.de Mail : knipper@mk-os.de
Tom Lane wrote on 2004-09-20 22:19: >In 7.3 only a view whose targetlist consists of simple column references >can be pulled up into the nullable side of an outer join. OK, well you dragged me kicking and screaming into writing and 'end_date' function (IMMUTABLE but non-STRICT). I then use that in the JOINed VIEW (and other strategic places) rather than in the subject of the LEFT JOIN, and that works with no performance problems. Since we mentioned PostgreSQL versions, one of the things I'd like to see in v8.0 is support for "DELETE FROM ... AS ..." and "UPDATE ... AS ...". For example, I need to do things like: DELETE FROM xxx AS x WHERE field > xxx.field; In other words, a DELETE that has criteria similar to a self JOIN. Right now, I do this via: CREATE VIEW yyy AS SELECT * FROM xxx; DELETE FROM xxx WHERE field > yyy.field; And that works fine. But it's probably too late for v8.0 ... -- Dean
Hallo Martin, Martin Knipper wrote: > Am 20.09.2004 18:19 schrieb T E Schmitz: > >>>I want to select only those BRAND/MODEL combinations, where the MODEL >>>has more than one TYPE, but only where one of those has TYPE_NAME='xyz'. >>>I am not interested in MODELs with multiple TYPEs where none of them are >>>called 'xyz'. >> >> > > Try this: > > select b.brand_name,m.model_name,count(*) from model m inner join > brand b on m.brand_fk=b.brand_pk inner join type t on > m.model_pk=t.model_fk where t.type_name='xyz' group by > b.brand_name,m.model_name having count(m.model_name)>0; > Sorry, no, this doesn't work. It definitely doesn't return the required resultset. It just returns all BRAND_NAME,MODEL_NAME for all Models that have a Type with TYPE_NAME='xyz'. -- Regards/Gruß, Tarlika Elisabeth Schmitz
Recently I asked about why a field from the nullable side of an OUTER JOIN was causing the JOIN to be inefficient, and was told that it was because that field had a CASE statement as part of its definition, and that CASE (and by extension COALESCE) were non-nullable constructs. Is NULLIF nullable, in that sense? Reason: I'd like to define a VIEW with a field definition thusly: 'P.O. Box' || NULLIF( po_box, ' '::char ) AS ... And I would like the field thusly defined to be nullable. -- Dean
You can ignore my question below, since I just put the field definition below in an SQL FUNCTION and marked it STRICT. -- Dean Dean Gibson (DB Administrator) wrote on 2004-11-09 19:29: >Recently I asked about why a field from the nullable side of an OUTER JOIN >was causing the JOIN to be inefficient, and was told that it was because >that field had a CASE statement as part of its definition, and that CASE >(and by extension COALESCE) were non-nullable constructs. > >Is NULLIF nullable, in that sense? > >Reason: I'd like to define a VIEW with a field definition thusly: > >'P.O. Box' || NULLIF( po_box, ' '::char ) AS ... > >And I would like the field thusly defined to be nullable. > >-- Dean