Thread: SQL Challenge: Arbitrary Cross-tab
Folks, I have a wierd business case. Annoyingly it has to be written in *portable* SQL92, which means no arrays or custom aggregates. I think it may be impossible to do in SQL which is why I thought I'd give the people on this list a crack at it. Solver gets a free drink/lunch on me if we ever meet at a convention. The Problem: for each "case" there are from zero to eight "timekeepers" authorized to work on the "case", out of a pool of 150 "timekeepers". This data is stored vertically: authorized_timekeepers: case_id | timekeeper_id 213447 | 047 132113 | 021 132113 | 115 132113 | 106 etc. But, a client's e-billing application wants to see these timekeepers displayed in the following horizontal format: case_id | tk1 | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8 213447 | 047 | | | | | | | | 132113 | 021 | 115 | 106 | 034 | 109 | 112 | 087 | etc. Order does not matter for timekeepers 1-8. This is a daunting problem because traditional crosstab solutions do not work; timekeepers 1-8 are coming out of a pool of 150. Can it be done? Or are we going to build this with a row-by-row procedural loop? (to reiterate: I'm not allowed to use a custom aggregate or other PostgreSQL "advanced feature") -- Josh Berkus Aglio Database Solutions San Francisco
I would use my report writer, but in any case you'd want at least 2 separate queries, maybe three to keep it simple and readable. If you are allowed to use stored procedures you can build up the output by using simple concats instead of text aggregation (which is a procedure of simple concats). Using loops and subqueries you should be to construct the heading (count distinct timekeeper_id) and then select the data row by row concatenating results before you send it out. This is a non-solution which effectively hides the aggregation in a function. Or write it in a client perl app if you must. You can't really do it w/o loops or aggregates. (I wish (hope?) I were wrong about this.) --elein On Tue, Aug 17, 2004 at 07:55:11PM -0700, Josh Berkus wrote: > Folks, > > I have a wierd business case. Annoyingly it has to be written in *portable* > SQL92, which means no arrays or custom aggregates. I think it may be > impossible to do in SQL which is why I thought I'd give the people on this > list a crack at it. Solver gets a free drink/lunch on me if we ever meet at > a convention. > > The Problem: for each "case" there are from zero to eight "timekeepers" > authorized to work on the "case", out of a pool of 150 "timekeepers". This > data is stored vertically: > > authorized_timekeepers: > case_id | timekeeper_id > 213447 | 047 > 132113 | 021 > 132113 | 115 > 132113 | 106 > etc. > > But, a client's e-billing application wants to see these timekeepers displayed > in the following horizontal format: > > case_id | tk1 | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8 > 213447 | 047 | | | | | | | | > 132113 | 021 | 115 | 106 | 034 | 109 | 112 | 087 | > etc. > > Order does not matter for timekeepers 1-8. > > This is a daunting problem because traditional crosstab solutions do not work; > timekeepers 1-8 are coming out of a pool of 150. > > Can it be done? Or are we going to build this with a row-by-row procedural > loop? (to reiterate: I'm not allowed to use a custom aggregate or other > PostgreSQL "advanced feature") > > -- > Josh Berkus > Aglio Database Solutions > San Francisco > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
On Tue, 17 Aug 2004, Josh Berkus wrote: > I have a wierd business case. Annoyingly it has to be written in *portable* > SQL92, which means no arrays or custom aggregates. I think it may be > impossible to do in SQL which is why I thought I'd give the people on this > list a crack at it. Solver gets a free drink/lunch on me if we ever meet at > a convention. > > The Problem: for each "case" there are from zero to eight "timekeepers" > authorized to work on the "case", out of a pool of 150 "timekeepers". This > data is stored vertically: > > authorized_timekeepers: > case_id | timekeeper_id > 213447 | 047 > 132113 | 021 > 132113 | 115 > 132113 | 106 > etc. > > But, a client's e-billing application wants to see these timekeepers displayed > in the following horizontal format: > > case_id | tk1 | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8 > 213447 | 047 | | | | | | | | > 132113 | 021 | 115 | 106 | 034 | 109 | 112 | 087 | > etc. > > Order does not matter for timekeepers 1-8. > > This is a daunting problem because traditional crosstab solutions do not work; > timekeepers 1-8 are coming out of a pool of 150. > > Can it be done? Or are we going to build this with a row-by-row procedural If you know it's max 8, I think it may be possible, but I can't think of a way that'd be better than just writing code yourself. Just maybe something like the following would give you three timekeepers: select foo.case_id, foo.v1, foo.v2, (select min(timekeeper_id) from authorized_timekeepers where authorized_timekeepers.case_id=foo.case_id and timekeeper_id > foo.v2) as v3 from (select foo.case_id, foo.v1, (select min(timekeeper_id) from authorized_timekeepers whereauthorized_timekeepers.case_id=foo.case_id and timekeeper_id > foo.v1) as v2 from (select foo.case_id, foo.v1 from (select foo.case_id, (select min(timekeeper_id) from authorized_timekeeperswhere authorized_timekeepers.case_id=foo.case_id) as v1 from (select distinct case_id fromauthorized_timekeepers) foo ) foo) foo) foo; If that works for 3 (and I think that's standard behavior), then you should be able to extend it to any fixed number using the pattern.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Names shortened to spare the line lengths: SELECT bob.cid, (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1) AS tk1, (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET1) AS tk2, (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 2) AS tk3, (SELECT tid FROM ats WHERE cid=bob.cidLIMIT 1 OFFSET 3) AS tk4, (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 4) AS tk5, (SELECT tid FROM atsWHERE cid=bob.cid LIMIT 1 OFFSET 5) AS tk6, (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 6) AS tk7, (SELECT tidFROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 7) AS tk8, FROM (SELECT DISTINCT cid FROM ats) AS bob; - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200408172335 -----BEGIN PGP SIGNATURE----- iD8DBQFBIs7AvJuQZxSWSsgRAkglAJ9mNEmOYlLPynygMmelvzlqkYoHlwCeJqTb g5gyh9LztONPCZj32aOEuGI= =Yy7m -----END PGP SIGNATURE-----
Greg, Stephan, > (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 1) AS tk2, bzzzzz! Thanks for playing. LIMIT and OFFSET, sadly, are not SQL standard. They're only portable to MySQL. This has to port to SQL Server and Oracle. > If that works for 3 (and I think that's standard behavior), then you > should be able to extend it to any fixed number using the pattern. Hmmm... that might work. I'll have to test that the nesting doesn't kill SQL Server (a serious danger) but it's worth a try. Performance will really suck but fortunately we only run this bill once a month. -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: > The Problem: for each "case" there are from zero to eight "timekeepers" > authorized to work on the "case", out of a pool of 150 "timekeepers". This > data is stored vertically: > > authorized_timekeepers: > case_id | timekeeper_id > 213447 | 047 > 132113 | 021 > 132113 | 115 > 132113 | 106 > etc. > > But, a client's e-billing application wants to see these timekeepers displayed > in the following horizontal format: > > case_id | tk1 | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8 > 213447 | 047 | | | | | | | | > 132113 | 021 | 115 | 106 | 034 | 109 | 112 | 087 | > etc. > > Order does not matter for timekeepers 1-8. > > This is a daunting problem because traditional crosstab solutions do not work; > timekeepers 1-8 are coming out of a pool of 150. > > Can it be done? Or are we going to build this with a row-by-row procedural > loop? (to reiterate: I'm not allowed to use a custom aggregate or other > PostgreSQL "advanced feature") > This is pretty much exactly how contrib/tablefunc's crosstab (non-hashed version; crosstab(sourcesql, ncols)) works. If you really need it to be portable, though, application layer procedural code is likely to be the easiest and fastest way to go. crosstab just wraps the procedural code in an SRF for you. Joe
Greg Sabino Mullane wrote: > > Names shortened to spare the line lengths: > > SELECT bob.cid, > (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1) AS tk1, > (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 1) AS tk2, > (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 2) AS tk3, > (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 3) AS tk4, > (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 4) AS tk5, > (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 5) AS tk6, > (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 6) AS tk7, > (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 7) AS tk8, > FROM (SELECT DISTINCT cid FROM ats) AS bob; > Don't you miss for each subselect an order by tid ? Regards Gaetano Mendola
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > Don't you miss for each subselect an order by tid ? No: since all the SELECTs are part of one statement, they will have the same (pseudo-random) implicit order. Since Josh's requirement said the order of the tids* was not important, I can be lazy and get away with it in this case. An order by would not hurt, of course. * As far as PG goes, this is not an ideal abbreviation! :) - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200408180745 -----BEGIN PGP SIGNATURE----- iD8DBQFBI0InvJuQZxSWSsgRApW7AKCpFN6TMQ3WjcJgZse5f+Ap6/Y7RACfSSlc MZusqEadF2xZrE4PLOhmMek= =pnQz -----END PGP SIGNATURE-----
Greg Sabino Mullane wrote: > > >>>Don't you miss for each subselect an order by tid ? > > > No: since all the SELECTs are part of one statement, they > will have the same (pseudo-random) implicit order. Is this guaranted ? Regards Gaetano Mendola
"Greg Sabino Mullane" <greg@turnstep.com> writes: >> Don't you miss for each subselect an order by tid ? > No: since all the SELECTs are part of one statement, they > will have the same (pseudo-random) implicit order. Nope; Gaetano's right, you cannot assume that. It's entirely possible for the planner to choose different plans depending on the OFFSET. (Maybe not very likely, with such small offsets, but could happen.) regards, tom lane
Josh Berkus wrote: >Folks, > >I have a wierd business case. Annoyingly it has to be written in *portable* >SQL92, which means no arrays or custom aggregates. I think it may be >impossible to do in SQL which is why I thought I'd give the people on this >list a crack at it. Solver gets a free drink/lunch on me if we ever meet at >a convention. > > > Might be possible. Would certainly be ugly. >The Problem: for each "case" there are from zero to eight "timekeepers" >authorized to work on the "case", out of a pool of 150 "timekeepers". This >data is stored vertically: > >authorized_timekeepers: >case_id | timekeeper_id >213447 | 047 >132113 | 021 >132113 | 115 >132113 | 106 >etc. > >But, a client's e-billing application wants to see these timekeepers displayed >in the following horizontal format: > >case_id | tk1 | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8 >213447 | 047 | | | | | | | | >132113 | 021 | 115 | 106 | 034 | 109 | 112 | 087 | >etc. > >Order does not matter for timekeepers 1-8. > >This is a daunting problem because traditional crosstab solutions do not work; >timekeepers 1-8 are coming out of a pool of 150. > >Can it be done? Or are we going to build this with a row-by-row procedural >loop? (to reiterate: I'm not allowed to use a custom aggregate or other >PostgreSQL "advanced feature") > > > If it can be done, it might be extremely ugly. I am thinking a massive set of left self joins (since there could be between 0 and 8). Something like: select case_id FROM authorized_timekeeper t0 LEFT JOIN (SELECT case_id, min(timekeeper_id) FROM authorized_timekeeper GROUP BY case_id) t1 ON case_id LEFT JOIN (SELECT case_id, min(timekeeper_id) FROM authorized_timekeeper WHERE timekeeper_id <> t1.timekeeper GROUP BY case_id) t2 ON case_id LEFT JOIN (SELECT case_id, min(timekeeper_id) FROM authorized_timekeeper WHERE timekeeper_id NOT IN (t1.timekeeper,t2.timekeeper) GROUP BY case_id) t3 etc.... If this is not an option, instead I would create a series of views. Something like: CREATE VIEW t1 AS select case_id, min(timekeeper_id) AS tk_id from authorized_timekeepers group by case_id; CREATE VIEW t2 AS select case_id, min(timekeeper_id) AS tk_id from authorized_timekeepers WHERE tk_id NOT IN (SELECTtk_id FROM t1) group by case_id; CREATE VIEW t3 AS select case_id, min(timekeeper_id) AS tk_id FROM authorized_timekeepers WHERE tk_id NOT IN (SELECTtk_id FROM t1) AND tk_id NOT IN (SELECT tk_id FROM t2) GROUP BY case_id; Etc. Then you do a left join among the views. Hope that this helps. Best Wishes, Chris Travers
Joe, Elein: > This is pretty much exactly how contrib/tablefunc's crosstab (non-hashed > version; crosstab(sourcesql, ncols)) works. If you really need it to be > portable, though, application layer procedural code is likely to be the > easiest and fastest way to go. crosstab just wraps the procedural code > in an SRF for you. No, you're missing one factor in the spec. Timekeeper_1 for case_id = 182738 is not the same timekeeper as Timekeeper_1 for case_id = 217437. That's why traditional crosstab plans don't work. Anyway, here's the wrap-up: I tried Stephan's idea, it works, but it's so slow that we're going to to the procedural loop. Thanks, all! -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: >>This is pretty much exactly how contrib/tablefunc's crosstab (non-hashed >>version; crosstab(sourcesql, ncols)) works. If you really need it to be >>portable, though, application layer procedural code is likely to be the >>easiest and fastest way to go. crosstab just wraps the procedural code >>in an SRF for you. > > No, you're missing one factor in the spec. Timekeeper_1 for case_id = 182738 > is not the same timekeeper as Timekeeper_1 for case_id = 217437. That's why > traditional crosstab plans don't work. No, I understood. E.g. create table authorized_timekeepers ( case_id int, timekeeper_id text ); insert into authorized_timekeepers values(213447,'047'); insert into authorized_timekeepers values(132113,'021'); insert into authorized_timekeepers values(132113,'115'); insert into authorized_timekeepers values(132113,'106'); insert into authorized_timekeepers values(140000,'106'); insert into authorized_timekeepers values(140000,'021'); insert into authorized_timekeepers values(140000,'115'); insert into authorized_timekeepers values(140000,'108'); insert into authorized_timekeepers values(140000,'006'); insert into authorized_timekeepers values(140000,'042'); insert into authorized_timekeepers values(140000,'142'); insert into authorized_timekeepers values(140000,'064'); insert into authorized_timekeepers values(140000,'999'); select * from crosstab('select case_id, ''cat'' as cat, timekeeper_id from authorized_timekeepers order by 1',8) as t(case_id int, tk1 text, tk2 text, tk3 text, tk4 text, tk5 text, tk6 text, tk7 text, tk8 text); case_id | tk1 | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8 ---------+-----+-----+-----+-----+-----+-----+-----+----- 132113 | 021 | 115 | 106 | | | | | 140000 | 106| 021 | 115 | 108 | 006 | 042 | 142 | 064 213447 | 047 | | | | | | | (3 rows) Or even: select * from crosstab('select case_id, ''cat'' as cat, timekeeper_id from authorized_timekeepers order by 1',4) as t(case_id int, tk1 text, tk2 text, tk3 text, tk4 text); case_id | tk1 | tk2 | tk3 | tk4 ---------+-----+-----+-----+----- 132113 | 021 | 115 | 106 | 140000 | 106 | 021 | 115 | 108 213447 | 047 | | | (3 rows) But I know that doesn't help you with portability. Joe
Joe, > case_id | tk1 | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8 > ---------+-----+-----+-----+-----+-----+-----+-----+----- > 132113 | 021 | 115 | 106 | | | | | > 140000 | 106 | 021 | 115 | 108 | 006 | 042 | 142 | 064 > 213447 | 047 | | | | | | | Darn I wish this didn't have to be portable .... -- --Josh Josh Berkus Aglio Database Solutions San Francisco
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > Nope; Gaetano's right, you cannot assume that. It's entirely possible > for the planner to choose different plans depending on the OFFSET. > (Maybe not very likely, with such small offsets, but could happen.) Interesting. I realized that there was a finite chance of different plans being chosen, but it seemed neigh-impossible since there is no WHERE clause and the offsets only vary from 0-7. What sort of different plans would it choose, out of curiosity? Seq-scan vs. index-scan? Are there any particular cases where the same plan is guaranteed to be used? - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200408192216 -----BEGIN PGP SIGNATURE----- iD8DBQFBJV9xvJuQZxSWSsgRAp74AJ96mtrKC1J53y0TPqTPdq2Xost0fACg4DnJ 7P+dgpHWBazGNE9+SR7uxLY= =MZuM -----END PGP SIGNATURE-----
"Greg Sabino Mullane" <greg@turnstep.com> writes: >> Nope; Gaetano's right, you cannot assume that. It's entirely possible >> for the planner to choose different plans depending on the OFFSET. >> (Maybe not very likely, with such small offsets, but could happen.) > Interesting. I realized that there was a finite chance of different > plans being chosen, but it seemed neigh-impossible since there is > no WHERE clause and the offsets only vary from 0-7. What sort of > different plans would it choose, out of curiosity? For the particular trivial case you were showing (no WHERE, no GROUP BY, no ORDER BY, no DISTINCT, no nada) it's probably true that only a seqscan plan would ever be chosen. I was worrying that people would take this example and try to add "just that other little thing" to it and get burnt. > Are there any particular cases where the same plan is guaranteed to be > used? No, I'd never care to make such a guarantee. The planner is entirely driven by cost estimates. Even if I could say something definitive about the behavior with the default cost parameters, it wouldn't necessary hold up when someone had taken an axe to random_page_cost or something like that. (It's not impossible that the thing would pick an indexscan plan for even this trivial case, were you to set random_page_cost below 1.) regards, tom lane