Thread: SQL Challenge: Arbitrary Cross-tab

SQL Challenge: Arbitrary Cross-tab

From

Josh Berkus

Date:

17 August 2004, 23:56:28

Folks,

I have a wierd business case.  Annoyingly it has to be written in *portable* 
SQL92, which means no arrays or custom aggregates.   I think it may be 
impossible to do in SQL which is why I thought I'd give the people on this 
list a crack at it.   Solver gets a free drink/lunch on me if we ever meet at 
a convention.

The Problem:  for each "case" there are from zero to eight "timekeepers" 
authorized to work on the "case", out of a pool of 150 "timekeepers".  This 
data is stored vertically:

authorized_timekeepers:
case_id     | timekeeper_id
213447    | 047
132113    | 021
132113    | 115
132113    | 106
etc.

But, a client's e-billing application wants to see these timekeepers displayed 
in the following horizontal format:

case_id    | tk1    | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8
213447    | 047 |     |     |     |     |     |     |     |
132113      | 021 | 115 | 106 | 034 | 109 | 112 | 087 |
etc.

Order does not matter for timekeepers 1-8.

This is a daunting problem because traditional crosstab solutions do not work; 
timekeepers 1-8 are coming out of a pool of 150.

Can it be done?  Or are we going to build this with a row-by-row procedural 
loop? (to reiterate: I'm not allowed to use a custom aggregate or other 
PostgreSQL "advanced feature")

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

Re: SQL Challenge: Arbitrary Cross-tab

From

elein

Date:

18 August 2004, 00:21:22

I would use my report writer, but in any case you'd
want at least 2 separate queries, maybe three to
keep it simple and readable.

If you are allowed to use stored procedures you can
build up the output by using simple concats instead
of text aggregation (which is a procedure of simple
concats).  

Using loops and subqueries you should be to construct
the heading (count distinct timekeeper_id) and then select
the data row by row concatenating results before you
send it out.

This is a non-solution which effectively hides the
aggregation in a function.

Or write it in a client perl app if you must.

You can't really do it w/o loops or aggregates.
(I wish (hope?) I were wrong about this.)

--elein


On Tue, Aug 17, 2004 at 07:55:11PM -0700, Josh Berkus wrote:
> Folks,
> 
> I have a wierd business case.  Annoyingly it has to be written in *portable* 
> SQL92, which means no arrays or custom aggregates.   I think it may be 
> impossible to do in SQL which is why I thought I'd give the people on this 
> list a crack at it.   Solver gets a free drink/lunch on me if we ever meet at 
> a convention.
> 
> The Problem:  for each "case" there are from zero to eight "timekeepers" 
> authorized to work on the "case", out of a pool of 150 "timekeepers".  This 
> data is stored vertically:
> 
> authorized_timekeepers:
> case_id     | timekeeper_id
> 213447    | 047
> 132113    | 021
> 132113    | 115
> 132113    | 106
> etc.
> 
> But, a client's e-billing application wants to see these timekeepers displayed 
> in the following horizontal format:
> 
> case_id    | tk1    | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8
> 213447    | 047 |     |     |     |     |     |     |     |
> 132113      | 021 | 115 | 106 | 034 | 109 | 112 | 087 |
> etc.
> 
> Order does not matter for timekeepers 1-8.
> 
> This is a daunting problem because traditional crosstab solutions do not work; 
> timekeepers 1-8 are coming out of a pool of 150.
> 
> Can it be done?  Or are we going to build this with a row-by-row procedural 
> loop? (to reiterate: I'm not allowed to use a custom aggregate or other 
> PostgreSQL "advanced feature")
> 
> -- 
> Josh Berkus
> Aglio Database Solutions
> San Francisco
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: SQL Challenge: Arbitrary Cross-tab

From

Stephan Szabo

Date:

18 August 2004, 00:36:03

On Tue, 17 Aug 2004, Josh Berkus wrote:

> I have a wierd business case.  Annoyingly it has to be written in *portable*
> SQL92, which means no arrays or custom aggregates.   I think it may be
> impossible to do in SQL which is why I thought I'd give the people on this
> list a crack at it.   Solver gets a free drink/lunch on me if we ever meet at
> a convention.
>
> The Problem:  for each "case" there are from zero to eight "timekeepers"
> authorized to work on the "case", out of a pool of 150 "timekeepers".  This
> data is stored vertically:
>
> authorized_timekeepers:
> case_id     | timekeeper_id
> 213447    | 047
> 132113    | 021
> 132113    | 115
> 132113    | 106
> etc.
>
> But, a client's e-billing application wants to see these timekeepers displayed
> in the following horizontal format:
>
> case_id    | tk1    | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8
> 213447    | 047 |     |     |     |     |     |     |     |
> 132113      | 021 | 115 | 106 | 034 | 109 | 112 | 087 |
> etc.
>
> Order does not matter for timekeepers 1-8.
>
> This is a daunting problem because traditional crosstab solutions do not work;
> timekeepers 1-8 are coming out of a pool of 150.
>
> Can it be done?  Or are we going to build this with a row-by-row procedural

If you know it's max 8, I think it may be possible, but I can't think of a
way that'd be better than just writing code yourself.

Just maybe something like the following would give you three timekeepers:
select foo.case_id, foo.v1, foo.v2, (select min(timekeeper_id) from
authorized_timekeepers where
authorized_timekeepers.case_id=foo.case_id and timekeeper_id > foo.v2) as
v3 from (select foo.case_id, foo.v1, (select min(timekeeper_id) from
authorized_timekeepers whereauthorized_timekeepers.case_id=foo.case_id and timekeeper_id > foo.v1) as
v2 from  (select foo.case_id, foo.v1 from   (select foo.case_id,    (select min(timekeeper_id) from
authorized_timekeeperswhere      authorized_timekeepers.case_id=foo.case_id) as v1    from (select distinct case_id
fromauthorized_timekeepers) foo  ) foo) foo) foo;
 

If that works for 3 (and I think that's standard behavior), then you
should be able to extend it to any fixed number using the pattern.

Re: SQL Challenge: Arbitrary Cross-tab

From

"Greg Sabino Mullane"

Date:

18 August 2004, 00:36:41

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Names shortened to spare the line lengths:
SELECT bob.cid, (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1) AS tk1, (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1
OFFSET1) AS tk2, (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 2) AS tk3, (SELECT tid FROM ats WHERE
cid=bob.cidLIMIT 1 OFFSET 3) AS tk4, (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 4) AS tk5, (SELECT tid FROM
atsWHERE cid=bob.cid LIMIT 1 OFFSET 5) AS tk6, (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 6) AS tk7, (SELECT
tidFROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 7) AS tk8,
 
FROM (SELECT DISTINCT cid FROM ats) AS bob;
- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200408172335
-----BEGIN PGP SIGNATURE-----
iD8DBQFBIs7AvJuQZxSWSsgRAkglAJ9mNEmOYlLPynygMmelvzlqkYoHlwCeJqTb
g5gyh9LztONPCZj32aOEuGI=
=Yy7m
-----END PGP SIGNATURE-----

Re: SQL Challenge: Arbitrary Cross-tab

From

Josh Berkus

Date:

18 August 2004, 00:42:28

Greg, Stephan,

>   (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 1) AS tk2,

bzzzzz!  Thanks for playing.   LIMIT and OFFSET, sadly, are not SQL standard.  
They're only portable to MySQL.  This has to port to SQL Server and Oracle.

> If that works for 3 (and I think that's standard behavior), then you
> should be able to extend it to any fixed number using the pattern.

Hmmm... that might work.  I'll have to test that the nesting doesn't kill SQL 
Server (a serious danger) but it's worth a try.  Performance will really suck 
but fortunately we only run this bill once a month.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

Re: SQL Challenge: Arbitrary Cross-tab

From

Joe Conway

Date:

18 August 2004, 02:16:08

Josh Berkus wrote:
> The Problem:  for each "case" there are from zero to eight "timekeepers" 
> authorized to work on the "case", out of a pool of 150 "timekeepers".  This 
> data is stored vertically:
> 
> authorized_timekeepers:
> case_id     | timekeeper_id
> 213447    | 047
> 132113    | 021
> 132113    | 115
> 132113    | 106
> etc.
> 
> But, a client's e-billing application wants to see these timekeepers displayed 
> in the following horizontal format:
> 
> case_id    | tk1    | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8
> 213447    | 047 |     |     |     |     |     |     |     |
> 132113      | 021 | 115 | 106 | 034 | 109 | 112 | 087 |
> etc.
> 
> Order does not matter for timekeepers 1-8.
> 
> This is a daunting problem because traditional crosstab solutions do not work; 
> timekeepers 1-8 are coming out of a pool of 150.
> 
> Can it be done?  Or are we going to build this with a row-by-row procedural 
> loop? (to reiterate: I'm not allowed to use a custom aggregate or other 
> PostgreSQL "advanced feature")
> 

This is pretty much exactly how contrib/tablefunc's crosstab (non-hashed 
version; crosstab(sourcesql, ncols)) works. If you really need it to be 
portable, though, application layer procedural code is likely to be the 
easiest and fastest way to go. crosstab just wraps the procedural code 
in an SRF for you.

Joe

Re: SQL Challenge: Arbitrary Cross-tab

From

Gaetano Mendola

Date:

18 August 2004, 08:12:47

Greg Sabino Mullane wrote:

> 
> Names shortened to spare the line lengths:
> 
> SELECT bob.cid,
>   (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1) AS tk1,
>   (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 1) AS tk2,
>   (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 2) AS tk3,
>   (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 3) AS tk4,
>   (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 4) AS tk5,
>   (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 5) AS tk6,
>   (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 6) AS tk7,
>   (SELECT tid FROM ats WHERE cid=bob.cid LIMIT 1 OFFSET 7) AS tk8,
> FROM (SELECT DISTINCT cid FROM ats) AS bob;
> 

Don't you miss for each subselect an order by tid ?


Regards
Gaetano Mendola

Re: SQL Challenge: Arbitrary Cross-tab

From

"Greg Sabino Mullane"

Date:

18 August 2004, 08:47:45

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
> Don't you miss for each subselect an order by tid ?
No: since all the SELECTs are part of one statement, they
will have the same (pseudo-random) implicit order. Since
Josh's requirement said the order of the tids* was not important,
I can be lazy and get away with it in this case. An order by
would not hurt, of course.
* As far as PG goes, this is not an ideal abbreviation! :)
- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200408180745
-----BEGIN PGP SIGNATURE-----
iD8DBQFBI0InvJuQZxSWSsgRApW7AKCpFN6TMQ3WjcJgZse5f+Ap6/Y7RACfSSlc
MZusqEadF2xZrE4PLOhmMek=
=pnQz
-----END PGP SIGNATURE-----

Re: SQL Challenge: Arbitrary Cross-tab

From

Gaetano Mendola

Date:

18 August 2004, 12:12:39

Greg Sabino Mullane wrote:

>  
> 
>>>Don't you miss for each subselect an order by tid ?
> 
>  
> No: since all the SELECTs are part of one statement, they
> will have the same (pseudo-random) implicit order. 

Is this guaranted ?

Regards
Gaetano Mendola

Re: SQL Challenge: Arbitrary Cross-tab

From

Tom Lane

Date:

18 August 2004, 12:13:33

"Greg Sabino Mullane" <greg@turnstep.com> writes:
>> Don't you miss for each subselect an order by tid ?
> No: since all the SELECTs are part of one statement, they
> will have the same (pseudo-random) implicit order.

Nope; Gaetano's right, you cannot assume that.  It's entirely possible
for the planner to choose different plans depending on the OFFSET.
(Maybe not very likely, with such small offsets, but could happen.)
        regards, tom lane

Re: SQL Challenge: Arbitrary Cross-tab

From

Chris Travers

Date:

18 August 2004, 13:14:10

Josh Berkus wrote:

>Folks,
>
>I have a wierd business case.  Annoyingly it has to be written in *portable* 
>SQL92, which means no arrays or custom aggregates.   I think it may be 
>impossible to do in SQL which is why I thought I'd give the people on this 
>list a crack at it.   Solver gets a free drink/lunch on me if we ever meet at 
>a convention.
>
>  
>
Might be possible.  Would certainly be ugly.

>The Problem:  for each "case" there are from zero to eight "timekeepers" 
>authorized to work on the "case", out of a pool of 150 "timekeepers".  This 
>data is stored vertically:
>
>authorized_timekeepers:
>case_id     | timekeeper_id
>213447    | 047
>132113    | 021
>132113    | 115
>132113    | 106
>etc.
>
>But, a client's e-billing application wants to see these timekeepers displayed 
>in the following horizontal format:
>
>case_id    | tk1    | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8
>213447    | 047 |     |     |     |     |     |     |     |
>132113      | 021 | 115 | 106 | 034 | 109 | 112 | 087 |
>etc.
>
>Order does not matter for timekeepers 1-8.
>
>This is a daunting problem because traditional crosstab solutions do not work; 
>timekeepers 1-8 are coming out of a pool of 150.
>
>Can it be done?  Or are we going to build this with a row-by-row procedural 
>loop? (to reiterate: I'm not allowed to use a custom aggregate or other 
>PostgreSQL "advanced feature")
>
>  
>
If it can be done, it might be extremely ugly.  I am thinking a massive
set of left self joins (since there could be between 0 and 8).

Something like:
select case_id FROM authorized_timekeeper t0
LEFT JOIN    (SELECT case_id, min(timekeeper_id) FROM authorized_timekeeper       GROUP BY case_id) t1   ON case_id
LEFT JOIN   (SELECT case_id, min(timekeeper_id) FROM authorized_timekeeper      WHERE timekeeper_id <> t1.timekeeper
 GROUP BY case_id) t2   ON case_id
 
LEFT JOIN   (SELECT case_id, min(timekeeper_id) FROM authorized_timekeeper      WHERE timekeeper_id NOT IN
(t1.timekeeper,t2.timekeeper)      GROUP BY case_id) t3
 
etc....

If this is not an option, instead I would create a series of views.
Something like:
CREATE VIEW t1 AS select case_id, min(timekeeper_id) AS tk_id   from authorized_timekeepers   group by case_id;
CREATE VIEW t2 AS select case_id, min(timekeeper_id) AS tk_id   from authorized_timekeepers   WHERE tk_id NOT IN
(SELECTtk_id FROM t1)   group by case_id;
 
CREATE VIEW t3 AS select case_id, min(timekeeper_id) AS tk_id   FROM authorized_timekeepers   WHERE tk_id NOT IN
(SELECTtk_id FROM t1)       AND tk_id NOT IN (SELECT tk_id FROM t2)   GROUP BY case_id;
 
Etc.
Then you do a left join among the views.

Hope that this helps.

Best Wishes,
Chris Travers

Re: SQL Challenge: Arbitrary Cross-tab

From

Josh Berkus

Date:

18 August 2004, 13:57:33

Joe, Elein:

> This is pretty much exactly how contrib/tablefunc's crosstab (non-hashed
> version; crosstab(sourcesql, ncols)) works. If you really need it to be
> portable, though, application layer procedural code is likely to be the
> easiest and fastest way to go. crosstab just wraps the procedural code
> in an SRF for you.

No, you're missing one factor in the spec.   Timekeeper_1 for case_id = 182738 
is not the same timekeeper as Timekeeper_1 for case_id = 217437.  That's why 
traditional crosstab plans don't work.

Anyway, here's the wrap-up:   I tried Stephan's idea, it works, but it's so 
slow that we're going to to the procedural loop.   Thanks, all!

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

Re: SQL Challenge: Arbitrary Cross-tab

From

Joe Conway

Date:

18 August 2004, 14:54:38

Josh Berkus wrote:
>>This is pretty much exactly how contrib/tablefunc's crosstab (non-hashed
>>version; crosstab(sourcesql, ncols)) works. If you really need it to be
>>portable, though, application layer procedural code is likely to be the
>>easiest and fastest way to go. crosstab just wraps the procedural code
>>in an SRF for you.
> 
> No, you're missing one factor in the spec.   Timekeeper_1 for case_id = 182738 
> is not the same timekeeper as Timekeeper_1 for case_id = 217437.  That's why 
> traditional crosstab plans don't work.

No, I understood. E.g.

create table authorized_timekeepers (
case_id int,
timekeeper_id text
);

insert into authorized_timekeepers values(213447,'047');
insert into authorized_timekeepers values(132113,'021');
insert into authorized_timekeepers values(132113,'115');
insert into authorized_timekeepers values(132113,'106');
insert into authorized_timekeepers values(140000,'106');
insert into authorized_timekeepers values(140000,'021');
insert into authorized_timekeepers values(140000,'115');
insert into authorized_timekeepers values(140000,'108');
insert into authorized_timekeepers values(140000,'006');
insert into authorized_timekeepers values(140000,'042');
insert into authorized_timekeepers values(140000,'142');
insert into authorized_timekeepers values(140000,'064');
insert into authorized_timekeepers values(140000,'999');

select * from crosstab('select case_id, ''cat'' as cat, timekeeper_id
from authorized_timekeepers order by 1',8)
as t(case_id int, tk1 text, tk2 text, tk3 text, tk4 text, tk5 text, tk6 
text, tk7 text, tk8 text);
 case_id | tk1 | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8
---------+-----+-----+-----+-----+-----+-----+-----+-----  132113 | 021 | 115 | 106 |     |     |     |     |  140000 |
106| 021 | 115 | 108 | 006 | 042 | 142 | 064  213447 | 047 |     |     |     |     |     |     |
 
(3 rows)

Or even:

select * from crosstab('select case_id, ''cat'' as cat, timekeeper_id
from authorized_timekeepers order by 1',4)
as t(case_id int, tk1 text, tk2 text, tk3 text, tk4 text);
 case_id | tk1 | tk2 | tk3 | tk4
---------+-----+-----+-----+-----  132113 | 021 | 115 | 106 |  140000 | 106 | 021 | 115 | 108  213447 | 047 |     |
|
(3 rows)


But I know that doesn't help you with portability.

Joe

Re: SQL Challenge: Arbitrary Cross-tab

From

Josh Berkus

Date:

18 August 2004, 15:52:24

Joe,

>   case_id | tk1 | tk2 | tk3 | tk4 | tk5 | tk6 | tk7 | tk8
> ---------+-----+-----+-----+-----+-----+-----+-----+-----
>    132113 | 021 | 115 | 106 |     |     |     |     |
>    140000 | 106 | 021 | 115 | 108 | 006 | 042 | 142 | 064
>    213447 | 047 |     |     |     |     |     |     |

Darn I wish this didn't have to be portable ....

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: SQL Challenge: Arbitrary Cross-tab

From

"Greg Sabino Mullane"

Date:

19 August 2004, 23:16:49

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
> Nope; Gaetano's right, you cannot assume that.  It's entirely possible
> for the planner to choose different plans depending on the OFFSET.
> (Maybe not very likely, with such small offsets, but could happen.)
Interesting. I realized that there was a finite chance of different
plans being chosen, but it seemed neigh-impossible since there is
no WHERE clause and the offsets only vary from 0-7. What sort of
different plans would it choose, out of curiosity?
Seq-scan vs. index-scan? Are there any particular cases where the
same plan is guaranteed to be used?
- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200408192216
-----BEGIN PGP SIGNATURE-----
iD8DBQFBJV9xvJuQZxSWSsgRAp74AJ96mtrKC1J53y0TPqTPdq2Xost0fACg4DnJ
7P+dgpHWBazGNE9+SR7uxLY=
=MZuM
-----END PGP SIGNATURE-----

Re: SQL Challenge: Arbitrary Cross-tab

From

Tom Lane

Date:

20 August 2004, 00:25:16

"Greg Sabino Mullane" <greg@turnstep.com> writes:
>> Nope; Gaetano's right, you cannot assume that.  It's entirely possible
>> for the planner to choose different plans depending on the OFFSET.
>> (Maybe not very likely, with such small offsets, but could happen.)
> Interesting. I realized that there was a finite chance of different
> plans being chosen, but it seemed neigh-impossible since there is
> no WHERE clause and the offsets only vary from 0-7. What sort of
> different plans would it choose, out of curiosity?

For the particular trivial case you were showing (no WHERE, no GROUP BY,
no ORDER BY, no DISTINCT, no nada) it's probably true that only a
seqscan plan would ever be chosen.  I was worrying that people would
take this example and try to add "just that other little thing" to it
and get burnt.

> Are there any particular cases where the same plan is guaranteed to be
> used?

No, I'd never care to make such a guarantee.  The planner is entirely
driven by cost estimates.  Even if I could say something definitive
about the behavior with the default cost parameters, it wouldn't
necessary hold up when someone had taken an axe to random_page_cost
or something like that.  (It's not impossible that the thing would pick
an indexscan plan for even this trivial case, were you to set
random_page_cost below 1.)
        regards, tom lane