Thread: SQL challenge--top 10 for each key value?

SQL challenge--top 10 for each key value?

From
Jeff Boes
Date:
Offered up for anyone with time on their hands. I fiddled around with 
this for half an afternoon, then gave up and did it programmatically in 
Perl.

Given a table that looks something like this:

id       | INTEGER
query    | INTEGER
checksum | char(32)
score    | INTEGER
include  | BOOLEAN


The table is unique by "id". "Checksum" may be repeated, but I only care 
if it is repeated within a given group by "query". ("query" is non-null.)

I can get the top scorer for each "query" row by something like this:

SELECT * FROM (  SELECT DISTINCT ON (checksum) *  FROM my_table  ORDER BY checksum, score DESC)
ORDER BY query;

How would you go about getting the top N (say, the top 10) for each query?

And then, if that's too easy for you--consider a further case where I 
want every row for a given "query" that has "include" TRUE, and enough 
non-"include" rows to make N. I might end up with more than N rows for a 
given value of "query" if there were more than N with "include" set.

I headed off in the direction of groups of SELECTs and UNIONs, and quit 
when I got to something like four levels of "SELECT ... AS FOO" ...

-- 
Jeff Boes                                      vox 269.226.9550 ext 24
Database Engineer                                     fax 269.349.9076
Nexcerpt, Inc.                                 http://www.nexcerpt.com           ...Nexcerpt... Extend your Expertise



Re: SQL challenge--top 10 for each key value?

From
Greg Stark
Date:
Jeff Boes <jboes@nexcerpt.com> writes:

> I headed off in the direction of groups of SELECTs and UNIONs, and quit when I
> got to something like four levels of "SELECT ... AS FOO" ...

four? wimp, that's nothing!

ok, seriously I think there's no way to do this directly with straight SQL.
You would have to define a non-immutable function that has some temporary
storage where it keeps track of how many it has seen. 

The generic function that would help here would be some kind of rank(value)
that would give you the equivalent of rownum except with a level break every
time value changes. I've been hoping to see something like this on the list
for a long time but haven't yet.

If the value of n is constant and small you could cheat with an aggregate
function with an array of the top n values.

db=> create function first3_accum(integer[],integer) returns integer[] as 'select case when array_upper($1,1) >= 3 then
$1else array_append($1,$2) end' language sql strict immutable;
 
CREATE FUNCTION
db=> create aggregate first3 (basetype = integer, sfunc = first3_accum, stype = integer[], initcond = '{}');
CREATE AGGREGATE

then something like:

SELECT first3(id) FROM (SELECT id         FROM my_table        ORDER BY query,                 CASE WHEN include THEN 1
ELSE2 END ASC,                 score DESC)GROUP BY query
 

But then you'll have to go back to the table to refetch the original records
that you've found. The best way I find to do that is with the int_array_enum()
function from the int_aggregate contrib module.

SELECT *  FROM my_table WHERE id IN (          SELECT int_array_enum(f3)            FROM (                  SELECT
first3(id)as f3                    FROM (SELECT id                            FROM my_table
ORDERBY query,                                    CASE WHEN include THEN 1 ELSE 2 END ASC,
     score DESC) as x                   GROUP BY query                ) as x          )
 


This last step is kind of annoying since you've already seen all those
records. And it requires writing a new aggregate function every time the value
of n changes though, which kind of sucks.

In theory if the new work in 7.5 handling structured datatypes is as cool as
it sounds you could have an array of complete records and when UNNEST is
eventually incorporated into the array code then you could expand those
instead of using the int_array_enum function. Neither of those things are
ready yet as far as I know though.

-- 
greg



Re: SQL challenge--top 10 for each key value?

From
Rod Taylor
Date:
On Thu, 2004-04-08 at 19:33, Greg Stark wrote:
> Jeff Boes <jboes@nexcerpt.com> writes:
> 
> > I headed off in the direction of groups of SELECTs and UNIONs, and quit when I
> > got to something like four levels of "SELECT ... AS FOO" ...
> 
> four? wimp, that's nothing!
> 
> ok, seriously I think there's no way to do this directly with straight SQL.
> You would have to define a non-immutable function that has some temporary
> storage where it keeps track of how many it has seen. 

I don't believe that is true, though it is certainly is in PostgreSQL.

The spec has the ability to apply a progressive aggregate on the results
of a query (window function). Meaning you can accomplish things like
counting (ROW_NUMBER) or running totals.

Something along the lines of the below would accomplish what you want
according to spec. ROW_NUMBER() is a spec defined function. (6.10 of
SQL200N)
       SELECT *          FROM (SELECT ROW_NUMBER() OVER (DISTINCT query) AS counter                 <rest of query>
        )         WHERE counter > 10;
 



Re: SQL challenge--top 10 for each key value?

From
Tom Lane
Date:
Rod Taylor <pg@rbt.ca> writes:
> ROW_NUMBER() is a spec defined function. (6.10 of SQL200N)

If the spec doesn't even have a year number yet, you can hardly expect
real implementations to support it ;-).  There is no such thing in the
extant specs SQL92 or SQL99.
        regards, tom lane


Re: SQL challenge--top 10 for each key value?

From
Josh Berkus
Date:
Rod,

> Something along the lines of the below would accomplish what you want
> according to spec. ROW_NUMBER() is a spec defined function. (6.10 of
> SQL200N)

Great leaping little gods!   They added something called "row number" to the 
spec? 

Boy howdy, folks were right ... the ANSI committee really has completly blown 
off the relational model completely.   First there was the addition of 
network-database functions so that IBM could make DB2 look more like a real 
database, now this ....

When a standards committee becomes hostage to a handful of vendors, kiss real 
standards goodbye.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: SQL challenge--top 10 for each key value?

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> Rod,
> 
> > Something along the lines of the below would accomplish what you want
> > according to spec. ROW_NUMBER() is a spec defined function. (6.10 of
> > SQL200N)
> 
> Great leaping little gods!   They added something called "row number" to the 
> spec? 
> 
> Boy howdy, folks were right ... the ANSI committee really has completly blown 
> off the relational model completely.   

If it's like Oracle's rownum then it's the row number of the *output*, not the
position on disk. So it's not entirely blowing off the relational model any
more than ORDER BY does.

The weird thing is the number of cases where you want ORDER BY or rownum
inside subselects. Which the solution to the original question needed.

> When a standards committee becomes hostage to a handful of vendors, kiss
> real standards goodbye.

In the case of SQL was there ever any pretension otherwise? Was the SQL
standard ever really useful as a "real standard"? I can write useful ANSI C89
code that will compile and work on any C compiler. Trying to write portable
SQL92 code that does any useful work is about as productive as stapling bagels
to your forehead.

-- 
greg



Re: SQL challenge--top 10 for each key value?

From
Rod Taylor
Date:
On Fri, 2004-04-09 at 18:43, Greg Stark wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> 
> > Rod,
> > 
> > > Something along the lines of the below would accomplish what you want
> > > according to spec. ROW_NUMBER() is a spec defined function. (6.10 of
> > > SQL200N)
> > 
> > Great leaping little gods!   They added something called "row number" to the 
> > spec? 
> > 
> > Boy howdy, folks were right ... the ANSI committee really has completly blown 
> > off the relational model completely.   
> 
> If it's like Oracle's rownum then it's the row number of the *output*, not the
> position on disk. So it's not entirely blowing off the relational model any
> more than ORDER BY does.
> 
> The weird thing is the number of cases where you want ORDER BY or rownum
> inside subselects. Which the solution to the original question needed.

It's not really like Oracles row num at all, though I suppose you can
emulate rownum using it. The intention is that you will use it for
"aggregates" like running totals, moving averages, counting, etc.

http://www.devx.com/getHelpOn/10MinuteSolution/16573/1954?pf=true




Re: SQL challenge--top 10 for each key value?

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
> How would you go about getting the top N (say, the top 10) for each query?
Assume you have a table "ch" and three sequences 'aa', 'bb', and 'cc'.
(Only 'aa' and 'bb' need to be initially set)
SELECT setval('aa',1,'f');
SELECT setval('bb',1,'f');
SELECT nextval('cc') AS rating,q2 AS query, s2 AS score FROM
(SELECT 0 AS q1, 0 AS s1, NULL AS cs, nextval('aa') AS v1UNION ALL(SELECT *, nextval('aa') AS v1 FROM (SELECT query AS
q1,MAX(score) AS s1, checksum AS cs FROM ch GROUP BY 1,3 ORDER BY 1 ASC, 2 DESC) AS foo)
 
) AS uno,
((SELECT *, nextval('bb') AS v2 FROM (SELECT query AS q2, MAX(score) AS s2, checksum AS cs FROM ch GROUP BY 1,3 ORDER
BY1 ASC, 2 DESC) AS foo)UNION ALLSELECT NULL AS q2, 0 AS s2, NULL AS cs, nextval('bb') AS v2
 
) AS dos
WHERE v1 = v2 AND q2 IS NOT NULL
AND ((CASE WHEN q1 != q2 THEN setval('cc',1,'f') ELSE 0 END > 0)OR(CASE WHEN currval('cc')<10 THEN 1 ELSE 0 END >0)
);
- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200404101029
-----BEGIN PGP SIGNATURE-----
iD8DBQFAeAZ1vJuQZxSWSsgRAqYuAJ9HaYLotPYkyi1U76I9xnvi8AhLTQCfUyJq
+iVdbz5U7HKep89z0kp49U0=
=6+OH
-----END PGP SIGNATURE-----




Re: SQL challenge--top 10 for each key value?

From
Josh Berkus
Date:
Rod, Greg

> It's not really like Oracles row num at all, though I suppose you can
> emulate rownum using it. The intention is that you will use it for
> "aggregates" like running totals, moving averages, counting, etc.

Yes, that makes a certain amount of sense.  I just take exception to the name 
"Row Number" becuase it confuses newbies about the actual nature of the data 
being returned and gets them back in the bad space of believing in fixed data 
ordering -- something which RDBMSes are supposed to avoid.

I'm also disgusted at the vendor partiality that this shows.  PostgreSQL and 
MySQL both have working implementations of a very similar concept using the 
non-confusing term "Limit".   However, since Oracle is on the committee, they 
had to use a more "Oracle-friendly" term, I guess.

> In the case of SQL was there ever any pretension otherwise? Was the SQL
> standard ever really useful as a "real standard"? I can write useful ANSI
> C89 code that will compile and work on any C compiler. Trying to write
> portable SQL92 code that does any useful work is about as productive as
> stapling bagels to your forehead.

Well, there *was* a pretense otherwise, which ended about 1994, just as we 
were getting SQL on this project.  Now the big vendors -- mostly IBM and 
Oracle since Informix and Sybase are dying -- run everything and adapt the 
standard to what features their products already have.

So, yes, SQL92 needed development and expansion.   But we didn't need SQL99.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: SQL challenge--top 10 for each key value?

From
Troels Arvin
Date:
On Fri, 09 Apr 2004 02:11:44 -0400, Tom Lane wrote:

>> ROW_NUMBER() is a spec defined function. (6.10 of SQL200N)
> 
> If the spec doesn't even have a year number yet, you can hardly expect
> real implementations to support it ;-)

SQL:2003 is finished. Among its new (non-core) OLAP features are a set of
"windows functions" (spec section 6.10), which include

feature ID T611 (elementary OLAP):- ROW_NUMBER() OVER (...)- RANK() OVER (...)- DENSE_RANK() OVER (...)

feature ID T612 (extended OLAP):- PERCENT_RANK() OVER (...)- CUME_DIST() OVER (...)

See http://www.acm.org/sigmod/record/issues/0403/index.html#standards for
an article which summarizes the news in SQL:2003.

ROW_NUMBER() OVER may be used in queries where a PostgreSQL user which use
LIMIT.

RANK() OVER may be used in queries where a PostgreSQL user would have to
come up with a somewhat strange query in order to get acceptable
performance, see http://troels.arvin.dk/db/rdbms/#select-top-n-postgresql

-- 
Greetings from Troels Arvin, Copenhagen, Denmark




Re: SQL challenge--top 10 for each key value?

From
Troels Arvin
Date:
On Sun, 11 Apr 2004 00:13:17 +0200, I wrote:

> Among its new (non-core) OLAP features are a set of
> "windows functions"

Sorry - I meant "window functions"... (Microsoft don't seem to have had
much influence in SQL:2003's OLAP-specifications; IBM seems to be the big
influencer in those parts of the standard.)

-- 
Greetings from Troels Arvin, Copenhagen, Denmark




Re: SQL challenge--top 10 for each key value?

From
elein
Date:
Welcome to the real world, Josh.  There are people who
have full time salaried positions soley to attend 
standards meetings.

Note that ROW_NUMBER() really is handy, regardless of the
silly name.  And there was a little python function of mine 
that did it fairly simply, except that you needed to
initialize the counter in the connection before use.

create or replace function pycounter(integer)
returns integer as
'  if args[0] == 0:     SD["nextno"] = 1     return SD["nextno"]  try:     SD["nextno"] += 1  except:     SD["nextno"]
=1  return SD["nextno"]
 
' language 'plpythonu';

And clearly it can be done faster as a little
C function.

elein


On Fri, Apr 09, 2004 at 09:06:39AM -0700, Josh Berkus wrote:
> Rod,
> 
> > Something along the lines of the below would accomplish what you want
> > according to spec. ROW_NUMBER() is a spec defined function. (6.10 of
> > SQL200N)
> 
> Great leaping little gods!   They added something called "row number" to the 
> spec? 
> 
> Boy howdy, folks were right ... the ANSI committee really has completly blown 
> off the relational model completely.   First there was the addition of 
> network-database functions so that IBM could make DB2 look more like a real 
> database, now this ....
> 
> When a standards committee becomes hostage to a handful of vendors, kiss real 
> standards goodbye.
> 
> -- 
> Josh Berkus
> Aglio Database Solutions
> San Francisco
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend


Re: SQL challenge--top 10 for each key value?

From
elein
Date:
This solution will be in Monday's edition of
PostgreSQL General Bits (http://www.varlena.com/GeneralBits).
(In other words, if it doesn't do what you mean, let me know now!)


CREATE TYPE topscores AS  (id integer, query integer, checksum char(32), score integer);

CREATE OR REPLACE FUNCTION topscores(integer) RETURNS SETOF topscores AS
'
DECLARE  t topscores%ROWTYPE;  r RECORD;  q RECORD;  n alias for $1;
BEGIN  FOR q IN SELECT distinct query from table70 order by query LOOP     FOR t IN SELECT id , query, checksum, score
     FROM table70        where query = q.query        ORDER BY query, score DESC LIMIT n LOOP           RETURN NEXT t;
  END LOOP;  END LOOP;  RETURN;
 
END;
' language 'plpgsql';

select * from topscores(1) ;
select * from topscores(2) ;
select * from topscores(3) ;


On Thu, Apr 08, 2004 at 07:55:33PM +0000, Jeff Boes wrote:
> Offered up for anyone with time on their hands. I fiddled around with 
> this for half an afternoon, then gave up and did it programmatically in 
> Perl.
> 
> Given a table that looks something like this:
> 
> id       | INTEGER
> query    | INTEGER
> checksum | char(32)
> score    | INTEGER
> include  | BOOLEAN
> 
> 
> The table is unique by "id". "Checksum" may be repeated, but I only care 
> if it is repeated within a given group by "query". ("query" is non-null.)
> 
> I can get the top scorer for each "query" row by something like this:
> 
> SELECT * FROM (
>   SELECT DISTINCT ON (checksum) *
>   FROM my_table
>   ORDER BY checksum, score DESC)
> ORDER BY query;
> 
> How would you go about getting the top N (say, the top 10) for each query?
> 
> And then, if that's too easy for you--consider a further case where I 
> want every row for a given "query" that has "include" TRUE, and enough 
> non-"include" rows to make N. I might end up with more than N rows for a 
> given value of "query" if there were more than N with "include" set.
> 
> I headed off in the direction of groups of SELECTs and UNIONs, and quit 
> when I got to something like four levels of "SELECT ... AS FOO" ...
> 
> -- 
> Jeff Boes                                      vox 269.226.9550 ext 24
> Database Engineer                                     fax 269.349.9076
> Nexcerpt, Inc.                                 http://www.nexcerpt.com
>            ...Nexcerpt... Extend your Expertise
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


Re: SQL challenge--top 10 for each key value?

From
Greg Stark
Date:
elein <elein@varlena.com> writes:

> create or replace function pycounter(integer)
> returns integer as
> '
>    if args[0] == 0:
>       SD["nextno"] = 1
>       return SD["nextno"]
>    try:
>       SD["nextno"] += 1
>    except:
>       SD["nextno"] = 1
>    return SD["nextno"]
> ' language 'plpythonu';
> 
> And clearly it can be done faster as a little
> C function.

Does this approach have a hope of working if it's used twice in the same
query?


-- 
greg



Re: SQL challenge--top 10 for each key value?

From
elein
Date:
No, it will not work twice in the same query as is.

If you want to code two counter buckets and pass in
some way to distinguish between the two yada yada yada
it is possible.  It is also possible to code this to
do multi-level counting/breaks/calculations, etc.

But the SD dictionary is by connection. So any values
stored in it need to be initialized at the appropriate
time *outside* of the first use.

elein

On Sun, Apr 11, 2004 at 12:38:20AM -0400, Greg Stark wrote:
> 
> elein <elein@varlena.com> writes:
> 
> > create or replace function pycounter(integer)
> > returns integer as
> > '
> >    if args[0] == 0:
> >       SD["nextno"] = 1
> >       return SD["nextno"]
> >    try:
> >       SD["nextno"] += 1
> >    except:
> >       SD["nextno"] = 1
> >    return SD["nextno"]
> > ' language 'plpythonu';
> > 
> > And clearly it can be done faster as a little
> > C function.
> 
> Does this approach have a hope of working if it's used twice in the same
> query?
> 
> 
> -- 
> greg
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match


Re: SQL challenge--top 10 for each key value?

From
Jeff Boes
Date:
Troels Arvin wrote:
> 
> See http://www.acm.org/sigmod/record/issues/0403/index.html#standards for
> an article which summarizes the news in SQL:2003.
> 

This is a very useful page; thank you for creating it and for noting it 
in this thread!

-- 
(Posted from an account used as a SPAM dump. If you really want to get
in touch with me, dump the 'jboes' and substitute 'mur'.)
________
Jeffery Boes <>< jboes@qtm.net