Thread: funny view/temp table problem with query

funny view/temp table problem with query

From
Grzegorz Jaśkiewicz
Date:
So I have a 'accounts' table, with id and name, and than some
hypothetical 'packages' table, containing some info per customer.

I need to retrive distinct pairs , of random packages assigned per customer.
Packages table contains 10 packages, id:=[1:10], there's 1M customers
for testing purposes.

I could name the tables foo/bar again, but decided for something more
creative this time ;)

Anyways, I have this query:


 select count(distinct (v,id)) from (
  select heh.id, v[i] from
   (
     SELECT ss.id, ARRAY
      (
        SELECT id FROM packages where ss.id>0 and id between 2 and 6
ORDER BY random() limit 5
      ) as v FROM
         (
          SELECT id FROM accounts ORDER BY random() limit 100000
         ) ss
   ) heh,generate_series(1, 5 ) i order by heh.id,v
) ziew;


So in theory, that should return me random array of packages, per
account. Since id's in both tables are primary keys, I expect the pair
of accountId/packageId to be unique as well.

The query above doesn't deliver, so I tried to divide it up:



create view hehview as SELECT ss.id, ARRAY
      (
        SELECT id FROM packages where ss.id>0 and id between 2 and 6
ORDER BY random() limit 5
      ) as v FROM
         (
          SELECT id FROM accounts ORDER BY random() limit 100000
         ) ss

select count( distinct (id, v[i])) from hehview, generate_series(1, 5) i;


That doesn't work either, because postgresql 'merges' view into query
(which is a good way to chop large queries btw, and still keep them up
to speed).

But if I store intermediate result in temporary table, all values are
nicely unique - as I want them.
Now, that's the solution I will use. But for sake of my conciousness,
I want to know what has failed here.
Btw, the count(distinct(x,y)) works that way only on 8.4, but I tested
it on 8.3, and I get same results.


with temp table:


create temp table hehtable as select * from hehview;

select count( distinct (id, v[i])) from hehtable, generate_series(1, 5) i;


Thanks folks.
--
GJ

Re: funny view/temp table problem with query

From
Grzegorz Jaśkiewicz
Date:
all explains:



Query without view:

                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=94419553.37..94419553.38 rows=1 width=16)
   ->  Sort  (cost=94269553.37..94294553.37 rows=10000000 width=12)
         Sort Key: ss.id, (((subplan))[i.i])
         ->  Nested Loop  (cost=93414.56..92953067.54 rows=10000000 width=12)
               ->  Function Scan on generate_series i
(cost=0.00..12.50 rows=1000 width=4)
               ->  Materialize  (cost=93414.56..93514.56 rows=10000 width=8)
                     ->  Subquery Scan ss  (cost=93279.56..93404.56
rows=10000 width=8)
                           ->  Limit  (cost=93279.56..93304.56
rows=10000 width=8)
                                 ->  Sort  (cost=93279.56..95779.56
rows=1000000 width=8)
                                       Sort Key: (random())
                                       ->  Seq Scan on accounts
(cost=0.00..21841.00 rows=1000000 width=8)
               SubPlan
                 ->  Limit  (cost=9.25..9.27 rows=5 width=8)
                       ->  Sort  (cost=9.25..9.50 rows=100 width=8)
                             Sort Key: (random())
                             ->  Result  (cost=0.00..7.59 rows=100 width=8)
                                   One-Time Filter: ($0 > (-1))
                                   ->  Seq Scan on packages
(cost=0.00..7.34 rows=100 width=8)
                                         Filter: ((id >= 1) AND (id <= 100))
(19 rows)



With view used (notice, it is a bit different plan!)

                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2361251.70..2361260.98 rows=1 width=12)
   ->  Nested Loop  (cost=111239.20..2111251.70 rows=100000000 width=12)
         ->  Function Scan on generate_series i  (cost=0.00..12.50
rows=1000 width=4)
         ->  Materialize  (cost=111239.20..112239.20 rows=100000 width=8)
               ->  Subquery Scan ss  (cost=109889.20..111139.20
rows=100000 width=8)
                     ->  Limit  (cost=109889.20..110139.20 rows=100000 width=8)
                           ->  Sort  (cost=109889.20..112389.20
rows=1000000 width=8)
                                 Sort Key: (random())
                                 ->  Seq Scan on accounts
(cost=0.00..21841.00 rows=1000000 width=8)
   SubPlan
     ->  Limit  (cost=9.25..9.27 rows=5 width=8)
           ->  Sort  (cost=9.25..9.50 rows=100 width=8)
                 Sort Key: (random())
                 ->  Result  (cost=0.00..7.59 rows=100 width=8)
                       One-Time Filter: ($0 > (-1))
                       ->  Seq Scan on packages  (cost=0.00..7.34
rows=100 width=8)
                             Filter: ((id >= 1) AND (id <= 100))
(17 rows)



Create temp table based on view:

explain create temp table fooheh as select * from heh;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Subquery Scan ss  (cost=109889.20..1037735.61 rows=100000 width=8)
   ->  Limit  (cost=109889.20..110139.20 rows=100000 width=8)
         ->  Sort  (cost=109889.20..112389.20 rows=1000000 width=8)
               Sort Key: (random())
               ->  Seq Scan on accounts  (cost=0.00..21841.00
rows=1000000 width=8)
   SubPlan
     ->  Limit  (cost=9.25..9.27 rows=5 width=8)
           ->  Sort  (cost=9.25..9.50 rows=100 width=8)
                 Sort Key: (random())
                 ->  Result  (cost=0.00..7.59 rows=100 width=8)
                       One-Time Filter: ($0 > (-1))
                       ->  Seq Scan on packages  (cost=0.00..7.34
rows=100 width=8)
                             Filter: ((id >= 1) AND (id <= 100))
(13 rows)


and run simple query against temp table:

explain select count( distinct (id, v[i])) from fooheh, generate_series(1, 5) i;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Aggregate  (cost=3226173.36..3226173.37 rows=1 width=44)
   ->  Nested Loop  (cost=2810.86..2868023.36 rows=143260000 width=44)
         ->  Function Scan on generate_series i  (cost=0.00..12.50
rows=1000 width=4)
         ->  Materialize  (cost=2810.86..4243.46 rows=143260 width=40)
               ->  Seq Scan on fooheh  (cost=0.00..2667.60 rows=143260 width=40)
(5 rows)

Re: funny view/temp table problem with query

From
Alban Hertroys
Date:
On Feb 25, 2009, at 11:02 AM, Grzegorz Jaśkiewicz wrote:

> So I have a 'accounts' table, with id and name, and than some
> hypothetical 'packages' table, containing some info per customer.
>
> I need to retrive distinct pairs , of random packages assigned per
> customer.
> Packages table contains 10 packages, id:=[1:10], there's 1M customers
> for testing purposes.
>
> I could name the tables foo/bar again, but decided for something more
> creative this time ;)
>
> Anyways, I have this query:
>
>
> select count(distinct (v,id)) from (
>  select heh.id, v[i] from
>   (
>     SELECT ss.id, ARRAY
>      (
>        SELECT id FROM packages where ss.id>0 and id between 2 and 6
> ORDER BY random() limit 5
>      ) as v FROM
>         (
>          SELECT id FROM accounts ORDER BY random() limit 100000
>         ) ss
>   ) heh,generate_series(1, 5 ) i order by heh.id,v
> ) ziew;


An alternative solution is to NOT order by random and not to limit,
but to use a scrollable cursor. Having to order your entire result set
by random is a fairly expensive operation and you only want 5 random
rows anyway, not 100000, so it is an inefficient approach as well: In
a good solution you should be calculating random() 5 times, not 100000.

Normal cursors just pick the next row from the result set as you
request them. Scrollable ones allow you to pick specific rows from
that result set. As soon as you know how many rows you have, picking 5
random ones isn't that hard. The idea is to calculate 5 random row
numbers from your result set and retrieve only those rows.

To do this you'll first need to know how many rows there are. That can
be determined by scrolling to the last row and reading the instruction
result (not the record itself) of that instruction; it contains a row
number (mind the one-off difference with a row count).

That row number you can feed to the random() function so it returns
numbers from 1..(lastRow+1). Scroll to that row and read the result,
repeat as often as you like (5 times in your case).

You may have realised that there is a chance to get duplicates here if
you happen to calculate the same random row number more than once.
That's not very hard to fix of course, you only need to keep track of
which row numbers you already used and recalculate the random number
if it's already in your set.

You can put the code to do this in your application (if your
connection interface allows for scrollable cursors), or since pg8.3
you can create a stored procedure to do this. I believe before 8.3
scrollable cursors weren't usable in pl/pgsql. Then again, maybe other
pl-languages are more suitable for a general solution... Back when I
had this problem I was using PHP and pg8.1, putting the code in a
function in my application worked fine, but it felt like it didn't
belong there.

The general opinion seems to be that picking random rows isn't a
relational operation, and for that reason a relational database isn't
particularly good at that. I think my approach works as well as it
does because it's a procedural approach to a procedural problem.

If you'd like to see some code, I have posted about this in the past
and that contained some code examples. Just search the archives.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,49a664fc129742059914308!



Re: funny view/temp table problem with query

From
Grzegorz Jaśkiewicz
Date:
looks like you completely misunderstood my question.
First of all, I wonder why the same query divided up in half - and
using temporary table works as expected, and with everything together
doesn't. And about rand(), it was tested on large enough set of runs,
that I don't think it is to blame.
The queries do everything I wanted it to do, and - no - doing it in
software is just baaad, and doesn't do.

Re: funny view/temp table problem with query

From
Alban Hertroys
Date:
On Feb 26, 2009, at 11:02 AM, Grzegorz Jaśkiewicz wrote:

> looks like you completely misunderstood my question.

I'm not surprised. What do you expect with random capitalisation,
random table alias names and random indentation combined with queries
getting wrapped by the mailing-list software? With some proper
formatting and meaningful alias-names some people might actually
understand what you're trying to get at. You're not exactly helping
here.
You're the one who's asking a question, it's your responsibility that
we can understand your problem.

With respect to your "original" naming scheme... indeed, foo, bar and
baz aren't the most elaborate names for tables or aliases, but at
least we are used to them. More meaningful names are still preferred
of course. Those meta-table-names are better reserved for theoretical
situations where no meaningful names are available. I'm pretty sure in
your case more meaningful names are easy to come up with, so please do.

> First of all, I wonder why the same query divided up in half - and
> using temporary table works as expected, and with everything together
> doesn't. And about rand(), it was tested on large enough set of runs,
> that I don't think it is to blame.

Well, as hard as I try reading that SQL, I lose track somewhere
halfway due to the above issues. I don't feel like rewriting your
queries to make them readable (I have no obligation to do that, after
all), and even then I'm not sure what you're trying to show with them.
They do look overly complicated, but without knowing their purpose it
is kind of hard to see what you're trying to tell.

> The queries do everything I wanted it to do, and - no - doing it in
> software is just baaad, and doesn't do.


I figured you were complaining about the performance, hence I gave you
a better performing solution. Apparently that wasn't what your
question was about, but it's still good advice IMO.

Your comment about the solution I gave you borders on insulting. The
method I showed you isn't any worse than your solution using temp
tables, as both solutions move logic to the application. It's hardly
any code in the application in either case, I wonder why you'd be so
set against using a cursor that you'd prefer a much more inefficient
solution that uses about as much application-side code as what I
proposed.
Besides, I showed that it's possible to put the logic in the database,
but apparently you didn't bother to read that far.

(What argument are you trying to make there anyway? X is bad and just
doesn't do... How is that an argument? - That's a rhetorical question,
it isn't).

Goodness, look at all the time I wasted trying to get a proper
question out of you...

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,49a7359f129748797120425!



Re: funny view/temp table problem with query

From
"Joshua D. Drake"
Date:
On Fri, 2009-02-27 at 01:36 +0100, Alban Hertroys wrote:
> On Feb 26, 2009, at 11:02 AM, Grzegorz Jaśkiewicz wrote:

> You're the one who's asking a question, it's your responsibility that
> we can understand your problem.

Woah... ease up cowboy.

Joshua D. Drake


--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


Re: funny view/temp table problem with query

From
Scott Marlowe
Date:
On Thu, Feb 26, 2009 at 3:02 AM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote:

> First of all, I wonder why the same query divided up in half - and
> using temporary table works as expected, and with everything together

I'm betting it's your use of generate_series().  You can get some
weird side effects because it sometimes gets run multiple times not
just the once you expect.  I'm guessing that's what's biting you.

Re: funny view/temp table problem with query

From
Grzegorz Jaśkiewicz
Date:
2009/2/27 Scott Marlowe <scott.marlowe@gmail.com>:
> On Thu, Feb 26, 2009 at 3:02 AM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote:
>
>> First of all, I wonder why the same query divided up in half - and
>> using temporary table works as expected, and with everything together
>
> I'm betting it's your use of generate_series().  You can get some
> weird side effects because it sometimes gets run multiple times not
> just the once you expect.  I'm guessing that's what's biting you.

in which case, wouldn't that be a postgresql's bug ?




--
GJ

Re: funny view/temp table problem with query

From
Scott Marlowe
Date:
On Fri, Feb 27, 2009 at 3:10 AM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote:
> 2009/2/27 Scott Marlowe <scott.marlowe@gmail.com>:
>> On Thu, Feb 26, 2009 at 3:02 AM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote:
>>
>>> First of all, I wonder why the same query divided up in half - and
>>> using temporary table works as expected, and with everything together
>>
>> I'm betting it's your use of generate_series().  You can get some
>> weird side effects because it sometimes gets run multiple times not
>> just the once you expect.  I'm guessing that's what's biting you.
>
> in which case, wouldn't that be a postgresql's bug ?

Nope.

Re: funny view/temp table problem with query

From
Grzegorz Jaśkiewicz
Date:
2009/2/27 Scott Marlowe <scott.marlowe@gmail.com>:
> Nope.

as far as I can understand it, if I do the same thing in two steps,
and in one step. And the latter is broken, because of some internal
process/optimization/whatever - that's a bug to me.

Unless I am expecting it to work, and it was just pure luck that it
worked in two steps..



--
GJ

Re: funny view/temp table problem with query

From
Scott Marlowe
Date:
On Fri, Feb 27, 2009 at 3:16 AM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote:
> 2009/2/27 Scott Marlowe <scott.marlowe@gmail.com>:
>> Nope.
>
> as far as I can understand it, if I do the same thing in two steps,
> and in one step. And the latter is broken, because of some internal
> process/optimization/whatever - that's a bug to me.

You're assuming that creating a temp data set in a table and join
semantics are the same.  they are not.

Re: funny view/temp table problem with query

From
Gregory Stark
Date:
Alban Hertroys <dalroi@solfertje.student.utwente.nl> writes:

> On Feb 26, 2009, at 11:02 AM, Grzegorz Jaśkiewicz wrote:
>
>> looks like you completely misunderstood my question.
>
> I'm not surprised. What do you expect with random capitalisation,  random table
> alias names and random indentation combined with queries  getting wrapped by
> the mailing-list software?

Uh, we get a lot of really mangled SQL and explain plans -- I don't see
anything wrong with these. If the question was unclear it sounds like it's
just because it's a fairly subtle problem and was hard to describe. Needing
two cracks at describing the problem is pretty much par for the course here.

I haven't tested the query to see what's going on but if the problem is due to
random() then in 8.4 you could use WITH to guarantee that the subquery is
executed precisely once and the results reused as-is subsequently.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's PostGIS support!

Re: funny view/temp table problem with query

From
Grzegorz Jaśkiewicz
Date:
On Fri, Feb 27, 2009 at 11:57 AM, Gregory Stark <stark@enterprisedb.com> wrote:

> Uh, we get a lot of really mangled SQL and explain plans -- I don't see
> anything wrong with these. If the question was unclear it sounds like it's
> just because it's a fairly subtle problem and was hard to describe. Needing
> two cracks at describing the problem is pretty much par for the course here.
>
> I haven't tested the query to see what's going on but if the problem is due to
> random() then in 8.4 you could use WITH to guarantee that the subquery is
> executed precisely once and the results reused as-is subsequently.

In that case, actually I need to get a random array of packages that's
different as much as possible for every account-id.
I'll try to create simpler example, with some script that would
generate data for you.



--
GJ