Thread: funny view/temp table problem with query
So I have a 'accounts' table, with id and name, and than some hypothetical 'packages' table, containing some info per customer. I need to retrive distinct pairs , of random packages assigned per customer. Packages table contains 10 packages, id:=[1:10], there's 1M customers for testing purposes. I could name the tables foo/bar again, but decided for something more creative this time ;) Anyways, I have this query: select count(distinct (v,id)) from ( select heh.id, v[i] from ( SELECT ss.id, ARRAY ( SELECT id FROM packages where ss.id>0 and id between 2 and 6 ORDER BY random() limit 5 ) as v FROM ( SELECT id FROM accounts ORDER BY random() limit 100000 ) ss ) heh,generate_series(1, 5 ) i order by heh.id,v ) ziew; So in theory, that should return me random array of packages, per account. Since id's in both tables are primary keys, I expect the pair of accountId/packageId to be unique as well. The query above doesn't deliver, so I tried to divide it up: create view hehview as SELECT ss.id, ARRAY ( SELECT id FROM packages where ss.id>0 and id between 2 and 6 ORDER BY random() limit 5 ) as v FROM ( SELECT id FROM accounts ORDER BY random() limit 100000 ) ss select count( distinct (id, v[i])) from hehview, generate_series(1, 5) i; That doesn't work either, because postgresql 'merges' view into query (which is a good way to chop large queries btw, and still keep them up to speed). But if I store intermediate result in temporary table, all values are nicely unique - as I want them. Now, that's the solution I will use. But for sake of my conciousness, I want to know what has failed here. Btw, the count(distinct(x,y)) works that way only on 8.4, but I tested it on 8.3, and I get same results. with temp table: create temp table hehtable as select * from hehview; select count( distinct (id, v[i])) from hehtable, generate_series(1, 5) i; Thanks folks. -- GJ
all explains: Query without view: QUERY PLAN ------------------------------------------------------------------------------------------------------------ Aggregate (cost=94419553.37..94419553.38 rows=1 width=16) -> Sort (cost=94269553.37..94294553.37 rows=10000000 width=12) Sort Key: ss.id, (((subplan))[i.i]) -> Nested Loop (cost=93414.56..92953067.54 rows=10000000 width=12) -> Function Scan on generate_series i (cost=0.00..12.50 rows=1000 width=4) -> Materialize (cost=93414.56..93514.56 rows=10000 width=8) -> Subquery Scan ss (cost=93279.56..93404.56 rows=10000 width=8) -> Limit (cost=93279.56..93304.56 rows=10000 width=8) -> Sort (cost=93279.56..95779.56 rows=1000000 width=8) Sort Key: (random()) -> Seq Scan on accounts (cost=0.00..21841.00 rows=1000000 width=8) SubPlan -> Limit (cost=9.25..9.27 rows=5 width=8) -> Sort (cost=9.25..9.50 rows=100 width=8) Sort Key: (random()) -> Result (cost=0.00..7.59 rows=100 width=8) One-Time Filter: ($0 > (-1)) -> Seq Scan on packages (cost=0.00..7.34 rows=100 width=8) Filter: ((id >= 1) AND (id <= 100)) (19 rows) With view used (notice, it is a bit different plan!) QUERY PLAN ------------------------------------------------------------------------------------------------------ Aggregate (cost=2361251.70..2361260.98 rows=1 width=12) -> Nested Loop (cost=111239.20..2111251.70 rows=100000000 width=12) -> Function Scan on generate_series i (cost=0.00..12.50 rows=1000 width=4) -> Materialize (cost=111239.20..112239.20 rows=100000 width=8) -> Subquery Scan ss (cost=109889.20..111139.20 rows=100000 width=8) -> Limit (cost=109889.20..110139.20 rows=100000 width=8) -> Sort (cost=109889.20..112389.20 rows=1000000 width=8) Sort Key: (random()) -> Seq Scan on accounts (cost=0.00..21841.00 rows=1000000 width=8) SubPlan -> Limit (cost=9.25..9.27 rows=5 width=8) -> Sort (cost=9.25..9.50 rows=100 width=8) Sort Key: (random()) -> Result (cost=0.00..7.59 rows=100 width=8) One-Time Filter: ($0 > (-1)) -> Seq Scan on packages (cost=0.00..7.34 rows=100 width=8) Filter: ((id >= 1) AND (id <= 100)) (17 rows) Create temp table based on view: explain create temp table fooheh as select * from heh; QUERY PLAN ------------------------------------------------------------------------------------ Subquery Scan ss (cost=109889.20..1037735.61 rows=100000 width=8) -> Limit (cost=109889.20..110139.20 rows=100000 width=8) -> Sort (cost=109889.20..112389.20 rows=1000000 width=8) Sort Key: (random()) -> Seq Scan on accounts (cost=0.00..21841.00 rows=1000000 width=8) SubPlan -> Limit (cost=9.25..9.27 rows=5 width=8) -> Sort (cost=9.25..9.50 rows=100 width=8) Sort Key: (random()) -> Result (cost=0.00..7.59 rows=100 width=8) One-Time Filter: ($0 > (-1)) -> Seq Scan on packages (cost=0.00..7.34 rows=100 width=8) Filter: ((id >= 1) AND (id <= 100)) (13 rows) and run simple query against temp table: explain select count( distinct (id, v[i])) from fooheh, generate_series(1, 5) i; QUERY PLAN -------------------------------------------------------------------------------------- Aggregate (cost=3226173.36..3226173.37 rows=1 width=44) -> Nested Loop (cost=2810.86..2868023.36 rows=143260000 width=44) -> Function Scan on generate_series i (cost=0.00..12.50 rows=1000 width=4) -> Materialize (cost=2810.86..4243.46 rows=143260 width=40) -> Seq Scan on fooheh (cost=0.00..2667.60 rows=143260 width=40) (5 rows)
On Feb 25, 2009, at 11:02 AM, Grzegorz Jaśkiewicz wrote: > So I have a 'accounts' table, with id and name, and than some > hypothetical 'packages' table, containing some info per customer. > > I need to retrive distinct pairs , of random packages assigned per > customer. > Packages table contains 10 packages, id:=[1:10], there's 1M customers > for testing purposes. > > I could name the tables foo/bar again, but decided for something more > creative this time ;) > > Anyways, I have this query: > > > select count(distinct (v,id)) from ( > select heh.id, v[i] from > ( > SELECT ss.id, ARRAY > ( > SELECT id FROM packages where ss.id>0 and id between 2 and 6 > ORDER BY random() limit 5 > ) as v FROM > ( > SELECT id FROM accounts ORDER BY random() limit 100000 > ) ss > ) heh,generate_series(1, 5 ) i order by heh.id,v > ) ziew; An alternative solution is to NOT order by random and not to limit, but to use a scrollable cursor. Having to order your entire result set by random is a fairly expensive operation and you only want 5 random rows anyway, not 100000, so it is an inefficient approach as well: In a good solution you should be calculating random() 5 times, not 100000. Normal cursors just pick the next row from the result set as you request them. Scrollable ones allow you to pick specific rows from that result set. As soon as you know how many rows you have, picking 5 random ones isn't that hard. The idea is to calculate 5 random row numbers from your result set and retrieve only those rows. To do this you'll first need to know how many rows there are. That can be determined by scrolling to the last row and reading the instruction result (not the record itself) of that instruction; it contains a row number (mind the one-off difference with a row count). That row number you can feed to the random() function so it returns numbers from 1..(lastRow+1). Scroll to that row and read the result, repeat as often as you like (5 times in your case). You may have realised that there is a chance to get duplicates here if you happen to calculate the same random row number more than once. That's not very hard to fix of course, you only need to keep track of which row numbers you already used and recalculate the random number if it's already in your set. You can put the code to do this in your application (if your connection interface allows for scrollable cursors), or since pg8.3 you can create a stored procedure to do this. I believe before 8.3 scrollable cursors weren't usable in pl/pgsql. Then again, maybe other pl-languages are more suitable for a general solution... Back when I had this problem I was using PHP and pg8.1, putting the code in a function in my application worked fine, but it felt like it didn't belong there. The general opinion seems to be that picking random rows isn't a relational operation, and for that reason a relational database isn't particularly good at that. I think my approach works as well as it does because it's a procedural approach to a procedural problem. If you'd like to see some code, I have posted about this in the past and that contained some code examples. Just search the archives. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,49a664fc129742059914308!
looks like you completely misunderstood my question. First of all, I wonder why the same query divided up in half - and using temporary table works as expected, and with everything together doesn't. And about rand(), it was tested on large enough set of runs, that I don't think it is to blame. The queries do everything I wanted it to do, and - no - doing it in software is just baaad, and doesn't do.
On Feb 26, 2009, at 11:02 AM, Grzegorz Jaśkiewicz wrote: > looks like you completely misunderstood my question. I'm not surprised. What do you expect with random capitalisation, random table alias names and random indentation combined with queries getting wrapped by the mailing-list software? With some proper formatting and meaningful alias-names some people might actually understand what you're trying to get at. You're not exactly helping here. You're the one who's asking a question, it's your responsibility that we can understand your problem. With respect to your "original" naming scheme... indeed, foo, bar and baz aren't the most elaborate names for tables or aliases, but at least we are used to them. More meaningful names are still preferred of course. Those meta-table-names are better reserved for theoretical situations where no meaningful names are available. I'm pretty sure in your case more meaningful names are easy to come up with, so please do. > First of all, I wonder why the same query divided up in half - and > using temporary table works as expected, and with everything together > doesn't. And about rand(), it was tested on large enough set of runs, > that I don't think it is to blame. Well, as hard as I try reading that SQL, I lose track somewhere halfway due to the above issues. I don't feel like rewriting your queries to make them readable (I have no obligation to do that, after all), and even then I'm not sure what you're trying to show with them. They do look overly complicated, but without knowing their purpose it is kind of hard to see what you're trying to tell. > The queries do everything I wanted it to do, and - no - doing it in > software is just baaad, and doesn't do. I figured you were complaining about the performance, hence I gave you a better performing solution. Apparently that wasn't what your question was about, but it's still good advice IMO. Your comment about the solution I gave you borders on insulting. The method I showed you isn't any worse than your solution using temp tables, as both solutions move logic to the application. It's hardly any code in the application in either case, I wonder why you'd be so set against using a cursor that you'd prefer a much more inefficient solution that uses about as much application-side code as what I proposed. Besides, I showed that it's possible to put the logic in the database, but apparently you didn't bother to read that far. (What argument are you trying to make there anyway? X is bad and just doesn't do... How is that an argument? - That's a rhetorical question, it isn't). Goodness, look at all the time I wasted trying to get a proper question out of you... Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,49a7359f129748797120425!
On Fri, 2009-02-27 at 01:36 +0100, Alban Hertroys wrote: > On Feb 26, 2009, at 11:02 AM, Grzegorz Jaśkiewicz wrote: > You're the one who's asking a question, it's your responsibility that > we can understand your problem. Woah... ease up cowboy. Joshua D. Drake -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Thu, Feb 26, 2009 at 3:02 AM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote: > First of all, I wonder why the same query divided up in half - and > using temporary table works as expected, and with everything together I'm betting it's your use of generate_series(). You can get some weird side effects because it sometimes gets run multiple times not just the once you expect. I'm guessing that's what's biting you.
2009/2/27 Scott Marlowe <scott.marlowe@gmail.com>: > On Thu, Feb 26, 2009 at 3:02 AM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote: > >> First of all, I wonder why the same query divided up in half - and >> using temporary table works as expected, and with everything together > > I'm betting it's your use of generate_series(). You can get some > weird side effects because it sometimes gets run multiple times not > just the once you expect. I'm guessing that's what's biting you. in which case, wouldn't that be a postgresql's bug ? -- GJ
On Fri, Feb 27, 2009 at 3:10 AM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote: > 2009/2/27 Scott Marlowe <scott.marlowe@gmail.com>: >> On Thu, Feb 26, 2009 at 3:02 AM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote: >> >>> First of all, I wonder why the same query divided up in half - and >>> using temporary table works as expected, and with everything together >> >> I'm betting it's your use of generate_series(). You can get some >> weird side effects because it sometimes gets run multiple times not >> just the once you expect. I'm guessing that's what's biting you. > > in which case, wouldn't that be a postgresql's bug ? Nope.
2009/2/27 Scott Marlowe <scott.marlowe@gmail.com>: > Nope. as far as I can understand it, if I do the same thing in two steps, and in one step. And the latter is broken, because of some internal process/optimization/whatever - that's a bug to me. Unless I am expecting it to work, and it was just pure luck that it worked in two steps.. -- GJ
On Fri, Feb 27, 2009 at 3:16 AM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote: > 2009/2/27 Scott Marlowe <scott.marlowe@gmail.com>: >> Nope. > > as far as I can understand it, if I do the same thing in two steps, > and in one step. And the latter is broken, because of some internal > process/optimization/whatever - that's a bug to me. You're assuming that creating a temp data set in a table and join semantics are the same. they are not.
Alban Hertroys <dalroi@solfertje.student.utwente.nl> writes: > On Feb 26, 2009, at 11:02 AM, Grzegorz Jaśkiewicz wrote: > >> looks like you completely misunderstood my question. > > I'm not surprised. What do you expect with random capitalisation, random table > alias names and random indentation combined with queries getting wrapped by > the mailing-list software? Uh, we get a lot of really mangled SQL and explain plans -- I don't see anything wrong with these. If the question was unclear it sounds like it's just because it's a fairly subtle problem and was hard to describe. Needing two cracks at describing the problem is pretty much par for the course here. I haven't tested the query to see what's going on but if the problem is due to random() then in 8.4 you could use WITH to guarantee that the subquery is executed precisely once and the results reused as-is subsequently. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
On Fri, Feb 27, 2009 at 11:57 AM, Gregory Stark <stark@enterprisedb.com> wrote: > Uh, we get a lot of really mangled SQL and explain plans -- I don't see > anything wrong with these. If the question was unclear it sounds like it's > just because it's a fairly subtle problem and was hard to describe. Needing > two cracks at describing the problem is pretty much par for the course here. > > I haven't tested the query to see what's going on but if the problem is due to > random() then in 8.4 you could use WITH to guarantee that the subquery is > executed precisely once and the results reused as-is subsequently. In that case, actually I need to get a random array of packages that's different as much as possible for every account-id. I'll try to create simpler example, with some script that would generate data for you. -- GJ