Thread: DISTINCT/Optimizer question
Hello,
My name is Beth Jen and I'm a Summer of Code student currently adding a hash-based implementation of DISTINCT to PostgreSQL. My prototype is near completion, and the current design is as follows:
I've created a new exec node that uses hash table functions in execGrouping. The node simply sticks distinct values into the hash table and then returns them, else it discards the value. The idea was to then have the optimizer consider using this node instead of the current sort -> unique combination.
However, I've run into an issue where Jonah and Bruce suggested I post the question to the hackers list.
Right now, the distinct clause adds its targets to the sort clause list when it is parsed. This causes an automatic insertion of the sort node into the query plan before the application of the unique node. The hash-based implementation however is meant to bypass the need to sort. I could just remove this action, but the optimizer should only consider using the hash-based implementation, it should still use the sort -> unique implementation when necessary. (Not to mention, this sort -> unique implementation is used in other cases where unique values are needed, such as in unions.) Therefore, I need to be able to somehow either move this component from the parser into the optimizer, or prevent the creation of a sort node when the query planner chooses to use the hash-based implementation.
What are your suggestions for going about this? Are these approaches feasible without a significant restructuring of the code? Are there any other approaches I should consider?
Thank you for your assistance.
-Beth
My name is Beth Jen and I'm a Summer of Code student currently adding a hash-based implementation of DISTINCT to PostgreSQL. My prototype is near completion, and the current design is as follows:
I've created a new exec node that uses hash table functions in execGrouping. The node simply sticks distinct values into the hash table and then returns them, else it discards the value. The idea was to then have the optimizer consider using this node instead of the current sort -> unique combination.
However, I've run into an issue where Jonah and Bruce suggested I post the question to the hackers list.
Right now, the distinct clause adds its targets to the sort clause list when it is parsed. This causes an automatic insertion of the sort node into the query plan before the application of the unique node. The hash-based implementation however is meant to bypass the need to sort. I could just remove this action, but the optimizer should only consider using the hash-based implementation, it should still use the sort -> unique implementation when necessary. (Not to mention, this sort -> unique implementation is used in other cases where unique values are needed, such as in unions.) Therefore, I need to be able to somehow either move this component from the parser into the optimizer, or prevent the creation of a sort node when the query planner chooses to use the hash-based implementation.
What are your suggestions for going about this? Are these approaches feasible without a significant restructuring of the code? Are there any other approaches I should consider?
Thank you for your assistance.
-Beth
On Fri, Jul 07, 2006 at 01:25:53PM -0400, Beth Jen wrote: > Right now, the distinct clause adds its targets to the sort clause list when > it is parsed. This causes an automatic insertion of the sort node into the > query plan before the application of the unique node. The hash-based > implementation however is meant to bypass the need to sort. I could just > remove this action, but the optimizer should only consider using the <snip> My laymans opinion suggests that this needs a new specific "distinct clause" which looks a lot like a sort clause only isn't. And then in the planner this clause would either be converted to your new node type or the traditional sort node. > What are your suggestions for going about this? Are these approaches > feasible without a significant restructuring of the code? Are there any > other approaches I should consider? I think it should be possible without too much changes, since much would be shared. For example you could have the distinct node look exactly like the sort, so they could share code. Or perhaps just a flag to distinguish them. I admit I havn't looked carefully though... Have you considered how your code interacts with DISTINCT ON ()? Perhaps a clue lies there... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes: > On Fri, Jul 07, 2006 at 01:25:53PM -0400, Beth Jen wrote: > > Right now, the distinct clause adds its targets to the sort clause list when > > it is parsed. This causes an automatic insertion of the sort node into the > > query plan before the application of the unique node. The hash-based > > implementation however is meant to bypass the need to sort. I could just > > remove this action, but the optimizer should only consider using the > > <snip> > > My laymans opinion suggests that this needs a new specific "distinct > clause" which looks a lot like a sort clause only isn't. And then in > the planner this clause would either be converted to your new node type > or the traditional sort node. I had always assumed that the way forward here was just to convert the DISTINCT query into the equivalent GROUP BY query. No sense in having two separate code paths that handle precisely the same behaviour. > Have you considered how your code interacts with DISTINCT ON ()? > Perhaps a clue lies there... Therein lies the rub. There are equivalent GROUP BY forms for DISTINCT ON queries but they aren't optimized effectively currently. Until they are DISTINCT ON can't be translated into GROUP BY queries. I would suggest working on optimizing those cases (min(), max(), first(), last() with GROUP BY over a sorted subquery) and then translating DISTINCT ON queries as well. But afaict there's nothing stopping Postgres from converting plain old standard DISTINCT queries into GROUP BY queries currently. -- greg
Hi Beth, "Beth Jen" <raelys@gmail.com> writes: > Right now, the distinct clause adds its targets to the sort clause list when > it is parsed. Yeah, the DISTINCT/DISTINCT ON implementation is currently rather tightly tied to sorting :-(. This is ancient code and badly in need of redesign, but it's not clear how to clean it up without breaking the expected behavior of DISTINCT ON. There may not be any alternative except to divorce DISTINCT from DISTINCT ON and make them two separate code paths, but that's hardly appealing. On the other side of the coin, there's the analogy to GROUP BY that Greg points out --- there's some duplicated functionality there, but again it doesn't carry over to DISTINCT ON, AFAICS. It might work to have parse analysis not add the DISTINCT list to the ORDER BY list, but instead store them as separate Query fields, and have the planner add DISTINCT to ORDER BY if it decides to use sort-based distinct-ing. I'm not sure if there's any good way to merge all three constructs (DISTINCT, DISTINCT ON, GROUP BY). regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > On the other side of the coin, there's the analogy to GROUP BY that Greg > points out --- there's some duplicated functionality there, but again it > doesn't carry over to DISTINCT ON, AFAICS. The equivalent query for: SELECT DISTINCT ON (x,y) a,b,c FROM ... ORDER BY x,y,z is: SELECT x,y,z,first(a),first(b),first(c) FROM ( SELECT x,y,z,a,b,c FROM ... ORDER BY x,y,z ) GROUP BY x,y Getting the optimizer to treat this as well as DISTINCT ON would be quite a trick. It would probably require the same machinery as getting min() and max() to take maximum advantage of indexes in the face of a GROUP BY clause. That is some sort of field for aggregate functions indicating what subset of the records is sufficient for them and what path they would need for that to be the case. In any case I don't see how you get DISTINCT ON to work without sorting. For min() and max() they could indicate they only need the first field if the input is sorted and the optimizer could decide it's cheaper to pass them every record that do the sort. For first() and last() they would tell the optimizer they only need the first or last record with no particular path but that only works because the rewritten query has an explicit ORDER BY clause. That's about as far as I've thought about it. At the time I thought it would likely be too hard for a first project. I suspect it's too hard for a second project for that matter. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com