Thread: SELECT DISTINCT scans the table?
Dear list, This feels like a FAQ, but neither the postgres docs nor web searches got me a single step towards some sort of insight. Maximally stripped down, my problem is that select distinct 300 from <bigtable> seqscans <bigtable> (at least in PostgreSQL 11.14). To me, it seems obvious that this ought be be just one row containing 300 once Postgres has established that <bigtable> is nonempty. Why do things not work like this? Am I missing something major? The reason I'm interested in this is of course a bit more involved. I have a view that looks somewhat like this: CREATE VIEW a_view AS ( SELECT 'abc' as coll, ... FROM table1 UNION SELECT 'def' as coll, ... FROM table2 UNION SELECT coll, ... FROM table3 ...) and so on for perhaps 50 tables; where, as for table3 in this example, the coll column is not simply a constant, there is an index on the source column (these then are actually fast). I'd now like to be able to say SELECT DISTINCT coll FROM a_view -- which takes forever once some of the tables involved are sufficiently large. I'd assume in an ideal world the query would be essentally instantaneous -- merging the constants and doing a few index scans for the tables that have non-constant coll. In reality, it's not. The tables with constant coll are all completely scanned, either sequentially or index-only. Assuming this is the expected behaviour: Is there perhaps some trick I could use to make postgres use the information that there are constants in the select clauses? Thanks, Markus
Markus Demleitner <msdemlei@ari.uni-heidelberg.de> writes: > Maximally stripped down, my problem is that > select distinct 300 from <bigtable> > seqscans <bigtable> (at least in PostgreSQL 11.14). To me, it seems > obvious that this ought be be just one row containing 300 once > Postgres has established that <bigtable> is nonempty. > Why do things not work like this? Am I missing something major? That seems like the sort of optimization that we absolutely should not spend cycles looking for. If it were a trivial change consuming no detectable number of planning cycles, maybe it would be worth the development and maintenance effort; though I'd be dubious about the value. But the fact that it'd have to be transformed into something testing whether the table is nonempty makes it fairly nontrivial. I doubt it's worth the development cost plus the cycles imposed on every other query. regards, tom lane
Dear Tom, On Thu, Dec 16, 2021 at 03:47:57PM -0500, Tom Lane wrote: > Markus Demleitner <msdemlei@ari.uni-heidelberg.de> writes: > > Maximally stripped down, my problem is that > > > select distinct 300 from <bigtable> > > > seqscans <bigtable> (at least in PostgreSQL 11.14). To me, it seems > > obvious that this ought be be just one row containing 300 once > > Postgres has established that <bigtable> is nonempty. > > That seems like the sort of optimization that we absolutely should > not spend cycles looking for. If it were a trivial change consuming > no detectable number of planning cycles, maybe it would be worth the > development and maintenance effort; though I'd be dubious about the > value. But the fact that it'd have to be transformed into something > testing whether the table is nonempty makes it fairly nontrivial. > I doubt it's worth the development cost plus the cycles imposed > on every other query. I certainly understand that reasoning for this particular example. However, in my actual use case, the one with the view consisting of a large union containing constants from the original mail, CREATE VIEW a_view AS ( SELECT 'abc' as coll, ... FROM table1 -- with perhaps 1e6 rows UNION SELECT 'def' as coll, ... FROM table2 -- with perhaps another 1e7 rows UNION SELECT coll, ... -- with an index on table3.coll FROM table3 ...) being able to factor out constants would make a difference of milliseconds versus a long time (~ a minute in my case, with about 1e8 total rows) when running SELECT DISTINCT coll FROM a_view. Is there, perhaps, a non-obvious way to give the planner a nudge to exploit the constant-ness of coll in table1 and table2? Thanks, Markus