Re: Doing better at HINTing an appropriate column within errorMissingColumn() - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Doing better at HINTing an appropriate column within errorMissingColumn() |
Date | |
Msg-id | CA+TgmoY7C8=3SaEaii4extc5NT6-z6qe2p0JGL++iy2oZSTZ0g@mail.gmail.com Whole thread Raw |
In response to | Re: Doing better at HINTing an appropriate column within errorMissingColumn() (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Doing better at HINTing an appropriate column within errorMissingColumn()
|
List | pgsql-hackers |
On Tue, Jun 17, 2014 at 12:51 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Jun 16, 2014 at 8:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Not having looked at the patch, but: I think the probability of >> useless-noise HINTs could be substantially reduced if the code prints a >> HINT only when there is a single available alternative that is clearly >> better than the others in Levenshtein distance. I'm not sure how much >> better is "clearly better", but I exclude "zero" from that. I see that >> the original description of the patch says that it will arbitrarily >> choose one alternative when there are several with equal Levenshtein >> distance, and I'd say that's a bad idea. > > I disagree. I happen to think that making some guess is better than no > guess at all here, given the fact that there aren't too many > possibilities to choose from. I think that it might be particularly > annoying to not show some suggestion in the event of a would-be > ambiguous column reference where the column name is itself wrong, > since both mistakes are common. For example, "order_id" was specified > instead of one of either "o.orderid" or "ol.orderid", as in my > original examples. If some correct alias was specified, that would > make the new code prefer the appropriate Var, but it might not be, and > that should be okay in my view. > > I'm not trying to remove the need for human judgement here. We've all > heard stories about people who did things like input "Portland" into > their GPS only to end up in Maine rather than Oregon, but I think in > general you can only go so far in worrying about those cases. Emitting a suggestion with a large distance seems like it could be rather irritating. If the user types in SELECT prodct_id FROM orders, and that column does not exist, suggesting "product_id", if such a column exists, will likely be well-received. Suggesting a column named, say, "price", however, will likely make at least some users say "no I didn't mean that you stupid @%!#" - because probably the issue there is that the user selected from the completely wrong table, rather than getting 6 of the 9 characters they typed incorrect. One existing tool that does something along these lines is 'git', which seems to have some kind of a heuristic to know when to give up: [rhaas pgsql]$ git gorp git: 'gorp' is not a git command. See 'git --help'. Did you mean this? grep [rhaas pgsql]$ git goop git: 'goop' is not a git command. See 'git --help'. Did you mean this? grep [rhaas pgsql]$ git good git: 'good' is not a git command. See 'git --help'. [rhaas pgsql]$ git puma git: 'puma' is not a git command. See 'git --help'. Did you mean one of these? pull push I suspect that the maximum useful distance is a function of the string length. Certainly, if the distance is greater than or equal to the length of one of the strings involved, it's just a totally unrelated string and thus not worth suggesting. A useful heuristic might be something like "distance at most 3, or at most half the string length, whichever is less". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: