Re: Doing better at HINTing an appropriate column within errorMissingColumn() - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Doing better at HINTing an appropriate column within errorMissingColumn()
Date
Msg-id CA+TgmoaR-XPHkbe4DA2xOx+B6gqqTz10p48=HEKbQfGx++mbyA@mail.gmail.com
Whole thread Raw
In response to Re: Doing better at HINTing an appropriate column within errorMissingColumn()  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Doing better at HINTing an appropriate column within errorMissingColumn()  (Peter Geoghegan <pg@heroku.com>)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()  (Peter Geoghegan <pg@heroku.com>)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On Wed, Nov 19, 2014 at 1:22 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> If you agree, then I'm not being clear enough.  I don't think think
>> that tinkering with the Levenshtein cost factors is a good idea, and I
>> think it's unhelpful to suggest something when the suggestion and the
>> original word differ by more than 50% of the number characters in the
>> shorter word.
>
> I agree - except for very short strings, where there is insufficient
> information to go on.

That's precisely the time I think it's *most* important.  In a very
long string, the threshold should be LESS than 50%.  My original
proposal was "no more than 3 characters of difference, but in any
event not more than half the length of the shorter string".

>> Suggesting "col" for "oid" or "x" for "xmax", as crops
>> up in the regression tests with this patch applied, shows the folly of
>> this: the user didn't mean the other named column; rather, the user
>> was confused about whether a particular system column existed for that
>> table.
>
> Those are all very terse strings. What you're overlooking is what is
> broken by using straight Levenshtein distance, which includes things
> in the regression test that are reasonable and helpful. As I mentioned
> before, requiring a greater than 50% of total string size distance
> breaks this, just within the regression tests:
>
> """
>   ERROR:  column "f1" does not exist
>   LINE 1: select f1 from c1;
>                  ^
> - HINT:  Perhaps you meant to reference the column "c1"."f2".
> """

That's exactly 50%, not more than 50%.

(I'm also on the fence about whether the hint is actually helpful in
that case, but the rule I proposed wouldn't prohibit it.)

> And:
>
> """
>   ERROR:  column atts.relid does not exist
>   LINE 1: select atts.relid::regclass, s.* from pg_stats s join
>                  ^
> - HINT:  Perhaps you meant to reference the column "atts"."indexrelid".
> """
>
> Those are really useful suggestions! And, they're much more
> representative of real user error.

That one's right at 50% too, but it's certainly more than 3 characters
of difference.  I think it's going to be pretty hard to emit a
suggestion in that case but not in a whole lot of cases that don't
make any sense.

> How about git as a kind of precedent? It is not at all conservative
> about showing *some* suggestion:
>
> """
> $ git aa
> git: 'aa' is not a git command. See 'git --help'.
>
> Did you mean this?
> am
>
> $ git d
> git: 'd' is not a git command. See 'git --help'.
>
> Did you mean one of these?
> diff
> add
>
> $ git ddd
> git: 'ddd' is not a git command. See 'git --help'.
>
> Did you mean this?
> add
> """
>
> And why wouldn't git be? As far as its concerned, you can only have
> meant one of those small number of things. Similarly, with the patch,
> the number of things we can pick from is fairly limited at this stage,
> since we are actually fairly far along with parse analysis.

I went and found the actual code git uses for this.  It's here:

https://github.com/git/git/blob/d29e9c89dbbf0876145dc88615b99308cab5f187/help.c

And the underlying Levenshtein implementation is here:

https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c

Apparently what they're doing is charging 0 for a transposition (which
we don't have as a separate concept), 2 for a substitution, 1 for an
insertion, and 3 for a deletion, with the constraint that anything
with a total distance of more than 6 isn't considered.  And that does
overall seem to give pretty good suggestions.  However, an interesting
point about the git algorithm is that it's not hard to make it do
stupid things on short strings:

[rhaas pgsql]$ git xy
git: 'xy' is not a git command. See 'git --help'.

Did you mean one of these?   am   gc   mv   rm

Maybe they should adopt my idea of a lower cutoff for short strings.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Doing better at HINTing an appropriate column within errorMissingColumn()
Next
From: Peter Geoghegan
Date:
Subject: Re: Doing better at HINTing an appropriate column within errorMissingColumn()