Re: Doing better at HINTing an appropriate column within errorMissingColumn() - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Doing better at HINTing an appropriate column within errorMissingColumn()
Date
Msg-id CA+TgmoZLwzgyv=JAYfi6XfAK8OcBuTPYYhP5TbOqsS=YWVvzUw@mail.gmail.com
Whole thread Raw
In response to Re: Doing better at HINTing an appropriate column within errorMissingColumn()  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Doing better at HINTing an appropriate column within errorMissingColumn()  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On Sat, Nov 15, 2014 at 7:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Attached patch incorporates this feedback.
>
> The only user-visible difference between this revision and the
> previous revision is that it's quite possible for two suggestion to
> originate from the same RTE (there is exactly one change in the
> regression test's expected output as compared to the last revision for
> this reason. The regression tests are otherwise unchanged). It's still
> not possible to see more than 2 suggestions under any circumstances,
> no matter where they might have originated from, which I think is
> appropriate -- we continue to not present any HINT in the event of 3
> or more equidistant matches.
>
> I think that the restructuring required to pass around a state
> variable has resulted in somewhat clearer code.

Cool!

I'm grumpy about the distanceName() function.  That seems too generic.
If we're going to keep this as it is, I suggest something like
computeRTEColumnDistance().  But see below.

On a related note, I'm also grumpy about this comment:

+    /* Charge half as much per deletion as per insertion or per substitution */
+    return varstr_levenshtein_less_equal(actual, len, match, match_len,
+                                         2, 1, 2, max);

The purpose of a code comment is to articulate WHY we did something,
rather than simply to restate what the code quite obviously does.  I
haven't heard a compelling argument for why this should be 2, 1, 2
rather than the default 1, 1, 1; and I'm inclined to do the latter
unless you can make some very good argument for this combination of
weights.  And if you can make such an argument, then there should be
comments so that the next person to come along and look at this code
doesn't go, huh, that's whacky, and change it.

+                 int location, FuzzyAttrMatchState *rtestate)

I suggest calling this "fuzzystate" rather than "rtestate"; it's not
the state of the RTE, but the state of the fuzzy matching.

Within the scanRTEForColumn block, we have a rather large chunk of
code protected by if (rtestate), which contains the only call to
distanceName().  I suggest that we move all of this logic to a
separate, static function, and merge distanceName into it.  I also
suggest testing against NULL explicitly instead of implicitly.  So
this block of code would end up as something like:

if (fuzzystate != NULL)   updateFuzzyAttrMatchState(rte, attcolname, colname, &fuzzystate);

In searchRangeTableForCol, I'm fairly certain that you've changed the
behavior by adding a check for if (rte->rtekind == RTE_JOIN) before
the call to scanRTEForColumn().  Why not instead put this check into
updateFuzzyAttrMatchState?  Then you can be sure you're not changing
the behavior in any other case.

On a similar note, I think the dropped-column test should happen early
as well, probably again in updateFuzzyAttrMatchState().  There's
little point in adding a suggestion only to throw it away again.

+            /*
+             * Charge extra (for inexact matches only) when an alias was
+             * specified that differs from what might have been used to
+             * correctly qualify this RTE's closest column
+             */
+            if (wrongalias)
+                rtestate.distance += 3;

I don't understand what situation this is catering to.  Can you
explain?  It seems to account for a good deal of complexity.
ERROR:  column "oid" does not existLINE 1: select oid > 0, * from altwithoid;               ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".

That seems like a stretch.  I think I suggested before using a
distance threshold of at most 3 or half the word length, whichever is
less.  For a three-letter column name that means not suggesting
anything if more than one character is different.  What you
implemented here is close to that, yet somehow we've got a suggestion
slipping through that has 2 out of 3 characters different.  I'm not
quite sure I see how that's getting through, but I think it shouldn't.
ERROR:  column fullname.text does not existLINE 1: select fullname.text from fullname;               ^
+HINT:  Perhaps you meant to reference the column "fullname"."last".

Same problem, only worse!  They've only got one letter of four in common.
ERROR:  column "xmin" does not existLINE 1: select xmin, * from fooview;               ^
+HINT:  Perhaps you meant to reference the column "fooview"."x".

Basically the same problem again.  I think the distance threshold in
this case should be half the shorter column name, i.e. 0.

Your new test cases include no negative test cases; that is, cases
where the machinery declines to suggest a hint because of, say, 3
equally good possibilities.  They probably should have something like
that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Review of Refactoring code for sync node detection
Next
From: Robert Haas
Date:
Subject: Re: New Event Trigger: table_rewrite