On Sat, Dec 20, 2014 at 7:30 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sat, Dec 20, 2014 at 3:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Attached patch implements this scheme.
>
> I had another thought: "NAMEDATALEN + 1" is a better representation of
> "infinity" for matching purposes than INT_MAX. I probably should have
> made that change, too. It would then not have been necessary to
> "#include <limits.h>". I think that this is a useful
> belt-and-suspenders precaution against integer overflow. It almost
> certainly won't matter, since it's very unlikely that the best match
> within an RTE will end up being a dropped column, but we might as well
> do it that way (Levenshtein distance is costed in multiples of code
> point changes, but the maximum density is 1 byte per codepoint).
Good idea.
Looking over the latest patch, I think we could simplify the code so
that you don't need multiple FuzzyAttrMatchState objects.  Instead of
creating a separate one for each RTE and then merging them, just have
one.  When you find an inexact-RTE name match, set a field inside the
FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the
Levenshtein distance between the RTEs.  Then call scanRTEForColumn()
and pass down the same state object.  Now let
updateFuzzyAttrMatchState() work out what it needs to do based on the
observed inter-column distance and the currently-in-force RTE penalty.
-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company