Thank you for your time looking into this.
Here's a v7 patch, rebased over my recent hacking, and addressing
most of the complaints I raised in <31691.1579648824@sss.pgh.pa.us>.
However, I've got some new complaints now that I've looked harder
at the code:
* It's not exactly apparent to me why this code should be concerned
about non-normalized characters when noplace else in the backend is.
I think we should either rip that out entirely, or move the logic
into str_tolower (and hence also str_toupper etc). I'd tend to favor
the former, but of course I don't speak any languages where this would
be an issue. Perhaps a better idea is to invent a new SQL function
that normalizes a given string, and expect users to call that first
if they'd like to_date() to match unnormalized text.
There is an open patch that will make the normalization functionality user visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD TMMON YYYY') I would vote to drop the normalization logic inside this patch altogether.
* I have no faith in this calculation that decides how long the match
length was:
*len = element_len + name_len - norm_len;
I seriously doubt that this does the right thing if either the
normalization or the downcasing changed the byte length of the
string. I'm not actually sure how we can do that correctly.
There's no good way to know whether the changes happened within
the part of the "name" string that we matched, or the part beyond
what we matched, but we only want to count the former. So this
seems like a pretty hard problem, and even if this logic is somehow
correct as it stands, it needs a comment explaining why.
The proper logic would come from do_to_timestamp() receiving a normalized "date_txt" input, so we do not operate with unnormalize and normalize strings at the same time.
* I'm still concerned about the risk of integer overflow in the
string allocations in seq_search_localized(). Those need to be
adjusted to be more paranoid, similar to the code in e.g. str_tolower.
Please find attached a patch with the normalization logic removed, thus no direct allocations in seq_search_localized().
I would like to rise a couple of questions myself:
* When compiled with DEBUG_TO_FROM_CHAR, there is a warning "‘dump_node’ defined but not used". Should we drop this function or uncomment its usage?
* Would it be worth moving str_tolower(localized_name) from seq_search_localized() into cache_locale_time()?
Regards,
Juan José Santamaría Flecha