Re: Unaccent extension python script Issue in Windows - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: Unaccent extension python script Issue in Windows
Date
Msg-id 20190318.141334.186469242.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: Unaccent extension python script Issue in Windows  (Hugh Ranalli <hugh@whtc.ca>)
Responses Re: Unaccent extension python script Issue in Windows
Re: Unaccent extension python script Issue in Windows
List pgsql-hackers
Hello.

At Sun, 17 Mar 2019 20:23:05 -0400, Hugh Ranalli <hugh@whtc.ca> wrote in
<CAAhbUMNoBLu7jAbyK5MK0LXEyt03PzNQt_Apkg0z9bsAjcLV4g@mail.gmail.com>
> Hi Ram,
> Thanks for doing this; I've been overestimating my ability to get to things
> over the last couple of weeks.
> 
> I've looked at the patch and have made one minor change. I had moved all
> the imports up to the top, to keep them in one place (and I think some had
> originally been used only by the Python 2 code. You added them there, but
> didn't remove them from their original positions. So I've incorporated that
> into your patch, attached as v2. I've tested this under Python 2 and 3 on
> Linux, not Windows.

Though I'm not sure the necessity of running the script on
Windows, the problem is not specific for Windows, but general one
that haven't accidentially found on non-Windows environment.

On CentOS7:
> export LANG="ja_JP.EUCJP"
> python <..snipped..>
..
> UnicodeEncodeError: 'euc_jp' codec can't encode character '\xab' in position 0: illegal multibyte sequence

So this is not an issue with Windows but with python3.

The script generates identical files with the both versions of
python with the pach on Linux and Windows 7. Python3 on Windows
emits CRLF as a new line but it doesn't seem to harm. (I didn't
confirmed that due to extreme slowness of build from uncertain
reasons now..)

This patch contains irrelevant changes. The minimal required
change would be the attached. If you want refacotor the
UnicodeData reader or rearrange import sutff, it should be
separate patches.

It would be better use IOBase for Python3 especially for stdout
replacement but I didin't since it *is* working.

> Everything else looks correct. I apologise for not having replied to your
> question in the original bug report. I had intended to, but as I said,
> there's been an increase in the things I need to juggle at the moment.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 58b6e7deb7..0d645567b7 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -45,7 +45,9 @@ if sys.version_info[0] <= 2:
     # Python 2 and 3 compatible bytes call
     def bytes(source, encoding='ascii', errors='strict'):
         return source.encode(encoding=encoding, errors=errors)
+else:
 # END: Python 2/3 compatibility - remove when Python 2 compatibility dropped
+    sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
 
 import re
 import argparse
@@ -233,7 +235,8 @@ def main(args):
     charactersSet = set()
 
     # read file UnicodeData.txt
-    unicodeDataFile = open(args.unicodeDataFilePath, 'r')
+    unicodeDataFile = codecs.open(
+        args.unicodeDataFilePath, mode='r', encoding='UTF-8')
 
     # read everything we need into memory
     for line in unicodeDataFile:

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Determine if FOR UPDATE or FOR SHARE was used?
Next
From: Stephen Frost
Date:
Subject: Re: Online verification of checksums