Home > mailing lists

Re: Unaccent extension python script Issue in Windows - Mailing list pgsql-hackers

From	Kyotaro HORIGUCHI
Subject	Re: Unaccent extension python script Issue in Windows
Date	March 18, 2019 08:13:34
Msg-id	20190318.141334.186469242.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw
In response to	Re: Unaccent extension python script Issue in Windows (Hugh Ranalli <hugh@whtc.ca>)
Responses	Re: Unaccent extension python script Issue in Windows Re: Unaccent extension python script Issue in Windows
List	pgsql-hackers

Tree view

Hello.

At Sun, 17 Mar 2019 20:23:05 -0400, Hugh Ranalli <hugh@whtc.ca> wrote in
<CAAhbUMNoBLu7jAbyK5MK0LXEyt03PzNQt_Apkg0z9bsAjcLV4g@mail.gmail.com>
> Hi Ram,
> Thanks for doing this; I've been overestimating my ability to get to things
> over the last couple of weeks.
> 
> I've looked at the patch and have made one minor change. I had moved all
> the imports up to the top, to keep them in one place (and I think some had
> originally been used only by the Python 2 code. You added them there, but
> didn't remove them from their original positions. So I've incorporated that
> into your patch, attached as v2. I've tested this under Python 2 and 3 on
> Linux, not Windows.

Though I'm not sure the necessity of running the script on
Windows, the problem is not specific for Windows, but general one
that haven't accidentially found on non-Windows environment.

On CentOS7:
> export LANG="ja_JP.EUCJP"
> python <..snipped..>
..
> UnicodeEncodeError: 'euc_jp' codec can't encode character '\xab' in position 0: illegal multibyte sequence

So this is not an issue with Windows but with python3.

The script generates identical files with the both versions of
python with the pach on Linux and Windows 7. Python3 on Windows
emits CRLF as a new line but it doesn't seem to harm. (I didn't
confirmed that due to extreme slowness of build from uncertain
reasons now..)

This patch contains irrelevant changes. The minimal required
change would be the attached. If you want refacotor the
UnicodeData reader or rearrange import sutff, it should be
separate patches.

It would be better use IOBase for Python3 especially for stdout
replacement but I didin't since it *is* working.

> Everything else looks correct. I apologise for not having replied to your
> question in the original bug report. I had intended to, but as I said,
> there's been an increase in the things I need to juggle at the moment.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 58b6e7deb7..0d645567b7 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -45,7 +45,9 @@ if sys.version_info[0] <= 2:
     # Python 2 and 3 compatible bytes call
     def bytes(source, encoding='ascii', errors='strict'):
         return source.encode(encoding=encoding, errors=errors)
+else:
 # END: Python 2/3 compatibility - remove when Python 2 compatibility dropped
+    sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
 
 import re
 import argparse
@@ -233,7 +235,8 @@ def main(args):
     charactersSet = set()
 
     # read file UnicodeData.txt
-    unicodeDataFile = open(args.unicodeDataFilePath, 'r')
+    unicodeDataFile = codecs.open(
+        args.unicodeDataFilePath, mode='r', encoding='UTF-8')
 
     # read everything we need into memory
     for line in unicodeDataFile:

pgsql-hackers by date:

From: Tom Lane
Date: 18 March 2019, 07:45:19
Subject: Re: Determine if FOR UPDATE or FOR SHARE was used?

From: Stephen Frost
Date: 18 March 2019, 08:43:08
Subject: Re: Online verification of checksums

Re: Unaccent extension python script Issue in Windows - Mailing list pgsql-hackers

Previous

Next