Thread: Re: BUG #15548: Unaccent does not remove combining diacritical characters

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
raam narayana
Date:
Hi,

After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the
outputfrom the script is completely different from the unaccent.rules file content. Am I missing anything.My testing
includesthe following
 

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
 
http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file  Latin-ASCII.xml >
unaccent.rules
 

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Thomas Munro
Date:
On Mon, Feb 11, 2019 at 7:07 AM raam narayana <raam.soft@gmail.com> wrote:
> After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the
outputfrom the script is completely different from the unaccent.rules file content. Am I missing anything.My testing
includesthe following
 
>
> Downloaded the following files
>
> http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
>
> http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml
>
> Executed the below python script
>
> python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file  Latin-ASCII.xml >
unaccent.rules
>
> I am using python 3.7.1 and running on Windows 10 Platform
>
> The new status of this patch is: Needs review

Hi Raam,

How does it differ?  Can you please share the output you get?  I used
Python 2.7 on a Mac, exactly those input files, and my output matched
Hugh's.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:

On Sun, 10 Feb 2019 at 15:07, raam narayana <raam.soft@gmail.com> wrote:
Hi,

After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt

http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file  Latin-ASCII.xml > unaccent.rules

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review

Hi Raam,
I just ran generate_unaccent_rules.py under two environments, using the data files given above :
  - Python 3.4.3  on Linux Mint 17.3 (equivalent to Ubuntu 14.04)
  - Python 3.6.7 on Ubuntu 18.04

In both cases, the output was identical to that generated by the program under Python 2.7. So yes, more information would help. Unfortunately I don't have a Windows Python environment readily available, but could set one up if I had to.

Thanks,
Hugh

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Ramanarayana
Date:
Hi Hugh,

I tested the script in python 2.7 and it works perfect. The problem is in python 3.7(and may be only in windows as you were not getting the issue) and I was getting the following error 

UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in position 0: character maps to <undefined>

 I went through the python script and found that the stdout encoding is set to utf-8 only  if python version is <=2. 

I have made the same change for python version 3 as well. Please find the patch for the same.Let me know if it makes sense

Regards,
Ram.

On Tue, 12 Feb 2019 at 00:50, Hugh Ranalli <hugh@whtc.ca> wrote:

On Sun, 10 Feb 2019 at 15:07, raam narayana <raam.soft@gmail.com> wrote:
Hi,

After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt

http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file  Latin-ASCII.xml > unaccent.rules

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review

Hi Raam,
I just ran generate_unaccent_rules.py under two environments, using the data files given above :
  - Python 3.4.3  on Linux Mint 17.3 (equivalent to Ubuntu 14.04)
  - Python 3.6.7 on Ubuntu 18.04

In both cases, the output was identical to that generated by the program under Python 2.7. So yes, more information would help. Unfortunately I don't have a Windows Python environment readily available, but could set one up if I had to.

Thanks,
Hugh


--
Cheers
Ram 4.0
Attachment

Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Michael Paquier
Date:
On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
>  I went through the python script and found that the stdout encoding is set
> to utf-8 only  if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense

Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD.  Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Ramanarayana
Date:
Hi Michael,
The issue was that the python script was working in python 2 but not in python 3 in Windows. This is because the python script writes the final output to stdout and stdout encoding is set to utf-8 only for python 2 but not python 3.If no encoding is set for stdout it takes the encoding from the Operating system.Default encoding in linux and windows might be different.Hence this issue.
Regards,
Ram.

On Tue, 12 Feb 2019 at 09:48, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
>  I went through the python script and found that the stdout encoding is set
> to utf-8 only  if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense

Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD.  Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael


--
Cheers
Ram 4.0

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
On Tue, 12 Feb 2019 at 08:54, Ramanarayana <raam.soft@gmail.com> wrote:
Hi Michael,
The issue was that the python script was working in python 2 but not in python 3 in Windows. This is because the python script writes the final output to stdout and stdout encoding is set to utf-8 only for python 2 but not python 3.If no encoding is set for stdout it takes the encoding from the Operating system.Default encoding in linux and windows might be different.Hence this issue.
Regards,
Ram.

On Tue, 12 Feb 2019 at 09:48, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
>  I went through the python script and found that the stdout encoding is set
> to utf-8 only  if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense

Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD.  Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael

I can't look at this today, but will fire up Windows and Python tomorrow, look at Ram's patch, and see what is going on. I'll also look at how we open the input files, to see if we should supply an encoding. It makes sense those input files will only make sense in UTF-8 anyway.

Ram, thanks for catching this issue.,

Hugh

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Hugh Ranalli
Date:
On Mon, 11 Feb 2019 at 15:57, Ramanarayana <raam.soft@gmail.com> wrote:
Hi Hugh,

I tested the script in python 2.7 and it works perfect. The problem is in python 3.7(and may be only in windows as you were not getting the issue) and I was getting the following error 

UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in position 0: character maps to <undefined>

 I went through the python script and found that the stdout encoding is set to utf-8 only  if python version is <=2. 

I have made the same change for python version 3 as well. Please find the patch for the same.Let me know if it makes sense

Regards,
Ram

Hi Ram,
I took a look at this, and unfortunately the proposed fix breaks Python 2 (sys.stdout.encoding isn't a writable attribute in Python 2)  :-(. I've attached a patch which is compatible with both versions, and have confirmed that the output is identical across Python 2 and 3 and across both Windows and Linux. The output on Windows and Linux is identical, once the difference in line endings is accounted for.

I've also opened the Unicode data file in UTF-8 and added a "with" block which ensures we close the file when we are done with it. The change makes the Python2 compatibility a little more complex (2 blocks to remove), but it's the cleanest I could achieve.

The attached patch goes on top of patch 02 (not on top of the broken, committed 03). I'm hoping that's not a problem. If it is, let me know and I'll factor out the changes.

Please let me know if you have any questions.

Best wishes,
Hugh

Attachment

Re: BUG #15548: Unaccent does not remove combining diacritical characters

From
Ramanarayana
Date:
Hi Hugh,

The patch I submitted was tested both in python 2 and 3 and it worked for me.The single line of code 
added in the patch runs only in python 3. I dont think it can break python2. Would like to see the error you got in python 2   Good to know the reported issue  is a valid one in windows.I tested your patch as well and it is also working fine.
-- 
Cheers
Ram 4.0

Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

From
Michael Paquier
Date:
On Sun, Feb 17, 2019 at 12:45:39PM +0530, Ramanarayana wrote:
> The patch I submitted was tested both in python 2 and 3 and it worked for
> me.The single line of code
> added in the patch runs only in python 3. I dont think it can break
> python2. Would like to see the error you got in python 2   Good to know the
> reported issue  is a valid one in windows.I tested your patch as well and
> it is also working fine.

I can see that the commit fest entry associated to this thread has
been switched back from "committed" to "Needs Review" with Thomas
Munro still associated as committer.  The thing is that we have
already committed all the bits discussed here, so I am switching back
the status as "committed", which reflects the state of the thread.  If
you have a set of fixes for what has been pushed regarding Windows and
Python 2/3 capabilities, I would suggest to create a new entry with
yourself as the author.  Spawning a new thread would be also nice so
as you attract the correct audience, this thread about initially
diacritical character support for unaccent has been used more than
enough now.

Python 2/3 support for this script is easy enough to check on Linux,
and now you are adding Windows in the mix...

Thanks,
--
Michael

Attachment