Thread: Initial ugly reverse-translator

Initial ugly reverse-translator

From
Craig Ringer
Date:
Hi all

I've chucked together a quick and very ugly script to read the .po files
from the backend and produce a simple database to map translations back
to the original strings and their source locations. It's a very dirty
.po reader that doesn't try to parse the format properly, but it does
the job. There's no search interface yet, this is just intended to get
to the point where useful queries can be run on the data and the most
effective queries can be figured out.

Right now queries against errors without format-string substitutions
work ok, if not great, with pg_tgrm based lookups, eg:

test=# SELECT message_id, is_format, message, translation
test-# FROM po_translation INNER JOIN po_message ON
po_translation.message_id = po_message.id INNER JOIN
test-# WHERE  'el valor de array debe comenzar con «{» o información de
dimensión' % translation
test-# ORDER BY similarity('el valor de array debe comenzar con «{» o
información de dimensión', translation) desc;

 message_id | is_format |
message                           |                             translation

------------+-----------+------------------------------------------------------------+---------------------------------------------------------------------
       4470 | f         | array value must start with \"{\" or dimension
information | el valor de array debe comenzar con «{» o información de
dimensión"
       4437 | f         | argument must be empty or one-dimensional
array            | el argumento debe ser vacío o un array unidimensional"
(2 rows)

test=# SELECT DISTINCT srcfile, srcline FROM po_location WHERE
message_id = 4437;
                           srcfile                           | srcline
-------------------------------------------------------------+---------
 /a/pgsql/HEAD/pgtst/src/backend/utils/adt/array_userfuncs.c |     121
 utils/adt/array_userfuncs.c                                 |      99
 utils/adt/array_userfuncs.c                                 |     121
 utils/adt/array_userfuncs.c                                 |     124
(4 rows)

It's also useful for format-string based messages, but more thought is
needed on how best to handle them. A LIKE query using the format-string
message as the pattern (after converting the pattern syntax to SQL
style) would be (a) slow and (b) very sensitive to formatting and other
variation. I haven't spent any time on that bit yet, but if anybody has
any ideas I'd be glad to hear them.

Anyway, the initial version of the script can be found at:

http://www.postnewspapers.com.au/~craig/poread.py

Consider running it in a new database as it's extremely poorly tested,
written very quickly and dirtily, and contains DDL commands. The schema
can be found inline in the script. The psycopg2 Python module is
required, and the pg_tgrm contrib module must be loaded in the database
you use the script with.

Once I'm happy with the queries for translation lookups I'll bang
together a quick web interface for the script and clean it up. At that
point it might start being useful to people here.

--
Craig Ringer

Re: Initial ugly reverse-translator

From
Tom Lane
Date:
Craig Ringer <craig@postnewspapers.com.au> writes:
> It's also useful for format-string based messages, but more thought is
> needed on how best to handle them. A LIKE query using the format-string
> message as the pattern (after converting the pattern syntax to SQL
> style) would be (a) slow and (b) very sensitive to formatting and other
> variation. I haven't spent any time on that bit yet, but if anybody has
> any ideas I'd be glad to hear them.

I don't really see the problem.  I assume from your reference to pg_trgm
that you're using trigram similarity as the prefilter for potential
matches, so a slow final LIKE match shouldn't be an issue really.
(And besides, speed doesn't seem like the be-all and end-all here.)

AFAICS you just need to translate %-string format escapes to %, quote
any other % or _, and away you go.

One thing that might be worth doing is avoiding spacing sensitivity,
since whitespace is frequently mangled in copy-and-paste.  Perhaps
strip all spaces from both strings before matching?

            regards, tom lane

Re: Initial ugly reverse-translator

From
Craig Ringer
Date:
Tom Lane wrote:

> I don't really see the problem.  I assume from your reference to pg_trgm
> that you're using trigram similarity as the prefilter for potential
> matches

It turns out that's no good anyway, as it appears to ignore characters
outside the ASCII range. Rather less than useful for searching a
database of translated strings ;-)

> so a slow final LIKE match shouldn't be an issue really.
> (And besides, speed doesn't seem like the be-all and end-all here.)

True. It's not so much the speed as the fragility when faced with small
changes to formatting. In addition to whitespace, some clients mangle
punctuation with features like automatic "curly"-quoting.

> AFAICS you just need to translate %-string format escapes to %, quote
> any other % or _, and away you go.
>
> One thing that might be worth doing is avoiding spacing sensitivity,
> since whitespace is frequently mangled in copy-and-paste.  Perhaps
> strip all spaces from both strings before matching?

Yep, that sounds pretty reasonable. As usual I'm making things more
complicated than they need to be. I suspect it'll be necessary to strip
quotes and some other punctuation too, but that's not a big deal.

--
Craig Ringer

Re: Initial ugly reverse-translator

From
Tom Lane
Date:
Craig Ringer <craig@postnewspapers.com.au> writes:
> Tom Lane wrote:
>> I don't really see the problem.  I assume from your reference to pg_trgm
>> that you're using trigram similarity as the prefilter for potential
>> matches

> It turns out that's no good anyway, as it appears to ignore characters
> outside the ASCII range. Rather less than useful for searching a
> database of translated strings ;-)

A quick look at the pg_trgm code suggests that it is only prepared to
deal with single-byte encodings; if you're working in UTF8, which I
suppose you'd have to be, it's dead in the water :-(.  Perhaps fixing
that should be on the TODO list.

But in any case maybe the full-text-search stuff would be more useful
as a prefilter?  Although honestly, for the speed we need here, I'm
not sure a prefilter is needed at all.  Full text might be useful
if a LIKE-based match fails, though.

>> (And besides, speed doesn't seem like the be-all and end-all here.)

> True. It's not so much the speed as the fragility when faced with small
> changes to formatting. In addition to whitespace, some clients mangle
> punctuation with features like automatic "curly"-quoting.

Yeah.  I was wondering whether encoding differences wouldn't be a huge
problem in practice, as well.

            regards, tom lane

Re: Initial ugly reverse-translator

From
Oleg Bartunov
Date:
On Sat, 19 Apr 2008, Tom Lane wrote:

> Craig Ringer <craig@postnewspapers.com.au> writes:
>> Tom Lane wrote:
>>> I don't really see the problem.  I assume from your reference to pg_trgm
>>> that you're using trigram similarity as the prefilter for potential
>>> matches
>
>> It turns out that's no good anyway, as it appears to ignore characters
>> outside the ASCII range. Rather less than useful for searching a
>> database of translated strings ;-)
>
> A quick look at the pg_trgm code suggests that it is only prepared to
> deal with single-byte encodings; if you're working in UTF8, which I
> suppose you'd have to be, it's dead in the water :-(.  Perhaps fixing
> that should be on the TODO list.

as well as ltree. they are in our todo list:
http://www.sai.msu.su/~megera/wiki/TODO

>
> But in any case maybe the full-text-search stuff would be more useful
> as a prefilter?  Although honestly, for the speed we need here, I'm
> not sure a prefilter is needed at all.  Full text might be useful
> if a LIKE-based match fails, though.
>
>>> (And besides, speed doesn't seem like the be-all and end-all here.)
>
>> True. It's not so much the speed as the fragility when faced with small
>> changes to formatting. In addition to whitespace, some clients mangle
>> punctuation with features like automatic "curly"-quoting.
>
> Yeah.  I was wondering whether encoding differences wouldn't be a huge
> problem in practice, as well.
>
>             regards, tom lane
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Initial ugly reverse-translator

From
Craig Ringer
Date:
Tom Lane wrote:

>> True. It's not so much the speed as the fragility when faced with small
>> changes to formatting. In addition to whitespace, some clients mangle
>> punctuation with features like automatic "curly"-quoting.
>
> Yeah.  I was wondering whether encoding differences wouldn't be a huge
> problem in practice, as well.

I'm not *too* worried about text encoding issues. In general it's very
obvious when text has been mangled due to bad encoding handling, and
it's extremely rare to see anything subtle like an app that transforms
accented chars to their base variants. Demangling strings damaged by bad
encoding handling is way out of scope, and sometimes not possible anyway.

I guess that UTF-8's delightful support for various composed and
decomposed forms of same glyph might be a problem. It's something I may
face in some other works I'm doing too, so I might have to see how hard
it'd be to put together a DB function that normalizes a UTF-8 string to
its fully composed variant. I don't think the decomposed forms see much
use in the wild though; they mostly come up as a security issue for
path/URL matching and the like.

http://unicode.org/reports/tr15/
http://msdn2.microsoft.com/en-us/library/ms776393(VS.85).aspx
http://earthlingsoft.net/ssp/blog/2006/07/unicode_normalisation

I don't know much about the CJK text representations, though, either in
Unicode or in other encodings like Big5 . I *hope* the Unicode
normalization rules will be enough there but I'm not sure.

All strings must be converted from their original encoding to utf-8 for
queries of course. That might be troublesome when using something like a
web form where it might be hard to know the encoding of the input text
(and where browser bugs are the rule rather than the exception) but it's
thankfully not necessary to cater to every weird and broken browser.

So in this case I don't think encodings will be *too* much trouble
unless alternate unicode normalization forms turn out to be more common
than I think they are.

--
Craig Ringer

Re: Initial ugly reverse-translator

From
pepone.onrez
Date:
On Sat, Apr 19, 2008 at 6:10 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
> On Sat, 19 Apr 2008, Tom Lane wrote:
>
>> Craig Ringer <craig@postnewspapers.com.au> writes:
>>>
>>> Tom Lane wrote:
>>>>
>>>> I don't really see the problem.  I assume from your reference to pg_trgm
>>>> that you're using trigram similarity as the prefilter for potential
>>>> matches
>>
>>> It turns out that's no good anyway, as it appears to ignore characters
>>> outside the ASCII range. Rather less than useful for searching a
>>> database of translated strings ;-)
>>
>> A quick look at the pg_trgm code suggests that it is only prepared to
>> deal with single-byte encodings; if you're working in UTF8, which I
>> suppose you'd have to be, it's dead in the water :-(.  Perhaps fixing
>> that should be on the TODO list.
>
> as well as ltree. they are in our todo list:
> http://www.sai.msu.su/~megera/wiki/TODO
>

Hi Oleg

In your TODO list says that UTF8 was added to ltree, is this code
currently available for download?

Regards,
José
>>
>> But in any case maybe the full-text-search stuff would be more useful
>> as a prefilter?  Although honestly, for the speed we need here, I'm
>> not sure a prefilter is needed at all.  Full text might be useful
>> if a LIKE-based match fails, though.
>>
>>>> (And besides, speed doesn't seem like the be-all and end-all here.)
>>
>>> True. It's not so much the speed as the fragility when faced with small
>>> changes to formatting. In addition to whitespace, some clients mangle
>>> punctuation with features like automatic "curly"-quoting.
>>
>> Yeah.  I was wondering whether encoding differences wouldn't be a huge
>> problem in practice, as well.
>>
>>                        regards, tom lane
>>
>>
>
>        Regards,
>                Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: Initial ugly reverse-translator

From
Oleg Bartunov
Date:
Hi,

ltree and pg_trgm with UTF8 support are available from CVS HEAD, see
See http://archives.postgresql.org/pgsql-committers/2008-06/msg00356.php
http://archives.postgresql.org/pgsql-committers/2008-11/msg00139.php

Oleg
On Fri, 16 Jan 2009, pepone.onrez wrote:

> On Sat, Apr 19, 2008 at 6:10 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
>> On Sat, 19 Apr 2008, Tom Lane wrote:
>>
>>> Craig Ringer <craig@postnewspapers.com.au> writes:
>>>>
>>>> Tom Lane wrote:
>>>>>
>>>>> I don't really see the problem.  I assume from your reference to pg_trgm
>>>>> that you're using trigram similarity as the prefilter for potential
>>>>> matches
>>>
>>>> It turns out that's no good anyway, as it appears to ignore characters
>>>> outside the ASCII range. Rather less than useful for searching a
>>>> database of translated strings ;-)
>>>
>>> A quick look at the pg_trgm code suggests that it is only prepared to
>>> deal with single-byte encodings; if you're working in UTF8, which I
>>> suppose you'd have to be, it's dead in the water :-(.  Perhaps fixing
>>> that should be on the TODO list.
>>
>> as well as ltree. they are in our todo list:
>> http://www.sai.msu.su/~megera/wiki/TODO
>>
>
> Hi Oleg
>
> In your TODO list says that UTF8 was added to ltree, is this code
> currently available for download?
>
> Regards,
> JosЪЪ
>>>
>>> But in any case maybe the full-text-search stuff would be more useful
>>> as a prefilter?  Although honestly, for the speed we need here, I'm
>>> not sure a prefilter is needed at all.  Full text might be useful
>>> if a LIKE-based match fails, though.
>>>
>>>>> (And besides, speed doesn't seem like the be-all and end-all here.)
>>>
>>>> True. It's not so much the speed as the fragility when faced with small
>>>> changes to formatting. In addition to whitespace, some clients mangle
>>>> punctuation with features like automatic "curly"-quoting.
>>>
>>> Yeah.  I was wondering whether encoding differences wouldn't be a huge
>>> problem in practice, as well.
>>>
>>>                        regards, tom lane
>>>
>>>
>>
>>        Regards,
>>                Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83