Re: [PROPOSAL] Improvements of Hunspell dictionaries support - Mailing list pgsql-hackers

From Artur Zakirov
Subject Re: [PROPOSAL] Improvements of Hunspell dictionaries support
Date
Msg-id 5691440F.2030502@postgrespro.ru
Whole thread Raw
In response to Re: [PROPOSAL] Improvements of Hunspell dictionaries support  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
<div class="moz-cite-prefix">Thanks for review.<br /><br /> On 09.01.2016 02:04, Alvaro Herrera wrote:<br
/></div><blockquotecite="mid:20160108230434.GA652291@alvherre.pgsql" type="cite"><pre wrap="">Artur Zakirov wrote:
 
</pre><blockquote type="cite"><pre wrap="">--- 74,80 ----  typedef struct aff_struct {
!     uint32        flag:16,                 type:1,                 flagflags:7,                 issimple:1,
</pre></blockquote><pre wrap="">
By doing this, you're using 40 bits of a 32-bits-wide field.  What does
this mean?  Are the final 8 bits lost?  Does the compiler allocate a
second uint32 member for those additional bits?  I don't know, but I
don't think this is a very clean idea.</pre></blockquote> No, 8 bits are not lost. This 8 bits are used if a dictionary
usesdouble extended ASCII character flag type (Conf->flagMode == FM_LONG) or decimal number flag type
(Conf->flagMode== FM_NUM). If a dictionary uses single extended ASCII character flag type (Conf->flagMode ==
FM_CHAR),then 8 bits lost.<br /><br /> You can see it in decodeFlag function. This function is used in
NIImportOOAffixesfunction, decode affix flag from string type and store in flag field (flag:16).<br /><blockquote
cite="mid:20160108230434.GA652291@alvherre.pgsql"type="cite"><pre wrap="">
 

</pre><blockquote type="cite"><pre wrap="">  typedef struct spell_struct {
!     struct     {
!         int            affix;
!         int            len;
!     }            d;
!     /*
!      * flag is filled in by NIImportDictionary. After NISortDictionary, d
!      * is valid and flag is invalid.
!      */
!     char       *flag;     char        word[FLEXIBLE_ARRAY_MEMBER]; } SPELL;
</pre></blockquote><pre wrap="">
Here you removed the union, with no rationale for doing so.  Why did you
do it?  Can it be avoided?  Because of the comment, I'd imagine that d
and flag are valid at different times, so at any time we care about only
one of them; but you haven't updated the comment stating the reason for
that no longer to be the case.  I suspect you need to keep flag valid
after NISortDictionary has been called, but if so why?  If "flag" is
invalid as the comment says, what's the reason for keeping it?
</pre></blockquote> Union was removed because the flag field need to be dynamically sized. It had 16 size in the
previousversion. In this field flag set are stored. For example, if .dict file has the entry:<br /><br />
mitigate/NDSGny<br/><br /> Then the "NDSGny" is stored in the flag field.<br /><br /> But in some cases a flag set can
havesize bigger than 16. I added this changes after this message <a class="moz-txt-link-freetext"
href="http://www.postgresql.org/message-id/CAE2gYzwom3=11U9G8ZxMT5PLkZrwb12BWzxh4dB3HUd89FOSrg@mail.gmail.com">http://www.postgresql.org/message-id/CAE2gYzwom3=11U9G8ZxMT5PLkZrwb12BWzxh4dB3HUd89FOSrg@mail.gmail.com</a><br
/>In that Turkish dictionary there are can be large flag set. For example:<br /><br />
özek/2240,852,749,5026,2242,4455,2594,2597,4963,1608,494,2409<br/><br /> This flag set has 56 size.<br /> This "flag"
isvalid all the time. It is used in NISortDictionary and it is not used after NISortDictionary function has been
called.Maybe you right and there are no reason for keeping it, and it is necessary to store all flags in separate
variable,that will be deleted after NISortDictionary has been called.<br /><br /><blockquote
cite="mid:20160108230434.GA652291@alvherre.pgsql"type="cite"><pre wrap="">The routines decodeFlag and isAffixFlagInUse
coulddo with more
 
comments.  Your patch adds zero.  Actually the whole file has not nearly
enough comments; adding some more would be very good.

Actually, after some more reading, I think this code is pretty terrible.
I have a hard time figuring out how the original works, which makes it
even more difficult to figure out whether your changes make sense.  I
would have to take your patch on faith, which doesn't sound so great an
idea.

palloc / cpalloc / tmpalloc make the whole mess even more confusing.
Why does this file have three ways to allocate memory?

Not sure what's a good way to go about this.  I am certainly not going
to commit this as is, because if I do whatever bugs you have are going
to become my problem; and with the severe lack of documentation and
given how fiddly this stuff is, I bet there are going to be a bunch of
bugs.  I suspect most committers are going to be in the same position.
I think you should start by adding a few comments here and there on top
of the original to explain how it works, then your patch on top.  I
suppose it's going to be a lot of work for you but I don't see any other
way.

A top-level overview about it would be good, too.  The current comment
at top of file states:
* spell.c*        Normalizing word with ISpell

which is, err, somewhat laconic.

</pre></blockquote> I will provide comments and explain how it works in comments. Maybe I will add some explanation
aboutdictionaries structure. I will update the patch soon.<br /><br /><pre class="moz-signature" cols="72">-- 
 
Artur Zakirov
Postgres Professional: <a class="moz-txt-link-freetext"
href="http://www.postgrespro.com">http://www.postgrespro.com</a>
Russian Postgres Company</pre>

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: [COMMITTERS] pgsql: Blind attempt at a Cygwin fix
Next
From: Artur Zakirov
Date:
Subject: Re: [PROPOSAL] Improvements of Hunspell dictionaries support