Thread: Preliminary patch for tsearch example dictionaries/parsers in contrib

Preliminary patch for tsearch example dictionaries/parsers in contrib

From
Tom Lane
Date:
Given all the flap about txid, this surely mustn't go in without public
review first ;-).  So, here is a submission from Sergey Karpov to fill
in the lack of any working code examples for user-written tsearch
parsers and dictionaries.

I will be mostly off-line for the next day or so and don't have time to
work on this more now, but here are a few comments:

* It seems a bit odd to put multiple independent contrib modules under a
single subfolder.  I'd be inclined to drop the ts_pack layer and just
make the dictionaries and parser be top-level contrib modules.

* Depending on PCRE, when we have an at-least-equally-good regex engine
built in, is silly.  It's an unnecessary dependency and to the (minor)
extent that the regex syntax is different, we'd have to document the
discrepancies.

* dict_regex is not nearly up to speed on encoding or locale issues.
I didn't look at the other ones too closely, they may or may not need
similar adjustments.

* Allowing config files to be read from anywhere is not acceptable.
We have dealt with this in the core code and the contrib examples
*must* follow the same rules.

* The whole "utils" part of dict_regex should probably go away; it
is reinventing wheels that already exist in the Postgres backend
environment.  Since these are meant to be code examples, they should
show the best ways of doing things within Postgres.

            regards, tom lane


Attachment

Re: Preliminary patch for tsearch example dictionaries/parsers in contrib

From
karpov@sao.ru (Sergey V. Karpov)
Date:
Hi Tom,

Thank you for starting the discussion

> Given all the flap about txid, this surely mustn't go in without public
> review first ;-).  So, here is a submission from Sergey Karpov to fill
> in the lack of any working code examples for user-written tsearch
> parsers and dictionaries.
>
> I will be mostly off-line for the next day or so and don't have time to
> work on this more now, but here are a few comments:
>
> * It seems a bit odd to put multiple independent contrib modules under a
> single subfolder.  I'd be inclined to drop the ts_pack layer and just
> make the dictionaries and parser be top-level contrib modules.

Yes, I understand your position, as well as Magnus' complaints. However,
putting all the code to its own contribs is not the best solution, as
the majority of it is no more than examples. dict_regex, on the
contrary, is an add-on very useful in some situations (and we actually
use in in our projects). Also, its requirements differ from the rest of
the dictionaries, see below.

So, what about the following layout:

 - contrib/ts_examples - single module which contains all the example
   stuff in a single folder, to be built together

 - contrib/dict_regex - separate contrib

> * Depending on PCRE, when we have an at-least-equally-good regex engine
> built in, is silly.  It's an unnecessary dependency and to the (minor)
> extent that the regex syntax is different, we'd have to document the
> discrepancies.

Built-in regex engine seems to not support the one feature critical to
the dict_regex operation - it is not able to report the "partial match"
in a case when the matching fails solely due to premature end of input
string (i.e. when matching may possibly succeed after adding some data
to the string).

If it is possible to achieve this behaviour with built-in engine, please
point me to the right direction.

> * dict_regex is not nearly up to speed on encoding or locale issues.
> I didn't look at the other ones too closely, they may or may not need
> similar adjustments.
>
> * Allowing config files to be read from anywhere is not acceptable.
> We have dealt with this in the core code and the contrib examples
> *must* follow the same rules.

Is it necessary to require this behaviour from each contrib module? They
are not core code, and usually solve application-level tasks - is it
optimal to store the application config files in postgres tree?

Also, these dictionaries need some example config files at the
regression test time, and these configs are of no sense for anyone - is
it good to pullute the system tree with them?

On the other hand, to prevent reading arbitrary files we may require the
specific header line which identifies these dictionary configs.

> * The whole "utils" part of dict_regex should probably go away; it
> is reinventing wheels that already exist in the Postgres backend
> environment.  Since these are meant to be code examples, they should
> show the best ways of doing things within Postgres.

Yes, you are right. I'll rewrite it using StringInfo (the "official"
string-handling layer, right?).

Sincerely your,

Sergey Karpov


Re: Preliminary patch for tsearch example dictionaries/parsers in contrib

From
Magnus Hagander
Date:
On Tue, Oct 09, 2007 at 02:02:03PM +0400, Sergey V. Karpov wrote:
>
> Hi Tom,
>
> Thank you for starting the discussion
>
> > Given all the flap about txid, this surely mustn't go in without public
> > review first ;-).  So, here is a submission from Sergey Karpov to fill
> > in the lack of any working code examples for user-written tsearch
> > parsers and dictionaries.
> >
> > I will be mostly off-line for the next day or so and don't have time to
> > work on this more now, but here are a few comments:
> >
> > * It seems a bit odd to put multiple independent contrib modules under a
> > single subfolder.  I'd be inclined to drop the ts_pack layer and just
> > make the dictionaries and parser be top-level contrib modules.
>
> Yes, I understand your position, as well as Magnus' complaints. However,
> putting all the code to its own contribs is not the best solution, as
> the majority of it is no more than examples. dict_regex, on the
> contrary, is an add-on very useful in some situations (and we actually
> use in in our projects). Also, its requirements differ from the rest of
> the dictionaries, see below.
>
> So, what about the following layout:
>
>  - contrib/ts_examples - single module which contains all the example
>    stuff in a single folder, to be built together
>
>  - contrib/dict_regex - separate contrib

I haven't actually looked at the code. The problem from the MSVC
perspective is that it generates one output file per directory. So we'll
need to special-case it somehow. But that can certainly be done.

But since we're on the topic (not here, but in the other thread really) of
what should be in /contrib and what shouldn't - shuld contrib/ts_examples
realy be in contrib or could they live on pgfoundry? With just dict_regex
in contrib?

> > * Depending on PCRE, when we have an at-least-equally-good regex engine
> > built in, is silly.  It's an unnecessary dependency and to the (minor)
> > extent that the regex syntax is different, we'd have to document the
> > discrepancies.
>
> Built-in regex engine seems to not support the one feature critical to
> the dict_regex operation - it is not able to report the "partial match"
> in a case when the matching fails solely due to premature end of input
> string (i.e. when matching may possibly succeed after adding some data
> to the string).
>
> If it is possible to achieve this behaviour with built-in engine, please
> point me to the right direction.

Can't comment on that since I don't know it, but perhaps it's something we
can pull in from upstream and add to our regex engine? (But that would
probably make this 8.4 material for sure, no?)

> > * dict_regex is not nearly up to speed on encoding or locale issues.
> > I didn't look at the other ones too closely, they may or may not need
> > similar adjustments.
> >
> > * Allowing config files to be read from anywhere is not acceptable.
> > We have dealt with this in the core code and the contrib examples
> > *must* follow the same rules.
>
> Is it necessary to require this behaviour from each contrib module? They
> are not core code, and usually solve application-level tasks - is it
> optimal to store the application config files in postgres tree?

Yes. We've been thruogh that many times wrt adminpack, and we don't want to
do that again :-)
See convert_and_check_filename() in adminpack.c.


> Also, these dictionaries need some example config files at the
> regression test time, and these configs are of no sense for anyone - is
> it good to pullute the system tree with them?
>
> On the other hand, to prevent reading arbitrary files we may require the
> specific header line which identifies these dictionary configs.

Depending on how you deal with it, that will let the user know if a file
exists (you'd have to give the "file not found" error or similar even if
the file is there but doesn't have the header, which is likely to confuse
the user). It can be worked around, but isn't necessarily easy.

//Magnus

Re: Preliminary patch for tsearch example dictionaries/parsers in contrib

From
karpov@sao.ru (Sergey V. Karpov)
Date:
Magnus Hagander <magnus@hagander.net> writes:

>> > * Allowing config files to be read from anywhere is not acceptable.
>> > We have dealt with this in the core code and the contrib examples
>> > *must* follow the same rules.
>>
>> Is it necessary to require this behaviour from each contrib module? They
>> are not core code, and usually solve application-level tasks - is it
>> optimal to store the application config files in postgres tree?
>
> Yes. We've been thruogh that many times wrt adminpack, and we don't want to
> do that again :-)
> See convert_and_check_filename() in adminpack.c.

Ok, I understand.

In this case, the simplest way for me is to use
get_tsearch_config_filename(), which assumes files in
share/tsearch_data/, like standard tsearch dictionaries do. Or contrib
modules have to keep their files strictly in share/contrib/?

Sergey Karpov

Re: Preliminary patch for tsearch example dictionaries/parsers in contrib

From
Andrew Dunstan
Date:

Sergey V. Karpov wrote:
> Built-in regex engine seems to not support the one feature critical to
> the dict_regex operation - it is not able to report the "partial match"
> in a case when the matching fails solely due to premature end of input
> string (i.e. when matching may possibly succeed after adding some data
> to the string).
>
> If it is possible to achieve this behaviour with built-in engine, please
> point me to the right direction.
>
>

Adding new code at this stage of the process is bad enough. Adding new
code which adds a library dependency we have not previously had at this
stage of the process is quite unacceptable IMNSHO, and I will protest
very loudly about it. Quite apart from anything else it will almost
certainly break many buildfarm members.

Are we in beta or not? To me, beta means nothing but bug fixes go in,
period. No ifs, no buts, no maybes, no exceptions. And that should
definitely go for contrib as well. We need a bit of self-discipline
around here.

cheers

andrew

Re: Preliminary patch for tsearch example dictionaries/parsers in contrib

From
Oleg Bartunov
Date:
On Tue, 9 Oct 2007, Andrew Dunstan wrote:

>
>
> Sergey V. Karpov wrote:
>> Built-in regex engine seems to not support the one feature critical to
>> the dict_regex operation - it is not able to report the "partial match"
>> in a case when the matching fails solely due to premature end of input
>> string (i.e. when matching may possibly succeed after adding some data
>> to the string).
>>
>> If it is possible to achieve this behaviour with built-in engine, please
>> point me to the right direction.
>>
>>
>
> Adding new code at this stage of the process is bad enough. Adding new code
> which adds a library dependency we have not previously had at this stage of
> the process is quite unacceptable IMNSHO, and I will protest very loudly
> about it. Quite apart from anything else it will almost certainly break many
> buildfarm members.

good point.

>
> Are we in beta or not? To me, beta means nothing but bug fixes go in,
> period. No ifs, no buts, no maybes, no exceptions. And that should
> definitely go for contrib as well. We need a bit of self-discipline around
> here.

the original intention was to have text search companion with code examples
for dictionaries and parser API and ability to keep them in workable
condition, instead of having them in SGML documentation.
Of course, buildfarm checking is  very important for this purpose and
I think we should remove dict_regex from ts_pack, but leave
README.dict_regex, so interested users could find it.

>
> cheers
>
> andrew
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Preliminary patch for tsearch example dictionaries/parsers in contrib

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> On Tue, 9 Oct 2007, Andrew Dunstan wrote:
>> Are we in beta or not? To me, beta means nothing but bug fixes go in,
>> period. No ifs, no buts, no maybes, no exceptions. And that should
>> definitely go for contrib as well. We need a bit of self-discipline around
>> here.

> the original intention was to have text search companion with code examples
> for dictionaries and parser API and ability to keep them in workable
> condition, instead of having them in SGML documentation.

Yes.  I think that we can justify new contrib code here as a documentation
bug fix: right now, the examples in sections 12.9 and 12.10 are wrong
(obsolete), impossible to maintain (which is why they're obsolete;
they failed to track code changes), and not in a format that's directly
useful as a template for new code.

However, the dict_regex code is not ready for a number of reasons,
quite aside from the question of whether we want to add a dependency.
I concur with the idea of dropping it for now.

I'm a bit inclined to drop dict_roman as well, as it seems of
questionable real use, and code-example-wise it adds nothing over
dict_intdict.  That would leave us with two dictionary examples and one
parser example, and I think both of the dictionaries look useful enough
to be worth keeping.  (In particular I'd not vote for having only
dict_intdict since it does not illustrate how to use a config file
... and we surely don't want to encourage people to not do that
correctly.)

            regards, tom lane

Re: Preliminary patch for tsearch example dictionaries/parsers in contrib

From
karpov@sao.ru (Sergey V. Karpov)
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Oleg Bartunov <oleg@sai.msu.su> writes:
>> On Tue, 9 Oct 2007, Andrew Dunstan wrote:
>>> Are we in beta or not? To me, beta means nothing but bug fixes go in,
>>> period. No ifs, no buts, no maybes, no exceptions. And that should
>>> definitely go for contrib as well. We need a bit of self-discipline around
>>> here.
>
>> the original intention was to have text search companion with code examples
>> for dictionaries and parser API and ability to keep them in workable
>> condition, instead of having them in SGML documentation.
>
> Yes.  I think that we can justify new contrib code here as a documentation
> bug fix: right now, the examples in sections 12.9 and 12.10 are wrong
> (obsolete), impossible to maintain (which is why they're obsolete;
> they failed to track code changes), and not in a format that's directly
> useful as a template for new code.
>
> However, the dict_regex code is not ready for a number of reasons,
> quite aside from the question of whether we want to add a dependency.
> I concur with the idea of dropping it for now.
>
> I'm a bit inclined to drop dict_roman as well, as it seems of
> questionable real use, and code-example-wise it adds nothing over
> dict_intdict.  That would leave us with two dictionary examples and one
> parser example, and I think both of the dictionaries look useful enough
> to be worth keeping.  (In particular I'd not vote for having only
> dict_intdict since it does not illustrate how to use a config file
> ... and we surely don't want to encourage people to not do that
> correctly.)

I've prepared reduced and renamed version of the examples. It may be
downloaded at

http://lynx.sao.ru/~karpov/tmp/ts_examples.tar.gz

Changes:

 - renamed to ts_examples to better reflect its purpose

 - dict_regex and dict_roman excluded, README corrected correspondingly

 - dict_xsyn now uses the same code to locate its config as built-in
   dictionaries, and can't access files outside
   $(prefix)/share/tsearch_data/. Example config file (which is also
   used in regression test) is installed to this
   folder. README.dict_xsyn mentions it.

 - source reformatted to better match PostgreSQL coding style

It still has a complicated source tree with subfolders for each
example. Have I to break it into three separate contrib modules? Or
combine into one with three build targets (this will break MSVC build)?

Sincerely your,

Sergey Karpov

Re: Preliminary patch for tsearch example dictionaries/parsers in contrib

From
Tom Lane
Date:
karpov@sao.ru (Sergey V. Karpov) writes:
> I've prepared reduced and renamed version of the examples. It may be
> downloaded at
> http://lynx.sao.ru/~karpov/tmp/ts_examples.tar.gz

>  - dict_xsyn now uses the same code to locate its config as built-in
>    dictionaries, and can't access files outside
>    $(prefix)/share/tsearch_data/.

Good, did you also fix it to do encoding conversion while reading the file?

> It still has a complicated source tree with subfolders for each
> example. Have I to break it into three separate contrib modules? Or
> combine into one with three build targets (this will break MSVC build)?

I think the consensus is for one module per top-level contrib directory.

            regards, tom lane

Re: Preliminary patch for tsearch example dictionaries/parsers in contrib

From
karpov@sao.ru (Sergey V. Karpov)
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> karpov@sao.ru (Sergey V. Karpov) writes:
>> I've prepared reduced and renamed version of the examples. It may be
>> downloaded at
>> http://lynx.sao.ru/~karpov/tmp/ts_examples.tar.gz
>
>>  - dict_xsyn now uses the same code to locate its config as built-in
>>    dictionaries, and can't access files outside
>>    $(prefix)/share/tsearch_data/.
>
> Good, did you also fix it to do encoding conversion while reading the file?

It has not been so in previous code. I've replaced the download file
with fixed version.

>
>> It still has a complicated source tree with subfolders for each
>> example. Have I to break it into three separate contrib modules? Or
>> combine into one with three build targets (this will break MSVC build)?
>
> I think the consensus is for one module per top-level contrib directory.

Please excuse me, but English is not my native language, and I don't
understand your phrase. Do you mean one module with two dictionaries and
a parser at once, or three different modules in contrib/?

Sincerely your,

Sergey Karpov

Re: Preliminary patch for tsearch example dictionaries/parsers in contrib

From
Tom Lane
Date:
karpov@sao.ru (Sergey V. Karpov) writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> I think the consensus is for one module per top-level contrib directory.

> Please excuse me, but English is not my native language, and I don't
> understand your phrase. Do you mean one module with two dictionaries and
> a parser at once, or three different modules in contrib/?

Sorry --- I meant they should be set up as three independent contrib
modules.

            regards, tom lane

Re: Preliminary patch for tsearch example dictionaries/parsers in contrib

From
karpov@sao.ru (Sergey V. Karpov)
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> karpov@sao.ru (Sergey V. Karpov) writes:
>> Tom Lane <tgl@sss.pgh.pa.us> writes:
>>> I think the consensus is for one module per top-level contrib directory.
>
>> Please excuse me, but English is not my native language, and I don't
>> understand your phrase. Do you mean one module with two dictionaries and
>> a parser at once, or three different modules in contrib/?
>
> Sorry --- I meant they should be set up as three independent contrib
> modules.

Ok. I've made it. The code is again at

http://lynx.sao.ru/~karpov/tmp/ts_examples.tar.gz

Sincerely your,

Sergey Karpov