Thread: Preliminary patch for tsearch example dictionaries/parsers in contrib
Given all the flap about txid, this surely mustn't go in without public review first ;-). So, here is a submission from Sergey Karpov to fill in the lack of any working code examples for user-written tsearch parsers and dictionaries. I will be mostly off-line for the next day or so and don't have time to work on this more now, but here are a few comments: * It seems a bit odd to put multiple independent contrib modules under a single subfolder. I'd be inclined to drop the ts_pack layer and just make the dictionaries and parser be top-level contrib modules. * Depending on PCRE, when we have an at-least-equally-good regex engine built in, is silly. It's an unnecessary dependency and to the (minor) extent that the regex syntax is different, we'd have to document the discrepancies. * dict_regex is not nearly up to speed on encoding or locale issues. I didn't look at the other ones too closely, they may or may not need similar adjustments. * Allowing config files to be read from anywhere is not acceptable. We have dealt with this in the core code and the contrib examples *must* follow the same rules. * The whole "utils" part of dict_regex should probably go away; it is reinventing wheels that already exist in the Postgres backend environment. Since these are meant to be code examples, they should show the best ways of doing things within Postgres. regards, tom lane
Attachment
Re: Preliminary patch for tsearch example dictionaries/parsers in contrib
From
karpov@sao.ru (Sergey V. Karpov)
Date:
Hi Tom, Thank you for starting the discussion > Given all the flap about txid, this surely mustn't go in without public > review first ;-). So, here is a submission from Sergey Karpov to fill > in the lack of any working code examples for user-written tsearch > parsers and dictionaries. > > I will be mostly off-line for the next day or so and don't have time to > work on this more now, but here are a few comments: > > * It seems a bit odd to put multiple independent contrib modules under a > single subfolder. I'd be inclined to drop the ts_pack layer and just > make the dictionaries and parser be top-level contrib modules. Yes, I understand your position, as well as Magnus' complaints. However, putting all the code to its own contribs is not the best solution, as the majority of it is no more than examples. dict_regex, on the contrary, is an add-on very useful in some situations (and we actually use in in our projects). Also, its requirements differ from the rest of the dictionaries, see below. So, what about the following layout: - contrib/ts_examples - single module which contains all the example stuff in a single folder, to be built together - contrib/dict_regex - separate contrib > * Depending on PCRE, when we have an at-least-equally-good regex engine > built in, is silly. It's an unnecessary dependency and to the (minor) > extent that the regex syntax is different, we'd have to document the > discrepancies. Built-in regex engine seems to not support the one feature critical to the dict_regex operation - it is not able to report the "partial match" in a case when the matching fails solely due to premature end of input string (i.e. when matching may possibly succeed after adding some data to the string). If it is possible to achieve this behaviour with built-in engine, please point me to the right direction. > * dict_regex is not nearly up to speed on encoding or locale issues. > I didn't look at the other ones too closely, they may or may not need > similar adjustments. > > * Allowing config files to be read from anywhere is not acceptable. > We have dealt with this in the core code and the contrib examples > *must* follow the same rules. Is it necessary to require this behaviour from each contrib module? They are not core code, and usually solve application-level tasks - is it optimal to store the application config files in postgres tree? Also, these dictionaries need some example config files at the regression test time, and these configs are of no sense for anyone - is it good to pullute the system tree with them? On the other hand, to prevent reading arbitrary files we may require the specific header line which identifies these dictionary configs. > * The whole "utils" part of dict_regex should probably go away; it > is reinventing wheels that already exist in the Postgres backend > environment. Since these are meant to be code examples, they should > show the best ways of doing things within Postgres. Yes, you are right. I'll rewrite it using StringInfo (the "official" string-handling layer, right?). Sincerely your, Sergey Karpov
Re: Preliminary patch for tsearch example dictionaries/parsers in contrib
From
Magnus Hagander
Date:
On Tue, Oct 09, 2007 at 02:02:03PM +0400, Sergey V. Karpov wrote: > > Hi Tom, > > Thank you for starting the discussion > > > Given all the flap about txid, this surely mustn't go in without public > > review first ;-). So, here is a submission from Sergey Karpov to fill > > in the lack of any working code examples for user-written tsearch > > parsers and dictionaries. > > > > I will be mostly off-line for the next day or so and don't have time to > > work on this more now, but here are a few comments: > > > > * It seems a bit odd to put multiple independent contrib modules under a > > single subfolder. I'd be inclined to drop the ts_pack layer and just > > make the dictionaries and parser be top-level contrib modules. > > Yes, I understand your position, as well as Magnus' complaints. However, > putting all the code to its own contribs is not the best solution, as > the majority of it is no more than examples. dict_regex, on the > contrary, is an add-on very useful in some situations (and we actually > use in in our projects). Also, its requirements differ from the rest of > the dictionaries, see below. > > So, what about the following layout: > > - contrib/ts_examples - single module which contains all the example > stuff in a single folder, to be built together > > - contrib/dict_regex - separate contrib I haven't actually looked at the code. The problem from the MSVC perspective is that it generates one output file per directory. So we'll need to special-case it somehow. But that can certainly be done. But since we're on the topic (not here, but in the other thread really) of what should be in /contrib and what shouldn't - shuld contrib/ts_examples realy be in contrib or could they live on pgfoundry? With just dict_regex in contrib? > > * Depending on PCRE, when we have an at-least-equally-good regex engine > > built in, is silly. It's an unnecessary dependency and to the (minor) > > extent that the regex syntax is different, we'd have to document the > > discrepancies. > > Built-in regex engine seems to not support the one feature critical to > the dict_regex operation - it is not able to report the "partial match" > in a case when the matching fails solely due to premature end of input > string (i.e. when matching may possibly succeed after adding some data > to the string). > > If it is possible to achieve this behaviour with built-in engine, please > point me to the right direction. Can't comment on that since I don't know it, but perhaps it's something we can pull in from upstream and add to our regex engine? (But that would probably make this 8.4 material for sure, no?) > > * dict_regex is not nearly up to speed on encoding or locale issues. > > I didn't look at the other ones too closely, they may or may not need > > similar adjustments. > > > > * Allowing config files to be read from anywhere is not acceptable. > > We have dealt with this in the core code and the contrib examples > > *must* follow the same rules. > > Is it necessary to require this behaviour from each contrib module? They > are not core code, and usually solve application-level tasks - is it > optimal to store the application config files in postgres tree? Yes. We've been thruogh that many times wrt adminpack, and we don't want to do that again :-) See convert_and_check_filename() in adminpack.c. > Also, these dictionaries need some example config files at the > regression test time, and these configs are of no sense for anyone - is > it good to pullute the system tree with them? > > On the other hand, to prevent reading arbitrary files we may require the > specific header line which identifies these dictionary configs. Depending on how you deal with it, that will let the user know if a file exists (you'd have to give the "file not found" error or similar even if the file is there but doesn't have the header, which is likely to confuse the user). It can be worked around, but isn't necessarily easy. //Magnus
Re: Preliminary patch for tsearch example dictionaries/parsers in contrib
From
karpov@sao.ru (Sergey V. Karpov)
Date:
Magnus Hagander <magnus@hagander.net> writes: >> > * Allowing config files to be read from anywhere is not acceptable. >> > We have dealt with this in the core code and the contrib examples >> > *must* follow the same rules. >> >> Is it necessary to require this behaviour from each contrib module? They >> are not core code, and usually solve application-level tasks - is it >> optimal to store the application config files in postgres tree? > > Yes. We've been thruogh that many times wrt adminpack, and we don't want to > do that again :-) > See convert_and_check_filename() in adminpack.c. Ok, I understand. In this case, the simplest way for me is to use get_tsearch_config_filename(), which assumes files in share/tsearch_data/, like standard tsearch dictionaries do. Or contrib modules have to keep their files strictly in share/contrib/? Sergey Karpov
Sergey V. Karpov wrote: > Built-in regex engine seems to not support the one feature critical to > the dict_regex operation - it is not able to report the "partial match" > in a case when the matching fails solely due to premature end of input > string (i.e. when matching may possibly succeed after adding some data > to the string). > > If it is possible to achieve this behaviour with built-in engine, please > point me to the right direction. > > Adding new code at this stage of the process is bad enough. Adding new code which adds a library dependency we have not previously had at this stage of the process is quite unacceptable IMNSHO, and I will protest very loudly about it. Quite apart from anything else it will almost certainly break many buildfarm members. Are we in beta or not? To me, beta means nothing but bug fixes go in, period. No ifs, no buts, no maybes, no exceptions. And that should definitely go for contrib as well. We need a bit of self-discipline around here. cheers andrew
On Tue, 9 Oct 2007, Andrew Dunstan wrote: > > > Sergey V. Karpov wrote: >> Built-in regex engine seems to not support the one feature critical to >> the dict_regex operation - it is not able to report the "partial match" >> in a case when the matching fails solely due to premature end of input >> string (i.e. when matching may possibly succeed after adding some data >> to the string). >> >> If it is possible to achieve this behaviour with built-in engine, please >> point me to the right direction. >> >> > > Adding new code at this stage of the process is bad enough. Adding new code > which adds a library dependency we have not previously had at this stage of > the process is quite unacceptable IMNSHO, and I will protest very loudly > about it. Quite apart from anything else it will almost certainly break many > buildfarm members. good point. > > Are we in beta or not? To me, beta means nothing but bug fixes go in, > period. No ifs, no buts, no maybes, no exceptions. And that should > definitely go for contrib as well. We need a bit of self-discipline around > here. the original intention was to have text search companion with code examples for dictionaries and parser API and ability to keep them in workable condition, instead of having them in SGML documentation. Of course, buildfarm checking is very important for this purpose and I think we should remove dict_regex from ts_pack, but leave README.dict_regex, so interested users could find it. > > cheers > > andrew > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > On Tue, 9 Oct 2007, Andrew Dunstan wrote: >> Are we in beta or not? To me, beta means nothing but bug fixes go in, >> period. No ifs, no buts, no maybes, no exceptions. And that should >> definitely go for contrib as well. We need a bit of self-discipline around >> here. > the original intention was to have text search companion with code examples > for dictionaries and parser API and ability to keep them in workable > condition, instead of having them in SGML documentation. Yes. I think that we can justify new contrib code here as a documentation bug fix: right now, the examples in sections 12.9 and 12.10 are wrong (obsolete), impossible to maintain (which is why they're obsolete; they failed to track code changes), and not in a format that's directly useful as a template for new code. However, the dict_regex code is not ready for a number of reasons, quite aside from the question of whether we want to add a dependency. I concur with the idea of dropping it for now. I'm a bit inclined to drop dict_roman as well, as it seems of questionable real use, and code-example-wise it adds nothing over dict_intdict. That would leave us with two dictionary examples and one parser example, and I think both of the dictionaries look useful enough to be worth keeping. (In particular I'd not vote for having only dict_intdict since it does not illustrate how to use a config file ... and we surely don't want to encourage people to not do that correctly.) regards, tom lane
Re: Preliminary patch for tsearch example dictionaries/parsers in contrib
From
karpov@sao.ru (Sergey V. Karpov)
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes: > Oleg Bartunov <oleg@sai.msu.su> writes: >> On Tue, 9 Oct 2007, Andrew Dunstan wrote: >>> Are we in beta or not? To me, beta means nothing but bug fixes go in, >>> period. No ifs, no buts, no maybes, no exceptions. And that should >>> definitely go for contrib as well. We need a bit of self-discipline around >>> here. > >> the original intention was to have text search companion with code examples >> for dictionaries and parser API and ability to keep them in workable >> condition, instead of having them in SGML documentation. > > Yes. I think that we can justify new contrib code here as a documentation > bug fix: right now, the examples in sections 12.9 and 12.10 are wrong > (obsolete), impossible to maintain (which is why they're obsolete; > they failed to track code changes), and not in a format that's directly > useful as a template for new code. > > However, the dict_regex code is not ready for a number of reasons, > quite aside from the question of whether we want to add a dependency. > I concur with the idea of dropping it for now. > > I'm a bit inclined to drop dict_roman as well, as it seems of > questionable real use, and code-example-wise it adds nothing over > dict_intdict. That would leave us with two dictionary examples and one > parser example, and I think both of the dictionaries look useful enough > to be worth keeping. (In particular I'd not vote for having only > dict_intdict since it does not illustrate how to use a config file > ... and we surely don't want to encourage people to not do that > correctly.) I've prepared reduced and renamed version of the examples. It may be downloaded at http://lynx.sao.ru/~karpov/tmp/ts_examples.tar.gz Changes: - renamed to ts_examples to better reflect its purpose - dict_regex and dict_roman excluded, README corrected correspondingly - dict_xsyn now uses the same code to locate its config as built-in dictionaries, and can't access files outside $(prefix)/share/tsearch_data/. Example config file (which is also used in regression test) is installed to this folder. README.dict_xsyn mentions it. - source reformatted to better match PostgreSQL coding style It still has a complicated source tree with subfolders for each example. Have I to break it into three separate contrib modules? Or combine into one with three build targets (this will break MSVC build)? Sincerely your, Sergey Karpov
karpov@sao.ru (Sergey V. Karpov) writes: > I've prepared reduced and renamed version of the examples. It may be > downloaded at > http://lynx.sao.ru/~karpov/tmp/ts_examples.tar.gz > - dict_xsyn now uses the same code to locate its config as built-in > dictionaries, and can't access files outside > $(prefix)/share/tsearch_data/. Good, did you also fix it to do encoding conversion while reading the file? > It still has a complicated source tree with subfolders for each > example. Have I to break it into three separate contrib modules? Or > combine into one with three build targets (this will break MSVC build)? I think the consensus is for one module per top-level contrib directory. regards, tom lane
Re: Preliminary patch for tsearch example dictionaries/parsers in contrib
From
karpov@sao.ru (Sergey V. Karpov)
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes: > karpov@sao.ru (Sergey V. Karpov) writes: >> I've prepared reduced and renamed version of the examples. It may be >> downloaded at >> http://lynx.sao.ru/~karpov/tmp/ts_examples.tar.gz > >> - dict_xsyn now uses the same code to locate its config as built-in >> dictionaries, and can't access files outside >> $(prefix)/share/tsearch_data/. > > Good, did you also fix it to do encoding conversion while reading the file? It has not been so in previous code. I've replaced the download file with fixed version. > >> It still has a complicated source tree with subfolders for each >> example. Have I to break it into three separate contrib modules? Or >> combine into one with three build targets (this will break MSVC build)? > > I think the consensus is for one module per top-level contrib directory. Please excuse me, but English is not my native language, and I don't understand your phrase. Do you mean one module with two dictionaries and a parser at once, or three different modules in contrib/? Sincerely your, Sergey Karpov
karpov@sao.ru (Sergey V. Karpov) writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> I think the consensus is for one module per top-level contrib directory. > Please excuse me, but English is not my native language, and I don't > understand your phrase. Do you mean one module with two dictionaries and > a parser at once, or three different modules in contrib/? Sorry --- I meant they should be set up as three independent contrib modules. regards, tom lane
Re: Preliminary patch for tsearch example dictionaries/parsers in contrib
From
karpov@sao.ru (Sergey V. Karpov)
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes: > karpov@sao.ru (Sergey V. Karpov) writes: >> Tom Lane <tgl@sss.pgh.pa.us> writes: >>> I think the consensus is for one module per top-level contrib directory. > >> Please excuse me, but English is not my native language, and I don't >> understand your phrase. Do you mean one module with two dictionaries and >> a parser at once, or three different modules in contrib/? > > Sorry --- I meant they should be set up as three independent contrib > modules. Ok. I've made it. The code is again at http://lynx.sao.ru/~karpov/tmp/ts_examples.tar.gz Sincerely your, Sergey Karpov