Thread: Text Search zero padding

Text Search zero padding

From

"Richard Greenwood"

Date:

28 February 2008, 23:47:11

I am using text search across multiple columns. Two of the columns
have values that have zero padding - sort of. The values look like
R0001234 (1 char followed by 7 digits, zero padded). Users are
accustom to searching with and without the zero padding (entering
R0001234 or R1234 should return identical results). This is easy to
accommodate when parsing user input for a single column, but text
searching across multiple columns it is harder determine if a
char/digit group should be padded.

So far my best idea is to create a tsvector column containing both
padded and non-padded versions of the value. i.e. put both R1234 and
R0001234 into the tsvector column. This seems pretty brute force, and
I am pretty new to text search, so I'd welcome any suggestions.

Thanks,
Rich

--
Richard Greenwood
richard.greenwood@gmail.com
www.greenwoodmap.com

Re: Text Search zero padding

From

Tom Lane

Date:

29 February 2008, 01:19:18

"Richard Greenwood" <richard.greenwood@gmail.com> writes:
> I am using text search across multiple columns. Two of the columns
> have values that have zero padding - sort of. The values look like
> R0001234 (1 char followed by 7 digits, zero padded). Users are
> accustom to searching with and without the zero padding (entering
> R0001234 or R1234 should return identical results). This is easy to
> accommodate when parsing user input for a single column, but text
> searching across multiple columns it is harder determine if a
> char/digit group should be padded.

> So far my best idea is to create a tsvector column containing both
> padded and non-padded versions of the value. i.e. put both R1234 and
> R0001234 into the tsvector column. This seems pretty brute force, and
> I am pretty new to text search, so I'd welcome any suggestions.

I'm not an expert in tsearch either, but given what you say here,
it seems like the Right Thing is to create a parser or dictionary
that strips those zeroes as being insignificant, so that R0001234 and
R1234 get mapped to the same stored/searchable lexeme.

            regards, tom lane

Re: Text Search zero padding

From

Oleg Bartunov

Date:

29 February 2008, 04:07:03

On Thu, 28 Feb 2008, Richard Greenwood wrote:

> I am using text search across multiple columns. Two of the columns
> have values that have zero padding - sort of. The values look like
> R0001234 (1 char followed by 7 digits, zero padded). Users are
> accustom to searching with and without the zero padding (entering
> R0001234 or R1234 should return identical results). This is easy to
> accommodate when parsing user input for a single column, but text
> searching across multiple columns it is harder determine if a
> char/digit group should be padded.
>
> So far my best idea is to create a tsvector column containing both
> padded and non-padded versions of the value. i.e. put both R1234 and
> R0001234 into the tsvector column. This seems pretty brute force, and
> I am pretty new to text search, so I'd welcome any suggestions.

create your dictionary, which index R0001234 as R0001234 and R1234
Seems, dict_regex is your friend.
http://vo.astronet.ru/arxiv/dict_regex.html

>
> Thanks,
> Rich
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Text Search zero padding

From

Richard Huxton

Date:

29 February 2008, 05:55:38

Oleg Bartunov wrote:
> On Thu, 28 Feb 2008, Richard Greenwood wrote:
>
>> So far my best idea is to create a tsvector column containing both
>> padded and non-padded versions of the value. i.e. put both R1234 and
>> R0001234 into the tsvector column. This seems pretty brute force, and
>> I am pretty new to text search, so I'd welcome any suggestions.
>
> create your dictionary, which index R0001234 as R0001234 and R1234
> Seems, dict_regex is your friend.
> http://vo.astronet.ru/arxiv/dict_regex.html

Nice - I was thinking something like that would be useful, but Googling
hadn't found me anything. Thanks for that link Oleg.

Wouldn't it be more efficient to have the regex-dictionary map just to
R1234 though? Or R0001234, I suppose.

--
   Richard Huxton
   Archonet Ltd

Re: Text Search zero padding

From

Oleg Bartunov

Date:

29 February 2008, 06:11:33

On Fri, 29 Feb 2008, Richard Huxton wrote:

> Oleg Bartunov wrote:
>> On Thu, 28 Feb 2008, Richard Greenwood wrote:
>>
>>> So far my best idea is to create a tsvector column containing both
>>> padded and non-padded versions of the value. i.e. put both R1234 and
>>> R0001234 into the tsvector column. This seems pretty brute force, and
>>> I am pretty new to text search, so I'd welcome any suggestions.
>>
>> create your dictionary, which index R0001234 as R0001234 and R1234
>> Seems, dict_regex is your friend.
>> http://vo.astronet.ru/arxiv/dict_regex.html
>
> Nice - I was thinking something like that would be useful, but Googling
> hadn't found me anything. Thanks for that link Oleg.
>
> Wouldn't it be more efficient to have the regex-dictionary map just to R1234
> though? Or R0001234, I suppose.

sure. But having both variants in index allows more flexible searches using
different configurations with/without mapping. Thinks about 'exact' search.

>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83