Thread: Re: [GENERAL] Determining if "in a text set"

Re: [GENERAL] Determining if "in a text set"

From

Herouth Maoz

Date:

28 March 1999, 12:23:17

I am redirecting the discussion to the SQL list, as it is more appropriate
than the GENERAL list.

At 06:41 +0200 on 27/03/1999, Richi Plana wrote:

> I have a table/class with a text field which contains a single word
> (possibly trailed by whitespaces). Given a phrase (a text literal composed
> of words separated by one or more whitespaces), what's the best way to
> SELECT those tuples with fields containing instances that match exactly
> one of the words in the given string?
>
> Example:
>
> Table:
> Tuple#   ...   FieldN
> 1        ...   'MON'
> 2        ...   'TUE'
> 3        ...   'THURS'
> 4        ...   'THU'
> 5        ...   'FRI'
>
> Given phrase:
> 'MON TUE WED THURS'
> Tuple 1 will match
> Tuple 2 will match
> Tuple 3 will match
> Tuple 4 will NOT match
> Tuple 5 will NOT match
>
> Right now, I'm extracting (using strtok(), 8^P) each word from the given
> phrase and using "SELECT * FROM {class} where FieldN LIKE '%{word}%'", but
> it's terribly slow, relies on strtok() to parse words, and Tuple 4 above
> will match.

First, do without the strtok. You could achieve the same result (that also
matches tuple 4) with:
 SELECT * FROM {class} WHERE '{phrase}' LIKE '%' || ( rtrim( FieldN ) || '%' );

Hehe... the strange expression simply cuts off spaces from the right side
of the field, and concatenates % signs on its left and right. So you match
the phrase against the keyword, not the other way around.

It is more readable if you use regular expression rather than 'like':
 SELECT * FROM {class} WHERE '{phrase}' ~ rtrim( FieldN );

Now, turning attention to eliminating the fourth tuple from matching: I'm
not an expert on regular expressions. I think postgres does not support
perl's concept of word boundaries. But if the phrase is supposed to contain
just spaces (ascii 32, if you will) as word separators, I'd match against
the field with a space on the left and nothing on the right, a space on the
right and nothing on the left, spaces on both sides, or standing alone:
 SELECT * FROM {class} WHERE '{phrase}' ~ ' ' || ( rtrim( FieldN ) || '$' )    OR '{phrase}' ~ '^' || ( rtrim( FieldN )
||' ' )    OR '{phrase}' ~ ' ' || ( rtrim( FieldN ) || ' ' )    OR '{phrase}' = rtrim( FieldN );

The equivalent with 'like' is:
 SELECT * FROM {class} WHERE '{phrase}' LIKE '% ' || rtrim( FieldN )    OR '{phrase}' LIKE rtrim( FieldN ) || ' %'
OR'{phrase}' LIKE '% ' || ( rtrim( FieldN ) || ' %' )    OR '{phrase}' = rtrim( FieldN );

If the spaces between the words in your phrase are not just plain spaces
but can also be tabs etc, you will have to use the regexp version, and
write something like '[ \t\r\f\n]' - though I'm less than sure that
postgres's regular expressions support these. Can anyone tell us which
regexp definition postgres uses?

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma

Re: [SQL] Re: [GENERAL] Determining if "in a text set"

From

Herouth Maoz

Date:

28 March 1999, 12:43:23

At 19:24 +0200 on 28/03/1999, I  wrote:

>
> If the spaces between the words in your phrase are not just plain spaces
> but can also be tabs etc, you will have to use the regexp version, and
> write something like '[ \t\r\f\n]' - though I'm less than sure that
> postgres's regular expressions support these. Can anyone tell us which
> regexp definition postgres uses?

OK, I looked into it more deeply. The regular expressions in postgres seem
to follow the rules in the 'regex(5)' manpage.

So, if the delimiters are not necessarily plain space characters, you could
match, using the '~' version, with '[[:space:]]' instead of ' '. Ugly, but
it matches also tabs and newlines.

At least it does on my solaris. I'm not sure whether this is or is not
OS-dependent.

As for the words in the field, which you said can have trailing spaces. If
these spaces also may be tabs or newlines, you should use rtrim with two
arguments, and the second argument is a string of all characters which need
removal from the end of the string. The point is, however, that you type in
an actual tab or an actual newline.

Hope you followed me so far.

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma

Re: [SQL] Re: [GENERAL] Determining if "in a text set"

From

Richi Plana

Date:

28 March 1999, 13:14:17

Hi,

On Sun, 28 Mar 1999, Herouth Maoz wrote:

|o| OK, I looked into it more deeply. The regular expressions in postgres seem
|o| to follow the rules in the 'regex(5)' manpage.
[snipped]
|o| At least it does on my solaris. I'm not sure whether this is or is not
|o| OS-dependent.

Looking at the PostgreSQL source, I think they're using some UC Berkeley
regex package (check out $(SRC)/src/backend/regex). There are two MAN
pages there if people are interested. re_format.7 specifies POSIX 1003.2
Regex specs ... Again, assuming that PostgreSQL uses this lib.

|o| Hope you followed me so far.

thanks, that was pretty straightforward. kinda teacher-like. ;^)


L   L Richi Plana 8^)         ,-,-.     ,-,-.     ,-,-.     ,-,-.     ,-
LL LL Systems Administrator  / / \ \   / / \ \   / / \ \   / / \ \   / /
LLLLL Mosaic Communications, Inc. \ \ / /   \ \ / /   \ \ / /   \ \ / / 
LLLLL mailto:richip@mozcom.com     `-'-'     `-'-'     `-'-'     `-'-'  
------------------------------------------------------------------------
P G P Key available at http://www2.mozcom.com/~richip/richip.asc
Tired of Spam? Join this CAUCE! http://www.cauce.org/

Re: [GENERAL] Determining if "in a text set"

From

Richi Plana

Date:

28 March 1999, 13:59:25

Hi,

On Sun, 28 Mar 1999, Herouth Maoz wrote:

|o| perl's concept of word boundaries. But if the phrase is supposed to contain
|o| just spaces (ascii 32, if you will) as word separators, I'd match against
|o| the field with a space on the left and nothing on the right, a space on the
|o| right and nothing on the left, spaces on both sides, or standing alone:
|o| 
|o|   SELECT * FROM {class}
|o|   WHERE '{phrase}' ~ ' ' || ( rtrim( FieldN ) || '$' )
|o|      OR '{phrase}' ~ '^' || ( rtrim( FieldN ) || ' ' )
|o|      OR '{phrase}' ~ ' ' || ( rtrim( FieldN ) || ' ' )
|o|      OR '{phrase}' = rtrim( FieldN );

1) I'm not sure the regex thing is ANSI SQL and would port to other SQL
systems easilly
2) Isn't there a way to do [^{ws}](word)[{ws}$], where [^{ws}] means
either the start or a whitespace and [{ws}$] means either a whitespace or
EOL?
3) When I try the concat strings operator (||), I get the following psql
error:

ERROR:  parser: syntax error at or near "||"

What am I doing wrong? (I'm using PostgreSQL 6.4.2 on Sparc/Solaris with
GCC compiler) There's something wrong with the concat operator on my
system. I get diff. results from the ff. commands:

SELECT * FROM {class} where 'MON' = FieldN; and
SELECT * FROM {class} where 'MO' || 'N' = fieldN;

Anybody have a clue.


L   L Richi Plana 8^)         ,-,-.     ,-,-.     ,-,-.     ,-,-.     ,-
LL LL Systems Administrator  / / \ \   / / \ \   / / \ \   / / \ \   / /
LLLLL Mosaic Communications, Inc. \ \ / /   \ \ / /   \ \ / /   \ \ / / 
LLLLL mailto:richip@mozcom.com     `-'-'     `-'-'     `-'-'     `-'-'  
------------------------------------------------------------------------
P G P Key available at http://www2.mozcom.com/~richip/richip.asc
Tired of Spam? Join this CAUCE! http://www.cauce.org/

Re: [SQL] Re: [GENERAL] Determining if "in a text set"

From

Richi Plana

Date:

28 March 1999, 14:15:12

Hi,

On Mon, 29 Mar 1999, Richi Plana wrote:

|o| |o|   SELECT * FROM {class}
|o| |o|   WHERE '{phrase}' ~ ' ' || ( rtrim( FieldN ) || '$' )
|o| |o|      OR '{phrase}' ~ '^' || ( rtrim( FieldN ) || ' ' )
|o| |o|      OR '{phrase}' ~ ' ' || ( rtrim( FieldN ) || ' ' )
|o| |o|      OR '{phrase}' = rtrim( FieldN );
|o| 
|o| 3) When I try the concat strings operator (||), I get the following psql
|o| error:
|o| 
|o| ERROR:  parser: syntax error at or near "||"

My mistake. I didn't take into account hierarchy/precedence of operators.
It seems string concatenation (||) has a higher order compared to the
regex and LIKE operators. It should have been.

SELECT * FROM {class} WHERE '{phrase}' ~ ('^' || TRIM(FieldN) ...'


L   L Richi Plana 8^)         ,-,-.     ,-,-.     ,-,-.     ,-,-.     ,-
LL LL Systems Administrator  / / \ \   / / \ \   / / \ \   / / \ \   / /
LLLLL Mosaic Communications, Inc. \ \ / /   \ \ / /   \ \ / /   \ \ / / 
LLLLL mailto:richip@mozcom.com     `-'-'     `-'-'     `-'-'     `-'-'  
------------------------------------------------------------------------
P G P Key available at http://www2.mozcom.com/~richip/richip.asc
Tired of Spam? Join this CAUCE! http://www.cauce.org/

Re: [GENERAL] Determining if "in a text set"

From

Herouth Maoz

Date:

29 March 1999, 05:33:00

At 20:56 +0200 on 28/03/1999, Richi Plana wrote:

>
> 1) I'm not sure the regex thing is ANSI SQL and would port to other SQL
> systems easilly

That's true. ANSI only has LIKE, but it's oh-so-limited.

> 2) Isn't there a way to do [^{ws}](word)[{ws}$], where [^{ws}] means
> either the start or a whitespace and [{ws}$] means either a whitespace or
> EOL?

Hehe... Actually, there is. This is what I came up with:
  SELECT * FROM {class}  WHERE '{phrase}' ~ (      '(^|[[:space:]])' || ( rtrim( FieldN ) || '([[:space:]]|$)' )  );

It seems a bit complicated. The [[:space:]] combination is the set of all
whitespace characters. Like perl's \s, but locale-sensitive.

The combination '(^|[[:space:]])' uses the regexp alternatives syntax. This
means "Either the beginning of the phrase (^) or a space character". The
combination '([[:space:]]|$)' is, along the same lines, "either a space
character or the end of the phrase".

All the rest of the mess are concatenation operators, and the parentheses
required for their precedence and associativity.

> 3) When I try the concat strings operator (||), I get the following psql
> error:

Yes, you are right about the precedence (I don't know why I didn't
encounter it yesterday). One must also remember that the || operator is
also non-associative (no real reason for it - it's a minor bug), so you
have to put parentheses around it if you concat more than two strings.

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma