Thread: Problem: pq_recvbuf: unexpected EOF of client connection

Problem: pq_recvbuf: unexpected EOF of client connection

From
"Natalya S. Makushina"
Date:
Hi all!

I had messages on my console like this
pq_recvbuf: unexpected  EOF of client connection

It's happened when client finished connection.
In other words when windows application is terminated, this message showed up on console.
I have PostgreSQL v 6.5.2 on Red Hat Linux.
Application use ODBC for Win32.

What is it mean?
Any help would be greatly appreciated.

P.S. When I used PostgreSQL v 6.4.2, i had no such problem.

Natalya Makushina
mak@rtsoft.msk.ru



HAVING QUESTION

From
Alexander Barkov
Date:
Hi!


How can I refer the calculated field in HAVING clause.

This work in MySQL:

SELECT some_expression as field1, ...
FROM tablename
GROUP BY ...
HAVING field1>0;

PostgreSQL gives error "Attribute 'field1' not found".


Are there any workarounds?


Thanks.


--
Alexander Barkov
IZHCOM, Izhevsk
email:    bar@izhcom.ru      | http://www.izhcom.ru
Phone:    +7 (3412) 51-55-45 | Fax: +7 (3412) 78-70-10
ICQ:      7748759

Re: [GENERAL] HAVING QUESTION

From
Bob Kline
Date:
On Wed, 3 Nov 1999, Alexander Barkov wrote:

>
> Hi!
>
>
> How can I refer the calculated field in HAVING clause.
>
> This work in MySQL:
>
> SELECT some_expression as field1, ...
> FROM tablename
> GROUP BY ...
> HAVING field1>0;
>
> PostgreSQL gives error "Attribute 'field1' not found".
>
>
> Are there any workarounds?
>
>

How about HAVING some_expression > 0?  (Thought your version is legal
SQL, I believe).

--
Bob Kline
mailto:bkline@rksystems.com
http://www.rksystems.com


Re: [GENERAL] HAVING QUESTION

From
Alexander Barkov
Date:
Bob Kline wrote:
>
> On Wed, 3 Nov 1999, Alexander Barkov wrote:
>
> >
> > Hi!
> >
> >
> > How can I refer the calculated field in HAVING clause.
> >
> > This work in MySQL:
> >
> > SELECT some_expression as field1, ...
> > FROM tablename
> > GROUP BY ...
> > HAVING field1>0;
> >
> > PostgreSQL gives error "Attribute 'field1' not found".
> >
> >
> > Are there any workarounds?
> >
> >
>
> How about HAVING some_expression > 0?  (Thought your version is legal
> SQL, I believe).
>


The problem is that some_expression is big enough. I need
this query for search engine. The query depends of number of
given words.  Check second field in this query:

SELECT
  dict.url_id,
 max(case when word IN ('word1') then 1 else 0 end)+
 max(case when word IN ('word2') then 1 else 0 end)+
 max(case when word IN ('word3') then 1 else 0 end) as num,
 sum(intag)as rate
FROM dict,url
WHERE url.rec_id=dict.url_id
AND dict.word IN ('word1','word2','word3')
GROUP BY dict.url_id ORDER BY num DESC, rate DESC


I need to check in HAVING that calculated field 'num' is 3.

This is the sample for three words. I can duplicate big expression
for 'num' in HAVING. But the query will be huge for 10 or 15 words :-)


Strange. I cannot use 'num' in HAVING. But it works in ORDER BY.

May be I'm doing something wrong?


--
Alexander Barkov
IZHCOM, Izhevsk
email:    bar@izhcom.ru      | http://www.izhcom.ru
Phone:    +7 (3412) 51-55-45 | Fax: +7 (3412) 78-70-10
ICQ:      7748759

Re: [GENERAL] HAVING QUESTION

From
Bob Kline
Date:
On Wed, 3 Nov 1999, Alexander Barkov wrote:

> Bob Kline wrote:
> >
> > On Wed, 3 Nov 1999, Alexander Barkov wrote:
> >
> > >
> > > Hi!
> > >
> > >
> > > How can I refer the calculated field in HAVING clause.
> > >
> > > This work in MySQL:
> > >
> > > SELECT some_expression as field1, ...
> > > FROM tablename
> > > GROUP BY ...
> > > HAVING field1>0;
> > >
> > > PostgreSQL gives error "Attribute 'field1' not found".
> > >
> > >
> > > Are there any workarounds?
> > >
> > >
> >
> > How about HAVING some_expression > 0?  (Though your version is legal
> > SQL, I believe).
> >
>
>
> The problem is that some_expression is big enough. I need
> this query for search engine. The query depends of number of
> given words.  Check second field in this query:
>
> SELECT
>   dict.url_id,
>  max(case when word IN ('word1') then 1 else 0 end)+
>  max(case when word IN ('word2') then 1 else 0 end)+
>  max(case when word IN ('word3') then 1 else 0 end) as num,
>  sum(intag)as rate
> FROM dict,url
> WHERE url.rec_id=dict.url_id
> AND dict.word IN ('word1','word2','word3')
> GROUP BY dict.url_id ORDER BY num DESC, rate DESC
>
>
> I need to check in HAVING that calculated field 'num' is 3.
>
> This is the sample for three words. I can duplicate big expression
> for 'num' in HAVING. But the query will be huge for 10 or 15 words :-)
>
>
> Strange. I cannot use 'num' in HAVING. But it works in ORDER BY.
>
> May be I'm doing something wrong?
>

It appears that although some implementations support the syntax you're
trying to use, SQL92 (and apparently PostgreSQL) doesn't.  What SQL92
*does* support, would be:

  SELECT url_id,
         num,
         rate
    FROM (SELECT ... FROM ... WHERE ...GROUP BY ...) AS tmptab
   WHERE num = 3
ORDER BY rate DESC

If the DBMS doesn't support this either, then you could resort to
creating a temporary table.

--
Bob Kline
mailto:bkline@rksystems.com
http://www.rksystems.com


Re: [GENERAL] HAVING QUESTION

From
Bruce Momjian
Date:
[Charset koi8-r unsupported, filtering to ASCII...]
>
> Hi!
>
>
> How can I refer the calculated field in HAVING clause.
>
> This work in MySQL:
>
> SELECT some_expression as field1, ...
> FROM tablename
> GROUP BY ...
> HAVING field1>0;
>
> PostgreSQL gives error "Attribute 'field1' not found".

Worked here in development tree:

    select state as test, count(*)
    from friends
    group by test
    having count(*) > 1\g

    test|count
    ----+-----
    MA  |    2
    NJ  |    2
    (2 rows)

I think your problem is that field1 is not an aggregate or if it is, you
may need to repeat the aggregate in the having.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

indexed regex select optimisation missing?

From
Stuart Woolford
Date:
Hello people.

Please bear with me, as I think I may have found either a bug or 'missing
optimisation feature' in postgresql 6.5.2..

 I'm trying to use postgresql 6.5.2 to implement (amongst other things) a
searchable word index, ie: I have a table called 'inv_word_i' which contains
the fields

create table inv_word_i ( key char(10), word char(16), count int ) ;

and index

create index i3 on inv_word_i ( word );

and currently has 1,654,662 entries (1.6 million), now, I currently search
this with:

select key from inv_word_i where word='whatever' order by count desc ;

and this is fast, however, if I use:

select key from inv_word_i where word~'^whatever.*' order by count desc ;

it is very slow.

the explains for both are:
query 1:
Sort  (cost=67.82 rows=1057 width=16)
  ->  Index Scan using i3 on inv_word_i  (cost=67.82 rows=1057 width=16)
query 2:
Sort  (cost=35148.70 rows=353 width=16)
  ->  Index Scan using i3 on inv_word_i  (cost=35148.70 rows=353 width=16)

now, the explain on a query for ~'.*whatever.*' gives a cost of 70000.

now, the documentation says that the index will be used for a regex query that
has the left side tied (by using the ^ start of line match), and it seems to be
doing this, but it then scans the whole rest of the file, not using the obvious
optimisation of stoppping as soon as the non-variable left part of the regex is
no longer matched. I have verified this, as the above ^ based regex search
takes a long time for a word starting with z, and a massive time for one
starting this a, and a linear difference as I progress through the alphabet.

Is there any reason why this optimisation is not used? it seems like a rather
important one to myself, and would speed up queries of this type by a massive
amount (on average, the same amount as using the search to locate the start of
the search scan)

I've downloaded the source, and will start looking into this, but having not
touched the postgresql source before, I'm not holding my breath fort a quick
resolution from myself,

any ideas?

--
------------------------------------------------------------
Stuart Woolford, stuartw@newmail.net
Unix Consultant.
Software Developer.
Supra Club of New Zealand.
------------------------------------------------------------

Re: [GENERAL] indexed regex select optimisation missing?

From
"Gene Selkov, Jr."
Date:
> select key from inv_word_i where word='whatever' order by count desc ;
>
> and this is fast, however, if I use:
>
> select key from inv_word_i where word~'^whatever.*' order by count desc ;
>
> it is very slow.

Did you try '^whatever' instead of '^whatever.*'? Based on common
sense, the former should be much faster than the latter, which would
match any cahracter any number of times, unless the regexp is
optimized to avoid that.

--Gene

Re: [GENERAL] indexed regex select optimisation missing?

From
Stuart Woolford
Date:
On Thu, 04 Nov 1999, Gene Selkov, Jr. wrote:
> > select key from inv_word_i where word='whatever' order by count desc ;
> >
> > and this is fast, however, if I use:
> >
> > select key from inv_word_i where word~'^whatever.*' order by count desc ;
> >
> > it is very slow.
>
> Did you try '^whatever' instead of '^whatever.*'? Based on common
> sense, the former should be much faster than the latter, which would
> match any cahracter any number of times, unless the regexp is
> optimized to avoid that.

unfortunately '^whatever.*' is what I'm trying to locate (ie: all words
starting with whatever, but with nay trailing text), the problem seems to be in
the termination of the index scan, not in the actual regex match (which actually
seems very good, speed wise..) otherwise I could just use ='whatever', which
runs very very fast.

I've had one hint that I need to build with --enable-locale, which I have to
look into, so far I've just been using the redhat RPMs, so have not actually
delved the build options, but I would have thought this would be a safe option
everywhere (but thinking about it, character order is locale specific, so maybe
it's not.. I'll look, if this is the case, enable-locale is a VERY important
option for text searching, perhaps should be added the the section 4.x in the
FAQ where the regex optimisation is mentioned?)

 >
> --Gene
--
------------------------------------------------------------
Stuart Woolford, stuartw@newmail.net
Unix Consultant.
Software Developer.
Supra Club of New Zealand.
------------------------------------------------------------

Re: [GENERAL] indexed regex select optimisation missing?

From
Peter Eisentraut
Date:
On Thu, 4 Nov 1999, Stuart Woolford wrote:

> unfortunately '^whatever.*' is what I'm trying to locate (ie: all words
> starting with whatever, but with nay trailing text), the problem seems to be in
> the termination of the index scan, not in the actual regex match (which actually
> seems very good, speed wise..) otherwise I could just use ='whatever', which
> runs very very fast.

Isn't "all words that start with whatever but without trailing text" the
same as = 'whatever'? From a regex point of view '^whatever' and
'^whatever.*' are exactly equivalent, but I can see where one could fail
to optimize properly. But that could even be a problem in the regex
library, which is not any of the PostgreSQL developers' domain and which
is used in other products as well.

    -Peter

--
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden


Re: [GENERAL] indexed regex select optimisation missing?

From
"Gene Selkov, Jr."
Date:
> > unfortunately '^whatever.*' is what I'm trying to locate (ie: all words
> > starting with whatever, but with nay trailing text), the problem seems to be in
> > the termination of the index scan, not in the actual regex match (which actually
> > seems very good, speed wise..) otherwise I could just use ='whatever', which
> > runs very very fast.
>
> Isn't "all words that start with whatever but without trailing text" the
> same as = 'whatever'? From a regex point of view '^whatever' and
> '^whatever.*' are exactly equivalent, but I can see where one could fail
> to optimize properly.

OK, let's turn from speculations to facts (have just gotten off my
rear end and verified each).:

1. '^whatever.*' and '^whatever' are equivalent regular expressions.

2. The version of regexp used in postgres is aware of this equivalence.

3. Btree index is used in the queries involving anchored expressions:

emp=> explain select * from ps where ps ~ '^EDTA';
NOTICE:  QUERY PLAN:

Index Scan using psix on ps  (cost=2373.21 rows=1 width=62)

emp=> explain select * from ps where ps ~ '^EDTA.*';
NOTICE:  QUERY PLAN:

Index Scan using psix on ps  (cost=2373.21 rows=1 width=62)

(ps is a 250k-row table; the result is returned immediately when
indexed and in about 3 seconds when not)

However,

4. Hash index is never used
===========================

Observations made with 6.5 on RedHat 5.1.


--Gene

Re: [GENERAL] indexed regex select optimisation missing?

From
Stuart Woolford
Date:
On Fri, 05 Nov 1999, Gene Selkov, Jr. wrote:
> OK, let's turn from speculations to facts (have just gotten off my
> rear end and verified each).:
>
> 1. '^whatever.*' and '^whatever' are equivalent regular expressions.

yes, sorry, I was aware of this, although I was using .* for clarity and my
mind got stuck in 'proper' regex mode where those are needed.., it unfortunately
has no effect on the outcome here.

> 2. The version of regexp used in postgres is aware of this equivalence.

sure seems that way.

> 3. Btree index is used in the queries involving anchored expressions:
>
> emp=> explain select * from ps where ps ~ '^EDTA';
> NOTICE:  QUERY PLAN:
>
> Index Scan using psix on ps  (cost=2373.21 rows=1 width=62)
>
> emp=> explain select * from ps where ps ~ '^EDTA.*';
> NOTICE:  QUERY PLAN:
>
> Index Scan using psix on ps  (cost=2373.21 rows=1 width=62)
>
> (ps is a 250k-row table; the result is returned immediately when
> indexed and in about 3 seconds when not)

My point is that, while the index (in 6.5.1 and 6.5.2, anyway) is used to locate
the start of the scan, the system is then index-scanning the *whole* rest of the
table (which takes minutes for my 1.6 million entry table if it is from near
the start), as opposed to using a better 'stop term' to stop scanning once the
regex will no longer be able to match (ie: the static front of the regex is no
longer matching), so the ordered scan is only being half utilised, this makes a
MASSIVE difference in performance.

For example, say one of the words in the table is 'alongword', and there is
also 'alongwords', but no other words with the root of 'alongword'

If I do a "select key from inv_word_i where word='alongword'" it will use the
btree index on inv_word_i, and locate the one match almost instantly.

If I do a "select key from inv_word_i where word~'alongword' it will need to
scan all the records (this takes some time, minutes, infact) - as it should!,
and would match atleast the two entries detailed above.

If I do a 'select key from inv_word_i where word~'^alongword'  it uses the
index to find 'alongword', then does an index scan of the *whole* rest of the
table check all the rest of the entries for regex matching, so it takes a long
time, and returns the two entries detailed above, it will take almost as long
as the previous query.

What it should do is stop as soon as the leftmost part of the regex match no
longer matches 'alongword' because, as it is scanning in indexed order, a match
is no longer possible. The query will then run at nearly the speed of the first
example, while finding the required two entries. This method is extensible to
any regex where there is a '^' followed by a length of static match, as soon as
the static part does not match in index scan order, the regex can never be
matched.

This makes a massive difference for searching large indexes of words when you
want to match a root words and all extensions of that word (for exmple, window,
windows, windowing, windowed, windowless, etc....) - this optimisation (if it
is missing or broken) would make postgresql a much more powerful tool for this
job for what would seem to be a quite simple addition.

>
> However,
>
> 4. Hash index is never used

makes a lot off sense, hash indexes do not supply ordering information, and are
therefore only usefull for equivanence location, not ordered scanning, which is
required for the regex situation.

 > ===========================
>
> Observations made with 6.5 on RedHat 5.1.
--
------------------------------------------------------------
Stuart Woolford, stuartw@newmail.net
Unix Consultant.
Software Developer.
Supra Club of New Zealand.
------------------------------------------------------------

Re: [GENERAL] indexed regex select optimisation missing?

From
"Ross J. Reedstrom"
Date:
Ah, your description just tripped a memory for me from the hackers list:

The behavior you describe has to do with the implementation of using an
index for regex matching, in the presence of the USE_LOCALE configuration
option.

Internally, the condition: WHERE word~'^alongword' is converted in the
parser(!) to:

WHERE word >= 'alongword' AND word < 'alongword\377'

since the index needs inequalities to be used, not matches. Now, the
problem is the hack of tacking an octal \377 on the string to create
the lexagraphically 'just bigger' value assumes ASCI sort order. If
USE_LOCALE is defined, this is dropped, since we don't have a good fix
yet, and slow correct behavior is better than fast, incorrect behavior.

So, you have two options: if you don't need locale support, recompile
without it. Otherwise, hand code your anchored matches as the pair of
conditionals above Hmm, is there syntax for adding an arbitrary value to
a string constant in the SQL? I suppose you could use: word < 'alongwore',
i.e. hand increment the last character, so it's larger than any match.

Your point is correct, the developers are aware of it as a theoretical
problem, at least. Always helps to hear a real world case, though. I
believe it's on the TODO list as is, otherwise, pester Bruce. ;-)

Reviewing my email logs from June, most of the work on this has to do with
people who needs locales, and potentially multibyte character sets. Tom
Lane is of the opinion that this particular optimization needs to be moved
out of the parser, and deeper into the planner or optimizer/rewriter,
so a good fix may be some ways out.

Ross

On Fri, Nov 05, 1999 at 10:12:06AM +1300, Stuart Woolford wrote:
>
> My point is that, while the index (in 6.5.1 and 6.5.2, anyway) is used to locate
> the start of the scan, the system is then index-scanning the *whole* rest of the
> table (which takes minutes for my 1.6 million entry table if it is from near
> the start), as opposed to using a better 'stop term' to stop scanning once the
> regex will no longer be able to match (ie: the static front of the regex is no
> longer matching), so the ordered scan is only being half utilised, this makes a
> MASSIVE difference in performance.
>
> For example, say one of the words in the table is 'alongword', and there is
> also 'alongwords', but no other words with the root of 'alongword'
>

[...]

>
> If I do a 'select key from inv_word_i where word~'^alongword'  it uses the
> index to find 'alongword', then does an index scan of the *whole* rest of the
> table check all the rest of the entries for regex matching, so it takes a long
> time, and returns the two entries detailed above, it will take almost as long
> as the previous query.
>
> What it should do is stop as soon as the leftmost part of the regex match no
> longer matches 'alongword' because, as it is scanning in indexed order, a match
> is no longer possible. The query will then run at nearly the speed of the first

--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St.,  Houston, TX 77005

Re: [GENERAL] indexed regex select optimisation missing?

From
Stuart Woolford
Date:
On Fri, 05 Nov 1999, you wrote:
> Ah, your description just tripped a memory for me from the hackers list:
>
> The behavior you describe has to do with the implementation of using an
> index for regex matching, in the presence of the USE_LOCALE configuration
> option.
>
> Internally, the condition: WHERE word~'^alongword' is converted in the
> parser(!) to:
>
> WHERE word >= 'alongword' AND word < 'alongword\377'
>
> since the index needs inequalities to be used, not matches. Now, the
> problem is the hack of tacking an octal \377 on the string to create
> the lexagraphically 'just bigger' value assumes ASCI sort order. If
> USE_LOCALE is defined, this is dropped, since we don't have a good fix
> yet, and slow correct behavior is better than fast, incorrect behavior.

ah, now this makes sense, I'm using the RPMs, and I bet they have lexical
enabled by default (damb! perhaps another set should be produced without this
option? it makes a BIG difference)

 >  > So, you have two options: if you don't need locale support,
recompile > without it. Otherwise, hand code your anchored matches as the pair
of > conditionals above Hmm, is there syntax for adding an arbitrary value to
> a string constant in the SQL? I suppose you could use: word < 'alongwore',
> i.e. hand increment the last character, so it's larger than any match.

I've tried a test using ">='window' and <'windox'", and it works perfectly, and
very very fast, so I think we have found your culprit.

>
> Your point is correct, the developers are aware of it as a theoretical
> problem, at least. Always helps to hear a real world case, though. I
> believe it's on the TODO list as is, otherwise, pester Bruce. ;-)
>
> Reviewing my email logs from June, most of the work on this has to do with
> people who needs locales, and potentially multibyte character sets. Tom
> Lane is of the opinion that this particular optimization needs to be moved
> out of the parser, and deeper into the planner or optimizer/rewriter,
> so a good fix may be some ways out.

Hmm, perhaps a 'good' initial fix would be to produce another set of RPMs,
and/or add it to the FAQ in the 4.x section about the slow queries that say
indexes are used for this type of search. using the >= AND < trick does seem to
work, but is a little non-obvious (and hard to code in some situations, it will
make quite a difference to how I need to implement my searching system)

>
> Ross

thank you very very much for your assistance on this, it is greatly appreciated!

--
------------------------------------------------------------
Stuart Woolford, stuartw@newmail.net
Unix Consultant.
Software Developer.
Supra Club of New Zealand.
------------------------------------------------------------

Re: [GENERAL] indexed regex select optimisation missing?

From
"Ross J. Reedstrom"
Date:
Stuart -
I'm forwarding a version of your last message to the hackers list, and
to Lamar Owen, who's the keeper of the RPMs. The short verson, for those
who haven't followed this thread over on GENERAL, is that Stuart is being
bitten by the USE_LOCALE affect on the makeIndexable() function in the
parser: anchored regex searches on a large table (a glossary, I believe)
take a long time, proportional to sort position of the anchoring text:
i.e. searching for '^zoo' is quick, '^apple' is very slow.

I seems to recall the packagers here (Lamar and Oliver) asking if defining
USE_LOCALE for the general RPM or deb would cause any problems for other
users, who don't need locale info. Here's a real world example.

The discussion about this was last June, and shifted focus into the
multi-byte problem, as far as I can tell. Bruce, some version of this
is on the TODO list, right?

Ross

On Fri, Nov 05, 1999 at 12:09:19PM +1300, Stuart Woolford wrote:
> On Fri, 05 Nov 1999, you wrote:
> > Ah, your description just tripped a memory for me from the hackers list:
> >
> > The behavior you describe has to do with the implementation of using an
> > index for regex matching, in the presence of the USE_LOCALE configuration
> > option.
> >
> > Internally, the condition: WHERE word~'^alongword' is converted in the
> > parser(!) to:
> >
> > WHERE word >= 'alongword' AND word < 'alongword\377'
> >
> > since the index needs inequalities to be used, not matches. Now, the
> > problem is the hack of tacking an octal \377 on the string to create
> > the lexagraphically 'just bigger' value assumes ASCI sort order. If
> > USE_LOCALE is defined, this is dropped, since we don't have a good fix
> > yet, and slow correct behavior is better than fast, incorrect behavior.
>
> ah, now this makes sense, I'm using the RPMs, and I bet they have lexical
> enabled by default (damb! perhaps another set should be produced without this
> option? it makes a BIG difference)
>
>  >  > So, you have two options: if you don't need locale support,
> recompile > without it. Otherwise, hand code your anchored matches as the pair
> of > conditionals above Hmm, is there syntax for adding an arbitrary value to
> > a string constant in the SQL? I suppose you could use: word < 'alongwore',
> > i.e. hand increment the last character, so it's larger than any match.
>
> I've tried a test using ">='window' and <'windox'", and it works perfectly, and
> very very fast, so I think we have found your culprit.
>
> >
> > Your point is correct, the developers are aware of it as a theoretical
> > problem, at least. Always helps to hear a real world case, though. I
> > believe it's on the TODO list as is, otherwise, pester Bruce. ;-)
> >
> > Reviewing my email logs from June, most of the work on this has to do with
> > people who needs locales, and potentially multibyte character sets. Tom
> > Lane is of the opinion that this particular optimization needs to be moved
> > out of the parser, and deeper into the planner or optimizer/rewriter,
> > so a good fix may be some ways out.
>
> Hmm, perhaps a 'good' initial fix would be to produce another set of RPMs,
> and/or add it to the FAQ in the 4.x section about the slow queries that say
> indexes are used for this type of search. using the >= AND < trick does seem to
> work, but is a little non-obvious (and hard to code in some situations, it will
> make quite a difference to how I need to implement my searching system)
>
> >
> > Ross
>
> thank you very very much for your assistance on this, it is greatly appreciated!
>
> --
> ------------------------------------------------------------
> Stuart Woolford, stuartw@newmail.net
> Unix Consultant.
> Software Developer.
> Supra Club of New Zealand.
> ------------------------------------------------------------
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St.,  Houston, TX 77005

Re: [GENERAL] indexed regex select optimisation missing?

From
Lamar Owen
Date:
"Ross J. Reedstrom" wrote:
>
> Stuart -
> I'm forwarding a version of your last message to the hackers list, and
> to Lamar Owen, who's the keeper of the RPMs. The short verson, for those

> > Hmm, perhaps a 'good' initial fix would be to produce another set of RPMs,

That is easy enough.  I can build two versions -- with locale, and
no-locale.  No-locale RPM's would be named differently --
postgresql-6.5.3-1nl.i386.rpm (that's 'one in ell').

I have been helping another user figure out the regression results for
locales -- it's not fun.  HOWEVER, I also need to follow the
RedHat-originated standard, with is with locale support.

It'll take a little bit to rebuild, but not too long -- I could release
no-locale RPM's as early as tomorrow for RedHat 6.x, and as early as an
hour from now for RedHat 5.2 (both releases happening after the official
6.5.3 release, of course).

In fact, if a user wants to build the no-locale RPM's themselves, it's
not too difficult:
1.)    get the postgresql-6.5.2-1.src.rpm source RPM (hereafter abbreviated
'the SRPM')
2.)    Install the SRPM with 'rpm -i'
3.)    Become root, and cd to /usr/src/redhat/SPECS
4.)    Open postgresql.spec with your favorite editor
5.)    Remove the configure option '--enable-locale' (if you use vi, and
are comfortable with doing so, you can ':%s/--enable-locale//g' to good
effect).
6.)    Change the string after the line 'Release:' to be '1nl' from 1.
7.)    Save and exit your editor.
8.)    execute the command 'rpm -ba postgresql.spec'
9.)    When it's done, install the new RPM's from the appropriate directory
under /usr/src/redhat/RPMS.
10.)    Clean up by removing the files under SOURCES and the
postgresql-6.5.2 build tree under BUILD.

NOTE: You need a fairly complete development environment to do this --
in particular, 'python-devel' must be installed (it's not by default,
even under a 'C Development' and 'Development Libraries' enabled
installation.  You do need the C++ compiler installed as well.

Would this help??

--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

Re: [GENERAL] indexed regex select optimisation missing?

From
Bruce Momjian
Date:
> Stuart -
> I'm forwarding a version of your last message to the hackers list, and
> to Lamar Owen, who's the keeper of the RPMs. The short verson, for those
> who haven't followed this thread over on GENERAL, is that Stuart is being
> bitten by the USE_LOCALE affect on the makeIndexable() function in the
> parser: anchored regex searches on a large table (a glossary, I believe)
> take a long time, proportional to sort position of the anchoring text:
> i.e. searching for '^zoo' is quick, '^apple' is very slow.
>
> I seems to recall the packagers here (Lamar and Oliver) asking if defining
> USE_LOCALE for the general RPM or deb would cause any problems for other
> users, who don't need locale info. Here's a real world example.
>
> The discussion about this was last June, and shifted focus into the
> multi-byte problem, as far as I can tell. Bruce, some version of this
> is on the TODO list, right?

I have beefed up the FAQ with a mention that locale disables regex
indexing, and have added to TODO:

    * Allow LOCALE to use indexes in regular expression searches

--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [HACKERS] Re: [GENERAL] indexed regex select optimisation missing?

From
Tom Lane
Date:
"Ross J. Reedstrom" <reedstrm@wallace.ece.rice.edu> writes:
> Reviewing my email logs from June, most of the work on this has to do with
> people who needs locales, and potentially multibyte character sets. Tom
> Lane is of the opinion that this particular optimization needs to be moved
> out of the parser, and deeper into the planner or optimizer/rewriter,
> so a good fix may be some ways out.

Actually, that part is already done: addition of the index-enabling
comparisons is gone from the parser and is now done in the optimizer,
which has a whole bunch of benefits (one being that the comparison
clauses don't get added to the query unless they are actually used
with an index!).

But the underlying LOCALE problem still remains: I don't know a good
character-set-independent method for generating a "just a little bit
larger" string to use as the righthand limit.  If anyone out there is
an expert on foreign and multibyte character sets, some help would
be appreciated.  Basically, given that we know the LIKE or regex
pattern can only match values beginning with FOO, we want to generate
string comparisons that select out the range of values that begin with
FOO (or, at worst, a slightly larger range).  In USASCII locale it's not
hard: you can do
    field >= 'FOO' AND field < 'FOP'
but it's not immediately obvious how to make this idea work reliably
in the presence of odd collation orders or multibyte characters...

BTW: the \377 hack is actually wrong for USASCII too, since it'll
exclude a data value like 'FOO\377x' which should be included.

            regards, tom lane

Re: [HACKERS] Re: [GENERAL] indexed regex select optimisation missing?

From
Charles Tassell
Date:
I don't know much about the backend stuff, but wouldn't it reduce the
amount of records you go through to do a search for FO. and then do a
another check on each returned record to check that the last character
matches?  More checks, but fewer total records.

Anyway, just a thought.

At 12:46 PM 11/5/99, Tom Lane wrote:
>[snip]
>
>  Basically, given that we know the LIKE or regex
>pattern can only match values beginning with FOO, we want to generate
>string comparisons that select out the range of values that begin with
>FOO (or, at worst, a slightly larger range).  In USASCII locale it's not
>hard: you can do
>         field >= 'FOO' AND field < 'FOP'
>but it's not immediately obvious how to make this idea work reliably
>in the presence of odd collation orders or multibyte characters...
>
>BTW: the \377 hack is actually wrong for USASCII too, since it'll
>exclude a data value like 'FOO\377x' which should be included.
>
>                         regards, tom lane
>
>************


Re: [GENERAL] indexed regex select optimisation missing?

From
Stuart Woolford
Date:
On Fri, 05 Nov 1999, Ross J. Reedstrom wrote:
> Ah, your description just tripped a memory for me from the hackers list:
>
> The behavior you describe has to do with the implementation of using an
> index for regex matching, in the presence of the USE_LOCALE configuration
> option.
>
> Internally, the condition: WHERE word~'^alongword' is converted in the
> parser(!) to:
>
> WHERE word >= 'alongword' AND word < 'alongword\377'
>
> since the index needs inequalities to be used, not matches. Now, the
> problem is the hack of tacking an octal \377 on the string to create
> the lexagraphically 'just bigger' value assumes ASCI sort order. If
> USE_LOCALE is defined, this is dropped, since we don't have a good fix
> yet, and slow correct behavior is better than fast, incorrect behavior.

just to add to my previous reply, the 'hack' I am using now is:

select key from inv_word_i where word>='window' and word<'window\372'

which matches very nearly everything in my database (actually, I limit data to
printable characters, so it should be safe), and words with my normal queries
(which are actually Zope queries, and therefore changing the actual search word
is a little non-trivial)

anyway, just a quick hack that helps performance by several orders of magnitude
if you have locale enabled (ie: are using the standard RPMs)
BTW, I assume that my databases will need requilding if I compile up a
non-locale aware version, which presents a problem currently :(

------------------------------------------------------------
Stuart Woolford,
stuartw@newmail.net Unix Consultant.
Software Developer.
Supra Club of New Zealand.
------------------------------------------------------------

Re: [HACKERS] Re: [GENERAL] indexed regex select optimisation missing?

From
Stuart Woolford
Date:
Firstly, damb you guys are good, please accept my strongest complements for the
response time on this issue!

On Sat, 06 Nov 1999, Tom Lane wrote:
> "Ross J. Reedstrom" <reedstrm@wallace.ece.rice.edu> writes:
> > Reviewing my email logs from June, most of the work on this has to do with
> > people who needs locales, and potentially multibyte character sets. Tom
> > Lane is of the opinion that this particular optimization needs to be moved
> > out of the parser, and deeper into the planner or optimizer/rewriter,
> > so a good fix may be some ways out.
>
> Actually, that part is already done: addition of the index-enabling
> comparisons is gone from the parser and is now done in the optimizer,
> which has a whole bunch of benefits (one being that the comparison
> clauses don't get added to the query unless they are actually used
> with an index!).
>
> But the underlying LOCALE problem still remains: I don't know a good
> character-set-independent method for generating a "just a little bit
> larger" string to use as the righthand limit.  If anyone out there is
> an expert on foreign and multibyte character sets, some help would
> be appreciated.  Basically, given that we know the LIKE or regex
> pattern can only match values beginning with FOO, we want to generate
> string comparisons that select out the range of values that begin with
> FOO (or, at worst, a slightly larger range).  In USASCII locale it's not
> hard: you can do
>     field >= 'FOO' AND field < 'FOP'
> but it's not immediately obvious how to make this idea work reliably
> in the presence of odd collation orders or multibyte characters...

how about something along the lines of:

file >='FOO' and field='FOO.*'

ie, terminate once the search fails on a match of the static left-hand-side
followed by anything (although I have the feeling this does not fit into your
execution system..), and a simple regex type check be added to the scan
validation code?

>
> BTW: the \377 hack is actually wrong for USASCII too, since it'll
> exclude a data value like 'FOO\377x' which should be included.

That's why I pointed out that in my particular case, I only have alpha and
numeric data in the database, so it is safe, it's certainly no general solution.

--
------------------------------------------------------------
Stuart Woolford, stuartw@newmail.net
Unix Consultant.
Software Developer.
Supra Club of New Zealand.
------------------------------------------------------------

more] indexed regex select optimisations?

From
Stuart Woolford
Date:
Well, I've improved my regex text searches to actually use the indexes properly
now for the basic case, but I have found another 'problem' (or feature, call it
what you will ;) - to demonstrate:
with locale turned on (the default RPMS are like this):

the following takes a LONG time to run on 1.6 million records:
-------------------------------------
explain select isbn, count from inv_word_i where
word~'^foo'
order by count

Sort  (cost=35148.70 rows=353 width=16)
  ->  Index Scan using i3 on inv_word_i  (cost=35148.70 rows=353 width=16)
-------------------------------------
the following runs instantly, and does (nearly) the same thing:
-------------------------------------
explain select isbn, count from inv_word_i where
word>='foo' and word<'fop'
order by count

Sort  (cost=11716.57 rows=183852 width=16)
  ->  Index Scan using i3 on inv_word_i  (cost=11716.57 rows=183852 width=16)
-------------------------------------
but what about the following? :
-------------------------------------
explain select isbn , sum(count) from inv_word_i where
(word>='window' and word<'windox')
or
(word>='idiot' and word<'idiou')
group by isbn
order by sum(count) desc

Sort  (cost=70068.84 rows=605525 width=16)
  ->  Aggregate  (cost=70068.84 rows=605525 width=16)
        ->  Group  (cost=70068.84 rows=605525 width=16)
              ->  Sort  (cost=70068.84 rows=605525 width=16)
                    ->  Seq Scan on inv_word_i  (cost=70068.84 rows=605525 width=16)
-------------------------------------

this is the fastest way I've found so far to do a multi-word search (window and
idiot as the root words in this case), you note it does NOT use the indexes,
but falls back to a linear scan?!? it takes well over 30 seconds (much much too
long)

I've tried a LOT of different combinations, and have yet to find a way of
getting the system to use the indexes correctly to do what I want, the closest
I've ffound is using a select intersect select method to find all docs
containing both word (what I really want, although the query above is a ranked
or query), but it gets slow as soon as I select more than one field for the
results (I need to line isbn in this case to another database in the final
application)

I assume there is some reason the system falls back to a linear scan in this
case? it seems two index lookups would be much much more efficient..

am I missing something again?

--
------------------------------------------------------------
Stuart Woolford, stuartw@newmail.net
Unix Consultant.
Software Developer.
Supra Club of New Zealand.
------------------------------------------------------------