Thread: Newbie

Newbie

From
"William Winter"
Date:
I'm Having a hard time with this statement:

Select Jobs.*, ProjectedDates.* FROM Jobs, ProjectedDates WHERE (Plotted =
False) And (JobName = 'Boulder Valley') AND (BuilderName = 'Elliott Homes')
AND (Jobs.JobID = ProjectedDates.JobID)

Can anyone see why? the record and field names certainly do exist...any help
appreciated!



Re: Newbie

From
"Henshall, Stuart - WCP"
Date:
Did you use speech marks around the column names when creating the tables?
If so that makes them case sensitive and you'll always need them (unless
there all lower case)
- Stuart

> -----Original Message-----
> From:    William Winter [SMTP:wilscott@earthlink.net]
> Sent:    Friday, October 12, 2001 4:29 AM
> To:    pgsql-general@postgresql.org
> Subject:    [GENERAL] Newbie
>
> I'm Having a hard time with this statement:
>
> Select Jobs.*, ProjectedDates.* FROM Jobs, ProjectedDates WHERE (Plotted =
> False) And (JobName = 'Boulder Valley') AND (BuilderName = 'Elliott
> Homes')
> AND (Jobs.JobID = ProjectedDates.JobID)
>
> Can anyone see why? the record and field names certainly do exist...any
> help
> appreciated!
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org

Re: Newbie

From
Doug McNaught
Date:
"William Winter" <wilscott@earthlink.net> writes:

> I'm Having a hard time with this statement:
>
> Select Jobs.*, ProjectedDates.* FROM Jobs, ProjectedDates WHERE (Plotted =
> False) And (JobName = 'Boulder Valley') AND (BuilderName = 'Elliott Homes')
> AND (Jobs.JobID = ProjectedDates.JobID)
>
> Can anyone see why? the record and field names certainly do exist...any help
> appreciated!

It would help if you post the error message you're getting.

-Doug
--
Let us cross over the river, and rest under the shade of the trees.
   --T. J. Jackson, 1863

Re: Newbie

From
"Nick Fankhauser"
Date:
Are you getting an error message, or just not getting the results you
expected?

-Nick
--------------------------------------------------------------------------
Nick Fankhauser  nickf@ontko.com  Phone 1.765.935.4283  Fax 1.765.962.9788
Ray Ontko & Co.     Software Consulting Services     http://www.ontko.com/

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org]On Behalf Of William Winter
> Sent: Thursday, October 11, 2001 10:29 PM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] Newbie
>
>
> I'm Having a hard time with this statement:
>
> Select Jobs.*, ProjectedDates.* FROM Jobs, ProjectedDates WHERE (Plotted =
> False) And (JobName = 'Boulder Valley') AND (BuilderName =
> 'Elliott Homes')
> AND (Jobs.JobID = ProjectedDates.JobID)
>
> Can anyone see why? the record and field names certainly do
> exist...any help
> appreciated!
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>


FTI Queries and Explain

From
Gordan Bobic
Date:
Hi.

I've been playing with Full Text Indexing for a few days now, and there is
something in the behaviour of the queries that I don't fully understand. I
have set up a little test database that contains sample job adverts (it's the
sample data I had lying around froma different project)

Selecting on 1 field returns results blindingly fast, as one would expect
when indices are used. However, selecting on 2 fields takes forever.

I have done SET ENABLE_SEQSCAN=OFF.

Here's output of explain:

postgres=> explain select jobs.title from jobs, jobs_description_fti,
jobs_title_fti where (jobs_description_fti.string = 'linux' or
jobs_title_fti.string = 'linux') and (jobs_description_fti.id = jobs.oid and
jobs_title_fti.id = jobs.oid);
NOTICE:  QUERY PLAN:

Nested Loop  (cost=200000018.60..200000027.18 rows=1 width=48)
  ->  Merge Join  (cost=200000018.60..200000024.31 rows=1 width=32)
        ->  Sort  (cost=100000004.09..100000004.09 rows=75 width=16)
              ->  Seq Scan on jobs_description_fti
(cost=100000000.00..100000001.75 rows=75 width=16)
        ->  Sort  (cost=100000014.51..100000014.51 rows=251 width=16)
              ->  Seq Scan on jobs_title_fti
(cost=100000000.00..100000004.51 rows=251 width=16)
  ->  Index Scan using jobs_description_oid_index on jobs  (cost=0.00..2.01
rows=1 width=16)

EXPLAIN

This means, if I am understanding things correctly, that jobs_description_fti
is scanned with a sequential scan. That would explain the slowness.

Hwever, doing a:

explain select jobs.title from jobs, jobs_description_fti where
(jobs_description_fti.string = 'linux') and (jobs_description_fti.id =
jobs.oid);
NOTICE:  QUERY PLAN:

Nested Loop  (cost=0.00..4.04 rows=1 width=20)
  ->  Index Scan using jobs_description_fti_index on jobs_description_fti
(cost=0.00..2.01 rows=1 width=4)
  ->  Index Scan using jobs_description_oid_index on jobs  (cost=0.00..2.01
rows=1 width=16)

yields lightning fast results, as one would expect. Why does selecting from
two fields on an "or" basis cause both scans to be sequential? Even when
sequential scans are "disabled"? Because of the breakdown of descriptions
into thousands of lookup rows in the FTI tables, this is actually slower than
doing an index-less "ILIKE" search on both of the fields because of the huge
number of records in the lookup tables...

Can anyone suggest a way to do a two field "or" match using the FTI and
indices?

Cheers.

Gordan

Re: FTI Queries and Explain

From
Tom Lane
Date:
Gordan Bobic <gordan@bobich.net> writes:
> [ why is this slow? ]
> explain select jobs.title from jobs, jobs_description_fti,
> jobs_title_fti where (jobs_description_fti.string = 'linux' or
> jobs_title_fti.string = 'linux') and (jobs_description_fti.id = jobs.oid and
> jobs_title_fti.id = jobs.oid);

Because the query is wrong.  The way you wrote the WHERE, given a
match in jobs_description_fti and jobs, *any* jobs_title_fti row
with a matching ID will result in an output row.  Similarly, given
a match in jobs_title_fti and jobs, *any* jobs_description_fti row
with a matching ID will produce output.  So the system generates
what's essentially a doubly nested loop over the insufficiently
constrained tables.

A correct and practical form of the query would be something like

select jobs.title from jobs, jobs_description_fti where
jobs_description_fti.string = 'linux' and jobs_description_fti.id = jobs.oid
union
select jobs.title from jobs, jobs_title_fti where
jobs_title_fti.string = 'linux' and jobs_title_fti.id = jobs.oid;

One of the not-so-pleasant aspects of SQL is that erroneous queries
frequently look like performance problems, because no one waits around
for the enormous result set that the query actually generates ... they
try to debug the performance problem instead of looking to see if the
query requests what they want ...

            regards, tom lane

Re: FTI Queries and Explain (long)

From
Gordan Bobic
Date:
[Broken SQL instead of performance issue fixed]

It would appear that when I define the index on the FTI table (string and
oid) to be unique (which makes sense, since there is little point in having
duplicate rows in this case), a lot of inserts fail where they shouldn't. I
am guessing that if the insert into the look-up FTI table fails, the insert
into the master table fails as well.

I can understand that this might be useful for matches where the number of
occurences is important, but in this particular application, that is not the
case. Before I go and look into modifying the fti function code for my
specific purpose, it would be nice to have a confirmation of this behaviour -
otherwise it may take me a while to find what I'm looking for. ;-)

Another question - there are (as often happens) multiple ways of doing what I
want in SQL, but I am not sure what is the fastest and most efficient way of
doing it (in theory at least).

I want to do a multi-criterion search on the same field (the FTI indexed
one), and have separate AND, NOT and OR search terms.
AND = "terms that must occur in the text"
OR = "terms of which at least one has to occur in the text"
NOT = "terms which must not occur in the text"

Initially, before FTI, I used a big ILIKE query which worked reasonably well.

I should point out that my test bed machine for this is a Pentium 100 MHz
with 128 MB of RAM and an IDE disk. My database is expected to be around
50K-100K records, and about 100-200 MB on disk in PostgreSQL files (that's
what the disk consumption of the vacuumed database was before FTI).

Using the same example data set as before, yhe query was something like:

SELECT    *
FROM    Jobs
WHERE    (
        Description ILIKE '%AndTerm1%'    AND
        Description ILIKE '%AndTerm2%'    AND
        Description ILIKE '%AndTerm3%'
        ...
    )
    AND
    (
        Description ILIKE '%OrTerm1%'    OR
        Description ILIKE '%OrTerm2%'    OR
        Description ILIKE '%OrTerm3%'
        ...
    )
    AND
    (
        Description NOT ILIKE '%OrTerm1%'    AND
        Description NOT ILIKE '%OrTerm2%'    AND
        Description NOT ILIKE '%OrTerm3%'
        ...
    )

This usually returned the required data within 30 seconds or so, after,
obviously, doing as sequential search through the database due to the
non-anchored ILIKE match.

After implementing FTI, the insertion speed has gone through the floor (as
expected), but the select speed doesn't seem to be that much greater, even
when using the index (string, oid) on the FTI look-up table. On simple
queries that only require one or two terms there is a big speed improvement,
but on queries with three or more terms, the improvement is not that great.

The queries typically return 10 - 200 rows (give or take a bit, depending on
the specific query).

The queries I am using at the moment to replace the above ILIKE solution are
in the form

SELECT     Jobs.*
FROM    Jobs,
    Jobs_Description_FTI
WHERE    Jobs_Description_FTI.string    = $And/Or/NotTerms[$i]    AND
    Jobs_Description_FTI.id    = Jobs.oid

The AND queries are INTERSECTed together, OR queries and UNIONed together,
both are UNIONed, and then the NOT queries are EXCEPTed.

In some cases, this has yielded a signifficant improvement in performance, as
Tom suggested (thanks for that, it was much appreciated). Sometimes, however,
things go the other way.

To cut the long story short, I seem to have tracked the problem down to a
certain situation.

If there is, say, 10K records in the master table, there is about 4M records
in the lookup table. This in itself isn't an issue. Queries that return small
numbers of records, e.g.

SELECT    count(*)
FROM    Jobs_Description_FTI
WHERE    string = 'linux'

(returns ~300 rows)

happen more or less instantaneously.

However, a very similar query such as:

SELECT    count(*)
FROM    Jobs_Description_FTI
WHERE    string = 'nt'

(returns ~30K rows)

takes around two-three minutes.

I tried doing a

SELECT    count(*)
FROM    Jobs
WHERE    Description ILIKE '%nt%'

(returns 11K records out of 12K)

and that only takes about 10 seconds or so.

SELECT    count(*)
FROM    Jobs
WHERE    Description ILIKE '% nt %'

returns ~800 records out of 12K, which is much more reasonable.

Ideally, that should be

SELECT    count(*)
FROM    Jobs
WHERE    Description ~* '.*[!a-z]nt[!a-z].*'

or something like that, which yields a similar number of records to the
previous query, but is slower.

I am fully aware that this is fairly normal under the circumstances, but I
need a way of defeating this performance issue. The only way of doing that
that I can see at the moment is to:

1) Modify the FTI function to split the text field only at non-alphanumeric
characters, and only return whole words, rather than substrings of words.

2) Allow the insert into master table to succeed, even if some of the inserts
driven by the trigger fail, and define a unique string-oid index, which would
prevent duplicates, thus yielding a smaller lookup table.

One of the other things I'm considering is pruning the lookup table
duplicates periodically to shrink the table to a more reasonable size.

If anyone can suggest other courses of action, I am most interested to hear
them. Is there anything in the pipeline for addressing FTI in the next
version of PostgreSQL?

At the moment, the best average case scenario for my application seems to be
just doing an ILIKE or a ~* search, because although it takes a while, it
takes a comparatively similar amount of time for most queries, unlike the FTI
search which can go away for minutes at a time.

Is this another case of my being thick and producing broken SQL? Can anybody
think of a different way of doing this that would yield a performance
increase? I don't want to believe that doing a ~* unindexed sequential search
is the best solution here...

Thanks.

Gordan

FTI - unique look-up records, and no sub-words?

From
Gordan Bobic
Date:
Hi.

After looking again at the other email I've sent earlier, I realized that it
goes on for far too long, so I'll try to summarize my question more briefly.

1) How can FTI be made to NOT break up words into sub-strings? Most of those
are likely to be useless in my particular application. In fact, this feature
is why my FTI queries run slower on certain strings than sequential ILIKE
scans.

1.1) How do I make FTI only separate the text string into words on
non-alphanumeric characters (i.e. [!a-z|!A-Z|!0-9] - I think...). Is that
what it already does?

2) How can I make inserts work when enforcing unique records in the FTI
table? There is hardly any need for duplicates in the FTI table...

At the moment, the search for a particular string can take signifficantly
longer with FTI than using the plain ILIKE. FTI also returns 30K matches in
the FTI table for only 10K of records in the master table, which is clearly
not very useful. Doing a DISTINCT on the FTI table with that string will
yield 10K records, indicating that there are about 3 times more records in
the FTI table which are just duplicates taking up space.

I am NOT prepared to capitulate and use something MS SQL instead. How can I
get this to work with PostgreSQL? Who maintains the FTI contrib?

Kind regards.

Gordan

Re: FTI Queries and Explain (long)

From
Tom Lane
Date:
Gordan Bobic <gordan@bobich.net> writes:
> 1) Modify the FTI function to split the text field only at non-alphanumeric
> characters, and only return whole words, rather than substrings of words.

Sure, if that works better for you.  The reason FTI is in contrib is
that it's not really ready for prime time (if it were, it'd be in the
mainline code).  Feel free to hack it to make it match your application
better.

> Is there anything in the pipeline for addressing FTI in the next
> version of PostgreSQL?

There is a completely new contrib module 'tsearch' that will probably
supersede FTI over time.  I haven't looked at it closely myself, though.

            regards, tom lane

Re: FTI Queries and Explain (long)

From
Gordan Bobic
Date:
On Thursday 18 Oct 2001 19:59, Tom Lane wrote:
> > Is there anything in the pipeline for addressing FTI in the next
> > version of PostgreSQL?
>
> There is a completely new contrib module 'tsearch' that will probably
> supersede FTI over time.  I haven't looked at it closely myself, though.

I've just read the README for it, and it looks like it is a bit closer to
what I need. It's amazing how just before each beta release of PostgreSQL,
I'd find a feature that I REALLY need, which just happens to be implemented
in the next version. I wish other software was like that. ;^)

Thanks for the help.

Gordan

Re: Newbie

From
Keary Suska
Date:
Without providing the error message returned, we can only guess. One likely
issue is "Plotted = False", because all values must be quoted except
numbers. Better to use Plotted = 'f'.

Keary Suska
Esoteritech, Inc.
"Leveraging Open Source for a better Internet"

> From: "William Winter" <wilscott@earthlink.net>
> Organization: EarthLink Inc. -- http://www.EarthLink.net
> Reply-To: "William Winter" <wilscott@earthlink.net>
> Date: Fri, 12 Oct 2001 03:29:04 GMT
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] Newbie
>
> I'm Having a hard time with this statement:
>
> Select Jobs.*, ProjectedDates.* FROM Jobs, ProjectedDates WHERE (Plotted =
> False) And (JobName = 'Boulder Valley') AND (BuilderName = 'Elliott Homes')
> AND (Jobs.JobID = ProjectedDates.JobID)
>
> Can anyone see why? the record and field names certainly do exist...any help
> appreciated!
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>


Re: Newbie

From
Stephan Szabo
Date:
On Fri, 12 Oct 2001, William Winter wrote:

> I'm Having a hard time with this statement:
>
> Select Jobs.*, ProjectedDates.* FROM Jobs, ProjectedDates WHERE (Plotted =
> False) And (JobName = 'Boulder Valley') AND (BuilderName = 'Elliott Homes')
> AND (Jobs.JobID = ProjectedDates.JobID)
>
> Can anyone see why? the record and field names certainly do exist...any help
> appreciated!

First off, what's the error you're getting?

If you're getting a table/field name not found, try double quoting
names.  If you double quoted (or used an interface that does so for
you) the names in mixed case on the create, you'll need to double
quote the names in queries from that point forward.



Re: Newbie

From
missive@frontiernet.net (Lee Harr)
Date:
On Fri, 12 Oct 2001 03:29:04 GMT, William Winter <wilscott@earthlink.net> wrote:
> I'm Having a hard time with this statement:
>
> Select Jobs.*, ProjectedDates.* FROM Jobs, ProjectedDates WHERE (Plotted =
> False) And (JobName = 'Boulder Valley') AND (BuilderName = 'Elliott Homes')
> AND (Jobs.JobID = ProjectedDates.JobID)
>
> Can anyone see why? the record and field names certainly do exist...any help
> appreciated!
>
>

What error message are you getting?


Also might be helpful:

version
schema
sample data


Idea for new newsgroup:

comp.databases.postgresql.help.psychic