Re: Query tuning help - Mailing list pgsql-performance

From Dan Harris
Subject Re: Query tuning help
Date
Msg-id 6f4a543a2fbd348c7e82876b8e477600@drivefaster.net
Whole thread Raw
In response to Re: Query tuning help  (Russell Smith <mr-russ@pws.com.au>)
Responses Re: Query tuning help  (Josh Berkus <josh@agliodbs.com>)
Re: Query tuning help  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Query tuning help  (Russell Smith <mr-russ@pws.com.au>)
List pgsql-performance
On May 8, 2005, at 6:51 PM, Russell Smith wrote:

> On Mon, 9 May 2005 09:20 am, Dan Harris wrote:
> You cannot use an index for %CORVETTE%, or %RED%.  There is no way
> for the index to know if a row had that in the middle without scanning
> the whole
> index.  So it's much cheaper to do a sequence scan.
>

While I believe you, I'm confused by this line in my original EXPLAIN
ANALYZE:

>> ->  Index Scan using ea1 on ea  (cost=0.00..2736.43 rows=42 width=47)
>> (actual time=2.085..2.309 rows=2 loops=473)
>>                                         Index Cond:
>> ((ea.incidentid)::text = ("outer".incidentid)::text)
>>                                         Filter: (((recordtext)::text
>> ~~ '%RED%'::text) OR ((recordtext)::text ~~ '%CORVETTE%'::text))

Doesn't that mean it was using an index to filter?  Along those lines,
before I created index 'ea1', the query was much much slower.  So, it
seemed like creating this index made a difference.

> One possible way to make the query faster is to limit based on date,
> as you will only get about 700 rows.
> And then don't use subselects, as they are doing full sequence scans.
> I think this query does what you do
> above, and I think it will be faster, but I don't know.
>

I REALLY like this idea! If I could just filter by date first and then
sequential scan through those, it should be very manageable.  Hopefully
I can keep this goal while still accommodating the requirement listed
in my next paragraph.

> select distinct em.incidentid, ea.recordtext as retdata, eg.long,
> eg.lat
> FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate >=
> '2005-1-1 00:00'
> AND em.entrydate <= '2005-5-9 00:00' AND ea.recordtext like '%RED%'
> AND ea.recordtext like '%CORVETTE%')
> JOIN eg ON em.incidentid = eg.incidentid WHERE (recordtext like
> '%RED%' or recordtext like '%CORVETTE%'  );
>

I have run this, and while it is very fast, I'm concerned it's not
doing what I need.  Here's the situation:

Due to the format of the systems with which I integrate ( I have no
control over these formats ), we will get these 'recordtext' values one
line at a time, accumulating over time.  The only way I can find to
make this work is to insert a new record for each line.  The problem
is, that when someone wants to search multiple keywords, they expect
these words to be matched across multiple records with a given incident
number.

  For a very simple example:

IncidentID        Date                Recordtext
--------------        -------------
-------------------------------------------------------
11111            2005-05-01 14:21    blah blah blah RED blah blah
2222            2005-05-01 14:23    not what we are looking for
11111            2005-05-02 02:05    blah CORVETTE blah blah

So, doing a search with an 'and' condition, e.g. WHERE RECORDTEXT LIKE
'%RED%' AND RECORDTEXT LIKE '%CORVETTE%' , will not match because the
condition will only be applied to a single row of recordtext at a time,
not a whole group with the same incident number.

If I were to use tsearch2 for full-text indexing, would I need to
create another table that merges all of my recordtext rows into a
single 'text' field type?  If so, this is where I run into problems, as
my logic also needs to match multiple words in their original order.  I
may also receive additional updates to the previous data.  In that
case, I need to replace the original record with the latest version of
it.  If I have already concatenated these rows into a single field, the
logic to in-line replace only the old text that has changed is very
very difficult at best.  So, that's the reason I had to do two
subqueries in my example.  Please tell me if I misunderstood your logic
and it really will match given my condition above, but it didn't seem
like it would.

Thanks again for the quick responses!  This list has been a great
resource for me.

-Dan


pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Query tuning help
Next
From: Josh Berkus
Date:
Subject: Re: Query tuning help