Re: 7.0.2 issues / Geocrawler - Mailing list pgsql-hackers

From Ross J. Reedstrom
Subject Re: 7.0.2 issues / Geocrawler
Date
Msg-id 20000712113644.B1487@rice.edu
Whole thread Raw
In response to Re: 7.0.2 issues / Geocrawler  (Tim Perdue <tperdue@valinux.com>)
Responses Re: 7.0.2 issues / Geocrawler
List pgsql-hackers
On Wed, Jul 12, 2000 at 08:14:29AM -0700, Tim Perdue wrote:
> "Ross J. Reedstrom" wrote:
> > Tim, why are you building a multikey index, especially one containing a
> > large text field? It's almost never a win to index a text field, unless
> 
> This is not a key on a text field.
> 

Ah, I see, I had merged the two problems you reported together. I see now
that the 'can't create index' problem was on a different index. 

Mike Mascari gave you a detailed answer to that, which you seemd to just blow
off, based on you guesstimate that it would run too long:

> This is a *big* help.
> 
> Yes, the table is approx 10-12GB in size and running your length() and
> update queries is going to take a lifetime, since it will require a
> calculation on 4 million rows.

Mike mentioned that he's run similar index creations on 2 million rows,
and it took 5-10 minutes. I reiterate: you've got a long subject that
tripped a bug in index creation in postgresql versions < 7.0. Give his
solution a try. It's a 'clean it up once' sort of thing: I don't think
anyone's going to complain about the subject getting trimmed at ~ 2k.

> The keys are:
> 
> mail_list (example, the PHP mailing list=1)
> mail_year (1999)
> mail_month (July=7)
> 
> Yes it is a multi-key index, and the matches are exact.
> 

Right, as your explain output showed: the planner is picking this index
and using it. I'd guess that your time is getting lost in the sort step.
I seem to recall that Tom reworked the sort code as well, to reduce the
size of temporary sort files: perhaps you've found a corner case that is
much slower.

Do you still have the 6.X install available? EXPLAIN output from that
would be useful.

> Someone else asked why I have separated these fields out from the
> mail_date.
> 
> If I didn't, and I wanted to see the messages for this month, I'd have
> to regex and that would overwhelm the database.

That's what the date_part function is for:

reedstrm=# select now();         now           
------------------------2000-07-12 11:03:11-05
(1 row)

reedstrm=# select date_part('month', now());date_part 
-----------        7
(1 row)

reedstrm=# select date_part('year', now());date_part 
-----------     2000
(1 row)

So your query would look like:


SELECT mailid, mail_date, mail_is_followup, mail_from, mail_subject FROM
mail_archive WHERE mail_list=35 AND date_part('year',mail_date)=2000 AND
date_part('month',mail_date)=1 ORDER BY mail_date DESC LIMIT 26 OFFSET 0;

You can even build functional indices.  However, since you're selecting
and sorting based on the same attribute, the time of the message, it
should be possible to build an index on mail_date, and construct a SELECT
that uses it for ordering as well as limiting the tuples returned.


You're generating the queries programmatically, from a scripting language,
right? So, the best thing would be if you could create a query that
looks something like:
 SELECT mailid, mail_date, mail_is_followup, mail_from, mail_subject FROM mail_archive WHERE mail_list=35 AND mail_date
>='January 1, 2000' AND mail_date < 'February 1, 2000' ORDER BY mail_date DESC LIMIT 26 OFFSET 0;
 


With an  index on mail_date, that should do a single index scan, returning
the first 26, and stop. I'd bet a lot that it's the sort that's killing
you, since the backend has to retrieve the entire result set and sort
it to be sure it returns the first 26.

You might be able to use a two key index, on mail_date, mailid. I think
you have to be careful to put key you want sorted output on first,
to ensure that the index order is presorted, and the planner know it.

Ross
-- 
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu> 
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St.,  Houston, TX 77005



pgsql-hackers by date:

Previous
From: The Hermit Hacker
Date:
Subject: Re: Vacuum only with 20% old tuples
Next
From: Tom Lane
Date:
Subject: Re: Insert..returning (was Re: Re: postgres TODO)