Thread:

From

Date:

02 December 2002, 13:54:05

[I hope job postings are kosher...]

I need help optimizing a PostgreSQL application:

Full-text search
~17,000 records
Articles (text) are about 10K long on average, ranging from 0 to 278K.

I don't know if we need to throw more RAM, more hard drive, more
comparison RAM in postmaster.conf or build a concordance or if this is
just not something that can be done within our budget.

I can't even seem to get the PostgreSQL profiling output using "-s" in the
startup of postmaster and client to determine what the db engine is doing.

I don't understand why PostgreSQL sometimes chooses not to use the
existing INDEXes to do an index scan instead of sequential scan -- Does it
really think sequential will be faster, or does it eliminate an index scan
because there won't be enough hard drive or swap space to do it?

Currently, full text search queries take on the order of 2 minutes to
execute.
We need them to be happening in 5 seconds, if at all possible.

Unfortunately, this needs to happen EARLY THIS WEEK, if at all possible.

Contact me off-list with some idea of price/availability/references if you
are interested in taking on this task.

THANKS!

Re:

From

Rod Taylor

Date:

02 December 2002, 14:39:04

> I don't understand why PostgreSQL sometimes chooses not to use the
> existing INDEXes to do an index scan instead of sequential scan -- Does it
> really think sequential will be faster, or does it eliminate an index scan

Yes, and it's generally right.

> because there won't be enough hard drive or swap space to do it?

Nope.  Simply because of time it takes to read from the disk.  An index
scan makes ~ 1 read per tuple and sequential scans make one per page
(gross simplification).

> Currently, full text search queries take on the order of 2 minutes to
> execute.
> We need them to be happening in 5 seconds, if at all possible.

How about a couple of explains of the queries.  What kind of tuning have
you done in postgresql.conf.  Whats your hardware like? Have you
partitioned the data to separate disks in any way?

Are you doing mostly (all?) reads?  Some writes?  Perhaps clustering?

Is this on 7.2 or 7.3?  What is the Locale? C or en_US or something
else?

--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Attachment

signature.asc

Re:

From

Date:

02 December 2002, 15:55:21

>> I don't understand why PostgreSQL sometimes chooses not to use the
>> existing INDEXes to do an index scan instead of sequential scan --
>> Does it really think sequential will be faster, or does it eliminate
>> an index scan
>
> Yes, and it's generally right.
>
>> because there won't be enough hard drive or swap space to do it?
>
> Nope.  Simply because of time it takes to read from the disk.  An index
> scan makes ~ 1 read per tuple and sequential scans make one per page
> (gross simplification).

Hmmm.  An "index" is apparently nothing like I expected it to be...

Here I thought it would be some quick hash-table small data-set lookup
with a reference to the OID -- and that most of the hash tables could just
be loaded in one fell swoop.

Oh well.

>> Currently, full text search queries take on the order of 2 minutes to
>> execute.
>> We need them to be happening in 5 seconds, if at all possible.
>
> How about a couple of explains of the queries.

Explains were posted previously, but I'll do a couple more.

At its simplest, this takes 30 seconds:

explain select article.* from article where lower(text) like '%einstein%';
NOTICE:  QUERY PLAN:

Seq Scan on article  (cost=0.00..1155.01 rows=1 width=216)

Or, slightly more complex:

explain SELECT DISTINCT *, 0 + (0 + 10 * (lower(title) like '%einstein%')
::int + 10 * (lower(author_flattened) like '%einstein%') ::int + 30 *
(lower(subject_flattened) like '%einstein%') ::int + 30 * (lower(text)
LIKE '%einstein%') ::int + 9 * (substring(lower(title), 1, 20) like
'%einstein%') ::int + 25 * (substring(lower(text), 1, 20) LIKE
'%einstein%') ::int ) AS points FROM article WHERE TRUE AND (FALSE OR
(lower(title) like '%einstein%') OR (lower(author_flattened) like
'%einstein%') OR (lower(subject_flattened) like '%einstein%') OR
(lower(text) LIKE '%einstein%') ) ORDER BY points desc, volume, number,
article.article LIMIT 10, 0;
NOTICE:  QUERY PLAN:

Limit  (cost=1418.03..1418.08 rows=1 width=216)
  ->  Unique  (cost=1418.03..1418.08 rows=1 width=216)
        ->  Sort  (cost=1418.03..1418.03 rows=1 width=216)
              ->  Seq Scan on article  (cost=0.00..1418.02 rows=1 width=216)


> What kind of tuning have
> you done in postgresql.conf.

None.  Never really understood what that one memory setting would affect...

And the rest of the options seemed to be about logging output (which I
also can't seem to crank up to the level of getting query analysis out).

I RTFM, but actually comprehending what was written ... :-^

>  Whats your hardware like?

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 11
model name      : Intel(R) Pentium(R) III CPU family      1400MHz
stepping        : 1
cpu MHz         : 1406.005
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 mmx fxsr sse
bogomips        : 2804.94

        total:    used:    free:  shared: buffers:  cached:
Mem:  921235456 736669696 184565760   749568 75321344 592257024
Swap: 2097143808 15368192 2081775616
MemTotal:       899644 kB
MemFree:        180240 kB
MemShared:         732 kB
Buffers:         73556 kB
Cached:         573896 kB
SwapCached:       4480 kB
Active:         433776 kB
Inact_dirty:    182208 kB
Inact_clean:     36680 kB
Inact_target:   229376 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       899644 kB
LowFree:        180240 kB
SwapTotal:     2047992 kB
SwapFree:      2032984 kB



> Have you
> partitioned the data to separate disks in any way?

No, except when attempting to do the PostgreSQL contrib/fulltextindex we
clustered the _fti table by loading it in word order.

> Are you doing mostly (all?) reads?  Some writes?  Perhaps clustering?

Mostly reads.
Some writes by:
  Admin fixing typos, adding new articles
  Nightly cron jobs to "flatten" large-scale JOINs into text contatenations
  (We could get rid of that and go back to the JOINs, now that we've
figured out that it's really the full text search that's killing us, not
the JOINs)

> Is this on 7.2 or 7.3?

7.1.3

> What is the Locale? C or en_US or something
> else?

AFAIK, we didn't do anything to alter the locale from whatever the default
would be...

Re:

From

Hannu Krosing

Date:

02 December 2002, 16:21:55

typea@l-i-e.com kirjutas T, 03.12.2002 kell 01:45:
> Explains were posted previously, but I'll do a couple more.
>
> At its simplest, this takes 30 seconds:
>
> explain select article.* from article where lower(text) like '%einstein%';
> NOTICE:  QUERY PLAN:
>
> Seq Scan on article  (cost=0.00..1155.01 rows=1 width=216)

searches with LIKE use indexes only when the like expression starts
with a string (like 'einstein%') and even then only if in C locale.

You should check out some real full-text index add-ons, like contrib/tsearch or
construct your own using your imagination plus contrib/intarray and contrib/intagg  :)

---------------
Hannu

Re:

From

Ron Johnson

Date:

02 December 2002, 22:56:05

On Mon, 2002-12-02 at 12:30, typea@l-i-e.com wrote:
> [I hope job postings are kosher...]
>
> I need help optimizing a PostgreSQL application:
>
> Full-text search
> ~17,000 records
> Articles (text) are about 10K long on average, ranging from 0 to 278K.
>
> I don't know if we need to throw more RAM, more hard drive, more
> comparison RAM in postmaster.conf or build a concordance or if this is
> just not something that can be done within our budget.
>
> I can't even seem to get the PostgreSQL profiling output using "-s" in the
> startup of postmaster and client to determine what the db engine is doing.
>
> I don't understand why PostgreSQL sometimes chooses not to use the
> existing INDEXes to do an index scan instead of sequential scan -- Does it
> really think sequential will be faster, or does it eliminate an index scan
> because there won't be enough hard drive or swap space to do it?
>
> Currently, full text search queries take on the order of 2 minutes to
> execute.
> We need them to be happening in 5 seconds, if at all possible.
>
> Unfortunately, this needs to happen EARLY THIS WEEK, if at all possible.
>
> Contact me off-list with some idea of price/availability/references if you
> are interested in taking on this task.

After reading the thread to see that your box has what looks like
1GB RAM, and firing up bc(1) to see that 17K articles each of
which is ~10KB == 166MB, it seems to this simple mind that given
enough buffers, you could suck all of the articles into the
buffers.  Thus, no more disk IO, but boy would it burn up the CPU!

Also, I think that I might write some sort of "book index pre-processor"
to run against each article, to create, for each article, a list of
words plus byte offsets.  (Some tweaking would have to occur in order
to handle capitalization vagaries.  Probably capitalize all "index
words".)  (Yes, this method has the limitation of [sub-]word searches
instead of arbitrary string searches,

Then, insert all that data into a 3rd table (T_LOOKUP) whose structure
is:
  val        TEXT    (primary key)
  article_name    TEXT
  byte_offset    INTEGER

Then, 'EINSTEIN%' queries would go against T_LOOKUP instead of the
articles table.

--
+------------------------------------------------------------+
| Ron Johnson, Jr.     mailto:ron.l.johnson@cox.net          |
| Jefferson, LA  USA   http://members.cox.net/ron.l.johnson  |
|                                                            |
| "they love our milk and honey, but preach about another    |
|  way of living"                                            |
|    Merle Haggard, "The Fighting Side Of Me"                |
+------------------------------------------------------------+

Is there any limitations

From

li li

Date:

03 December 2002, 00:46:40

for the maximum number of tables in a database.

I'm thinking about separating a table with up to millions of rows into
several tables with the same set of columns to speed up some complex
queries. As the size of the original table is increasing fast, I want
to get it separated once the size grows up to a limit. So there
will be a large amount of tables (having same structure) in a database. Is
there any potential performance problem with this design?

Thanks.

Li Li

Re: Is there any limitations

From

Andrew Sullivan

Date:

03 December 2002, 07:35:14

On Mon, Dec 02, 2002 at 09:46:43PM -0800, li li wrote:
>
> for the maximum number of tables in a database.

<http://www.ca.postgresql.org/users-lounge/limitations.html>

For practical purposes, probably not.

> to get it separated once the size grows up to a limit. So there
> will be a large amount of tables (having same structure) in a database. Is
> there any potential performance problem with this design?

It depends on what you're going to do.  If the idea is to join across
the tables, it'll probably perform worse than just ahving a large
table.  OTOH, if what you're doing is, say, archiving from time to
time, it doesn't seem unreasonable.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110

Re: Is there any limitations

From

li li

Date:

03 December 2002, 15:03:07

>
> It depends on what you're going to do.  If the idea is to join across
> the tables, it'll probably perform worse than just ahving a large
> table.  OTOH, if what you're doing is, say, archiving from time to
> time, it doesn't seem unreasonable.
>
The purpose for this design is to avoid record lookup in a huge table.
I expect to see the query results in, say, one minute, by searching a much
smaller table (not join across multiple tables).

Thanks and regards.

Li Li

Re: Is there any limitations

From

Richard Huxton

Date:

04 December 2002, 04:30:00

On Tuesday 03 Dec 2002 7:49 pm, li li wrote:
> > It depends on what you're going to do.  If the idea is to join across
> > the tables, it'll probably perform worse than just ahving a large
> > table.  OTOH, if what you're doing is, say, archiving from time to
> > time, it doesn't seem unreasonable.
>
> The purpose for this design is to avoid record lookup in a huge table.
> I expect to see the query results in, say, one minute, by searching a much
> smaller table (not join across multiple tables).
>
> Thanks and regards.

If you only want *most* queries to finish in one minute - I've used two tables
in the past. One for recent info (which is what most of my users wanted) and
one for older info (which only got accessed rarely). You're only union-ing
two tables then and you can cluster the older table as mentioned elsewhere.
--
  Richard Huxton
  Archonet Ltd

Re: Is there any limitations

From

Hannu Krosing

Date:

04 December 2002, 05:29:24

On Wed, 2002-12-04 at 09:29, Richard Huxton wrote:
> On Tuesday 03 Dec 2002 7:49 pm, li li wrote:
> > > It depends on what you're going to do.  If the idea is to join across
> > > the tables, it'll probably perform worse than just ahving a large
> > > table.  OTOH, if what you're doing is, say, archiving from time to
> > > time, it doesn't seem unreasonable.
> >
> > The purpose for this design is to avoid record lookup in a huge table.
> > I expect to see the query results in, say, one minute, by searching a much
> > smaller table (not join across multiple tables).
> >
> > Thanks and regards.
>
> If you only want *most* queries to finish in one minute - I've used two tables
> in the past. One for recent info (which is what most of my users wanted) and
> one for older info (which only got accessed rarely). You're only union-ing
> two tables then and you can cluster the older table as mentioned elsewhere.

ANother approach could be to have index on timestamp field (which should
be naturally clustered) and search in recent data only.

If the problem is simply too much data returned, you could use LIMIT.

--
Hannu Krosing <hannu@tm.ee>