Home > mailing lists
Optimizing a complex query (long) - Mailing list pgsql-general

From	Paulo Jan
Subject	Optimizing a complex query (long)
Date	October 26, 2001 18:21:39
Msg-id	3BD9AFE6.2B6876F6@digital.ddnet.es Whole thread Raw
Responses	Re: Optimizing a complex query (long)
List	pgsql-general
Tree view
Hi all:

    I have here a photographic database searchable through the Web. The
database stores the name of each photo, along with its attributes (color
/ B & W, description, names of the celebrities appearing in it, keywords
associated, etc.), and the users have to be able to find the photos they
want based on said attributes. The problem that I'm having is that the
search on text fields is too slow, and I've been trying everything I
could think of (plus many things from the list archives) to speed it up,
without success.
    First of all: yes, all the relevant fields in the tables are indexed,
and I have run VACUUM ANALYZE. The main table (where photos are stored)
is something like:


                    Table "archivos"
     Attribute     |           Type           | Modifier
-------------------+--------------------------+----------
 idarchivo         | integer                  | not null
 codigofoto        | numeric(18,0)            |
 idsexo            | integer                  |
 idprovincia       | integer                  |
 idccaa            | integer                  |
 idpais            | integer                  |
 place             | character varying(255)   |
 fechafoto         | timestamp with time zone |
 nino              | numeric(1,0)             | not null
 joven             | numeric(1,0)             | not null
 adulto            | numeric(1,0)             | not null
 anciano           | numeric(1,0)             | not null
 posado            | numeric(1,0)             | not null
 title             | character varying(255)   |
 description       | character varying(1000)  |
 limitations       | character varying(500)   |

    (Actually, the "real" table has many more fields, like 40 or so. I've
taken out those that are not used at all in searches).
    "idsexo", "idprovincia", "idccaa" and "idpais", in turn, references
external tables: "lu_sexos", "lu_provincias", "lu_ccaa" and "lu_paises":


               Table "lu_sexos"
 Attribute |         Type          | Modifier
-----------+-----------------------+----------
 idsexo    | integer               | not null
 sex       | character varying(50) |
Index: lu_sexos_pkey


               Table "lu_paises"
 Attribute |          Type          | Modifier
-----------+------------------------+----------
 idpais    | integer                | not null
 country   | character varying(255) |
Index: lu_paises_pkey


                Table "lu_ccaa"
 Attribute |          Type          | Modifier
-----------+------------------------+----------
 idccaa    | integer                | not null
 idpais    | integer                |
 state     | character varying(255) |
Indices: lu_ccaa_idx,
         lu_ccaa_pkey


              Table "lu_provincias"
  Attribute  |          Type          | Modifier
-------------+------------------------+----------
 idprovincia | integer                | not null
 idccaa      | integer                |
 province    | character varying(255) |
Indices: lu_provincias_idx,
         lu_provincias_pkey


    And there are also other tables that store other attributes for the
photos:

    -Races (of the people who appear in each photo):

    create table lu_razas (idraza int4 primary key, race varchar(50));
    create table archivos_razas (idarchivoraza int4 primary key, idarchivo
int4 references archivos, idraza int4 references lu_razas);

    -Keywords:

    create table lu_tesauro (idtesauro int4 primary key, codtesauro
varchar(50), descriptor_eng varchar(255), synonym1 varchar(255),
synonym2 varchar(255));
    create table tesauro_archivos (idtesauroarchivo int4 primary key,
idtesauro int4 references lu_tesauro, idarchivo int4 references
archivos);


    -Celebrities in each photo:

    create table lu_personajes (idpersonaje int4 primary key, personaje
varchar(255), persona decimal(1,0), personalidad decimal(1,0), famoso
decimal(1,0), profession1 varchar(255), profession2 varchar(255), idpais
int4 references lu_paises);
    create table personajes_archivos (idpersonajearchivo int4 primary key,
idpersonaje int4 references lu_personajes, idarchivo int4 references
archivos);


    -Category in which each photo belongs:

    create table lu_categoriafotos (idcategoriafoto int4 primary key, class
varchar(255));
    create table archivo_categoriafoto (idarchivocategoriafoto int4 primary
key, idarchivo int4 references archivos, idcategoriafoto int4 references
lu_categoriafotos);


    Now, when the user types a keyword in the website, I have to search for
matches in all the text fields mentioned above (archivos.title,
archivos.description... but also keywords, celebrities, races,
countries, etc.). To make this easier I created some views in those
secondary tables, with the text field instead of the foreign key in each
case:

    create view v_archivo_categoriafoto as SELECT
archivo_categoriafoto.idarchivo, lu_categoriafotos.class where
lu_categoriafotos.idcategoriafoto=archivo_categoriafoto.idcategoriafoto;

    create view v_tesauro_archivos as SELECT
tesauro_archivos.idtesauroarchivo, lu_tesauro.descriptor_eng,
lu_tesauro.synonym1, lu_tesauro.synonym2, tesauro_archivos.idarchivo
where lu_tesauro.idtesauro=tesauro_archivos.idtesauro;

    create view v_archivos_razas as SELECT archivos_razas.idarchivoraza,
archivos_razas.idarchivo, lu_razas.race where
lu_razas.idraza=archivos_razas.idraza;

    create view v_personajes_archivos as SELECT
personajes_archivos.idpersonajearchivo, lu_personajes.personaje,
personajes_archivos.idarchivo WHERE
personajes_archivos.idpersonaje=lu_personajes.idpersonaje;

    (This way, I can do directly something like "SELECT idarchivo FROM
v_personajes_archivos WHERE personaje ~ 'LADY DI'"...)


    Still with me? Before anything else, please don't tell me about the
data model; I know how convoluted it is, and all I can say is that I
didn't design it :-(. The query I'm using to search all the fields above
is:


    SELECT idarchivo, codigofoto, title, date(fechafoto) AS fechafoto FROM
archivos WHERE (((UPPER(TRANSLATE(archivos.title || archivos.description
|| archivos.place, '[áéíóúÁÉÍÓÚ]', '[aeiouAEIOU]')) ~ 'SEARCH TEXT') OR
EXISTS (SELECT idarchivo FROM v_archivos_razas WHERE
v_archivos_razas.idarchivo=archivos.idarchivo AND UPPER(race) ~ 'SEARCH
TEXT') OR EXISTS (SELECT idarchivo FROM v_tesauro_archivos WHERE
v_tesauro_archivos.idarchivo=archivos.idarchivo AND UPPER(descriptor_eng
|| synonym1 || synonym2) ~ 'SEARCH TEXT') OR EXISTS (SELECT idpais FROM
lu_paises WHERE lu_paises.idpais=archivos.idpais AND UPPER(country) ~
'SEARCH TEXT') OR EXISTS (SELECT idccaa FROM lu_ccaa WHERE
lu_ccaa.idccaa=archivos.idccaa AND UPPER(state) ~ 'SEARCH TEXT') OR
EXISTS (SELECT idprovincia FROM lu_provincias WHERE
lu_provincias.idprovincia=archivos.idprovincia AND UPPER(province) ~
'SEARCH TEXT') OR EXISTS (SELECT idarchivo FROM v_archivo_categoriafoto
WHERE v_archivo_categoriafoto.idarchivo=archivos.idarchivo AND
UPPER(class) ~ 'SEARCH TEXT') OR EXISTS (select idarchivo from
v_personajes_archivos where
v_personajes_archivos.idarchivo=archivos.idarchivo and upper(personaje)
~ 'SEARCH TEXT')));


    There are 1327 records right now in the main table (archivos), 2743
records in lu_tesauro, 5978 records in lu_personajes and around 200
countries in lu_paises (the other tables have much less entries, less
than 100 in all cases).
    In my server, a Pentium III 1Gz. with 128 Mb. RAM and running Red Hat
Linux 6.2 and Postgres 7.1.2, the above query takes 65 sec. to execute,
which is too much. I have tried to group all the text fields I need to
search in a new table, to avoid all the subselects, so I did:


create table archivos_fulltext as
select distinct archivos.idarchivo,
archivos.title, archivos.description, archivos.place, lu_paises.country,
lu_ccaa.state, lu_provincias.province, lu_sexos.sex,
archivos.codigofoto, archivos.fechafoto
from archivos, lu_razas, lu_paises, lu_ccaa, lu_provincias, lu_sexos
where archivos.idpais = lu_paises.idpais and archivos.idccaa =
lu_ccaa.idccaa and archivos.idprovincia = lu_provincias.idprovincia and
archivos.idsexo = lu_sexos.idsexo;

create index archivos_fulltext_idx on archivos_fulltext (idarchivo,
title, description, place, country, state, province, sex);


    (The new table, "archivos_fulltext", has not only the text fields that
I need for the search, but also the fields that I want returned).
    Now, first problem here: I would have added too the text fields in the
views: v_personajes_archivos, v_archivos_razas, etc., but it gave me a
lot of duplicated entries in "archivos_fulltext" for obvious reasons, so
I finally decided to leave them as is. The main problem, however, is
that the new search takes exactly the same time:


    SELECT idarchivo, codigofoto, title, date(fechafoto) AS fechafoto FROM
archivos_fulltext WHERE ((upper(translate(title || description || place
|| country || state || province || sex, '[áéíóúÁÉÍÓÚ]', '[aeiouAEIOU]'))
~ 'SEARCH TEXT')
OR EXISTS (SELECT idarchivo FROM v_archivos_razas WHERE
v_archivos_razas.idarchivo=archivos_fulltext.idarchivo AND UPPER(race) ~
'SEARCH TEXT') OR EXISTS (SELECT idarchivo FROM v_tesauro_archivos WHERE
v_tesauro_archivos.idarchivo=archivos_fulltext.idarchivo AND
UPPER(descriptor_eng || synonym1 || synonym2) ~ 'SEARCH TEXT')
OR EXISTS (SELECT idarchivo FROM v_archivo_categoriafoto WHERE
v_archivo_categoriafoto.idarchivo=archivos_fulltext.idarchivo AND
UPPER(class) ~ 'SEARCH TEXT') or exists (select idarchivo from
v_personajes_archivos where
v_personajes_archivos.idarchivo=archivos_fulltext.idarchivo and
upper(personaje) ~ 'SEARCH TEXT'))

    Apparently, taking out 3 or 4 subselects didn't help at all.
    Playing with the shared buffers values, I was able to speed up both
searches by around 10 seconds (with 3072 shared buffers, it takes 54
seconds instead of 64), but it's still too slow to be acceptable. So my
question is basically: is there *any* way, any way at all to speed up
the above query? Optimizing it further in some way? Or is it just a
matter of brute force, i.e. adding more memory and CPU power?
    (Something else I forgot: the EXPLAIN plans for both searches. The
first one, the search performed directly on "archivos", says:

NOTICE:  QUERY PLAN:

Index Scan using archivos_idx on archivos  (cost=0.00..288.98 rows=1
width=36)
  SubPlan
    ->  Nested Loop  (cost=0.00..3.15 rows=1 width=12)
          ->  Index Scan using archivos_razas_idx on archivos_razas
(cost=0.00..2.03 rows=1 width=8)
          ->  Seq Scan on lu_razas  (cost=0.00..1.10 rows=1 width=4)
    ->  Nested Loop  (cost=0.00..93.37 rows=1 width=12)
          ->  Seq Scan on lu_tesauro  (cost=0.00..90.86 rows=1 width=4)
          ->  Index Scan using tesauro_archivos_idx on tesauro_archivos
(cost=0.00..2.50 rows=1 width=8)
    ->  Index Scan using lu_paises_pkey on lu_paises  (cost=0.00..2.02
rows=1 width=4)
    ->  Seq Scan on lu_ccaa  (cost=0.00..1.31 rows=1 width=4)
    ->  Seq Scan on lu_provincias  (cost=0.00..1.91 rows=1 width=4)
    ->  Nested Loop  (cost=0.00..3.75 rows=1 width=12)
          ->  Index Scan using archivo_categoriafoto_idx on
archivo_categoriafoto  (cost=0.00..2.09 rows=1 width=8)
          ->  Seq Scan on lu_categoriafotos  (cost=0.00..1.65 rows=1
width=4)
    ->  Nested Loop  (cost=0.00..179.71 rows=1 width=12)
          ->  Seq Scan on lu_personajes  (cost=0.00..177.67 rows=1
width=4)
          ->  Index Scan using personajes_archivos_idx on
personajes_archivos  (cost=0.00..2.03 rows=1 width=8)


    While the second one, on the new "fulltext" table, says:

NOTICE:  QUERY PLAN:

Seq Scan on archivos_fulltext  (cost=0.00..371614.95 rows=1327 width=36)
  SubPlan
    ->  Nested Loop  (cost=0.00..3.15 rows=1 width=12)
          ->  Index Scan using archivos_razas_idx on archivos_razas
(cost=0.00..2.03 rows=1 width=8)
          ->  Seq Scan on lu_razas  (cost=0.00..1.10 rows=1 width=4)
    ->  Nested Loop  (cost=0.00..93.37 rows=1 width=12)
          ->  Seq Scan on lu_tesauro  (cost=0.00..90.86 rows=1 width=4)
          ->  Index Scan using tesauro_archivos_idx on tesauro_archivos
(cost=0.00..2.50 rows=1 width=8)
    ->  Nested Loop  (cost=0.00..3.75 rows=1 width=12)
          ->  Index Scan using archivo_categoriafoto_idx on
archivo_categoriafoto  (cost=0.00..2.09 rows=1 width=8)
          ->  Seq Scan on lu_categoriafotos  (cost=0.00..1.65 rows=1
width=4)
    ->  Nested Loop  (cost=0.00..179.71 rows=1 width=12)
          ->  Seq Scan on lu_personajes  (cost=0.00..177.67 rows=1
width=4)
          ->  Index Scan using personajes_archivos_idx on
personajes_archivos  (cost=0.00..2.03 rows=1 width=8)

    Yes, I noticed it uses a sequencial scan. I tried to force it to use
the index with "SET enable_seqscan=off", but then the cost went up
through the roof: "cost=100000000.00..100979513.63". And BTW, the
queries don't return the 1327 rows, not by far: the query that I've been
using to perform tests returns actually only 5 records).

    Well, these are all the details I could think of. If you've read this
far, a big thank you (and a even bigger one if you can think of a way to
solve my problem... <g>)



                        Paulo Jan.
                        DDnet.
pgsql-general by date:
From: Rich Shepard
Date: 26 October 2001, 18:11:30
Subject: Re: null != null ???
From: Tom Lane
Date: 26 October 2001, 18:31:38
Subject: Re: initdb segfault - solaris 8
Optimizing a complex query (long) - Mailing list pgsql-general

Previous

Next