Thread: performance on selecting a row in large tables

performance on selecting a row in large tables

From
"Rainer Spittel"
Date:
Hi guys,

  I am not really new to Postgres, but to be honest I am not an expert
on the internal configuration of Postgres. I run a couple of
PostgreSQL/PostGIS databases on several servers, but two of them are not
performing very well. Those two databases are just read-only databases
and they get dropped and recreated every week.

  I have a table like this (incl. a PostGIS geometry column):

  col01_id    integer
  col02       character varying(10)
  col03       character varying(100)
  col04       double precision
  col05       double precision
  col06       character varying(80)
  col07       character varying(80)
  col07       character varying(40)
  col08       character varying(6)
  col09       character varying(100)
  col10       date
  col11       date
  col12       character varying(30)
  col13       character varying(30)
  col14       character varying(40)
  col15       character varying(10)
  col16       character varying(100)
  the_geom    geometry

About 2400000 rows are in the table, and the size of the table is about
2200MB. Three indexes are created on col01_id, col02 and on the geometry
(clustered).

When performing a 'select col01_id from table limit 1 offset 100000;',
the query takes up to 20sec. Monitoring the dstats on the server, I see
that the box is reading approx. 1GB from the disks.

I ran VACUUM FULL on the table, reindex, analyze and cluster, all sorts
of desperate tries to get a better performance out of this table without
any success.

I copied this table with less columns into a second table and run the
same query on that table which performs much better.

The server is a 2-dual-core cpu server with 2GB ram, running Fedora 5,
Postgres 8.1.9 and PostGIS on top of it. Shared memory settings have
been increased to:

  kernel.shmmax=1073741824
  kernel.shmall=2097152

and I changed following settings in the postgres.conf:

  max_connections = 1000
  shared_buffers = 65536
  max_fsm_pages = 104000

My co-workes are getting to the point to move back to mySQL. But I like
to stick to PostgreSQL.

Looking forward for any comments or suggestions.

Cheers,
Rainer

Re: performance on selecting a row in large tables

From
Tom Lane
Date:
"Rainer Spittel" <rainer.spittel@terralink.co.nz> writes:
> When performing a 'select col01_id from table limit 1 offset 100000;',
> the query takes up to 20sec. Monitoring the dstats on the server, I see
> that the box is reading approx. 1GB from the disks.

No surprise.  That query says "read 100001 rows, then throw away the
first 100000".

Why are you considering that this is an important performance measure?
Are you aware that the results of such a query aren't even very well
defined (since you omitted an ORDER BY)?

            regards, tom lane

Re: performance on selecting a row in large tables

From
Tino Schwarze
Date:
Hi Rainer,

On Tue, Feb 05, 2008 at 09:07:26AM +1300, Rainer Spittel wrote:

> When performing a 'select col01_id from table limit 1 offset 100000;',

As Tom already stated, the result of this query is a bit random since
there is no ordering at all. What are you actually querying for?

Try tuning queries before tuning the server.

> Looking forward for any comments or suggestions.

Here it is, right at the bottom of your mail:

> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

explain analyze select col01_id from table limit 1 offset 100000;
will run the query and show you how the query plan looks and how much
time is spent. It is likely to say "SeqScan" somewhere - a sequential
scan of the tabe.

HTH!

Tino.

--
www.craniosacralzentrum.de
www.spiritualdesign-chemnitz.de

Tino Schwarze * Lortzingstraße 21 * 09119 Chemnitz

Re: performance on selecting a row in large tables

From
"Rainer Spittel"
Date:
Hi Tom,

  You are right, this query is not the right approach for performance
testing. I thought that this will give an indication about the
performance of a select statement on that table.

  One of those slow queries are running on col02 which has a btree
index. But I use the 'in' expression to get a set of matching rows:

  select * from table where col02 in ('...',[...],'...')

  This query gets sometimes really slow, I guess it depends on the size
of the set used by 'in'.

  Would the query perform better when I cluster the index on col02 and
force to order the set for the in clause?

  Is there a way to disable the caching for testing? Once I ran the
query, the result set seems to be cached and the second run of the query
is fast. This makes a testing a little difficult ;-)

regards.
Rainer

Re: performance on selecting a row in large tables

From
Tom Lane
Date:
"Rainer Spittel" <rainer.spittel@terralink.co.nz> writes:
>   One of those slow queries are running on col02 which has a btree
> index. But I use the 'in' expression to get a set of matching rows:
>   select * from table where col02 in ('...',[...],'...')
>   This query gets sometimes really slow, I guess it depends on the size
> of the set used by 'in'.

Don't guess.  Use EXPLAIN ANALYZE and find out what is happening.

            regards, tom lane

Re: performance on selecting a row in large tables

From
Tino Schwarze
Date:
Hi Rainer,

On Tue, Feb 05, 2008 at 10:24:11AM +1300, Rainer Spittel wrote:

>   You are right, this query is not the right approach for performance
> testing. I thought that this will give an indication about the
> performance of a select statement on that table.

Only do benchmarking with real-world queries. Or you'll end up
optimizing for artificial workloads which never occur.

>   One of those slow queries are running on col02 which has a btree
> index. But I use the 'in' expression to get a set of matching rows:
>
>   select * from table where col02 in ('...',[...],'...')
>
>   This query gets sometimes really slow, I guess it depends on the size
> of the set used by 'in'.

How much in-clauses are there? It might be faster to use a subselect if
the in clauses come from DB anyway.

>   Would the query perform better when I cluster the index on col02 and
> force to order the set for the in clause?

>   Is there a way to disable the caching for testing? Once I ran the
> query, the result set seems to be cached and the second run of the query
> is fast. This makes a testing a little difficult ;-)

Use "EXPLAIN query" first to get an estimate how the query will perform.
Get used to reading EXPLAIN output. You already seem to have tuned the
DB to your server (setting memory sizes etc.), now optimize queries
first, then the DB server (look at the settings like random_page_cost,
effective_cache_size etc. if the planner performs sequential scans and
you think it should use indices).

If you really want to rule out caching, you'd need to keep trashing your
OS's cache, e.g. by running a "while dd if=/dev/some-drive of=/dev/null
bs=1m ; do : ; done" in the background. But again, you'll optimize for a
artifical workload - in real live, the cache will play a significant
role. There's no use in optimizing queries for no-cache scenario if in
practice, the cache will have the data. I know of real-live applications
where the first query takes some time and all subsequent are fast.
That's what caches are for.

HTH,

Tino.

--
www.craniosacralzentrum.de
www.spiritualdesign-chemnitz.de

Tino Schwarze * Lortzingstraße 21 * 09119 Chemnitz