Thread: performance on selecting a row in large tables
Hi guys, I am not really new to Postgres, but to be honest I am not an expert on the internal configuration of Postgres. I run a couple of PostgreSQL/PostGIS databases on several servers, but two of them are not performing very well. Those two databases are just read-only databases and they get dropped and recreated every week. I have a table like this (incl. a PostGIS geometry column): col01_id integer col02 character varying(10) col03 character varying(100) col04 double precision col05 double precision col06 character varying(80) col07 character varying(80) col07 character varying(40) col08 character varying(6) col09 character varying(100) col10 date col11 date col12 character varying(30) col13 character varying(30) col14 character varying(40) col15 character varying(10) col16 character varying(100) the_geom geometry About 2400000 rows are in the table, and the size of the table is about 2200MB. Three indexes are created on col01_id, col02 and on the geometry (clustered). When performing a 'select col01_id from table limit 1 offset 100000;', the query takes up to 20sec. Monitoring the dstats on the server, I see that the box is reading approx. 1GB from the disks. I ran VACUUM FULL on the table, reindex, analyze and cluster, all sorts of desperate tries to get a better performance out of this table without any success. I copied this table with less columns into a second table and run the same query on that table which performs much better. The server is a 2-dual-core cpu server with 2GB ram, running Fedora 5, Postgres 8.1.9 and PostGIS on top of it. Shared memory settings have been increased to: kernel.shmmax=1073741824 kernel.shmall=2097152 and I changed following settings in the postgres.conf: max_connections = 1000 shared_buffers = 65536 max_fsm_pages = 104000 My co-workes are getting to the point to move back to mySQL. But I like to stick to PostgreSQL. Looking forward for any comments or suggestions. Cheers, Rainer
"Rainer Spittel" <rainer.spittel@terralink.co.nz> writes: > When performing a 'select col01_id from table limit 1 offset 100000;', > the query takes up to 20sec. Monitoring the dstats on the server, I see > that the box is reading approx. 1GB from the disks. No surprise. That query says "read 100001 rows, then throw away the first 100000". Why are you considering that this is an important performance measure? Are you aware that the results of such a query aren't even very well defined (since you omitted an ORDER BY)? regards, tom lane
Hi Rainer, On Tue, Feb 05, 2008 at 09:07:26AM +1300, Rainer Spittel wrote: > When performing a 'select col01_id from table limit 1 offset 100000;', As Tom already stated, the result of this query is a bit random since there is no ordering at all. What are you actually querying for? Try tuning queries before tuning the server. > Looking forward for any comments or suggestions. Here it is, right at the bottom of your mail: > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend explain analyze select col01_id from table limit 1 offset 100000; will run the query and show you how the query plan looks and how much time is spent. It is likely to say "SeqScan" somewhere - a sequential scan of the tabe. HTH! Tino. -- www.craniosacralzentrum.de www.spiritualdesign-chemnitz.de Tino Schwarze * Lortzingstraße 21 * 09119 Chemnitz
Hi Tom, You are right, this query is not the right approach for performance testing. I thought that this will give an indication about the performance of a select statement on that table. One of those slow queries are running on col02 which has a btree index. But I use the 'in' expression to get a set of matching rows: select * from table where col02 in ('...',[...],'...') This query gets sometimes really slow, I guess it depends on the size of the set used by 'in'. Would the query perform better when I cluster the index on col02 and force to order the set for the in clause? Is there a way to disable the caching for testing? Once I ran the query, the result set seems to be cached and the second run of the query is fast. This makes a testing a little difficult ;-) regards. Rainer
"Rainer Spittel" <rainer.spittel@terralink.co.nz> writes: > One of those slow queries are running on col02 which has a btree > index. But I use the 'in' expression to get a set of matching rows: > select * from table where col02 in ('...',[...],'...') > This query gets sometimes really slow, I guess it depends on the size > of the set used by 'in'. Don't guess. Use EXPLAIN ANALYZE and find out what is happening. regards, tom lane
Hi Rainer, On Tue, Feb 05, 2008 at 10:24:11AM +1300, Rainer Spittel wrote: > You are right, this query is not the right approach for performance > testing. I thought that this will give an indication about the > performance of a select statement on that table. Only do benchmarking with real-world queries. Or you'll end up optimizing for artificial workloads which never occur. > One of those slow queries are running on col02 which has a btree > index. But I use the 'in' expression to get a set of matching rows: > > select * from table where col02 in ('...',[...],'...') > > This query gets sometimes really slow, I guess it depends on the size > of the set used by 'in'. How much in-clauses are there? It might be faster to use a subselect if the in clauses come from DB anyway. > Would the query perform better when I cluster the index on col02 and > force to order the set for the in clause? > Is there a way to disable the caching for testing? Once I ran the > query, the result set seems to be cached and the second run of the query > is fast. This makes a testing a little difficult ;-) Use "EXPLAIN query" first to get an estimate how the query will perform. Get used to reading EXPLAIN output. You already seem to have tuned the DB to your server (setting memory sizes etc.), now optimize queries first, then the DB server (look at the settings like random_page_cost, effective_cache_size etc. if the planner performs sequential scans and you think it should use indices). If you really want to rule out caching, you'd need to keep trashing your OS's cache, e.g. by running a "while dd if=/dev/some-drive of=/dev/null bs=1m ; do : ; done" in the background. But again, you'll optimize for a artifical workload - in real live, the cache will play a significant role. There's no use in optimizing queries for no-cache scenario if in practice, the cache will have the data. I know of real-live applications where the first query takes some time and all subsequent are fast. That's what caches are for. HTH, Tino. -- www.craniosacralzentrum.de www.spiritualdesign-chemnitz.de Tino Schwarze * Lortzingstraße 21 * 09119 Chemnitz