Thread: Why are selects so slow on large tables, even when indexed?
To test PostgreSQL's scalability, I created a table with approximately 76M rows. The table had four columns: a bigint, a varchar(32), another bigint and a varchar(80). The first three columns were filled with values, the fourth was left null. After populating the table, I created an index on the first column (a non-unique index, as the column contains duplicate values) and then VACUUMed. Select statements involving only the indexed column are pathetically slow (tens of minutes). Some examples:
select count(*) from a where id < 0; /* returns 0 rows */
select * from a where id=5; /* returns a handful of rows */
76M rows is a lot, but it shouldn't be that bad when id is indexed.
Attached are two scripts. One creates the table, the other populates it. I typed "create index index_a on a(id)" and "vacuum" by hand. I see this behavior both on Windows and RedHat Linux using PostgreSQL version 7.1.3 in both cases. Any idea why the performance is so poor? Can this be corrected by tuning?
Attachment
On 26 Mar 2002 at 16:28, Robert Wille wrote: > Attached are two scripts. One creates the table, the other populates it. I > typed "create index index_a on a(id)" and "vacuum" by hand. What was the vacuum command you issued? I hope it was "vacuum analyze". -- Dan Langille The FreeBSD Diary - http://freebsddiary.org/ - practical examples
On Tue, 2002-03-26 at 18:28, Robert Wille wrote: > To test PostgreSQL's scalability, I created a table with approximately 76M rows. > The table had four columns: a bigint, a varchar(32), another bigint > and a varchar(80). > select count(*) from a where id < 0; /* returns 0 rows */ > select * from a where id=5; /* returns a handful of rows */ > > 76M rows is a lot, but it shouldn't be that bad when id is indexed. A couple things: (1) You indicated that you ran VACUUM. You'll need to run VACUUM ANALYZE (or just ANALYZE) to update the planner's statistics. For your particular situation, this is essential. (2) There is a long-standing bug with indexes on int8 columns: if you use a numeric literal as a qualifier, it will be converted to an int4, so the index won't be used. There is an easy work-around: select * from a where id = 5; /* won't use index if id is int8 */ select * from a where id = 5::int8; /* will use index, if appropriate */ (3) You can get more information on the decisions Postgres is making when executing your query through the use of EXPLAIN. In this instance, it will likely tell you that the index isn't being used at all, and a sequential scan is being performed. If you follow suggestions #1 and #2, you should see markedly improved performance. Let us know the results... Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
The suggested fixes have helped a lot, but it is still rather slow. The time varies and can be upwards of 10 to 20 seconds on a ~47M row table. Is this normal? Similar queries on an indexed varchar column in Oracle with about 1/2 as many rows execute at least a hundred times faster. ----- Original Message ----- From: "Neil Conway" <nconway@klamath.dyndns.org> To: "Robert Wille" <rwille@iarchives.com> Cc: <pgsql-general@postgresql.org>; "Russell Black" <russell.black@iarchives.com> Sent: Tuesday, March 26, 2002 4:47 PM Subject: Re: [GENERAL] Why are selects so slow on large tables, even whenindexed? > On Tue, 2002-03-26 at 18:28, Robert Wille wrote: > > To test PostgreSQL's scalability, I created a table with approximately 76M rows. > > The table had four columns: a bigint, a varchar(32), another bigint > > and a varchar(80). > > > select count(*) from a where id < 0; /* returns 0 rows */ > > select * from a where id=5; /* returns a handful of rows */ > > > > 76M rows is a lot, but it shouldn't be that bad when id is indexed. > > A couple things: > > (1) You indicated that you ran VACUUM. You'll need to run VACUUM ANALYZE > (or just ANALYZE) to update the planner's statistics. For your > particular situation, this is essential. > > (2) There is a long-standing bug with indexes on int8 columns: if you > use a numeric literal as a qualifier, it will be converted to an int4, > so the index won't be used. There is an easy work-around: > > select * from a where id = 5; /* won't use index if id is int8 */ > select * from a where id = 5::int8; /* will use index, if appropriate */ > > (3) You can get more information on the decisions Postgres is making > when executing your query through the use of EXPLAIN. In this instance, > it will likely tell you that the index isn't being used at all, and a > sequential scan is being performed. > > If you follow suggestions #1 and #2, you should see markedly improved > performance. Let us know the results... > > Cheers, > > Neil > > -- > Neil Conway <neilconway@rogers.com> > PGP Key ID: DB3C29FC >
On Wed, 2002-03-27 at 09:52, Robert Wille wrote: > The suggested fixes have helped a lot, but it is still rather slow. The time > varies and can be upwards of 10 to 20 seconds on a ~47M row table. Is this > normal? You haven't given us enough information to be able to tell. What hardware is this running on? What (exact) queries are you executing? What is the schema of any relevant database tables, indexes, views, etc.? What does EXPLAIN produce for the query? What configuration changes have you made? (e.g. increasing the size of the shared buffers). What OS is this running on, and how has the OS been tuned? Are you running a single query, multiple sequentual queries, or multiple concurrent queries? For my own curiosity, what are the results if you execute the same query using a hash index -- i.e. CREATE INDEX your_index ON your_table USING hash (your_column); (You'll need to drop the existing btree index -- once you've got the results, I'd recommend removing the hash index and re-instating the btree one for production use.) > Similar queries on an indexed varchar column in Oracle with about > 1/2 as many rows execute at least a hundred times faster. Keep in mind that the time taken to return a single SELECT query is a very incomplete measure of RDBMS performance; nevertheless, I doubt that Postgres is, on average, 100x slower than Oracle. Can you post the Oracle equivalent of EXPLAIN for the queries you're running to derive this 100x figure? Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
"Robert Wille" <rwille@iarchives.com> writes: > The suggested fixes have helped a lot, but it is still rather > slow. The time varies and can be upwards of 10 to 20 seconds on a > ~47M row table. Is this normal? Similar queries on an indexed > varchar column in Oracle with about 1/2 as many rows execute at > least a hundred times faster. I realize that it has taken me quite a while to get back to you, on this particular case but I have been running some tests on your data (or 28 million lines of it anyway) and I have some stuff to share. First of all, your sample data set has the *opposite* problem of most queries that PostgreSQL users complain about. Most people complain about queries that do sequential scans when PostgreSQL should be doing an index scan. Your data, on the other hand, caused PostgreSQL to do an indexscan when it probably should have been doing a sequential scan. After all, there are only 1000 (or so) different unique values of "id" and the instances of each value are spread evenly throughout the table. Since you are going to touch most pages anyhow consulting the index is just an extra step. On my limited test machine I actually saw modest gains on queries like: SELECT count(*) FROM a WHERE id = 89::bigint; However, if you cluster the index on id then similar queries started to return *immediately*. Don't forget to vacuum after clustering. I hope this was helpful, I sured learned a lot. Jason