Thread: not using indexes on large table

not using indexes on large table

From
Jeroen Kleijer
Date:
Hi all,

I'm a bit new to PostgreSQL and database design in general so forgive me
for asking stupid questions. ;-)

I've setup a PostgreSQL database on a Linux machine (2 processor, 1GB
mem) and while the database itself resides on a NetApp filer, via NFS,
this doesn't seem to impact the performance to drastically.

I basically use it for indexed tables without any relation between 'em
so far this has worked perfectly.

For statistics I've created the following table:
volume varchar(30),
qtree varchar(255),
file varchar(512),
ctime timestamp,
mtime timestamp,
atime timestamp
annd created separate indexes on the volume and qtree columns.

This table gets filled with the copy command and about 2 hours and
some 40 million records later I issue a reindex command to make sure the
indexes are accurate. (for good interest, there are some 35 values for
volume and some 1450 for qtrees)

While filling of this table, my database grows to an (expected) 11.5GB.

The problems comes when I try to do a query without using a where clause
because by then, it completely discards the indexes and does a complete
table scan which takes over half an hour! (40.710.725 rows, 1110258
pages, 1715 seconds)

I've tried several things but doing a query like:
select distinct volume from project_access_times
or
select distinct qtree from project_access_times
always result in a full sequential table scan even after a 'vacuum' and
'vacuum analyze'.

I even tried the 'set enable_seqscan = no' but it still does a full
table scan instead of using the indexes.

Can anyone tell me if this is normal behaviour (half an hour seems over
the top to me) and if not, what I can do about it.

Regards,

Jeroen Kleijer

Re: not using indexes on large table

From
Vincenzo Romano
Date:
On Saturday 21 April 2007 22:17:42 Jeroen Kleijer wrote:
> I've tried several things but doing a query like:
> select distinct volume from project_access_times

I'm new too but an "order by volume"  could help!

In any case maybe a different table design with a separate table for the
"distinct volumes" could help even more.

--
Vincenzo Romano
----
Maybe Computers will never become as intelligent as
Humans. For sure they won't ever become so stupid.

Re: not using indexes on large table

From
Andreas Kostyrka
Date:
* Jeroen Kleijer <jeroen.kleijer@xs4all.nl> [070421 23:10]:
>
> Hi all,
>
> I'm a bit new to PostgreSQL and database design in general so forgive me
> for asking stupid questions. ;-)
>
> I've setup a PostgreSQL database on a Linux machine (2 processor, 1GB
> mem) and while the database itself resides on a NetApp filer, via NFS,
> this doesn't seem to impact the performance to drastically.
>
> I basically use it for indexed tables without any relation between 'em
> so far this has worked perfectly.
>
> For statistics I've created the following table:
> volume varchar(30),
> qtree varchar(255),
> file varchar(512),
> ctime timestamp,
> mtime timestamp,
> atime timestamp
> annd created separate indexes on the volume and qtree columns.
>
> This table gets filled with the copy command and about 2 hours and
> some 40 million records later I issue a reindex command to make sure the
> indexes are accurate. (for good interest, there are some 35 values for
> volume and some 1450 for qtrees)
>
> While filling of this table, my database grows to an (expected) 11.5GB.
>
> The problems comes when I try to do a query without using a where clause
> because by then, it completely discards the indexes and does a complete
> table scan which takes over half an hour! (40.710.725 rows, 1110258
> pages, 1715 seconds)
>
> I've tried several things but doing a query like:
> select distinct volume from project_access_times
> or
> select distinct qtree from project_access_times
> always result in a full sequential table scan even after a 'vacuum' and
> 'vacuum analyze'.

Try:
select volume from project_access_times group by project_access_times;

And no matter, runnning a database over NFS smells like a dead rat.

Hopefully, you've mounted it hard, but still NFS does not have normal
semantics, e.g. locking, etc.

Next thing, as you've got only one client for that NFS mount, try to
make it to cache aggressivly meta data. The ac prefixed options in
nfs(5) come to mind.

Andreas

Re: not using indexes on large table

From
"Dave Dutcher"
Date:
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> Jeroen Kleijer
>
> The problems comes when I try to do a query without using a
> where clause
> because by then, it completely discards the indexes and does
> a complete
> table scan which takes over half an hour! (40.710.725 rows, 1110258
> pages, 1715 seconds)
>
> I've tried several things but doing a query like:
> select distinct volume from project_access_times
> or
> select distinct qtree from project_access_times
> always result in a full sequential table scan even after a
> 'vacuum' and
> 'vacuum analyze'.

To my knowledge Postgres doesn't use indexes for distinct queries or
grouping.  Also you are getting horrible IO performance.  Our old slow test
machine can scan a table of 12 million rows in 100 seconds, and our
production server can do the same in 20 seconds.  If possible, I would try
running the same thing on your local hard drive.  That way you can see how
much the netapp and NFS are slowing you down.  Although in the end if you
need very fast distinct queries, you will need to maintain a separate table.

Dave


Re: not using indexes on large table

From
Scott Marlowe
Date:
On Sat, 2007-04-21 at 15:17, Jeroen Kleijer wrote:
> Hi all,
>
> I'm a bit new to PostgreSQL and database design in general so forgive me
> for asking stupid questions. ;-)
>
> I've setup a PostgreSQL database on a Linux machine (2 processor, 1GB
> mem) and while the database itself resides on a NetApp filer, via NFS,
> this doesn't seem to impact the performance to drastically.

What does a benchmark like bonnie++ say about your performance?  And I
hope your data's not too important to you, because I've had LOTS of
problems with NFS mounts in the past with pgsql.  Generally speaking,
NFS can be moderately fast, or moderately reliable (for databases) but
it generally isn't both at the same time.

Considering the cost of a quartet of 80 Gig SATA drives ($59x4) and a
decent RAID controller (LSI, Areca at ~$450 or so) you could be getting
VERY good performance out of your system with real reliability at the
same time on a RAID-10 volume.  Then use the NetApp for backup.  That's
what I'd do.

> I basically use it for indexed tables without any relation between 'em
> so far this has worked perfectly.
>
> For statistics I've created the following table:
> volume varchar(30),
> qtree varchar(255),
> file varchar(512),
> ctime timestamp,
> mtime timestamp,
> atime timestamp
> annd created separate indexes on the volume and qtree columns.

You might want to look at setting this up as two or three tables with a
view and update triggers to look like one table to the user, and the
qtree and file in their own tables.  that would make your main stats
table only one varchar(30) and 3 timestamps wide. Especially if qtree
and file tend to be large.  If one of those tends to be small and the
other large, then look at moving just the large one into its own table.
The reasons for this will be obvious later on in this post.

> The problems comes when I try to do a query without using a where clause
> because by then, it completely discards the indexes and does a complete
> table scan which takes over half an hour! (40.710.725 rows, 1110258
> pages, 1715 seconds)

Yes it does, and it should.

Why?  Visibility.  This has been discussed quite a bit on the lists.
Because of the particular design for PostgreSQL's MVCC implementation,
indexes cannot contain visibility information on tables.  Therefore,
every time the db looks in an index, it then has to look in the table
anyway to find the right version of that tuple and to see if it's
actually valid for your snapshot.

> Can anyone tell me if this is normal behaviour (half an hour seems over
> the top to me) and if not, what I can do about it.

Yes this is normal behaviour.  It's just how PostgreSQL works.  There
are some workarounds our there that involve updating extra tables that
carry things like counts etc...  Each of these cost something in
overhead.

There are two distinct problems here.  One is that you're tying to use
PostgreSQL in a role where perhaps a different database might be a
better choice.  MSSQL Server or DB2 or even MySQL might be a better
choice depending on what you want to do with your data.

The other problem is that you're using an NFS server.  Either go whole
hog and buy a SAN with dual 2G nics in it or put local storage underneat
your machine with LOTS of hard drives in RAID-10.

Note that while other databases may be better at some of the queries
you're trying to run, it might be that PostgreSQL is still a good choice
because of other queries, and you can do rollups of the data that it's
slow at while using it for the things it is good at.

I've got a test db on my workstation that's pretty big at 42,463,248
rows and taking up 12 Gigs just for the table, 7.7 Gigs in indexes, and
a select count(*) on it takes 489 seconds.  I try not to do things like
that.  It covers the last 9 months of statistics.

This query:

select a, b, count(*) from summary where atime > '2006-06-16' and
        perspective = 'yada'
        group by a, b
        order by a, b

took 300 seconds, which is typical.

This is on a Workstation with one CPU, 2 gigs of ram, and a 150 Gig SATA
drive.  It's running X Windows, with Evolution, firefox, and a dozen
other user apps up and running.  Our "real" server, with 4 disks in a
RAID 5 on a mediocre RAID controller but with 2 CPUs and 6 gigs of ram,
stomps my little work station into the ground.

I have the feeling my laptop with 512 Meg of ram and a 1.6 GHz CPU would
be faster than your current server.