Re: CPU and RAM - Mailing list pgsql-performance

From Harry Jackson
Subject Re: CPU and RAM
Date
Msg-id 45b42ce40512300038k7426786etfe6c3a6d0efffb39@mail.gmail.com
Whole thread Raw
In response to CPU and RAM  (Harry Jackson <harryjackson@gmail.com>)
List pgsql-performance
On 24 Dec 2005 10:25:09 -0500, Greg Stark <gsstark@mit.edu> wrote:
>
> Harry Jackson <harryjackson@gmail.com> writes:
>
> > I always look at the explain plans.
> >
> > =# explain select item_id, term_frequency from reverse_index where
> > term_id = 22781;
> >                                           QUERY PLAN
> > -----------------------------------------------------------------------------------------------
> >  Bitmap Heap Scan on reverse_index  (cost=884.57..84443.35 rows=150448 width=8)
> >    Recheck Cond: (term_id = 22781)
> >    ->  Bitmap Index Scan on reverse_index_term_id_idx
> > (cost=0.00..884.57 rows=150448 width=0)
> >          Index Cond: (term_id = 22781)
> > (4 rows)
>
> Can you send EXPLAIN ANALYZE for this query for a problematic term_id? Are you
> really retrieving 150k records like it expects? In an OLTP environment that's
> an awful lot of records to be retrieving and might explain your high CPU usage
> all on its own.

The above is with the problematic term_id ;)

The above comes in at around 1/4 of a second which is fine for now but
will cause me severe problems in a few months when the size of teh
database swells.

> 250ms might be as good as you'll get for 150k records. I'm not sure precaching
> that many records will help you. You're still going to have to read them from
> somewhere.

This is what I am thinking. I have tried various methods to reduce the
time. I even tried to use "order by" then reduce the amount of data to
50K records to see if this would work but it came in at around the
same amount of time. It is faster if I use the following though but
not by much.

=# explain select * from reverse_index where term_id = 22781 order by
term_frequency DESC limit 30000;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=67337.87..67412.87 rows=30000 width=12)
   ->  Sort  (cost=67337.87..67565.26 rows=90956 width=12)
         Sort Key: term_frequency
         ->  Index Scan using reverse_index_term_id_idx on
reverse_index  (cost=0.00..59846.33 rows=90956 width=12)
               Index Cond: (term_id = 22781)
(5 rows)

I was actually suprised by this and it shows that whatever routines
Postgresql is using to sort the data its pretty bloody fast. The total
sort time for 110K records is about 193ms. The its retrieval after
that. What also suprised me is that without the sort

select * from reverse_index where term_id = 22781;

is slower than

select item_id, term_frequency from reverse_index where term_id = 22781;

but with the sort and limit added

select * from reverse_index where term_id = 22781 order by
term_frequency DESC limit 30000;

is faster than

select item_id, term_frequency from reverse_index where term_id =
22781 order by term_frequency DESC limit 30000;

> I guess clustering on term_id might speed this up by putting all the records
> being retrieved together. It might also let the planner use a plain index scan
> instead of a bitmap scan and get the same benefit.

Yep. I clustered on the term_id index again before running the above
explain and this time we have a plain index scan.

> > The next query absolutely flies but it would have been the one I would
> > have expected to be a lot slower.
> > ...
> > This comes in under 10.6ms which is astounding and I am more than
> > happy with the performance to be had from it.
>
> Out of curiosity it would be interesting to see the EXPLAIN ANALYZE from this
> too.


                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on item i  (cost=8.01..16.18 rows=4 width=478)
   Recheck Cond: ((item_id = 20006293) OR (item_id = 20097065) OR
(item_id = 20101014) OR (item_id = 20101015))
   ->  BitmapOr  (cost=8.01..8.01 rows=4 width=0)
         ->  Bitmap Index Scan on item_item_id_pk  (cost=0.00..2.00
rows=1 width=0)
               Index Cond: (item_id = 20006293)
         ->  Bitmap Index Scan on item_item_id_pk  (cost=0.00..2.00
rows=1 width=0)
               Index Cond: (item_id = 20097065)

<snip lots of single item_id bitmap index scans>

         ->  Bitmap Index Scan on item_item_id_pk  (cost=0.00..2.00
rows=1 width=0)
               Index Cond: (item_id = 20101014)
         ->  Bitmap Index Scan on item_item_id_pk  (cost=0.00..2.00
rows=1 width=0)
               Index Cond: (item_id = 20101015)


Another intereting thing I noticed was the size of the tables and
indexes after the cluster operation

BEFORE:
          relname          |   bytes   | kbytes | relkind | mb
---------------------------+-----------+--------+---------+-----
 reverse_index             | 884293632 | 863568 | r       | 843
 reverse_index_pk          | 548126720 | 535280 | i       | 522
 reverse_index_term_id_idx | 415260672 | 405528 | i       | 396

AFTER:
 reverse_index             | 635944960 | 621040 | r       | 606
 reverse_index_pk          | 322600960 | 315040 | i       | 307
 reverse_index_term_id_idx | 257622016 | 251584 | i       | 245

This database has autovacuum running but it looks like there is a lot
of space in pages on disk that is not being used. Is this a trade off
when using MVCC?

--
Harry
http://www.uklug.co.uk
http://www.hjackson.org

pgsql-performance by date:

Previous
From: "Arup Dutta"
Date:
Subject: unsubscribe
Next
From: Jens-Wolfhard Schicke
Date:
Subject: Materialize Subplan and push into inner index conditions