Thread: Less rows -> better performance?

Less rows -> better performance?

From

Andreas Hartmann

Date:

21 July 2008, 10:50:49

Dear PostgreSQL community,

first some info about our application:

- Online course directory for a University
- Amount of data: complete dump is 27 MB
- Semester is part of primary key in each table
- Data for approx. 10 semesters stored in the DB
- Read-only access from web application (JDBC)

Our client has asked us if the performance of the application could be
improved by moving the data from previous years to a separate "archive"
application. This would reduce the overall amount of data in the main
application by about 80% at the moment.

Actually I doubt that this will have the desired effect, since the
semester is part of the primary key in virtually all tables (apart from
some small tables containing string constants etc.), and therefore
indexed. Some tests with EXPLAIN ANALYZE and some web tests (JMeter)
seem to confirm this, the queries showed the same performance with 2 and
10 semesters.

But since I'm not sure yet, I would very much appreciate any answers to
the following questions:

- Do you think the approach (reducing the data) is effective?
- Are there any particular tests which I should do?

Thanks a lot in advance!

-- Andreas



--
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01

Re: Less rows -> better performance?

From

Richard Huxton

Date:

21 July 2008, 11:06:18

Andreas Hartmann wrote:
> Dear PostgreSQL community,
>
> first some info about our application:
>
> - Online course directory for a University
> - Amount of data: complete dump is 27 MB
> - Semester is part of primary key in each table
> - Data for approx. 10 semesters stored in the DB
> - Read-only access from web application (JDBC)
>
> Our client has asked us if the performance of the application could be
> improved by moving the data from previous years to a separate "archive"
> application.

If you had 27GB of data maybe, but you've only got 27MB - that's
presumably all sitting in memory.

What in particular is slow?

--
   Richard Huxton
   Archonet Ltd

Re: Less rows -> better performance?

From

Andreas Hartmann

Date:

21 July 2008, 11:25:36

Richard, thanks for your reply!

Richard Huxton schrieb:
> Andreas Hartmann wrote:
>> Dear PostgreSQL community,
>>
>> first some info about our application:
>>
>> - Online course directory for a University
>> - Amount of data: complete dump is 27 MB
>> - Semester is part of primary key in each table
>> - Data for approx. 10 semesters stored in the DB
>> - Read-only access from web application (JDBC)
>>
>> Our client has asked us if the performance of the application could be
>> improved by moving the data from previous years to a separate "archive"
>> application.
>
> If you had 27GB of data maybe, but you've only got 27MB - that's
> presumably all sitting in memory.

Here's some info about the actual amount of data:

SELECT pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database where pg_database.datname = 'vvz_live_1';

     datname    |  size
---------------+---------
  vvz_live_1    | 2565 MB

I wonder why the actual size is so much bigger than the data-only dump -
is this because of index data etc.?

> What in particular is slow?

There's no particular bottleneck (at least that we're aware of). During
the first couple of days after the beginning of the semester the
application request processing tends to slow down due to the high load
(many students assemble their schedule). The customer upgraded the
hardware (which already helped a lot), but they asked us to find further
approaches to performance optimiziation.

-- Andreas

--
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01

Re: Less rows -> better performance?

From

"Guillaume Smet"

Date:

21 July 2008, 11:32:45

On Mon, Jul 21, 2008 at 1:25 PM, Andreas Hartmann <andreas@apache.org> wrote:
> SELECT pg_database.datname,
> pg_size_pretty(pg_database_size(pg_database.datname)) AS size
> FROM pg_database where pg_database.datname = 'vvz_live_1';
>
>    datname    |  size
> ---------------+---------
>  vvz_live_1    | 2565 MB
>
> I wonder why the actual size is so much bigger than the data-only dump - is
> this because of index data etc.?

More probably because the database is totally bloated. Do you run
VACUUM regularly or did you set up autovacuum?

--
Guillaume

Re: Less rows -> better performance?

From

Richard Huxton

Date:

21 July 2008, 12:11:32

Andreas Hartmann wrote:
>
> Here's some info about the actual amount of data:
>
> SELECT pg_database.datname,
> pg_size_pretty(pg_database_size(pg_database.datname)) AS size
> FROM pg_database where pg_database.datname = 'vvz_live_1';
>
>     datname    |  size
> ---------------+---------
>  vvz_live_1    | 2565 MB
>
> I wonder why the actual size is so much bigger than the data-only dump -
> is this because of index data etc.?

I suspect Guillame is right and you've not been vacuuming. That or
you've got a *LOT* of indexes. If the database is only 27MB dumped, I'd
just dump/restore it.

Since the database is read-only it might be worth running CLUSTER on the
  main tables if there's a sensible ordering for them.

>> What in particular is slow?
>
> There's no particular bottleneck (at least that we're aware of). During
> the first couple of days after the beginning of the semester the
> application request processing tends to slow down due to the high load
> (many students assemble their schedule). The customer upgraded the
> hardware (which already helped a lot), but they asked us to find further
> approaches to performance optimiziation.

1. Cache sensibly at the application (I should have thought there's
plenty of opportunity here).
2. Make sure you're using a connection pool and have sized it reasonably
(try 4,8,16 see what loads you can support).
3. Use prepared statements where it makes sense. Not sure how you'll
manage the interplay between this and connection pooling in JDBC. Not a
Java man I'm afraid.

If you're happy with the query plans you're looking to reduce overheads
as much as possible during peak times.

4. Offload more of the processing to clients with some fancy ajax-ed
interface.
5. Throw in a spare machine as an app server for the first week of term.
     Presumably your load is 100 times average at this time.

--
   Richard Huxton
   Archonet Ltd

Re: Less rows -> better performance?

From

Craig Ringer

Date:

21 July 2008, 13:54:03

Guillaume Smet wrote:
> On Mon, Jul 21, 2008 at 1:25 PM, Andreas Hartmann <andreas@apache.org> wrote:
>> SELECT pg_database.datname,
>> pg_size_pretty(pg_database_size(pg_database.datname)) AS size
>> FROM pg_database where pg_database.datname = 'vvz_live_1';
>>
>>    datname    |  size
>> ---------------+---------
>>  vvz_live_1    | 2565 MB
>>
>> I wonder why the actual size is so much bigger than the data-only dump - is
>> this because of index data etc.?
>
> More probably because the database is totally bloated. Do you run
> VACUUM regularly or did you set up autovacuum?

You might also want to REINDEX and see if that improves things. My
understanding is that if vacuum isn't run regularly, the indexes may end
up a bit of a mess as well as the tables.

--
Craig Ringer

Re: Less rows -> better performance?

From

"Christian GRANDIN"

Date:

21 July 2008, 14:01:26

Hi,

Reducing the amount of data will only have effect on table scan or index scan. If your queries are selective and optimized, it will have no effect.

Before looking for solutions, the first thing to do is to understand what's happen.

If you already know the queries then explain them. Otherwise, you must log duration with the log_statement and log_min_duration parameters in the postgresql.conf.

Before this, you must at least run VACUUM ANALYZE on the database to collect actual statistics and have current explain plans.

Best regards.

Christian

2008/7/21 Richard Huxton <dev@archonet.com>

Andreas Hartmann wrote:

Here's some info about the actual amount of data:

SELECT pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database where pg_database.datname = 'vvz_live_1';

datname | size
---------------+---------
vvz_live_1 | 2565 MB

I wonder why the actual size is so much bigger than the data-only dump - is this because of index data etc.?

I suspect Guillame is right and you've not been vacuuming. That or you've got a *LOT* of indexes. If the database is only 27MB dumped, I'd just dump/restore it.

Since the database is read-only it might be worth running CLUSTER on the main tables if there's a sensible ordering for them.

What in particular is slow?

There's no particular bottleneck (at least that we're aware of). During the first couple of days after the beginning of the semester the application request processing tends to slow down due to the high load (many students assemble their schedule). The customer upgraded the hardware (which already helped a lot), but they asked us to find further approaches to performance optimiziation.

1. Cache sensibly at the application (I should have thought there's plenty of opportunity here).
2. Make sure you're using a connection pool and have sized it reasonably (try 4,8,16 see what loads you can support).
3. Use prepared statements where it makes sense. Not sure how you'll manage the interplay between this and connection pooling in JDBC. Not a Java man I'm afraid.

If you're happy with the query plans you're looking to reduce overheads as much as possible during peak times.

4. Offload more of the processing to clients with some fancy ajax-ed interface.
5. Throw in a spare machine as an app server for the first week of term. Presumably your load is 100 times average at this time.

--
Richard Huxton
Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Less rows -> better performance?

From

Andreas Hartmann

Date:

21 July 2008, 14:45:41

Guillaume Smet schrieb:
> On Mon, Jul 21, 2008 at 1:25 PM, Andreas Hartmann <andreas@apache.org> wrote:
>> SELECT pg_database.datname,
>> pg_size_pretty(pg_database_size(pg_database.datname)) AS size
>> FROM pg_database where pg_database.datname = 'vvz_live_1';
>>
>>    datname    |  size
>> ---------------+---------
>>  vvz_live_1    | 2565 MB
>>
>> I wonder why the actual size is so much bigger than the data-only dump - is
>> this because of index data etc.?
>
> More probably because the database is totally bloated. Do you run
> VACUUM regularly or did you set up autovacuum?

Thanks for the hint!

I just verified that the autovacuum property is enabled. I did the
following to prepare the tests:

- setup two test databases, let's call them db_all and db_current
- import the dump from the live DB into both test DBs
- delete the old semester data from db_current, leaving only the current
data

Both test DBs were 600 MB large after this. I did a VACUUM FULL ANALYZE
on both of them now. db_all didn't shrink significantly (only 1 MB),
db_current shrunk to 440 MB. We're using quite a lot of indexes, I guess
that's why that much data are allocated.

-- Andreas

--
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01

Re: Less rows -> better performance?

From

"Harald Armin Massa"

Date:

21 July 2008, 14:56:33

Andreas,

> I just verified that the autovacuum property is enabled. I did the following
> to prepare the tests:

"autovacuum property is enabled" Did you also check the logs, if
autovacuum is working?

> - setup two test databases, let's call them db_all and db_current
> - import the dump from the live DB into both test DBs
> - delete the old semester data from db_current, leaving only the current
> data
>
> Both test DBs were 600 MB large after this. I did a VACUUM FULL ANALYZE on
> both of them now. db_all didn't shrink significantly (only 1 MB), db_current
> shrunk to 440 MB.

Your test is not testing if vacuum is done on your production
database! With pg_dump + pg_restore you removed next to all database
bloat. (theoretically all)

After loading a fresh dump, vacuuming ideally has to do nearly
nothing; after deleting some data VACUUM reclaims the memory of the
deleted rows, thats the shrinking you see after delete + vacuum.

The bload in your production system may be the result of updates and
deletes in that system; dumping and restoring removes that bloat.

If your life DB is ~2,5Gig, and your dumped / restored DB is only
600MB, that 2500MB minus 600MB is some bloat from not vacuuming or
bloated indexes. So, before the start of the next semester, at least
do vacuum. (maybe also reindex)

Best wishes,

Harald



--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pidgeon
-
EuroPython 2008 will take place in Vilnius, Lithuania - Stay tuned!

Re: Less rows -> better performance?

From

Andreas Hartmann

Date:

21 July 2008, 15:14:35

Mario Weilguni schrieb:
> Andreas Hartmann schrieb:

[…]

>> I just verified that the autovacuum property is enabled.

[…]

> Did you have:
> stats_start_collector = on
> stats_block_level = on
> stats_row_level = on
>
> Otherwise autovacuum won't run IMO.

Thanks for the hint! The section looks like this:

stats_start_collector = on
#stats_command_string = off
#stats_block_level = off
stats_row_level = on
#stats_reset_on_server_start = off


I'll check the logs if the vacuum really runs - as soon as I find them :)

-- Andreas
--
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01

Re: Less rows -> better performance?

From

Mario Weilguni

Date:

21 July 2008, 15:29:24

Andreas Hartmann schrieb:
> Mario Weilguni schrieb:
>> Andreas Hartmann schrieb:
>
> […]
>
>>> I just verified that the autovacuum property is enabled.
>
> […]
>
>> Did you have:
>> stats_start_collector = on
>> stats_block_level = on
>> stats_row_level = on
>>
>> Otherwise autovacuum won't run IMO.
>
> Thanks for the hint! The section looks like this:
>
> stats_start_collector = on
> #stats_command_string = off
> #stats_block_level = off
> stats_row_level = on
> #stats_reset_on_server_start = off
>
>
> I'll check the logs if the vacuum really runs - as soon as I find them :)
>
> -- Andreas
You might want to use these entries in your config:
redirect_stderr = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_rotation_age = 1d

Fit those to your needs, then you will find log entries in $PGDATA/pg_log/

And BTW, I was wrong, you just need to have stats_row_level=On,
stats_block_level doesn't matter. But in fact it's simple, if you don't
have 24x7 requirements type VACUUM FULL ANALYZE; and check if your DB
becomes smaller, I really doubt you can have that much indizes that 27MB
dumps might use 2.3 GB on-disk.

You can check this too:
select relname, relpages, reltuples, relkind
  from pg_class
where relkind in ('r', 'i')
order by relpages desc limit 20;

Will give you the top-20 tables and their sizes, 1 page is typically
8KB, so you can cross-check if relpages/reltuples is completly off, this
is a good indicator for table/index bloat.

Regards,
Mario

Re: Less rows -> better performance?

From

Łukasz Filut

Date:

22 July 2008, 08:22:30

[...]
> You can check this too:
> select relname, relpages, reltuples, relkind
>  from pg_class
> where relkind in ('r', 'i')
> order by relpages desc limit 20;
>
> Will give you the top-20 tables and their sizes, 1 page is typically
> 8KB, so you can cross-check if relpages/reltuples is completly off,
> this is a good indicator for table/index bloat.
use this query :
select pg_size_pretty(pg_relation_size(oid)) as relation_size,relname,
relpages, reltuples, relkind
 from pg_class
where relkind in ('r', 'i')
order by relpages desc limit 20;

output will be much more readeable
>
> Regards,
> Mario
    Lukasz

--
Lukasz Filut
DBA - IT Group, WSB-TEB Corporation
Poznan - Poland