Re: sum of all values - Mailing list pgsql-performance

From Madison Kelly
Subject Re: sum of all values
Date
Msg-id 41E7E742.4060607@alteeve.com
Whole thread Raw
In response to Re: sum of all values  (Richard Huxton <dev@archonet.com>)
Responses Re: sum of all values  (Richard Huxton <dev@archonet.com>)
List pgsql-performance
Richard Huxton wrote:
> Madison Kelly wrote:
>
>> Hi all,
>>
>>   Is there a fast(er) way to get the sum of all integer values for a
>> certain condition over many thousands of rows? What I am currently
>> doing is this (which takes ~5-10sec.):
>
>
> OK, I'm assuming you've configured PG to your satisfaction and this is
> the only query giving you problems.

   This is a program for general consumption (hopefully...
eventually...) so I want to leave the psql config alone. Once I am
happier with the program I will try different tuning options and write a
faq though I expect 9 out of 10 users won't read it.

>> SELECT SUM (a.file_size) FROM file_info_1 a, file_set_1 b WHERE
>> a.file_name=b.fs_name AND a.file_parent_dir=b.fs_parent_dir AND
>> a.file_type=b.fs_type AND b.fs_backup='t';
>
>
> You'll want to run EXPLAIN ANALYSE SELECT SUM... and post the output of
> that, although the query looks straightforward enough.

tle-bu=> EXPLAIN ANALYZE SELECT SUM (a.file_size) FROM file_info_1 a,
file_set_1 b WHERE a.file_name=b.fs_name AND
a.file_parent_dir=b.fs_parent_dir AND a.file_type=b.fs_type AND
b.fs_backup='t';

     QUERY PLAN
----------------------------------------------------------------
  Aggregate  (cost=2202.54..2202.54 rows=1 width=8) (actual
time=5078.744..5078.748 rows=1 loops=1)
    ->  Merge Join  (cost=724.94..2202.51 rows=11 width=8) (actual
time=3281.677..4969.719 rows=12828 loops=1)
          Merge Cond: (("outer".file_parent_dir = "inner".fs_parent_dir)
AND ("outer".file_name = "inner".fs_name) AND ("outer".file_type =
"inner".fs_type))
          ->  Index Scan using file_info_1_search_idx on file_info_1 a
(cost=0.00..1317.11 rows=12828 width=104) (actual time=0.042..116.825
rows=12828 loops=1)
          ->  Sort  (cost=724.94..740.97 rows=6414 width=96) (actual
time=3281.516..3350.640 rows=12828 loops=1)
                Sort Key: b.fs_parent_dir, b.fs_name, b.fs_type
                ->  Seq Scan on file_set_1 b  (cost=0.00..319.35
rows=6414 width=96) (actual time=0.029..129.129 rows=12828 loops=1)
                      Filter: (fs_backup = true)
  Total runtime: 5080.729 ms
(9 rows)

>>   Here are the schemas, in case they help:
>>
>> tle-bu=> \d file_info_1                Table "public.file_info_1"
>>      Column      |  Type   |         Modifiers
>> -----------------+---------+----------------------------
>>  file_acc_time   | bigint  | not null
>>  file_group_name | text    | not null
>>  file_group_uid  | integer | not null
>>  file_mod_time   | bigint  | not null
>>  file_name       | text    | not null
>>  file_parent_dir | text    | not null
>>  file_perm       | text    | not null
>>  file_size       | bigint  | not null
>>  file_type       | text    | not null default 'f'::text
>>  file_user_name  | text    | not null
>>  file_user_uid   | integer | not null
>> Indexes:
>>     "file_info_1_display_idx" btree (file_parent_dir, file_name,
>> file_type)
>>     "file_info_1_search_idx" btree (file_parent_dir, file_name,
>> file_type)
>>
>> tle-bu=> \d file_set_1               Table "public.file_set_1"
>>     Column     |  Type   |         Modifiers
>> ---------------+---------+----------------------------
>>  fs_backup     | boolean | not null default true
>>  fs_display    | boolean | not null default false
>>  fs_name       | text    | not null
>>  fs_parent_dir | text    | not null
>>  fs_restore    | boolean | not null default false
>>  fs_type       | text    | not null default 'f'::text
>> Indexes:
>>     "file_set_1_sync_idx" btree (fs_parent_dir, fs_name, fs_type)
>
>
> 1. WHERE ARE YOUR PRIMARY KEYS???
> 2. Why do you have two identical indexes on file_info_1
> 3. WHERE ARE YOUR PRIMARY KEYS???
> 4. Am I right in thinking that always, file_name==fs_name (i.e. they
> represent the same piece of information) and if so, why are you storing
> it twice? Same for _parent_dir too
> 5. file_type/fs_type are being held as unbounded text? Not an index into
> some lookup table or a varchar(N)?
>
> Can you explain what you're trying to do here - it might be you want to
> alter your database design.
> --
>   Richard Huxton
>   Archonet Ltd

   This is where I have to admit my novice level of knowledge. Until now
I have been more concerned with "making it work". It is only now that I
have finished (more or less) the program that I have started going back
and trying to find ways to speed it up. I have not used postgres (or
perl or anything) before this program. I hope my questions aren't too
basic. ^.^;

   I keep hearing about Primary Keys but I can't say that I know what
they are or how they are used. If I do understand, it is a way to
reference another table's entry (using a foreign key)? The two matching
indexes is a typo in my program that I hadn't noticed, I'll fix that asap.

   Here is what the database is used for:

   This is a backup program and I use the DB to store extended
information on all selected files and directories on a partition. Each
partition has it's own 'file_info_#' and 'file_set_#' tables where '#'
matches the ID stored for that partition in the DB in another table.

   The 'file_info_#' table stored the data that can change such as file
size, last modified/accessed, owing user and group and so forth. The
'file_set_#' table stores the flags that say to include or exclude it
from a backup/restore job and whether it has been selected for display
in the file browser.

   In the first iteration I -used- to have the data in a single table
and I identified the partition with a column called 'file_in_id' (or
something similar). As I looked at each file on the system I would do a
db call to see if the entry existed and if so, update it and if not,
insert it. This was horribly slow though so I decided to break out into
the schema above.

   With the schema above what I do now is just drop the 'file_info_#'
table, recreate the table and matching indexes and then do a mass 'COPY'
of all the file info on the partition. After this is done I read in the
new data from the reloaded 'file_info_#' table and sync the data in
'file_set_#' which removes entries no longer in 'file_info_#', adds new
ones matching the parent's values and leaves the existing entries alone.

   I found droping the table and re-creating it a lot faster than a
'DELETE FROM' call and it also seems to have made 'VACUUM FULL' a lot
faster.

   Thank you very much for your feedback! I hope I haven't done
something -too- foolish. :p If I have, I will change it.

Madison

pgsql-performance by date:

Previous
From: sarlav kumar
Date:
Subject: Re: query optimization help
Next
From: Alex Turner
Date:
Subject: Re: Best filesystem for PostgreSQL Database Cluster under Linux