Thread: SELECT AND AGG huge tables

SELECT AND AGG huge tables

From
houmanb
Date:
Dear all,
We have a DB containing transactional data.
There are about *50* to *100 x 10^6* rows in one *huge* table.
We are using postgres 9.1.6 on linux with a *SSD card on PCIex* providing us
a constant seeking time.

A typical select (see below) takes about 200 secs. As the database is the
backend for a web-based reporting facility 200 to 500 or even more secs
response times are not acceptable for the customer.

Is there any way to speed up select statements like this:

SELECT
   SUM(T.x),
   SUM(T.y),
   SUM(T.z),
   AVG(T.a),
   AVG(T.b)
FROM T
GROUP BY
   T.c
WHERE
   T.creation_date=$SOME_DATE;

There is an Index on T.c. But would it help to partition the table by T.c?
It should be mentioned, that T.c is actually a foreign key to a Table
containing a
tiny number of rows (15 rows representing different companies).
my postgres.conf is actually the default one, despite the fact that we
increased the value for work_mem=128MB

Thanks in advance
Houman






--
View this message in context: http://postgresql.1045698.n5.nabble.com/SELECT-AND-AGG-huge-tables-tp5728306.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: SELECT AND AGG huge tables

From
Merlin Moncure
Date:
On Mon, Oct 15, 2012 at 3:59 PM, houmanb <houman@gmx.at> wrote:
> Dear all,
> We have a DB containing transactional data.
> There are about *50* to *100 x 10^6* rows in one *huge* table.
> We are using postgres 9.1.6 on linux with a *SSD card on PCIex* providing us
> a constant seeking time.
>
> A typical select (see below) takes about 200 secs. As the database is the
> backend for a web-based reporting facility 200 to 500 or even more secs
> response times are not acceptable for the customer.
>
> Is there any way to speed up select statements like this:
>
> SELECT
>    SUM(T.x),
>    SUM(T.y),
>    SUM(T.z),
>    AVG(T.a),
>    AVG(T.b)
> FROM T
> GROUP BY
>    T.c
> WHERE
>    T.creation_date=$SOME_DATE;
>
> There is an Index on T.c. But would it help to partition the table by T.c?
> It should be mentioned, that T.c is actually a foreign key to a Table
> containing a
> tiny number of rows (15 rows representing different companies).
> my postgres.conf is actually the default one, despite the fact that we
> increased the value for work_mem=128MB

it might help a little bit or a lot -- it depends on the plan. I'd
also advise raising shared buffers to around 25% of ram for queries
like this.

what's your server load look like while aggregating -- are you storage
or cpu bound?  which ssd?  how much data churn do you have?

merlin


Re: SELECT AND AGG huge tables

From
Matheus de Oliveira
Date:
On Mon, Oct 15, 2012 at 5:59 PM, houmanb <houman@gmx.at> wrote:
Dear all,
We have a DB containing transactional data.
There are about *50* to *100 x 10^6* rows in one *huge* table.
We are using postgres 9.1.6 on linux with a *SSD card on PCIex* providing us
a constant seeking time.

A typical select (see below) takes about 200 secs. As the database is the
backend for a web-based reporting facility 200 to 500 or even more secs
response times are not acceptable for the customer.

Is there any way to speed up select statements like this:

SELECT
   SUM(T.x),
   SUM(T.y),
   SUM(T.z),
   AVG(T.a),
   AVG(T.b)
FROM T
GROUP BY
   T.c
WHERE
   T.creation_date=$SOME_DATE;

There is an Index on T.c. But would it help to partition the table by T.c?
It should be mentioned, that T.c is actually a foreign key to a Table
containing a
tiny number of rows (15 rows representing different companies).

How selective is T.creation_date? Looks like an index on this column would be better than T.c (could use also, of course), which would be also true for the partitioning - something like per month or per year partitioning.
 
my postgres.conf is actually the default one, despite the fact that we
increased the value for work_mem=128MB


How much memory do you have? Could you increase shared_buffers?

Also with a SSD you could decrease random_page_cost a little bit.

See [1].


Regards.
-- 
Matheus de Oliveira
Analista de Banco de Dados PostgreSQL
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres

Re: SELECT AND AGG huge tables

From
Ondrej Ivanič
Date:
Hi,

On 16 October 2012 07:59, houmanb <houman@gmx.at> wrote:
> Dear all,
> We have a DB containing transactional data.
> There are about *50* to *100 x 10^6* rows in one *huge* table.
> We are using postgres 9.1.6 on linux with a *SSD card on PCIex* providing us
> a constant seeking time.

How many columns? What's the average row size?

> Is there any way to speed up select statements like this:
>
> SELECT
>    SUM(T.x),
>    SUM(T.y),
>    SUM(T.z),
>    AVG(T.a),
>    AVG(T.b)
> FROM T
> GROUP BY
>    T.c
> WHERE
>    T.creation_date=$SOME_DATE;
>
> There is an Index on T.c. But would it help to partition the table by T.c?
> It should be mentioned, that T.c is actually a foreign key to a Table
> containing a
> tiny number of rows (15 rows representing different companies).
> my postgres.conf is actually the default one, despite the fact that we
> increased the value for work_mem=128MB

Partitioning by T.c is not going to help. You should partition by
T.creation_date. The question is if all queries have T.creation_date
in where clause. Moreover, you need to choose partition size base on
query range so majority of queries can operate on one or two
partitions.

You can try vertical partitioning ie. split table based on column usage:
- group by frequency of use
- group by number of NULLs (null_frac in pg_stats)

Having "SSD card on PCIex" joining tables should be the problem.

In my case table has > 200 columns and monthly partitions (> 30 mil
rows on average) and aggregation queries performed better than 200sec.

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)
(http://www.linkedin.com/in/ondrejivanic)


Re: SELECT AND AGG huge tables

From
Bob Lunney
Date:
Houman,

Partition by date and revise your processes to create and load a new child table every day.   Since you already know
thedate append it to the table base name and go straight to the data you need.  Also, the index on T.c won't help for
thisquery, you're looking at a full table scan every time.   

Bob

Sent from my iPhone

On Oct 15, 2012, at 3:59 PM, houmanb <houman@gmx.at> wrote:

> Dear all,
> We have a DB containing transactional data.
> There are about *50* to *100 x 10^6* rows in one *huge* table.
> We are using postgres 9.1.6 on linux with a *SSD card on PCIex* providing us
> a constant seeking time.
>
> A typical select (see below) takes about 200 secs. As the database is the
> backend for a web-based reporting facility 200 to 500 or even more secs
> response times are not acceptable for the customer.
>
> Is there any way to speed up select statements like this:
>
> SELECT
>   SUM(T.x),
>   SUM(T.y),
>   SUM(T.z),
>   AVG(T.a),
>   AVG(T.b)
> FROM T
> GROUP BY
>   T.c
> WHERE
>   T.creation_date=$SOME_DATE;
>
> There is an Index on T.c. But would it help to partition the table by T.c?
> It should be mentioned, that T.c is actually a foreign key to a Table
> containing a
> tiny number of rows (15 rows representing different companies).
> my postgres.conf is actually the default one, despite the fact that we
> increased the value for work_mem=128MB
>
> Thanks in advance
> Houman
>
>
>
>
>
>
> --
> View this message in context: http://postgresql.1045698.n5.nabble.com/SELECT-AND-AGG-huge-tables-tp5728306.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: SELECT AND AGG huge tables

From
Jeff Janes
Date:
On Mon, Oct 15, 2012 at 1:59 PM, houmanb <houman@gmx.at> wrote:
> Dear all,
> We have a DB containing transactional data.
> There are about *50* to *100 x 10^6* rows in one *huge* table.
> We are using postgres 9.1.6 on linux with a *SSD card on PCIex* providing us
> a constant seeking time.
>
> A typical select (see below) takes about 200 secs. As the database is the
> backend for a web-based reporting facility 200 to 500 or even more secs
> response times are not acceptable for the customer.
>
> Is there any way to speed up select statements like this:
>
> SELECT
>    SUM(T.x),
>    SUM(T.y),
>    SUM(T.z),
>    AVG(T.a),
>    AVG(T.b)
> FROM T
> GROUP BY
>    T.c
> WHERE
>    T.creation_date=$SOME_DATE;
>
> There is an Index on T.c. But would it help to partition the table by T.c?

Probably not.

But an index on creation_date, or on (creation_date, c) might.  How
many records are there per day?  If you add a count(*) to your select,
what would typical values be?

Cheers,

Jeff


Re: SELECT AND AGG huge tables

From
Craig Ringer
Date:
On 10/16/2012 04:59 AM, houmanb wrote:

> There is an Index on T.c. But would it help to partition the table by T.c?

You should really post EXPLAIN ANALYZE for questions like this. See

https://wiki.postgresql.org/wiki/Slow_Query_Questions

--
Craig Ringer




Re: SELECT AND AGG huge tables

From
houmanb
Date:
Hi all,
Thanks for your advice and the link about posting my question in an
appropriate form.
Here are the info. I thank all of you in advance.

Best regards
Houman



Postgres version: 9.1.4
=================================================
Postgres.conf
max_connections = 100
shared_buffers = 8192MB
work_mem = 500MB
log_statement = 'none'
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
max_locks_per_transaction = 256

=================================================
Hardware:
CPU Quad Core Intel CPU
processor    : 0-7
vendor_id    : GenuineIntel
cpu family    : 6
model        : 45
model name    : Intel(R) Core(TM) i7-3820 CPU @ 3.60GHz

Memory:
MemTotal:       32927920 kB

HDD:
OCZ VeloDrive - Solid-State-Disk - 600 GB - intern - PCI Express 2.0 x8
Multi-Level-Cell (MLC)
PCI Express 2.0 x8
========================IO/stat===================
iostat sdb1 1
Linux 3.2.0-23-generic (regula2)     10/17/2012     _x86_64_    (8 CPU)
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb1              6.44       217.91       240.45 1956400373 2158777589
sdb1              0.00         0.00         0.00          0          0
sdb1              0.00         0.00         0.00          0          0
sdb1              0.00         0.00         0.00          0          0
sdb1              0.00         0.00         0.00          0          0
sdb1              0.00         0.00         0.00          0          0
=========================vmstat==========================
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa
 1  0  44376 2417096 210784 28664024    0    0    30    35    0    0  0  0
100  0
 0  0  44376 2416964 210784 28664024    0    0     0     0   80  138  0  0
100  0
 1  0  44376 2416592 210784 28664024    0    0     0     0  278  228  7  0
93  0
 1  0  44376 2416592 210784 28664280    0    0     0     0  457  305 12  0
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  472  303 12  0
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  462  296 13  0
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  478  293 12  0
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  470  317 12  0
87  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  455  299 12  0
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  459  301 12  0
87  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  370  291  7  5
88  0
 1  0  44376 2416716 210784 28664280    0    0     0    29  459  319 12  1
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  453  295 12  0
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  449  284 12  0
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     8  462  304 12  0
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  459  307 12  0
88  0
 2  0  44376 2416716 210784 28664280    0    0     0     0  461  300 12  0
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  457  299 12  0
87  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  439  295 12  0
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  439  306 12  0
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  448  305 12  0
88  0
 1  0  44376 2416716 210784 28664280    0    0     0     0  457  289 12  0
88  0
 0  0  44376 2416716 210784 28664280    0    0     0     0  174  179  3  0
97  0
 0  0  44376 2416716 210784 28664280    0    0     0     0   73  133  0  0
100  0
 0  0  44376 2416716 210784 28664280    0    0     0     0   75  133  0  0
100  0
 0  0  44376 2416716 210784 28664280    0    0     0     0   70  127  0  0
100  0



        Column         |            Type             |
Modifiers
-----------------------+-----------------------------+-------------------------------------------------------
 modifying_action      | integer                     |
 modifying_client      | integer                     |
 modification_time     | timestamp without time zone |
 instance_entity       | integer                     |
 id                    | integer                     | not null default
nextval('enigma.fact_seq'::regclass)
 successor             | integer                     |
 reporting_date        | integer                     |
 legal_entity          | integer                     |
 client_system         | integer                     |
 customer              | integer                     |
 customer_type         | integer                     |
 borrower              | integer                     |
 nace                  | integer                     |
 lsk                   | integer                     |
 review_date           | integer                     |
 uci_status            | integer                     |
 rating                | integer                     |
 rating_date           | integer                     |
 asset_class_sta_flags | integer                     |
 asset_class_flags     | integer                     |
 balance_indicator     | integer                     |
 quantity              | integer                     |
 credit_line           | numeric                     |
 outstanding           | numeric                     |
 ead                   | numeric                     |
 ead_collateralized    | numeric                     |
 ead_uncollateralized  | numeric                     |
 el                    | numeric                     |
 rwa                   | numeric                     |
 lgd                   | numeric                     |
 pd                    | numeric                     |
 economic_capital      | numeric                     |
 unit                  | integer                     |
========================================================================
Indexes:
 "fact_pkey" PRIMARY KEY, btree (id)
 "enigma_fact_id_present" UNIQUE CONSTRAINT, btree (id)
 "indx_enigma_fact_legal_entity" btree (legal_entity)
 "indx_enigma_fact_reporting_date" btree (reporting_date)
Triggers:
    fact_before_update_referrers_trigger BEFORE DELETE ON enigma.fact FOR
EACH ROW EXECUTE PROCEDURE enigma.fact_update_referrers_function()
========================================================================
genesis=# SELECT count(*) FROM enigma.fact;
  count
---------
 7493958
========================================================================
EXPLAIN analyze SELECT
SUM(T.quantity) AS T__quantity,
SUM(T.credit_line) AS T__credit_line,
SUM(T.outstanding) AS T__outstanding,
SUM(T.ead) AS T__ead,
SUM(T.ead_collateralized) AS T__ead_collateralized,
SUM(T.ead_uncollateralized) AS T__ead_uncollateralized,
SUM(T.el) AS T__el,
SUM(T.rwa) AS T__rwa,
AVG(T.lgd) AS T__lgd,
AVG(T.pd) AS T__pd
FROM enigma.fact T
GROUP BY T.legal_entity
ORDER BY T.legal_entity;

----------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1819018.32..1819018.36 rows=15 width=48) (actual
time=20436.264..20436.264 rows=15 loops=1)
   Sort Key: legal_entity
   Sort Method: quicksort  Memory: 27kB
   ->  HashAggregate  (cost=1819017.80..1819018.02 rows=15 width=48) (actual
time=20436.221..20436.242 rows=15 loops=1)
         ->  Seq Scan on fact t  (cost=0.00..959291.68 rows=31262768
width=48) (actual time=2.619..1349.523 rows=7493958 loops=1)
 Total runtime: 20436.410 ms

========================================================================

EXPLAIN (BUFFERS true, ANALYZE)  SELECT SUM(T.quantity) AS T__quantity,
SUM(T.credit_line) AS T__credit_line,
SUM(T.outstanding) AS T__outstanding,
SUM(T.ead) AS T__ead,
SUM(T.ead_collateralized) AS T__ead_collateralized,
SUM(T.ead_uncollateralized) AS T__ead_uncollateralized,
SUM(T.el) AS T__el,
SUM(T.rwa) AS T__rwa,
AVG(T.lgd) AS T__lgd,
AVG(T.pd) AS T__pd
FROM enigma.fact T
GROUP BY T.legal_entity
ORDER BY T.legal_entity;
                                                            QUERY PLAN
          

----------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1819018.32..1819018.36 rows=15 width=48) (actual
time=20514.976..20514.977 rows=15 loops=1)
   Sort Key: legal_entity
   Sort Method: quicksort  Memory: 27kB
   Buffers: shared hit=2315 read=644351
   ->  HashAggregate  (cost=1819017.80..1819018.02 rows=15 width=48) (actual
time=20514.895..20514.917 rows=15 loops=1)
         Buffers: shared hit=2313 read=644351
         ->  Seq Scan on fact t  (cost=0.00..959291.68 rows=31262768
width=48) (actual time=2.580..1385.491 rows=7493958 loops=1)
               Buffers: shared hit=2313 read=644351
 Total runtime: 20515.369 ms


                                                           QUERY PLAN
         




--
View this message in context: http://postgresql.1045698.n5.nabble.com/SELECT-AND-AGG-huge-tables-tp5728306p5728572.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: SELECT AND AGG huge tables

From
Ants Aasma
Date:
On Wed, Oct 17, 2012 at 2:24 PM, houmanb <houman@gmx.at> wrote:
> Hi all,
> Thanks for your advice and the link about posting my question in an
> appropriate form.
> Here are the info. I thank all of you in advance.

Can you run the EXPLAIN once more with EXPLAIN (ANALYZE, BUFFERS,
TIMING OFF). Given the number of rows processed by the query, the
detailed per node timing overhead might be a considerable factor here.

What happened to the "WHERE T.creation_date=$SOME_DATE" part of the
query. These examples go through the whole table. The plans shown are
about as fast as it gets. Summarizing 5GB of data will never be fast.
If you need that information quickly, you'll need to actively maintain
the aggregate values via triggers.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de