Thread: inaccurate stats on large tables

inaccurate stats on large tables

From
Kiran Mukhyala
Date:
Hello,

        I am running a select on a large table with two where
        conditions.
        Explain analyze shows that the estimated number of rows returned
        (190760) is much more than the actual rows returned (58221),
        which is probably the underlying cause for the poor performance
        I am seeing.

        Can someone please tell me how to improve the query planner
        estimate? I did try vacuum analyze. Here are some details:

        Explain plan:
        unison@csb-test=> explain analyze select * from paliasorigin a
        where
        a.origin_id=20 and a.tax_id=9606;


        QUERY PLAN
        --------------------------------------------------------------------------
         Bitmap Heap Scan on paliasorigin a  (cost=4901.38..431029.54
        rows=190760 width=118) (actual time=12.447..112.902 rows=58221
        loops=1)
           Recheck Cond: ((origin_id = 20) AND (tax_id = 9606))
           ->  Bitmap Index Scan on paliasorigin_search3_idx
        (cost=0.00..4853.69 rows=190760 width=0) (actual
        time=11.407..11.407
        rows=58221 loops=1)
                 Index Cond: ((origin_id = 20) AND (tax_id = 9606))

        Schema:
        unison@csb-test=> \d+ paliasorigin
         Column   |           Type           |
        Modifiers                             |
        -----------+--------------------------+------------
         palias_id | integer                  | not null
         origin_id | integer                  | not null
         alias     | text                     | not null
         descr     | text                     |
         tax_id    | integer                  |
         added     | timestamp with time zone | not null default
        timenow()
        Indexes:
            "palias_pkey" PRIMARY KEY, btree (palias_id)
            "paliasorigin_alias_unique_in_origin_idx" UNIQUE, btree
        (origin_id,
        alias)
            "paliasorigin_alias_casefold_idx" btree (upper(alias))
        CLUSTER
            "paliasorigin_alias_idx" btree (alias)
            "paliasorigin_o_idx" btree (origin_id)
            "paliasorigin_search1_idx" btree (palias_id, origin_id)
            "paliasorigin_search3_idx" btree (origin_id, tax_id,
        palias_id)
            "paliasorigin_tax_id_idx" btree (tax_id)
        Foreign-key constraints:
            "origin_id_exists" FOREIGN KEY (origin_id) REFERENCES
        origin(origin_id) ON UPDATE CASCADE ON DELETE CASCADE
        Has OIDs: no


        Number of rows:
        unison@csb-test=> select count(*) from paliasorigin;
          count
        ----------
         37909009
        (1 row)

        Pg version:
        unison@csb-test=> select version();
                                                  version
        --------------------------------------------------------------------------------------------
         PostgreSQL 8.3.3 on x86_64-unknown-linux-gnu, compiled by GCC
        gcc (GCC)
        4.1.0 (SUSE Linux)
        (1 row)


        Info from analyze verbose:
        unison@csb-test=> analyze verbose paliasorigin;
        INFO:  analyzing "unison.paliasorigin"
        INFO:  "paliasorigin": scanned 300000 of 692947 pages,
        containing
        16409041 live rows and 0 dead rows; 300000 rows in sample,
        37901986
        estimated total rows
        ANALYZE
        Time: 21999.506 ms


        Thank you,

        -Kiran Mukhyala




Re: inaccurate stats on large tables

From
"David Wilson"
Date:
On Thu, Sep 4, 2008 at 2:21 PM, Kiran Mukhyala <mukhyala.kiran@gene.com> wrote:

>        Can someone please tell me how to improve the query planner
>        estimate? I did try vacuum analyze. Here are some details:

Have you tried increasing the statistics target for that table (or in general)?

--
- David T. Wilson
david.t.wilson@gmail.com

Re: inaccurate stats on large tables

From
"Matt Smiley"
Date:
Hi Kiran,

You gave great info on your problem.

First, is this the query you're actually trying to speed up, or is it a simplified version?  It looks like the
optimizerhas already chosen the best execution plan for the given query.  Since the query has no joins, we only have to
consideraccess paths.  You're fetching 58221/37909009 = 0.15% of the rows, so a sequential scan is clearly
inappropriate. A basic index scan is likely to incur extra scattered I/O, so a bitmap index scan is favored. 

To improve on this query's runtime, you could try any of the following:

 - Reorganize the data to reduce this query's scattered I/O (i.e. cluster on "paliasorigin_search3_idx" rather than
"paliasorigin_alias_casefold_idx"). Bear in mind, this may adversely affect other queries. 

 - Increase the cache hit frequency by ensuring the underlying filesystem cache has plenty of RAM (usually so under
Linux)and checking that other concurrent queries aren't polluting the cache.  Consider adding RAM if you think the
workingset of blocks required by most queries is larger than the combined Postgres and filesystem caches.  If other
processesthan the db do I/O on this machine, consider them as resource consumers, too. 

 - Restructure the table, partitioning along a column that would be useful for pruning whole partitions for your
painfulqueries.  In this case, origin_id or tax_id seems like a good bet, but again, consider other queries against
thistable.  38 million rows probably makes your table around 2 GB (guessing about 55 bytes/row).  Depending on the size
andgrowth rate of the table, it may be time to consider partitioning.  Out of curiosity, what runtime are you typically
seeingfrom this query?  The explain-analyze ran in 113 ms, which I'm guessing is the effect of caching, not the runtime
you'retrying to improve. 

 - Rebuild the indexes on this table.  Under certain use conditions, btree indexes can get horribly bloated.
Rebuildingthe indexes returns them to their most compact and balanced form.  For example: reindex index
"paliasorigin_search3_idx"; Apart from the locking and CPU usage during the rebuild, this has no negative consequences,
soI'd try this before something drastic like partitioning.  First review the current size of the index for comparison:
selectpg_size_pretty(pg_relation_size('paliasorigin_search3_idx')); 

Since you asked specifically about improving the row-count estimate, like the previous responder said, you should
considerincreasing the statistics target.  This will help if individual columns are being underestimated, but not if
theoverestimate is due to joint variation.  In other words, the optimizer has no way to tell if there is there a
logicalrelationship between columns A and B such that certain values in B only occur with certain values of A.  Just
judgingfrom the names, it sounds like origin_id and tax_id might have a parent-child relationship, so I thought it was
worthmentioning. 

Do the columns individually have good estimates?
explain analyze select * from paliasorigin where origin_id=20;
explain analyze select * from paliasorigin where tax_id=9606;

If not, increase the statistics on that column, reanalyze the table, and recheck the selectivity estimate:
alter table paliasorigin alter column origin_id set statistics 20;
analyze paliasorigin;
explain analyze select * from paliasorigin where origin_id=20;

Good luck!
Matt



Re: inaccurate stats on large tables

From
Kiran Mukhyala
Date:
On Mon, 2008-09-08 at 09:16 -0700, Matt Smiley wrote:
> Hi Kiran,
>
> You gave great info on your problem.
>
> First, is this the query you're actually trying to speed up, or is it a simplified version?  It looks like the
optimizerhas already chosen the best execution plan for the given query.  Since the query has no joins, we only have to
consideraccess paths.  You're fetching 58221/37909009 = 0.15% of the rows, so a sequential scan is clearly
inappropriate. A basic index scan is likely to incur extra scattered I/O, so a bitmap index scan is favored. 

Thanks for your analysis and sorry for the long silence.

Its a simplified version. I was tackling this part of the original query
plan since I saw that I got inaccurate stats on one of the tables.

>
> To improve on this query's runtime, you could try any of the following:
>
>  - Reorganize the data to reduce this query's scattered I/O (i.e. cluster on "paliasorigin_search3_idx" rather than
"paliasorigin_alias_casefold_idx"). Bear in mind, this may adversely affect other queries. 

I applied this on a different table which solved my original problem!
The query was hitting statement_timeouts but now runs in reasonable
time. I re clustered one of the tables in my actual query on a more
appropriate index.

>
>  - Increase the cache hit frequency by ensuring the underlying filesystem cache has plenty of RAM (usually so under
Linux)and checking that other concurrent queries aren't polluting the cache.  Consider adding RAM if you think the
workingset of blocks required by most queries is larger than the combined Postgres and filesystem caches.  If other
processesthan the db do I/O on this machine, consider them as resource consumers, too. 
>
>  - Restructure the table, partitioning along a column that would be useful for pruning whole partitions for your
painfulqueries.  In this case, origin_id or tax_id seems like a good bet, but again, consider other queries against
thistable.  38 million rows probably makes your table around 2 GB (guessing about 55 bytes/row).  Depending on the size
andgrowth rate of the table, it may be time to consider partitioning.  Out of curiosity, what runtime are you typically
seeingfrom this query?  The explain-analyze ran in 113 ms, which I'm guessing is the effect of caching, not the runtime
you'retrying to improve. 

This seems inevitable eventually, if my tables keep growing in size.

>  - Rebuild the indexes on this table.  Under certain use conditions, btree indexes can get horribly bloated.
Rebuildingthe indexes returns them to their most compact and balanced form.  For example: reindex index
"paliasorigin_search3_idx"; Apart from the locking and CPU usage during the rebuild, this has no negative consequences,
soI'd try this before something drastic like partitioning.  First review the current size of the index for comparison:
selectpg_size_pretty(pg_relation_size('paliasorigin_search3_idx')); 

This didn't improve the stats.
>
> Since you asked specifically about improving the row-count estimate, like the previous responder said, you should
considerincreasing the statistics target.  This will help if individual columns are being underestimated, but not if
theoverestimate is due to joint variation.  In other words, the optimizer has no way to tell if there is there a
logicalrelationship between columns A and B such that certain values in B only occur with certain values of A.  Just
judgingfrom the names, it sounds like origin_id and tax_id might have a parent-child relationship, so I thought it was
worthmentioning. 
>
> Do the columns individually have good estimates?
Yes.
> explain analyze select * from paliasorigin where origin_id=20;
> explain analyze select * from paliasorigin where tax_id=9606;
>
> If not, increase the statistics on that column, reanalyze the table, and recheck the selectivity estimate:
> alter table paliasorigin alter column origin_id set statistics 20;
> analyze paliasorigin;
> explain analyze select * from paliasorigin where origin_id=20;

my default_statistics_target is set to 1000 but I did set some column
specific statistics. But didn't help in this case.

Thanks a lot.

-Kiran