Thread: How to specify/mock the statistic data of tables in PostgreSQL
Hi all,
I want to use PostgreSQL to help me calculate the cardinality/selectivity of some queries, but I do not want to insert any data into these tables(since the data size is huge) to PostgreSQL. So I plan to calculate the statistic data by myself (not in PostgreSQL) and manually specify the metrics (maybe by modifying pg_statistic table) in PostgreSQL, thus PG's optimizer may use these statistic to evaluate the query (Explain...). Here comes the problem:
1. Is it possible to do what I've described above?
2. I've took a look at the pg_statistic table and pg_stats view, in the view I saw that most_common_elems/most_common_elem_freqs/elem_count_histogram were empty, and I'm also a little confused about the column called correlation. Is there any detailed document about how these metrics are calculated in PostgreSQL?
Thanks!
On Fri, Jan 10, 2014 at 6:00 PM, ygnhzeus <ygnhzeus@gmail.com> wrote: > Hi all, > > I want to use PostgreSQL to help me calculate the cardinality/selectivity of > some queries, but I do not want to insert any data into these tables(since > the data size is huge) to PostgreSQL. So I plan to calculate the statistic > data by myself (not in PostgreSQL) and manually specify the metrics (maybe > by modifying pg_statistic table) in PostgreSQL, thus PG's optimizer may use > these statistic to evaluate the query (Explain...). Here comes the problem: > > > > 1. Is it possible to do what I've described above? > > 2. I've took a look at the pg_statistic table and pg_stats view, in the view > I saw that most_common_elems/most_common_elem_freqs/elem_count_histogram > were empty, and I'm also a little confused about the column called > correlation. Is there any detailed document about how these metrics are > calculated in PostgreSQL? > > About correlation: As you might know index on some column imparts a logical ordering (for example, ascending) to table rows based on that column, but remember, actual rows are not stored in the same physical order in the relation file as the logical order. So, there's a random disk access penalty when fetching individual rows from the heap (for example, range scans that use index). "correlation" denotes how close these two orderings are to each other. A command called CLUSTER can be used to physically reorder a table's rows to match the logical ordering imposed by some index on that table. More about CLUSTER here: http://www.postgresql.org/docs/9.3/static/sql-cluster.html Consider following example, postgres=# create table test as select generate_series(1,1000000) as a order by random(); SELECT 1000000 postgres=# create index test_idx on test using a; postgres=# create index test_idx on test using btree (a); CREATE INDEX postgres=# analyze test; ANALYZE postgres=# select correlation from pg_stats where tablename = 'test'; correlation ------------- -0.00164016 (1 row) postgres=# select count(*) from test where a between 34000 and 68000; count ------- 34001 (1 row) Time: 26.875 ms Note here that the correlation is pretty close to zero meaning physical ordering of rows is different than logical ordering imposed by the index. postgres=# cluster test using test_idx; CLUSTER This should put rows of the table into the same order as the index. postgres=# analyze test; ANALYZE postgres=# select correlation from pg_stats where tablename = 'test'; correlation ------------- 1 (1 row) postgres=# select count(*) from test where a between 34000 and 68000; count ------- 34001 (1 row) Time: 12.990 ms Note here that now rows of the table are in almost same physical order as its index thus reducing random disk accesses. Note how after CLUSTER, time for same query reduces to half the time of original unclustered case. This is due to reduced random disk access. As to how the pg_stats statistics are used by the planner for row estimation is described here: http://www.postgresql.org/docs/9.3/static/row-estimation-examples.html However, to understand how they are generated by ANALYZE (in most cases, using random sampling), I guess you'd need to go through its code in the source file "src/backend/commands/analyze.c". -- Amit Langote
Sent from my iPad
body{FONT-SIZE:12pt; FONT-FAMILY:宋体,serif;}
On Fri, Jan 10, 2014 at 11:19 PM, Atri Sharma <atri.jiit@gmail.com> wrote: > > > Sent from my iPad > > On 10-Jan-2014, at 19:42, "ygnhzeus" <ygnhzeus@gmail.com> wrote: > > Thanks for your reply. > So correlation is not related to the calculation of selectivity right? If I > force PostgreSQL not to optimize the join order (by setting > join_collapse_limit and from_collapse_limit to 1) , is there any other > factor that may affect the structure of execution plan regardless of the > data access method. > > 2014-01-10 > ________________________________ > ygnhzeus > ________________________________ > 发件人:Amit Langote <amitlangote09@gmail.com> > 发送时间:2014-01-10 22:00 > 主题:Re: [GENERAL] How to specify/mock the statistic data of tables in > PostgreSQL > 收件人:"ygnhzeus"<ygnhzeus@gmail.com> > 抄送:"pgsql-general"<pgsql-general@postgresql.org> > > > > AFAIK, correlation is involved in calculation of the costs that are used for > deciding the type of access.If the correlation is low, index scan can lead > to quite some random reads, hence leading to higher costs. > Ah, I forgot to mention this point about how planner uses correlation for access method selection. And selectivity is a function of statistical distribution of column values described in pg_statistic by histograms, most common values (with their occurrence frequencies), number of distinct values, etc. It has nothing to do with correlation. -- Amit Langote
Ah, I forgot to mention this point about how planner uses correlationOn Fri, Jan 10, 2014 at 11:19 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
>
>
> Sent from my iPad
>
> On 10-Jan-2014, at 19:42, "ygnhzeus" <ygnhzeus@gmail.com> wrote:
>
> Thanks for your reply.
> So correlation is not related to the calculation of selectivity right? If I
> force PostgreSQL not to optimize the join order (by setting
> join_collapse_limit and from_collapse_limit to 1) , is there any other
> factor that may affect the structure of execution plan regardless of the
> data access method.
>
> 2014-01-10
> ________________________________
> ygnhzeus
> ________________________________
> 发件人:Amit Langote <amitlangote09@gmail.com>
> 发送时间:2014-01-10 22:00
> 主题:Re: [GENERAL] How to specify/mock the statistic data of tables in
> PostgreSQL
> 收件人:"ygnhzeus"<ygnhzeus@gmail.com>
> 抄送:"pgsql-general"<pgsql-general@postgresql.org>
>
>
>
> AFAIK, correlation is involved in calculation of the costs that are used for
> deciding the type of access.If the correlation is low, index scan can lead
> to quite some random reads, hence leading to higher costs.
>
for access method selection.
And selectivity is a function of statistical distribution of column
values described in pg_statistic by histograms, most common values
(with their occurrence frequencies), number of distinct values, etc.
It has nothing to do with correlation.
--
Amit Langote
=?GB2312?B?RmVsaXgu0Ow=?= <ygnhzeus@gmail.com> writes: > //line 194 : In a "most common values" slot, staop is the OID of the "=" > operator used to decide whether values are the same or not. > //line 206 : A "histogram" slot describes the distribution of scalar data. > staop is the OID of the "<" operator that describes the sort ordering. > I don't understand the function of staop here, how is it used in optimizer, In principle a data type could have more than one sort ordering, and if we were to collect stats according to multiple orderings, staop would be needed to identify which ordering a particular set of statistics was created with. That flexibility isn't being used right now, at least not by any built-in code. There are types with more than one ordering (more than one btree opclass), but ANALYZE only collects stats for the default btree opclass. regards, tom lane