Thread: TEXT vs PG_NODE_TREE in system columns (cross column and expression statistics patch)

TEXT vs PG_NODE_TREE in system columns (cross column and expression statistics patch)

From
Boszormenyi Zoltan
Date:
Hi,

attached is the WIP patch for cross-column statistics and
extra expression statistics.

My question is that why pg_node_tree is unusable as
syscache attribute? I attempted to alias it as text in the patch
but I get the following error if I try to use it by setting
USE_SYSCACHE_FOR_SEARCH to 1 in selfuncs.c.
Directly using the underlying pg_statistic3 doesn't cause an error.

zozo=# select * from t1 where i+1 = 5;
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.

The table looks like this:

create table t1 (id serial primary key, t text, i integer);

Best regards,
Zoltán Böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
     http://www.postgresql.at/


Attachment
Boszormenyi Zoltan <zb@cybertec.at> writes:
> My question is that why pg_node_tree is unusable as
> syscache attribute? I attempted to alias it as text in the patch
> but I get the following error if I try to use it by setting
> USE_SYSCACHE_FOR_SEARCH to 1 in selfuncs.c.
> Directly using the underlying pg_statistic3 doesn't cause an error.

I'm not sure what you're running into, but it doesn't matter because the
design would be unworkable anyway.  Expression text representations
could be extremely long, too long to be usable as index keys.  I don't
believe either of the proposed indexes on the new catalogs are workable,
actually, and the catalog definitions themselves seem a bit outre.
Why are you setting it up so that stats on expressions and cross-column
stats are mutually exclusive?

The idea that's used currently is that we only compute stats on
expressions that are indexed, so the OID/attnum of the index column
can be used as a reference in pg_statistic.  I don't see a strong
need to deviate from that approach.
        regards, tom lane


Excerpts from Boszormenyi Zoltan's message of jue abr 28 11:03:56 -0300 2011:
> Hi,
> 
> attached is the WIP patch for cross-column statistics and
> extra expression statistics.
> 
> My question is that why pg_node_tree is unusable as
> syscache attribute? I attempted to alias it as text in the patch
> but I get the following error if I try to use it by setting
> USE_SYSCACHE_FOR_SEARCH to 1 in selfuncs.c.
> Directly using the underlying pg_statistic3 doesn't cause an error.

Two comments:
1. it seems that expression stats are mostly separate from cross-column
stats; does it really make sense to submit the two in the same patch?

2. there are almost no code comments anywhere

3. (bonus) if you're going to copy/paste pg_attribute.h verbatim into
the new files, please remove the bits you currently have in "#if 0".
(Not to mention the fact that the new catalogs seem rather poorly
named).

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Excerpts from Boszormenyi Zoltan's message of jue abr 28 11:03:56 -0300 2011:

> My question is that why pg_node_tree is unusable as
> syscache attribute? I attempted to alias it as text in the patch
> but I get the following error if I try to use it by setting
> USE_SYSCACHE_FOR_SEARCH to 1 in selfuncs.c.
> Directly using the underlying pg_statistic3 doesn't cause an error.
> 
> zozo=# select * from t1 where i+1 = 5;
> ERROR:  could not determine which collation to use for string comparison
> HINT:  Use the COLLATE clause to set the collation explicitly.

Maybe the pg_node_tree problem is a bug with the collation feature.  If
you could reproduce it in unpatched master, I'm sure it'd find a quick
death.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Boszormenyi Zoltan's message of jue abr 28 11:03:56 -0300 2011:
>> ERROR:  could not determine which collation to use for string comparison
>> HINT:  Use the COLLATE clause to set the collation explicitly.

> Maybe the pg_node_tree problem is a bug with the collation feature.  If
> you could reproduce it in unpatched master, I'm sure it'd find a quick
> death.

Actually, I rather imagine it comes from this choice in catcache.c:
       /* Currently, there are no catcaches on collation-aware data types */       cache->cc_skey[i].sk_collation =
InvalidOid;

I'd be more worried about that if I thought it made any sense to use
a pg_node_tree column as an index key, but I don't ...
        regards, tom lane


Hi,

2011-04-28 17:20 keltezéssel, Alvaro Herrera írta:
> Excerpts from Boszormenyi Zoltan's message of jue abr 28 11:03:56 -0300 2011:
>> Hi,
>>
>> attached is the WIP patch for cross-column statistics and
>> extra expression statistics.
>>
>> My question is that why pg_node_tree is unusable as
>> syscache attribute? I attempted to alias it as text in the patch
>> but I get the following error if I try to use it by setting
>> USE_SYSCACHE_FOR_SEARCH to 1 in selfuncs.c.
>> Directly using the underlying pg_statistic3 doesn't cause an error.
> Two comments:
> 1. it seems that expression stats are mostly separate from cross-column
> stats; does it really make sense to submit the two in the same patch?
>
> 2. there are almost no code comments anywhere
>
> 3. (bonus) if you're going to copy/paste pg_attribute.h verbatim into
> the new files, please remove the bits you currently have in "#if 0".
> (Not to mention the fact that the new catalogs seem rather poorly
> named).

OK, we went to a different route this time. Here is what we came
up with. Attached are two patches.

attnum-int2vector.patch implements:

- int2vector support routines and catalog entries for them
- pg_statistic is modified so "staattnum int2" it converted to
  "staattnums int2vector". RemoveStatistics() is modified to take
  an array of AttrNumber and the length of it.
- pg_attribute.attstattarget is moved to pg_statistic.statarget,
  pg_statistic gains a new "stavalid" bool field. Two support routines
  are added: AddStatistics() and InvalidateStatistics(). Entries
  in pg_statistic for table columns are created upon table creation
  and ALTER TABLE ADD COLUMN and maintained for the lifetime
  of the column. Exceptions are system tables: calling AddStatistics()
  for them during initdb is a Catch-22 when pg_statistic doesn't yet
  exist. For these, ANALYZE creates the pg_statistic record just
  as before. ALTER TABLE ALTER COLUMN SET DATA TYPE
  only invalidates the record by setting "stavalid" to false.
- Factor out common code for getting the statistics tuple into a
  new function called validate_statistics().

cross-col-syntax.patch builds on the first patch and implements:

CREATE CROSS COLUMN STATISTICS ON TABLE tabname (col, ...)
     [ WITH ( statistics_target ) ] ;

DROP CROSS COLUMN STATISTICS ON TABLE tabname (col, ...) ;

CREATE CROSS COLUMN STATISTICS ON INDEX idxname
     [ WITH ( statistics_target ) ] ;

DROP CROSS COLUMN STATISTICS ON INDEX idxname ;

and puts new records into pg_statistic with array_length(staattnums, 1) > 1.
Note: this patch should record dependencies on the respective table or
index and the fields but doesn't.

The data structure for storing the N-dimension histogram is not yet decided.

Comments?

Best regards,
Zoltán böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
     http://www.postgresql.at/


Attachment