Thread: Upgrading pg_statistic to handle collation honestly

Upgrading pg_statistic to handle collation honestly

From
Tom Lane
Date:
When we first put in collations support, we basically punted on teaching
ANALYZE, pg_statistic, and the planner selectivity functions about that.
They just use DEFAULT_COLLATION_OID independently of the actual collation
of the data.  I tripped over this while investigating making type "name"
collatable: it needs to default to C_COLLATION_OID, and the mismatch
resulted in broken statistics for name columns.  So it's time to pay down
that technical debt.

Attached is a draft patch for same.  It adds storage to pg_statistic
to record the collation of each statistics "slot".  A plausible
alternative design would be to just say "look at the collation of the
underlying column", but that would require extra catcache lookups in
the selectivity functions that need the info.  Doing it like this also
makes it theoretically feasible to track stats computed with respect
to different collations for the same column, though I'm not really
convinced that we'd ever do that.

Loose ends:

* I'm not sure what, if anything, needs to be done in the extended
statistics stuff.  It looks like the existing types of extended stats
aren't really collation sensitive, so maybe the answer is "nothing".

* There's a remaining use of DEFAULT_COLLATION_OID in array_selfuncs.c's
element_compare().  I'm not sure if it's important to get rid of that,
either; it doesn't seem to be used for anything that relates to
collected statistics, so it might be fine as-is.

* Probably this conflicts to some extent with Peter's "Reorganize
collation lookup" patch, but I haven't studied that yet.

* There's a kluge in get_attstatsslot() that I'd like to get rid of
later, but it's necessary for now because of the weird things that
happen when doing regex operators on "name" columns.

Comments, objections?

            regards, tom lane

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 18c38e4..af4d062 100644
*** a/doc/src/sgml/catalogs.sgml
--- b/doc/src/sgml/catalogs.sgml
*************** SCRAM-SHA-256$<replaceable><iteration
*** 6395,6400 ****
--- 6395,6412 ----
       </row>

       <row>
+       <entry><structfield>stacoll<replaceable>N</replaceable></structfield></entry>
+       <entry><type>oid</type></entry>
+       <entry><literal><link
linkend="catalog-pg-collation"><structname>pg_collation</structname></link>.oid</literal></entry>
+       <entry>
+        The collation used to derive the statistics stored in the
+        <replaceable>N</replaceable>th <quote>slot</quote>.  For example, a
+        histogram slot for a collatable column would show the collation that
+        defines the sort order of the data.  Zero for noncollatable data.
+       </entry>
+      </row>
+
+      <row>
        <entry><structfield>stanumbers<replaceable>N</replaceable></structfield></entry>
        <entry><type>float4[]</type></entry>
        <entry></entry>
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index b8445dc..dcbd04c 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*************** examine_attribute(Relation onerel, int a
*** 904,914 ****
--- 904,917 ----
      {
          stats->attrtypid = exprType(index_expr);
          stats->attrtypmod = exprTypmod(index_expr);
+         stats->attrcollid = exprCollation(index_expr);
+         /* XXX should we consult indcollation instead? */
      }
      else
      {
          stats->attrtypid = attr->atttypid;
          stats->attrtypmod = attr->atttypmod;
+         stats->attrcollid = attr->attcollation;
      }

      typtuple = SearchSysCacheCopy1(TYPEOID,
*************** update_attstats(Oid relid, bool inh, int
*** 1553,1558 ****
--- 1556,1566 ----
          {
              values[i++] = ObjectIdGetDatum(stats->staop[k]);    /* staopN */
          }
+         i = Anum_pg_statistic_stacoll1 - 1;
+         for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
+         {
+             values[i++] = ObjectIdGetDatum(stats->stacoll[k]);    /* stacollN */
+         }
          i = Anum_pg_statistic_stanumbers1 - 1;
          for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
          {
*************** compute_distinct_stats(VacAttrStatsP sta
*** 1993,2001 ****
          firstcount1 = track_cnt;
          for (j = 0; j < track_cnt; j++)
          {
-             /* We always use the default collation for statistics */
              if (DatumGetBool(FunctionCall2Coll(&f_cmpeq,
!                                                DEFAULT_COLLATION_OID,
                                                 value, track[j].value)))
              {
                  match = true;
--- 2001,2008 ----
          firstcount1 = track_cnt;
          for (j = 0; j < track_cnt; j++)
          {
              if (DatumGetBool(FunctionCall2Coll(&f_cmpeq,
!                                                stats->attrcollid,
                                                 value, track[j].value)))
              {
                  match = true;
*************** compute_distinct_stats(VacAttrStatsP sta
*** 2202,2207 ****
--- 2209,2215 ----

              stats->stakind[0] = STATISTIC_KIND_MCV;
              stats->staop[0] = mystats->eqopr;
+             stats->stacoll[0] = stats->attrcollid;
              stats->stanumbers[0] = mcv_freqs;
              stats->numnumbers[0] = num_mcv;
              stats->stavalues[0] = mcv_values;
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2273,2280 ****

      memset(&ssup, 0, sizeof(ssup));
      ssup.ssup_cxt = CurrentMemoryContext;
!     /* We always use the default collation for statistics */
!     ssup.ssup_collation = DEFAULT_COLLATION_OID;
      ssup.ssup_nulls_first = false;

      /*
--- 2281,2287 ----

      memset(&ssup, 0, sizeof(ssup));
      ssup.ssup_cxt = CurrentMemoryContext;
!     ssup.ssup_collation = stats->attrcollid;
      ssup.ssup_nulls_first = false;

      /*
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2567,2572 ****
--- 2574,2580 ----

              stats->stakind[slot_idx] = STATISTIC_KIND_MCV;
              stats->staop[slot_idx] = mystats->eqopr;
+             stats->stacoll[slot_idx] = stats->attrcollid;
              stats->stanumbers[slot_idx] = mcv_freqs;
              stats->numnumbers[slot_idx] = num_mcv;
              stats->stavalues[slot_idx] = mcv_values;
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2682,2687 ****
--- 2690,2696 ----

              stats->stakind[slot_idx] = STATISTIC_KIND_HISTOGRAM;
              stats->staop[slot_idx] = mystats->ltopr;
+             stats->stacoll[slot_idx] = stats->attrcollid;
              stats->stavalues[slot_idx] = hist_values;
              stats->numvalues[slot_idx] = num_hist;

*************** compute_scalar_stats(VacAttrStatsP stats
*** 2725,2730 ****
--- 2734,2740 ----

              stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;
              stats->staop[slot_idx] = mystats->ltopr;
+             stats->stacoll[slot_idx] = stats->attrcollid;
              stats->stanumbers[slot_idx] = corrs;
              stats->numnumbers[slot_idx] = 1;
              slot_idx++;
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index 1f93963..bd34711 100644
*** a/src/backend/tsearch/ts_typanalyze.c
--- b/src/backend/tsearch/ts_typanalyze.c
***************
*** 14,19 ****
--- 14,20 ----
  #include "postgres.h"

  #include "access/hash.h"
+ #include "catalog/pg_collation.h"
  #include "catalog/pg_operator.h"
  #include "commands/vacuum.h"
  #include "tsearch/ts_type.h"
*************** compute_tsvector_stats(VacAttrStats *sta
*** 415,420 ****
--- 416,422 ----

              stats->stakind[0] = STATISTIC_KIND_MCELEM;
              stats->staop[0] = TextEqualOperator;
+             stats->stacoll[0] = DEFAULT_COLLATION_OID;
              stats->stanumbers[0] = mcelem_freqs;
              /* See above comment about two extra frequency fields */
              stats->numnumbers[0] = num_mcelem + 2;
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 92e38b8..c4a1fef 100644
*** a/src/backend/utils/adt/array_typanalyze.c
--- b/src/backend/utils/adt/array_typanalyze.c
***************
*** 15,21 ****
  #include "postgres.h"

  #include "access/tuptoaster.h"
- #include "catalog/pg_collation.h"
  #include "commands/vacuum.h"
  #include "utils/array.h"
  #include "utils/builtins.h"
--- 15,20 ----
*************** typedef struct
*** 39,44 ****
--- 38,44 ----
      /* Information about array element type */
      Oid            type_id;        /* element type's OID */
      Oid            eq_opr;            /* default equality operator's OID */
+     Oid            coll_id;        /* collation to use */
      bool        typbyval;        /* physical properties of element type */
      int16        typlen;
      char        typalign;
*************** array_typanalyze(PG_FUNCTION_ARGS)
*** 135,140 ****
--- 135,141 ----
      extra_data = (ArrayAnalyzeExtraData *) palloc(sizeof(ArrayAnalyzeExtraData));
      extra_data->type_id = typentry->type_id;
      extra_data->eq_opr = typentry->eq_opr;
+     extra_data->coll_id = stats->attrcollid;    /* collation we should use */
      extra_data->typbyval = typentry->typbyval;
      extra_data->typlen = typentry->typlen;
      extra_data->typalign = typentry->typalign;
*************** compute_array_stats(VacAttrStats *stats,
*** 560,565 ****
--- 561,567 ----

              stats->stakind[slot_idx] = STATISTIC_KIND_MCELEM;
              stats->staop[slot_idx] = extra_data->eq_opr;
+             stats->stacoll[slot_idx] = extra_data->coll_id;
              stats->stanumbers[slot_idx] = mcelem_freqs;
              /* See above comment about extra stanumber entries */
              stats->numnumbers[slot_idx] = num_mcelem + 3;
*************** compute_array_stats(VacAttrStats *stats,
*** 661,666 ****
--- 663,669 ----

              stats->stakind[slot_idx] = STATISTIC_KIND_DECHIST;
              stats->staop[slot_idx] = extra_data->eq_opr;
+             stats->stacoll[slot_idx] = extra_data->coll_id;
              stats->stanumbers[slot_idx] = hist;
              stats->numnumbers[slot_idx] = num_hist + 1;
              slot_idx++;
*************** prune_element_hashtable(HTAB *elements_t
*** 703,709 ****
  /*
   * Hash function for elements.
   *
!  * We use the element type's default hash opclass, and the default collation
   * if the type is collation-sensitive.
   */
  static uint32
--- 706,712 ----
  /*
   * Hash function for elements.
   *
!  * We use the element type's default hash opclass, and the column collation
   * if the type is collation-sensitive.
   */
  static uint32
*************** element_hash(const void *key, Size keysi
*** 712,718 ****
      Datum        d = *((const Datum *) key);
      Datum        h;

!     h = FunctionCall1Coll(array_extra_data->hash, DEFAULT_COLLATION_OID, d);
      return DatumGetUInt32(h);
  }

--- 715,723 ----
      Datum        d = *((const Datum *) key);
      Datum        h;

!     h = FunctionCall1Coll(array_extra_data->hash,
!                           array_extra_data->coll_id,
!                           d);
      return DatumGetUInt32(h);
  }

*************** element_match(const void *key1, const vo
*** 729,735 ****
  /*
   * Comparison function for elements.
   *
!  * We use the element type's default btree opclass, and the default collation
   * if the type is collation-sensitive.
   *
   * XXX consider using SortSupport infrastructure
--- 734,740 ----
  /*
   * Comparison function for elements.
   *
!  * We use the element type's default btree opclass, and the column collation
   * if the type is collation-sensitive.
   *
   * XXX consider using SortSupport infrastructure
*************** element_compare(const void *key1, const
*** 741,747 ****
      Datum        d2 = *((const Datum *) key2);
      Datum        c;

!     c = FunctionCall2Coll(array_extra_data->cmp, DEFAULT_COLLATION_OID, d1, d2);
      return DatumGetInt32(c);
  }

--- 746,754 ----
      Datum        d2 = *((const Datum *) key2);
      Datum        c;

!     c = FunctionCall2Coll(array_extra_data->cmp,
!                           array_extra_data->coll_id,
!                           d1, d2);
      return DatumGetInt32(c);
  }

diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
index 9c50e4c..98cf5f8 100644
*** a/src/backend/utils/adt/rangetypes_typanalyze.c
--- b/src/backend/utils/adt/rangetypes_typanalyze.c
*************** compute_range_stats(VacAttrStats *stats,
*** 320,325 ****
--- 320,326 ----
              num_hist = 0;
          }
          stats->staop[slot_idx] = Float8LessOperator;
+         stats->stacoll[slot_idx] = InvalidOid;
          stats->stavalues[slot_idx] = length_hist_values;
          stats->numvalues[slot_idx] = num_hist;
          stats->statypid[slot_idx] = FLOAT8OID;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index ffca0fe..c3db9ea 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 87,97 ****
   * For both oprrest and oprjoin functions, the operator's input collation OID
   * (if any) is passed using the standard fmgr mechanism, so that the estimator
   * function can fetch it with PG_GET_COLLATION().  Note, however, that all
!  * statistics in pg_statistic are currently built using the database's default
   * collation.  Thus, in most cases where we are looking at statistics, we
!  * should ignore the actual operator collation and use DEFAULT_COLLATION_OID.
   * We expect that the error induced by doing this is usually not large enough
!  * to justify complicating matters.
   *----------
   */

--- 87,98 ----
   * For both oprrest and oprjoin functions, the operator's input collation OID
   * (if any) is passed using the standard fmgr mechanism, so that the estimator
   * function can fetch it with PG_GET_COLLATION().  Note, however, that all
!  * statistics in pg_statistic are currently built using the relevant column's
   * collation.  Thus, in most cases where we are looking at statistics, we
!  * should ignore the operator collation and use the stats entry's collation.
   * We expect that the error induced by doing this is usually not large enough
!  * to justify complicating matters.  In any case, doing otherwise would yield
!  * entirely garbage results for ordered stats data such as histograms.
   *----------
   */

*************** static double eqjoinsel_semi(Oid opfunco
*** 181,187 ****
                 RelOptInfo *inner_rel);
  static bool estimate_multivariate_ndistinct(PlannerInfo *root,
                                  RelOptInfo *rel, List **varinfos, double *ndistinct);
! static bool convert_to_scalar(Datum value, Oid valuetypid, double *scaledvalue,
                    Datum lobound, Datum hibound, Oid boundstypid,
                    double *scaledlobound, double *scaledhibound);
  static double convert_numeric_to_scalar(Datum value, Oid typid, bool *failure);
--- 182,189 ----
                 RelOptInfo *inner_rel);
  static bool estimate_multivariate_ndistinct(PlannerInfo *root,
                                  RelOptInfo *rel, List **varinfos, double *ndistinct);
! static bool convert_to_scalar(Datum value, Oid valuetypid, Oid collid,
!                   double *scaledvalue,
                    Datum lobound, Datum hibound, Oid boundstypid,
                    double *scaledlobound, double *scaledhibound);
  static double convert_numeric_to_scalar(Datum value, Oid typid, bool *failure);
*************** static double convert_one_string_to_scal
*** 201,207 ****
                               int rangelo, int rangehi);
  static double convert_one_bytea_to_scalar(unsigned char *value, int valuelen,
                              int rangelo, int rangehi);
! static char *convert_string_datum(Datum value, Oid typid, bool *failure);
  static double convert_timevalue_to_scalar(Datum value, Oid typid,
                              bool *failure);
  static void examine_simple_variable(PlannerInfo *root, Var *var,
--- 203,210 ----
                               int rangelo, int rangehi);
  static double convert_one_bytea_to_scalar(unsigned char *value, int valuelen,
                              int rangelo, int rangehi);
! static char *convert_string_datum(Datum value, Oid typid, Oid collid,
!                      bool *failure);
  static double convert_timevalue_to_scalar(Datum value, Oid typid,
                              bool *failure);
  static void examine_simple_variable(PlannerInfo *root, Var *var,
*************** var_eq_const(VariableStatData *vardata,
*** 370,381 ****
                  /* be careful to apply operator right way 'round */
                  if (varonleft)
                      match = DatumGetBool(FunctionCall2Coll(&eqproc,
!                                                            DEFAULT_COLLATION_OID,
                                                             sslot.values[i],
                                                             constval));
                  else
                      match = DatumGetBool(FunctionCall2Coll(&eqproc,
!                                                            DEFAULT_COLLATION_OID,
                                                             constval,
                                                             sslot.values[i]));
                  if (match)
--- 373,384 ----
                  /* be careful to apply operator right way 'round */
                  if (varonleft)
                      match = DatumGetBool(FunctionCall2Coll(&eqproc,
!                                                            sslot.stacoll,
                                                             sslot.values[i],
                                                             constval));
                  else
                      match = DatumGetBool(FunctionCall2Coll(&eqproc,
!                                                            sslot.stacoll,
                                                             constval,
                                                             sslot.values[i]));
                  if (match)
*************** mcv_selectivity(VariableStatData *vardat
*** 666,676 ****
          {
              if (varonleft ?
                  DatumGetBool(FunctionCall2Coll(opproc,
!                                                DEFAULT_COLLATION_OID,
                                                 sslot.values[i],
                                                 constval)) :
                  DatumGetBool(FunctionCall2Coll(opproc,
!                                                DEFAULT_COLLATION_OID,
                                                 constval,
                                                 sslot.values[i])))
                  mcv_selec += sslot.numbers[i];
--- 669,679 ----
          {
              if (varonleft ?
                  DatumGetBool(FunctionCall2Coll(opproc,
!                                                sslot.stacoll,
                                                 sslot.values[i],
                                                 constval)) :
                  DatumGetBool(FunctionCall2Coll(opproc,
!                                                sslot.stacoll,
                                                 constval,
                                                 sslot.values[i])))
                  mcv_selec += sslot.numbers[i];
*************** histogram_selectivity(VariableStatData *
*** 744,754 ****
              {
                  if (varonleft ?
                      DatumGetBool(FunctionCall2Coll(opproc,
!                                                    DEFAULT_COLLATION_OID,
                                                     sslot.values[i],
                                                     constval)) :
                      DatumGetBool(FunctionCall2Coll(opproc,
!                                                    DEFAULT_COLLATION_OID,
                                                     constval,
                                                     sslot.values[i])))
                      nmatch++;
--- 747,757 ----
              {
                  if (varonleft ?
                      DatumGetBool(FunctionCall2Coll(opproc,
!                                                    sslot.stacoll,
                                                     sslot.values[i],
                                                     constval)) :
                      DatumGetBool(FunctionCall2Coll(opproc,
!                                                    sslot.stacoll,
                                                     constval,
                                                     sslot.values[i])))
                      nmatch++;
*************** ineq_histogram_selectivity(PlannerInfo *
*** 873,879 ****
                                                           &sslot.values[probe]);

                  ltcmp = DatumGetBool(FunctionCall2Coll(opproc,
!                                                        DEFAULT_COLLATION_OID,
                                                         sslot.values[probe],
                                                         constval));
                  if (isgt)
--- 876,882 ----
                                                           &sslot.values[probe]);

                  ltcmp = DatumGetBool(FunctionCall2Coll(opproc,
!                                                        sslot.stacoll,
                                                         sslot.values[probe],
                                                         constval));
                  if (isgt)
*************** ineq_histogram_selectivity(PlannerInfo *
*** 958,964 ****
                   * values to a uniform comparison scale, and do a linear
                   * interpolation within this bin.
                   */
!                 if (convert_to_scalar(constval, consttype, &val,
                                        sslot.values[i - 1], sslot.values[i],
                                        vardata->vartype,
                                        &low, &high))
--- 961,968 ----
                   * values to a uniform comparison scale, and do a linear
                   * interpolation within this bin.
                   */
!                 if (convert_to_scalar(constval, consttype, sslot.stacoll,
!                                       &val,
                                        sslot.values[i - 1], sslot.values[i],
                                        vardata->vartype,
                                        &low, &high))
*************** eqjoinsel_inner(Oid opfuncoid,
*** 2499,2505 ****
                  if (hasmatch2[j])
                      continue;
                  if (DatumGetBool(FunctionCall2Coll(&eqproc,
!                                                    DEFAULT_COLLATION_OID,
                                                     sslot1->values[i],
                                                     sslot2->values[j])))
                  {
--- 2503,2509 ----
                  if (hasmatch2[j])
                      continue;
                  if (DatumGetBool(FunctionCall2Coll(&eqproc,
!                                                    sslot1->stacoll,
                                                     sslot1->values[i],
                                                     sslot2->values[j])))
                  {
*************** eqjoinsel_semi(Oid opfuncoid,
*** 2711,2717 ****
                  if (hasmatch2[j])
                      continue;
                  if (DatumGetBool(FunctionCall2Coll(&eqproc,
!                                                    DEFAULT_COLLATION_OID,
                                                     sslot1->values[i],
                                                     sslot2->values[j])))
                  {
--- 2715,2721 ----
                  if (hasmatch2[j])
                      continue;
                  if (DatumGetBool(FunctionCall2Coll(&eqproc,
!                                                    sslot1->stacoll,
                                                     sslot1->values[i],
                                                     sslot2->values[j])))
                  {
*************** estimate_multivariate_ndistinct(PlannerI
*** 4066,4072 ****
   * converted to measurements expressed in seconds.
   */
  static bool
! convert_to_scalar(Datum value, Oid valuetypid, double *scaledvalue,
                    Datum lobound, Datum hibound, Oid boundstypid,
                    double *scaledlobound, double *scaledhibound)
  {
--- 4070,4076 ----
   * converted to measurements expressed in seconds.
   */
  static bool
! convert_to_scalar(Datum value, Oid valuetypid, Oid collid, double *scaledvalue,
                    Datum lobound, Datum hibound, Oid boundstypid,
                    double *scaledlobound, double *scaledhibound)
  {
*************** convert_to_scalar(Datum value, Oid value
*** 4131,4141 ****
          case NAMEOID:
              {
                  char       *valstr = convert_string_datum(value, valuetypid,
!                                                           &failure);
                  char       *lostr = convert_string_datum(lobound, boundstypid,
!                                                          &failure);
                  char       *histr = convert_string_datum(hibound, boundstypid,
!                                                          &failure);

                  /*
                   * Bail out if any of the values is not of string type.  We
--- 4135,4145 ----
          case NAMEOID:
              {
                  char       *valstr = convert_string_datum(value, valuetypid,
!                                                           collid, &failure);
                  char       *lostr = convert_string_datum(lobound, boundstypid,
!                                                          collid, &failure);
                  char       *histr = convert_string_datum(hibound, boundstypid,
!                                                          collid, &failure);

                  /*
                   * Bail out if any of the values is not of string type.  We
*************** convert_one_string_to_scalar(char *value
*** 4404,4410 ****
   * before continuing, so as to generate correct locale-specific results.
   */
  static char *
! convert_string_datum(Datum value, Oid typid, bool *failure)
  {
      char       *val;

--- 4408,4414 ----
   * before continuing, so as to generate correct locale-specific results.
   */
  static char *
! convert_string_datum(Datum value, Oid typid, Oid collid, bool *failure)
  {
      char       *val;

*************** convert_string_datum(Datum value, Oid ty
*** 4432,4438 ****
              return NULL;
      }

!     if (!lc_collate_is_c(DEFAULT_COLLATION_OID))
      {
          char       *xfrmstr;
          size_t        xfrmlen;
--- 4436,4442 ----
              return NULL;
      }

!     if (!lc_collate_is_c(collid))
      {
          char       *xfrmstr;
          size_t        xfrmlen;
*************** get_variable_range(PlannerInfo *root, Va
*** 5407,5420 ****
                  continue;
              }
              if (DatumGetBool(FunctionCall2Coll(&opproc,
!                                                DEFAULT_COLLATION_OID,
                                                 sslot.values[i], tmin)))
              {
                  tmin = sslot.values[i];
                  tmin_is_mcv = true;
              }
              if (DatumGetBool(FunctionCall2Coll(&opproc,
!                                                DEFAULT_COLLATION_OID,
                                                 tmax, sslot.values[i])))
              {
                  tmax = sslot.values[i];
--- 5411,5424 ----
                  continue;
              }
              if (DatumGetBool(FunctionCall2Coll(&opproc,
!                                                sslot.stacoll,
                                                 sslot.values[i], tmin)))
              {
                  tmin = sslot.values[i];
                  tmin_is_mcv = true;
              }
              if (DatumGetBool(FunctionCall2Coll(&opproc,
!                                                sslot.stacoll,
                                                 tmax, sslot.values[i])))
              {
                  tmax = sslot.values[i];
*************** prefix_selectivity(PlannerInfo *root, Va
*** 6014,6019 ****
--- 6018,6024 ----
      Selectivity prefixsel;
      Oid            cmpopr;
      FmgrInfo    opproc;
+     AttStatsSlot sslot;
      Const       *greaterstrcon;
      Selectivity eq_sel;

*************** prefix_selectivity(PlannerInfo *root, Va
*** 6036,6051 ****

      /*-------
       * If we can create a string larger than the prefix, say
!      *    "x < greaterstr".
       *-------
       */
      cmpopr = get_opfamily_member(opfamily, vartype, vartype,
                                   BTLessStrategyNumber);
      if (cmpopr == InvalidOid)
          elog(ERROR, "no < operator for opfamily %u", opfamily);
      fmgr_info(get_opcode(cmpopr), &opproc);
!     greaterstrcon = make_greater_string(prefixcon, &opproc,
!                                         DEFAULT_COLLATION_OID);
      if (greaterstrcon)
      {
          Selectivity topsel;
--- 6041,6063 ----

      /*-------
       * If we can create a string larger than the prefix, say
!      * "x < greaterstr".  We try to generate the string referencing the
!      * collation of the var's statistics, but if that's not available,
!      * use DEFAULT_COLLATION_OID.
       *-------
       */
+     if (HeapTupleIsValid(vardata->statsTuple) &&
+         get_attstatsslot(&sslot, vardata->statsTuple,
+                          STATISTIC_KIND_HISTOGRAM, InvalidOid, 0))
+          /* sslot.stacoll is set up */ ;
+     else
+         sslot.stacoll = DEFAULT_COLLATION_OID;
      cmpopr = get_opfamily_member(opfamily, vartype, vartype,
                                   BTLessStrategyNumber);
      if (cmpopr == InvalidOid)
          elog(ERROR, "no < operator for opfamily %u", opfamily);
      fmgr_info(get_opcode(cmpopr), &opproc);
!     greaterstrcon = make_greater_string(prefixcon, &opproc, sslot.stacoll);
      if (greaterstrcon)
      {
          Selectivity topsel;
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 7a263cc..33b5b16 100644
*** a/src/backend/utils/cache/lsyscache.c
--- b/src/backend/utils/cache/lsyscache.c
*************** get_attavgwidth(Oid relid, AttrNumber at
*** 2881,2886 ****
--- 2881,2887 ----
   *
   * If a matching slot is found, true is returned, and *sslot is filled thus:
   * staop: receives the actual STAOP value.
+  * stacoll: receives the actual STACOLL value.
   * valuetype: receives actual datatype of the elements of stavalues.
   * values: receives pointer to an array of the slot's stavalues.
   * nvalues: receives number of stavalues.
*************** get_attavgwidth(Oid relid, AttrNumber at
*** 2893,2898 ****
--- 2894,2903 ----
   *
   * If no matching slot is found, false is returned, and *sslot is zeroed.
   *
+  * Note that the current API doesn't allow for searching for a slot with
+  * a particular collation.  If we ever actually support recording more than
+  * one collation, we'll have to extend the API, but for now simple is good.
+  *
   * The data referred to by the fields of sslot is locally palloc'd and
   * is independent of the original pg_statistic tuple.  When the caller
   * is done with it, call free_attstatsslot to release the palloc'd data.
*************** get_attstatsslot(AttStatsSlot *sslot, He
*** 2927,2932 ****
--- 2932,2951 ----
          return false;            /* not there */

      sslot->staop = (&stats->staop1)[i];
+     sslot->stacoll = (&stats->stacoll1)[i];
+
+     /*
+      * XXX Hopefully-temporary hack: if stacoll isn't set, inject the default
+      * collation.  This won't matter for non-collation-aware datatypes.  For
+      * those that are, this covers cases where stacoll has not been set.  In
+      * the short term we need this because some code paths involving type NAME
+      * do not pass any collation to prefix_selectivity and related functions.
+      * Even when that's been fixed, it's likely that some add-on typanalyze
+      * functions won't get the word right away about filling stacoll during
+      * ANALYZE, so we'll probably need this for awhile.
+      */
+     if (sslot->stacoll == InvalidOid)
+         sslot->stacoll = DEFAULT_COLLATION_OID;

      if (flags & ATTSTATSSLOT_VALUES)
      {
diff --git a/src/include/catalog/pg_statistic.h b/src/include/catalog/pg_statistic.h
index 49223aa..2155f51 100644
*** a/src/include/catalog/pg_statistic.h
--- b/src/include/catalog/pg_statistic.h
*************** CATALOG(pg_statistic,2619,StatisticRelat
*** 74,85 ****
       * statistical data can be placed.  Each slot includes:
       *        kind            integer code identifying kind of data (see below)
       *        op                OID of associated operator, if needed
       *        numbers            float4 array (for statistical values)
       *        values            anyarray (for representations of data values)
!      * The ID and operator fields are never NULL; they are zeroes in an
!      * unused slot.  The numbers and values fields are NULL in an unused
!      * slot, and might also be NULL in a used slot if the slot kind has
!      * no need for one or the other.
       * ----------------
       */

--- 74,86 ----
       * statistical data can be placed.  Each slot includes:
       *        kind            integer code identifying kind of data (see below)
       *        op                OID of associated operator, if needed
+      *        coll            OID of relevant collation, or 0 if none
       *        numbers            float4 array (for statistical values)
       *        values            anyarray (for representations of data values)
!      * The ID, operator, and collation fields are never NULL; they are zeroes
!      * in an unused slot.  The numbers and values fields are NULL in an
!      * unused slot, and might also be NULL in a used slot if the slot kind
!      * has no need for one or the other.
       * ----------------
       */

*************** CATALOG(pg_statistic,2619,StatisticRelat
*** 95,100 ****
--- 96,107 ----
      Oid            staop4;
      Oid            staop5;

+     Oid            stacoll1;
+     Oid            stacoll2;
+     Oid            stacoll3;
+     Oid            stacoll4;
+     Oid            stacoll5;
+
  #ifdef CATALOG_VARLEN            /* variable-length fields start here */
      float4        stanumbers1[1];
      float4        stanumbers2[1];
*************** typedef FormData_pg_statistic *Form_pg_s
*** 159,165 ****

  /*
   * In a "most common values" slot, staop is the OID of the "=" operator
!  * used to decide whether values are the same or not.  stavalues contains
   * the K most common non-null values appearing in the column, and stanumbers
   * contains their frequencies (fractions of total row count).  The values
   * shall be ordered in decreasing frequency.  Note that since the arrays are
--- 166,173 ----

  /*
   * In a "most common values" slot, staop is the OID of the "=" operator
!  * used to decide whether values are the same or not, and stacoll is the
!  * collation used (same as column's collation).  stavalues contains
   * the K most common non-null values appearing in the column, and stanumbers
   * contains their frequencies (fractions of total row count).  The values
   * shall be ordered in decreasing frequency.  Note that since the arrays are
*************** typedef FormData_pg_statistic *Form_pg_s
*** 171,179 ****

  /*
   * A "histogram" slot describes the distribution of scalar data.  staop is
!  * the OID of the "<" operator that describes the sort ordering.  (In theory,
!  * more than one histogram could appear, if a datatype has more than one
!  * useful sort operator.)  stavalues contains M (>=2) non-null values that
   * divide the non-null column data values into M-1 bins of approximately equal
   * population.  The first stavalues item is the MIN and the last is the MAX.
   * stanumbers is not used and should be NULL.  IMPORTANT POINT: if an MCV
--- 179,189 ----

  /*
   * A "histogram" slot describes the distribution of scalar data.  staop is
!  * the OID of the "<" operator that describes the sort ordering, and stacoll
!  * is the relevant collation.  (In theory more than one histogram could appear,
!  * if a datatype has more than one useful sort operator or we care about more
!  * than one collation.  Currently the collation will always be that of the
!  * underlying column.)  stavalues contains M (>=2) non-null values that
   * divide the non-null column data values into M-1 bins of approximately equal
   * population.  The first stavalues item is the MIN and the last is the MAX.
   * stanumbers is not used and should be NULL.  IMPORTANT POINT: if an MCV
*************** typedef FormData_pg_statistic *Form_pg_s
*** 190,200 ****
  /*
   * A "correlation" slot describes the correlation between the physical order
   * of table tuples and the ordering of data values of this column, as seen
!  * by the "<" operator identified by staop.  (As with the histogram, more
!  * than one entry could theoretically appear.)    stavalues is not used and
!  * should be NULL.  stanumbers contains a single entry, the correlation
!  * coefficient between the sequence of data values and the sequence of
!  * their actual tuple positions.  The coefficient ranges from +1 to -1.
   */
  #define STATISTIC_KIND_CORRELATION    3

--- 200,211 ----
  /*
   * A "correlation" slot describes the correlation between the physical order
   * of table tuples and the ordering of data values of this column, as seen
!  * by the "<" operator identified by staop with the collation identified by
!  * stacoll.  (As with the histogram, more than one entry could theoretically
!  * appear.)  stavalues is not used and should be NULL.  stanumbers contains
!  * a single entry, the correlation coefficient between the sequence of data
!  * values and the sequence of their actual tuple positions.  The coefficient
!  * ranges from +1 to -1.
   */
  #define STATISTIC_KIND_CORRELATION    3

*************** typedef FormData_pg_statistic *Form_pg_s
*** 203,209 ****
   * except that it stores the most common non-null *elements* of the column
   * values.  This is useful when the column datatype is an array or some other
   * type with identifiable elements (for instance, tsvector).  staop contains
!  * the equality operator appropriate to the element type.  stavalues contains
   * the most common element values, and stanumbers their frequencies.  Unlike
   * MCV slots, frequencies are measured as the fraction of non-null rows the
   * element value appears in, not the frequency of all rows.  Also unlike
--- 214,221 ----
   * except that it stores the most common non-null *elements* of the column
   * values.  This is useful when the column datatype is an array or some other
   * type with identifiable elements (for instance, tsvector).  staop contains
!  * the equality operator appropriate to the element type, and stacoll
!  * contains the collation to use with it.  stavalues contains
   * the most common element values, and stanumbers their frequencies.  Unlike
   * MCV slots, frequencies are measured as the fraction of non-null rows the
   * element value appears in, not the frequency of all rows.  Also unlike
*************** typedef FormData_pg_statistic *Form_pg_s
*** 226,232 ****
   * A "distinct elements count histogram" slot describes the distribution of
   * the number of distinct element values present in each row of an array-type
   * column.  Only non-null rows are considered, and only non-null elements.
!  * staop contains the equality operator appropriate to the element type.
   * stavalues is not used and should be NULL.  The last member of stanumbers is
   * the average count of distinct element values over all non-null rows.  The
   * preceding M (>=2) members form a histogram that divides the population of
--- 238,245 ----
   * A "distinct elements count histogram" slot describes the distribution of
   * the number of distinct element values present in each row of an array-type
   * column.  Only non-null rows are considered, and only non-null elements.
!  * staop contains the equality operator appropriate to the element type,
!  * and stacoll contains the collation to use with it.
   * stavalues is not used and should be NULL.  The last member of stanumbers is
   * the average count of distinct element values over all non-null rows.  The
   * preceding M (>=2) members form a histogram that divides the population of
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 2f4303e..dfff23a 100644
*** a/src/include/commands/vacuum.h
--- b/src/include/commands/vacuum.h
***************
*** 52,60 ****
   * careful to allocate any pointed-to data in anl_context, which will NOT
   * be CurrentMemoryContext when compute_stats is called.
   *
!  * Note: for the moment, all comparisons done for statistical purposes
!  * should use the database's default collation (DEFAULT_COLLATION_OID).
!  * This might change in some future release.
   *----------
   */
  typedef struct VacAttrStats *VacAttrStatsP;
--- 52,62 ----
   * careful to allocate any pointed-to data in anl_context, which will NOT
   * be CurrentMemoryContext when compute_stats is called.
   *
!  * Note: all comparisons done for statistical purposes should use the
!  * underlying column's collation (attcollation), except in situations
!  * where a noncollatable container type contains a collatable type;
!  * in that case use the type's default collation.  Be sure to record
!  * the appropriate collation in stacoll.
   *----------
   */
  typedef struct VacAttrStats *VacAttrStatsP;
*************** typedef struct VacAttrStats
*** 78,88 ****
--- 80,92 ----
       * because some index opclasses store a different type than the underlying
       * column/expression.  Instead use attrtypid, attrtypmod, and attrtype for
       * information about the datatype being fed to the typanalyze function.
+      * Likewise, use attrcollid not attr->attcollation.
       */
      Form_pg_attribute attr;        /* copy of pg_attribute row for column */
      Oid            attrtypid;        /* type of data being analyzed */
      int32        attrtypmod;        /* typmod of data being analyzed */
      Form_pg_type attrtype;        /* copy of pg_type row for attrtypid */
+     Oid            attrcollid;        /* collation of data being analyzed */
      MemoryContext anl_context;    /* where to save long-lived data */

      /*
*************** typedef struct VacAttrStats
*** 103,108 ****
--- 107,113 ----
      float4        stadistinct;    /* # distinct values */
      int16        stakind[STATISTIC_NUM_SLOTS];
      Oid            staop[STATISTIC_NUM_SLOTS];
+     Oid            stacoll[STATISTIC_NUM_SLOTS];
      int            numnumbers[STATISTIC_NUM_SLOTS];
      float4       *stanumbers[STATISTIC_NUM_SLOTS];
      int            numvalues[STATISTIC_NUM_SLOTS];
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index ff1705a..6408993 100644
*** a/src/include/utils/lsyscache.h
--- b/src/include/utils/lsyscache.h
*************** typedef struct AttStatsSlot
*** 44,49 ****
--- 44,50 ----
  {
      /* Always filled: */
      Oid            staop;            /* Actual staop for the found slot */
+     Oid            stacoll;        /* Actual collation for the found slot */
      /* Filled if ATTSTATSSLOT_VALUES is specified: */
      Oid            valuetype;        /* Actual datatype of the values */
      Datum       *values;            /* slot's "values" array, or NULL if none */

Re: Upgrading pg_statistic to handle collation honestly

From
Peter Eisentraut
Date:
On 12/12/2018 16:57, Tom Lane wrote:
> Attached is a draft patch for same.  It adds storage to pg_statistic
> to record the collation of each statistics "slot".  A plausible
> alternative design would be to just say "look at the collation of the
> underlying column", but that would require extra catcache lookups in
> the selectivity functions that need the info.

That looks like a good approach to me.

> Doing it like this also
> makes it theoretically feasible to track stats computed with respect
> to different collations for the same column, though I'm not really
> convinced that we'd ever do that.

It's a good option to keep around.  Maybe someday extended statistics
could be used to ask for additional statistics to be collected.

> * Probably this conflicts to some extent with Peter's "Reorganize
> collation lookup" patch, but I haven't studied that yet.

I've looked it over, and it's nothing that can't be easily fixed up.  In
fact, it simplifies a few things, so I'm in favor of moving your patch
along first.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Upgrading pg_statistic to handle collation honestly

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 12/12/2018 16:57, Tom Lane wrote:
>> * Probably this conflicts to some extent with Peter's "Reorganize
>> collation lookup" patch, but I haven't studied that yet.

> I've looked it over, and it's nothing that can't be easily fixed up.  In
> fact, it simplifies a few things, so I'm in favor of moving your patch
> along first.

Thanks for looking!  I'll take a closer look at the loose ends I mentioned
and then push it.

            regards, tom lane


Re: Upgrading pg_statistic to handle collation honestly

From
Peter Eisentraut
Date:
On 12/12/2018 16:57, Tom Lane wrote:
> diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
> index b8445dc..dcbd04c 100644
> *** a/src/backend/commands/analyze.c
> --- b/src/backend/commands/analyze.c
> *************** examine_attribute(Relation onerel, int a
> *** 904,914 ****
> --- 904,917 ----
>       {
>           stats->attrtypid = exprType(index_expr);
>           stats->attrtypmod = exprTypmod(index_expr);
> +         stats->attrcollid = exprCollation(index_expr);
> +         /* XXX should we consult indcollation instead? */

After looking through this again, I think the answer here is "yes".  If
the index definition overrides the collation, then we clearly want to
use that.  If it's not overridden, then indcollation is still set, so
it's just as easy to use it.

>       }
>       else
>       {
>           stats->attrtypid = attr->atttypid;
>           stats->attrtypmod = attr->atttypmod;
> +         stats->attrcollid = attr->attcollation;
>       }
>   
>       typtuple = SearchSysCacheCopy1(TYPEOID,


-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Upgrading pg_statistic to handle collation honestly

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 12/12/2018 16:57, Tom Lane wrote:
>> +         stats->attrcollid = exprCollation(index_expr);
>> +         /* XXX should we consult indcollation instead? */

> After looking through this again, I think the answer here is "yes".

Yeah, I was leaning towards that as well, but hadn't written the extra
code needed to make it so.  Will fix.

            regards, tom lane