Thread: tsearch refactorings

tsearch refactorings

From

"Heikki Linnakangas"

Date:

03 September 2007, 11:47:29

I've been slowly reading through the tsearch code and refactoring things
as I go. Here's a patch of what I've done this far. It has grown larger
than I intended, but it has helped me a lot to understand the code.

I hope I didn't break anything; at least it passes regression tests. If
there's some changes that others don't agree with or feel uncomfortable
doing for 8.3, let me know and I can adjust the patch.


The usage of the QueryItem struct was very confusing. It was used for
both operators and operands. For operators, "val" was a single character
casted to a int4, marking the operator type. For operands, val was the
CRC-32 of the value. Other fields were used only either for operands or
for operators. The biggest change in the patch is that I broke the
QueryItem struct into QueryOperator and QueryOperand. Type was really
the only common field between them. QueryItem still exists, and is used
in the TSQuery struct as before, but it's now a union of the two. Many
other changes fell from that, like separation of pushval_asis function
into pushValue, pushOperator and pushStop.

Other changes include:
- Moved some structs that were for internal use only from header files
to the right .c-files.

- Moved tsvector parser to a new tsvector_parser.c file. Parser code was
about half of the size of tsvector.c, it's also used from tsquery.c, and
it has some data structures of its own, so it seems better to separate
it. Cleaned up the API so that TSVectorParserState is not accessed from
outside tsvector_parser.c.

- Separated enumerations (#defines, really) used for QueryItem.type
field and as return codes from gettoken_query. It was just accidental
code sharing.

- Removed ParseQueryNode struct used internally by makepol and friends.
push*-functions now construct QueryItems directly.

- Changed int4 variables to just ints for variables like "i" or "array
size", where the storage-size was not significant.

- Probably a lot of other stuff I can't remember right now

I have a few more things in mind I'm planning to look into:

- I'm not convinced that there's enough sanity checking in the input
functions. I think you can send queries into the server using the binary
protocol that you couldn't produce otherwise. For example, queries with
multiple VAL nodes, with no operator to connect them. Does that wreak
havoc in any of the functions? And the left-field of QueryOperator is an
int16, what happens if you have query that exceeds
that? And parse_query always produces trees that are in prefix notation,
so the left-field is really redundant, but using tsqueryrecv, you could
inject queries that are not in prefix notation; is there anything in the
code that depends on that?

- There's many internal intermediate representations of a query:
TSQuery, a QTNode-tree, NODE-tree (in tsquery_cleanup.c), prefix
notation stack of QueryItems (in parser), infix-tree. Could we remove
some of these?

- There's a lot of recursive functions. Are they all using
check_stack_depth?

- More commenting and documenting and little refactorings

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Attachment

tsearch-refactor-2.patch.gz

Re: tsearch refactorings

From

"Heikki Linnakangas"

Date:

05 September 2007, 12:34:14

I wrote:
> I have a few more things in mind I'm planning to look into:
>
> - I'm not convinced that there's enough sanity checking in the input
> functions. I think you can send queries into the server using the binary
> protocol that you couldn't produce otherwise. For example, queries with
> multiple VAL nodes, with no operator to connect them. Does that wreak
> havoc in any of the functions?

I added code to check that the query tree is well-formed. It was indeed
possible to send malformed queries in binary mode, which produced all
kinds of strange results.

> And the left-field of QueryOperator is an
> int16, what happens if you have query that exceeds
> that?

Apparently the field wraps around and you get a backend crash when it
tries to do access an array using a negative index. You need to have a
large stack (and increase max_stack_depth now that we have the check for
that in there) to reproduce it. Not sure if it's an exploitable security
vulnerability, but it might be.

I fixed this by making the left-field a uint32. There's no reason to
arbitrarily limit it to 16-bits, and it won't increase the disk/memory
footprint either now that QueryOperator and QueryOperand are separate
structs.

> And parse_query always produces trees that are in prefix notation,
> so the left-field is really redundant, but using tsqueryrecv, you could
> inject queries that are not in prefix notation; is there anything in the
> code that depends on that?

At least infix-function seems to depend on it. By checking that the tree
is well-formed, and the fact that the left operand is implicitly the
next node in the array, we know that it is in prefix notation.

> - There's many internal intermediate representations of a query:
> TSQuery, a QTNode-tree, NODE-tree (in tsquery_cleanup.c), prefix
> notation stack of QueryItems (in parser), infix-tree. Could we remove
> some of these?

I haven't looked into this yet. Might leave it for 8.4.

> - There's a lot of recursive functions. Are they all using
> check_stack_depth?

I added check_stack_depth() call to all recursive functions I found.
Some of them might have a natural limit so that you can't force
arbitrarily deep recursions, but check_stack_depth() is cheap enough
that seems best to just stick it into anything that might be a problem.

Patch attached. It's on top of the tsearch-refactor-2.patch I posted
earlier.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** ../pgsql.tsearch-1/src/backend/utils/adt/tsquery.c    2007-09-03 11:12:17.000000000 +0100
--- ./src/backend/utils/adt/tsquery.c    2007-09-05 11:59:09.000000000 +0100
***************
*** 21,26 ****
--- 21,27 ----
  #include "tsearch/ts_utils.h"
  #include "utils/memutils.h"
  #include "utils/pg_crc.h"
+ #include "nodes/bitmapset.h"


  struct TSQueryParserStateData
***************
*** 388,394 ****
   * QueryItems must be in polish (prefix) notation.
   */
  static void
! findoprnd(QueryItem *ptr, int *pos)
  {
      /* since this function recurses, it could be driven to stack overflow. */
      check_stack_depth();
--- 389,395 ----
   * QueryItems must be in polish (prefix) notation.
   */
  static void
! findoprnd(QueryItem *ptr, uint32 *pos)
  {
      /* since this function recurses, it could be driven to stack overflow. */
      check_stack_depth();
***************
*** 451,457 ****
      TSQuery        query;
      int            commonlen;
      QueryItem  *ptr;
!     int            pos = 0;
      ListCell   *cell;

      /* init state */
--- 452,458 ----
      TSQuery        query;
      int            commonlen;
      QueryItem  *ptr;
!     uint32        pos = 0;
      ListCell   *cell;

      /* init state */
***************
*** 792,797 ****
--- 793,799 ----
      QueryItem  *item;
      int            datalen = 0;
      char       *ptr;
+     Bitmapset  *parentset = NULL;

      size = pq_getmsgint(buf, sizeof(uint32));
      if (size < 0 || size > (MaxAllocSize / sizeof(QueryItem)))
***************
*** 814,819 ****
--- 816,830 ----
                  item->operand.valcrc = (int32) pq_getmsgint(buf, sizeof(int32));
                  item->operand.length = pq_getmsgint(buf, sizeof(int16));

+                 /* Check that the weight bitmap is valid */
+                 if (item->operand.weight < 0 || item->operand.weight > 0xF)
+                     elog(ERROR, "invalid weight bitmap");
+
+                 /* XXX: We don't check that the CRC is valid. Actually, if we
+                  * bothered to calculate it to verify, there would be no need
+                  * to transfer it.
+                  */
+
                  /*
                   * Check that datalen doesn't grow too large. Without the
                   * check, a malicious client could induce a buffer overflow
***************
*** 828,834 ****
                      elog(ERROR, "invalid tsquery; total operand length exceeded");

                  /* We can calculate distance from datalen, no need to send it
!                  * through the wire. If we did, we would have to check that
                   * it's valid anyway.
                   */
                  item->operand.distance = datalen;
--- 839,845 ----
                      elog(ERROR, "invalid tsquery; total operand length exceeded");

                  /* We can calculate distance from datalen, no need to send it
!                  * across the wire. If we did, we would have to check that
                   * it's valid anyway.
                   */
                  item->operand.distance = datalen;
***************
*** 842,864 ****
                      item->operator.oper != OP_OR &&
                      item->operator.oper != OP_AND)
                      elog(ERROR, "unknown operator type %d", (int) item->operator.oper);
                  if(item->operator.oper != OP_NOT)
                  {
!                     item->operator.left = (int16) pq_getmsgint(buf, sizeof(int16));
                      /*
!                      * Sanity checks
                       */
!                     if (item->operator.left <= 0 || i + item->operator.left >= size)
                          elog(ERROR, "invalid pointer to left operand");

!                     /* XXX: Though there's no way to construct a TSQuery that's
!                      * not in polish notation, we don't enforce that for
!                      * queries received from client in binary mode. Is there
!                      * anything that relies on it?
!                      *
!                      * XXX: The tree could be malformed in other ways too,
!                      * a node could have two parents, for example.
                       */
                  }

                  if (i == size - 1)
--- 853,890 ----
                      item->operator.oper != OP_OR &&
                      item->operator.oper != OP_AND)
                      elog(ERROR, "unknown operator type %d", (int) item->operator.oper);
+
+                 /*
+                  * Check that no previous operator node points to the right
+                  * operand. That would mean that the operand node
+                  * has two parents.
+                  */
+                 if (bms_is_member(i + 1, parentset))
+                     elog(ERROR, "malformed query tree");
+
+                 parentset = bms_add_member(parentset, i + 1);
+
                  if(item->operator.oper != OP_NOT)
                  {
!                     uint32 left = (uint32) pq_getmsgint(buf, sizeof(uint32));
!
                      /*
!                      * Right operand is implicitly at "this+1". Don't allow
!                      * left to point to the right operand, or to self.
                       */
!                     if (left <= 1 || i + left >= size)
                          elog(ERROR, "invalid pointer to left operand");

!                     /*
!                      * Check that no previous operator node points to the left
!                      * operand.
                       */
+                     if (bms_is_member(i + left, parentset))
+                         elog(ERROR, "malformed query tree");
+
+                     parentset = bms_add_member(parentset, i + left);
+
+                     item->operator.left = left;
                  }

                  if (i == size - 1)
***************
*** 871,876 ****
--- 897,907 ----
          item++;
      }

+     /* Now check that each node, except the root, has a parent. We
+      * already checked above that no node has more than one parent. */
+     if (bms_num_members(parentset) != size - 1 && size != 0)
+         elog(ERROR, "malformed query tree");
+
      query = (TSQuery) repalloc(query, len + datalen);

      item = GETQUERY(query);
diff -c -r ../pgsql.tsearch-1/src/backend/utils/adt/tsquery_cleanup.c ./src/backend/utils/adt/tsquery_cleanup.c
*** ../pgsql.tsearch-1/src/backend/utils/adt/tsquery_cleanup.c    2007-08-31 19:34:58.000000000 +0100
--- ./src/backend/utils/adt/tsquery_cleanup.c    2007-09-05 12:14:43.000000000 +0100
***************
*** 57,62 ****
--- 57,65 ----
  static void
  plainnode(PLAINTREE * state, NODE * node)
  {
+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      if (state->cur == state->len)
      {
          state->len *= 2;
***************
*** 107,112 ****
--- 110,118 ----
  static void
  freetree(NODE * node)
  {
+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      if (!node)
          return;
      if (node->left)
***************
*** 125,130 ****
--- 131,139 ----
  static NODE *
  clean_NOT_intree(NODE * node)
  {
+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      if (node->valnode->type == QI_VAL)
          return node;

***************
*** 203,208 ****
--- 212,220 ----
      char        lresult = V_UNKNOWN,
                  rresult = V_UNKNOWN;

+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      if (node->valnode->type == QI_VAL)
          return node;
      else
*** ../pgsql.tsearch-1/src/backend/utils/adt/tsquery_rewrite.c    2007-08-31 22:08:41.000000000 +0100
--- ./src/backend/utils/adt/tsquery_rewrite.c    2007-09-05 12:18:46.000000000 +0100
***************
*** 22,27 ****
--- 22,30 ----
  static int
  addone(int *counters, int last, int total)
  {
+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      counters[last]++;
      if (counters[last] >= total)
      {
***************
*** 173,178 ****
--- 176,184 ----
  static QTNode *
  dofindsubquery(QTNode *root, QTNode *ex, QTNode *subs, bool *isfind)
  {
+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      root = findeq(root, ex, subs, isfind);

      if (root && (root->flags & QTN_NOCHANGE) == 0 && root->valnode->type == QI_OPR)
*** ../pgsql.tsearch-1/src/backend/utils/adt/tsquery_util.c    2007-08-31 22:15:56.000000000 +0100
--- ./src/backend/utils/adt/tsquery_util.c    2007-09-05 12:21:29.000000000 +0100
***************
*** 22,27 ****
--- 22,30 ----
  {
      QTNode       *node = (QTNode *) palloc0(sizeof(QTNode));

+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      node->valnode = in;

      if (in->type == QI_OPR)
***************
*** 53,58 ****
--- 56,64 ----
      if (!in)
          return;

+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      if (in->valnode->type == QI_VAL && in->word && (in->flags & QTN_WORDFREE) != 0)
          pfree(in->word);

***************
*** 79,84 ****
--- 85,93 ----
  int
  QTNodeCompare(QTNode * an, QTNode * bn)
  {
+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      if (an->valnode->type != bn->valnode->type)
          return (an->valnode->type > bn->valnode->type) ? -1 : 1;

***************
*** 133,138 ****
--- 142,150 ----
  {
      int            i;

+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      if (in->valnode->type != QI_OPR)
          return;

***************
*** 165,170 ****
--- 177,185 ----
  {
      int            i;

+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      if (in->valnode->type != QI_OPR)
          return;

***************
*** 205,210 ****
--- 220,228 ----
  {
      int            i;

+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      if (in->valnode->type != QI_OPR)
          return;

***************
*** 244,249 ****
--- 262,270 ----
  static void
  cntsize(QTNode * in, int *sumlen, int *nnode)
  {
+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      *nnode += 1;
      if (in->valnode->type == QI_OPR)
      {
***************
*** 268,273 ****
--- 289,297 ----
  static void
  fillQT(QTN2QTState *state, QTNode *in)
  {
+     /* since this function recurses, it could be driven to stack overflow. */
+     check_stack_depth();
+
      if (in->valnode->type == QI_VAL)
      {
          memcpy(state->curitem, in->valnode, sizeof(QueryOperand));
***************
*** 325,331 ****
  QTNode *
  QTNCopy(QTNode *in)
  {
!     QTNode       *out = (QTNode *) palloc(sizeof(QTNode));

      *out = *in;
      out->valnode = (QueryItem *) palloc(sizeof(QueryItem));
--- 349,360 ----
  QTNode *
  QTNCopy(QTNode *in)
  {
!     QTNode       *out;
!
!     /* since this function recurses, it could be driven to stack overflow. */
!     check_stack_depth();
!
!     out = (QTNode *) palloc(sizeof(QTNode));

      *out = *in;
      out->valnode = (QueryItem *) palloc(sizeof(QueryItem));
*** ../pgsql.tsearch-1/src/backend/utils/adt/tsrank.c    2007-08-31 22:24:34.000000000 +0100
--- ./src/backend/utils/adt/tsrank.c    2007-09-05 12:24:27.000000000 +0100
***************
*** 508,513 ****
--- 508,517 ----
      int            i;
      bool        found = false;

+     /* since this function recurses, it could be driven to stack overflow.
+      * (though any decent compiler will optimize away the tail-recursion.   */
+     check_stack_depth();
+
      reset_istrue_flag(query);

      ext->p = 0x7fffffff;
*** ../pgsql.tsearch-1/src/include/tsearch/ts_type.h    2007-09-03 11:10:37.000000000 +0100
--- ./src/include/tsearch/ts_type.h    2007-09-05 12:17:02.000000000 +0100
***************
*** 159,165 ****
  typedef struct
  {
      QueryItemType        type;    /* operand or kind of operator (ts_tokentype) */
!     int8        weight;            /* weights of operand to search. Is this a bitmask? checkclass_str seems to think
so*/ 
      int32    valcrc;                /* XXX: pg_crc32 would be a more appropriate data type, but we use comparisons to
signedintegers in the code. They would need to be changed as well. */ 

      /* pointer to text value of operand, must correlate with WordEntry */
--- 159,172 ----
  typedef struct
  {
      QueryItemType        type;    /* operand or kind of operator (ts_tokentype) */
!
!     /* Weights of operand to search. This is a bitmap:
!      * A: 1<<3
!      * B: 1<<2
!      * C: 1<<1
!      * D: 1<<0
!      */
!     int8        weight;
      int32    valcrc;                /* XXX: pg_crc32 would be a more appropriate data type, but we use comparisons to
signedintegers in the code. They would need to be changed as well. */ 

      /* pointer to text value of operand, must correlate with WordEntry */
***************
*** 179,185 ****
  {
      QueryItemType    type;
      int8        oper;        /* see above */
!     int16        left;        /* pointer to left operand. Right operand is
                               * item + 1, left operand is placed
                               * item+item->left */
  } QueryOperator;
--- 186,192 ----
  {
      QueryItemType    type;
      int8        oper;        /* see above */
!     uint32        left;        /* pointer to left operand. Right operand is
                               * item + 1, left operand is placed
                               * item+item->left */
  } QueryOperator;

Re: tsearch refactorings

From

Teodor Sigaev

Date:

05 September 2007, 15:55:54

Heikki, I see some strange changes in your patch, not related to tsearch at all:
contrib/pageinspect/pageinspect.sql.in
contrib/pageinspect/rawpage.c

> The usage of the QueryItem struct was very confusing. It was used for
> both operators and operands. For operators, "val" was a single character
> casted to a int4, marking the operator type. For operands, val was the
> CRC-32 of the value. Other fields were used only either for operands or
> for operators. The biggest change in the patch is that I broke the
> QueryItem struct into QueryOperator and QueryOperand. Type was really
...
 > - Removed ParseQueryNode struct used internally by makepol and friends.
 > push*-functions now construct QueryItems directly.

It's needed to set unused bytes in QueryItem to zero, it's common requiremens
for types in pgsql. After allocating space for tsquery in parse_tsquery you copy
  just sizeof(QueryOperator) bytes and leave sizeof(QueryItem) -
sizeof(QueryOperator) bytes untouched. QueryOperand is a biggest component in
QueryItem union. I don't check other places.



> that? And parse_query always produces trees that are in prefix notation,
> so the left-field is really redundant, but using tsqueryrecv, you could
> inject queries that are not in prefix notation; is there anything in the
> code that depends on that?
It's used by TS_execute for optimization reason. With clear postfix notation you
should go through every nodes. For example:
FALSE FALSE & FALSE &
You will go to the end of query to produce correct result.
In fact, TSQuery is a prefix notation with pointer to another operand or, by
another words, just a plain view of tree where right operand of operation is
always placed after operation.
That notation allows to calculate only one of operand if it possible:
& FALSE & FALSE FALSE
1   2   3   4     5      --Nodes
After evaluating of second node you can return FALSE for whole expression and do
not evaluate nodes 3-5. For query
& TRUE & FALSE & FALSE
it's needed to evaluate 1,2,3,4 nodes. In most cases checking  QI_VAL node is
much more expensive that QI_OPR



>
> - There's many internal intermediate representations of a query:
> TSQuery, a QTNode-tree, NODE-tree (in tsquery_cleanup.c), prefix
> notation stack of QueryItems (in parser), infix-tree. Could we remove
> some of these?
I havn't strong objections, QTNode and NODE are tree-like structures, but
TSQuery is a postfix notation for storage in plain memory. NODE is used only
cleanup stop-word placeholders, so it's a binary tree while  QTNode represents
t-ary tree (with any number of children).

Thank you for your interesting in tsearch - after recheck of problem pointed
above I'll commit your patch.
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: tsearch refactorings

From

Teodor Sigaev

Date:

05 September 2007, 15:56:57

> I fixed this by making the left-field a uint32. There's no reason to
> arbitrarily limit it to 16-bits, and it won't increase the disk/memory
> footprint either now that QueryOperator and QueryOperand are separate
> structs.
...
> I added check_stack_depth() call to all recursive functions I found.
> Some of them might have a natural limit so that you can't force
> arbitrarily deep recursions, but check_stack_depth() is cheap enough
> that seems best to just stick it into anything that might be a problem.
Agreed.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: tsearch refactorings

From

"Heikki Linnakangas"

Date:

05 September 2007, 16:31:25

Teodor Sigaev wrote:
> Heikki, I see some strange changes in your patch, not related to tsearch
> at all:
> contrib/pageinspect/pageinspect.sql.in
> contrib/pageinspect/rawpage.c

Oh, sorry about that. Just ignore them, they are some quick&dirty
debugging functions I had in the same source tree.

>> The usage of the QueryItem struct was very confusing. It was used for
>> both operators and operands. For operators, "val" was a single character
>> casted to a int4, marking the operator type. For operands, val was the
>> CRC-32 of the value. Other fields were used only either for operands or
>> for operators. The biggest change in the patch is that I broke the
>> QueryItem struct into QueryOperator and QueryOperand. Type was really
> ...
>> - Removed ParseQueryNode struct used internally by makepol and friends.
>> push*-functions now construct QueryItems directly.
>
> It's needed to set unused bytes in QueryItem to zero, it's common
> requiremens for types in pgsql. After allocating space for tsquery in
> parse_tsquery you copy  just sizeof(QueryOperator) bytes and leave
> sizeof(QueryItem) - sizeof(QueryOperator) bytes untouched. QueryOperand
> is a biggest component in QueryItem union. I don't check other places.

Ok. Probably easiest to do that by changing the palloc to palloc0 in
parse_tsquery.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: tsearch refactorings

From

Teodor Sigaev

Date:

05 September 2007, 16:41:17

> Ok. Probably easiest to do that by changing the palloc to palloc0 in
> parse_tsquery.

and change sizeof to sizeof(QueryItem)

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: tsearch refactorings

From

"Heikki Linnakangas"

Date:

05 September 2007, 16:56:00

Teodor Sigaev wrote:
>> Ok. Probably easiest to do that by changing the palloc to palloc0 in
>> parse_tsquery.
>
> and change sizeof to sizeof(QueryItem)

Do you mean the sizeofs in the memcpys in parse_tsquery? You can't
change them. The source structs are allocated in
pushOperand/pushOperator, using sizeof(QueryOperand/QueryOperator), so
if you copy sizeof(QueryItem) bytes from them, you'll copy some random
piece of memory. If you just allocate the TSQuery with palloc0, that'll
make sure that any memory area not copied into in the loop will be zero.

BTW, can you explain what the CRC-32 of a value is used for? It looks
like it's used to speed up some operations, by comparing the CRCs before
comparing the values, but I didn't quite figure out how it works. How
much of a performance difference does it make? Would hash_any do a
better/cheaper job?

In any case, I think we need to calculate the CRC/hash in tsqueryrecv,
instead of trusting the client.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: tsearch refactorings

From

Teodor Sigaev

Date:

05 September 2007, 17:13:57


Heikki Linnakangas wrote:
> Teodor Sigaev wrote:
>>> Ok. Probably easiest to do that by changing the palloc to palloc0 in
>>> parse_tsquery.
>> and change sizeof to sizeof(QueryItem)
>
> Do you mean the sizeofs in the memcpys in parse_tsquery? You can't

Oops, I meant pallocs in push* function. palloc0 in parse_tsquery is another way.

>
> BTW, can you explain what the CRC-32 of a value is used for? It looks
> like it's used to speed up some operations, by comparing the CRCs before
> comparing the values, but I didn't quite figure out how it works. How
It's mostly used in GiST indexes - recalculating crc32 every time for each index
tuple to be checked is rather expensive.

> much of a performance difference does it make? Would hash_any do a
> better/cheaper job?
crc32 was chosen  after testing a lot of hash function. Perl's hash was the
fastest, but crc32 makes much less number of collisions. That's interesting  for
ASCII a lot of functions produce rather small number of collision, but for upper
part of table (0x7f-0xff) crc32 was the best. CRC32 has  evenly distributed
collisions over characters, others - not.


> In any case, I think we need to calculate the CRC/hash in tsqueryrecv,
> instead of trusting the client.
Agreed.
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: tsearch refactorings

From

"Heikki Linnakangas"

Date:

06 September 2007, 16:47:20

Teodor Sigaev wrote:
> Heikki Linnakangas wrote:
>> In any case, I think we need to calculate the CRC/hash in tsqueryrecv,
>> instead of trusting the client.
> Agreed.

I started to write a patch for that, as I realized that we're
transferring the strings in a tsvector/tsquery in server encoding.
That's not good, right? A dump created with COPY ... BINARY wouldn't be
portable across clusters with different encodings, for example.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: tsearch refactorings

From

Tom Lane

Date:

06 September 2007, 18:59:19

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> I started to write a patch for that, as I realized that we're
> transferring the strings in a tsvector/tsquery in server encoding.
> That's not good, right? A dump created with COPY ... BINARY wouldn't be
> portable across clusters with different encodings, for example.

Any portion of a binary value that is considered textual should be
converted to and from client encoding --- cf textsend/textrecv.
This should be pretty trivial to fix, just call a different support
routine.

BTW, Teodor, are you intending to review/apply Heikki's tsearch fixes,
or do you want someone else to do it?

            regards, tom lane

Re: tsearch refactorings

From

"Heikki Linnakangas"

Date:

07 September 2007, 09:54:33

Tom Lane wrote:
> Any portion of a binary value that is considered textual should be
> converted to and from client encoding --- cf textsend/textrecv.
> This should be pretty trivial to fix, just call a different support
> routine.

You do need to adjust length and position fields in the structs as well.
I fixed (rewrote, almost) the send/recv functions, and added a comment
above them describing the on-wire format. The CRC is now recalculated in
tsquery as well per previous discussion.

Patch attached. This is on top of the previous patches I sent. It
includes some additional changes that I had already started with. Most
notably:

- change the alignment requirement of lexemes in TSVector slightly.
Lexeme strings were always padded to 2-byte aligned length to make sure
that if there's position array (uint16[]) it has the right alignment.
The patch changes that so that the padding is not done when there's no
positions. That makes the storage of tsvectors without positions
slightly more compact.

- added some #include "miscadmin.h" lines I missed in the earlier when I
added calls to check_stack_depth().


BTW, the encoding of the XML datatype looks pretty funky. xml_recv first
reads the xml string with pq_getmsgtext, which applies a client->server
conversion. Then the xml declaration is parsed, extracting the encoding
attribute. Then the string is converted again from that encoding (or
UTF-8 if none was specified) to server encoding. I don't understand how
it's supposed to work, but ISTM there's one conversion too much,

> BTW, Teodor, are you intending to review/apply Heikki's tsearch fixes,
> or do you want someone else to do it?

I am getting confused with the patches and version I have lying around
here... I think I'll have to wait for review of the patches I've posted
this far before I continue hacking.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** ../pgsql.tsearch-2/src/backend/utils/adt/tsginidx.c    2007-09-06 11:19:57.000000000 +0100
--- ./src/backend/utils/adt/tsginidx.c    2007-09-07 09:20:27.000000000 +0100
***************
*** 22,28 ****
  gin_extract_tsvector(PG_FUNCTION_ARGS)
  {
      TSVector    vector = PG_GETARG_TSVECTOR(0);
!     uint32       *nentries = (uint32 *) PG_GETARG_POINTER(1);
      Datum       *entries = NULL;

      *nentries = vector->size;
--- 22,28 ----
  gin_extract_tsvector(PG_FUNCTION_ARGS)
  {
      TSVector    vector = PG_GETARG_TSVECTOR(0);
!     int32       *nentries = (int32 *) PG_GETARG_POINTER(1);
      Datum       *entries = NULL;

      *nentries = vector->size;
***************
*** 54,60 ****
  gin_extract_query(PG_FUNCTION_ARGS)
  {
      TSQuery        query = PG_GETARG_TSQUERY(0);
!     uint32       *nentries = (uint32 *) PG_GETARG_POINTER(1);
      StrategyNumber strategy = PG_GETARG_UINT16(2);
      Datum       *entries = NULL;

--- 54,60 ----
  gin_extract_query(PG_FUNCTION_ARGS)
  {
      TSQuery        query = PG_GETARG_TSQUERY(0);
!     int32       *nentries = (int32 *) PG_GETARG_POINTER(1);
      StrategyNumber strategy = PG_GETARG_UINT16(2);
      Datum       *entries = NULL;

*** ../pgsql.tsearch-2/src/backend/utils/adt/tsquery.c    2007-09-05 11:59:09.000000000 +0100
--- ./src/backend/utils/adt/tsquery.c    2007-09-07 09:35:18.000000000 +0100
***************
*** 21,27 ****
  #include "tsearch/ts_utils.h"
  #include "utils/memutils.h"
  #include "utils/pg_crc.h"
- #include "nodes/bitmapset.h"


  struct TSQueryParserStateData
--- 21,26 ----
***************
*** 384,399 ****
      }
  }

- /*
-  * Fills in the left-fields previously left unfilled. The input
-  * QueryItems must be in polish (prefix) notation.
-  */
  static void
! findoprnd(QueryItem *ptr, uint32 *pos)
  {
      /* since this function recurses, it could be driven to stack overflow. */
      check_stack_depth();

      if (ptr[*pos].type == QI_VAL ||
          ptr[*pos].type == QI_VALSTOP) /* need to handle VALSTOP here,
                                         * they haven't been cleansed
--- 383,397 ----
      }
  }

  static void
! findoprnd_recurse(QueryItem *ptr, uint32 *pos, int nnodes)
  {
      /* since this function recurses, it could be driven to stack overflow. */
      check_stack_depth();

+     if (*pos >= nnodes)
+         elog(ERROR, "malformed tsquery; operand not found");
+
      if (ptr[*pos].type == QI_VAL ||
          ptr[*pos].type == QI_VALSTOP) /* need to handle VALSTOP here,
                                         * they haven't been cleansed
***************
*** 410,416 ****
          {
              ptr[*pos].operator.left = 1;
              (*pos)++;
!             findoprnd(ptr, pos);
          }
          else
          {
--- 408,414 ----
          {
              ptr[*pos].operator.left = 1;
              (*pos)++;
!             findoprnd_recurse(ptr, pos, nnodes);
          }
          else
          {
***************
*** 420,432 ****
              Assert(curitem->oper == OP_AND || curitem->oper == OP_OR);

              (*pos)++;
!             findoprnd(ptr, pos);
              curitem->left = *pos - tmp;
!             findoprnd(ptr, pos);
          }
      }
  }

  /*
   * Each value (operand) in the query is be passed to pushval. pushval can
   * transform the simple value to an arbitrarily complex expression using
--- 418,448 ----
              Assert(curitem->oper == OP_AND || curitem->oper == OP_OR);

              (*pos)++;
!             findoprnd_recurse(ptr, pos, nnodes);
              curitem->left = *pos - tmp;
!             findoprnd_recurse(ptr, pos, nnodes);
          }
      }
  }

+
+ /*
+  * Fills in the left-fields previously left unfilled. The input
+  * QueryItems must be in polish (prefix) notation.
+  */
+ static void
+ findoprnd(QueryItem *ptr, int size)
+ {
+     uint32 pos;
+
+     pos = 0;
+     findoprnd_recurse(ptr, &pos, size);
+
+     if (pos != size)
+         elog(ERROR, "malformed tsquery; extra nodes");
+ }
+
+
  /*
   * Each value (operand) in the query is be passed to pushval. pushval can
   * transform the simple value to an arbitrarily complex expression using
***************
*** 452,458 ****
      TSQuery        query;
      int            commonlen;
      QueryItem  *ptr;
-     uint32        pos = 0;
      ListCell   *cell;

      /* init state */
--- 468,473 ----
***************
*** 522,529 ****
      pfree(state.op);

      /* Set left operand pointers for every operator. */
!     pos = 0;
!     findoprnd(ptr, &pos);

      return query;
  }
--- 537,543 ----
      pfree(state.op);

      /* Set left operand pointers for every operator. */
!     findoprnd(ptr, query->size);

      return query;
  }
***************
*** 734,739 ****
--- 748,769 ----
      PG_RETURN_CSTRING(nrm.buf);
  }

+ /*
+  * Binary Input / Output functions. The binary format is as follows:
+  *
+  * uint32     number of operators/operands in the query
+  *
+  * Followed by the operators and operands, in prefix notation. For each
+  * operand:
+  *
+  * uint8    type, QI_VAL
+  * uint8    weight
+  *             operand text in client encoding, null-terminated
+  *
+  * For each operator:
+  * uint8    type, QI_OPR
+  * uint8    operator, one of OP_AND, OP_OR, OP_NOT.
+  */
  Datum
  tsquerysend(PG_FUNCTION_ARGS)
  {
***************
*** 744,750 ****

      pq_begintypsend(&buf);

!     pq_sendint(&buf, query->size, sizeof(int32));
      for (i = 0; i < query->size; i++)
      {
          pq_sendint(&buf, item->type, sizeof(item->type));
--- 774,780 ----

      pq_begintypsend(&buf);

!     pq_sendint(&buf, query->size, sizeof(uint32));
      for (i = 0; i < query->size; i++)
      {
          pq_sendint(&buf, item->type, sizeof(item->type));
***************
*** 752,767 ****
          switch(item->type)
          {
              case QI_VAL:
!                 pq_sendint(&buf, item->operand.weight, sizeof(item->operand.weight));
!                 pq_sendint(&buf, item->operand.valcrc, sizeof(item->operand.valcrc));
!                 pq_sendint(&buf, item->operand.length, sizeof(int16));
                  /* istrue flag is just for temporary use in tsrank.c/Cover,
                   * so we don't need to transfer that */
                  break;
              case QI_OPR:
                  pq_sendint(&buf, item->operator.oper, sizeof(item->operator.oper));
-                 if (item->operator.oper != OP_NOT)
-                     pq_sendint(&buf, item->operator.left, sizeof(item->operator.left));
                  break;
              default:
                  elog(ERROR, "unknown tsquery node type %d", item->type);
--- 782,794 ----
          switch(item->type)
          {
              case QI_VAL:
!                 pq_sendint(&buf, item->operand.weight, sizeof(uint8));
!                 pq_sendstring(&buf, GETOPERAND(query) + item->operand.distance);
                  /* istrue flag is just for temporary use in tsrank.c/Cover,
                   * so we don't need to transfer that */
                  break;
              case QI_OPR:
                  pq_sendint(&buf, item->operator.oper, sizeof(item->operator.oper));
                  break;
              default:
                  elog(ERROR, "unknown tsquery node type %d", item->type);
***************
*** 769,782 ****
          item++;
      }

-     item = GETQUERY(query);
-     for (i = 0; i < query->size; i++)
-     {
-         if (item->type == QI_VAL)
-             pq_sendbytes(&buf, GETOPERAND(query) + item->operand.distance, item->operand.length);
-         item++;
-     }
-
      PG_FREE_IF_COPY(query, 0);

      PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
--- 796,801 ----
***************
*** 788,924 ****
      StringInfo    buf = (StringInfo) PG_GETARG_POINTER(0);
      TSQuery        query;
      int            i,
-                 size,
                  len;
      QueryItem  *item;
!     int            datalen = 0;
      char       *ptr;
!     Bitmapset  *parentset = NULL;

      size = pq_getmsgint(buf, sizeof(uint32));
!     if (size < 0 || size > (MaxAllocSize / sizeof(QueryItem)))
          elog(ERROR, "invalid size of tsquery");

!     len = HDRSIZETQ + sizeof(QueryItem) * size;

!     query = (TSQuery) palloc(len);
      query->size = size;
      item = GETQUERY(query);

      for (i = 0; i < size; i++)
      {
          item->type = (int8) pq_getmsgint(buf, sizeof(int8));

!         switch(item->type)
          {
!             case QI_VAL:
!                 item->operand.weight = (int8) pq_getmsgint(buf, sizeof(int8));
!                 item->operand.valcrc = (int32) pq_getmsgint(buf, sizeof(int32));
!                 item->operand.length = pq_getmsgint(buf, sizeof(int16));
!
!                 /* Check that the weight bitmap is valid */
!                 if (item->operand.weight < 0 || item->operand.weight > 0xF)
!                     elog(ERROR, "invalid weight bitmap");
!
!                 /* XXX: We don't check that the CRC is valid. Actually, if we
!                  * bothered to calculate it to verify, there would be no need
!                  * to transfer it.
!                  */
!
!                 /*
!                  * Check that datalen doesn't grow too large. Without the
!                  * check, a malicious client could induce a buffer overflow
!                  * by sending a tsquery whose size exceeds 2GB. datalen
!                  * would overflow, we would allocate a too small buffer below,
!                  * and overflow the buffer. Because operand.length is a 20-bit
!                  * field, adding one such value to datalen must exceed
!                  * MaxAllocSize before wrapping over the 32-bit datalen field,
!                  * so this check will protect from it.
!                  */
!                 if (datalen > MAXSTRLEN)
!                     elog(ERROR, "invalid tsquery; total operand length exceeded");
!
!                 /* We can calculate distance from datalen, no need to send it
!                  * across the wire. If we did, we would have to check that
!                  * it's valid anyway.
!                  */
!                 item->operand.distance = datalen;

!                 datalen += item->operand.length + 1;        /* \0 */
!
!                 break;
!             case QI_OPR:
!                 item->operator.oper = (int8) pq_getmsgint(buf, sizeof(int8));
!                 if (item->operator.oper != OP_NOT &&
!                     item->operator.oper != OP_OR &&
!                     item->operator.oper != OP_AND)
!                     elog(ERROR, "unknown operator type %d", (int) item->operator.oper);
!
!                 /*
!                  * Check that no previous operator node points to the right
!                  * operand. That would mean that the operand node
!                  * has two parents.
!                  */
!                 if (bms_is_member(i + 1, parentset))
!                     elog(ERROR, "malformed query tree");
!
!                 parentset = bms_add_member(parentset, i + 1);
!
!                 if(item->operator.oper != OP_NOT)
!                 {
!                     uint32 left = (uint32) pq_getmsgint(buf, sizeof(uint32));
!
!                     /*
!                      * Right operand is implicitly at "this+1". Don't allow
!                      * left to point to the right operand, or to self.
!                      */
!                     if (left <= 1 || i + left >= size)
!                         elog(ERROR, "invalid pointer to left operand");
!
!                     /*
!                      * Check that no previous operator node points to the left
!                      * operand.
!                      */
!                     if (bms_is_member(i + left, parentset))
!                         elog(ERROR, "malformed query tree");
!
!                     parentset = bms_add_member(parentset, i + left);
!
!                     item->operator.left = left;
!                 }
!
!                 if (i == size - 1)
!                     elog(ERROR, "invalid pointer to right operand");
!                 break;
!             default:
!                 elog(ERROR, "unknown tsquery node type %d", item->type);
          }

          item++;
      }

!     /* Now check that each node, except the root, has a parent. We
!      * already checked above that no node has more than one parent. */
!     if (bms_num_members(parentset) != size - 1 && size != 0)
!         elog(ERROR, "malformed query tree");
!
      query = (TSQuery) repalloc(query, len + datalen);
-
      item = GETQUERY(query);
      ptr = GETOPERAND(query);
      for (i = 0; i < size; i++)
      {
          if (item->type == QI_VAL)
          {
!             memcpy(ptr,
!                    pq_getmsgbytes(buf, item->operand.length),
!                    item->operand.length);
!             ptr += item->operand.length;
!             *ptr++ = '\0';
          }
          item++;
      }

      Assert(ptr - GETOPERAND(query) == datalen);

      SET_VARSIZE(query, len + datalen);
--- 807,919 ----
      StringInfo    buf = (StringInfo) PG_GETARG_POINTER(0);
      TSQuery        query;
      int            i,
                  len;
      QueryItem  *item;
!     int            datalen;
      char       *ptr;
!     uint32        size;
!     const char **operands;

      size = pq_getmsgint(buf, sizeof(uint32));
!     if (size > (MaxAllocSize / sizeof(QueryItem)))
          elog(ERROR, "invalid size of tsquery");

!     /* Allocate space to temporarily hold operand strings */
!     operands = palloc(size * sizeof(char *));

!     /* Allocate space for all the QueryItems. */
!     len = HDRSIZETQ + sizeof(QueryItem) * size;
!     query = (TSQuery) palloc0(len);
      query->size = size;
      item = GETQUERY(query);

+     datalen = 0;
      for (i = 0; i < size; i++)
      {
          item->type = (int8) pq_getmsgint(buf, sizeof(int8));

!         if (item->type == QI_VAL)
          {
!             size_t val_len; /* length after recoding to server encoding */
!             uint8 weight;
!             const char *val;
!             pg_crc32 valcrc;
!
!             weight      = (uint8) pq_getmsgint(buf, sizeof(uint8));
!             val = pq_getmsgstring(buf);
!             val_len = strlen(val);
!
!             /* Sanity checks */
!
!             if (weight > 0xF)
!                 elog(ERROR, "invalid tsquery; invalid weight bitmap");
!
!             if (val_len > MAXSTRLEN)
!                 elog(ERROR, "invalid tsquery; operand too long");
!
!             if (datalen > MAXSTRPOS)
!                 elog(ERROR, "invalid tsquery; total operand length exceeded");
!
!             /* Looks valid. */
!
!             INIT_CRC32(valcrc);
!             COMP_CRC32(valcrc, val, val_len);
!             FIN_CRC32(valcrc);
!
!             item->operand.weight = weight;
!             item->operand.valcrc = (int32) valcrc;
!             item->operand.length = val_len;
!             item->operand.distance = datalen;
!
!             /*
!              * Operand strings are copied to the final struct after this loop;
!              * here we just collect them to an array
!              */
!             operands[i] = val;
!
!             datalen += val_len + 1;        /* + 1 for the '\0' terminator */
!         }
!         else if (item->type == QI_OPR)
!         {
!             int8 oper;
!             oper = (int8) pq_getmsgint(buf, sizeof(int8));
!             if (oper != OP_NOT && oper != OP_OR && oper != OP_AND)
!                 elog(ERROR, "invalid tsquery; unknown operator type %d", (int) oper);
!             if (i == size - 1)
!                 elog(ERROR, "invalid pointer to right operand");

!             item->operator.oper = oper;
          }
+         else
+             elog(ERROR, "unknown tsquery node type %d", item->type);

          item++;
      }

!     /* Enlarge buffer to make room for the operand values. */
      query = (TSQuery) repalloc(query, len + datalen);
      item = GETQUERY(query);
      ptr = GETOPERAND(query);
+
+     /*
+      * Fill in the left-pointers. Checks that the tree is well-formed
+      * as a side-effect.
+      */
+     findoprnd(item, size);
+
+     /* Copy operands to output struct */
      for (i = 0; i < size; i++)
      {
          if (item->type == QI_VAL)
          {
!             memcpy(ptr, operands[i], item->operand.length + 1);
!             ptr += item->operand.length + 1;
          }
          item++;
      }

+     pfree(operands);
+
      Assert(ptr - GETOPERAND(query) == datalen);

      SET_VARSIZE(query, len + datalen);
*** ../pgsql.tsearch-2/src/backend/utils/adt/tsquery_cleanup.c    2007-09-05 12:14:43.000000000 +0100
--- ./src/backend/utils/adt/tsquery_cleanup.c    2007-09-07 09:35:48.000000000 +0100
***************
*** 17,22 ****
--- 17,23 ----

  #include "tsearch/ts_type.h"
  #include "tsearch/ts_utils.h"
+ #include "miscadmin.h"

  typedef struct NODE
  {
diff -r -c -x '*.o' -x '*.Po' -x config.log -x '*.so' -x CVS -x gram.c
../pgsql.tsearch-2/src/backend/utils/adt/tsquery_rewrite.c./src/backend/utils/adt/tsquery_rewrite.c 
*** ../pgsql.tsearch-2/src/backend/utils/adt/tsquery_rewrite.c    2007-09-05 12:18:46.000000000 +0100
--- ./src/backend/utils/adt/tsquery_rewrite.c    2007-09-06 23:25:00.000000000 +0100
***************
*** 17,22 ****
--- 17,23 ----
  #include "executor/spi.h"
  #include "tsearch/ts_type.h"
  #include "tsearch/ts_utils.h"
+ #include "miscadmin.h"


  static int
*** ../pgsql.tsearch-2/src/backend/utils/adt/tsquery_util.c    2007-09-05 12:21:29.000000000 +0100
--- ./src/backend/utils/adt/tsquery_util.c    2007-09-06 23:25:14.000000000 +0100
***************
*** 16,21 ****
--- 16,22 ----

  #include "tsearch/ts_type.h"
  #include "tsearch/ts_utils.h"
+ #include "miscadmin.h"

  QTNode *
  QT2QTN(QueryItem * in, char *operand)
*** ../pgsql.tsearch-2/src/backend/utils/adt/tsrank.c    2007-09-05 12:24:27.000000000 +0100
--- ./src/backend/utils/adt/tsrank.c    2007-09-06 23:56:29.000000000 +0100
***************
*** 18,23 ****
--- 18,24 ----
  #include "tsearch/ts_type.h"
  #include "tsearch/ts_utils.h"
  #include "utils/array.h"
+ #include "miscadmin.h"


  static float weights[] = {0.1, 0.2, 0.4, 1.0};
***************
*** 176,183 ****
      return res;
  }

  static WordEntryPos POSNULL[] = {
!     0,
      0
  };

--- 177,185 ----
      return res;
  }

+ /* A dummy WordEntryPos array to use when haspos is false */
  static WordEntryPos POSNULL[] = {
!     1, /* Number of elements that follow */
      0
  };

***************
*** 207,213 ****
      }
      pos = (uint16 **) palloc(sizeof(uint16 *) * q->size);
      memset(pos, 0, sizeof(uint16 *) * q->size);
-     *(uint16 *) POSNULL = lengthof(POSNULL) - 1;
      WEP_SETPOS(POSNULL[1], MAXENTRYPOS - 1);

      for (i = 0; i < size; i++)
--- 209,214 ----
***************
*** 265,271 ****
      QueryOperand **item;
      int            size = q->size;

-     *(uint16 *) POSNULL = lengthof(POSNULL) - 1;
      item = SortAndUniqItems(q, &size);

      for (i = 0; i < size; i++)
--- 266,271 ----
***************
*** 593,599 ****
      DocRepresentation *doc;
      char       *operand;

-     *(uint16 *) POSNULL = lengthof(POSNULL) - 1;
      doc = (DocRepresentation *) palloc(sizeof(DocRepresentation) * len);
      operand = GETOPERAND(query);
      reset_istrue_flag(query);
--- 593,598 ----
*** ../pgsql.tsearch-2/src/backend/utils/adt/tsvector.c    2007-09-03 11:05:31.000000000 +0100
--- ./src/backend/utils/adt/tsvector.c    2007-09-07 09:47:46.000000000 +0100
***************
*** 75,92 ****
  }

  static int
! compareentry(const void *a, const void *b, void *arg)
  {
      char       *BufferStr = (char *) arg;

!     if (((WordEntryIN *) a)->entry.len == ((WordEntryIN *) b)->entry.len)
      {
!         return strncmp(&BufferStr[((WordEntryIN *) a)->entry.pos],
!                        &BufferStr[((WordEntryIN *) b)->entry.pos],
!                        ((WordEntryIN *) a)->entry.len);
      }

!     return (((WordEntryIN *) a)->entry.len > ((WordEntryIN *) b)->entry.len) ? 1 : -1;
  }

  static int
--- 75,94 ----
  }

  static int
! compareentry(const void *va, const void *vb, void *arg)
  {
      char       *BufferStr = (char *) arg;
+     WordEntryIN *a = (WordEntryIN *) va;
+     WordEntryIN *b = (WordEntryIN *) vb;

!     if (a->entry.len == b->entry.len)
      {
!         return strncmp(&BufferStr[a->entry.pos],
!                        &BufferStr[b->entry.pos],
!                        a->entry.len);
      }

!     return (a->entry.len > b->entry.len) ? 1 : -1;
  }

  static int
***************
*** 104,109 ****
--- 106,114 ----
              a->poslen = uniquePos(a->pos, a->poslen);
              *outbuflen = SHORTALIGN(a->entry.len) + (a->poslen + 1) * sizeof(WordEntryPos);
          }
+         else
+             *outbuflen = a->entry.len;
+
          return l;
      }
      res = a;
***************
*** 118,127 ****
          {
              if (res->entry.haspos)
              {
                  res->poslen = uniquePos(res->pos, res->poslen);
                  *outbuflen += res->poslen * sizeof(WordEntryPos);
              }
!             *outbuflen += SHORTALIGN(res->entry.len);
              res++;
              memcpy(res, ptr, sizeof(WordEntryIN));
          }
--- 123,134 ----
          {
              if (res->entry.haspos)
              {
+                 *outbuflen += SHORTALIGN(res->entry.len);
                  res->poslen = uniquePos(res->pos, res->poslen);
                  *outbuflen += res->poslen * sizeof(WordEntryPos);
              }
!             else
!                 *outbuflen += res->entry.len;
              res++;
              memcpy(res, ptr, sizeof(WordEntryIN));
          }
***************
*** 147,158 ****
          }
          ptr++;
      }
      if (res->entry.haspos)
      {
          res->poslen = uniquePos(res->pos, res->poslen);
          *outbuflen += res->poslen * sizeof(WordEntryPos);
      }
!     *outbuflen += SHORTALIGN(res->entry.len);

      return res + 1 - a;
  }
--- 154,171 ----
          }
          ptr++;
      }
+
+     /* add last item */
+
      if (res->entry.haspos)
      {
+         *outbuflen += SHORTALIGN(res->entry.len);
+
          res->poslen = uniquePos(res->pos, res->poslen);
          *outbuflen += res->poslen * sizeof(WordEntryPos);
      }
!     else
!         *outbuflen += res->entry.len;

      return res + 1 - a;
  }
***************
*** 367,372 ****
--- 380,397 ----
      PG_RETURN_CSTRING(outbuf);
  }

+ /*
+  * Binary Input / Output functions. The binary format is as follows:
+  *
+  * uint32    number of lexemes
+  *
+  * for each lexeme:
+  *        lexeme text in client encoding, null-terminated
+  *         uint16    number of positions
+  *         for each position:
+  *            uint16 WordEntryPos
+  */
+
  Datum
  tsvectorsend(PG_FUNCTION_ARGS)
  {
***************
*** 381,398 ****
      pq_sendint(&buf, vec->size, sizeof(int32));
      for (i = 0; i < vec->size; i++)
      {
!         /*
!          * We are sure that sizeof(WordEntry) == sizeof(int32)
           */
!         pq_sendint(&buf, *(int32 *) weptr, sizeof(int32));

!         pq_sendbytes(&buf, STRPTR(vec) + weptr->pos, weptr->len);
!         if (weptr->haspos)
          {
              WordEntryPos *wepptr = POSDATAPTR(vec, weptr);

!             pq_sendint(&buf, POSDATALEN(vec, weptr), sizeof(WordEntryPos));
!             for (j = 0; j < POSDATALEN(vec, weptr); j++)
                  pq_sendint(&buf, wepptr[j], sizeof(WordEntryPos));
          }
          weptr++;
--- 406,427 ----
      pq_sendint(&buf, vec->size, sizeof(int32));
      for (i = 0; i < vec->size; i++)
      {
!         uint16 npos;
!
!         /* the strings in the TSVector array are not null-terminated, so
!          * we have to send the null-terminator separately
           */
!         pq_sendtext(&buf, STRPTR(vec) + weptr->pos, weptr->len);
!         pq_sendbyte(&buf, '\0');
!
!         npos = POSDATALEN(vec, weptr);
!         pq_sendint(&buf, npos, sizeof(uint16));

!         if(npos > 0)
          {
              WordEntryPos *wepptr = POSDATAPTR(vec, weptr);

!             for (j = 0; j < npos; j++)
                  pq_sendint(&buf, wepptr[j], sizeof(WordEntryPos));
          }
          weptr++;
***************
*** 407,477 ****
      StringInfo    buf = (StringInfo) PG_GETARG_POINTER(0);
      TSVector    vec;
      int            i;
!     uint32        size;
!     WordEntry  *weptr;
!     int            datalen = 0;
!     Size        len;

!     size = pq_getmsgint(buf, sizeof(uint32));
!     if (size < 0 || size > (MaxAllocSize / sizeof(WordEntry)))
          elog(ERROR, "invalid size of tsvector");

!     len = DATAHDRSIZE + sizeof(WordEntry) * size;

!     len = len * 2; /* times two to make room for lexemes */
      vec = (TSVector) palloc0(len);
!     vec->size = size;

!     weptr = ARRPTR(vec);
!     for (i = 0; i < size; i++)
      {
!         int32 tmp;

!         weptr = ARRPTR(vec) + i;

          /*
!          * We are sure that sizeof(WordEntry) == sizeof(int32)
           */
!         tmp = pq_getmsgint(buf, sizeof(int32));
!         *weptr = *(WordEntry *) & tmp;
!
!         while (CALCDATASIZE(size, datalen + SHORTALIGN(weptr->len)) >= len)
          {
              len *= 2;
              vec = (TSVector) repalloc(vec, len);
-             weptr = ARRPTR(vec) + i;
          }

!         memcpy(STRPTR(vec) + weptr->pos,
!                pq_getmsgbytes(buf, weptr->len),
!                weptr->len);
!         datalen += SHORTALIGN(weptr->len);

!         if (i > 0 && WordEntryCMP(weptr, weptr - 1, STRPTR(vec)) <= 0)
              elog(ERROR, "lexemes are unordered");

!         if (weptr->haspos)
          {
!             uint16        j,
!                         npos;
              WordEntryPos *wepptr;

!             npos = (uint16) pq_getmsgint(buf, sizeof(uint16));
!             if (npos > MAXNUMPOS)
!                 elog(ERROR, "unexpected number of positions");
!
!             while (CALCDATASIZE(size, datalen + (npos + 1) * sizeof(WordEntryPos)) >= len)
              {
!                 len *= 2;
!                 vec = (TSVector) repalloc(vec, len);
!                 weptr = ARRPTR(vec) + i;
              }

!             memcpy(_POSDATAPTR(vec, weptr), &npos, sizeof(int16));
!             wepptr = POSDATAPTR(vec, weptr);
              for (j = 0; j < npos; j++)
              {
!                 wepptr[j] = (WordEntryPos) pq_getmsgint(buf, sizeof(int16));
                  if (j > 0 && WEP_GETPOS(wepptr[j]) <= WEP_GETPOS(wepptr[j - 1]))
                      elog(ERROR, "position information is unordered");
              }
--- 436,527 ----
      StringInfo    buf = (StringInfo) PG_GETARG_POINTER(0);
      TSVector    vec;
      int            i;
!     int32        nentries;
!     int            datalen; /* number of bytes used in the variable size area
!                           * after fixed size TSVector header and WordEntries
!                           */
!     Size        hdrlen;
!     Size        len;  /* allocated size of vec */

!     nentries = pq_getmsgint(buf, sizeof(int32));
!     if (nentries < 0 || nentries > (MaxAllocSize / sizeof(WordEntry)))
          elog(ERROR, "invalid size of tsvector");

!     hdrlen = DATAHDRSIZE + sizeof(WordEntry) * nentries;

!     len = hdrlen * 2; /* times two to make room for lexemes */
      vec = (TSVector) palloc0(len);
!     vec->size = nentries;

!     datalen = 0;
!     for (i = 0; i < nentries; i++)
      {
!         const char *lexeme;
!         uint16 npos;
!         size_t lex_len;
!
!         lexeme = pq_getmsgstring(buf);
!         npos = (uint16) pq_getmsgint(buf, sizeof(uint16));
!
!         /* sanity checks */

!         lex_len = strlen(lexeme);
!         if (lex_len < 0 || lex_len > MAXSTRLEN)
!             elog(ERROR, "invalid tsvector; lexeme too long");
!
!         if (datalen > MAXSTRPOS)
!             elog(ERROR, "invalid tsvector; maximum total lexeme length exceeded");
!
!         if (npos > MAXNUMPOS)
!             elog(ERROR, "unexpected number of positions");

          /*
!          * Looks valid. Fill the WordEntry struct, and copy lexeme.
!          *
!          * But make sure the buffer is large enough first.
           */
!         while (hdrlen + SHORTALIGN(datalen + lex_len) +
!                (npos + 1) * sizeof(WordEntryPos) >= len)
          {
              len *= 2;
              vec = (TSVector) repalloc(vec, len);
          }

!         vec->entries[i].haspos = (npos > 0) ? 1 : 0;
!         vec->entries[i].len = lex_len;
!         vec->entries[i].pos = datalen;

!         memcpy(STRPTR(vec) + datalen, lexeme, lex_len);
!
!         datalen += lex_len;
!
!         if (i > 0 && WordEntryCMP(&vec->entries[i], &vec->entries[i - 1], STRPTR(vec)) <= 0)
              elog(ERROR, "lexemes are unordered");

!         /* Receive positions */
!
!         if (npos > 0)
          {
!             uint16        j;
              WordEntryPos *wepptr;

!             /*
!              * Pad to 2-byte alignment if necessary. Though we used palloc0
!              * for the initial allocation, subsequent repalloc'd memory
!              * areas are not initialized to zero.
!              */
!             if (datalen != SHORTALIGN(datalen))
              {
!                 *(STRPTR(vec) + datalen) = '\0';
!                 datalen = SHORTALIGN(datalen);
              }

!             memcpy(STRPTR(vec) + datalen, &npos, sizeof(uint16));
!
!             wepptr = POSDATAPTR(vec, &vec->entries[i]);
              for (j = 0; j < npos; j++)
              {
!                 wepptr[j] = (WordEntryPos) pq_getmsgint(buf, sizeof(WordEntryPos));
                  if (j > 0 && WEP_GETPOS(wepptr[j]) <= WEP_GETPOS(wepptr[j - 1]))
                      elog(ERROR, "position information is unordered");
              }
***************
*** 480,486 ****
          }
      }

!     SET_VARSIZE(vec, CALCDATASIZE(vec->size, datalen));

      PG_RETURN_TSVECTOR(vec);
  }
--- 530,536 ----
          }
      }

!     SET_VARSIZE(vec, hdrlen + datalen);

      PG_RETURN_TSVECTOR(vec);
  }
*** ../pgsql.tsearch-2/src/backend/utils/adt/tsvector_op.c    2007-08-31 21:20:00.000000000 +0100
--- ./src/backend/utils/adt/tsvector_op.c    2007-09-06 23:47:32.000000000 +0100
***************
*** 165,171 ****
      char       *cur;

      for (i = 0; i < in->size; i++)
!         len += SHORTALIGN(arrin[i].len);

      len = CALCDATASIZE(in->size, len);
      out = (TSVector) palloc0(len);
--- 165,171 ----
      char       *cur;

      for (i = 0; i < in->size; i++)
!         len += arrin[i].len;

      len = CALCDATASIZE(in->size, len);
      out = (TSVector) palloc0(len);
***************
*** 179,185 ****
          arrout[i].haspos = 0;
          arrout[i].len = arrin[i].len;
          arrout[i].pos = cur - STRPTR(out);
!         cur += SHORTALIGN(arrout[i].len);
      }

      PG_FREE_IF_COPY(in, 0);
--- 179,185 ----
          arrout[i].haspos = 0;
          arrout[i].len = arrin[i].len;
          arrout[i].pos = cur - STRPTR(out);
!         cur += arrout[i].len;
      }

      PG_FREE_IF_COPY(in, 0);
***************
*** 351,362 ****
              ptr->len = ptr1->len;
              memcpy(cur, data1 + ptr1->pos, ptr1->len);
              ptr->pos = cur - data;
-             cur += SHORTALIGN(ptr1->len);
              if (ptr->haspos)
              {
                  memcpy(cur, _POSDATAPTR(in1, ptr1), POSDATALEN(in1, ptr1) * sizeof(WordEntryPos) + sizeof(uint16));
                  cur += POSDATALEN(in1, ptr1) * sizeof(WordEntryPos) + sizeof(uint16);
              }
              ptr++;
              ptr1++;
              i1--;
--- 351,365 ----
              ptr->len = ptr1->len;
              memcpy(cur, data1 + ptr1->pos, ptr1->len);
              ptr->pos = cur - data;
              if (ptr->haspos)
              {
+                 cur += SHORTALIGN(ptr1->len);
                  memcpy(cur, _POSDATAPTR(in1, ptr1), POSDATALEN(in1, ptr1) * sizeof(WordEntryPos) + sizeof(uint16));
                  cur += POSDATALEN(in1, ptr1) * sizeof(WordEntryPos) + sizeof(uint16);
              }
+             else
+                 cur += ptr1->len;
+
              ptr++;
              ptr1++;
              i1--;
***************
*** 367,382 ****
              ptr->len = ptr2->len;
              memcpy(cur, data2 + ptr2->pos, ptr2->len);
              ptr->pos = cur - data;
-             cur += SHORTALIGN(ptr2->len);
              if (ptr->haspos)
              {
                  int            addlen = add_pos(in2, ptr2, out, ptr, maxpos);

                  if (addlen == 0)
                      ptr->haspos = 0;
                  else
                      cur += addlen * sizeof(WordEntryPos) + sizeof(uint16);
              }
              ptr++;
              ptr2++;
              i2--;
--- 370,389 ----
              ptr->len = ptr2->len;
              memcpy(cur, data2 + ptr2->pos, ptr2->len);
              ptr->pos = cur - data;
              if (ptr->haspos)
              {
                  int            addlen = add_pos(in2, ptr2, out, ptr, maxpos);

+                 cur += SHORTALIGN(ptr2->len);
+
                  if (addlen == 0)
                      ptr->haspos = 0;
                  else
                      cur += addlen * sizeof(WordEntryPos) + sizeof(uint16);
              }
+             else
+                 cur += ptr2->len;
+
              ptr++;
              ptr2++;
              i2--;
***************
*** 387,395 ****
              ptr->len = ptr1->len;
              memcpy(cur, data1 + ptr1->pos, ptr1->len);
              ptr->pos = cur - data;
-             cur += SHORTALIGN(ptr1->len);
              if (ptr->haspos)
              {
                  if (ptr1->haspos)
                  {
                      memcpy(cur, _POSDATAPTR(in1, ptr1), POSDATALEN(in1, ptr1) * sizeof(WordEntryPos) +
sizeof(uint16));
--- 394,402 ----
              ptr->len = ptr1->len;
              memcpy(cur, data1 + ptr1->pos, ptr1->len);
              ptr->pos = cur - data;
              if (ptr->haspos)
              {
+                 cur += SHORTALIGN(ptr1->len);
                  if (ptr1->haspos)
                  {
                      memcpy(cur, _POSDATAPTR(in1, ptr1), POSDATALEN(in1, ptr1) * sizeof(WordEntryPos) +
sizeof(uint16));
***************
*** 407,412 ****
--- 414,422 ----
                          cur += addlen * sizeof(WordEntryPos) + sizeof(uint16);
                  }
              }
+             else
+                 cur += ptr1->len;
+
              ptr++;
              ptr1++;
              ptr2++;
***************
*** 421,432 ****
          ptr->len = ptr1->len;
          memcpy(cur, data1 + ptr1->pos, ptr1->len);
          ptr->pos = cur - data;
-         cur += SHORTALIGN(ptr1->len);
          if (ptr->haspos)
          {
              memcpy(cur, _POSDATAPTR(in1, ptr1), POSDATALEN(in1, ptr1) * sizeof(WordEntryPos) + sizeof(uint16));
              cur += POSDATALEN(in1, ptr1) * sizeof(WordEntryPos) + sizeof(uint16);
          }
          ptr++;
          ptr1++;
          i1--;
--- 431,445 ----
          ptr->len = ptr1->len;
          memcpy(cur, data1 + ptr1->pos, ptr1->len);
          ptr->pos = cur - data;
          if (ptr->haspos)
          {
+             cur += SHORTALIGN(ptr1->len);
              memcpy(cur, _POSDATAPTR(in1, ptr1), POSDATALEN(in1, ptr1) * sizeof(WordEntryPos) + sizeof(uint16));
              cur += POSDATALEN(in1, ptr1) * sizeof(WordEntryPos) + sizeof(uint16);
          }
+         else
+             cur += ptr1->len;
+
          ptr++;
          ptr1++;
          i1--;
***************
*** 438,453 ****
          ptr->len = ptr2->len;
          memcpy(cur, data2 + ptr2->pos, ptr2->len);
          ptr->pos = cur - data;
-         cur += SHORTALIGN(ptr2->len);
          if (ptr->haspos)
          {
              int            addlen = add_pos(in2, ptr2, out, ptr, maxpos);

              if (addlen == 0)
                  ptr->haspos = 0;
              else
                  cur += addlen * sizeof(WordEntryPos) + sizeof(uint16);
          }
          ptr++;
          ptr2++;
          i2--;
--- 451,470 ----
          ptr->len = ptr2->len;
          memcpy(cur, data2 + ptr2->pos, ptr2->len);
          ptr->pos = cur - data;
          if (ptr->haspos)
          {
              int            addlen = add_pos(in2, ptr2, out, ptr, maxpos);

+             cur += SHORTALIGN(ptr2->len);
+
              if (addlen == 0)
                  ptr->haspos = 0;
              else
                  cur += addlen * sizeof(WordEntryPos) + sizeof(uint16);
          }
+         else
+             cur += ptr2->len;
+
          ptr++;
          ptr2++;
          i2--;
***************
*** 484,491 ****
  static bool
  checkclass_str(CHKVAL * chkval, WordEntry * val, QueryOperand * item)
  {
!     WordEntryPos *ptr = (WordEntryPos *) (chkval->values + val->pos + SHORTALIGN(val->len) + sizeof(uint16));
!     uint16        len = *((uint16 *) (chkval->values + val->pos + SHORTALIGN(val->len)));

      while (len--)
      {
--- 501,508 ----
  static bool
  checkclass_str(CHKVAL * chkval, WordEntry * val, QueryOperand * item)
  {
!     WordEntryPos *ptr = (WordEntryPos *) (chkval->values + SHORTALIGN(val->pos + val->len) + sizeof(uint16));
!     uint16        len = *((uint16 *) (chkval->values + SHORTALIGN(val->pos + val->len)));

      while (len--)
      {
*** ../pgsql.tsearch-2/src/include/tsearch/ts_type.h    2007-09-05 12:17:02.000000000 +0100
--- ./src/include/tsearch/ts_type.h    2007-09-07 09:20:43.000000000 +0100
***************
*** 62,87 ****
   *                            bytes from end of WordEntry array to start of
   *                            corresponding lexeme.
   * 4) Lexeme's storage:
!  *      SHORTALIGNED(lexeme) and position information if it exists
!  *      Position information: first int2 - is a number of positions and it
!  *      follows array of WordEntryPos
   */

  typedef struct
  {
      int32        vl_len_;        /* varlena header (do not touch directly!) */
!     uint32        size;
!     char        data[1];
  } TSVectorData;

  typedef TSVectorData *TSVector;

! #define DATAHDRSIZE (VARHDRSZ + sizeof(int4))
! #define CALCDATASIZE(x, lenstr) ( (x) * sizeof(WordEntry) + DATAHDRSIZE + (lenstr) )
! #define ARRPTR(x)    ( (WordEntry*) ( (char*)(x) + DATAHDRSIZE ) )
! #define STRPTR(x)    ( (char*)(x) + DATAHDRSIZE + ( sizeof(WordEntry) * ((TSVector)(x))->size ) )
! #define STRSIZE(x)    ( ((TSVector)(x))->len - DATAHDRSIZE - ( sizeof(WordEntry) * ((TSVector)(x))->size ) )
! #define _POSDATAPTR(x,e)    (STRPTR(x)+((WordEntry*)(e))->pos+SHORTALIGN(((WordEntry*)(e))->len))
  #define POSDATALEN(x,e) ( ( ((WordEntry*)(e))->haspos ) ? (*(uint16*)_POSDATAPTR(x,e)) : 0 )
  #define POSDATAPTR(x,e) ( (WordEntryPos*)( _POSDATAPTR(x,e)+sizeof(uint16) ) )

--- 62,94 ----
   *                            bytes from end of WordEntry array to start of
   *                            corresponding lexeme.
   * 4) Lexeme's storage:
!  *      lexeme (without null-terminator)
!  *    if haspos is true:
!  *        padding byte if necessary to make the number of positions 2-byte aligned
!  *        uint16        number of positions that follow.
!  *        uint16[]    positions
!  *
!  * The positions must be sorted.
   */

  typedef struct
  {
      int32        vl_len_;        /* varlena header (do not touch directly!) */
!     int32        size;
!     WordEntry    entries[1]; /* var size */
!     /* lexemes follow */
  } TSVectorData;

  typedef TSVectorData *TSVector;

! #define DATAHDRSIZE (offsetof(TSVectorData, entries))
! #define CALCDATASIZE(x, lenstr) (DATAHDRSIZE + (x) * sizeof(WordEntry) + (lenstr) )
! #define ARRPTR(x)    ( (x)->entries )
!
! /* returns a pointer to the beginning of lexemes */
! #define STRPTR(x)    ( (char *) &(x)->entries[x->size] )
!
! #define _POSDATAPTR(x,e)    (STRPTR(x) + SHORTALIGN((e)->pos + (e)->len))
  #define POSDATALEN(x,e) ( ( ((WordEntry*)(e))->haspos ) ? (*(uint16*)_POSDATAPTR(x,e)) : 0 )
  #define POSDATAPTR(x,e) ( (WordEntryPos*)( _POSDATAPTR(x,e)+sizeof(uint16) ) )

***************
*** 166,172 ****
       * C: 1<<1
       * D: 1<<0
       */
!     int8        weight;
      int32    valcrc;                /* XXX: pg_crc32 would be a more appropriate data type, but we use comparisons to
signedintegers in the code. They would need to be changed as well. */ 

      /* pointer to text value of operand, must correlate with WordEntry */
--- 173,179 ----
       * C: 1<<1
       * D: 1<<0
       */
!     uint8        weight;
      int32    valcrc;                /* XXX: pg_crc32 would be a more appropriate data type, but we use comparisons to
signedintegers in the code. They would need to be changed as well. */ 

      /* pointer to text value of operand, must correlate with WordEntry */

XML binary I/O (was Re: tsearch refactorings)

From

"Heikki Linnakangas"

Date:

07 September 2007, 11:03:37

Heikki Linnakangas wrote:
> BTW, the encoding of the XML datatype looks pretty funky. xml_recv first
> reads the xml string with pq_getmsgtext, which applies a client->server
> conversion. Then the xml declaration is parsed, extracting the encoding
> attribute. Then the string is converted again from that encoding (or
> UTF-8 if none was specified) to server encoding. I don't understand how
> it's supposed to work, but ISTM there's one conversion too much,

And it's got an unfortunate typo in it as well: it calls "free(result)"
instead of pfree. I think we need regression tests for the more complex
send/recv functions...

What's the difference between text and binary mode for something like
xml anyway? Could we just call the text format in/out functions and be
done with it?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: tsearch refactorings

From

Teodor Sigaev

Date:

07 September 2007, 14:41:36

> BTW, Teodor, are you intending to review/apply Heikki's tsearch fixes,
> or do you want someone else to do it?
I'll do it.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: tsearch refactorings

From

Teodor Sigaev

Date:

07 September 2007, 16:06:33

> I am getting confused with the patches and version I have lying around
> here... I think I'll have to wait for review of the patches I've posted
> this far before I continue hacking.
Sorry for delay - I was busy by another. All your patches are committed with
very small changes.
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: tsearch refactorings

From

"Heikki Linnakangas"

Date:

07 September 2007, 21:30:02

Teodor Sigaev wrote:
>> I am getting confused with the patches and version I have lying around
>> here... I think I'll have to wait for review of the patches I've posted
>> this far before I continue hacking.
> Sorry for delay - I was busy by another. All your patches are committed
> with very small changes.

Thanks!

Can you please apply this fix for the bug Pavel found as well:

http://archives.postgresql.org/pgsql-hackers/2007-09/msg00127.php

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: XML binary I/O (was Re: tsearch refactorings)

From

"Heikki Linnakangas"

Date:

14 September 2007, 11:56:06

Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
>> BTW, the encoding of the XML datatype looks pretty funky. xml_recv first
>> reads the xml string with pq_getmsgtext, which applies a client->server
>> conversion. Then the xml declaration is parsed, extracting the encoding
>> attribute. Then the string is converted again from that encoding (or
>> UTF-8 if none was specified) to server encoding. I don't understand how
>> it's supposed to work, but ISTM there's one conversion too much,
>
> And it's got an unfortunate typo in it as well: it calls "free(result)"
> instead of pfree. I think we need regression tests for the more complex
> send/recv functions...

According to the docs, xml_send is supposed to output the XML document
in client encoding, with that encoding specified in the xml declaration,
eg. ?<xml encoding="LATIN1"?>. xml_recv is supposed to ignore client
encoding, and parse the document according to the encoding specified in
the declaration.

Here's a patch that fixes the send/recv functions to work like the
manual says. It fixes the free/pfree typo as well.

Included is a new regression test for the send/recv functions as well
(unpatched CVS HEAD fails it BTW). It can be merged with the existing
xml test, but I didn't want to do it in the patch because it would've
involved moving the current xml test from sql/ to input/, and made the
patch longer to review.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/utils/adt/xml.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/adt/xml.c,v
retrieving revision 1.46
diff -c -r1.46 xml.c
*** src/backend/utils/adt/xml.c    13 Jul 2007 03:43:23 -0000    1.46
--- src/backend/utils/adt/xml.c    14 Sep 2007 11:23:48 -0000
***************
*** 232,238 ****
      xmlDocPtr    doc;
      xmlChar       *encoding = NULL;

!     str = pq_getmsgtext(buf, buf->len - buf->cursor, &nbytes);

      result = palloc(nbytes + VARHDRSZ);
      SET_VARSIZE(result, nbytes + VARHDRSZ);
--- 232,244 ----
      xmlDocPtr    doc;
      xmlChar       *encoding = NULL;

!     /*
!      * Read the data in raw format. We don't know yet what the encoding
!      * is, that information is embedded in the xml declaration, so we
!      * have to parse that before converting to server encoding.
!      */
!     nbytes = buf->len - buf->cursor;
!     str = (char *) pq_getmsgbytes(buf, nbytes);

      result = palloc(nbytes + VARHDRSZ);
      SET_VARSIZE(result, nbytes + VARHDRSZ);
***************
*** 247,262 ****
      doc = xml_parse(result, xmloption, true, encoding);
      xmlFreeDoc(doc);

      newstr = (char *) pg_do_encoding_conversion((unsigned char *) str,
                                                  nbytes,
                                                  encoding ? pg_char_to_encoding((char *) encoding) : PG_UTF8,
                                                  GetDatabaseEncoding());

-     pfree(str);
-
      if (newstr != str)
      {
!         free(result);

          nbytes = strlen(newstr);

--- 253,267 ----
      doc = xml_parse(result, xmloption, true, encoding);
      xmlFreeDoc(doc);

+     /* Now that we know what we're dealing with, convert to server encoding */
      newstr = (char *) pg_do_encoding_conversion((unsigned char *) str,
                                                  nbytes,
                                                  encoding ? pg_char_to_encoding((char *) encoding) : PG_UTF8,
                                                  GetDatabaseEncoding());

      if (newstr != str)
      {
!         pfree(result);

          nbytes = strlen(newstr);

***************
*** 277,287 ****
  xml_send(PG_FUNCTION_ARGS)
  {
      xmltype       *x = PG_GETARG_XML_P(0);
!     char       *outval = xml_out_internal(x, pg_get_client_encoding());
      StringInfoData buf;

      pq_begintypsend(&buf);
!     pq_sendstring(&buf, outval);
      PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
  }

--- 282,299 ----
  xml_send(PG_FUNCTION_ARGS)
  {
      xmltype       *x = PG_GETARG_XML_P(0);
!     char       *outval;
      StringInfoData buf;
+
+     /*
+      * xml_out_internal doesn't convert the encoding, it just prints
+      * the right declaration. pq_sendtext will do the conversion.
+      */
+     outval = xml_out_internal(x, pg_get_client_encoding());

      pq_begintypsend(&buf);
!     pq_sendtext(&buf, outval, strlen(outval));
!     pfree(outval);
      PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
  }

Index: src/test/regress/parallel_schedule
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/test/regress/parallel_schedule,v
retrieving revision 1.44
diff -c -r1.44 parallel_schedule
*** src/test/regress/parallel_schedule    11 Sep 2007 11:54:42 -0000    1.44
--- src/test/regress/parallel_schedule    14 Sep 2007 11:29:10 -0000
***************
*** 85,90 ****
--- 85,93 ----
  # "plpgsql" cannot run concurrently with "rules", nor can "plancache"
  test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid conversion truncate alter_table
sequencepolymorphism rowtypes returning largeobject xml 

+ # "xmlio" needs to run after xml
+ test: xmlio
+
  # run stats by itself because its delay may be insufficient under heavy load
  test: stats

Index: src/test/regress/serial_schedule
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/test/regress/serial_schedule,v
retrieving revision 1.41
diff -c -r1.41 serial_schedule
*** src/test/regress/serial_schedule    11 Sep 2007 11:54:42 -0000    1.41
--- src/test/regress/serial_schedule    14 Sep 2007 11:28:19 -0000
***************
*** 111,115 ****
--- 111,116 ----
  test: returning
  test: largeobject
  test: xml
+ test: xmlio
  test: stats
  test: tablespace
Index: src/test/regress/expected/xmlio.out
===================================================================
RCS file: src/test/regress/expected/xmlio.out
diff -N src/test/regress/expected/xmlio.out
*** /dev/null    1 Jan 1970 00:00:00 -0000
--- src/test/regress/expected/xmlio.out    14 Sep 2007 11:36:47 -0000
***************
*** 0 ****
--- 1,22 ----
+ -- Test binary I/O functions of XML datatype
+ --
+ -- This needs to be run after 'xml' test, because this depends
+ -- on the 'xmltest' table it creates
+ SELECT * FROM xmltest;
+  id |        data
+ ----+--------------------
+   1 | <value>one</value>
+   2 | <value>two</value>
+ (2 rows)
+
+ -- Basic test
+ COPY xmltest TO '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
+ TRUNCATE xmltest;
+ COPY xmltest FROM '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
+ SELECT * FROM xmltest;
+  id |        data
+ ----+--------------------
+   1 | <value>one</value>
+   2 | <value>two</value>
+ (2 rows)
+
Index: src/test/regress/expected/xmlio_1.out
===================================================================
RCS file: src/test/regress/expected/xmlio_1.out
diff -N src/test/regress/expected/xmlio_1.out
*** /dev/null    1 Jan 1970 00:00:00 -0000
--- src/test/regress/expected/xmlio_1.out    14 Sep 2007 11:44:13 -0000
***************
*** 0 ****
--- 1,18 ----
+ -- Test binary I/O functions of XML datatype
+ --
+ -- This needs to be run after 'xml' test, because this depends
+ -- on the 'xmltest' table it creates
+ SELECT * FROM xmltest;
+  id | data
+ ----+------
+ (0 rows)
+
+ -- Basic test
+ COPY xmltest TO '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
+ TRUNCATE xmltest;
+ COPY xmltest FROM '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
+ SELECT * FROM xmltest;
+  id | data
+ ----+------
+ (0 rows)
+
Index: src/test/regress/input/xmlio.source
===================================================================
RCS file: src/test/regress/input/xmlio.source
diff -N src/test/regress/input/xmlio.source
*** /dev/null    1 Jan 1970 00:00:00 -0000
--- src/test/regress/input/xmlio.source    14 Sep 2007 11:34:37 -0000
***************
*** 0 ****
--- 1,12 ----
+ -- Test binary I/O functions of XML datatype
+ --
+ -- This needs to be run after 'xml' test, because this depends
+ -- on the 'xmltest' table it creates
+
+ SELECT * FROM xmltest;
+
+ -- Basic test
+ COPY xmltest TO '@abs_builddir@/results/xmltest.data' WITH BINARY;
+ TRUNCATE xmltest;
+ COPY xmltest FROM '@abs_builddir@/results/xmltest.data' WITH BINARY;
+ SELECT * FROM xmltest;

Re: XML binary I/O (was Re: tsearch refactorings)

From

Alvaro Herrera

Date:

14 September 2007, 13:25:44

Heikki Linnakangas wrote:

> + -- Basic test
> + COPY xmltest TO '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
> + TRUNCATE xmltest;
> + COPY xmltest FROM '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
> + SELECT * FROM xmltest;
> +  id |        data
> + ----+--------------------
> +   1 | <value>one</value>
> +   2 | <value>two</value>
> + (2 rows)

This isn't going to work ...



--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: XML binary I/O (was Re: tsearch refactorings)

From

"Heikki Linnakangas"

Date:

14 September 2007, 13:38:01

Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>
>> + -- Basic test
>> + COPY xmltest TO '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
>> + TRUNCATE xmltest;
>> + COPY xmltest FROM '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
>> + SELECT * FROM xmltest;
>> +  id |        data
>> + ----+--------------------
>> +   1 | <value>one</value>
>> +   2 | <value>two</value>
>> + (2 rows)
>
> This isn't going to work ...

Care to elaborate?

We could test a lot more, but that at least calls the send/recv functions.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: XML binary I/O (was Re: tsearch refactorings)

From

"Heikki Linnakangas"

Date:

14 September 2007, 13:39:08

Heikki Linnakangas wrote:
> Alvaro Herrera wrote:
>> Heikki Linnakangas wrote:
>>
>>> + -- Basic test
>>> + COPY xmltest TO '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
>>> + TRUNCATE xmltest;
>>> + COPY xmltest FROM '/home/hlinnaka/pg_sandbox/pgsql.cvshead/src/test/regress/results/xmltest.data' WITH BINARY;
>>> + SELECT * FROM xmltest;
>>> +  id |        data
>>> + ----+--------------------
>>> +   1 | <value>one</value>
>>> +   2 | <value>two</value>
>>> + (2 rows)
>> This isn't going to work ...
>
> Care to elaborate?
>
> We could test a lot more, but that at least calls the send/recv functions.

Oh, I see what you mean. Yeah, the absolute paths need to be fixed :).

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: tsearch refactorings

From

Bruce Momjian

Date:

14 September 2007, 16:45:28

Done by Teodor.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Teodor Sigaev wrote:
> >> I am getting confused with the patches and version I have lying around
> >> here... I think I'll have to wait for review of the patches I've posted
> >> this far before I continue hacking.
> > Sorry for delay - I was busy by another. All your patches are committed
> > with very small changes.
>
> Thanks!
>
> Can you please apply this fix for the bug Pavel found as well:
>
> http://archives.postgresql.org/pgsql-hackers/2007-09/msg00127.php
>
> --
>   Heikki Linnakangas
>   EnterpriseDB   http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: XML binary I/O (was Re: tsearch refactorings)

From

Tom Lane

Date:

14 September 2007, 17:04:43

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Oh, I see what you mean. Yeah, the absolute paths need to be fixed :).

Easier to just remove the regression test altogether.  Not every patch
requires permanent memorialization in a regression test.

            regards, tom lane

Re: XML binary I/O (was Re: tsearch refactorings)

From

Tom Lane

Date:

23 September 2007, 21:38:48

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Here's a patch that fixes the send/recv functions to work like the
> manual says. It fixes the free/pfree typo as well.

Applied with a further fix: the patch caused xml_recv to call
parse_xml_decl with a non-null-terminated string, which could in the
worst case lead to a crash.

> Included is a new regression test for the send/recv functions as well

I didn't apply this, as it seemed more trouble than it was worth.

            regards, tom lane