Thread: text datum VARDATA and strings

text datum VARDATA and strings

From
Reece Hart
Date:
Michael Enke recently asked in pgsql-bugs about VARDATA and C strings
(BUG #2574: C function: arg TEXT data corrupt).  Since that's not a bug,
I've moved this follow-up to pgsql-general.


On Mon, 2006-08-14 at 11:27 -0400, Tom Lane wrote:
> The usual way to get a C string from a TEXT datum is to call textout,
> eg
>         str = DatumGetCString(DirectFunctionCall1(textout, datumval));


Yikes!  I've been accessing VARDATA text data like Michael for years
(code below).  I account for length and don't expect null-termination,
but I don't use anything like Tom's suggestion above.  (I always try to
do what Tom says because that usually hurts less.)


I have three questions:

1) I based everything I did on examples lifted nearly verbatim from a
7.x manual, and I bet Michael did similarly.  I've never heard of
DatumGetCString, DirectFunctionCall1, or textout.  Are these and other
treasures documented somewhere?

2) Does DatumGetCString(DirectFunctionCall1(textout, datumval)) do
something other than null terminate a string?  All of the strings are
from [-A-Z0-1*]; server_encoding has been either SQL_ASCII or UTF8 in
case that's relevant.

3) Is there any reason to believe that the code below is problematic?

Thanks,
Reece




#include <postgres.h>
#include <fmgr.h>
#include <ctype.h>
#include <string.h>

static char* clean_sequence(const char* in, int32 n);

PG_FUNCTION_INFO_V1(pg_clean_sequence);
Datum pg_clean_sequence(PG_FUNCTION_ARGS)
  {
  text* t0;                                 /* in */
  text* t1;                                 /* out */
  char* tmp;
  int32 tmpl;

  if ( PG_ARGISNULL(0) )
    { PG_RETURN_NULL(); }

  t0 = PG_GETARG_TEXT_P(0);

  tmp = clean_sequence( VARDATA(t0), VARSIZE(t0)-VARHDRSZ );
  tmpl = (int32) strlen(tmp);

  /* copy temp sequence into new pg variable */
  t1 = (text*) palloc( tmpl + VARHDRSZ );
  if (!t1)
    { elog( ERROR, "couldn't palloc (%d bytes)", tmpl+VARHDRSZ ); }
  memcpy(VARDATA(t1),tmp,tmpl);
  VARATT_SIZEP(t1) = tmpl + VARHDRSZ;

  pfree(tmp);

  PG_RETURN_TEXT_P(t1);
  }



/* clean_sequence -- strip non-IUPAC symbols
   The intent is to strip non-sequence data which might result from
   copy-pasting a fasta file or some such.

   in: char*, length
   out: char*, |out|<=length, NULL-TERMINATED
   out is palloc'd memory; caller must free

   allow chars from IUPAC std 20
   + selenocysteine (U) + ambiguity (BZX) + gap (-) + stop (*)
*/

#define isseq(c) ( ((c)>='A' && (c)<='Z' && (c)!='J' && (c)!='O') \
                   || ((c)=='-') \
                   || ((c)=='*') )

char* clean_sequence(const char* in, int32 n) {
  char* out;
  char* oi;
  int32 i;

  out = palloc( n + 1 );        /* w/null */
  if (!out)
    { elog( ERROR, "couldn't palloc (%d bytes)", n+1 ); }

  for( i=0, oi=out; i<=n-1; i++ ) {
    char c = toupper(in[i]);
    if ( isseq(c) ) {
      *oi++ = c;
    }
  }
  *oi = '\0';
  return(out);
}


--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0


Re: text datum VARDATA and strings

From
Tom Lane
Date:
Reece Hart <reece@harts.net> writes:
> On Mon, 2006-08-14 at 11:27 -0400, Tom Lane wrote:
>> The usual way to get a C string from a TEXT datum is to call textout,
>> eg
>> str = DatumGetCString(DirectFunctionCall1(textout, datumval));

> Yikes!  I've been accessing VARDATA text data like Michael for years
> (code below).  I account for length and don't expect null-termination,
> but I don't use anything like Tom's suggestion above.

Sure, that works.  The problem with what Michael was doing was that he
was passing the string to elog, which expects a null-terminated string.
Possibly I should have written "usual way to get a null-terminated string"
above, just to be clear.

> 1) I based everything I did on examples lifted nearly verbatim from a
> 7.x manual, and I bet Michael did similarly.  I've never heard of
> DatumGetCString, DirectFunctionCall1, or textout.  Are these and other
> treasures documented somewhere?

Whose 7.x manual?  This stuff has been there since we invented the
"version 1" function call convention, which was 7.3 or before.  There
is some documentation in the SGML docs, but really we kind of expect you
to look at the standard built-in functions to see how things are done...

> 2) Does DatumGetCString(DirectFunctionCall1(textout, datumval)) do
> something other than null terminate a string?

At the moment that's all it does ... assuming that you've already
detoasted the text datum ... but it's not impossible that someday
it will do something different.  For instance, sooner or later we are
going to support multiple locales/encodings within a single database,
and I wouldn't be surprised if that involves sticking extra data into
text values.  So it's best not to assume that you know what is inside a
text datum, if possible,

> 3) Is there any reason to believe that the code below is problematic?

The only thing I'd suggest is that checking for a null return from
palloc is a waste of effort.  It doesn't return to you if it runs
out of memory.

            regards, tom lane

Re: text datum VARDATA and strings

From
Reece Hart
Date:
On Mon, 2006-08-14 at 15:51 -0400, Tom Lane wrote:
Whose 7.x manual?  This stuff has been there since we invented the
"version 1" function call convention, which was 7.3 or before.  There
is some documentation in the SGML docs, but really we kind of expect you
to look at the standard built-in functions to see how things are done...

The PostgreSQL manual. I wrote these functions early in the 7.x series and I don't know which manual version exactly.  For example, the sec 9.5.4 of 
http://www.postgresql.org/docs/7.3/static/xfunc-c.html shows code for concat_text, which I remember was the basis for what I wrote. I now see and understand the text regarding detoasting the 'DatumGetXXX' macros; the relevance of these were not obvious to me at the time.

So it's best not to assume that you know what is inside a
text datum, if possible,

Okay. Does that mean that code in 9.5.4 should have a warning to that effect?


> 3) Is there any reason to believe that the code below is problematic?

The only thing I'd suggest is that checking for a null return from
palloc is a waste of effort.  It doesn't return to you if it runs
out of memory.

Okay.  Thanks for the advice, Tom.

-Reece


-- 
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0