text datum VARDATA and strings - Mailing list pgsql-general

From Reece Hart
Subject text datum VARDATA and strings
Date
Msg-id 1155578671.4158.45.camel@tallac.gene.com
Whole thread Raw
Responses Re: text datum VARDATA and strings
List pgsql-general
Michael Enke recently asked in pgsql-bugs about VARDATA and C strings
(BUG #2574: C function: arg TEXT data corrupt).  Since that's not a bug,
I've moved this follow-up to pgsql-general.


On Mon, 2006-08-14 at 11:27 -0400, Tom Lane wrote:
> The usual way to get a C string from a TEXT datum is to call textout,
> eg
>         str = DatumGetCString(DirectFunctionCall1(textout, datumval));


Yikes!  I've been accessing VARDATA text data like Michael for years
(code below).  I account for length and don't expect null-termination,
but I don't use anything like Tom's suggestion above.  (I always try to
do what Tom says because that usually hurts less.)


I have three questions:

1) I based everything I did on examples lifted nearly verbatim from a
7.x manual, and I bet Michael did similarly.  I've never heard of
DatumGetCString, DirectFunctionCall1, or textout.  Are these and other
treasures documented somewhere?

2) Does DatumGetCString(DirectFunctionCall1(textout, datumval)) do
something other than null terminate a string?  All of the strings are
from [-A-Z0-1*]; server_encoding has been either SQL_ASCII or UTF8 in
case that's relevant.

3) Is there any reason to believe that the code below is problematic?

Thanks,
Reece




#include <postgres.h>
#include <fmgr.h>
#include <ctype.h>
#include <string.h>

static char* clean_sequence(const char* in, int32 n);

PG_FUNCTION_INFO_V1(pg_clean_sequence);
Datum pg_clean_sequence(PG_FUNCTION_ARGS)
  {
  text* t0;                                 /* in */
  text* t1;                                 /* out */
  char* tmp;
  int32 tmpl;

  if ( PG_ARGISNULL(0) )
    { PG_RETURN_NULL(); }

  t0 = PG_GETARG_TEXT_P(0);

  tmp = clean_sequence( VARDATA(t0), VARSIZE(t0)-VARHDRSZ );
  tmpl = (int32) strlen(tmp);

  /* copy temp sequence into new pg variable */
  t1 = (text*) palloc( tmpl + VARHDRSZ );
  if (!t1)
    { elog( ERROR, "couldn't palloc (%d bytes)", tmpl+VARHDRSZ ); }
  memcpy(VARDATA(t1),tmp,tmpl);
  VARATT_SIZEP(t1) = tmpl + VARHDRSZ;

  pfree(tmp);

  PG_RETURN_TEXT_P(t1);
  }



/* clean_sequence -- strip non-IUPAC symbols
   The intent is to strip non-sequence data which might result from
   copy-pasting a fasta file or some such.

   in: char*, length
   out: char*, |out|<=length, NULL-TERMINATED
   out is palloc'd memory; caller must free

   allow chars from IUPAC std 20
   + selenocysteine (U) + ambiguity (BZX) + gap (-) + stop (*)
*/

#define isseq(c) ( ((c)>='A' && (c)<='Z' && (c)!='J' && (c)!='O') \
                   || ((c)=='-') \
                   || ((c)=='*') )

char* clean_sequence(const char* in, int32 n) {
  char* out;
  char* oi;
  int32 i;

  out = palloc( n + 1 );        /* w/null */
  if (!out)
    { elog( ERROR, "couldn't palloc (%d bytes)", n+1 ); }

  for( i=0, oi=out; i<=n-1; i++ ) {
    char c = toupper(in[i]);
    if ( isseq(c) ) {
      *oi++ = c;
    }
  }
  *oi = '\0';
  return(out);
}


--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0


pgsql-general by date:

Previous
From: "Jaime Casanova"
Date:
Subject: Re: problem with a dropped database
Next
From: Scott Ribe
Date:
Subject: Re: Best approach for a "gap-less" sequence