Thread: Some encoding trouble via libpq

Some encoding trouble via libpq

From
"Billy Gray"
Date:
Friends,

I did a little research into the archives of this list for my
particular problem, and while I haven't found the solution, I'm
thinking that maybe I'm approaching it wrong.  If anyone has any
advice, it'd be much appreciated.

On the one hand I have a database in postgres created WITH
ENCODING='UTF8'.  On the other hand I have this C program written with
libpq-fe.h that takes data over standard input, does some checking,
and then inserts it into a table in the aforementioned database.

The trick there is properly reading off of stdin in a safe way (input
can be of variable size), especially since this data is mostly going
to be coming off of a pipe transport in exim!  Before I was doing it
sloppy with getc but it was working just fine with postgres, but now
I'm using fread.  The oddest thing about this is that the program does
just what it should when it is run on Mac OS X against postgres 8.2.3,
also on Mac OS X.  It's when I run it on CentOS 4 that I get this
error:

ERROR:  invalid byte sequence for encoding "UTF8": 0xc0f5
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".

Now that error message is very literal, and I've been trying to take
it at face value.  Yet, the data going in over stdin is ASCII, which
shouldn't need any conversion to UTF8!  The same data worked just fine
before on Cent OS, before I put together this new fread routine.  And
yet, that routine works just fine on Mac OS X!  Weird!

We have tried setting certain env vars (export LC_ALL=en_US.utf-8,
export PGCLIENTENCODING=utf8) to force a client encoding on the client
side, and we get this slightly different error in that instance:

ERROR:  invalid byte sequence for encoding "UTF8": 0xb0HINT:  This
error can also happen if the byte sequence does not match the encoding
expected by the server, which is controlled by "client_encoding".

Without further ado, this is the routine used to read off of stdin,
and below that is the snippet of code the does the statement execution
into postgres:

char *
readinput()
{
    char *buffer = (char *) xmalloc (STDIN_BLOCK); //xmalloc is really
malloc
    int offset = 0;
    int read = 1;
    int size = STDIN_BLOCK;

    while ( (read > 0) && (offset <= STDIN_MAX) )
    {
        syslog (LOG_DEBUG, "Reading a block...");
        read = fread (buffer + offset, 1, STDIN_BLOCK, stdin);
        offset += read;
        if (read == STDIN_BLOCK)
        {
            size += STDIN_BLOCK;
            buffer = xrealloc (buffer, size);
        }
    } // while

    // null terminate the string...
    memset(buffer + offset + 1, '\0', 1);

    syslog (LOG_DEBUG, "Read message of %d bytes", offset);

    fprintf (stderr, "Contents of the buffer:\n%s\n\n", buffer);

    return buffer;
}

Later on in my program, I set this buffer to one of my input
parameters and run an insert query:

message = readinput();

    paramValues[0] = ping_id;
    paramValues[1] = event_id;
    paramValues[2] = message; // get from std input! how do we do that
again???
    result = PQexecParams(conn,
                            "INSERT INTO event_changes (ping_id, event_id, created_at,
message) VALUES ($1, $2, NOW(), $3)",
                            3,
                            NULL, // backend figures out type itself
                            paramValues,
                            NULL, // apparently we don't need param lengths
                            NULL, // all text params
                            0 // we don't want binary results, no
                        );

I'm a little beside myself as to what I ought to try next, so if
anybody has any random hunches, they're much appreciated!  A really
interesting thing is that I've recreated my database with encoding
SQL_ASCII, which really should make it ignorant of the encoding of the
data coming in.  In that instance, I get a weird variation on the
error listed above:

ERROR:  invalid input syntax for integer: "????^?"
STATEMENT:  INSERT INTO event_changes (ping_id, event_id, created_at,
message) VALUES ($1, $2, NOW(), $3)

Thanks,
Billy


Re: Some encoding trouble via libpq

From
Tom Lane
Date:
"Billy Gray" <billy.zophar@gmail.com> writes:
>     char *buffer = (char *) xmalloc (STDIN_BLOCK); //xmalloc is really
> malloc
>     int offset = 0;
>     int read = 1;
>     int size = STDIN_BLOCK;

>     while ( (read > 0) && (offset <= STDIN_MAX) )
>     {
>         syslog (LOG_DEBUG, "Reading a block...");
>         read = fread (buffer + offset, 1, STDIN_BLOCK, stdin);
>         offset += read;
>         if (read == STDIN_BLOCK)
>         {
>             size += STDIN_BLOCK;
>             buffer = xrealloc (buffer, size);
>         }
>     } // while

This looks to me like it risks telling fread to read more bytes than
will actually fit in the buffer at the moment.  Think about what happens
if fread returns only a partial bufferload on any particular call.
I'm guessing you're clobbering memory ...

            regards, tom lane

Re: Some encoding trouble via libpq

From
Tom Lane
Date:
"William Gray" <billy.zophar@gmail.com> writes:
> ... And in the case
> that fread() pulls in less data than requested, that means the next call to
> fread() should return zero, right?

Wouldn't count on that, particularly not when reading from an
interactive device.  You are more likely to get a line per call.

What's bothering me about your code is that it assumes there are
exactly STDIN_BLOCK bytes available in the buffer when you call
fread, and the code does nothing that guarantees that.  Personally
I'd have used "size - offset" as the fread length parameter and not
had to worry.

            regards, tom lane

Re: Some encoding trouble via libpq

From
"William Gray"
Date:
Tom,

Thanks for your help, I really appreciate it!  I'm kinda new to fread (I grew up in the land of Java, so I haven't had to get this close to memory since my assembler class back in college!), so I'm not sure what you mean by a partial buffer load?  Since the first read is only for data up to the size of the initial allocation (because offset is zero the first time through), I don't think I could be going out into un-allocated memory.  And in the case that fread() pulls in less data than requested, that means the next call to fread() should return zero, right?  I've been working off the GNU manual, but seeing how things can be different depending on the OS vendor, maybe I'm making an assumption that I shouldn't be making?  Again, when I print out the contents of the buffer I get good data for large and small sets.  I'm going to try pumping it into hexdump and see if I can identify the offending character sequence.

Billy

On 3/29/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Billy Gray" <billy.zophar@gmail.com> writes:
>       char *buffer = (char *) xmalloc (STDIN_BLOCK); //xmalloc is really
> malloc
>       int offset = 0;
>       int read = 1;
>       int size = STDIN_BLOCK;

>       while ( (read > 0) && (offset <= STDIN_MAX) )
>       {
>               syslog (LOG_DEBUG, "Reading a block...");
>               read = fread (buffer + offset, 1, STDIN_BLOCK, stdin);
>               offset += read;
>               if (read == STDIN_BLOCK)
>               {
>                       size += STDIN_BLOCK;
>                       buffer = xrealloc (buffer, size);
>               }
>       } // while

This looks to me like it risks telling fread to read more bytes than
will actually fit in the buffer at the moment.  Think about what happens
if fread returns only a partial bufferload on any particular call.
I'm guessing you're clobbering memory ...

                        regards, tom lane

Re: Some encoding trouble via libpq

From
"William Gray"
Date:
Ahhh, I get it!  Thanks for pointing this out.  It ended up not being the problem I was running into, but eventually it would have snagged me (solution to original problem below).  I spent some time going over that bit of code and what you said with a coworker, and we ended up taking care of it.  So many little gotchas when doing this stuff!  Here's the snippet as of now for anyone interested:

    int size = STDIN_BLOCK + 1;
    char *buffer = (char *) xmalloc(size);
    int offset = 0;
    int read = 1;
   
    while ( (read > 0) && (offset <= STDIN_MAX) )
    {
        if (offset + STDIN_BLOCK >= size)
        {
            size += STDIN_BLOCK;
            buffer = xrealloc (buffer, size);
        }
        syslog (LOG_DEBUG, "Reading a block...");
        read = fread (buffer + offset, 1, STDIN_BLOCK, stdin);
        offset += read;
    } // while
   
    // null terminate the string...
    memset(buffer + offset + 1, '\0', 1);

Also, an update on the original problem:  It turned out with some experimenting that the problem I experienced had nothing to do with the standard input reading routine, or postgresql itself, but rather how I was re-inserting into one query data I had gotten from a previous query.  And this is maybe useful for others on the list:

I was doing a select on one table to get a foreign key id that I would use subsequently in an insert statement, like thus:

result = PQexecParams(conn,
                            "SELECT id FROM events WHERE ping_id = $1 AND serial = $2",
                            2,
                            NULL, // backend figures out type itself
                            paramValues,
                            NULL, // apparently we don't need param lengths
                            NULL, // all text params
                            0 // we don't want binary results, no
                        );
... a bunch of tuples checking, then...
event_id = PQgetvalue(result, 0, 0);  // <-- BAD GUY!
PQclear(result);  // <-- ACCOMPLICE!
... then the insert....
    paramValues[0] = ping_id;
    paramValues[1] = event_id;
    paramValues[2] = message; // get from std input! how do we do that again???
    result = PQexecParams(conn,
                            "INSERT INTO event_changes (ping_id, event_id, created_at, message) VALUES ($1, $2, NOW(), $3)",
                            3,
                            NULL, // backend figures out type itself
                            paramValues,
                            NULL, // apparently we don't need param lengths
                            NULL, // all text params
                            0 // we don't want binary results, no
                        );

This is what was causing our woes.  event_id is merely a pointer to data in result.  When you PQclear the result, it's gone!  Sort of...  Mac OS X was forgiving on some level in that the data we were pointing to was still there.  But since different operating systems manage their memory differently, Linux wasn't having any of it, it was basically pointing to garbage!

The solution was to not clear result until later, or to copy the data in allocated memory, then clear the result, which we did thusly:

    tmp_str = PQgetvalue(result, 0, 0);
    tmp_str_len = strlen(tmp_str);
    event_id = xmalloc(tmp_str_len + 1);
    strncpy(event_id, tmp_str, tmp_str_len);

Until next time,
Billy

On 3/29/07, Tom Lane < tgl@sss.pgh.pa.us> wrote:
"William Gray" < billy.zophar@gmail.com> writes:
> ... And in the case
> that fread() pulls in less data than requested, that means the next call to
> fread() should return zero, right?

Wouldn't count on that, particularly not when reading from an
interactive device.  You are more likely to get a line per call.

What's bothering me about your code is that it assumes there are
exactly STDIN_BLOCK bytes available in the buffer when you call
fread, and the code does nothing that guarantees that.  Personally
I'd have used "size - offset" as the fread length parameter and not
had to worry.

                        regards, tom lane

Re: Some encoding trouble via libpq

From
Martijn van Oosterhout
Date:
On Fri, Mar 30, 2007 at 11:43:08AM -0400, William Gray wrote:
> The solution was to not clear result until later, or to copy the data in
> allocated memory, then clear the result, which we did thusly:
>
>    tmp_str = PQgetvalue(result, 0, 0);
>    tmp_str_len = strlen(tmp_str);
>    event_id = xmalloc(tmp_str_len + 1);
>    strncpy(event_id, tmp_str, tmp_str_len);

Glad you got your problem solved. What you've done here is so common
there's even a standard function for it: strdup(). So what you did is
equivalent to:

event_id = strdup( PQgetvalue(result, 0, 0) );

You obviously need to free it later, but this gives you the right
idea...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment