Thread: Some encoding trouble via libpq
Friends, I did a little research into the archives of this list for my particular problem, and while I haven't found the solution, I'm thinking that maybe I'm approaching it wrong. If anyone has any advice, it'd be much appreciated. On the one hand I have a database in postgres created WITH ENCODING='UTF8'. On the other hand I have this C program written with libpq-fe.h that takes data over standard input, does some checking, and then inserts it into a table in the aforementioned database. The trick there is properly reading off of stdin in a safe way (input can be of variable size), especially since this data is mostly going to be coming off of a pipe transport in exim! Before I was doing it sloppy with getc but it was working just fine with postgres, but now I'm using fread. The oddest thing about this is that the program does just what it should when it is run on Mac OS X against postgres 8.2.3, also on Mac OS X. It's when I run it on CentOS 4 that I get this error: ERROR: invalid byte sequence for encoding "UTF8": 0xc0f5 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". Now that error message is very literal, and I've been trying to take it at face value. Yet, the data going in over stdin is ASCII, which shouldn't need any conversion to UTF8! The same data worked just fine before on Cent OS, before I put together this new fread routine. And yet, that routine works just fine on Mac OS X! Weird! We have tried setting certain env vars (export LC_ALL=en_US.utf-8, export PGCLIENTENCODING=utf8) to force a client encoding on the client side, and we get this slightly different error in that instance: ERROR: invalid byte sequence for encoding "UTF8": 0xb0HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". Without further ado, this is the routine used to read off of stdin, and below that is the snippet of code the does the statement execution into postgres: char * readinput() { char *buffer = (char *) xmalloc (STDIN_BLOCK); //xmalloc is really malloc int offset = 0; int read = 1; int size = STDIN_BLOCK; while ( (read > 0) && (offset <= STDIN_MAX) ) { syslog (LOG_DEBUG, "Reading a block..."); read = fread (buffer + offset, 1, STDIN_BLOCK, stdin); offset += read; if (read == STDIN_BLOCK) { size += STDIN_BLOCK; buffer = xrealloc (buffer, size); } } // while // null terminate the string... memset(buffer + offset + 1, '\0', 1); syslog (LOG_DEBUG, "Read message of %d bytes", offset); fprintf (stderr, "Contents of the buffer:\n%s\n\n", buffer); return buffer; } Later on in my program, I set this buffer to one of my input parameters and run an insert query: message = readinput(); paramValues[0] = ping_id; paramValues[1] = event_id; paramValues[2] = message; // get from std input! how do we do that again??? result = PQexecParams(conn, "INSERT INTO event_changes (ping_id, event_id, created_at, message) VALUES ($1, $2, NOW(), $3)", 3, NULL, // backend figures out type itself paramValues, NULL, // apparently we don't need param lengths NULL, // all text params 0 // we don't want binary results, no ); I'm a little beside myself as to what I ought to try next, so if anybody has any random hunches, they're much appreciated! A really interesting thing is that I've recreated my database with encoding SQL_ASCII, which really should make it ignorant of the encoding of the data coming in. In that instance, I get a weird variation on the error listed above: ERROR: invalid input syntax for integer: "????^?" STATEMENT: INSERT INTO event_changes (ping_id, event_id, created_at, message) VALUES ($1, $2, NOW(), $3) Thanks, Billy
"Billy Gray" <billy.zophar@gmail.com> writes: > char *buffer = (char *) xmalloc (STDIN_BLOCK); //xmalloc is really > malloc > int offset = 0; > int read = 1; > int size = STDIN_BLOCK; > while ( (read > 0) && (offset <= STDIN_MAX) ) > { > syslog (LOG_DEBUG, "Reading a block..."); > read = fread (buffer + offset, 1, STDIN_BLOCK, stdin); > offset += read; > if (read == STDIN_BLOCK) > { > size += STDIN_BLOCK; > buffer = xrealloc (buffer, size); > } > } // while This looks to me like it risks telling fread to read more bytes than will actually fit in the buffer at the moment. Think about what happens if fread returns only a partial bufferload on any particular call. I'm guessing you're clobbering memory ... regards, tom lane
"William Gray" <billy.zophar@gmail.com> writes: > ... And in the case > that fread() pulls in less data than requested, that means the next call to > fread() should return zero, right? Wouldn't count on that, particularly not when reading from an interactive device. You are more likely to get a line per call. What's bothering me about your code is that it assumes there are exactly STDIN_BLOCK bytes available in the buffer when you call fread, and the code does nothing that guarantees that. Personally I'd have used "size - offset" as the fread length parameter and not had to worry. regards, tom lane
Tom,
Thanks for your help, I really appreciate it! I'm kinda new to fread (I grew up in the land of Java, so I haven't had to get this close to memory since my assembler class back in college!), so I'm not sure what you mean by a partial buffer load? Since the first read is only for data up to the size of the initial allocation (because offset is zero the first time through), I don't think I could be going out into un-allocated memory. And in the case that fread() pulls in less data than requested, that means the next call to fread() should return zero, right? I've been working off the GNU manual, but seeing how things can be different depending on the OS vendor, maybe I'm making an assumption that I shouldn't be making? Again, when I print out the contents of the buffer I get good data for large and small sets. I'm going to try pumping it into hexdump and see if I can identify the offending character sequence.
Billy
Thanks for your help, I really appreciate it! I'm kinda new to fread (I grew up in the land of Java, so I haven't had to get this close to memory since my assembler class back in college!), so I'm not sure what you mean by a partial buffer load? Since the first read is only for data up to the size of the initial allocation (because offset is zero the first time through), I don't think I could be going out into un-allocated memory. And in the case that fread() pulls in less data than requested, that means the next call to fread() should return zero, right? I've been working off the GNU manual, but seeing how things can be different depending on the OS vendor, maybe I'm making an assumption that I shouldn't be making? Again, when I print out the contents of the buffer I get good data for large and small sets. I'm going to try pumping it into hexdump and see if I can identify the offending character sequence.
Billy
On 3/29/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Billy Gray" <billy.zophar@gmail.com> writes:
> char *buffer = (char *) xmalloc (STDIN_BLOCK); //xmalloc is really
> malloc
> int offset = 0;
> int read = 1;
> int size = STDIN_BLOCK;
> while ( (read > 0) && (offset <= STDIN_MAX) )
> {
> syslog (LOG_DEBUG, "Reading a block...");
> read = fread (buffer + offset, 1, STDIN_BLOCK, stdin);
> offset += read;
> if (read == STDIN_BLOCK)
> {
> size += STDIN_BLOCK;
> buffer = xrealloc (buffer, size);
> }
> } // while
This looks to me like it risks telling fread to read more bytes than
will actually fit in the buffer at the moment. Think about what happens
if fread returns only a partial bufferload on any particular call.
I'm guessing you're clobbering memory ...
regards, tom lane
Ahhh, I get it! Thanks for pointing this out. It ended up not being the problem I was running into, but eventually it would have snagged me (solution to original problem below). I spent some time going over that bit of code and what you said with a coworker, and we ended up taking care of it. So many little gotchas when doing this stuff! Here's the snippet as of now for anyone interested:
int size = STDIN_BLOCK + 1;
char *buffer = (char *) xmalloc(size);
int offset = 0;
int read = 1;
while ( (read > 0) && (offset <= STDIN_MAX) )
{
if (offset + STDIN_BLOCK >= size)
{
size += STDIN_BLOCK;
buffer = xrealloc (buffer, size);
}
syslog (LOG_DEBUG, "Reading a block...");
read = fread (buffer + offset, 1, STDIN_BLOCK, stdin);
offset += read;
} // while
// null terminate the string...
memset(buffer + offset + 1, '\0', 1);
Also, an update on the original problem: It turned out with some experimenting that the problem I experienced had nothing to do with the standard input reading routine, or postgresql itself, but rather how I was re-inserting into one query data I had gotten from a previous query. And this is maybe useful for others on the list:
I was doing a select on one table to get a foreign key id that I would use subsequently in an insert statement, like thus:
result = PQexecParams(conn,
"SELECT id FROM events WHERE ping_id = $1 AND serial = $2",
2,
NULL, // backend figures out type itself
paramValues,
NULL, // apparently we don't need param lengths
NULL, // all text params
0 // we don't want binary results, no
);
... a bunch of tuples checking, then...
event_id = PQgetvalue(result, 0, 0); // <-- BAD GUY!
PQclear(result); // <-- ACCOMPLICE!
... then the insert....
paramValues[0] = ping_id;
paramValues[1] = event_id;
paramValues[2] = message; // get from std input! how do we do that again???
result = PQexecParams(conn,
"INSERT INTO event_changes (ping_id, event_id, created_at, message) VALUES ($1, $2, NOW(), $3)",
3,
NULL, // backend figures out type itself
paramValues,
NULL, // apparently we don't need param lengths
NULL, // all text params
0 // we don't want binary results, no
);
This is what was causing our woes. event_id is merely a pointer to data in result. When you PQclear the result, it's gone! Sort of... Mac OS X was forgiving on some level in that the data we were pointing to was still there. But since different operating systems manage their memory differently, Linux wasn't having any of it, it was basically pointing to garbage!
The solution was to not clear result until later, or to copy the data in allocated memory, then clear the result, which we did thusly:
tmp_str = PQgetvalue(result, 0, 0);
tmp_str_len = strlen(tmp_str);
event_id = xmalloc(tmp_str_len + 1);
strncpy(event_id, tmp_str, tmp_str_len);
Until next time,
Billy
int size = STDIN_BLOCK + 1;
char *buffer = (char *) xmalloc(size);
int offset = 0;
int read = 1;
while ( (read > 0) && (offset <= STDIN_MAX) )
{
if (offset + STDIN_BLOCK >= size)
{
size += STDIN_BLOCK;
buffer = xrealloc (buffer, size);
}
syslog (LOG_DEBUG, "Reading a block...");
read = fread (buffer + offset, 1, STDIN_BLOCK, stdin);
offset += read;
} // while
// null terminate the string...
memset(buffer + offset + 1, '\0', 1);
Also, an update on the original problem: It turned out with some experimenting that the problem I experienced had nothing to do with the standard input reading routine, or postgresql itself, but rather how I was re-inserting into one query data I had gotten from a previous query. And this is maybe useful for others on the list:
I was doing a select on one table to get a foreign key id that I would use subsequently in an insert statement, like thus:
result = PQexecParams(conn,
"SELECT id FROM events WHERE ping_id = $1 AND serial = $2",
2,
NULL, // backend figures out type itself
paramValues,
NULL, // apparently we don't need param lengths
NULL, // all text params
0 // we don't want binary results, no
);
... a bunch of tuples checking, then...
event_id = PQgetvalue(result, 0, 0); // <-- BAD GUY!
PQclear(result); // <-- ACCOMPLICE!
... then the insert....
paramValues[0] = ping_id;
paramValues[1] = event_id;
paramValues[2] = message; // get from std input! how do we do that again???
result = PQexecParams(conn,
"INSERT INTO event_changes (ping_id, event_id, created_at, message) VALUES ($1, $2, NOW(), $3)",
3,
NULL, // backend figures out type itself
paramValues,
NULL, // apparently we don't need param lengths
NULL, // all text params
0 // we don't want binary results, no
);
This is what was causing our woes. event_id is merely a pointer to data in result. When you PQclear the result, it's gone! Sort of... Mac OS X was forgiving on some level in that the data we were pointing to was still there. But since different operating systems manage their memory differently, Linux wasn't having any of it, it was basically pointing to garbage!
The solution was to not clear result until later, or to copy the data in allocated memory, then clear the result, which we did thusly:
tmp_str = PQgetvalue(result, 0, 0);
tmp_str_len = strlen(tmp_str);
event_id = xmalloc(tmp_str_len + 1);
strncpy(event_id, tmp_str, tmp_str_len);
Until next time,
Billy
On 3/29/07, Tom Lane < tgl@sss.pgh.pa.us> wrote:
"William Gray" < billy.zophar@gmail.com> writes:
> ... And in the case
> that fread() pulls in less data than requested, that means the next call to
> fread() should return zero, right?
Wouldn't count on that, particularly not when reading from an
interactive device. You are more likely to get a line per call.
What's bothering me about your code is that it assumes there are
exactly STDIN_BLOCK bytes available in the buffer when you call
fread, and the code does nothing that guarantees that. Personally
I'd have used "size - offset" as the fread length parameter and not
had to worry.
regards, tom lane
On Fri, Mar 30, 2007 at 11:43:08AM -0400, William Gray wrote: > The solution was to not clear result until later, or to copy the data in > allocated memory, then clear the result, which we did thusly: > > tmp_str = PQgetvalue(result, 0, 0); > tmp_str_len = strlen(tmp_str); > event_id = xmalloc(tmp_str_len + 1); > strncpy(event_id, tmp_str, tmp_str_len); Glad you got your problem solved. What you've done here is so common there's even a standard function for it: strdup(). So what you did is equivalent to: event_id = strdup( PQgetvalue(result, 0, 0) ); You obviously need to free it later, but this gives you the right idea... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.