Thread: Error inserting RFC1738-encoded URLs

Error inserting RFC1738-encoded URLs

From
Javier Amor garcia
Date:
Hello,
sometimes I get encoding errors when inserting a s a encoded URL in a
text field.

The database uses UTF8, with both collation and c-type defined as
en_US.UTF-8, and the URL field itself is defined as  VARCHAR(1024). In
the case that the  URL is longer than 1024 the software truncates it.

The  inserted URL is extracted from the log file of the Squid Proxy,
which is encoded in UTF8.

The URL is encoded with RFC 1738 encoding of all non-ASCII characters in
the path & query sections. puny-coding of characters in the host
authority section.
RFC 1738 -> http://www.ietf.org/rfc/rfc1738.txt


Example of URLs that raise error:


http://www.formacion.aimplas.es/_Documentos/2011/FORMACIÓN%20ABIERTA/Folleto%20Especialistas%20Universitarios%20Polímeros%20ok.pdf


http://ads.prisacom.com/RealMedia/ads/adstream_mjx.ads/www.elpais.es/edicionimpresa/deportes/articulos/1452867580@Middle,Middle1,Top,Top2,TopRight,x02,x20?search=VUELTA%20A%20ESPAÑA,Ciclismo,Deportes


http://ads.prisacom.com/RealMedia/ads/adstream_nx.ads/www.elpais.es/edicionimpresa/deportes/articulos/1452867580@Middle,Middle1,Top,Top2,TopRight,x02,x20!Middle?search=VUELTA%20A%20ESPAÑA,Ciclismo,Deportes

http://www.t-a-o.com/ES/moda-bebe-nino/pantalón/flash/zoom.swf?image_lien=52905_C1057_A_zoom.jpg&lang=ES


http://static.slidesharecdn.com/swf/menu.swf?embedCode=<div%20style="width:425px"%20id="__ss_1320169">%20<strong%20style="display:block;margin:12px%200%204px"><a%20href="http://www.slideshare.net/raimonesteve/que-es-openerp"%20title="¿Que%20es%20Openerp?"%20target="_blank">¿Que%20es%20Openerp?</a></strong>%20<iframe%20src="http://www.slideshare.net/slideshow/embed_code/1320169"%20width="425"%20height="355"%20frameborder="0"%20marginwidth="0"%20marginheight="0"%20scrolling="no"></iframe>%20<div%20style="padding:5px%200%2012px">%20View%20more%20<a%20href="http://www.slideshare.net/"%20target="_blank">presentations</a>%20from%20<a%20href="http://www.slideshare.net/raimonesteve"%20target="_blank">raimonesteve</a>%20</div>%20</div>&showID=1320169&showURL=http://www.slideshare.net/raimonesteve/que-es-openerp

---------------- End URL examples --------------------------------

Anyone know what I must do to be able to safely insert any http URL?.

Thanks for your time,
Javier

Re: Error inserting RFC1738-encoded URLs

From
Javier Amor Garcia
Date:
Thanks for you reply Marti. I wil ltry to complete the description of
the problem.

I get errors like this:

  Error inserting data: INSERT INTO squid_access ( bytes, event,
elapsed, rfc931, timestamp, url, method, peer, mimetype, remotehost,
code) VALUES ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ERROR:  invalid byte sequence for encoding "UTF8": 0xe97469
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".


I am sending the data to the database trough the DBI perl module,
tomorrow I will make sure that the data is read as UTF8 but I think I
had previously already tried to encode this a UTF8. (not sure, it was
some weeks ago).

Anyway, with the new data you continue to think the problem is that the
data is sent with the bad encoding?.

Regards,
Javier



On 10/24/2011 01:37 PM, Marti Raudsepp wrote:
> On Mon, Oct 24, 2011 at 10:27, Javier Amor garcia<jamor@zentyal.com>  wrote:
>> sometimes I get encoding errors when inserting a s a encoded URL in a text
>> field.
>
> You forgot the most important thing: *What's* the error that you get?
>
>>
http://www.formacion.aimplas.es/_Documentos/2011/FORMACIÓN%20ABIERTA/Folleto%20Especialistas%20Universitarios%20Polímeros%20ok.pdf
>
> Since I have to guess, I suspect you're sending these strings to
> Postgres in a non-UTF-8 encoding.
>
> This isn't a valid URL anyway -- you can't have unquoted "Ó" or "í"
> characters since they're not valid ASCII. But a 'varchar' field would
> accept them anyway if you send them in the right encoding.
>
> Regards,
> Marti


Re: Error inserting RFC1738-encoded URLs

From
Marti Raudsepp
Date:
On Mon, Oct 24, 2011 at 10:27, Javier Amor garcia <jamor@zentyal.com> wrote:
> sometimes I get encoding errors when inserting a s a encoded URL in a text
> field.

You forgot the most important thing: *What's* the error that you get?

>
http://www.formacion.aimplas.es/_Documentos/2011/FORMACIÓN%20ABIERTA/Folleto%20Especialistas%20Universitarios%20Polímeros%20ok.pdf

Since I have to guess, I suspect you're sending these strings to
Postgres in a non-UTF-8 encoding.

This isn't a valid URL anyway -- you can't have unquoted "Ó" or "í"
characters since they're not valid ASCII. But a 'varchar' field would
accept them anyway if you send them in the right encoding.

Regards,
Marti