Thread: Re: Proposal to add a new URL data type.

Re: Proposal to add a new URL data type.

From
Peter Eisentraut
Date:
On 05.12.24 15:01, Alexander Borisov wrote:
> Postgres users often store URLs in the database.  As an example, they
> provide links to their pages on the web, analyze users posts and get
> links for further storage and analysis.  Naturally, there is a need to
> compare, sort, build indexes, get statistics on individual parts of
> the URL: hosts, protocols, path and so on.
> 
> Adding a new URL type will standardize the work with URLs in Postgres
> and provide simple tools for manipulating the new type.

Have you seen this: https://github.com/petere/pguri ?

The difference there is that it uses an external library for parsing the 
URLs, which makes the code much smaller.  Overall, the functionality 
looks pretty similar to yours at first glance.




Re: Proposal to add a new URL data type.

From
Ziga
Date:
On 05/12/2024 15:59, Peter Eisentraut wrote:
> On 05.12.24 15:01, Alexander Borisov wrote:
>> Postgres users often store URLs in the database.  As an example, they
>> provide links to their pages on the web, analyze users posts and get
>> links for further storage and analysis.  Naturally, there is a need to
>> compare, sort, build indexes, get statistics on individual parts of
>> the URL: hosts, protocols, path and so on.
>>
>> Adding a new URL type will standardize the work with URLs in Postgres
>> and provide simple tools for manipulating the new type.
>
> Have you seen this: https://github.com/petere/pguri ?
>
> The difference there is that it uses an external library for parsing 
> the URLs, which makes the code much smaller.  Overall, the 
> functionality looks pretty similar to yours at first glance.

+1 for adding this to contrib





Re: Proposal to add a new URL data type.

From
Alexander Borisov
Date:
05.12.2024 17:59, Peter Eisentraut пишет:
> On 05.12.24 15:01, Alexander Borisov wrote:
>> Postgres users often store URLs in the database.  As an example, they
>> provide links to their pages on the web, analyze users posts and get
>> links for further storage and analysis.  Naturally, there is a need to
>> compare, sort, build indexes, get statistics on individual parts of
>> the URL: hosts, protocols, path and so on.
>>
>> Adding a new URL type will standardize the work with URLs in Postgres
>> and provide simple tools for manipulating the new type.
> 
> Have you seen this: https://github.com/petere/pguri ?
> 
> The difference there is that it uses an external library for parsing the 
> URLs, which makes the code much smaller.  Overall, the functionality 
> looks pretty similar to yours at first glance.
> 

Hi Peter,

I looked at your implementation of the URI extension.  You are using
a third party library uriparser for URI/URL parsing.  The uriparser
library is based on the RFC 3986 specification, which I cite in
comparison to WHATWG in my email.  The name of the functions to get the
individual parts of the URL will naturally be the same, that's how
URLs/URIs work. But you have to look at the point and perspective here.

As I've written before, there is a difference between parsing URLs
according to the RFC 3986 specification and WHATWG URLs.  This is
especially true for host.  Here are a couple more examples.

The specifications describe character encoding differently, this
applies to userinfo (username/password), path, query, fragment.
RFC 3986 standard encodes characters in the same way everywhere.
The characters to encode are: < > “ ` \r \n \t { } | \ ^ '.
The WHATWG URL standard takes a more selective and subtle approach.
For example, path may contain the character “|”, but userinfo does
not (encoded).

The WHATWG specification also requires that tabs and newlines be removed
from URLs before parsing.  The WHATWG character encoding gradation can
be found in the specification [1].

In addition to functions to retrieve individual parts, the WHATWG URL
standard describes an API for changing them: scheme, userinfo, host,
hostname, port, path, query, fragment.  There is not just one value is
replaced by another, there is a certain logic, which is not always
obvious.  For example, try to replace scheme, let's take the URL for
example: https://example.com/.
This URL contains special scheme (any others are not special), there
are only six of them: ftp, file, http, https, ws, wss.  And it is
impossible to replace scepial scheme with a non-special one.  More
precisely, the URL will be returned with special scheme, i.e. without
changes.  This is how you can check it with the patch I have given:

Example: select url_scheme_set('https://example.com/'::url, 'wss');
Result: wss://example.com/

Example: select url_scheme_set('https://example.com/'::url, 'myown');
Result: https://example.com/

Example: select url_scheme_set('best://example.com/'::url, 'myown');
Result: myown://example.com/

Example: select url_scheme_set('best://example.com/'::url, 'https');
Result: best://example.com/

In addition, WHATWG validates URLs during parsing and reports
non-critical errors [2].  If such errors occur, parsing continues.
However, in my implementation I don't output these errors, I just
haven't figured out how to do it correctly in SQL (as NOTICE?).

Without going further into the differences in specifications I could
say simply - RFC 3986 is obsolete, for example, node.js has labeled
the API with it as Legacy (they use WHATWG).

If we abstract from specifications and consider our approaches in
implementation.  You parse the URL every time for any request
(even to retrieve fragments).  In my implementation I proceed from
the fact that requests to read URLs will significantly exceed their
changes. Parsing is done once on input, the result is saved in
a special format and later the necessary parts of the URL
(or the whole URL) are retrieved. Also please note that there are
no dependencies on third-party libraries (ICU dependencies can also
be fought off).

All currently available functions and examples can be seen in
the README file, in the patch.

[1] https://url.spec.whatwg.org/#c0-control-percent-encode-set
[2] https://url.spec.whatwg.org/#writing


--
Alexander Borisov



Re: Proposal to add a new URL data type.

From
Daniel Gustafsson
Date:
> On 6 Dec 2024, at 13:59, Alexander Borisov <lex.borisov@gmail.com> wrote:

> As I've written before, there is a difference between parsing URLs
> according to the RFC 3986 specification and WHATWG URLs.  This is
> especially true for host.  Here are a couple more examples.

As someone who wears another open-source hat which is heavily involved in
parsing URLs I cannot stress enough how much I think postgres should avoid
this.  The example url http://http://http://@http://http://?http://#http:// is
a valid url, but is rejected by a number of implementations and parsed
differently by most that accept it.

A URL datatype is a *good idea* but one which I personally believe is best
handled as an external extension.

--
Daniel Gustafsson




Re: Proposal to add a new URL data type.

From
Alexander Borisov
Date:
Hi Daniel,

06.12.2024 16:46, Daniel Gustafsson пишет:
>> On 6 Dec 2024, at 13:59, Alexander Borisov <lex.borisov@gmail.com> wrote:
> 
>> As I've written before, there is a difference between parsing URLs
>> according to the RFC 3986 specification and WHATWG URLs.  This is
>> especially true for host.  Here are a couple more examples.
> 
> As someone who wears another open-source hat which is heavily involved in
> parsing URLs I cannot stress enough how much I think postgres should avoid
> this.  The example url http://http://http://@http://http://?http://#http:// is
> a valid url, but is rejected by a number of implementations and parsed
> differently by most that accept it.

Your example is valid, yes, it looks scary, t might catch someone off
guard.  At the same time your URL is correctly parsed both RFC 3986
and WHATWG URL.
There are many examples of “scary” URLs that you can't even understand
how they are parsed.  You can write a URL with any intimidating host,
path, scheme, but that's not what I mean.
There are generally accepted standards for URL/URI parsing RFC 3986 and
WHATWG URL.  We are not talking about self-written implementations
(without relying on any specifications) or those who made a mistake
while implementing one of the standards.
I propose to implement support for one of the standards that looks
promising.  On the contrary, everything is quite clear.  All we need to
do is point out that we have a URL data type in extension by WHATWG
specification.  I would even say that by creating a new type we will
contribute to the standardization of this zoo.

It's about creating a new URL data type according to the
specification WHATWG and including it in contrib as an extension.


--
Alexander Borisov



Re: Proposal to add a new URL data type.

From
Robert Haas
Date:
On Fri, Dec 6, 2024 at 8:46 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> A URL datatype is a *good idea* but one which I personally believe is best
> handled as an external extension.

+1. I don't think that it's an intrinsically bad idea to have a URL
data type in the core distribution, but no compelling reason for
putting it there has been proposed. Alexander's effort can equally
well live in github or pgxn or whatever and people who want to use it
still can. Also, it's pretty clear from reading this thread that
there's more than one way to do this and not everybody agrees with or
likes the particular thing Alexander has done. For an out-of-core
extension, that's perfectly fine, and even good: different people want
different things, and that's OK. For something that's part of
PostgreSQL itself, it's a big problem. There's no reason that we
should privilege one implementation over others, and we certainly do
not want the core project to have to maintain and ship multiple
versions of things like this.

--
Robert Haas
EDB: http://www.enterprisedb.com