Thread: Re: Proposal to add a new URL data type.
On 05.12.24 15:01, Alexander Borisov wrote: > Postgres users often store URLs in the database. As an example, they > provide links to their pages on the web, analyze users posts and get > links for further storage and analysis. Naturally, there is a need to > compare, sort, build indexes, get statistics on individual parts of > the URL: hosts, protocols, path and so on. > > Adding a new URL type will standardize the work with URLs in Postgres > and provide simple tools for manipulating the new type. Have you seen this: https://github.com/petere/pguri ? The difference there is that it uses an external library for parsing the URLs, which makes the code much smaller. Overall, the functionality looks pretty similar to yours at first glance.
On 05/12/2024 15:59, Peter Eisentraut wrote: > On 05.12.24 15:01, Alexander Borisov wrote: >> Postgres users often store URLs in the database. As an example, they >> provide links to their pages on the web, analyze users posts and get >> links for further storage and analysis. Naturally, there is a need to >> compare, sort, build indexes, get statistics on individual parts of >> the URL: hosts, protocols, path and so on. >> >> Adding a new URL type will standardize the work with URLs in Postgres >> and provide simple tools for manipulating the new type. > > Have you seen this: https://github.com/petere/pguri ? > > The difference there is that it uses an external library for parsing > the URLs, which makes the code much smaller. Overall, the > functionality looks pretty similar to yours at first glance. +1 for adding this to contrib
05.12.2024 17:59, Peter Eisentraut пишет: > On 05.12.24 15:01, Alexander Borisov wrote: >> Postgres users often store URLs in the database. As an example, they >> provide links to their pages on the web, analyze users posts and get >> links for further storage and analysis. Naturally, there is a need to >> compare, sort, build indexes, get statistics on individual parts of >> the URL: hosts, protocols, path and so on. >> >> Adding a new URL type will standardize the work with URLs in Postgres >> and provide simple tools for manipulating the new type. > > Have you seen this: https://github.com/petere/pguri ? > > The difference there is that it uses an external library for parsing the > URLs, which makes the code much smaller. Overall, the functionality > looks pretty similar to yours at first glance. > Hi Peter, I looked at your implementation of the URI extension. You are using a third party library uriparser for URI/URL parsing. The uriparser library is based on the RFC 3986 specification, which I cite in comparison to WHATWG in my email. The name of the functions to get the individual parts of the URL will naturally be the same, that's how URLs/URIs work. But you have to look at the point and perspective here. As I've written before, there is a difference between parsing URLs according to the RFC 3986 specification and WHATWG URLs. This is especially true for host. Here are a couple more examples. The specifications describe character encoding differently, this applies to userinfo (username/password), path, query, fragment. RFC 3986 standard encodes characters in the same way everywhere. The characters to encode are: < > “ ` \r \n \t { } | \ ^ '. The WHATWG URL standard takes a more selective and subtle approach. For example, path may contain the character “|”, but userinfo does not (encoded). The WHATWG specification also requires that tabs and newlines be removed from URLs before parsing. The WHATWG character encoding gradation can be found in the specification [1]. In addition to functions to retrieve individual parts, the WHATWG URL standard describes an API for changing them: scheme, userinfo, host, hostname, port, path, query, fragment. There is not just one value is replaced by another, there is a certain logic, which is not always obvious. For example, try to replace scheme, let's take the URL for example: https://example.com/. This URL contains special scheme (any others are not special), there are only six of them: ftp, file, http, https, ws, wss. And it is impossible to replace scepial scheme with a non-special one. More precisely, the URL will be returned with special scheme, i.e. without changes. This is how you can check it with the patch I have given: Example: select url_scheme_set('https://example.com/'::url, 'wss'); Result: wss://example.com/ Example: select url_scheme_set('https://example.com/'::url, 'myown'); Result: https://example.com/ Example: select url_scheme_set('best://example.com/'::url, 'myown'); Result: myown://example.com/ Example: select url_scheme_set('best://example.com/'::url, 'https'); Result: best://example.com/ In addition, WHATWG validates URLs during parsing and reports non-critical errors [2]. If such errors occur, parsing continues. However, in my implementation I don't output these errors, I just haven't figured out how to do it correctly in SQL (as NOTICE?). Without going further into the differences in specifications I could say simply - RFC 3986 is obsolete, for example, node.js has labeled the API with it as Legacy (they use WHATWG). If we abstract from specifications and consider our approaches in implementation. You parse the URL every time for any request (even to retrieve fragments). In my implementation I proceed from the fact that requests to read URLs will significantly exceed their changes. Parsing is done once on input, the result is saved in a special format and later the necessary parts of the URL (or the whole URL) are retrieved. Also please note that there are no dependencies on third-party libraries (ICU dependencies can also be fought off). All currently available functions and examples can be seen in the README file, in the patch. [1] https://url.spec.whatwg.org/#c0-control-percent-encode-set [2] https://url.spec.whatwg.org/#writing -- Alexander Borisov
> On 6 Dec 2024, at 13:59, Alexander Borisov <lex.borisov@gmail.com> wrote: > As I've written before, there is a difference between parsing URLs > according to the RFC 3986 specification and WHATWG URLs. This is > especially true for host. Here are a couple more examples. As someone who wears another open-source hat which is heavily involved in parsing URLs I cannot stress enough how much I think postgres should avoid this. The example url http://http://http://@http://http://?http://#http:// is a valid url, but is rejected by a number of implementations and parsed differently by most that accept it. A URL datatype is a *good idea* but one which I personally believe is best handled as an external extension. -- Daniel Gustafsson
Hi Daniel, 06.12.2024 16:46, Daniel Gustafsson пишет: >> On 6 Dec 2024, at 13:59, Alexander Borisov <lex.borisov@gmail.com> wrote: > >> As I've written before, there is a difference between parsing URLs >> according to the RFC 3986 specification and WHATWG URLs. This is >> especially true for host. Here are a couple more examples. > > As someone who wears another open-source hat which is heavily involved in > parsing URLs I cannot stress enough how much I think postgres should avoid > this. The example url http://http://http://@http://http://?http://#http:// is > a valid url, but is rejected by a number of implementations and parsed > differently by most that accept it. Your example is valid, yes, it looks scary, t might catch someone off guard. At the same time your URL is correctly parsed both RFC 3986 and WHATWG URL. There are many examples of “scary” URLs that you can't even understand how they are parsed. You can write a URL with any intimidating host, path, scheme, but that's not what I mean. There are generally accepted standards for URL/URI parsing RFC 3986 and WHATWG URL. We are not talking about self-written implementations (without relying on any specifications) or those who made a mistake while implementing one of the standards. I propose to implement support for one of the standards that looks promising. On the contrary, everything is quite clear. All we need to do is point out that we have a URL data type in extension by WHATWG specification. I would even say that by creating a new type we will contribute to the standardization of this zoo. It's about creating a new URL data type according to the specification WHATWG and including it in contrib as an extension. -- Alexander Borisov
On Fri, Dec 6, 2024 at 8:46 AM Daniel Gustafsson <daniel@yesql.se> wrote: > A URL datatype is a *good idea* but one which I personally believe is best > handled as an external extension. +1. I don't think that it's an intrinsically bad idea to have a URL data type in the core distribution, but no compelling reason for putting it there has been proposed. Alexander's effort can equally well live in github or pgxn or whatever and people who want to use it still can. Also, it's pretty clear from reading this thread that there's more than one way to do this and not everybody agrees with or likes the particular thing Alexander has done. For an out-of-core extension, that's perfectly fine, and even good: different people want different things, and that's OK. For something that's part of PostgreSQL itself, it's a big problem. There's no reason that we should privilege one implementation over others, and we certainly do not want the core project to have to maintain and ship multiple versions of things like this. -- Robert Haas EDB: http://www.enterprisedb.com