Re: Proposal to add a new URL data type. - Mailing list pgsql-hackers
From | Alexander Borisov |
---|---|
Subject | Re: Proposal to add a new URL data type. |
Date | |
Msg-id | 79547330-ab46-4886-9fdf-cf7fa526c876@gmail.com Whole thread Raw |
In response to | Re: Proposal to add a new URL data type. (Peter Eisentraut <peter@eisentraut.org>) |
Responses |
Re: Proposal to add a new URL data type.
|
List | pgsql-hackers |
05.12.2024 17:59, Peter Eisentraut пишет: > On 05.12.24 15:01, Alexander Borisov wrote: >> Postgres users often store URLs in the database. As an example, they >> provide links to their pages on the web, analyze users posts and get >> links for further storage and analysis. Naturally, there is a need to >> compare, sort, build indexes, get statistics on individual parts of >> the URL: hosts, protocols, path and so on. >> >> Adding a new URL type will standardize the work with URLs in Postgres >> and provide simple tools for manipulating the new type. > > Have you seen this: https://github.com/petere/pguri ? > > The difference there is that it uses an external library for parsing the > URLs, which makes the code much smaller. Overall, the functionality > looks pretty similar to yours at first glance. > Hi Peter, I looked at your implementation of the URI extension. You are using a third party library uriparser for URI/URL parsing. The uriparser library is based on the RFC 3986 specification, which I cite in comparison to WHATWG in my email. The name of the functions to get the individual parts of the URL will naturally be the same, that's how URLs/URIs work. But you have to look at the point and perspective here. As I've written before, there is a difference between parsing URLs according to the RFC 3986 specification and WHATWG URLs. This is especially true for host. Here are a couple more examples. The specifications describe character encoding differently, this applies to userinfo (username/password), path, query, fragment. RFC 3986 standard encodes characters in the same way everywhere. The characters to encode are: < > “ ` \r \n \t { } | \ ^ '. The WHATWG URL standard takes a more selective and subtle approach. For example, path may contain the character “|”, but userinfo does not (encoded). The WHATWG specification also requires that tabs and newlines be removed from URLs before parsing. The WHATWG character encoding gradation can be found in the specification [1]. In addition to functions to retrieve individual parts, the WHATWG URL standard describes an API for changing them: scheme, userinfo, host, hostname, port, path, query, fragment. There is not just one value is replaced by another, there is a certain logic, which is not always obvious. For example, try to replace scheme, let's take the URL for example: https://example.com/. This URL contains special scheme (any others are not special), there are only six of them: ftp, file, http, https, ws, wss. And it is impossible to replace scepial scheme with a non-special one. More precisely, the URL will be returned with special scheme, i.e. without changes. This is how you can check it with the patch I have given: Example: select url_scheme_set('https://example.com/'::url, 'wss'); Result: wss://example.com/ Example: select url_scheme_set('https://example.com/'::url, 'myown'); Result: https://example.com/ Example: select url_scheme_set('best://example.com/'::url, 'myown'); Result: myown://example.com/ Example: select url_scheme_set('best://example.com/'::url, 'https'); Result: best://example.com/ In addition, WHATWG validates URLs during parsing and reports non-critical errors [2]. If such errors occur, parsing continues. However, in my implementation I don't output these errors, I just haven't figured out how to do it correctly in SQL (as NOTICE?). Without going further into the differences in specifications I could say simply - RFC 3986 is obsolete, for example, node.js has labeled the API with it as Legacy (they use WHATWG). If we abstract from specifications and consider our approaches in implementation. You parse the URL every time for any request (even to retrieve fragments). In my implementation I proceed from the fact that requests to read URLs will significantly exceed their changes. Parsing is done once on input, the result is saved in a special format and later the necessary parts of the URL (or the whole URL) are retrieved. Also please note that there are no dependencies on third-party libraries (ICU dependencies can also be fought off). All currently available functions and examples can be seen in the README file, in the patch. [1] https://url.spec.whatwg.org/#c0-control-percent-encode-set [2] https://url.spec.whatwg.org/#writing -- Alexander Borisov
pgsql-hackers by date: