BLOB / CLOB support in PostgreSQL - Mailing list pgsql-hackers

From Vladimir Sitnikov
Subject BLOB / CLOB support in PostgreSQL
Date
Msg-id CAB=Je-EN_Wc-SURAfUAgzLh-kcmxD_migPaO1Jx+SksTAMEuog@mail.gmail.com
Whole thread Raw
Responses Re: BLOB / CLOB support in PostgreSQL
List pgsql-hackers
Hi,

According to Pgjdbc GitHub statistics, the most popular page is https://github.com/pgjdbc/pgjdbc/issues/1102 which is
 "org.postgresql.jdbc.PgConnection.createClob() is not yet implemented" issue (1600 visits from 1400 unique visitors per a fortnight).

There are workarounds to silence the error, however I'm sure CLOB (as in "streaming text datatype") is not yet supported in PostgreSQL backend.

I have browsed pgsql-hackers mailing list re CLOB, and it looks like there's no relevant discussion, so I'm quite sure I've done my homework on "check prior mails regarding the subject".

**Issue**: there's no sensible way to map java.sql.Clob to the existing backend datatypes. `text` can't stream (it always loads the data fully), and it is 1GB limited.

Java distinguishes java.sql.Blob and java.sql.Clob.
Blob is a "binary stream with streaming features". It can be mapped to existing "Large Objects", and existing Large Object API somewhat suits for the implementation.
There are glitches (like "pgjdbc has to perform 4 API calls tell/seek/tell/seek in order just to get LO length once"), however it is fine.

Java Clob API is just a dozen of methods (13 to be exact), however there are two major issues there:
1) "Large Object" is just a binary object. There's no way to tell if the contents is a UTF-8 string or Windows-1251 string or protobuf-encoded message or whatever.
That is if pgjdbc encodes java.sql.Clob (large string) into some form of binary (e.g. UTF-8) and store it as PostgreSQL Large Object, then this LO automatically becomes "pgjdbc-specific blob".
There's no way to use the data in SQL or pl/pgsql or other applications. 
For instance, one can't perform " where clob_column like '%abcd%' "

2) "characters". For instance, `long length()` should return the number of characters in the string.
If pgjdbc implements java.sql.Clob as a UTF-8 encoded binary, then it would have to **process the whole blob** in order to measure string length.
The same thing goes for `String getSubString(long pos, int length)`. It would have to process all the bytes up to character `long pos` (how otherwise it would know byte position for character `pos`?).

Currently pgjdbc encodes strings using client_encoding, stores them as LO, and has been like that for ages. Apparently that might easily produce garbage in the DB if clients use various encodings, however pgjdbc's default setting is to use UTF-8 so the problem should be not that visible.

I fully understand LO has issues with "removing obsolete entries", however mapping java.sql.Clob to `text` seems to make less sense.
For instance: suppose pgjdbc choses "Clob == text". Then a client meets "1GB" limit.

"Streaming TOAST data" looks more like a research project rather than a clear thing to implement.

What if there was a standard of storing strings in Large Objects?
For instance: "CLOB is a UTF-8 encoded string stored as a single LO". When such an agreement exists, various applications could read and write the data.
Of course, UTF-8 might not suit everybody, so a format might be "prefix that specifies encoding, then encoded string".
Of course both variations above fail to support streaming (as in "need to process all the contents in order to get the last character"), so it might be better to use
"prefix that specifies encoding + 'index block' (that specifies offsets for each 1M characters) + encoded string".
I'm sure there are known algorithms to store strings in binary format that support subsequence / overwrite / length in reasonable time (O(1) or O(N) with reasonable constant).
There might be an option to use UTF-16 (so each "character" becomes 2 bytes always), however it would come at a cost of space usage.

**Here goes the question**:  do you think such an implementation ("large string stored in Large Objects" could be merged into the core eventually)?

Q2: any ideas/existing libraries for random access read-write large strings stored as binary?


--
Regards,
Vladimir Sitnikov

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Expression errors with "FOR UPDATE" and postgres_fdw withpartition wise join enabled.
Next
From: Tom Lane
Date:
Subject: Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.