Re: [GENERAL] workaround for lack of REPLACE() function - Mailing list pgsql-hackers

From Tatsuo Ishii
Subject Re: [GENERAL] workaround for lack of REPLACE() function
Date
Msg-id 20020812.100720.91313063.t-ishii@sra.co.jp
Whole thread Raw
In response to Re: [GENERAL] workaround for lack of REPLACE() function  (Joe Conway <mail@joeconway.com>)
List pgsql-hackers
> Now, in the multibyte case, again in textlen(), I see:
> 
>    /* optimization for single byte encoding */
>    if (pg_database_encoding_max_length() <= 1)
>      PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);
> 
>    PG_RETURN_INT32(
>        pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ));
> 
> Three questions here.
> 1) In the case of encoding max length == 1, can we treat it the same as 
> the non-multibyte case (I presume they are exactly the same)?

Yes.

> 2) Can encoding max length ever be < 1? Doesn't make sense to me.

No. It seems just a defensive coding.

> 3) In the case of encoding max length > 1, if I understand correctly, 
> each encoded character can be one *or more* bytes, up to and encluding 
> encoding max length bytes.

Right.

> So the *only* way presently to get the length 
> of the original character string is to loop through the entire string 
> checking the length of each individual character (that's what 
> pg_mbstrlen_with_len() does it seems)?

Yes.

> Finally, if 3) is true, then there is no way to avoid the retrieval and 
> decompression of the datum just to find out its length. For large 
> datums, detoasting plus the looping through each character would add a 
> huge amount of overhead just to get at the length of the original 
> string. I don't know if we need to be able to get *just* the length 
> often enough to really care, but if we do, I had an idea for some future 
> release (I wouldn't propose doing this for 7.3):
> 
> - add a new EXTENDED state to va_external for MULTIBYTE
> - any string with max encoding length > 1 would be EXTENDED even if it
>    is not EXTERNAL and not COMPRESSED.
> - to each of the structs in the union, add va_strlen
> - populate va_strlen on INSERT and maintain it on UPDATE.
> 
> Now a new function similar to toast_raw_datum_size(), maybe 
> toast_raw_datum_strlen() could be used to get the original string 
> length, whether MB or not, without needing to retrieve and decompress 
> the entire datum.
> 
> I understand we would either: have to steal another bit from the VARHDR 
> which would reduce the effective size of a valena from 1GB down to .5GB; 
> or we would need to add a byte or two to the VARHDR which is extra 
> per-datum overhead. I'm not sure we would want to do either. But I 
> wanted to toss out the idea while it was fresh on my mind.

Interesting idea. I also was thinking about adding some extra
infomation to text data types such as character set, collation
etc. for 7.4 or later.
--
Tatsuo Ishii


pgsql-hackers by date:

Previous
From: Curt Sampson
Date:
Subject: Re: Table Inheritance Discussion
Next
From: "Christopher Kings-Lynne"
Date:
Subject: Re: python patch