Re: [GENERAL] workaround for lack of REPLACE() function - Mailing list pgsql-hackers

From Joe Conway
Subject Re: [GENERAL] workaround for lack of REPLACE() function
Date
Msg-id 3D569BBF.8040309@joeconway.com
Whole thread Raw
In response to Re: [GENERAL] workaround for lack of REPLACE() function  (Thomas Lockhart <lockhart@fourpalms.org>)
Responses Re: [GENERAL] workaround for lack of REPLACE() function  (Tatsuo Ishii <t-ishii@sra.co.jp>)
List pgsql-hackers
Tatsuo Ishii wrote:
>>Any objection if I rework this function to meet SQL92 and fix the bug? 
> 

I've started working on text_substr() as described in this thread (which 
is hopefully prep work for the replace() function that started the 
thread). I haven't really looked at toast or multibyte closely before, 
so I'd like to ask a couple of questions to be sure I'm understanding 
the relevant issues correctly.

First, in textlen() I see (ignoring multibyte for a moment):
  text       *t = PG_GETARG_TEXT_P(0);  PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);

Tom has pointed out to me before that PG_GETARG_TEXT_P(n) incurs the 
overhead of retrieving and possibly decompressing a toasted datum. So my 
first question is, can we simply do:  PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ);
and save the overhead of retrieving and decompressing the whole datum?

Now, in the multibyte case, again in textlen(), I see:
  /* optimization for single byte encoding */  if (pg_database_encoding_max_length() <= 1)
PG_RETURN_INT32(VARSIZE(t)- VARHDRSZ);
 
  PG_RETURN_INT32(      pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ));

Three questions here.
1) In the case of encoding max length == 1, can we treat it the same as 
the non-multibyte case (I presume they are exactly the same)?

2) Can encoding max length ever be < 1? Doesn't make sense to me.

3) In the case of encoding max length > 1, if I understand correctly, 
each encoded character can be one *or more* bytes, up to and encluding 
encoding max length bytes. So the *only* way presently to get the length 
of the original character string is to loop through the entire string 
checking the length of each individual character (that's what 
pg_mbstrlen_with_len() does it seems)?

Finally, if 3) is true, then there is no way to avoid the retrieval and 
decompression of the datum just to find out its length. For large 
datums, detoasting plus the looping through each character would add a 
huge amount of overhead just to get at the length of the original 
string. I don't know if we need to be able to get *just* the length 
often enough to really care, but if we do, I had an idea for some future 
release (I wouldn't propose doing this for 7.3):

- add a new EXTENDED state to va_external for MULTIBYTE
- any string with max encoding length > 1 would be EXTENDED even if it  is not EXTERNAL and not COMPRESSED.
- to each of the structs in the union, add va_strlen
- populate va_strlen on INSERT and maintain it on UPDATE.

Now a new function similar to toast_raw_datum_size(), maybe 
toast_raw_datum_strlen() could be used to get the original string 
length, whether MB or not, without needing to retrieve and decompress 
the entire datum.

I understand we would either: have to steal another bit from the VARHDR 
which would reduce the effective size of a valena from 1GB down to .5GB; 
or we would need to add a byte or two to the VARHDR which is extra 
per-datum overhead. I'm not sure we would want to do either. But I 
wanted to toss out the idea while it was fresh on my mind.

Thanks,

Joe




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [SECURITY] DoS attack on backend possible (was: Re:
Next
From: Florian Weimer
Date:
Subject: Re: [SECURITY] DoS attack on backend possible (was: Re: