Thread: What is the maximum encoding-conversion growth rate, anyway?

What is the maximum encoding-conversion growth rate, anyway?

From
Tom Lane
Date:
I just rearranged the code in mbutils.c a little bit to make it more
robust if conversion of an over-length string is attempted, and noted
this comment:

/** When converting strings between different encodings, we assume that space* for converted result is 4-to-1 growth in
theworst case. The rate for* currently supported encoding pairs are within 3 (SJIS JIS X0201 half width* kanna -> UTF8
isthe worst case).  So "4" should be enough for the moment.** Note that this is not the same as the maximum character
widthin any* particular encoding.*/
 
#define MAX_CONVERSION_GROWTH  4

It strikes me that this is overly pessimistic, since we do not support
5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
in any supported encoding that require 4 bytes in another.  Could we
reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
longest COPY lines we can support, so I'd like it not to be larger than
necessary.
        regards, tom lane


Re: What is the maximum encoding-conversion growth rate, anyway?

From
Tatsuo Ishii
Date:
> I just rearranged the code in mbutils.c a little bit to make it more
> robust if conversion of an over-length string is attempted, and noted
> this comment:
> 
> /*
>  * When converting strings between different encodings, we assume that space
>  * for converted result is 4-to-1 growth in the worst case. The rate for
>  * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
>  * kanna -> UTF8 is the worst case).  So "4" should be enough for the moment.
>  *
>  * Note that this is not the same as the maximum character width in any
>  * particular encoding.
>  */
> #define MAX_CONVERSION_GROWTH  4
> 
> It strikes me that this is overly pessimistic, since we do not support
> 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
> in any supported encoding that require 4 bytes in another.  Could we
> reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
> longest COPY lines we can support, so I'd like it not to be larger than
> necessary.

I'm afraid we have to mke it larger, rather than smaller for 8.3. For
example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
bytes UTF_8 (0x00e3818b and 0x00e3829a). See
util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.

So the worst case is now 6, rather than 3.

Can we add a column to pg_conversion which represents the "growth
rate"? This would reduce the rate for most encodings much smaller than
6.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: What is the maximum encoding-conversion growth rate, anyway?

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
> I'm afraid we have to mke it larger, rather than smaller for 8.3. For
> example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
> bytes UTF_8 (0x00e3818b and 0x00e3829a). See
> util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.

> So the worst case is now 6, rather than 3.

Yipes.

> Can we add a column to pg_conversion which represents the "growth
> rate"? This would reduce the rate for most encodings much smaller than
> 6.

We need to do something, but the pg_conversion catalog seems a bad place
to put the info --- don't we have places that need to be able to do
conversion without catalog access?

Perhaps better would be to redefine the API for the conversion functions
so that they palloc their own result space.  Then each conversion
function would have to know the maximum growth rate for its particular
conversion.  This change would also make it feasible for a conversion
function to prescan the data and determine an exact output size, if that
seemed worthwhile because the potential growth rate was too extreme.
        regards, tom lane


Re: What is the maximum encoding-conversion growth rate, anyway?

From
Tatsuo Ishii
Date:
> > Can we add a column to pg_conversion which represents the "growth
> > rate"? This would reduce the rate for most encodings much smaller than
> > 6.
> 
> We need to do something, but the pg_conversion catalog seems a bad place
> to put the info --- don't we have places that need to be able to do
> conversion without catalog access?

Can you tell me where? I thought conversion functions are always
called by using OidFunctionCall5 thus we need to consult the
pg_conversion catalog beforehand anyway.

> Perhaps better would be to redefine the API for the conversion functions
> so that they palloc their own result space.  Then each conversion
> function would have to know the maximum growth rate for its particular
> conversion.  This change would also make it feasible for a conversion
> function to prescan the data and determine an exact output size, if that
> seemed worthwhile because the potential growth rate was too extreme.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: What is the maximum encoding-conversion growth rate, anyway?

From
Michael Fuhr
Date:
On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote:
> Tatsuo Ishii <ishii@postgresql.org> writes:
> > I'm afraid we have to mke it larger, rather than smaller for 8.3. For
> > example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
> > bytes UTF_8 (0x00e3818b and 0x00e3829a). See
> > util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.
> 
> > So the worst case is now 6, rather than 3.
> 
> Yipes.

Isn't MAX_CONVERSION_GROWTH a multiplier?  Doesn't 2 bytes becoming
2 * 3 bytes represent a growth of 3, not 6?  Or does that 2-byte
SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported
encoding?  Or am I missing something?

-- 
Michael Fuhr


Re: What is the maximum encoding-conversion growth rate, anyway?

From
Tatsuo Ishii
Date:
> On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote:
> > Tatsuo Ishii <ishii@postgresql.org> writes:
> > > I'm afraid we have to mke it larger, rather than smaller for 8.3. For
> > > example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
> > > bytes UTF_8 (0x00e3818b and 0x00e3829a). See
> > > util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.
> > 
> > > So the worst case is now 6, rather than 3.
> > 
> > Yipes.
> 
> Isn't MAX_CONVERSION_GROWTH a multiplier?  Doesn't 2 bytes becoming
> 2 * 3 bytes represent a growth of 3, not 6?  Or does that 2-byte
> SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported
> encoding?  Or am I missing something?

Oops. You are right. The MAX_CONVERSION_GROWTH should be 3 (=
(2*3)/2), rather than 6 for the case.

So it seems we could safely make MAX_CONVERSION_GROWTH down to 3 for
the moment.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: What is the maximum encoding-conversion growth rate, anyway?

From
Tatsuo Ishii
Date:
> > On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote:
> > > Tatsuo Ishii <ishii@postgresql.org> writes:
> > > > I'm afraid we have to mke it larger, rather than smaller for 8.3. For
> > > > example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
> > > > bytes UTF_8 (0x00e3818b and 0x00e3829a). See
> > > > util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.
> > > 
> > > > So the worst case is now 6, rather than 3.
> > > 
> > > Yipes.
> > 
> > Isn't MAX_CONVERSION_GROWTH a multiplier?  Doesn't 2 bytes becoming
> > 2 * 3 bytes represent a growth of 3, not 6?  Or does that 2-byte
> > SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported
> > encoding?  Or am I missing something?
> 
> Oops. You are right. The MAX_CONVERSION_GROWTH should be 3 (=
> (2*3)/2), rather than 6 for the case.
> 
> So it seems we could safely make MAX_CONVERSION_GROWTH down to 3 for
> the moment.

Thinking more, it striked me that users can define arbitarily growing
rate by using CFREATE CONVERSION. So it seems we need functionality to
define the growing rate anyway.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: What is the maximum encoding-conversion growth rate, anyway?

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
> Thinking more, it striked me that users can define arbitarily growing
> rate by using CFREATE CONVERSION. So it seems we need functionality to
> define the growing rate anyway.

Seems to me that would be an argument for moving the palloc inside the
conversion functions, as I suggested before.

In practice though, I find it hard to imagine a pair of encodings for
which the growth rate is more than 3x.  You'd need something that
translates a single-byte character into 4 or more bytes (pretty
unlikely, especially considering we require all these encodings to be
ASCII supersets); or something that translates a 2-byte character into
more than 6 bytes.
        regards, tom lane


Re: What is the maximum encoding-conversion growth rate, anyway?

From
Michael Fuhr
Date:
On Tue, May 29, 2007 at 10:00:06AM -0400, Tom Lane wrote:
> In practice though, I find it hard to imagine a pair of encodings for
> which the growth rate is more than 3x.  You'd need something that
> translates a single-byte character into 4 or more bytes (pretty
> unlikely, especially considering we require all these encodings to be
> ASCII supersets); or something that translates a 2-byte character into
> more than 6 bytes.

Many characters in the 0x80..0xff range of single-byte encodings
like LATIN1 become four bytes in GB18030 (e.g., LATIN1 f1 = GB18030
81 30 8a 39).  PostgreSQL doesn't currently support such conversions
but it's something to be aware of.

-- 
Michael Fuhr


Re: What is the maximum encoding-conversion growth rate, anyway?

From
"Jeroen T. Vermeulen"
Date:
On Tue, May 29, 2007 20:51, Tatsuo Ishii wrote:

> Thinking more, it striked me that users can define arbitarily growing
> rate by using CFREATE CONVERSION. So it seems we need functionality to
> define the growing rate anyway.

Would it make sense to define just the longest and shortest character
lengths for an encoding?  Then for any conversion you'd have a safe
estimate of
 ceil(target_encoding.max_char_len / source_encoding.min_char_len)

...without going through every possible conversion.


Jeroen




Re: What is the maximum encoding-conversion growth rate, anyway?

From
Bruce Momjian
Date:
Where are we on this?

---------------------------------------------------------------------------

Tom Lane wrote:
> I just rearranged the code in mbutils.c a little bit to make it more
> robust if conversion of an over-length string is attempted, and noted
> this comment:
> 
> /*
>  * When converting strings between different encodings, we assume that space
>  * for converted result is 4-to-1 growth in the worst case. The rate for
>  * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
>  * kanna -> UTF8 is the worst case).  So "4" should be enough for the moment.
>  *
>  * Note that this is not the same as the maximum character width in any
>  * particular encoding.
>  */
> #define MAX_CONVERSION_GROWTH  4
> 
> It strikes me that this is overly pessimistic, since we do not support
> 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
> in any supported encoding that require 4 bytes in another.  Could we
> reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
> longest COPY lines we can support, so I'd like it not to be larger than
> necessary.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: What is the maximum encoding-conversion growth rate, anyway?

From
Tatsuo Ishii
Date:
Sorry for dealy.

> On Tue, May 29, 2007 20:51, Tatsuo Ishii wrote:
> 
> > Thinking more, it striked me that users can define arbitarily growing
> > rate by using CFREATE CONVERSION. So it seems we need functionality to
> > define the growing rate anyway.
> 
> Would it make sense to define just the longest and shortest character
> lengths for an encoding?  Then for any conversion you'd have a safe
> estimate of
> 
>   ceil(target_encoding.max_char_len / source_encoding.min_char_len)
> 
> ...without going through every possible conversion.

This will not work since certain CONVERSION allows n char to m char
conversion.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: What is the maximum encoding-conversion growth rate, anyway?

From
Tatsuo Ishii
Date:
The conclusion of the discussion appears that we could reduce
MAX_CONVERSION_GROWTH from 4 to 3 safely with all existing built-in
conversions.

However, since user defined conversions could set arbitrary growth
rate, probably it would be better leave it as it is now.

For 8.4, maybe we could change conversion function's signature so that
we don't need to have the fixed conversion rate as Tom suggested.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> Where are we on this?
> 
> ---------------------------------------------------------------------------
> 
> Tom Lane wrote:
> > I just rearranged the code in mbutils.c a little bit to make it more
> > robust if conversion of an over-length string is attempted, and noted
> > this comment:
> > 
> > /*
> >  * When converting strings between different encodings, we assume that space
> >  * for converted result is 4-to-1 growth in the worst case. The rate for
> >  * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
> >  * kanna -> UTF8 is the worst case).  So "4" should be enough for the moment.
> >  *
> >  * Note that this is not the same as the maximum character width in any
> >  * particular encoding.
> >  */
> > #define MAX_CONVERSION_GROWTH  4
> > 
> > It strikes me that this is overly pessimistic, since we do not support
> > 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
> > in any supported encoding that require 4 bytes in another.  Could we
> > reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
> > longest COPY lines we can support, so I'd like it not to be larger than
> > necessary.
> > 
> >             regards, tom lane
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 4: Have you searched our list archives?
> > 
> >                http://archives.postgresql.org
> 
> -- 
>   Bruce Momjian  <bruce@momjian.us>          http://momjian.us
>   EnterpriseDB                               http://www.enterprisedb.com
> 
>   + If your life is a hard drive, Christ can be your backup. +


Re: What is the maximum encoding-conversion growth rate, anyway?

From
Bruce Momjian
Date:
This has been saved for the 8.4 release:
http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Tatsuo Ishii wrote:
> The conclusion of the discussion appears that we could reduce
> MAX_CONVERSION_GROWTH from 4 to 3 safely with all existing built-in
> conversions.
> 
> However, since user defined conversions could set arbitrary growth
> rate, probably it would be better leave it as it is now.
> 
> For 8.4, maybe we could change conversion function's signature so that
> we don't need to have the fixed conversion rate as Tom suggested.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> 
> > Where are we on this?
> > 
> > ---------------------------------------------------------------------------
> > 
> > Tom Lane wrote:
> > > I just rearranged the code in mbutils.c a little bit to make it more
> > > robust if conversion of an over-length string is attempted, and noted
> > > this comment:
> > > 
> > > /*
> > >  * When converting strings between different encodings, we assume that space
> > >  * for converted result is 4-to-1 growth in the worst case. The rate for
> > >  * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
> > >  * kanna -> UTF8 is the worst case).  So "4" should be enough for the moment.
> > >  *
> > >  * Note that this is not the same as the maximum character width in any
> > >  * particular encoding.
> > >  */
> > > #define MAX_CONVERSION_GROWTH  4
> > > 
> > > It strikes me that this is overly pessimistic, since we do not support
> > > 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
> > > in any supported encoding that require 4 bytes in another.  Could we
> > > reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
> > > longest COPY lines we can support, so I'd like it not to be larger than
> > > necessary.
> > > 
> > >             regards, tom lane
> > > 
> > > ---------------------------(end of broadcast)---------------------------
> > > TIP 4: Have you searched our list archives?
> > > 
> > >                http://archives.postgresql.org
> > 
> > -- 
> >   Bruce Momjian  <bruce@momjian.us>          http://momjian.us
> >   EnterpriseDB                               http://www.enterprisedb.com
> > 
> >   + If your life is a hard drive, Christ can be your backup. +

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: What is the maximum encoding-conversion growth rate, anyway?

From
Bruce Momjian
Date:
Added to TODO:

* Change memory allocation for multi-byte functions so memory is allocated inside conversion functions
 Currently we preallocate memory based on worst-case usage.


---------------------------------------------------------------------------

Tom Lane wrote:
> Tatsuo Ishii <ishii@postgresql.org> writes:
> > Thinking more, it striked me that users can define arbitarily growing
> > rate by using CFREATE CONVERSION. So it seems we need functionality to
> > define the growing rate anyway.
> 
> Seems to me that would be an argument for moving the palloc inside the
> conversion functions, as I suggested before.
> 
> In practice though, I find it hard to imagine a pair of encodings for
> which the growth rate is more than 3x.  You'd need something that
> translates a single-byte character into 4 or more bytes (pretty
> unlikely, especially considering we require all these encodings to be
> ASCII supersets); or something that translates a 2-byte character into
> more than 6 bytes.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +