Thread: lztext.c

lztext.c

From
Tatsuo Ishii
Date:
I'm going to commit changes to make lztextlen() aware of
multi-byte. While doing the work, I found that no POSITION() or
SUBSTRING() for lztext has been implemented in the file.

BTW, does anybody work on making lztext indexable?  If no, I will take
care of it with above addtions.
--
Tatsuo Ishii



Re: [HACKERS] lztext.c

From
wieck@debis.com (Jan Wieck)
Date:
Tatsuo Ishii wrote:

> I'm going to commit changes to make lztextlen() aware of
> multi-byte. While doing the work, I found that no POSITION() or
> SUBSTRING() for lztext has been implemented in the file.

    Thank's  for  that.  I  usually don't have multi-byte support
    compiled in and it's surely better if you  do  the  extension
    and tests.

    I know that a lot of functions are missing so far. Especially
    comparision and the mentioned ones. I thought to get back  on
    it after the multi-byte support is inside.

> BTW, does anybody work on making lztext indexable?  If no, I will take
> care of it with above addtions.

    IMHO something questionable.

    A compressed data type is preferred to store large amounts of
    data.  Indexing large fields OTOH is something to prevent  by
    database  design.   The  new  type  at hand offers reasonable
    compression rates only above some size of input.

    OTOOH, it might get someone around the btree  split  problems
    some of us encountered and which I where able to trigger with
    field contents above 2K already. In such a case it can  be  a
    last resort.

    I'd like to know what others think.

    Don't  spend  much efford for comparision and the SUBSTRING()
    things right now. I already have an  additional,  generalized
    decompressor in mind, that can be used in the comparision for
    example  to  decompress  two  values  on  the  fly  and  stop
    comparision  at  the  first difference, which usually happens
    early in two random datums.

    Tell me when you have the multi-byte  (and  maybe  cyrillic?)
    stuff committed and I'll take my hands back on the code.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] lztext.c

From
Tatsuo Ishii
Date:
>    Don't  spend  much efford for comparision and the SUBSTRING()
>    things right now. I already have an  additional,  generalized
>    decompressor in mind, that can be used in the comparision for
>    example  to  decompress  two  values  on  the  fly  and  stop
>    comparision  at  the  first difference, which usually happens
>    early in two random datums.

Ok.

>    Tell me when you have the multi-byte  (and  maybe  cyrillic?)
>    stuff committed and I'll take my hands back on the code.

I have committed the changes just now, though cyrillic support is not
included. I vaguely recall the discussion about the usefullness of
the cyrillic support.
--
Tatsuo Ishii



Re: [HACKERS] lztext.c

From
Oleg Bartunov
Date:
On Wed, 24 Nov 1999, Tatsuo Ishii wrote:

> Date: Wed, 24 Nov 1999 12:52:53 +0900
> From: Tatsuo Ishii <t-ishii@sra.co.jp>
> To: Jan Wieck <wieck@debis.com>
> Cc: pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] lztext.c 
> 
> >    Don't  spend  much efford for comparision and the SUBSTRING()
> >    things right now. I already have an  additional,  generalized
> >    decompressor in mind, that can be used in the comparision for
> >    example  to  decompress  two  values  on  the  fly  and  stop
> >    comparision  at  the  first difference, which usually happens
> >    early in two random datums.
> 
> Ok.
> 
> >    Tell me when you have the multi-byte  (and  maybe  cyrillic?)
> >    stuff committed and I'll take my hands back on the code.
> 
> I have committed the changes just now, though cyrillic support is not
> included. I vaguely recall the discussion about the usefullness of
> the cyrillic support.

If you mean --recode you-re right.

> --
> Tatsuo Ishii
> 
> 
> ************
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] lztext.c

From
wieck@debis.com (Jan Wieck)
Date:
Tatsuo Ishii wrote:

> >    Don't  spend  much efford for comparision and the SUBSTRING()
> >    things right now. I already have an  additional,  generalized
> >    decompressor in mind, that can be used in the comparision for
> >    example  to  decompress  two  values  on  the  fly  and  stop
> >    comparision  at  the  first difference, which usually happens
> >    early in two random datums.
>
> Ok.
>
> >    Tell me when you have the multi-byte  (and  maybe  cyrillic?)
> >    stuff committed and I'll take my hands back on the code.
>
> I have committed the changes just now, though cyrillic support is not
> included. I vaguely recall the discussion about the usefullness of
> the cyrillic support.

    I  added the comparision functions, operators and the default
    nbtree operator class for indexing.

    For the SUBSTR() and STRPOS(), I  just  checked  the  current
    setup  and it automatically casts an lztext argument in these
    functions to text. I assume lztext can now be used  in  every
    place  where  text  is allowed. Is it really worth to blow up
    the catalogs with rarely used functions that only  gain  some
    saved decompressed portion?

    Remember, the algorithm is optimized for decompression speed.
    It might save some time to do this for a comparision function
    used  inside  of  index scans or btree operations, where it's
    likely to hit a difference  early.  But  for  something  like
    STRPOS(),  using  the  default cast and changing the STRPOS()
    match search itself into a KMP algorithm (instead of  walking
    through  the  text  and  comparing  each position against the
    pattern using strncmp) would outperform it in any case.  With
    the  byte by byte strncmp() method, we definitely implemented
    the slowest and best readable possibility.

    I think we should better spend our time in adding a  lzbpchar
    type.   Or  work on compressed tables and tuple split to blow
    away the size limits at all.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #