Thread: LONG

LONG

From
wieck@debis.com (Jan Wieck)
Date:
I  thought  about  the  huge size variable text type a little
    more.  And I think I could get the  following  implementation
    to work reliable for our upcoming release.

    For   any  relation,  having  one  or  more  LONG  data  type
    attributes,  another  relation  (named   pg_<something>)   is
    created,  accessible  only to superusers (and internal access
    routines).  All LONG data items are  stored  as  a  reference
    into  that relation, split up automatically so the chunks fit
    into the installation specific tuple limit size.   Items  are
    added/updated/removed totally transparent.

    It would not be indexable (jesus no!) and using it in a WHERE
    clause will be expensive. But who ever uses a WHERE on a  not
    indexable  (possibly containing megabytes per item) data type
    is a silly fool who should get what he wanted, poor  response
    times.

    I'd  like  to  name it LONG, like Oracle's 2G max. data type.
    Even if I intend to restrict the data size to some  megabytes
    for  now.  All  the data must still be processable in memory,
    and there might be multiple instances of one item  in  memory
    at  the  same time.  So a real 2G datatype is impossible with
    this kind of approach.  But  isn't  a  64MB  #define'd  limit
    enough  for  now?  This  would  possibly still blow away many
    installations due to limited memory and/or swap space. And we
    can  adjust  that #define in 2001 (an address space odyssey),
    when 64bit hardware and plenty of GB real memory is  the  low
    end standard *1).

    I already thought that the 8K default BLKSIZE is a little out
    of date for today's  hardware  standards.  Two  weeks  ago  I
    bought a PC for my kids. It's a 433MHz Celeron, 64MB ram, 6GB
    disk - costs about $500 (exactly DM 999,-- at  Media  Markt).
    With  the  actual  on  disk  cache  <->  memory and cache <->
    surface transfer rates, the 8K size seems a little archaic to
    me.

    Thus, if we can get a LONG data type in 7.0, and maybe adjust
    the default BLKSIZE to something more up  to  date,  wouldn't
    the long tuple item get away silently?

    Should I go ahead on this or not?


Jan

*1) Or will it be TB/PB?

    I  fear to estimate, because it's only a short time ago, that
    a 4G hard disk was high-end. Today, IBM offers a  3.5''  disk
    with 72G formatted capacity and 64M is the lowest end of real
    memory, so where's the limit?

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] LONG

From
Bruce Momjian
Date:
> 
>     I  thought  about  the  huge size variable text type a little
>     more.  And I think I could get the  following  implementation
>     to work reliable for our upcoming release.
> 
>     For   any  relation,  having  one  or  more  LONG  data  type
>     attributes,  another  relation  (named   pg_<something>)   is
>     created,  accessible  only to superusers (and internal access
>     routines).  All LONG data items are  stored  as  a  reference
>     into  that relation, split up automatically so the chunks fit
>     into the installation specific tuple limit size.   Items  are
>     added/updated/removed totally transparent.

Should we use large objects for this, and beef them up.  Seems that
would be a good way.  I have considered putting them in a hash
bucket/directory tree for faster access to lots of large objects.

There is a lot to say about storing long tuples outside the tables
because long tuples fill cache buffers and make short fields longer to
access.

> 
>     It would not be indexable (jesus no!) and using it in a WHERE
>     clause will be expensive. But who ever uses a WHERE on a  not
>     indexable  (possibly containing megabytes per item) data type
>     is a silly fool who should get what he wanted, poor  response
>     times.

Good restriction.

>     I'd  like  to  name it LONG, like Oracle's 2G max. data type.
>     Even if I intend to restrict the data size to some  megabytes
>     for  now.  All  the data must still be processable in memory,
>     and there might be multiple instances of one item  in  memory
>     at  the  same time.  So a real 2G datatype is impossible with
>     this kind of approach.  But  isn't  a  64MB  #define'd  limit
>     enough  for  now?  This  would  possibly still blow away many
>     installations due to limited memory and/or swap space. And we
>     can  adjust  that #define in 2001 (an address space odyssey),
>     when 64bit hardware and plenty of GB real memory is  the  low
>     end standard *1).
> 
>     I already thought that the 8K default BLKSIZE is a little out
>     of date for today's  hardware  standards.  Two  weeks  ago  I
>     bought a PC for my kids. It's a 433MHz Celeron, 64MB ram, 6GB
>     disk - costs about $500 (exactly DM 999,-- at  Media  Markt).
>     With  the  actual  on  disk  cache  <->  memory and cache <->
>     surface transfer rates, the 8K size seems a little archaic to
>     me.

We use 8K blocks because that is the base size for most file systems. 
When we fsync an 8k buffer, the assumption is that that buffer is
written in a single write to the disk.  Larger buffers would be spread
over the disk, making a single fsync() impossible to be atomic, I think.

Also, larger buffers take more cache space per buffer, makeing the
buffer cache more corse holding fewer buffers.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
wieck@debis.com (Jan Wieck)
Date:
Bruce Momjian wrote:

> Should we use large objects for this, and beef them up.  Seems that
> would be a good way.  I have considered putting them in a hash
> bucket/directory tree for faster access to lots of large objects.
>
> There is a lot to say about storing long tuples outside the tables
> because long tuples fill cache buffers and make short fields longer to
> access.

    I  thought  to  use  a  regular table. Of course, it will eat
    buffers, but managing external files or  even  large  objects
    for  it  IMHO  isn't  that  simple,  if  you take transaction
    commit/abort and MVCC problematic into account too. And  IMHO
    this  is  something  that must be covered, because I meant to
    create a DATATYPE that can be used as a replacement for  TEXT
    if that's too small, so it must behave as a regular datatype,
    without any restrictions WRT beeing able to rollback etc.

    Using LO or external files would need much more testing, than
    creating  one  other  shadow  table (plus an index for it) at
    CREATE TABLE.  This table would automatically  have  all  the
    concurrency,  MVCC  and visibility stuff stable. And it would
    automatically split  into  multiple  files  if  growing  very
    large, be vacuumed, ...

    Let  me  do  it  this way for 7.0, and then lets collect some
    feedback and own experience with it. For 8.0 we  can  discuss
    again, if doing it the hard way would be worth the efford.

> We use 8K blocks because that is the base size for most file systems.
> When we fsync an 8k buffer, the assumption is that that buffer is
> written in a single write to the disk.  Larger buffers would be spread
> over the disk, making a single fsync() impossible to be atomic, I think.
>
> Also, larger buffers take more cache space per buffer, makeing the
> buffer cache more corse holding fewer buffers.

    Maybe something to play with a little.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] LONG

From
Don Baccus
Date:
At 01:48 PM 12/11/99 +0100, Jan Wieck wrote:

>    I  thought  to  use  a  regular table. Of course, it will eat
>    buffers, but managing external files or  even  large  objects
>    for  it  IMHO  isn't  that  simple,  if  you take transaction
>    commit/abort and MVCC problematic into account too. And  IMHO
>    this  is  something  that must be covered, because I meant to
>    create a DATATYPE that can be used as a replacement for  TEXT
>    if that's too small, so it must behave as a regular datatype,
>    without any restrictions WRT beeing able to rollback etc.

Yes, please, this is what (some of, at least) the world wants.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] LONG

From
Bruce Momjian
Date:
> Bruce Momjian wrote:
> 
> > Should we use large objects for this, and beef them up.  Seems that
> > would be a good way.  I have considered putting them in a hash
> > bucket/directory tree for faster access to lots of large objects.
> >
> > There is a lot to say about storing long tuples outside the tables
> > because long tuples fill cache buffers and make short fields longer to
> > access.
> 
>     I  thought  to  use  a  regular table. Of course, it will eat
>     buffers, but managing external files or  even  large  objects
>     for  it  IMHO  isn't  that  simple,  if  you take transaction
>     commit/abort and MVCC problematic into account too. And  IMHO
>     this  is  something  that must be covered, because I meant to
>     create a DATATYPE that can be used as a replacement for  TEXT
>     if that's too small, so it must behave as a regular datatype,
>     without any restrictions WRT beeing able to rollback etc.


OK, I have thought about your idea, and I like it very much.  In fact,
it borders on genius.

Our/my original idea was to chain tuple in the main table.  That has
some disadvantages:
More complex tuple handling of chained tuplesRequires more tuple storage overhead for housekeeping of chaining
dataSequentialscan of table has to read those large fieldsVacuum has to keep the tuples chained as they are moved
 
Your system would be:
CREATE TABLE pg_long (    refoid    OID,    attno    int2,    line    int4,    attdata    VARCHAR(8000);
CREATE INDEX pg_long_idx ON pg_long (refoid, attno, line);

You keep the long data out of the table.  When updating the tuple, you
mark the pg_long tuples as superceeded with the transaction id, and just
keep going.   No need to do anything special.  Vacuum will remove
superceeded tuples automatically while processing pg_long if the
transaction was committed.

The pg_long_idx index will allow rapid access to tuple long data.

This approach seems better than tuple chaining because it uses our
existing code more efficiently.  You keep long data out of the main
table, and allow use of existing tools to access the long data.  

In fact, you may decide to just extent varchar() and text to allow use
of long tuples.  Set the varlena VARLEN field to some special value like
-1, and when you see that, you go to pg_long to get the data.  Seems
very easy.  You could get fancy and keep data in the table in most
cases, but if the tuple length exceeds 8k, go to all the varlena fields
and start moving data into pg_long.  That way, a table with three 4k
columns could be stored without the user even knowing pg_long is
involved, but for shorter tuples, they are stored in the main table.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
Bruce Momjian
Date:
>     I  thought  to  use  a  regular table. Of course, it will eat
>     buffers, but managing external files or  even  large  objects
>     for  it  IMHO  isn't  that  simple,  if  you take transaction
>     commit/abort and MVCC problematic into account too. And  IMHO
>     this  is  something  that must be covered, because I meant to
>     create a DATATYPE that can be used as a replacement for  TEXT
>     if that's too small, so it must behave as a regular datatype,
>     without any restrictions WRT beeing able to rollback etc.

In fact, you could get fancy and allow an update of a non-pg_long using
column to not change pg_long at all.  Just keep the same value in the
column.  If the transaction fails or succeeds, the pg_long is the same
for that tuple.  Of course, because an update is a delete and then an
insert, that may be hard to do.  For very long fields, it would be a win
for UPDATE.  You certainly couldn't do that with chained tuples.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
Peter Eisentraut
Date:
On Sat, 11 Dec 1999, Bruce Momjian wrote:

> In fact, you could get fancy and allow an update of a non-pg_long using
> column to not change pg_long at all.  Just keep the same value in the
> column.  If the transaction fails or succeeds, the pg_long is the same
> for that tuple.  Of course, because an update is a delete and then an
> insert, that may be hard to do.  For very long fields, it would be a win
> for UPDATE.  You certainly couldn't do that with chained tuples.

While this is great and all, what will happen when long tuples finally get
done? Will you remove this, or keep it, or just make LONG and TEXT
equivalent? I fear that elaborate structures will be put in place here
that might perhaps only be of use for one release cycle.

-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Last thoughts about LONG

From
wieck@debis.com (Jan Wieck)
Date:
I wrote:

> Bruce Momjian wrote:
>
> > Should we use large objects for this, and beef them up.  Seems that
> > would be a good way.  I have considered putting them in a hash
> > bucket/directory tree for faster access to lots of large objects.
> >
> > There is a lot to say about storing long tuples outside the tables
> > because long tuples fill cache buffers and make short fields longer to
> > access.
>
>     I  thought  to  use  a  regular table. Of course, it will eat
>     buffers ...

    When  looking  at  my  actual implementation concept, I'm not
    sure if it will win or loose compared  against  text  itself!
    Amazing, but I think it could win already on relatively small
    text sizes (1-2K is IMHO small compared  to  what  this  type
    could store).

    Well,  the  implementation  details. I really would like some
    little  comments  to  verify  it's  really  complete   before
    starting.

    - A  new  field "rellongrelid" type Oid is added to pg_class.
      It contains the Oid  of  the  long-value  relation  or  the
      invalid Oid for those who have no LONG attributes.

    - At    CREATE   TABLE,   a   long   value   relation   named
      "_LONG<tablename>" is created for those tables who need it.
      And of course dropped and truncated appropriate. The schema
      of this table is

          rowid       Oid,          -- oid of our main data row
          rowattno    int2,         -- the attribute number in main data
          chunk_seq   int4,         -- the part number of this data chunk
          chunk       text          -- the content of this data chunk

      There is a unique index defined on (rowid, rowattno).

    - The new data type is of variable size  with  the  following
      header:

          typedef struct LongData {
              int32           varsize;
              int32           datasize;
              Oid             longrelid;
              Oid             rowid;
              int16           rowattno;
          } LongData;

      The   types   input   function  is  very  simple.  Allocate
      sizeof(LongData)  +  strlen(input),  set  varsize  to   it,
      datasize  to  strlen(input), and the rest to invalid and 0.
      Then copy the input after the struct.

      The types output function determines on the longrelid, what
      to do.  If it's invalid, just output the bytes stored after
      the struct (it must be a datum that resulted from an  input
      operation.   If  longrelid  isn't invalid, it does an index
      scan on that relation, fetching all tuples that match rowid
      and  attno.  Since  it  knows the datasize, it doesn't need
      them in the correct order, it can put  them  at  the  right
      places into the allocated return buffer by their chunk_seq.

    - For now (until we have enough experience to judge) I  think
      it  would  be  better  to  forbid  ALTER  TABLE  when  LONG
      attributes  are  involved.   Sure,  must   be   implemented
      finally, but IMHO not on the first evaluation attempt.

Now how the data goes in and out of the longrel.

    - On  heap_insert(),  we look for non NULL LONG attributes in
      the tuple. If there could be any  can  simply  be  seen  by
      looking  at the rellongrelid in rd_rel.  We fetch the value
      either from the memory after LongData or by using the  type
      output function (for fetching it from the relation where it
      is!).  Then we simply break it up into  single  chunks  and
      store  them with our tuples information.  Now we need to do
      something tricky - to shrink the main data tuple  size,  we
      form  a new heap tuple with the datums of the original one.
      But we replace all LongData items we stored by faked  ones,
      where  the  varsize  is  sizeof(LongData) and all the other
      information is setup appropriate.   We  append  that  faked
      tuple  instead,  copy  the  resulting  information into the
      original tuples header and throw it away.

      This is a point, where  I'm  not  totally  sure.  Could  it
      possibly  be  better  or  required to copy the entire faked
      tuple over the one we should have stored?  It  could  never
      need more space, so that wouldn't be a problem.

    - On heap_replace(), we check all LONG attributes if they are
      NULL of if the information in longrelid, rowid and rowattno
      doesn't match our rellongrelid, tupleid, and attno. In that
      case this attribute  might  have  an  old  content  in  the
      longrel, which we need to delete first.

      The   rest   of   the   operation   is   exactly  like  for
      heap_insert(), except all the  attributes  information  did
      match - then it's our own OLD value that wasn't changed. So
      we can simply skip it - the existing data is still valid.

    - heap_delete() is so simple that I don't explain it.

    Now I hear you asking "how could this overhead be a win?" :-)

    That's  easy  to  explain.   As  long as you don't use a LONG
    column in the WHERE clause, when will the data be fetched? At
    the  time  it's finally clear that it's needed. That's when a
    result tuple is sent to the client (type output)  or  when  a
    tuple resulting from INSERT ... SELECT should be stored.

    Thus,  all  the  tuples  moving around in the execution tree,
    getting joined together, abused by sorts and  aggregates  and
    filtered  out  again,  allways  contain  the  small  LongData
    struct, not the data itself. Wheren't there recently  reports
    about too expansive sorts due to their huge size?

    Another  bonus  would be this: What happens on an UPDATE to a
    table  having  LONG  attributes?  If  the  attribute  is  not
    modified,  the  OLD LongData will be found in the targetlist,
    and we'll not waste any space by storing the same information
    again.  IIRC  that  one was one of the biggest concerns about
    storing huge data  in  tuples,  but  it  disappeared  without
    leaving a trace - funny eh?

    It  is  so simple, that I fear I made some mistake somewhere.
    But where?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] LONG

From
wieck@debis.com (Jan Wieck)
Date:
Bruce Momjian wrote:

> In fact, you may decide to just extent varchar() and text to allow use
> of long tuples.  Set the varlena VARLEN field to some special value like
> -1, and when you see that, you go to pg_long to get the data.  Seems
> very easy.  You could get fancy and keep data in the table in most
> cases, but if the tuple length exceeds 8k, go to all the varlena fields
> and start moving data into pg_long.  That way, a table with three 4k
> columns could be stored without the user even knowing pg_long is
> involved, but for shorter tuples, they are stored in the main table.

    So  you  realized  most  of  my explanations yourself while I
    wrote the last mail.  :-)

    No, I don't intend to change anything on  the  existing  data
    types.  Where should be the limit on which to decide to store
    a datum in pg_long?  Based on the datums size? On  the  tuple
    size  and  attribute  order,  take one by one until the tuple
    became small enough to fit?

    Maybe  we  make  this  mechanism  so  general  that   it   is
    automatically applied to ALL varsize attributes? We'll end up
    with on big pg_long where 90+% of the databases content  will
    be stored.

    But  as  soon as an attribute stored there is used in a WHERE
    or is subject to be joined, you'll see why not (as said, this
    type  will  NOT  be enabled for indexing). The operation will
    probably fallback to a seq-scan on the main  table  and  then
    the attribute must be fetched from pg_long with an index scan
    on every single compare etc. - no, no, no.

    And it will not be one single pg_long table. Instead it  will
    be a separate table per table, that contains one or more LONG
    attributes.  IIRC, the TRUNCATE functionality was implemented
    exactly  to  QUICKLY  be able to whipe out the data from huge
    relations AND get the disk space  back.  In  the  case  of  a
    central  pg_long, TRUNCATE would have to scan pg_long to mark
    the tuples for deletion and vacuum must be run to really  get
    back  the  space.  And a vacuum on this central pg_long would
    probably take longer than the old DELETE, VACUUM of  the  now
    truncated table itself. Again no, no, no.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Last thoughts about LONG

From
Peter Eisentraut
Date:
On Sat, 11 Dec 1999, Jan Wieck wrote:

>     Well,  the  implementation  details. I really would like some
>     little  comments  to  verify  it's  really  complete   before
>     starting.

Before I start the nagging, please be aware that I'm not as smart as I
think I am. Long datatypes of some sort are clearly necessary -- more
power to you.

>     - A  new  field "rellongrelid" type Oid is added to pg_class.
>       It contains the Oid  of  the  long-value  relation  or  the
>       invalid Oid for those who have no LONG attributes.

I have a mixed feeling about all these "sparse" fields everywhere. Doing
it completely formally, this seems to be a one-to-many relation, so you
should put the referencing field into the pg_long table or whatever
structure you use, pointing the other way around. This is probably slower,
but it's cleaner. As I mentioned earlier, this whole arrangement will
(hopefully) not be needed for all too long, and then we wouldn't want to
be stuck with it.

>     - At    CREATE   TABLE,   a   long   value   relation   named
>       "_LONG<tablename>" is created for those tables who need it.

Please don't forget, this would require changes to pg_dump and psql. Also,
the COPY command might not be able to get away without changes, either.

In general, it wouldn't surprise me if some sections of the code would go
nuts about the news of tuples longer than BLCKSZ coming along. (Where
"nuts" is either 'truncation' or 'segfault'.)

I guess what I'm really saying is that I'd be totally in awe of you if you
could get all of this (and RI) done by Feb 1st. Good luck.

-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: [HACKERS] LONG

From
wieck@debis.com (Jan Wieck)
Date:
Peter Eisentraut wrote:

> On Sat, 11 Dec 1999, Bruce Momjian wrote:
>
> > In fact, you could get fancy and allow an update of a non-pg_long using
> > column to not change pg_long at all.  Just keep the same value in the
> > column.  If the transaction fails or succeeds, the pg_long is the same
> > for that tuple.  Of course, because an update is a delete and then an
> > insert, that may be hard to do.  For very long fields, it would be a win
> > for UPDATE.  You certainly couldn't do that with chained tuples.
>
> While this is great and all, what will happen when long tuples finally get
> done? Will you remove this, or keep it, or just make LONG and TEXT
> equivalent? I fear that elaborate structures will be put in place here
> that might perhaps only be of use for one release cycle.

    With  the  actual  design  explained, I don't think we aren't
    that much in need for long tuples any more,  that  we  should
    introduce  all  the  problems  of  chaninig  tuples  into the
    vacuum, bufmgr, heapam, hio etc. etc. code.

    The rare cases, where someone really needs larger tuples  and
    not  beeing  able  to  use the proposed LONG data type can be
    tackled by increasing BLKSIZE for this specific installation.

    Isn't  there  a FAQ entry about "tuple size too big" pointing
    to BLKSIZE?  Haven't checked, but if it is, could that be the
    reason why we get lesser request on this item?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] LONG

From
Tom Lane
Date:
>> I  thought  about  the  huge size variable text type a little
>> more.  And I think I could get the  following  implementation
>> to work reliable for our upcoming release.
>> 
>> For   any  relation,  having  one  or  more  LONG  data  type
>> attributes,  another  relation  (named   pg_<something>)   is
>> created,  accessible  only to superusers (and internal access
>> routines).  All LONG data items are  stored  as  a  reference
>> into  that relation, split up automatically so the chunks fit
>> into the installation specific tuple limit size.   Items  are
>> added/updated/removed totally transparent.

> Should we use large objects for this, and beef them up.  Seems that
> would be a good way.

Yes, I think what Jan is describing *is* a large object, with the
slight change that he wants to put multiple objects into the same
behind-the-scenes relation.  (That'd be a good change for regular
large objects as well ... it'd cut down the umpteen-thousand-files
problem.)

The two principal tricky areas would be (1) synchronization ---
having one hidden relation per primary relation might solve the
problems there, but I'm not sure about it; and (2) VACUUM.

But I don't really see why this would be either easier to do or
more reliable than storing multiple segments of a tuple in the
primary relation itself.  And I don't much care for
institutionalizing a hack like a special "LONG" datatype.
        regards, tom lane


Re: [HACKERS] LONG

From
Tom Lane
Date:
wieck@debis.com (Jan Wieck) writes:
>     The rare cases, where someone really needs larger tuples  and
>     not  beeing  able  to  use the proposed LONG data type can be
>     tackled by increasing BLKSIZE for this specific installation.

This would be a more convincing argument if we supported BLCKSZ
greater than 32K, but we don't.

I think we've speculated about having a compilation flag that gets
thrown to change page offsets from shorts to longs, thereby allowing
larger page sizes.  But as Bruce was just pointing out, all of the
code depends in a fundamental way on the assumption that writing a
page is an atomic action.  The larger the page size, the more likely
that you'll see broken tables caused by partial page writes.  So
allowing BLCKSZ large enough to accomodate any tuple wouldn't be a
very good answer.

I think the proposed LONG type is a hack, and I'd rather see us solve
the problem correctly.  ISTM that allowing a tuple to be divided into
"primary" and "continuation" tuples, all stored in the same relation
file, would be a much more general answer and not significantly harder
to implement than a LONG datatype as Jan is describing it.
        regards, tom lane


Re: [HACKERS] Last thoughts about LONG

From
wieck@debis.com (Jan Wieck)
Date:
Peter Eisentraut wrote:

> Before I start the nagging, please be aware that I'm not as smart as I
> think I am. Long datatypes of some sort are clearly necessary -- more
> power to you.

    So  be  it. It forces me to think it over again and points to
    sections, I might have forgotten so  far.  Also,  it  happend
    more  than  one  time  to  me, that writing a totally OBVIOUS
    answer triggerd a better solution in my brain  (dunno  what's
    wrong  with  that  brain, but sometimes it needs to be shaken
    well before use).  Thus, any of your notes can help, and that
    counts!

>
> >     - A  new  field "rellongrelid" type Oid is added to pg_class.
> >       It contains the Oid  of  the  long-value  relation  or  the
> >       invalid Oid for those who have no LONG attributes.
>
> I have a mixed feeling about all these "sparse" fields everywhere. Doing
> it completely formally, this seems to be a one-to-many relation, so you
> should put the referencing field into the pg_long table or whatever
> structure you use, pointing the other way around. This is probably slower,
> but it's cleaner. As I mentioned earlier, this whole arrangement will
> (hopefully) not be needed for all too long, and then we wouldn't want to
> be stuck with it.

    It's  4 bytes per RELATION in pg_class. As a side effect, the
    information will be available at NO  COST  immediately  after
    heap_open() and in every place, where a relation is accessed.
    So it is the best place to put it.

>
> >     - At    CREATE   TABLE,   a   long   value   relation   named
> >       "_LONG<tablename>" is created for those tables who need it.
>
> Please don't forget, this would require changes to pg_dump and psql. Also,
> the COPY command might not be able to get away without changes, either.

    Oh yes, thanks. That was a point I forgot!

    Psql must not list tables that begin with "_LONG" on  the  \d
    request.  Anything else should IMHO be transparent.

    Pg_dump  either uses a SELECT to build a script that INSERT's
    the data via SQL, or uses COPY. In the SELECT/INSERT case, my
    implementation  would  again  be  totally transparent and not
    noticed  by  pg_dump,  only  that  it  must  IGNORE  "_LONG*"
    relations  and  be  aware that really big tuples can be sent,
    but that's more a libpq question  I  think  (what  I  already
    checked   because   the   view/rule/PL  combo  I  created  to
    demonstrate a >128K tuple  was  done  through  psql).  AFAIK,
    pg_dump  doesn't  use  a binary COPY, and looking at the code
    tells me that this is transparent too (due  to  use  of  type
    specific input/output function there).

    All  pg_dump would have to do is to ignore "_LONG*" relations
    too.

    The real problem is COPY. In the case of  a  COPY  BINARY  it
    outputs  the data portion of the fetched tuples directly. But
    these will only contain the LongData headers,  not  the  data
    itself.

    So  at  that  point,  COPY  has  to do the reverse process of
    heap_insert().  Rebuild a faked tuple where all the not  NULL
    LONG values are placed in the representation, they would have
    after type input.  Not a big deal, must only be done with the
    same  care  as  the  changes  in heapam not to leave unfreed,
    leaked memory around.

> In general, it wouldn't surprise me if some sections of the code would go
> nuts about the news of tuples longer than BLCKSZ coming along. (Where
> "nuts" is either 'truncation' or 'segfault'.)

    The place, where the size of a heap  tuple  only  is  checked
    (and  where  the "tuple size too big" message is coming from)
    is in hio.c, right before it is copied into the block. Up  to
    then, a tuple is NOT explicitly limited to any size.

    So  I would be glad to see crashes coming up from this change
    (not after release - during BETA of course). It would help us
    to get another existing bug out of the code.

> I guess what I'm really saying is that I'd be totally in awe of you if you
> could get all of this (and RI) done by Feb 1st. Good luck.

    Thank's for the flowers, but "awe" is far too much - sorry.

    During  the  years I had my hands on nearly every part of the
    code involved in this. So I'm not a newbe  in  creating  data
    types,  utility  commands  or doing syscat changes.  The LONG
    type I described will be the work of two or three nights.

    I already intended to tackle the long tuples  next.   Missing
    was the idea how to AVOID it simply. And I had this idea just
    while answering a question about storing big  text  files  in
    the database in the [SQL] list - that woke me up.

    In  contrast  to  the  RI stuff, this time I don't expect any
    bugs, because there are absolutely no side effects I  noticed
    so  far.   On  the  RI  stuff, we discussed for weeks (if not
    months) about tuple visibility during concurrent transactions
    and I finally ran into exactly these problems anyway.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Last thoughts about LONG

From
Tom Lane
Date:
wieck@debis.com (Jan Wieck) writes:
>     Another  bonus  would be this: What happens on an UPDATE to a
>     table  having  LONG  attributes?  If  the  attribute  is  not
>     modified,  the  OLD LongData will be found in the targetlist,
>     and we'll not waste any space by storing the same information
>     again.

Won't work.  If you do that, you have several generations of the
"primary" tuple pointing at the same item in the "secondary" table.
There is room in the multiple primary tuples to keep track of their
committed/uncommitted status, but there won't be enough room to
keep track in the secondary table.

I think this can only work if there are exactly as many generations
of the LONG chunks in the secondary table as there are of the primary
tuple in the main table, and all of them have the same transaction
identification info stored in them as the corresponding copies of
the primary tuple have.

Among other things, this means that an update or delete *must* scan
through the tuple, find all the LONG fields, and go over to the
secondary table to mark all the LONG chunks as deleted by the current
xact, just the same as the primary tuple gets marked.  This puts a
considerable crimp in your claim that it'd be more efficient than
a multiple-tuple-segment approach.

Of course, this could be worked around if the secondary table did *not*
use standard access methods (it could be more like an index, and rely on
the primary table for all xact status info).  But that makes it look
even less like a clean data-type-based solution...
        regards, tom lane


Re: [HACKERS] Last thoughts about LONG

From
Tom Lane
Date:
Peter Eisentraut <e99re41@DoCS.UU.SE> writes:
> I guess what I'm really saying is that I'd be totally in awe of you if you
> could get all of this (and RI) done by Feb 1st. Good luck.

When Jan said this was for 7.0, I assumed he meant the release *after*
the Feb 1st one ... whatever it ends up being called.  I don't believe
it's possible or reasonable to get this done by Feb 1, either.
        regards, tom lane


Re: [HACKERS] Last thoughts about LONG

From
Hannu Krosing
Date:
Jan Wieck wrote:
> 
> Peter Eisentraut wrote:
> 
> > Please don't forget, this would require changes to pg_dump and psql. Also,
> > the COPY command might not be able to get away without changes, either.
> 
>     Oh yes, thanks. That was a point I forgot!
> 
>     Psql must not list tables that begin with "_LONG" on  the  \d
>     request.  Anything else should IMHO be transparent.
> 

If this is the main concern then start them with "pg_L_" and they will be 
ignored by the current implementation as well.

But of corse they will surface ad \dS , which may or may not be a good thing
as it makes it possible to list them without changing psql.

-----------
Hannu


Re: [HACKERS] Last thoughts about LONG

From
Hannu Krosing
Date:
Jan Wieck wrote:
> 
>     - At    CREATE   TABLE,   a   long   value   relation   named
>       "_LONG<tablename>" is created for those tables who need it.
>       And of course dropped and truncated appropriate. The schema
>       of this table is
> 
>           rowid       Oid,          -- oid of our main data row
>           rowattno    int2,         -- the attribute number in main data
>           chunk_seq   int4,         -- the part number of this data chunk
>           chunk       text          -- the content of this data chunk
> 
>       There is a unique index defined on (rowid, rowattno).
>

If you plan to use the same LONGs for multiple versions you will probably 
need a refcount int4 too

--------------------
Hannu


Re: [HACKERS] LONG

From
Hannu Krosing
Date:
Tom Lane wrote:
> 
> 
> But I don't really see why this would be either easier to do or
> more reliable than storing multiple segments of a tuple in the
> primary relation itself.  And I don't much care for
> institutionalizing a hack like a special "LONG" datatype.

AFAIK the "hack" is similar to what Oracle does.

At least this is my impression from some descriptions, and it also 
seems reasonable thing to do in general as we dont want to read in
500K tuples (and then sort them) just to join on int fields and filter
out on boolean and count(n) < 3.

The description referred above is about Oracle's habit to return LONG* 
fields as open file descriptions ready for reading when doing FETCH 1 
and as already read-in "strings" when fetching more than 1 tuple.

--------------------
Hannu


Re: [HACKERS] LONG

From
Hannu Krosing
Date:
Tom Lane wrote:
> 
> I think the proposed LONG type is a hack, and I'd rather see us solve
> the problem correctly.  ISTM that allowing a tuple to be divided into
> "primary" and "continuation" tuples, all stored in the same relation
> file, would be a much more general answer and not significantly harder
> to implement than a LONG datatype as Jan is describing it.

Actually they seem to be two _different_ problems - 

1) we may need bigger tuples for several reasons (I would also suggest 
making index tuples twice as long as data tuples to escape the problem 
of indexing text fields above 4K (2K?)

2) the LOB support should be advanced to a state where one could reasonably 
use them for storing more than a few LOBs without making everything else to 
crawl, even on filesystems that don't use indexes on filenames (like ext2)

After achieving 2) support could be added for on-demand migrating of LONG 
types to LOBs

I guess that Jans suggestion is just a quick hack for avoiding fixing LOBs.

-----------------------
Hannu


Re: [HACKERS] Last thoughts about LONG

From
wieck@debis.com (Jan Wieck)
Date:
Tom Lane wrote:

> wieck@debis.com (Jan Wieck) writes:
> >     Another  bonus  would be this: What happens on an UPDATE to a
> >     table  having  LONG  attributes?  If  the  attribute  is  not
> >     modified,  the  OLD LongData will be found in the targetlist,
> >     and we'll not waste any space by storing the same information
> >     again.
>
> Won't work.  If you do that, you have several generations of the
> "primary" tuple pointing at the same item in the "secondary" table.
> There is room in the multiple primary tuples to keep track of their
> committed/uncommitted status, but there won't be enough room to
> keep track in the secondary table.

    A really critical point, to think about in depth. And another
    point I could have stumbled over.

    But it would work anyway.

    I assumed up to now,  that  even  under  MVCC,  and  even  if
    reading  dirty,  there could be at max one single transaction
    modifying one and the same tuple - no?  Ignore all  the  rest
    and  forget all my comments if your answer is no.  But please
    tell me how something like RI should ever  work  RELIABLE  in
    such   an   environment.  In  fact,  in  that  case  I  would
    immediately stop all my efford in  FOREIGN  KEY,  because  it
    would  be a dead end street - so I assume your answer is yes.

    My concept, using regular heap access inside of  heap  access
    to  act on "secondary" table, means to stamp the same current
    xact as for "primary" table into xmax of old, and  into  xmin
    of  new  tuples for the "secondary" table.  And it means that
    this operation appears to be atomic if living  in  a  locking
    environment.

    The  only  thing  I  DON'T  wanted to do is to stamp xmax and
    create new instances in "secondary" table, if  no  update  is
    done  to  the  value  of  the old LONG attribute.  Any UPDATE
    modifying the LONG value, and INSERT/DELETE  of  course  will
    stamp  this  information and/or create new instances.  So the
    only thing (because the only difference) to worry  about  are
    unstamped  and  uncreated  instances  in  "secondary" table -
    right?

    Since INSERT/DELETE allways act synchronous to the  "primary"
    table,  and and UPDATE modifying the LONG too, the only thing
    left to worry about is an UPDATE without updating the LONG.

    In this scenario,  a  "secondary"  tuple  of  a  not  updated
    "primary"  LONG  will  have  an  older, but surely committed,
    xmin.  And it's xmax will be either infinite, or aborted.  So
    it  is visible - no other chance. And that's good, because at
    the time beeing, the updater of the "primary"  tuple  does  a
    NOOP  on  the  "secondary".   And this (extended) part of the
    "primaries"  tuple  information  is  absolutely   unaffected,
    regardless if it's transaction will commit or rollback.

    Well, your concern is again valid. This concept MIGHT need to
    force a NON-MVCC locking scheme for "secondary"  tables.  But
    as  far  as I learned from the RI stuff, that isn't a problem
    and therefore current Jackpot value to be  added  to  Vadim's
    account.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] LONG

From
Bruce Momjian
Date:
> On Sat, 11 Dec 1999, Bruce Momjian wrote:
> 
> > In fact, you could get fancy and allow an update of a non-pg_long using
> > column to not change pg_long at all.  Just keep the same value in the
> > column.  If the transaction fails or succeeds, the pg_long is the same
> > for that tuple.  Of course, because an update is a delete and then an
> > insert, that may be hard to do.  For very long fields, it would be a win
> > for UPDATE.  You certainly couldn't do that with chained tuples.
> 
> While this is great and all, what will happen when long tuples finally get
> done? Will you remove this, or keep it, or just make LONG and TEXT
> equivalent? I fear that elaborate structures will be put in place here
> that might perhaps only be of use for one release cycle.

I think the idea is that Jan's idea is better than chaining tuples.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
wieck@debis.com (Jan Wieck)
Date:
Bruce Momjian wrote:

> > While this is great and all, what will happen when long tuples finally get
> > done? Will you remove this, or keep it, or just make LONG and TEXT
> > equivalent? I fear that elaborate structures will be put in place here
> > that might perhaps only be of use for one release cycle.
>
> I think the idea is that Jan's idea is better than chaining tuples.

    Just as Tom already pointed out, it cannot completely replace
    tuple chaining because of the atomicy  assumption  of  single
    fsync(2)  operation  in  current code. Due to this, we cannot
    get around the cases LONG will leave open by  simply  raising
    BLKSIZE, we instead need to tackle that anyways.

    But I believe LONG would still be something worth the efford.
    It will lower the pressure on chained tuples, giving us  more
    time  to  build  a really good solution, and I think LONG can
    survive tuple chaining and live in coexistance with  it.   As
    said  in my last mail, I still believe that not touching LONG
    values at UPDATE can avoid storing the same huge value again.
    And that's a benefit, tuple chaining will never give us.

    Remember:  If  your only tool is a hammer, anything MUST look
    like a nail.  So why not provide a richer set of tools?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] LONG

From
wieck@debis.com (Jan Wieck)
Date:
Hannu Krosing wrote:

> Tom Lane wrote:
> >
> >
> > But I don't really see why this would be either easier to do or
> > more reliable than storing multiple segments of a tuple in the
> > primary relation itself.  And I don't much care for
> > institutionalizing a hack like a special "LONG" datatype.
>
> AFAIK the "hack" is similar to what Oracle does.
>
> At least this is my impression from some descriptions, and it also
> seems reasonable thing to do in general as we dont want to read in
> 500K tuples (and then sort them) just to join on int fields and filter
> out on boolean and count(n) < 3.

    Even  if  this  is  a  side  effect  I  haven't  seen  at the
    beginning, it would be one of the  best  side  effect's  I've
    ever  seen.  A  really  tempting  one  that's worth to try it
    anyway.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] LONG

From
Bruce Momjian
Date:
> Bruce Momjian wrote:
> 
> > > While this is great and all, what will happen when long tuples finally get
> > > done? Will you remove this, or keep it, or just make LONG and TEXT
> > > equivalent? I fear that elaborate structures will be put in place here
> > > that might perhaps only be of use for one release cycle.
> >
> > I think the idea is that Jan's idea is better than chaining tuples.
> 
>     Just as Tom already pointed out, it cannot completely replace
>     tuple chaining because of the atomicy  assumption  of  single
>     fsync(2)  operation  in  current code. Due to this, we cannot
>     get around the cases LONG will leave open by  simply  raising
>     BLKSIZE, we instead need to tackle that anyways.

Actually, in looking at the fsync() system call, it does write the
entire file descriptor before marking the transaction as complete, so
there is no hard reason not to raise it, but because the OS has to do
two reads to get 16k, I think we are better keeping 8k as our base block
size.

Jan's idea is not to chain tuples, but to keep tuples at 8k, and instead
chain out individual fields into 8k tuple chunks, as needed.  This seems
like it makes much more sense.  It uses the database to recreate the
chains.

Let me mention a few things.  First, I would like to avoid a LONG data
type if possible.  Seems a new data type is just going to make things
more confusing for users.

My ideas is a much more limited one than Jan's.  It is to have a special
-1 varlena length when the data is chained on the long relation.  I
would do:

-1|oid|attno

in 12 bytes.  That way, you can pass this around as long as you want,
and just expand it in the varlena textout and compare routines when you
need the value.  That prevents the tuples from changing size while being
processed.  As far as I remember, there is no need to see the data in
the tuple except in the type comparison/output routines.

Now it would be nice if we could set the varlena length to 12, it's
actual length, and then just somehow know that the varlena of 12 was a
long data entry.  Our current varlena has a maximum length of 64k.  I
wonder if we should grab a high bit of that to trigger long.  I think we
may be able to do that, and just do a AND mask to remove the bit to see
the length.  We don't need the high bit because our varlena's can't be
over 32k.  We can modify VARSIZE to strip it off, and make another
macro like ISLONG to check for that high bit.

Seems this could be done with little code.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
Bruce Momjian
Date:
> > At least this is my impression from some descriptions, and it also
> > seems reasonable thing to do in general as we dont want to read in
> > 500K tuples (and then sort them) just to join on int fields and filter
> > out on boolean and count(n) < 3.
> 
>     Even  if  this  is  a  side  effect  I  haven't  seen  at the
>     beginning, it would be one of the  best  side  effect's  I've
>     ever  seen.  A  really  tempting  one  that's worth to try it
>     anyway.
> 

Or make struct varlena vl_len a 15-bit field, and make islong a 1-bit
field.  I don't remember if using & manually or bit fields is faster.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
Bruce Momjian
Date:
>     Maybe  we  make  this  mechanism  so  general  that   it   is
>     automatically applied to ALL varsize attributes? We'll end up
>     with on big pg_long where 90+% of the databases content  will
>     be stored.

If most joins, comparisons are done on the 10% in the main table, so
much the better.

> 
>     But  as  soon as an attribute stored there is used in a WHERE
>     or is subject to be joined, you'll see why not (as said, this
>     type  will  NOT  be enabled for indexing). The operation will
>     probably fallback to a seq-scan on the main  table  and  then
>     the attribute must be fetched from pg_long with an index scan
>     on every single compare etc. - no, no, no.

Let's fact it.  Most long tuples are store/retrieve, not ordered on or
used in WHERE clauses.  Moving them out of the main table speeds up
things.  It also prevents expansion of rows that never end up in the
result set.

In your system, a sequential scan of the table will pull in all this
stuff because you are going to expand the tuple.  That could be very
costly.  In my system, the expansion only happens on output if they LONG
field does not appear in the WHERE or ORDER BY clauses.

Also, my idea was to auto-enable longs for all varlena types, so short
values stay in the table, while longer chained ones that take up lots of
space and are expensive to expand are retrieved only when needed.

I see this as much better than chained tuples.


> 
>     And it will not be one single pg_long table. Instead it  will
>     be a separate table per table, that contains one or more LONG
>     attributes.  IIRC, the TRUNCATE functionality was implemented
>     exactly  to  QUICKLY  be able to whipe out the data from huge
>     relations AND get the disk space  back.  In  the  case  of  a
>     central  pg_long, TRUNCATE would have to scan pg_long to mark
>     the tuples for deletion and vacuum must be run to really  get
>     back  the  space.  And a vacuum on this central pg_long would
>     probably take longer than the old DELETE, VACUUM of  the  now
>     truncated table itself. Again no, no, no.
> 

I guess a separate pg_long_ per table would be good.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
Bruce Momjian
Date:
>     Maybe  we  make  this  mechanism  so  general  that   it   is
>     automatically applied to ALL varsize attributes? We'll end up
>     with on big pg_long where 90+% of the databases content  will
>     be stored.
> 
>     But  as  soon as an attribute stored there is used in a WHERE
>     or is subject to be joined, you'll see why not (as said, this
>     type  will  NOT  be enabled for indexing). The operation will
>     probably fallback to a seq-scan on the main  table  and  then
>     the attribute must be fetched from pg_long with an index scan
>     on every single compare etc. - no, no, no.

A field value over 8k is not going to be something you join on,
restrict, or order by in most cases.  It is going to be some long
narrative or field that is just for output to the user, usually not used
to process the query.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Jesus, what have I done (was: LONG)

From
wieck@debis.com (Jan Wieck)
Date:
Bruce Momjian wrote (in several messages):

> Actually, in looking at the fsync() system call, it does write the
> entire file descriptor before marking the transaction as complete, so
> there is no hard reason not to raise it, but because the OS has to do
> two reads to get 16k, I think we are better keeping 8k as our base block
> size.

    Agreed. Let's stay with the 8K default.

> -1|oid|attno

    Actually  I  think  you need two more informations to move it
    around independently.  As you agreed somewhere  else  (on  my
    TRUNCATE  issue),  it would be better to keep the long values
    in a per table expansion relation.  Thus, you need the Oid of
    that too at least. Also, it would be good to know the size of
    the data before fetching it, so you need that to.

    But that's not the important issue,  there's  also  an  (IMHO
    dangerous) assumption on it, see below.

> Now it would be nice if we could set the varlena length to 12, it's
> actual length, and then just somehow know that the varlena of 12 was a
> long data entry.  Our current varlena has a maximum length of 64k.
>
> Or make struct varlena vl_len a 15-bit field, and make islong a 1-bit
> field.  I don't remember if using & manually or bit fields is faster.

    I  don't see vl_len as a 15-bit field. In the current sources
    (in postgres.h), it is an int32. And I'm sure  that  not  any
    code  is  aware that some magic bit's in it contain a special
    meaning. At least the types I  added  recently  (numeric  and
    lztext)  aren't.  Nor  am  I sure, a variable length Datum is
    never duplicated somewhere, just  by  using  the  information
    from  vl_len,  with or without using the macro. Thus we would
    have to visit alot of code to make  sure  this  new  variable
    length Datum can be passed around as you like.

    And  the  IMHO  most counting drawback is, that existing user
    type definitions treat the first 32 bits in a variable length
    data  type just as I interpreted the meaning up to now. So we
    could occationally break more than we are aware of.

> In your system, a sequential scan of the table will pull in all this
> stuff because you are going to expand the tuple.  That could be very
> costly.  In my system, the expansion only happens on output if they LONG
> field does not appear in the WHERE or ORDER BY clauses.

In my system, it would do exactly as in your's, because they are mostly the
same. The modification done to the tuple in heap_insert() and heap_replace(),
just before the call to RelationPutHeapTupleAtEnd(), makes each
LONG Datum of varsize 20. Just that the first 32 bits don't contain any
magic information.

> >     Maybe  we  make  this  mechanism  so  general  that   it   is
> >     automatically applied to ALL varsize attributes? We'll end up
> >     with on big pg_long where 90+% of the databases content  will
> >     be stored.
>
> If most joins, comparisons are done on the 10% in the main table, so
> much the better.

    Yes, but how would you want to judge which varsize  value  to
    put  onto  the "secondary" relation, and which one to keep in
    the "primary" table for fast comparisions?

    I think you forgot one little detail. In our model,  you  can
    only  move  around the Datum's extended information around as
    is. It will never be expanded in place, so it must be fetched
    (index  scan)  again  at  any  place,  the  value  itself  is
    required.

    The installed base currently  uses  varsize  attributes  with
    indices  on  them  to  condition, sort and group on them. Now
    pushing such a field into "secondary" occationally will cause
    a substantial loss of performance.

    So  again,  how do you determine which of the attributes is a
    candidate to push into "secondary"?  It  is  a  such  generic
    approach, that I cannot imagine any fail safe method.

    I'd  better like to have another LONG data type, that enables
    me to store huge string into but where I exactly know what  I
    can't  do  with, than having some automatic detection process
    that I cannot force to do what I want. It  happened  just  to
    often  to me, that these "user friendly better knowing what I
    might want"  systems  got  me  by  the  ball's.  I'm  a  real
    programmer,  so  there's  allway  a  way out for me, but what
    shoud a real user do?

> Let's fact it.  Most long tuples are store/retrieve, not ordered on or
> used in WHERE clauses.  Moving them out of the main table speeds up
> things.  It also prevents expansion of rows that never end up in the
> result set.

    Having a tuple consisting of 30+ attributes, where 20 of them
    are varsize ones (CHAR, VARCHAR, NUMERIC etc.), what makes it
    a long tuple? Yes, I'm repeating this  question  once  again,
    because  we're talking about a "one must fit all cases" here.

> stuff because you are going to expand the tuple.  That could be very
> costly.  In my system, the expansion only happens on output if they LONG
> field does not appear in the WHERE or ORDER BY clauses.

    No I won't. As explained, I would return a tuple as is,  just
    with  the  LONG reference information. It will only, but then
    allways again, be expanded if needed to compare, store  again
    or  beeing output to the client.  This "allways again" is one
    of my drawbacks against your "treating all varsize  pushable"
    concept.  In  one  of  my  early  projects, I had to manage a
    microVax for a year, and I love  systems  that  can  be  fine
    tuned  since  then, really! Auto detection is a nice feature,
    but if that failes and you don't have  any  override  option,
    you're hosed.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: Jesus, what have I done (was: LONG)

From
Bruce Momjian
Date:
>     around independently.  As you agreed somewhere  else  (on  my
>     TRUNCATE  issue),  it would be better to keep the long values
>     in a per table expansion relation.  Thus, you need the Oid of
>     that too at least. Also, it would be good to know the size of
>     the data before fetching it, so you need that to.
> 

Yes, I guess you could store the size, but the length is known by
looking at the long relation.  We already have an index to get them in
order, so there is no need to load them in random order.


>     The installed base currently  uses  varsize  attributes  with
>     indices  on  them  to  condition, sort and group on them. Now
>     pushing such a field into "secondary" occationally will cause
>     a substantial loss of performance.

We could allow indexes on long values by storing only 4k of the value. 
If there is no other index value with a matching 4k value, the index of
4k length is fine.  If no, you fail the insert with an error.


>     I'd  better like to have another LONG data type, that enables
>     me to store huge string into but where I exactly know what  I
>     can't  do  with, than having some automatic detection process
>     that I cannot force to do what I want. It  happened  just  to
>     often  to me, that these "user friendly better knowing what I
>     might want"  systems  got  me  by  the  ball's.  I'm  a  real
>     programmer,  so  there's  allway  a  way out for me, but what
>     shoud a real user do?

Automatic allows small values to be inline, and long values to be moved
to long tables in the same column.  This is a nice feature.  It
maximizes performance and capabilities.  I can't imagine why someone
would want a LONG column if they can have a column that does both inline
and long automatically and efficiently.

>     No I won't. As explained, I would return a tuple as is,  just
>     with  the  LONG reference information. It will only, but then
>     allways again, be expanded if needed to compare, store  again
>     or  beeing output to the client.  This "allways again" is one
>     of my drawbacks against your "treating all varsize  pushable"
>     concept.  In  one  of  my  early  projects, I had to manage a
>     microVax for a year, and I love  systems  that  can  be  fine
>     tuned  since  then, really! Auto detection is a nice feature,
>     but if that failes and you don't have  any  override  option,
>     you're hosed.

So you expand it when you need it?  That's fine.  We can do that, except
if you are accessing a real in-buffer tuple, and I am not sure you are
going to know that at the time in all routines.  By looking up each time
it is needed and not changing the tuple, you make changes to the system
minimal.  And in my system, you have long entries only when the data
requires it.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Jesus, what have I done (was: LONG)

From
Bruce Momjian
Date:
OK, I think I can take your ideas and polish this into a killer feature,
so I will keep going on this discussion.


> Bruce Momjian wrote (in several messages):
> 
> > Actually, in looking at the fsync() system call, it does write the
> > entire file descriptor before marking the transaction as complete, so
> > there is no hard reason not to raise it, but because the OS has to do
> > two reads to get 16k, I think we are better keeping 8k as our base block
> > size.
> 
>     Agreed. Let's stay with the 8K default.

OK.  I am worried about performance problems with increasing this for
non-large tuples.  That is why I was liking to keep 8k.  We are never
going to be able to configure 8MB tuples, so I figured 8k was good
enough.


> 
> > -1|oid|attno
> 
>     Actually  I  think  you need two more informations to move it
>     around independently.  As you agreed somewhere  else  (on  my
>     TRUNCATE  issue),  it would be better to keep the long values
>     in a per table expansion relation.  Thus, you need the Oid of
>     that too at least. Also, it would be good to know the size of
>     the data before fetching it, so you need that to.

Yes, I see your point that you don't know the relation oid in those adt
routintes.  Yes, you would need the oid too.  New structure would be:
1-bit long flag|31-bit length|long relid|tuple oid|attno


> > Now it would be nice if we could set the varlena length to 12, it's
> > actual length, and then just somehow know that the varlena of 12 was a
> > long data entry.  Our current varlena has a maximum length of 64k.
> >
> > Or make struct varlena vl_len a 15-bit field, and make islong a 1-bit
> > field.  I don't remember if using & manually or bit fields is faster.
> 
>     I  don't see vl_len as a 15-bit field. In the current sources
>     (in postgres.h), it is an int32. And I'm sure  that  not  any

Sorry, 32-bit field.  I thought 16-bit because there is no need for
values >8k for length.  Seems we have >16 unused bits in the length.


>     code  is  aware that some magic bit's in it contain a special
>     meaning. At least the types I  added  recently  (numeric  and
>     lztext)  aren't.  Nor  am  I sure, a variable length Datum is
>     never duplicated somewhere, just  by  using  the  information
>     from  vl_len,  with or without using the macro. Thus we would
>     have to visit alot of code to make  sure  this  new  variable
>     length Datum can be passed around as you like.

I just checked vl_len is used only in varlena.c inv_api.c and in the
VARSIZE define.  I make sure of that several releases ago, so they all
use the macro.


> 
>     And  the  IMHO  most counting drawback is, that existing user
>     type definitions treat the first 32 bits in a variable length
>     data  type just as I interpreted the meaning up to now. So we
>     could occationally break more than we are aware of.

OK, the solution is that we never pass back this type with the long bit
set.  We always expand it on return to user applications.

We can restrict type expansion to only certain data types.  Not all
varlena types have to be expanded.

> 
> > In your system, a sequential scan of the table will pull in all this
> > stuff because you are going to expand the tuple.  That could be very
> > costly.  In my system, the expansion only happens on output if they LONG
> > field does not appear in the WHERE or ORDER BY clauses.
> 
> In my system, it would do exactly as in your's, because they are mostly the
> same. The modification done to the tuple in heap_insert() and heap_replace(),
> just before the call to RelationPutHeapTupleAtEnd(), makes each
> LONG Datum of varsize 20. Just that the first 32 bits don't contain any
> magic information.

OK.  I just want to get this working in a seamless way with our existing
types.

> 
> > >     Maybe  we  make  this  mechanism  so  general  that   it   is
> > >     automatically applied to ALL varsize attributes? We'll end up
> > >     with on big pg_long where 90+% of the databases content  will
> > >     be stored.
> >
> > If most joins, comparisons are done on the 10% in the main table, so
> > much the better.
> 
>     Yes, but how would you want to judge which varsize  value  to
>     put  onto  the "secondary" relation, and which one to keep in
>     the "primary" table for fast comparisions?

There is only one place in heap_insert that checks for tuple size and
returns an error if it exceeds block size.  I recommend when we exceed
that we scan the tuple, and find the largest varlena type that is
supported for long relations, and set the long bit and copy the data
into the long table.  Keep going until the tuple is small enough, and if
not, throw an error on tuple size exceeded.  Also, prevent indexed
columns from being made long.

> 
>     I think you forgot one little detail. In our model,  you  can
>     only  move  around the Datum's extended information around as
>     is. It will never be expanded in place, so it must be fetched
>     (index  scan)  again  at  any  place,  the  value  itself  is
>     required.

Yes, I agree, but in most cases it will only be expanded to return to
user application because long fields, as used above only when needed,
are usually not used in WHERE or ORDER BY.  If only a few values exceed
the 8k limit, those would have to be retrieved to meet the WHERE or
ORDER BY.  If many are long, it would be a lot of lookups, but I think
this solution would be the best for most uses.

> 
>     The installed base currently  uses  varsize  attributes  with
>     indices  on  them  to  condition, sort and group on them. Now
>     pushing such a field into "secondary" occationally will cause
>     a substantial loss of performance.

Really?  Do people really group/order by on >8k value often?  I question
this.

> 
>     So  again,  how do you determine which of the attributes is a
>     candidate to push into "secondary"?  It  is  a  such  generic
>     approach, that I cannot imagine any fail safe method.

Outlined above in heap_insert().  Seems it would be a small loop.

> 
>     I'd  better like to have another LONG data type, that enables
>     me to store huge string into but where I exactly know what  I
>     can't  do  with, than having some automatic detection process
>     that I cannot force to do what I want. It  happened  just  to
>     often  to me, that these "user friendly better knowing what I
>     might want"  systems  got  me  by  the  ball's.  I'm  a  real
>     programmer,  so  there's  allway  a  way out for me, but what
>     shoud a real user do?

Automatic is better, I think.  We already have too many character types,
and another one is going to be confusing.  Also, if you have data that
is mostly under 8k, but a few are over, how do you store that.  Make
them all LONG and have the overhead for each row?  This seems like a
situation many people are in.  Also, by making it automatic, we can
change the implentation later without having to re-teach people how to
store long tuples.

> 
> > Let's fact it.  Most long tuples are store/retrieve, not ordered on or
> > used in WHERE clauses.  Moving them out of the main table speeds up
> > things.  It also prevents expansion of rows that never end up in the
> > result set.
> 
>     Having a tuple consisting of 30+ attributes, where 20 of them
>     are varsize ones (CHAR, VARCHAR, NUMERIC etc.), what makes it
>     a long tuple? Yes, I'm repeating this  question  once  again,
>     because  we're talking about a "one must fit all cases" here.

Again, scan tuple and move to long table until tuple fits.

> 
> > stuff because you are going to expand the tuple.  That could be very
> > costly.  In my system, the expansion only happens on output if they LONG
> > field does not appear in the WHERE or ORDER BY clauses.
> 
>     No I won't. As explained, I would return a tuple as is,  just
>     with  the  LONG reference information. It will only, but then
>     allways again, be expanded if needed to compare, store  again
>     or  beeing output to the client.  This "allways again" is one
>     of my drawbacks against your "treating all varsize  pushable"
>     concept.  In  one  of  my  early  projects, I had to manage a
>     microVax for a year, and I love  systems  that  can  be  fine
>     tuned  since  then, really! Auto detection is a nice feature,
>     but if that failes and you don't have  any  override  option,
>     you're hosed.

I am confused here.  With my code, you only have to:
add code to write/read from long tablesadd code to expand long values in varlen access routinesadd code to
heap_insert()to move data to long tablesadd code to heap_delete() to invalidate long tuples
 

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Jesus, what have I done (was: LONG)

From
wieck@debis.com (Jan Wieck)
Date:
> >     No I won't. As explained, I would return a tuple as is,  just
> >     with  the  LONG reference information. It will only, but then
> >     allways again, be expanded if needed to compare, store  again
> >     or  beeing output to the client.  This "allways again" is one
> >     of my drawbacks against your "treating all varsize  pushable"
> >     concept.  In  one  of  my  early  projects, I had to manage a
> >     microVax for a year, and I love  systems  that  can  be  fine
> >     tuned  since  then, really! Auto detection is a nice feature,
> >     but if that failes and you don't have  any  override  option,
> >     you're hosed.
>
> I am confused here.  With my code, you only have to:
>
>    add code to write/read from long tables
>    add code to expand long values in varlen access routines
>    add code to heap_insert() to move data to long tables
>    add code to heap_delete() to invalidate long tuples

    Add  code  to  expand  long values in varlen access routines,
    you're joking - no?

    How many functions are there, called  via  the  fmgr  with  a
    Datum as argument, and only knowing by themself (and a system
    catalog) that they receive a variable length attribute?

    So you would better do the fetching in the fmgr. Then  again,
    there  are  many  places  in  the  code (and possibly in user
    extensions too), that call builtin functions  like  textout()
    directly, passing it the Datum they got from somewhere.

    I can understand why you would like to automatically pull out
    varsize values as  needed.  But  I  see  really  a  bunch  of
    problems coming with it.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Re: Jesus, what have I done (was: LONG)

From
Bruce Momjian
Date:
> >    add code to write/read from long tables
> >    add code to expand long values in varlen access routines
> >    add code to heap_insert() to move data to long tables
> >    add code to heap_delete() to invalidate long tuples
> 
>     Add  code  to  expand  long values in varlen access routines,
>     you're joking - no?
> 
>     How many functions are there, called  via  the  fmgr  with  a
>     Datum as argument, and only knowing by themself (and a system
>     catalog) that they receive a variable length attribute?
> 
>     So you would better do the fetching in the fmgr. Then  again,
>     there  are  many  places  in  the  code (and possibly in user
>     extensions too), that call builtin functions  like  textout()
>     directly, passing it the Datum they got from somewhere.


You may be able to expand the in-tuple copy if you had a bit on the
tuple that said long fields exist, and do a heap_tuplecopy() only in
those cases.

You also could cache recently lookuped expand_long value so repeated
calls could return the value without reconstructing the long value.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Re: Jesus, what have I done (was: LONG)

From
Bruce Momjian
Date:
> > I am confused here.  With my code, you only have to:
> >
> >    add code to write/read from long tables
> >    add code to expand long values in varlen access routines
> >    add code to heap_insert() to move data to long tables
> >    add code to heap_delete() to invalidate long tuples
> 
>     Add  code  to  expand  long values in varlen access routines,
>     you're joking - no?

No, I am not joking.  Why not expand them there?  If we look at textout,
it returns a character string for the text field.  Why not do the lookup
of long there and return a very long value?

If we look at texteq, we expand any long values into a palloc'ed area
and do the compare.  Here, I can see the advantage of knowing the length
of the long string.

> 
>     How many functions are there, called  via  the  fmgr  with  a
>     Datum as argument, and only knowing by themself (and a system
>     catalog) that they receive a variable length attribute?
> 
>     So you would better do the fetching in the fmgr. Then  again,
>     there  are  many  places  in  the  code (and possibly in user
>     extensions too), that call builtin functions  like  textout()
>     directly, passing it the Datum they got from somewhere.

I see what you are suggesting, that we expand in fmgr, but we don't know
the arg types in there, do we?  I was suggesting we create an
expand_long() function that takes a long varlena and returns the long
value in palloc'ed memory, and sprinkle the calls in varlena.c and
varchar.c, etc.

If you prefer to expand the tuple itself, you can do that, but I think
doing it only when needed is easier because of in-buffer tuples that you
have to process without modification.

> 
>     I can understand why you would like to automatically pull out
>     varsize values as  needed.  But  I  see  really  a  bunch  of
>     problems coming with it.

These are the only comments you have?  Does that mean the other things I
said are OK, or that you are humoring me?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Also, my idea was to auto-enable longs for all varlena types, so short
> values stay in the table, while longer chained ones that take up lots of
> space and are expensive to expand are retrieved only when needed.

I missed most of yesterday's discussion (was off fighting a different
fire...).  This morning in the shower I had a brilliant idea, which
I now see Bruce has beaten me to ;-)

The idea of doing tuple splitting by pushing "long" fields out of line,
rather than just cutting up the tuple at arbitrary points, is clearly
a win for the reasons Bruce and Jan point out.  But I like Bruce's
approach (automatically do it for any overly-long varlena attribute)
much better than Jan's (invent a special LONG datatype).  A special
datatype is bad for several reasons:
* it forces users to kluge up their database schemas;
* inevitably, users will pick the wrong columns to make LONG (it's a truism that programmers seldom guess right about
whatparts of their programs consume the most resources; users would need a "profiler" to make the right decisions);
 
* it doesn't solve the problems for arrays, which desperately need it;
* we'd need to add a whole bunch of operations on the special datatype;

I could live with all of those limitations if a "clean" datatype-based
solution were possible, ie, all the special code is in the datatype
functions.  But we already know that that's not possible --- there would
have to be special hacks for the LONG datatype in other places.  So I
think we ought to handle the problem as part of the tuple access
machinery, not as a special datatype.

I think that the right place to implement this is in heapam, and that
it should go more or less like this:

1. While writing out a tuple, if the total tuple size is "too big"
(threshold would be some fraction of BLCKSZ, yet to be chosen),
then the tuple manager would go through the tuple to find the longest
varlena attribute, and convert same into an out-of-line attribute.
Repeat if necessary until tuple size fits within threshold.

2. While reading a tuple, fastgetattr() automatically fetches the
out-of-line value if it sees the requested attribute is out-of-line.
(I'd be inclined to mark out-of-line attributes in the same way that
NULL attributes are marked: one bit in the tuple header shows if any
out-of-line attrs are present, and if so there is a bitmap to show
which ones are out-of-line.  We could also use Bruce's idea of
commandeering the high-order bit of the varlena length word, but
I think that's a much uglier and more fragile solution.)

I think that these two changes would handle 99% of the problem.
VACUUM would still need work, but most normal access to tuples would
just work automatically, because all access to varlena fields must go
through fastgetattr().

An as-yet-unsolved issue is how to avoid memory leaks of out-of-line
values after they have been read in by fastgetattr().  However, I think
that's going to be a nasty problem with Jan's approach as well.  The
best answer might be to solve this in combination with addressing the
problem of leakage of temporary results during expression evaluation,
say by adding some kind of reference-count convention to all varlena
values.

BTW, I don't see any really good reason to keep the out-of-line values
in a separate physical file (relation) as Jan originally proposed.
Why not keep them in the same file, but mark them as being something
different than a normal tuple?  Sequential scans would have to know to
skip over them (big deal), and VACUUM would have to handle them
properly, but I think VACUUM is going to have to have special code to
support this feature no matter what.  If we do make them a new primitive
kind-of-a-tuple on disk, we could sidestep the problem of marking all
the out-of-line values associated with a tuple when the tuple is
outdated by a transaction.  The out-of-line values wouldn't have
transaction IDs in them at all; they'd just be labeled with the CTID
and/or OID of the primary tuple they belong to.  VACUUM would consult
that tuple to determine whether to keep or discard an out-of-line value.
        regards, tom lane


Re: [HACKERS] LONG

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> A field value over 8k is not going to be something you join on,
> restrict, or order by in most cases.  It is going to be some long
> narrative or field that is just for output to the user, usually not used
> to process the query.

Not necessarily.  The classic example in my mind is a text field that
the user will want to do LIKE or regexp matching searches on.  When
he does so (and only when he does so), we'd have no choice but to pull
in the out-of-line value for each tuple in order to check the WHERE
clause.  But we'd have to do that no matter how you slice the problem.

I think the case that is actually worth thinking about is where some
values of the column are long and some are not so long.  We should avoid
a solution that imposes out-of-line storage on *every* tuple even when
the particular tuple isn't large enough to cause a problem.

I believe all of the proposals made so far have the ability to keep a
short value in-line, but the data-type-based approach has a significant
disadvantage: the decision has to be made by a data-type-specific
routine that wouldn't have information about the rest of the tuple that
the data will end up in.  So it would have to err on the side of caution
and put anything more than a fairly short value out-of-line.  If the
decision is made by the tuple storage routine, then it can examine the
whole tuple and make a more nearly optimal choice about what to put
out-of-line.
        regards, tom lane


Re: [HACKERS] Re: Jesus, what have I done (was: LONG)

From
Hannu Krosing
Date:
Bruce Momjian wrote:
> 
> > > If most joins, comparisons are done on the 10% in the main table, so
> > > much the better.
> >
> >     Yes, but how would you want to judge which varsize  value  to
> >     put  onto  the "secondary" relation, and which one to keep in
> >     the "primary" table for fast comparisions?
> 
> There is only one place in heap_insert that checks for tuple size and
> returns an error if it exceeds block size.  I recommend when we exceed
> that we scan the tuple, and find the largest varlena type that is
> supported for long relations, and set the long bit and copy the data
> into the long table.  Keep going until the tuple is small enough, and if
> not, throw an error on tuple size exceeded.  Also, prevent indexed
> columns from being made long.

And prevent indexes from being created later if fields in some recorde 
are  made long ?

Or would it be enogh here to give out a warning ?

Or should one try to re-pack these tuples ?

Or, for tables that have mosty 10-char fields bu an occasional 10K field 
we could possibly approach the indexes as currently proposed for tables, 
i.e. make the index's data part point to the same LONG relation ?

The latter would probably open another can of worms.

---------
Hannu


Re: [HACKERS] LONG

From
wieck@debis.com (Jan Wieck)
Date:
Tom Lane wrote:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Also, my idea was to auto-enable longs for all varlena types, so short
> > values stay in the table, while longer chained ones that take up lots of
> > space and are expensive to expand are retrieved only when needed.
>
> I missed most of yesterday's discussion (was off fighting a different
> fire...).  This morning in the shower I had a brilliant idea, which
> I now see Bruce has beaten me to ;-)
>
> The idea of doing tuple splitting by pushing "long" fields out of line,
> rather than just cutting up the tuple at arbitrary points, is clearly
> a win for the reasons Bruce and Jan point out.  But I like Bruce's
> approach (automatically do it for any overly-long varlena attribute)
> much better than Jan's (invent a special LONG datatype).  A special
> datatype is bad for several reasons:
> * it forces users to kluge up their database schemas;
> * inevitably, users will pick the wrong columns to make LONG (it's
>   a truism that programmers seldom guess right about what parts of
>   their programs consume the most resources; users would need a
>   "profiler" to make the right decisions);
> * it doesn't solve the problems for arrays, which desperately need it;
> * we'd need to add a whole bunch of operations on the special datatype;

O.K.,

    you two got me now.

>
> I think that the right place to implement this is in heapam, and that
> it should go more or less like this:
>
> 1. While writing out a tuple, if the total tuple size is "too big"
> (threshold would be some fraction of BLCKSZ, yet to be chosen),
> then the tuple manager would go through the tuple to find the longest
> varlena attribute, and convert same into an out-of-line attribute.
> Repeat if necessary until tuple size fits within threshold.

    Yepp.  But  it  does  NOT mangle up the tuple handed to it in
    place.  The flat values in the tuple are sometimes used AFTER
    heap_insert()    and    heap_update(),    for   example   for
    index_insert. So that might break other places.

> 2. While reading a tuple, fastgetattr() automatically fetches the
> out-of-line value if it sees the requested attribute is out-of-line.
> (I'd be inclined to mark out-of-line attributes in the same way that
> NULL attributes are marked: one bit in the tuple header shows if any
> out-of-line attrs are present, and if so there is a bitmap to show
> which ones are out-of-line.  We could also use Bruce's idea of
> commandeering the high-order bit of the varlena length word, but
> I think that's a much uglier and more fragile solution.)
>
> I think that these two changes would handle 99% of the problem.
> VACUUM would still need work, but most normal access to tuples would
> just work automatically, because all access to varlena fields must go
> through fastgetattr().

    And I like Bruce's idea with the high order  bit  of  vl_len.
    This  is IMHO the only chance, to tell on UPDATE if the value
    wasn't changed.

    To detect that an UPDATE did not touch the out of line value,
    you  need  the  complete  long  reference  information in the
    RESULT tuple. The executor must not expand  the  value  while
    building them up already.

    But  Tom  is  right,  there is a visibility problem I haven't
    seen before.  It is  that  when  fetching  the  out  of  line
    attribute  (for  example in the type output function) is done
    later  than  fetching  the  reference  information.   Then  a
    transaction  reading  dirty  or  committed  might  see  wrong
    content,  or  worse,  see  different  contents  at  different
    fetches.

    The  solution  I see is to give any out of line datum another
    Oid, that is  part  of  it's  header  and  stamped  into  the
    reference  data.  That way, the long attribute lookup can use
    SnapshotAny using this  Oid,  there  can  only  be  one  that
    exists,  so SnapshotAny is safe here and forces that only the
    visibility of the master tuple in the main  table  counts  at
    all.

    Since this Values Oid is known in the Values reference of the
    tuple, we only need two indices on the out of line data.  One
    on  this  Oid, on on the referencing row's oid|attrno|seq  to
    be fast in heap_delete() and heap_update().

> An as-yet-unsolved issue is how to avoid memory leaks of out-of-line
> values after they have been read in by fastgetattr().  However, I think
> that's going to be a nasty problem with Jan's approach as well.  The
> best answer might be to solve this in combination with addressing the
> problem of leakage of temporary results during expression evaluation,
> say by adding some kind of reference-count convention to all varlena
> values.

    At the point we decide to move an attribute out of the tuple,
    we  make a lookup in an array consisting of type Oid's. Thus,
    we have plenty of time to add one datatype after another  and
    enable  them separately for long processing, but get the ones
    enabled ASAP (next release) out of the door.

    As Bruce suggested, we  implement  a  central  function  that
    fetches  back  the  long  value. This is used in all the type
    specific  funcitons  in  adt.   Now  that  we  have  an   Oid
    identifier  per  single value, it's easy to implement a cache
    there, that can manage a LRU table of the last fetched values
    and cache smaller ones for fast access.

    It's  the  response  of  the types adt functions, to free the
    returned (old VARLENA looking) memory. Since  we  enable  the
    types one-by-one, there's no need to hurry on this.

> BTW, I don't see any really good reason to keep the out-of-line values
> in a separate physical file (relation) as Jan originally proposed.
> Why not keep them in the same file, but mark them as being something
> different than a normal tuple?  Sequential scans would have to know to
> skip over them (big deal), and VACUUM would have to handle them

    The  one  I  see  is that a sequential scan would not benefit
    from this, it still has to read the entire relation, even  if
    looking only on small, fixed size items in the tuple. Will be
    a big win for count(*). And with the  mentioned  value  cache
    for  relatively  small  (yet  to define what that is) values,
    there will be very little overhead in a sort, if  the  tuples
    in  it  are  sorted  by  an  attribute where some long values
    occationally appear.

> properly, but I think VACUUM is going to have to have special code to
> support this feature no matter what.  If we do make them a new primitive
> kind-of-a-tuple on disk, we could sidestep the problem of marking all
> the out-of-line values associated with a tuple when the tuple is
> outdated by a transaction.  The out-of-line values wouldn't have
> transaction IDs in them at all; they'd just be labeled with the CTID
> and/or OID of the primary tuple they belong to.  VACUUM would consult
> that tuple to determine whether to keep or discard an out-of-line value.

    AFAIK, VACUUM consults single attributes of a tuple  only  to
    produce  the  statistical  informations  for them on ANALYZE.
    Well, statistical information  for  columns  containing  LONG
    values aren't good for the WHERE clause (I think we all agree
    on that). So it doesn't matter if these  informations  aren't
    totally  accurate, or if VACUUM counts them but uses only the
    first couple of bytes for the min/max etc. info.

    Also, the new long data relations should IMHO have their  own
    relkind.   So  VACUUM can easily detect them. This I think is
    required, so VACUUM can place an exclusive lock on  the  main
    table  first before starting to vacuum the long values (which
    can be done as is since it is in fact  a  normal  relation  -
    just  not  visible  to  the  user).  This  should  avoid race
    conditions as explained above on the visibility problem.

    I'll start to play around with this  approach  for  a  while,
    using  lztext  as  test  candidate  (with  custom compression
    parameters that force uncompressed storage). When I have some
    reasonable  result  ready to look at, I'll send a patch here,
    so we can continue the discussion while looking at some  test
    implementation.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Re: Jesus, what have I done (was: LONG)

From
Bruce Momjian
Date:
> > I am confused here.  With my code, you only have to:
> >
> >    add code to write/read from long tables
> >    add code to expand long values in varlen access routines
> >    add code to heap_insert() to move data to long tables
> >    add code to heap_delete() to invalidate long tuples
> 
>     Add  code  to  expand  long values in varlen access routines,
>     you're joking - no?

Here is a patch to textout() that allows it to handle long tuples.  It
checks the long bit, and calls the proper expansion function, and
pfree()'s it on exit.

It is a minimal amount of code that could be added to all the varlena
access routines.  I would be glad to do it.

By doing it there, we expand only when we access the varlena value, not
on every tuple.

---------------------------------------------------------------------------


*** varlena.c    Sun Nov  7 18:08:24 1999
--- varlena.c.new    Sun Dec 12 15:49:35 1999
***************
*** 176,181 ****
--- 176,182 ---- {     int            len;     char       *result;
+     bool        islong = false;      if (vlena == NULL)     {
***************
*** 184,189 ****
--- 185,197 ----         result[1] = '\0';         return result;     }
+ 
+     if (VARISLONG(vlena)) /* checks long bit */
+     {
+         vlena = expand_long(vlena); /* returns palloc long */
+         islong = true;
+     }
+      len = VARSIZE(vlena) - VARHDRSZ;     result = (char *) palloc(len + 1);     memmove(result, VARDATA(vlena),
len);
***************
*** 192,197 ****
--- 200,208 ---- #ifdef CYR_RECODE     convertstr(result, len, 1); #endif
+ 
+     if (islong)
+         pfree(vlena);      return result; }

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
Bruce Momjian
Date:
> 2. While reading a tuple, fastgetattr() automatically fetches the
> out-of-line value if it sees the requested attribute is out-of-line.
> (I'd be inclined to mark out-of-line attributes in the same way that
> NULL attributes are marked: one bit in the tuple header shows if any
> out-of-line attrs are present, and if so there is a bitmap to show
> which ones are out-of-line.  We could also use Bruce's idea of
> commandeering the high-order bit of the varlena length word, but
> I think that's a much uglier and more fragile solution.)

Not sure if fastgetattr() is the place for this.  I thought the varlena
access routines themselves would work.  It is nice and clean to do it in
fastgetattr, but how do you know to pfree it?  I suppose if you kept the
high bit set, you could try cleaning up, but where?

My idea was to expand the out-of-line varlena, and unset the 'long' bit.
long-bit|length|reloid|tupleoid|attno|longlen

Unexpanded would be:
1|20|10032|23123|5|20000

unexpanded is:
0|20000|data


> 
> I think that these two changes would handle 99% of the problem.
> VACUUM would still need work, but most normal access to tuples would
> just work automatically, because all access to varlena fields must go
> through fastgetattr().
> 
> An as-yet-unsolved issue is how to avoid memory leaks of out-of-line
> values after they have been read in by fastgetattr().  However, I think
> that's going to be a nasty problem with Jan's approach as well.  The
> best answer might be to solve this in combination with addressing the
> problem of leakage of temporary results during expression evaluation,
> say by adding some kind of reference-count convention to all varlena
> values.

That's why I was going to do the expansion only in the varlena access
routines.  Patch already posted.

> 
> BTW, I don't see any really good reason to keep the out-of-line values
> in a separate physical file (relation) as Jan originally proposed.
> Why not keep them in the same file, but mark them as being something
> different than a normal tuple?  Sequential scans would have to know to
> skip over them (big deal), and VACUUM would have to handle them
> properly, but I think VACUUM is going to have to have special code to
> support this feature no matter what.  If we do make them a new primitive
> kind-of-a-tuple on disk, we could sidestep the problem of marking all
> the out-of-line values associated with a tuple when the tuple is
> outdated by a transaction.  The out-of-line values wouldn't have
> transaction IDs in them at all; they'd just be labeled with the CTID
> and/or OID of the primary tuple they belong to.  VACUUM would consult
> that tuple to determine whether to keep or discard an out-of-line value.

I disagree.  By moving to another table, we don't have non-standard
tuples in the main table.  We can create normal tuples in the long*
table, of identical format, and access them just like normal tuples. 
Having special long tuples in the main table that don't follow the
format of the other tuples it a certain mess.  The long* tables also
move the long data out of the main table so it is not accessed in
sequential scans.  Why keep them in the main table?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
wieck@debis.com (Jan Wieck)
Date:
Bruce Momjian wrote:

> I disagree.  By moving to another table, we don't have non-standard
> tuples in the main table.  We can create normal tuples in the long*
> table, of identical format, and access them just like normal tuples.
> Having special long tuples in the main table that don't follow the
> format of the other tuples it a certain mess.  The long* tables also
> move the long data out of the main table so it is not accessed in
> sequential scans.  Why keep them in the main table?

    More  ugly  and  complicated (especially for VACUUM) seems to
    me, the we need an index on these  nonstandard  tuples,  that
    doesn't  see  the  standard  ones,  while the regular indices
    ignore the new long tuples. At least  if  we  want  to  delay
    reading of long values until they're explicitly requested.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] LONG

From
Bruce Momjian
Date:
> Bruce Momjian wrote:
> 
> > I disagree.  By moving to another table, we don't have non-standard
> > tuples in the main table.  We can create normal tuples in the long*
> > table, of identical format, and access them just like normal tuples.
> > Having special long tuples in the main table that don't follow the
> > format of the other tuples it a certain mess.  The long* tables also
> > move the long data out of the main table so it is not accessed in
> > sequential scans.  Why keep them in the main table?
> 
>     More  ugly  and  complicated (especially for VACUUM) seems to
>     me, the we need an index on these  nonstandard  tuples,  that
>     doesn't  see  the  standard  ones,  while the regular indices
>     ignore the new long tuples. At least  if  we  want  to  delay
>     reading of long values until they're explicitly requested.
> 

Yes, good point.  No reason to create non-standard tuples if you can
avoid it.  And a separate table has performance advantages, especially
because the long tuples are by definition long and take up lots of
blocks.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


libpq questions...when threads collide

From
Don Baccus
Date:
I'm working on bullet-proofing AOLserver's postgres driver.

I've fixed a bunch of weaknesses, but am stumped by the
following...

AOLserver's a multithreaded server, and libpq's database
connection routines aren't threadsafe.  It turns out the
environment in which the driver lives doesn't allow me
to ensure that only one thread executes a PQsetdb at a
time, at least without resorting to the specific operating
system's mutexes and cond primitives.  The server provides
a nice portable interface for such things but they're
not available to database drivers because in general the
server's not interested in having database drivers do such
things.

That's not a problem for this group, but I'm curious.  People
have been using this driver for years, and some use it 
heavily (Lamar Owen, for one).  Despite the thread unsafeness
of PQsetdb et al, I've never seen a failure in this environment
and I've never heard of folks experiencing such a failure.

So my question's simple - what exactly makes PQsetdb et al
thread unsafe?  I'm asking in order to attempt to get a handle
on just how vulnerable the routines are when two threads attempt
to open a database connection simultaneously.

The other question's simple, too - are the implications predictable,
i.e. will (for instance) one of the attemps simply crash or
fail when two or more threads attempt to make a connection?  Or
am I looking at something more evil, like silent building of a
connection messed up in some subtle way?  

I suspect the answer to the last question is that the result of
doing this is unpredictable, but thought I'd ask.

AOLserver supports external drivers called by a proxy with a 
separate process provided for each database connection, but
there are unfortunate performance implications with this
approach.  It's designed explicitly for dbs with no threadsafe
C API.  This includes Sybase, and in my testing the internal
Postgres driver can feed bytes to the server about three times
as fast as the external driver written for Sybase, so you
can see why I'm reluctant to rewrite the Postgres driver simply
because building a connection's not threadsafe.  After all,
unless a backend crashes they only happen when the server's
first fired up.  And people aren't seeing problems.




- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] libpq questions...when threads collide

From
Tom Lane
Date:
Don Baccus <dhogaza@pacifier.com> writes:
> Despite the thread unsafeness
> of PQsetdb et al, I've never seen a failure in this environment
> and I've never heard of folks experiencing such a failure.

The *only* thing that's actually thread-unsafe, AFAIR, is
PQconnectdb's use of a global array for connection parameters.
PQsetdb/setdbLogin are thread-safe; so just use them instead.

At least that was true before the async-connection code got added.
I haven't looked at that to see if it introduces any problems.
        regards, tom lane


Re: [HACKERS] libpq questions...when threads collide

From
Don Baccus
Date:
At 05:41 PM 12/12/99 -0500, Tom Lane wrote:
>Don Baccus <dhogaza@pacifier.com> writes:
>> Despite the thread unsafeness
>> of PQsetdb et al, I've never seen a failure in this environment
>> and I've never heard of folks experiencing such a failure.
>
>The *only* thing that's actually thread-unsafe, AFAIR, is
>PQconnectdb's use of a global array for connection parameters.
>PQsetdb/setdbLogin are thread-safe; so just use them instead.

Cool!  I am using setdbLogin but the documentation sez they,
too, aren't threadsafe...maybe this should be changed?  This
is great news.

>At least that was true before the async-connection code got added.
>I haven't looked at that to see if it introduces any problems.

For the moment, I'm happy to believe that it hasn't, it makes my
immediate future much simpler if I do so...

Also, the documentation describes two routines, PQoidStatus and
PQoidValue, but the libpq source seem to only define PQoidStatus.

(some user asked for a routine to feed back the oid of an insert,
so I looked into it while simultaneously suggesting he study
"sequence" and its associated "nextval" and "currval" functions
and ponder on why it's really a bad idea to related tables by
storing oids rather than generated keys)




- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] libpq questions...when threads collide

From
Tom Lane
Date:
Don Baccus <dhogaza@pacifier.com> writes:
> Cool!  I am using setdbLogin but the documentation sez they,
> too, aren't threadsafe...maybe this should be changed?

I guess so.  Submit a patch...

> Also, the documentation describes two routines, PQoidStatus and
> PQoidValue, but the libpq source seem to only define PQoidStatus.

PQoidValue is new in current sources --- you must be looking at
current-snapshot docs, rather than what was released with 6.5.
        regards, tom lane


Re: [HACKERS] libpq questions...when threads collide

From
Don Baccus
Date:
At 05:58 PM 12/12/99 -0500, Tom Lane wrote:

>PQoidValue is new in current sources --- you must be looking at
>current-snapshot docs, rather than what was released with 6.5.

I'm using the docs at www.postgresql.org, which I assumed would
be matched to the current release.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] libpq questions...when threads collide

From
Tom Lane
Date:
Don Baccus <dhogaza@pacifier.com> writes:
>> PQoidValue is new in current sources --- you must be looking at
>> current-snapshot docs, rather than what was released with 6.5.

> I'm using the docs at www.postgresql.org, which I assumed would
> be matched to the current release.

I believe the on-line manual is a nightly snapshot.  This is awfully
handy for developers but not so good for ordinary users.  What we
should probably do is have both the snapshot and the last release's
docs on the website ... clearly marked ;-)
        regards, tom lane


Re: [HACKERS] libpq questions...when threads collide

From
Vince Vielhaber
Date:
On 12-Dec-99 Tom Lane wrote:
> Don Baccus <dhogaza@pacifier.com> writes:
>>> PQoidValue is new in current sources --- you must be looking at
>>> current-snapshot docs, rather than what was released with 6.5.
> 
>> I'm using the docs at www.postgresql.org, which I assumed would
>> be matched to the current release.
> 
> I believe the on-line manual is a nightly snapshot.  This is awfully
> handy for developers but not so good for ordinary users.  What we
> should probably do is have both the snapshot and the last release's
> docs on the website ... clearly marked ;-)

Last I looked the docs for every particular version were included with 
the tarball.  No matter how clearly you mark anything it still won't be
seen and someone will complain.  Either we should keep the current docs
or the release docs online - not both.

Vince.
-- 
==========================================================================
Vince Vielhaber -- KA8CSH   email: vev@michvhf.com   flame-mail: /dev/null # include <std/disclaimers.h>       Have you
seenhttp://www.pop4.net?       Online Campground Directory    http://www.camping-usa.com      Online Giftshop
Superstore   http://www.cloudninegifts.com
 
==========================================================================




Re: [HACKERS] libpq questions...when threads collide

From
Don Baccus
Date:
At 06:27 PM 12/12/99 -0500, Vince Vielhaber wrote:

>Last I looked the docs for every particular version were included with 
>the tarball.  No matter how clearly you mark anything it still won't be
>seen and someone will complain.

I'm not complaining, I'm explaining where I found the definition of
PQoidValue.   And, yes, I know the docs are in the tarball.  As it
happens I have a permanent, high-speed internet connection and find
it convenient to use the docs at postgres.org.  If that makes me an
idiot in your book I could care less.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] libpq questions...when threads collide

From
Vince Vielhaber
Date:
On 12-Dec-99 Don Baccus wrote:
> At 06:27 PM 12/12/99 -0500, Vince Vielhaber wrote:
> 
>>Last I looked the docs for every particular version were included with 
>>the tarball.  No matter how clearly you mark anything it still won't be
>>seen and someone will complain.
> 
> I'm not complaining, I'm explaining where I found the definition of
> PQoidValue.   And, yes, I know the docs are in the tarball.  As it
> happens I have a permanent, high-speed internet connection and find
> it convenient to use the docs at postgres.org.  If that makes me an
> idiot in your book I could care less.

Now where the hell did I call you an idiot?   Tom said we should have
current and release docs online clearly marked.  Reread my reply.  Are
you volunteering to be that special "someone"?

Vince.
-- 
==========================================================================
Vince Vielhaber -- KA8CSH   email: vev@michvhf.com   flame-mail: /dev/null # include <std/disclaimers.h>       Have you
seenhttp://www.pop4.net?       Online Campground Directory    http://www.camping-usa.com      Online Giftshop
Superstore   http://www.cloudninegifts.com
 
==========================================================================




RE: [HACKERS] LONG

From
"Hiroshi Inoue"
Date:
There are so many mails for me to follow about this issue. 
For example,what's the conclusion about the following ?
Please teach me.

> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane
> 
> 
> BTW, I don't see any really good reason to keep the out-of-line values
> in a separate physical file (relation) as Jan originally proposed.
> Why not keep them in the same file, but mark them as being something
> different than a normal tuple?  Sequential scans would have to know to
> skip over them (big deal), and VACUUM would have to handle them
> properly, but I think VACUUM is going to have to have special code to
> support this feature no matter what.  If we do make them a new primitive
> kind-of-a-tuple on disk, we could sidestep the problem of marking all
> the out-of-line values associated with a tuple when the tuple is
> outdated by a transaction.  The out-of-line values wouldn't have
> transaction IDs in them at all; they'd just be labeled with the CTID

What is wong if out-of-line values have their own XIDs ?
If an out-of-line is newer than corresponding row in "primary" table
it's bad but could it occur ?
Because (rowid) of "secondary" table references "primary" table(oid)
on delete cascade,XID_MAXs of them would be synchronized.
Why is CTID needed ?  Is it necessary to know "primary" tuples from
out-of-lines values ? 

> and/or OID of the primary tuple they belong to.  VACUUM would consult
> that tuple to determine whether to keep or discard an out-of-line value.
>

What is wrong with separate VACUUM ?
VACUUM never changes OIDs and XIDs(after MVCC).

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Re: [HACKERS] libpq questions...when threads collide

From
Tom Lane
Date:
Vince Vielhaber <vev@michvhf.com> writes:
> Either we should keep the current docs
> or the release docs online - not both.

I disagree, because they serve different audiences.  The snapshot docs
are very useful to developers, particularly those of us who don't have
SGML tools installed but still want to know whether the docs we
committed recently look right or not ;-).  Meanwhile, current-release
documents are clearly the right thing to provide for ordinary users.

I think a reasonable choice would be to provide current-release docs
as the most readily accessible set of docs on the website, and to put
the snapshot docs somewhere less obvious where only developers would
normally go (preferably, accessed off a page that is clearly about
development sources).

If I can't have both, I'd reluctantly say that the release docs are
the right ones to have on the website.
        regards, tom lane


Re: [HACKERS] LONG

From
Bruce Momjian
Date:
>     The  solution  I see is to give any out of line datum another
>     Oid, that is  part  of  it's  header  and  stamped  into  the
>     reference  data.  That way, the long attribute lookup can use
>     SnapshotAny using this  Oid,  there  can  only  be  one  that
>     exists,  so SnapshotAny is safe here and forces that only the
>     visibility of the master tuple in the main  table  counts  at
>     all.

This is a great idea.  Get rid of my use of the attribute number.  Make
the varlena long value be:
long-bit|length|longrelid|longoid|longlen

No need for attno in there anymore.

Having a separate oid for the long value is great.  You can then have
multiple versions of the long attribute in the long table and can
control when updating a tuple.

I liked Hiroshi's idea of allowing long values in an index by just
pointing to the long table.  Seems that would work too.  varlena access
routines make that possible.



--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> There are so many mails for me to follow about this issue. 
> For example,what's the conclusion about the following ?

I don't think it's concluded yet...

> Why is CTID needed ?  Is it necessary to know "primary" tuples from
> out-of-lines values ? 

It seems to me that the primary tuple should store CTIDs of the
out-of-line segment(s) it's using.  That way, we need no index at
all on the expansion relation, which would clearly be a win.

My thought was that if the expansion tuples stored CTIDs of their
primary tuples, then it would be practical to have VACUUM consult
the primary tuples' xact status while vacuuming the expansion.
That way, we'd have no need to update expansion tuples when changing
xact status of primary tuples.  But I think Jan has something else
in mind for that.

It would be a little tricky to write out a tuple plus its expansion
tuples and have them all know each others' CTIDs; the CTIDs would
have to be assigned before anything got written.  And VACUUM would
need a little extra logic to update these things.  But those are
very localized and IMHO solvable problems, and I think the performance
advantages would be significant...

> What is wrong with separate VACUUM ?
> VACUUM never changes OIDs and XIDs(after MVCC).

I believe VACUUM does assign its own XID to tuples that it moves,
so that a crash during VACUUM doesn't corrupt the table by leaving
multiple apparently-valid copies of a tuple.  We'd have to figure out
how to accomplish the same result for expansion tuples.
        regards, tom lane


Re: [HACKERS] LONG

From
wieck@debis.com (Jan Wieck)
Date:
>
> >     The  solution  I see is to give any out of line datum another
> >     Oid, that is  part  of  it's  header  and  stamped  into  the
> >     reference  data.  That way, the long attribute lookup can use
> >     SnapshotAny using this  Oid,  there  can  only  be  one  that
> >     exists,  so SnapshotAny is safe here and forces that only the
> >     visibility of the master tuple in the main  table  counts  at
> >     all.
>
> This is a great idea.  Get rid of my use of the attribute number.  Make
> the varlena long value be:
>
>    long-bit|length|longrelid|longoid|longlen
>
> No need for attno in there anymore.

    I  still  need  it  to  explicitly  remove  one long value on
    update, while the other one is untouched. Otherwise  I  would
    have  to  drop  all  long  values  for  the  row together and
    reinsert all new ones.

> Having a separate oid for the long value is great.  You can then have
> multiple versions of the long attribute in the long table and can
> control when updating a tuple.
>
> I liked Hiroshi's idea of allowing long values in an index by just
> pointing to the long table.  Seems that would work too.  varlena access
> routines make that possible.

    Maybe possible, but not that good IMHO. Would  cause  another
    index  scan from inside index scan to get at the value. An we
    all agree that indexing huge values isn't that a  good  thing
    at all.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] LONG

From
Tom Lane
Date:
wieck@debis.com (Jan Wieck) writes:
>> I liked Hiroshi's idea of allowing long values in an index by just
>> pointing to the long table.  Seems that would work too.  varlena access
>> routines make that possible.

>     Maybe possible, but not that good IMHO. Would  cause  another
>     index  scan from inside index scan to get at the value. An we
>     all agree that indexing huge values isn't that a  good  thing
>     at all.

Well, no, you shouldn't make indexes on fields that are usually big.
But it'd be awfully nice if the system could cope with indexing fields
that just had a long value once in a while.  Right now, our answer is
to refuse to let you insert a long value into an indexed field; I don't
think that's very satisfactory.

What do you think of my idea of not using any index on the expansion
table at all, but instead having the primary tuple reference the
expansion tuples via their CTIDs?  More work at VACUUM time, for sure,
but a lot less work elsewhere.
        regards, tom lane


Re: [HACKERS] LONG

From
Bruce Momjian
Date:
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > There are so many mails for me to follow about this issue. 
> > For example,what's the conclusion about the following ?
> 
> I don't think it's concluded yet...
> 
> > Why is CTID needed ?  Is it necessary to know "primary" tuples from
> > out-of-lines values ? 
> 
> It seems to me that the primary tuple should store CTIDs of the
> out-of-line segment(s) it's using.  That way, we need no index at
> all on the expansion relation, which would clearly be a win.

That could be bad.  Vacuum moving expired entries in long_ tables would
need to update the ctids in the primary relation, which would be a mess.
Also, I can see an 16MB relation using 8k of stored ctids.  Entries over
16MB would be overflow, causing problems.  I think an index and
tradition access will be just fine.

> 
> My thought was that if the expansion tuples stored CTIDs of their
> primary tuples, then it would be practical to have VACUUM consult
> the primary tuples' xact status while vacuuming the expansion.
> That way, we'd have no need to update expansion tuples when changing
> xact status of primary tuples.  But I think Jan has something else
> in mind for that.

Then you need to have a way to point back to the primary table from the
long_ table.  Doesn't seem worth it.

Also, I am questioning the use of compressed for long tuples.  I often
don't want some compression happening behind the scenes.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: [HACKERS] LONG

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Jan Wieck
> 
> >
> > Having a separate oid for the long value is great.  You can then have
> > multiple versions of the long attribute in the long table and can
> > control when updating a tuple.
> >
> > I liked Hiroshi's idea of allowing long values in an index by just
> > pointing to the long table.  Seems that would work too.  varlena access
> > routines make that possible.
> 
>     Maybe possible, but not that good IMHO. Would  cause  another
>     index  scan from inside index scan to get at the value. An we
>     all agree that indexing huge values isn't that a  good  thing
>     at all.
>

What I need is an unqiue index (rowid,rowattno,chunk_seq) on
"secondary" table.
Is it different from your orginal idea ?
I don't need any index on primary table.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Re: [HACKERS] LONG

From
Bruce Momjian
Date:
> >
> > >     The  solution  I see is to give any out of line datum another
> > >     Oid, that is  part  of  it's  header  and  stamped  into  the
> > >     reference  data.  That way, the long attribute lookup can use
> > >     SnapshotAny using this  Oid,  there  can  only  be  one  that
> > >     exists,  so SnapshotAny is safe here and forces that only the
> > >     visibility of the master tuple in the main  table  counts  at
> > >     all.
> >
> > This is a great idea.  Get rid of my use of the attribute number.  Make
> > the varlena long value be:
> >
> >    long-bit|length|longrelid|longoid|longlen
> >
> > No need for attno in there anymore.
> 
>     I  still  need  it  to  explicitly  remove  one long value on
>     update, while the other one is untouched. Otherwise  I  would
>     have  to  drop  all  long  values  for  the  row together and
>     reinsert all new ones.

I am suggesting the longoid is not the oid of the primary or long*
table, but a unque id we assigned just to number all parts of the long*
tuple.  I thought that's what your oid was for.

> 
> > Having a separate oid for the long value is great.  You can then have
> > multiple versions of the long attribute in the long table and can
> > control when updating a tuple.
> >
> > I liked Hiroshi's idea of allowing long values in an index by just
> > pointing to the long table.  Seems that would work too.  varlena access
> > routines make that possible.
> 
>     Maybe possible, but not that good IMHO. Would  cause  another
>     index  scan from inside index scan to get at the value. An we
>     all agree that indexing huge values isn't that a  good  thing
>     at all.

May as well.  I can't think of a better solution for indexing when you
have long values.  I don't think we want long* versions of indexes.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: [HACKERS] LONG

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane
> Sent: Monday, December 13, 1999 12:00 PM
> To: Hiroshi Inoue
> Cc: Bruce Momjian; Jan Wieck; pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] LONG 
> 
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > There are so many mails for me to follow about this issue. 
> > For example,what's the conclusion about the following ?
> 
> I don't think it's concluded yet...
> 
> > Why is CTID needed ?  Is it necessary to know "primary" tuples from
> > out-of-lines values ? 
> 
> It seems to me that the primary tuple should store CTIDs of the
> out-of-line segment(s) it's using.  That way, we need no index at
> all on the expansion relation, which would clearly be a win.
> 
> My thought was that if the expansion tuples stored CTIDs of their
> primary tuples, then it would be practical to have VACUUM consult
> the primary tuples' xact status while vacuuming the expansion.
> That way, we'd have no need to update expansion tuples when changing
> xact status of primary tuples.  But I think Jan has something else
> in mind for that.
> 
> It would be a little tricky to write out a tuple plus its expansion
> tuples and have them all know each others' CTIDs; the CTIDs would
> have to be assigned before anything got written.  And VACUUM would
> need a little extra logic to update these things.  But those are
> very localized and IMHO solvable problems, and I think the performance
> advantages would be significant...
>

If CTIDs are needed it isn't worth the work,I think.
I don't understand why the reference "secondary" to "primary"  is
needed. As far as I see,VACUUM doesn't need the reference.  
> > What is wrong with separate VACUUM ?
> > VACUUM never changes OIDs and XIDs(after MVCC).
> 
> I believe VACUUM does assign its own XID to tuples that it moves,

AFAIK,vacuum never changes XIDs because MVCC doesn't allow
it.  Vadim changed to preverve XIDs between VACUUM before MVCC.
Vadim used CommandId instead to see whether VACUUM succeeded
or not. 

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


RE: [HACKERS] LONG

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Bruce Momjian
> 
> > >
> > > >     The  solution  I see is to give any out of line datum another
> > > >     Oid, that is  part  of  it's  header  and  stamped  into  the
> > > >     reference  data.  That way, the long attribute lookup can use
> > > >     SnapshotAny using this  Oid,  there  can  only  be  one  that
> > > >     exists,  so SnapshotAny is safe here and forces that only the
> > > >     visibility of the master tuple in the main  table  counts  at
> > > >     all.
> > >
> > > This is a great idea.  Get rid of my use of the attribute 
> number.  Make
> > > the varlena long value be:
> > >
> > >    long-bit|length|longrelid|longoid|longlen
> > >
> > > No need for attno in there anymore.
> > 
> >     I  still  need  it  to  explicitly  remove  one long value on
> >     update, while the other one is untouched. Otherwise  I  would
> >     have  to  drop  all  long  values  for  the  row together and
> >     reinsert all new ones.
> 
> I am suggesting the longoid is not the oid of the primary or long*
> table, but a unque id we assigned just to number all parts of the long*
> tuple.  I thought that's what your oid was for.
>

Unfortunately I couldn't follow this issue correctly. 
Is the format of long value relation different from Jan's original now ?
- At    CREATE   TABLE,   a   long   value   relation   named     "_LONG<tablename>" is created for those tables who
needit.     And of course dropped and truncated appropriate. The schema     of this table is
 
         rowid       Oid,          -- oid of our main data row         rowattno    int2,         -- the attribute
numberin main data         chunk_seq   int4,         -- the part number of this data chunk         chunk       text
    -- the content of this data chunk
 
I thought that there's an unique index (rowid,rowattno,chunk_seq).
Seems we could even update partially(specified chunk_seq only)
without problem.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Re: [HACKERS] LONG

From
Bruce Momjian
Date:
> > I am suggesting the longoid is not the oid of the primary or long*
> > table, but a unque id we assigned just to number all parts of the long*
> > tuple.  I thought that's what your oid was for.
> >
> 
> Unfortunately I couldn't follow this issue correctly. 
> Is the format of long value relation different from Jan's original now ?
> 
>  - At    CREATE   TABLE,   a   long   value   relation   named
>       "_LONG<tablename>" is created for those tables who need it.
>       And of course dropped and truncated appropriate. The schema
>       of this table is
> 
>           rowid       Oid,          -- oid of our main data row

I am suggesting a unique oid just to store this long value.  The new oid
gets stored in the primary table, and on every row of the long* table.


>           rowattno    int2,         -- the attribute number in main data

Not needed anymore.

>           chunk_seq   int4,         -- the part number of this data chunk
>           chunk       text          -- the content of this data chunk

Yes.

>  
> I thought that there's an unique index (rowid,rowattno,chunk_seq).

Index on longoid only.  No need index on longoid and chunk_seq because
you don't need the rows returned in order.


> Seems we could even update partially(specified chunk_seq only)
> without problem.

That could be done, but seems too rare because the new data would have
to be the same length. Doesn't seem worth�it, though others may
disagree. 

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
wieck@debis.com (Jan Wieck)
Date:
Bruce Momjian wrote:

> > > No need for attno in there anymore.
> >
> >     I  still  need  it  to  explicitly  remove  one long value on
> >     update, while the other one is untouched. Otherwise  I  would
> >     have  to  drop  all  long  values  for  the  row together and
> >     reinsert all new ones.
>
> I am suggesting the longoid is not the oid of the primary or long*
> table, but a unque id we assigned just to number all parts of the long*
> tuple.  I thought that's what your oid was for.

    It's  not  even  an  Oid  of  any  existing  tuple,  just  an
    identifier to quickly find all the chunks of one  LONG  value
    by (non-unique) index.

    My idea is this now:

    The schema of the expansion relation is

        value_id        Oid
        chunk_seq       int32
        chunk_data      text

    with a non unique index on value_id.

    We  change  heap_formtuple(),  heap_copytuple()  etc.  not to
    allocate the entire thing in one palloc(). Instead the  tuple
    portion itself is allocated separately and the current memory
    context remembered too  in  the  HeapTuple  struct  (this  is
    required below).

    The long value reference in a tuple is defined as:

        vl_len          int32;     /* high bit set, 32-bit = 18 */
        vl_datasize     int32;     /* real vl_len of long value */
        vl_valueid      Oid;       /* value_id in expansion relation */
        vl_relid        Oid;       /* Oid of "expansion" table */
        vl_rowid        Oid;       /* Oid of the row in "primary" table */
        vl_attno        int16;     /* attribute number in "primary" table */

    The  tuple  given to heap_update() (the most complex one) can
    now contain usual VARLENA values of the format

        high-bit=0|31-bit-size|data

    or if the value is the result of a scan eventually

        high-bit=1|31-bit=18|datasize|valueid|relid|rowid|attno

    Now there are a couple of different cases.

    1.  The value found is a plain VARLENA  that  must  be  moved
        off.

        To  move  it  off a new Oid for value_id is obtained, the
        value itself stored in the  expansion  relation  and  the
        attribute in the tuple is replaced by the above structure
        with the values  1,  18,  original  VARSIZE(),  value_id,
        "expansion" relid, "primary" tuples Oid and attno.

    2.  The  value  found  is a long value reference that has our
        own "expansion" relid and the correct  rowid  and  attno.
        This  would  be  the result of an UPDATE without touching
        this long value.

        Nothing to be done.

    3.  The value found is a  long  value  reference  of  another
        attribute,  row or relation and this attribute is enabled
        for move off.

        The long value is fetched from the expansion relation  it
        is  living  in,  and the same as for 1. is done with that
        value. There's space for optimization  here,  because  we
        might have room to store the value plain. This can happen
        if the operation was an INSERT INTO t1  SELECT  FROM  t2,
        where  t1 has few small plus one varsize attribute, while
        t2 has many, many long varsizes.

    4.  The value found is a  long  value  reference  of  another
        attribute, row or relation and this attribute is disabled
        for move off (either per column or because  our  relation
        does not have an expansion relation at all).

        The  long value is fetched from the expansion relation it
        is living in, and the reference in our tuple is  replaced
        with this plain VARLENA.

    This  in place replacement of values in the main tuple is the
    reason, why we have to make another allocation for the  tuple
    data  and  remember the memory context where made. Due to the
    above process, the tuple data can expand, and we then need to
    change into that context and reallocate it.

    What  heap_update()  further  must  do  is to examine the OLD
    tuple (that  it  already  has  grabbed  by  CTID  for  header
    modification)  and  delete all long values by their value_id,
    that aren't any longer present in the new tuple.

    The VARLENA arguments to type specific functions now can also
    have both formats.  The macro

        #define VAR_GETPLAIN(arg) \
            (VARLENA_ISLONG(arg) ? expand_long(arg) : (arg))

    can   be   used   to  get  a  pointer  to  an  allways  plain
    representation, and the macro

        #define VAR_FREEPLAIN(arg,userptr) \
              if (arg != userptr) pfree(userptr);

    is to be used to tidy up before returning.

    In this scenario, a function  like  smaller(text,text)  would
    look like

        text *
        smaller(text *t1, text *t2)
        {
            text *plain1 = VAR_GETPLAIN(t1);
            text *plain2 = VAR_GETPLAIN(t2);
            text *result;

            if ( /* whatever to compare plain1 and plain2 */ )
                result = t1;
            else
                result = t2;

            VAR_FREEPLAIN(t1,plain1);
            VAR_FREEPLAIN(t2,plain2);

            return result;
        }

    The  LRU cache used in expand_long() will the again and again
    expansion become cheap enough. The  benefit  would  be,  that
    huge  values resulting from table scans will be passed around
    in the system (in and out of sorting,  grouping  etc.)  until
    they are modified or really stored/output.

    And the LONG index stuff should be covered here already (free
    lunch)!   Index_insert()  MUST  allways   be   called   after
    heap_insert()/heap_update(),   because  it  needs  the  there
    assigned CTID. So at that time, the moved off attributes  are
    replaced  in  the tuple data by the references. These will be
    stored instead of the values that  originally  where  in  the
    tuple.   Should  also  work with hash indices, as long as the
    hashing functions use VAR_GETPLAIN as well.

    If we want to use auto compression too, no problem.  We  code
    this  into  another  bit  of  the  first  32-bit  vl_len. The
    question if to call expand_long() changes now to "is  one  of
    these  set".  This  way,  we  can  store both, compressed and
    uncompressed  into  both,  "primary"  tuple  or   "expansion"
    relation. expand_long() will take care for it.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] LONG

From
wieck@debis.com (Jan Wieck)
Date:
>     We  change  heap_formtuple(),  heap_copytuple()  etc.  not to
>     allocate the entire thing in one palloc(). Instead the  tuple
>     portion itself is allocated separately and the current memory
>     context remembered too  in  the  HeapTuple  struct  (this  is
>     required below).

Uhh,

    just  realized  that  the  usual  pfree(htup)  will  not work
    anymore. But shouldn't that already have been something  like
    heap_freetuple(htup)?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

RE: [HACKERS] LONG

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
>
> > > I am suggesting the longoid is not the oid of the primary or long*
> > > table, but a unque id we assigned just to number all parts of
> the long*
> > > tuple.  I thought that's what your oid was for.
> > >
> > Unfortunately I couldn't follow this issue correctly.
> > Is the format of long value relation different from Jan's original now ?
> >
> >  - At    CREATE   TABLE,   a   long   value   relation   named
> >       "_LONG<tablename>" is created for those tables who need it.
> >       And of course dropped and truncated appropriate. The schema
> >       of this table is
> >
> >           rowid       Oid,          -- oid of our main data row
>
> I am suggesting a unique oid just to store this long value.  The new oid
> gets stored in the primary table, and on every row of the long* table.
>

Hmm,we could delete long values easily using rowid in case of
heap_delete() .......

>
> > Seems we could even update partially(specified chunk_seq only)
> > without problem.
>
> That could be done, but seems too rare because the new data would have
> to be the same length. Doesn't seem worth\xA0it, though others may
> disagree.
>

First,I wanted to emphasize that we don't have to update any long value
tuples if we don't update long values. It's a special case of partial
update.

Second,large object has an feature like this. If we would replace large
object by LONG,isn't it needed ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



Re: [HACKERS] libpq questions...when threads collide

From
Peter Eisentraut
Date:
On Sun, 12 Dec 1999, Tom Lane wrote:

> Vince Vielhaber <vev@michvhf.com> writes:
> > Either we should keep the current docs
> > or the release docs online - not both.
> 
> I disagree, because they serve different audiences.  The snapshot docs
> are very useful to developers, particularly those of us who don't have
> SGML tools installed but still want to know whether the docs we
> committed recently look right or not ;-).  Meanwhile, current-release
> documents are clearly the right thing to provide for ordinary users.

Um, you mean you commit docs before you know whether they even "compile"?
As I see it, if you want to edit the docs, you should test them with your
own SGML tools. With recent sgmltools packages, this is not so hard. At
least the patch applicator hopefully does this.

-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: [HACKERS] LONG

From
Bruce Momjian
Date:
This outline is perfect!


> > I am suggesting the longoid is not the oid of the primary or long*
> > table, but a unque id we assigned just to number all parts of the long*
> > tuple.  I thought that's what your oid was for.
> 
>     It's  not  even  an  Oid  of  any  existing  tuple,  just  an
>     identifier to quickly find all the chunks of one  LONG  value
>     by (non-unique) index.

Yes, I understood this and I think it is a great idea.  It allows UPDATE
to control whether it wants to replace the LONG value.


> 
>     My idea is this now:
> 
>     The schema of the expansion relation is
> 
>         value_id        Oid
>         chunk_seq       int32
>         chunk_data      text
> 
>     with a non unique index on value_id.

Yes, exactly.

> 
>     We  change  heap_formtuple(),  heap_copytuple()  etc.  not to
>     allocate the entire thing in one palloc(). Instead the  tuple
>     portion itself is allocated separately and the current memory
>     context remembered too  in  the  HeapTuple  struct  (this  is
>     required below).

I read the later part.  I understand.

> 
>     The long value reference in a tuple is defined as:
> 
>         vl_len          int32;     /* high bit set, 32-bit = 18 */
>         vl_datasize     int32;     /* real vl_len of long value */
>         vl_valueid      Oid;       /* value_id in expansion relation */
>         vl_relid        Oid;       /* Oid of "expansion" table */
>         vl_rowid        Oid;       /* Oid of the row in "primary" table */
>         vl_attno        int16;     /* attribute number in "primary" table */

I see you need vl_rowid and vl_attno so you don't accidentally reference
a LONG value twice.  Good point.  I hadn't thought of that.

> 
>     The  tuple  given to heap_update() (the most complex one) can
>     now contain usual VARLENA values of the format
> 
>         high-bit=0|31-bit-size|data
> 
>     or if the value is the result of a scan eventually
> 
>         high-bit=1|31-bit=18|datasize|valueid|relid|rowid|attno
> 
>     Now there are a couple of different cases.
> 
>     1.  The value found is a plain VARLENA  that  must  be  moved
>         off.
> 
>         To  move  it  off a new Oid for value_id is obtained, the
>         value itself stored in the  expansion  relation  and  the
>         attribute in the tuple is replaced by the above structure
>         with the values  1,  18,  original  VARSIZE(),  value_id,
>         "expansion" relid, "primary" tuples Oid and attno.
> 
>     2.  The  value  found  is a long value reference that has our
>         own "expansion" relid and the correct  rowid  and  attno.
>         This  would  be  the result of an UPDATE without touching
>         this long value.
> 
>         Nothing to be done.
> 
>     3.  The value found is a  long  value  reference  of  another
>         attribute,  row or relation and this attribute is enabled
>         for move off.
> 
>         The long value is fetched from the expansion relation  it
>         is  living  in,  and the same as for 1. is done with that
>         value. There's space for optimization  here,  because  we
>         might have room to store the value plain. This can happen
>         if the operation was an INSERT INTO t1  SELECT  FROM  t2,
>         where  t1 has few small plus one varsize attribute, while
>         t2 has many, many long varsizes.
> 
>     4.  The value found is a  long  value  reference  of  another
>         attribute, row or relation and this attribute is disabled
>         for move off (either per column or because  our  relation
>         does not have an expansion relation at all).
> 
>         The  long value is fetched from the expansion relation it
>         is living in, and the reference in our tuple is  replaced
>         with this plain VARLENA.

Yes.

> 
>     This  in place replacement of values in the main tuple is the
>     reason, why we have to make another allocation for the  tuple
>     data  and  remember the memory context where made. Due to the
>     above process, the tuple data can expand, and we then need to
>     change into that context and reallocate it.


Yes, got it.

> 
>     What  heap_update()  further  must  do  is to examine the OLD
>     tuple (that  it  already  has  grabbed  by  CTID  for  header
>     modification)  and  delete all long values by their value_id,
>     that aren't any longer present in the new tuple.

Yes, makes vacuum run find on the LONG* relation.

> 
>     The VARLENA arguments to type specific functions now can also
>     have both formats.  The macro
> 
>         #define VAR_GETPLAIN(arg) \
>             (VARLENA_ISLONG(arg) ? expand_long(arg) : (arg))
> 
>     can   be   used   to  get  a  pointer  to  an  allways  plain
>     representation, and the macro
> 
>         #define VAR_FREEPLAIN(arg,userptr) \
>               if (arg != userptr) pfree(userptr);
> 
>     is to be used to tidy up before returning.

Got it.

> 
>     In this scenario, a function  like  smaller(text,text)  would
>     look like
> 
>         text *
>         smaller(text *t1, text *t2)
>         {
>             text *plain1 = VAR_GETPLAIN(t1);
>             text *plain2 = VAR_GETPLAIN(t2);
>             text *result;
> 
>             if ( /* whatever to compare plain1 and plain2 */ )
>                 result = t1;
>             else
>                 result = t2;
> 
>             VAR_FREEPLAIN(t1,plain1);
>             VAR_FREEPLAIN(t2,plain2);
> 
>             return result;
>         }

Yes.

> 
>     The  LRU cache used in expand_long() will the again and again
>     expansion become cheap enough. The  benefit  would  be,  that
>     huge  values resulting from table scans will be passed around
>     in the system (in and out of sorting,  grouping  etc.)  until
>     they are modified or really stored/output.

Yes.

> 
>     And the LONG index stuff should be covered here already (free
>     lunch)!   Index_insert()  MUST  allways   be   called   after
>     heap_insert()/heap_update(),   because  it  needs  the  there
>     assigned CTID. So at that time, the moved off attributes  are
>     replaced  in  the tuple data by the references. These will be
>     stored instead of the values that  originally  where  in  the
>     tuple.   Should  also  work with hash indices, as long as the
>     hashing functions use VAR_GETPLAIN as well.

I hoped this would be true.  Great.

> 
>     If we want to use auto compression too, no problem.  We  code
>     this  into  another  bit  of  the  first  32-bit  vl_len. The
>     question if to call expand_long() changes now to "is  one  of
>     these  set".  This  way,  we  can  store both, compressed and
>     uncompressed  into  both,  "primary"  tuple  or   "expansion"
>     relation. expand_long() will take care for it.

Perfect.  Sounds great.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] LONG

From
Christof Petig
Date:
As I offered some time to work on tuple chaining this thread clearly
touches the same area.

The idea of transparantly moving big attributes into a seperate table
clearly has its benefits as long as normal operations need not to touch
these long values. I (too) see this as a great deal. And the fact that
it happens transparently (not visible to user) is the best about it.

But AFAICS tuple chaining shouldn't be such a big deal, it should be
about three days of work. (It'll definitely take longer for me, since I
have to understand pgsql's internals first.): Split the tuple into
multiple Items on disk storage, concatenate them on read in. Then make
vacuum ignore continued items when not dealing with the whole tuple. No
need to touch CID, XID etc. The most obvious disadvantage is possible
fragmentation of tuples (unless handled in vacuum). Disk access
atomicity for tuples is a non issue for Linux people since Linux uses 1k
blocks :-(

Storing attributes seperately is the best solution once you exceed
4*BLKSZ, tuple chaining addresses 1.1-3*BLKSZ most efficiently. (correct
me if I'm wrong)

LONG as a seperate type is IMHO just another concept you have to master
before you can use a RDBMS efficiently. The less different concepts a
user needs to learn, the easier life is for him. Postgres already has a
lot of data types to learn. 

Wrapping lo in a user type sounds good to me.

Yours     Christof




Re: [HACKERS] libpq questions...when threads collide

From
Thomas Lockhart
Date:
> > > Either we should keep the current docs
> > > or the release docs online - not both.
> > I disagree, because they serve different audiences.  The snapshot docs
> > are very useful to developers, particularly those of us who don't have
> > SGML tools installed but still want to know whether the docs we
> > committed recently look right or not ;-).  Meanwhile, current-release
> > documents are clearly the right thing to provide for ordinary users.

Vince, I'm with Tom on this one, having both would be great. The
"developer's only" posting is a holdover from the first days when we
could generate docs on the Postgres machine, and I only had one place
on the web page I could put docs. But having the release docs posted
from the "Documentation" page and the current tree docs posted either
there or on the "Developers" page would be great. I'm happy to
redirect my nightly cron job to put the output somewhere other than
where they are now.

> Um, you mean you commit docs before you know whether they even "compile"?
> As I see it, if you want to edit the docs, you should test them with your
> own SGML tools. With recent sgmltools packages, this is not so hard. At
> least the patch applicator hopefully does this.

No, testing doc output has never been a prerequisite for submitting
and committing doc improvements/updates. If the submitted sgml code is
a bit wrong, the nightly cron job halts in the middle and the output
tar files and web page copies don't get updated. I see the results in
the cron output I have sent to my home machine, and usually fix the
problem within a day or two (would be longer recently since I'm so
busy, but the scheme still is working...).

The important thing is getting the words updated in the docs, and
running jade or the SGML-tools wrappers is still too much of a barrier
if it were a prerequisite.
                       - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] libpq questions...when threads collide

From
Vince Vielhaber
Date:
On Tue, 14 Dec 1999, Thomas Lockhart wrote:

> > > > Either we should keep the current docs
> > > > or the release docs online - not both.
> > > I disagree, because they serve different audiences.  The snapshot docs
> > > are very useful to developers, particularly those of us who don't have
> > > SGML tools installed but still want to know whether the docs we
> > > committed recently look right or not ;-).  Meanwhile, current-release
> > > documents are clearly the right thing to provide for ordinary users.
> 
> Vince, I'm with Tom on this one, having both would be great. The
> "developer's only" posting is a holdover from the first days when we
> could generate docs on the Postgres machine, and I only had one place
> on the web page I could put docs. But having the release docs posted
> from the "Documentation" page and the current tree docs posted either
> there or on the "Developers" page would be great. I'm happy to
> redirect my nightly cron job to put the output somewhere other than
> where they are now.

No problem, I'll come up with a developer's section.  I need to make it
as obvious as possible or as obscure as possible to keep the webmaster
mailbox from overflowing.  I'll let you know 'cuze it'll also affect 
the search engine.  Hopefully in the next week, otherwise it won't happen
till the next century :)

Vince.
-- 
==========================================================================
Vince Vielhaber -- KA8CSH    email: vev@michvhf.com    http://www.pop4.net       Online Campground Directory
http://www.camping-usa.com     Online Giftshop Superstore    http://www.cloudninegifts.com
 
==========================================================================