Re: [HACKERS] LONG - Mailing list pgsql-hackers

From wieck@debis.com (Jan Wieck)
Subject Re: [HACKERS] LONG
Date
Msg-id m11xFsS-0003kGC@orion.SAPserv.Hamburg.dsh.de
Whole thread Raw
In response to Re: [HACKERS] LONG  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] LONG
List pgsql-hackers
Tom Lane wrote:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Also, my idea was to auto-enable longs for all varlena types, so short
> > values stay in the table, while longer chained ones that take up lots of
> > space and are expensive to expand are retrieved only when needed.
>
> I missed most of yesterday's discussion (was off fighting a different
> fire...).  This morning in the shower I had a brilliant idea, which
> I now see Bruce has beaten me to ;-)
>
> The idea of doing tuple splitting by pushing "long" fields out of line,
> rather than just cutting up the tuple at arbitrary points, is clearly
> a win for the reasons Bruce and Jan point out.  But I like Bruce's
> approach (automatically do it for any overly-long varlena attribute)
> much better than Jan's (invent a special LONG datatype).  A special
> datatype is bad for several reasons:
> * it forces users to kluge up their database schemas;
> * inevitably, users will pick the wrong columns to make LONG (it's
>   a truism that programmers seldom guess right about what parts of
>   their programs consume the most resources; users would need a
>   "profiler" to make the right decisions);
> * it doesn't solve the problems for arrays, which desperately need it;
> * we'd need to add a whole bunch of operations on the special datatype;

O.K.,

    you two got me now.

>
> I think that the right place to implement this is in heapam, and that
> it should go more or less like this:
>
> 1. While writing out a tuple, if the total tuple size is "too big"
> (threshold would be some fraction of BLCKSZ, yet to be chosen),
> then the tuple manager would go through the tuple to find the longest
> varlena attribute, and convert same into an out-of-line attribute.
> Repeat if necessary until tuple size fits within threshold.

    Yepp.  But  it  does  NOT mangle up the tuple handed to it in
    place.  The flat values in the tuple are sometimes used AFTER
    heap_insert()    and    heap_update(),    for   example   for
    index_insert. So that might break other places.

> 2. While reading a tuple, fastgetattr() automatically fetches the
> out-of-line value if it sees the requested attribute is out-of-line.
> (I'd be inclined to mark out-of-line attributes in the same way that
> NULL attributes are marked: one bit in the tuple header shows if any
> out-of-line attrs are present, and if so there is a bitmap to show
> which ones are out-of-line.  We could also use Bruce's idea of
> commandeering the high-order bit of the varlena length word, but
> I think that's a much uglier and more fragile solution.)
>
> I think that these two changes would handle 99% of the problem.
> VACUUM would still need work, but most normal access to tuples would
> just work automatically, because all access to varlena fields must go
> through fastgetattr().

    And I like Bruce's idea with the high order  bit  of  vl_len.
    This  is IMHO the only chance, to tell on UPDATE if the value
    wasn't changed.

    To detect that an UPDATE did not touch the out of line value,
    you  need  the  complete  long  reference  information in the
    RESULT tuple. The executor must not expand  the  value  while
    building them up already.

    But  Tom  is  right,  there is a visibility problem I haven't
    seen before.  It is  that  when  fetching  the  out  of  line
    attribute  (for  example in the type output function) is done
    later  than  fetching  the  reference  information.   Then  a
    transaction  reading  dirty  or  committed  might  see  wrong
    content,  or  worse,  see  different  contents  at  different
    fetches.

    The  solution  I see is to give any out of line datum another
    Oid, that is  part  of  it's  header  and  stamped  into  the
    reference  data.  That way, the long attribute lookup can use
    SnapshotAny using this  Oid,  there  can  only  be  one  that
    exists,  so SnapshotAny is safe here and forces that only the
    visibility of the master tuple in the main  table  counts  at
    all.

    Since this Values Oid is known in the Values reference of the
    tuple, we only need two indices on the out of line data.  One
    on  this  Oid, on on the referencing row's oid|attrno|seq  to
    be fast in heap_delete() and heap_update().

> An as-yet-unsolved issue is how to avoid memory leaks of out-of-line
> values after they have been read in by fastgetattr().  However, I think
> that's going to be a nasty problem with Jan's approach as well.  The
> best answer might be to solve this in combination with addressing the
> problem of leakage of temporary results during expression evaluation,
> say by adding some kind of reference-count convention to all varlena
> values.

    At the point we decide to move an attribute out of the tuple,
    we  make a lookup in an array consisting of type Oid's. Thus,
    we have plenty of time to add one datatype after another  and
    enable  them separately for long processing, but get the ones
    enabled ASAP (next release) out of the door.

    As Bruce suggested, we  implement  a  central  function  that
    fetches  back  the  long  value. This is used in all the type
    specific  funcitons  in  adt.   Now  that  we  have  an   Oid
    identifier  per  single value, it's easy to implement a cache
    there, that can manage a LRU table of the last fetched values
    and cache smaller ones for fast access.

    It's  the  response  of  the types adt functions, to free the
    returned (old VARLENA looking) memory. Since  we  enable  the
    types one-by-one, there's no need to hurry on this.

> BTW, I don't see any really good reason to keep the out-of-line values
> in a separate physical file (relation) as Jan originally proposed.
> Why not keep them in the same file, but mark them as being something
> different than a normal tuple?  Sequential scans would have to know to
> skip over them (big deal), and VACUUM would have to handle them

    The  one  I  see  is that a sequential scan would not benefit
    from this, it still has to read the entire relation, even  if
    looking only on small, fixed size items in the tuple. Will be
    a big win for count(*). And with the  mentioned  value  cache
    for  relatively  small  (yet  to define what that is) values,
    there will be very little overhead in a sort, if  the  tuples
    in  it  are  sorted  by  an  attribute where some long values
    occationally appear.

> properly, but I think VACUUM is going to have to have special code to
> support this feature no matter what.  If we do make them a new primitive
> kind-of-a-tuple on disk, we could sidestep the problem of marking all
> the out-of-line values associated with a tuple when the tuple is
> outdated by a transaction.  The out-of-line values wouldn't have
> transaction IDs in them at all; they'd just be labeled with the CTID
> and/or OID of the primary tuple they belong to.  VACUUM would consult
> that tuple to determine whether to keep or discard an out-of-line value.

    AFAIK, VACUUM consults single attributes of a tuple  only  to
    produce  the  statistical  informations  for them on ANALYZE.
    Well, statistical information  for  columns  containing  LONG
    values aren't good for the WHERE clause (I think we all agree
    on that). So it doesn't matter if these  informations  aren't
    totally  accurate, or if VACUUM counts them but uses only the
    first couple of bytes for the min/max etc. info.

    Also, the new long data relations should IMHO have their  own
    relkind.   So  VACUUM can easily detect them. This I think is
    required, so VACUUM can place an exclusive lock on  the  main
    table  first before starting to vacuum the long values (which
    can be done as is since it is in fact  a  normal  relation  -
    just  not  visible  to  the  user).  This  should  avoid race
    conditions as explained above on the visibility problem.

    I'll start to play around with this  approach  for  a  while,
    using  lztext  as  test  candidate  (with  custom compression
    parameters that force uncompressed storage). When I have some
    reasonable  result  ready to look at, I'll send a patch here,
    so we can continue the discussion while looking at some  test
    implementation.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: [HACKERS] Re: Jesus, what have I done (was: LONG)
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] 6.6 release