Re: Jesus, what have I done (was: LONG) - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Jesus, what have I done (was: LONG)
Date
Msg-id 199912120539.AAA17817@candle.pha.pa.us
Whole thread Raw
In response to Jesus, what have I done (was: LONG)  (wieck@debis.com (Jan Wieck))
Responses Re: Jesus, what have I done (was: LONG)
List pgsql-hackers
OK, I think I can take your ideas and polish this into a killer feature,
so I will keep going on this discussion.


> Bruce Momjian wrote (in several messages):
> 
> > Actually, in looking at the fsync() system call, it does write the
> > entire file descriptor before marking the transaction as complete, so
> > there is no hard reason not to raise it, but because the OS has to do
> > two reads to get 16k, I think we are better keeping 8k as our base block
> > size.
> 
>     Agreed. Let's stay with the 8K default.

OK.  I am worried about performance problems with increasing this for
non-large tuples.  That is why I was liking to keep 8k.  We are never
going to be able to configure 8MB tuples, so I figured 8k was good
enough.


> 
> > -1|oid|attno
> 
>     Actually  I  think  you need two more informations to move it
>     around independently.  As you agreed somewhere  else  (on  my
>     TRUNCATE  issue),  it would be better to keep the long values
>     in a per table expansion relation.  Thus, you need the Oid of
>     that too at least. Also, it would be good to know the size of
>     the data before fetching it, so you need that to.

Yes, I see your point that you don't know the relation oid in those adt
routintes.  Yes, you would need the oid too.  New structure would be:
1-bit long flag|31-bit length|long relid|tuple oid|attno


> > Now it would be nice if we could set the varlena length to 12, it's
> > actual length, and then just somehow know that the varlena of 12 was a
> > long data entry.  Our current varlena has a maximum length of 64k.
> >
> > Or make struct varlena vl_len a 15-bit field, and make islong a 1-bit
> > field.  I don't remember if using & manually or bit fields is faster.
> 
>     I  don't see vl_len as a 15-bit field. In the current sources
>     (in postgres.h), it is an int32. And I'm sure  that  not  any

Sorry, 32-bit field.  I thought 16-bit because there is no need for
values >8k for length.  Seems we have >16 unused bits in the length.


>     code  is  aware that some magic bit's in it contain a special
>     meaning. At least the types I  added  recently  (numeric  and
>     lztext)  aren't.  Nor  am  I sure, a variable length Datum is
>     never duplicated somewhere, just  by  using  the  information
>     from  vl_len,  with or without using the macro. Thus we would
>     have to visit alot of code to make  sure  this  new  variable
>     length Datum can be passed around as you like.

I just checked vl_len is used only in varlena.c inv_api.c and in the
VARSIZE define.  I make sure of that several releases ago, so they all
use the macro.


> 
>     And  the  IMHO  most counting drawback is, that existing user
>     type definitions treat the first 32 bits in a variable length
>     data  type just as I interpreted the meaning up to now. So we
>     could occationally break more than we are aware of.

OK, the solution is that we never pass back this type with the long bit
set.  We always expand it on return to user applications.

We can restrict type expansion to only certain data types.  Not all
varlena types have to be expanded.

> 
> > In your system, a sequential scan of the table will pull in all this
> > stuff because you are going to expand the tuple.  That could be very
> > costly.  In my system, the expansion only happens on output if they LONG
> > field does not appear in the WHERE or ORDER BY clauses.
> 
> In my system, it would do exactly as in your's, because they are mostly the
> same. The modification done to the tuple in heap_insert() and heap_replace(),
> just before the call to RelationPutHeapTupleAtEnd(), makes each
> LONG Datum of varsize 20. Just that the first 32 bits don't contain any
> magic information.

OK.  I just want to get this working in a seamless way with our existing
types.

> 
> > >     Maybe  we  make  this  mechanism  so  general  that   it   is
> > >     automatically applied to ALL varsize attributes? We'll end up
> > >     with on big pg_long where 90+% of the databases content  will
> > >     be stored.
> >
> > If most joins, comparisons are done on the 10% in the main table, so
> > much the better.
> 
>     Yes, but how would you want to judge which varsize  value  to
>     put  onto  the "secondary" relation, and which one to keep in
>     the "primary" table for fast comparisions?

There is only one place in heap_insert that checks for tuple size and
returns an error if it exceeds block size.  I recommend when we exceed
that we scan the tuple, and find the largest varlena type that is
supported for long relations, and set the long bit and copy the data
into the long table.  Keep going until the tuple is small enough, and if
not, throw an error on tuple size exceeded.  Also, prevent indexed
columns from being made long.

> 
>     I think you forgot one little detail. In our model,  you  can
>     only  move  around the Datum's extended information around as
>     is. It will never be expanded in place, so it must be fetched
>     (index  scan)  again  at  any  place,  the  value  itself  is
>     required.

Yes, I agree, but in most cases it will only be expanded to return to
user application because long fields, as used above only when needed,
are usually not used in WHERE or ORDER BY.  If only a few values exceed
the 8k limit, those would have to be retrieved to meet the WHERE or
ORDER BY.  If many are long, it would be a lot of lookups, but I think
this solution would be the best for most uses.

> 
>     The installed base currently  uses  varsize  attributes  with
>     indices  on  them  to  condition, sort and group on them. Now
>     pushing such a field into "secondary" occationally will cause
>     a substantial loss of performance.

Really?  Do people really group/order by on >8k value often?  I question
this.

> 
>     So  again,  how do you determine which of the attributes is a
>     candidate to push into "secondary"?  It  is  a  such  generic
>     approach, that I cannot imagine any fail safe method.

Outlined above in heap_insert().  Seems it would be a small loop.

> 
>     I'd  better like to have another LONG data type, that enables
>     me to store huge string into but where I exactly know what  I
>     can't  do  with, than having some automatic detection process
>     that I cannot force to do what I want. It  happened  just  to
>     often  to me, that these "user friendly better knowing what I
>     might want"  systems  got  me  by  the  ball's.  I'm  a  real
>     programmer,  so  there's  allway  a  way out for me, but what
>     shoud a real user do?

Automatic is better, I think.  We already have too many character types,
and another one is going to be confusing.  Also, if you have data that
is mostly under 8k, but a few are over, how do you store that.  Make
them all LONG and have the overhead for each row?  This seems like a
situation many people are in.  Also, by making it automatic, we can
change the implentation later without having to re-teach people how to
store long tuples.

> 
> > Let's fact it.  Most long tuples are store/retrieve, not ordered on or
> > used in WHERE clauses.  Moving them out of the main table speeds up
> > things.  It also prevents expansion of rows that never end up in the
> > result set.
> 
>     Having a tuple consisting of 30+ attributes, where 20 of them
>     are varsize ones (CHAR, VARCHAR, NUMERIC etc.), what makes it
>     a long tuple? Yes, I'm repeating this  question  once  again,
>     because  we're talking about a "one must fit all cases" here.

Again, scan tuple and move to long table until tuple fits.

> 
> > stuff because you are going to expand the tuple.  That could be very
> > costly.  In my system, the expansion only happens on output if they LONG
> > field does not appear in the WHERE or ORDER BY clauses.
> 
>     No I won't. As explained, I would return a tuple as is,  just
>     with  the  LONG reference information. It will only, but then
>     allways again, be expanded if needed to compare, store  again
>     or  beeing output to the client.  This "allways again" is one
>     of my drawbacks against your "treating all varsize  pushable"
>     concept.  In  one  of  my  early  projects, I had to manage a
>     microVax for a year, and I love  systems  that  can  be  fine
>     tuned  since  then, really! Auto detection is a nice feature,
>     but if that failes and you don't have  any  override  option,
>     you're hosed.

I am confused here.  With my code, you only have to:
add code to write/read from long tablesadd code to expand long values in varlen access routinesadd code to
heap_insert()to move data to long tablesadd code to heap_delete() to invalidate long tuples
 

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Jesus, what have I done (was: LONG)
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] Re: Mirroring a DB