Thread: How are null's stored?

How are null's stored?

From
"Jim C. Nasby"
Date:
I have a 40M row table I need to import data into, then use to create a
bunch of more normalized tables. Right now all fields are varchar, but
I'm going to change this so that fields that are less than a certain
size are just char. Question is, how much impact is there from char
being nullable vs. not nullable? src/include/access/htup.h indicates
that nulls are stored in a bitmap, so I'd suspect that I should see a
decent space savings from not having to include length information all
the time... (most of these small fields are always the same size no
matter what...)
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: How are null's stored?

From
Josh Berkus
Date:
Jim,

> I have a 40M row table I need to import data into, then use to create a
> bunch of more normalized tables. Right now all fields are varchar, but
> I'm going to change this so that fields that are less than a certain
> size are just char. Question is, how much impact is there from char
> being nullable vs. not nullable? src/include/access/htup.h indicates
> that nulls are stored in a bitmap, so I'd suspect that I should see a
> decent space savings from not having to include length information all
> the time... (most of these small fields are always the same size no
> matter what...)

This is moot.   PostgreSQL stores CHAR(x), VARCHAR, and TEXT in the same
internal format, which includes length information in the page header.   So
you save no storage space by converting to CHAR(x) ... you might even make
your tables *larger* because of the space padding.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: How are null's stored?

From
"Ryan"
Date:
> Jim,
>
>> I have a 40M row table I need to import data into, then use to create
>> a bunch of more normalized tables. Right now all fields are varchar,
>> but I'm going to change this so that fields that are less than a
>> certain size are just char. Question is, how much impact is there from
>> char being nullable vs. not nullable? src/include/access/htup.h
>> indicates that nulls are stored in a bitmap, so I'd suspect that I
>> should see a decent space savings from not having to include length
>> information all the time... (most of these small fields are always the
>> same size no matter what...)
>
> This is moot.   PostgreSQL stores CHAR(x), VARCHAR, and TEXT in the same
>  internal format, which includes length information in the page header.
>  So  you save no storage space by converting to CHAR(x) ... you might
> even make  your tables *larger* because of the space padding.

So if the internal format is identical, why does the INFERNAL database
ignore indexes when you have a text compared to a varchar?

Ryan


Re: How are null's stored?

From
Andrew Sullivan
Date:
On Mon, May 12, 2003 at 01:58:03PM -0500, Ryan wrote:
> So if the internal format is identical, why does the INFERNAL database
> ignore indexes when you have a text compared to a varchar?

Because the rules for handling the two data types are not the same.
Since spaces are significant on char(n) according to the spec, you
have strange rules in their handling.

Short answer: use text.  Varchar(n) if you must, to limit length.
But char(n) is almost always evil.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: How are null's stored?

From
Josh Berkus
Date:
Ryan,

> So if the internal format is identical, why does the INFERNAL database
> ignore indexes when you have a text compared to a varchar?

I don't seem to have this problem; I use TEXT or VARCHAR willy-nilly,
including in LIKE 'string%' and UPPER(field) queries, and the indexes work
fine.

I suspect that either you're talking about TEXT to CHAR(x) comparisons, which
are a different ball o' wax, or your query problem is something else.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: How are null's stored?

From
Stephan Szabo
Date:
On Mon, 12 May 2003, Josh Berkus wrote:

> > So if the internal format is identical, why does the INFERNAL database
> > ignore indexes when you have a text compared to a varchar?
>
> I don't seem to have this problem; I use TEXT or VARCHAR willy-nilly,
> including in LIKE 'string%' and UPPER(field) queries, and the indexes work
> fine.

I can get the case he's complaining about with some cases I believe.

With an indexed varchar field, I can get 7.3.1 to give me:
sszabo=# set enable_seqscan=off;
SET
sszabo=# explain select * from aq2 where a=('f' || 'g');
                             QUERY PLAN
---------------------------------------------------------------------
 Seq Scan on aq2  (cost=100000000.00..100000022.50 rows=1 width=168)
   Filter: ((a)::text = 'fg'::text)

but

sszabo=# explain select * from aq2 where a=('f' || 'g')::varchar;
                              QUERY PLAN
----------------------------------------------------------------------
 Index Scan using aq2_pkey on aq2  (cost=0.00..4.82 rows=1 width=168)
   Index Cond: (a = 'fg'::character varying)

or

sszabo=# explain select * from aq2 where a=('f' || 'g'::varchar);
                              QUERY PLAN
----------------------------------------------------------------------
 Index Scan using aq2_pkey on aq2  (cost=0.00..4.82 rows=1 width=168)
   Index Cond: (a = 'fg'::character varying)

All in all, I'm not sure what the semantic differences between a varchar
with no length specified and a text are in PostgreSQL actually and if the
whole thing could be simplified in some way that doesn't break backwards
compatibility.


Re: How are null's stored?

From
Tom Lane
Date:
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> All in all, I'm not sure what the semantic differences between a varchar
> with no length specified and a text are in PostgreSQL actually and if the
> whole thing could be simplified in some way that doesn't break backwards
> compatibility.

Yeah, I've been wondering about that too.  A large part of the problem
is that varchar has its own set of operators, which the planner has no
right to assume behave exactly like the text ones ... but they do.  It
might work to rip out the redundant varchar operators and allow indexes
on varchar to become truly textual indexes (ie, they'd be text_ops not
varchar_ops opclass).  There might be a few tweaks needed to get the
planner to play nice with indexes that require implicit coercions, but
I think it could be made to work.

Another idea that has been rattling around is to stop treating bpchar as
binary-equivalent to text, and in fact to make bpchar-to-text promotion
go through rtrim() to eliminate padding spaces.

I think this stuff got put on hold because we haven't been able to come
up with a good solution for the comparable problems in the numeric
datatype hierarchy.  But bpchar/varchar/text is a lot simpler problem,
and maybe could be solved with the tools we have in place already.

            regards, tom lane


Re: How are null's stored?

From
Stephan Szabo
Date:
On Mon, 12 May 2003, Tom Lane wrote:

> Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> > All in all, I'm not sure what the semantic differences between a varchar
> > with no length specified and a text are in PostgreSQL actually and if the
> > whole thing could be simplified in some way that doesn't break backwards
> > compatibility.
>
> Yeah, I've been wondering about that too.  A large part of the problem
> is that varchar has its own set of operators, which the planner has no
> right to assume behave exactly like the text ones ... but they do.  It
> might work to rip out the redundant varchar operators and allow indexes
> on varchar to become truly textual indexes (ie, they'd be text_ops not
> varchar_ops opclass).  There might be a few tweaks needed to get the
> planner to play nice with indexes that require implicit coercions, but
> I think it could be made to work.

This seems to possibly work on 7.4.  I took my system and removed the
varchar comparison operators and directly made a text_ops index on a
varchar(30).
That gave me indexscans for
 col = 'a'
 col = 'a'::varchar
 col = 'a'::text
 col = 'a' || 'b'

but I don't know if it has other bad effects yet.


> Another idea that has been rattling around is to stop treating bpchar as
> binary-equivalent to text, and in fact to make bpchar-to-text promotion
> go through rtrim() to eliminate padding spaces.

I guess this depends on how we read the comparisons/conversions from PAD
SPACE to NO PAD are supposed to work, but I think this would be good and
make things easier for alot of people since most people don't expect it,
especially when using functions like upper and lower that return text.


Re: How are null's stored?

From
Tom Lane
Date:
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> On Mon, 12 May 2003, Tom Lane wrote:
>> It might work to rip out the redundant varchar operators and allow indexes
>> on varchar to become truly textual indexes (ie, they'd be text_ops not
>> varchar_ops opclass).

> This seems to possibly work on 7.4.  I took my system and removed the
> varchar comparison operators and directly made a text_ops index on a
> varchar(30).

Yeah, I fooled with it a little bit last night too.  It seems that we'd
need to still have a varchar_ops entry in pg_opclass (else you get
complaints about unable to select a default opclass, not to mention that
old pg_dump files specifying varchar_ops would fail to load).  But this
entry could point to the textual comparison operators.  AFAICT the
planner doesn't have any problem dealing with the implicit coercions
that it's faced with in such cases.

>> Another idea that has been rattling around is to stop treating bpchar as
>> binary-equivalent to text, and in fact to make bpchar-to-text promotion
>> go through rtrim() to eliminate padding spaces.

> I guess this depends on how we read the comparisons/conversions from PAD
> SPACE to NO PAD are supposed to work, but I think this would be good and
> make things easier for alot of people since most people don't expect it,
> especially when using functions like upper and lower that return text.

I tried that too, and it seemed to work as expected.  Whether it's
arguably more spec-compliant than our current behavior I dunno; haven't
looked at that part of the spec closely...

            regards, tom lane


Re: How are null's stored? -- Some numbers

From
"Jim C. Nasby"
Date:
I did two experiments. First, as someone mentioned, changing between
char and varchar made absolutely no difference size-wise. In some other
RDBMSes, performance wise char might still win out because the database
wouldn't have to do the math to figure out where in the tuple the fields
are. I know it's splitting hairs, but on what will be a 40M row table...

Second, I modified the table (see below; all fields were originally
nullable):

Before:
usps=# vacuum full analyze verbose zip4_detail;
INFO:  --Relation public.zip4_detail--
INFO:  Pages 12728: Changed 0, reaped 1, Empty 0, New 0; Tup 467140: Vac
0, Keep/VTL 0/0, UnUsed 19, MinLen 154, MaxLen 302; Re-using:
Free/Avail. Space 1009820/264028; EndEmpty/Avail. Pages 0/1521.
        CPU 0.65s/0.86u sec elapsed 1.51 sec.
INFO:  Rel zip4_detail: Pages: 12728 --> 12728; Tuple(s) moved: 0.
        CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  Analyzing public.zip4_detail


After:
INFO:  --Relation public.zip4_detail--
INFO:  Pages 13102: Changed 0, reaped 6961, Empty 0, New 0; Tup 467140:
Vac 0, Keep/VTL 0/0, UnUsed 31795, MinLen 166, MaxLen 306; Re-using:
Free/Avail. Space 1136364/190188; EndEmpty/Avail. Pages 0/1056.
        CPU 0.41s/0.79u sec elapsed 1.20 sec.
INFO:  Rel zip4_detail: Pages: 13102 --> 13102; Tuple(s) moved: 0.
        CPU 0.59s/10.02u sec elapsed 18.17 sec.
INFO:  Analyzing public.zip4_detail

As you can see, space useage actually went up, by 2.9% (pages). In other
words, it appears to be more efficient to store a null than to store an
empty string in a varchar.

usps=# select count(*) from zip4_detail where street_pre_drctn_abbrev=''
and street_suffix_abbrev='' and street_post_drctn_abbrev='';
-------
  9599

usps=# select count(*) from zip4_detail where street_pre_drctn_abbrev=''
or street_suffix_abbrev='';
--------
 128434

(all rows have at least one of the 3 fields empty)

Hope someone finds this info useful... :)

                  Table "public.zip4_detail"
          Column           |         Type          | Modifiers
---------------------------+-----------------------+-----------
 zip_code                  | character varying(5)  |
 update_key_no             | character varying(10) |
 action_code               | character varying(1)  |
 record_type_code          | character varying(1)  |
 carrier_route_id          | character varying(4)  |
 street_pre_drctn_abbrev   | character varying(2)  | not null
 street_name               | character varying(28) |
 street_suffix_abbrev      | character varying(4)  | not null
 street_post_drctn_abbrev  | character varying(2)  | not null
 addr_primary_low_no       | character varying(10) |
 addr_primary_high_no      | character varying(10) |
 addr_prmry_odd_even_code  | character varying(1)  |
 building_or_firm_name     | character varying(40) |
 addr_secondary_abbrev     | character varying(4)  |
 addr_secondary_low_no     | character varying(8)  |
 addr_secondary_high_no    | character varying(8)  |
 addr_secny_odd_even_code  | character varying(1)  |
 zip_add_on_low_no         | character varying(4)  |
 zip_add_on_high_no        | character varying(4)  |
 base_alt_code             | character varying(1)  |
 lacs_status_ind           | character varying(1)  |
 govt_bldg_ind             | character varying(1)  |
 finance_no                | character varying(6)  |
 state_abbrev              | character varying(2)  |
 county_no                 | character varying(3)  |
 congressional_dist_no     | character varying(2)  |
 muncipality_ctyst_key     | character varying(6)  |
 urbanization_ctyst_key    | character varying(6)  |
 prefd_last_line_ctyst_key | character varying(6)  |

--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"