Thread: 1 char in the world

1 char in the world

From

Matt Mello

Date:

29 January 2003, 03:25:09

TEXT vs "char" ... vs BOOLEAN

I am porting from Informix to PG.  In doing so, I had to pick some data
types for fields, and began wondering about the performance of char/text
fields with one character.  For example, I have a field which has one of
the following values/states: {'A', 'D', 'F', 'U'}.  Since CHAR(n),
VARCHAR, and TEXT are all supposed to have the same performance
according to the docs, it seems that they will all perform the same.
For this reason, I did not squabble over which one of these to use.
However, since "char" is implemented differently, I thought I would
compare it to one of the others.  I chose to pit TEXT against "char".

Test query = explain analyze select count(*) from table where onechar='D';
Table size = 512 wide [mostly TEXT] * 400000 rows
Performance averages:
  "char"  44ms
  TEXT    63ms

This seems somewhat reasonable, and makes me want to use "char" for my
single-char field.  Does everyone else find this to be reasonable?  Is
this pretty much the behavior I can expect on extraordinarily large
tables, too?  And, should I worry about things like the backend
developers removing "char" as a type later?

--

This naturally led me to another question.  How do TEXT, "char", and
BOOLEAN compare for storing t/f values.  The test results I saw were
surprising.

Test query=
  "char"/TEXT: explain analyze select count(*) from table where bool='Y';
  boolean:    explain analyze select count(*) from table where bool=true;
Table size (see above)
Performance averages:
  TEXT    24ms
  BOOLEAN 28ms
"char"   17ms

Why does boolean rate closer to TEXT than "char"?  I would think that
BOOLEANs would actually be stored like "char"s to prevent using the
extra 4 bytes with TEXT types.

Based on these results, I will probably store my booleans as "char"
instead of boolean.  I don't use stored procedures with my application
server, so I should never need my booleans to be the BOOLEAN type.  I
can convert faster in my own code.

--

NOTE: the above tests all had the same relative data in the different
fields (what was in TEXT could be found in "char", etc.) and were all
indexed equally.


Thanks!

--
Matt Mello

Re: 1 char in the world

From

Justin Clift

Date:

29 January 2003, 05:50:40

Matt Mello wrote:
<snip>
> This naturally led me to another question.  How do TEXT, "char", and
> BOOLEAN compare for storing t/f values.  The test results I saw were
> surprising.
>
> Test query=
>  "char"/TEXT: explain analyze select count(*) from table where bool='Y';
>  boolean:    explain analyze select count(*) from table where bool=true;
> Table size (see above)
> Performance averages:
>  TEXT    24ms
>  BOOLEAN 28ms
> "char"   17ms

Hi Matt,

This is interesting.  As a thought, would you be ok to run the same test
using int4 and int8 as well?

That would probably round out the test nicely.

:)

Regards and best wishes,

Justin Clift

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi

Re: 1 char in the world

From

Hannu Krosing

Date:

29 January 2003, 05:53:59

On Wed, 2003-01-29 at 06:22, Matt Mello wrote:
> TEXT vs "char" ... vs BOOLEAN
>
> I am porting from Informix to PG.  In doing so, I had to pick some data
> types for fields, and began wondering about the performance of char/text
> fields with one character.  For example, I have a field which has one of
> the following values/states: {'A', 'D', 'F', 'U'}.  Since CHAR(n),
> VARCHAR, and TEXT are all supposed to have the same performance
> according to the docs, it seems that they will all perform the same.
> For this reason, I did not squabble over which one of these to use.
> However, since "char" is implemented differently, I thought I would
> compare it to one of the others.  I chose to pit TEXT against "char".
>
> Test query = explain analyze select count(*) from table where onechar='D';
> Table size = 512 wide [mostly TEXT] * 400000 rows
> Performance averages:
>   "char"  44ms
>   TEXT    63ms
>
> This seems somewhat reasonable, and makes me want to use "char" for my
> single-char field.  Does everyone else find this to be reasonable?  Is
> this pretty much the behavior I can expect on extraordinarily large
> tables, too?

The actual compares will likely stay faster for char than for text.

OTOH the actual storage of one-char datatype should not play so
significant role for very large tables, even if this is the only field
in that table, as most of the overhead will be in other places - storage
overhead in page/tuple headers, performance in retrieving the
pages/tuples and cache lookups, etc.

Also, for very big tables you will most likely want to restrict selects
on other criteria than a 4-valued field, so that indexes could be used
in retrieving data.

>  And, should I worry about things like the backend
> developers removing "char" as a type later?
>
> --
>
> This naturally led me to another question.  How do TEXT, "char", and
> BOOLEAN compare for storing t/f values.  The test results I saw were
> surprising.
>
> Test query=
>   "char"/TEXT: explain analyze select count(*) from table where bool='Y';

You could also try just

select count(*) from table where bool;

>   boolean:    explain analyze select count(*) from table where bool=true;
> Table size (see above)
> Performance averages:
>   TEXT    24ms
>   BOOLEAN 28ms
> "char"   17ms
>
> Why does boolean rate closer to TEXT than "char"?  I would think that
> BOOLEANs would actually be stored like "char"s to prevent using the
> extra 4 bytes with TEXT types.
>
> Based on these results, I will probably store my booleans as "char"
> instead of boolean.  I don't use stored procedures with my application
> server, so I should never need my booleans to be the BOOLEAN type.  I
> can convert faster in my own code.
>
> --
>
> NOTE: the above tests all had the same relative data in the different
> fields (what was in TEXT could be found in "char", etc.) and were all
> indexed equally.

Did you repeat the texts enough times to be sure that you get reliable
results ?

>
> Thanks!
--
Hannu Krosing <hannu@tm.ee>

Re: 1 char in the world

From

Tom Lane

Date:

29 January 2003, 10:58:26

Matt Mello <alien@spaceship.com> writes:
> Test query=
>   "char"/TEXT: explain analyze select count(*) from table where bool='Y';
>   boolean:    explain analyze select count(*) from table where bool=true;
> Table size (see above)
> Performance averages:
>   TEXT    24ms
>   BOOLEAN 28ms
> "char"   17ms

I don't believe those numbers for a moment.  All else being equal,
comparing a "char" field to a literal should be exactly the same speed
as comparing a bool field to a literal (and if you'd just said "where bool",
the bool field would be faster).  Both ought to be markedly faster than
text.

Look for errors in your test procedure.  One thing I'd particularly
wonder about is whether the query plans are the same.  In the absence of
any VACUUM ANALYZE data, I'd fully expect the planner to pick a
different plan for a bool field than text/char --- because even without
ANALYZE data, it knows that a bool column has only two possible values.

            regards, tom lane

Re: 1 char in the world

From

Matt Mello

Date:

29 January 2003, 18:29:59

> OTOH the actual storage of one-char datatype should not play so
> significant role for very large tables, even if this is the only field
> in that table, as most of the overhead will be in other places - storage
> overhead in page/tuple headers, performance in retrieving the
> pages/tuples and cache lookups, etc.

Is that true if I have a table that consists of lots of 1-char fields?
For example, if I have a table with 4 billion records, which consist of
(20) 1-char fields each, then the storage for the data will be something
like 5 times as large if I use TEXT than if I use "char".

> Also, for very big tables you will most likely want to restrict selects
> on other criteria than a 4-valued field, so that indexes could be used
> in retrieving data.

I do.  I was just using that query for this test only.  I have some very
complex queries that are constrained by many foriegn-key int4 fields,
but also a few of these 1-char fields.

> You could also try just
>
> select count(*) from table where bool;
>

I will do this in a while and report to the list.  I am going to try
make a reproducable test that anyone can do, to be sure my results are
"real".

> Did you repeat the texts enough times to be sure that you get reliable
> results ?

I think so.  Not so much as hundreds of times, though.


--
Matt Mello
512-350-6900

Re: 1 char in the world

From

Tom Lane

Date:

29 January 2003, 18:59:26

Matt Mello <alien@spaceship.com> writes:
> Is that true if I have a table that consists of lots of 1-char fields?
> For example, if I have a table with 4 billion records, which consist of
> (20) 1-char fields each, then the storage for the data will be something
> like 5 times as large if I use TEXT than if I use "char".

Probably more like 8 times as large, when you allow for alignment
padding --- on most machines, TEXT fields will be aligned on 4-byte
boundaries, so several TEXT fields in a row will take up 8 bytes apiece,
vs one byte apiece for consecutive "char" or bool fields.

            regards, tom lane

Re: 1 char in the world

From

alien

Date:

02 February 2003, 13:09:32

Tom Lane wrote:
> I don't believe those numbers for a moment.  All else being equal,
> comparing a "char" field to a literal should be exactly the same speed
> as comparing a bool field to a literal (and if you'd just said "where bool",
> the bool field would be faster).  Both ought to be markedly faster than
> text.
>
> Look for errors in your test procedure.  One thing I'd particularly
> wonder about is whether the query plans are the same.  In the absence of
> any VACUUM ANALYZE data, I'd fully expect the planner to pick a
> different plan for a bool field than text/char --- because even without
> ANALYZE data, it knows that a bool column has only two possible values.

Well, the previous test was done on REAL data.  Everything was indexed
and vacuum analyzed as it should be.

However, I generated some test data under "controlled" circumstances and
did get different results.  Bear in mind, though, that the data is no
longer "real", and doesn't represent the system I am concerned about.

[Someone requested some tests with int4/int8, too, so I included them,
as well.  However, I would never use 4 or 8 bytes to store one bit.
Since a byte is platform-atomic, however, I will use a whole byte for a
single bit, as bit packing is too expensive.]

create table booltest (
   boo boolean,
   cha "char",
   txt text,
   in4 int4,
   in8 int8
);

Insert lots of data here, but stay consistent between fields.  [If you
insert a TRUE into a boolean, put a 'Y' into a text or "char" field and
a 1 into an int type.]  So, I basically had 2 different insert
statements (one for true and one for false), and I used a random number
generator to get a good distribution of them.

create index booidx on booltest(boo);
create index chaidx on booltest(cha);
create index txtidx on booltest(txt);
create index in4idx on booltest(in4);
create index in8idx on booltest(in8);

vacuum full verbose analyze booltest;
INFO:  --Relation public.booltest--
INFO:  Pages 6897: Changed 0, reaped 0, Empty 0, New 0; Tup 1000000: Vac
0, Keep/VTL 0/0, UnUsed 0, MinLen 52, MaxLen 52; Re-using: Free/Avail.
Space 362284/362284; EndEmpty/Avail. Pages 0/6897.
         CPU 0.53s/0.41u sec elapsed 18.69 sec.
INFO:  Index booidx: Pages 2193; Tuples 1000000.
         CPU 0.24s/0.11u sec elapsed 3.33 sec.
INFO:  Index chaidx: Pages 2193; Tuples 1000000.
         CPU 0.23s/0.19u sec elapsed 4.01 sec.
INFO:  Index txtidx: Pages 2745; Tuples 1000000.
         CPU 0.51s/0.14u sec elapsed 4.07 sec.
INFO:  Index in4idx: Pages 2193; Tuples 1000000.
         CPU 0.20s/0.17u sec elapsed 3.51 sec.
INFO:  Index in8idx: Pages 2745; Tuples 1000000.
         CPU 0.26s/0.04u sec elapsed 1.92 sec.
INFO:  Rel booltest: Pages: 6897 --> 6897; Tuple(s) moved: 0.
         CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  --Relation pg_toast.pg_toast_4327226--
INFO:  Pages 0: Changed 0, reaped 0, Empty 0, New 0; Tup 0: Vac 0,
Keep/VTL 0/0, UnUsed 0, MinLen 0, MaxLen 0; Re-using: Free/Avail. Space
0/0; EndEmpty/Avail. Pages 0/0.
         CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  Index pg_toast_4327226_index: Pages 1; Tuples 0.
         CPU 0.00s/0.00u sec elapsed 0.02 sec.
INFO:  Analyzing public.booltest
VACUUM

Count our test set:
select count(*) from booltest; [ALL] output:1000000
select count(*) from booltest where boo; [TRUES] output:498649

TESTS .....

1) INT8=1
explain analyze select count(*) from booltest where in8 = '1';
Aggregate  (cost=1342272.26..1342272.26 rows=1 width=0) (actual
time=3434.37..3434.37 rows=1 loops=1)
->  Index Scan using in8idx on booltest  (cost=0.00..1340996.42
rows=510333 width=0) (actual time=6.96..2704.45 rows=498649 loops=1)
Index Cond: (in8 = 1::bigint)
Total runtime: 3434.50 msec

2) INT4=1
explain analyze select count(*) from booltest where in4 = 1;
Aggregate  (cost=1341990.26..1341990.26 rows=1 width=0) (actual
time=3219.24..3219.24 rows=1 loops=1)
->  Index Scan using in4idx on booltest  (cost=0.00..1340714.42
rows=510333 width=0) (actual time=12.92..2548.20 rows=498649 loops=1)
Index Cond: (in4 = 1)
Total runtime: 3219.35 msec

3) TEXT='Y'
explain analyze select count(*) from booltest where txt = 'Y';
Aggregate  (cost=1342272.26..1342272.26 rows=1 width=0) (actual
time=4820.06..4820.06 rows=1 loops=1)
->  Index Scan using txtidx on booltest  (cost=0.00..1340996.42
rows=510333 width=0) (actual time=15.83..4042.07 rows=498649 loops=1)
Index Cond: (txt = 'Y'::text)
Total runtime: 4820.18 msec

4) BOOLEAN=true
explain analyze select count(*) from booltest where boo = true;
Aggregate  (cost=1341990.26..1341990.26 rows=1 width=0) (actual
time=3437.30..3437.30 rows=1 loops=1)
->  Index Scan using booidx on booltest  (cost=0.00..1340714.42
rows=510333 width=0) (actual time=28.16..2751.38 rows=498649 loops=1)
Index Cond: (boo = true)
Total runtime: 3437.42 msec

5) BOOLEAN [implied = true]
explain analyze select count(*) from booltest where boo;
Aggregate  (cost=100018172.83..100018172.83 rows=1 width=0) (actual
time=2775.40..2775.40 rows=1 loops=1)
->  Seq Scan on booltest  (cost=100000000.00..100016897.00 rows=510333
width=0) (actual time=0.10..2138.11 rows=498649 loops=1)
Filter: boo
Total runtime: 2775.50 msec

6) "char"='Y'
explain analyze select count(*) from booltest where cha = 'Y';
Aggregate  (cost=1341990.26..1341990.26 rows=1 width=0) (actual
time=3379.71..3379.71 rows=1 loops=1)
->  Index Scan using chaidx on booltest  (cost=0.00..1340714.42
rows=510333 width=0) (actual time=32.77..2695.77 rows=498649 loops=1)
Index Cond: (cha = 'Y'::"char")
Total runtime: 3379.82 msec

Average ms over 42 attempts per test, some one-after-the-other, others
mixed with other queries, was:

1) INT8=1          3229.76
2) INT4=1          3194.45
3) TEXT='Y'        4799.23
4) BOOLEAN=true    3283.30
5) BOOLEAN         2801.83
6) "char"='Y'      3290.15

The straight boolean test was the fastest at 2.8 secs, and the TEXT was
the slowest at 4.8 secs.  Everything else settled in the same pot at
3.25 secs.

I wasn't too impressed with any of these times, actually, but I'm
bearing in mind that we are talking about an aggregate, which I have
learned much about in the last few days from the mailing list, and which
I expect to be slow in PG.

Since straight-BOOLEAN [not adding "= true" to the condition] is about
15% faster than "char", I will stick with BOOLEAN.

My immediate future plans also include recoding my system to never use
aggregates in "live" queries.  I will be tracking statistics in real
time in statistics tables instead of hitting the database with an
aggregate.  It is much cheaper to add a few milliseconds per insert than
to slow my whole system down for several seconds during an aggregate query.

In fact, if you have a sizable table, and especially if you are running
an OLTP server, unless you are manually investigating something in the
table, I recommend never using aggregates.  Your SQL queries should be
aggregate-free for large tables, if possible.

Thanks!

--
Matt Mello