Thread: Fastest char datatype
I'm storing a lot of words in a database. What's the fastest format for finding them? I'm going to be doing a lot of WHERE w LIKE 'marsh%' and WHERE w IN ('m', 'ma'). All characters are lowercase a-z, no punctuation, no other alphabets. By default I'm using varchar in utf-8 encoding, but was wondering if I could specificy something else (perhaps 7bit ascii, perhaps lowercase only) that would speed things up even further.
On Monday 20 July 2009 04:46:53 Robert James wrote: > I'm storing a lot of words in a database. What's the fastest format for > finding them? I'm going to be doing a lot of WHERE w LIKE 'marsh%' and > WHERE w IN ('m', 'ma'). All characters are lowercase a-z, no punctuation, > no other alphabets. By default I'm using varchar in utf-8 encoding, but > was wondering if I could specificy something else (perhaps 7bit ascii, > perhaps lowercase only) that would speed things up even further. If your data is only lowercase a-z, as you say, then the binary representation will be the same in all server-side encodings, because they are all supersets of ASCII. These concerns will likely be dominated by the question of proper indexing and caching anyway.
Is there a way to use a more compact encoding? I only need 4 bits per char - that would certainly help caching. (I have indexes tuned very well, already).
On Mon, Jul 20, 2009 at 2:02 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
If your data is only lowercase a-z, as you say, then the binary representationOn Monday 20 July 2009 04:46:53 Robert James wrote:
> I'm storing a lot of words in a database. What's the fastest format for
> finding them? I'm going to be doing a lot of WHERE w LIKE 'marsh%' and
> WHERE w IN ('m', 'ma'). All characters are lowercase a-z, no punctuation,
> no other alphabets. By default I'm using varchar in utf-8 encoding, but
> was wondering if I could specificy something else (perhaps 7bit ascii,
> perhaps lowercase only) that would speed things up even further.
will be the same in all server-side encodings, because they are all supersets
of ASCII.
On Sun, Jul 19, 2009 at 9:46 PM, Robert James<srobertjames@gmail.com> wrote: > I'm storing a lot of words in a database. What's the fastest format for > finding them? I'm going to be doing a lot of WHERE w LIKE 'marsh%' and WHERE > w IN ('m', 'ma'). All characters are lowercase a-z, no punctuation, no > other alphabets. By default I'm using varchar in utf-8 encoding, but was > wondering if I could specificy something else (perhaps 7bit ascii, perhaps > lowercase only) that would speed things up even further. All the charater types are basically the same except for char(n) which pads out the string on disk. Reading downthread, [a-z] needs more than 4 bits (4 bits could only represent 16 characters). 5 bits is a very awkward number in computing, which may explain why this type of encoding is rarely done. Coming from the 'cobol' world, where there were all kinds of zany bit compressed encodings, I can tell you that the trend is definitely in the other direction...standard data layouts coupled with well known algorithms. Any type of simple bitwise encoding that would get you any space benefit would mean converting your text fields to bytea. This would mean that any place you needed to deal with your text field as text would require running your data through a decoder function...you would encode going into the field and decode going out...ugh. Better would be to use a functional index: create index foo_idx on foo(compactify(myfield)); If you don't need index ordering, then you could swap a hash function for compactify and have it return type 'int'. This should give the best possible performance (probably better than the built in hash index). You would probably only see a useful benefit if your average string length was well over 10 characters though. In the end though, I bet you're best off using a vanilla text field/index unless you expect your table to get really huge. PostgreSQL's btree implementation is really quite amazing. merlin