Thread: Getting weird results with unicode table...

Getting weird results with unicode table...

From
"David B"
Date:
Folks,

Two possible bugs.


#1 Loading unicode data.
#2 Doing count of data based on substring(1,1) of a col.

--------------------------
#1 Loading unicode data
--------------------------

I've loaded freedb database (freedb.org) into a 8.0.0 beta 4 database I
created on my PC.
PC is Windows XP Pro. 1.7Ghz 1GB Ram. Pentium 4.


I loaded a bunch of rows via SQL window within pgAdmin III.(v1.2.0 Post-beta
3 OCt25th).


During the load I did get lots of errors
This was DDL script for initially for use with MySQL db.
I made it Postgres ver by removing the DDL for creating the table since it
was MySQL specific and replaced it with the following:

CREATE TABLE album
(
  id serial NOT NULL,
  cddb_id varchar(10) NOT NULL DEFAULT ''::character varying,
  title varchar(255) NOT NULL DEFAULT ''::character varying,
  artist varchar(255) NOT NULL DEFAULT ''::character varying,
  "year" int4 NOT NULL DEFAULT 0,
  num_tracks int2 NOT NULL DEFAULT 0,
  length int2 NOT NULL DEFAULT 0,
  CONSTRAINT album_pkey PRIMARY KEY (id)
)
WITH OIDS;
ALTER TABLE album OWNER TO postgres;
GRANT ALL ON TABLE album TO postgres;
GRANT ALL ON TABLE album TO public;


The INSERT statements look like this:

--
-- Dumping data for table 'album'
--

INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(1,'21037703','æã¨ã«ã¡ã¬ãªã³','ROLL DAYS',1997,3,889);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(2,'cd113c10','Best of Miss Butch Blues','Various',1999,16,4414);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(3,'a50b380b','Ray Charles in Paris','Ray Charles',2000,11,2874);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(4,'21033f03','éå½','ã¢ã³ã³ãã¥',2001,3,833);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(5,'cd113d4d','Untitled','Benjamin Gate',2000,77,4415);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(9,'c409830e','More Blues','Diversos',2000,14,2437);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(10,'21034c03','Everything+Ash','erico',2004,3,846);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(11,'21034d03','å¤ã«æ±ããã¦ãA Night in Afro Blueã','ä¹
ä¿
  INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUE
S
(12,'21034e03','éå¹´ã大å¿ãæ±ã','大åå
æµ',1992,3,848);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(13,'ae0e0f0e','le voyageur','Calvin Russell',2000,14,3601);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(14,'21039204','Potrait_2 é¥','æ¶¼å®®é¥ (æ æã¿ãªã¿)',2003,4,916);



Running this script I get most inserts as sucess but some failures too:

INSERT 1356199 1
INSERT 1356200 1
INSERT 1356201 1
psql:freedb_oct04:1453202: ERROR:  Unicode characters greater than or equal
to 0
x10000 are not supported
INSERT 1356202 1
psql:freedb_oct04:1453219: ERROR:  Unicode characters greater than or equal
to 0
x10000 are not supported
INSERT 1356228 1
INSERT 1356229 1
psql:freedb_oct04:1453232: ERROR:  Unicode characters greater than or equal
to 0
x10000 are not supported
INSERT 1356230 1


Note that the DB is set to UNICODE.

Wonder if it is v8 beta related bug ?





--------------------------
#2 Counting data
--------------------------

I then take samples from that large ALBUM table and spread the records among
6 other tables.

create table album_1 as select * from album where lower(title) >= 'a%' and
lower(title) <= 'e%';
create table album_2 as select * from album where lower(title) >= 'f%' and
lower(title) <= 'k%';
create table album_3 as select * from album where lower(title) >= 'l%' and
lower(title) <= 'p%';
create table album_4 as select * from album where lower(title) >= 'q%' and
lower(title) <= 't%';
create table album_5 as select * from album where lower(title) >= 'u%' and
lower(title) <= 'w%';
create table album_6 as select * from album where lower(title) >= 'x%' or
lower(title) <= 'a%';

select count(*) from album_1; -- 296k
select count(*) from album_2; -- 205k
select count(*) from album_3; -- 211k
select count(*) from album_4; -- 180k
select count(*) from album_5; -- 39k
select count(*) from album_6; -- 78k


I then create a view that merges all 6 tables.

I count the data like this:

select substring ( lower(title), 1, 1), count(*)
from album
where lower(title) > 'a%' and lower(title) <= 'e%'
group by substring ( lower(title), 1, 1) ;

Should get counts for 'a', 'b', 'c', 'd', 'e'.
Don't even care about upper case rows.

I get over 100 rows back with lots of hidden or unprintable characters.
Does not make sense when I limit results to 'a' thru 'e'.

Thanks!!

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.799 / Virus Database: 543 - Release Date: 11/19/2004

Re: Getting weird results with unicode table...

From
Tom Lane
Date:
"David B" <postgresql@thegatelys.com> writes:
> Two possible bugs.

> #1 Loading unicode data.

I'm thinking that the dump file probably doesn't set client_encoding,
or if it does, sets it to the wrong thing.  Is the data in the dump file
actually UTF8, or is it in some other encoding?

> select substring ( lower(title), 1, 1), count(*)
> from album
> where lower(title) > 'a%' and lower(title) <= 'e%'
> group by substring ( lower(title), 1, 1) ;

> Should get counts for 'a', 'b', 'c', 'd', 'e'.

... plus any other characters that the current locale chooses to sort
between 'a' and 'e'.  In particular I'd expect to see accented 'a' and/or
accented 'e' sort that way.

            regards, tom lane