Getting weird results with unicode table... - Mailing list pgsql-bugs

From David B
Subject Getting weird results with unicode table...
Date
Msg-id GOEGKICPFOPNLIEIHGFJAEJHDDAA.postgresql@thegatelys.com
Whole thread Raw
Responses Re: Getting weird results with unicode table...
List pgsql-bugs
Folks,

Two possible bugs.


#1 Loading unicode data.
#2 Doing count of data based on substring(1,1) of a col.

--------------------------
#1 Loading unicode data
--------------------------

I've loaded freedb database (freedb.org) into a 8.0.0 beta 4 database I
created on my PC.
PC is Windows XP Pro. 1.7Ghz 1GB Ram. Pentium 4.


I loaded a bunch of rows via SQL window within pgAdmin III.(v1.2.0 Post-beta
3 OCt25th).


During the load I did get lots of errors
This was DDL script for initially for use with MySQL db.
I made it Postgres ver by removing the DDL for creating the table since it
was MySQL specific and replaced it with the following:

CREATE TABLE album
(
  id serial NOT NULL,
  cddb_id varchar(10) NOT NULL DEFAULT ''::character varying,
  title varchar(255) NOT NULL DEFAULT ''::character varying,
  artist varchar(255) NOT NULL DEFAULT ''::character varying,
  "year" int4 NOT NULL DEFAULT 0,
  num_tracks int2 NOT NULL DEFAULT 0,
  length int2 NOT NULL DEFAULT 0,
  CONSTRAINT album_pkey PRIMARY KEY (id)
)
WITH OIDS;
ALTER TABLE album OWNER TO postgres;
GRANT ALL ON TABLE album TO postgres;
GRANT ALL ON TABLE album TO public;


The INSERT statements look like this:

--
-- Dumping data for table 'album'
--

INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(1,'21037703','æã¨ã«ã¡ã¬ãªã³','ROLL DAYS',1997,3,889);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(2,'cd113c10','Best of Miss Butch Blues','Various',1999,16,4414);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(3,'a50b380b','Ray Charles in Paris','Ray Charles',2000,11,2874);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(4,'21033f03','éå½','ã¢ã³ã³ãã¥',2001,3,833);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(5,'cd113d4d','Untitled','Benjamin Gate',2000,77,4415);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(9,'c409830e','More Blues','Diversos',2000,14,2437);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(10,'21034c03','Everything+Ash','erico',2004,3,846);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(11,'21034d03','å¤ã«æ±ããã¦ãA Night in Afro Blueã','ä¹
ä¿
  INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUE
S
(12,'21034e03','éå¹´ã大å¿ãæ±ã','大åå
æµ',1992,3,848);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(13,'ae0e0f0e','le voyageur','Calvin Russell',2000,14,3601);
INSERT INTO album (id, cddb_id, title, artist, year, num_tracks, length)
VALUES
(14,'21039204','Potrait_2 é¥','æ¶¼å®®é¥ (æ æã¿ãªã¿)',2003,4,916);



Running this script I get most inserts as sucess but some failures too:

INSERT 1356199 1
INSERT 1356200 1
INSERT 1356201 1
psql:freedb_oct04:1453202: ERROR:  Unicode characters greater than or equal
to 0
x10000 are not supported
INSERT 1356202 1
psql:freedb_oct04:1453219: ERROR:  Unicode characters greater than or equal
to 0
x10000 are not supported
INSERT 1356228 1
INSERT 1356229 1
psql:freedb_oct04:1453232: ERROR:  Unicode characters greater than or equal
to 0
x10000 are not supported
INSERT 1356230 1


Note that the DB is set to UNICODE.

Wonder if it is v8 beta related bug ?





--------------------------
#2 Counting data
--------------------------

I then take samples from that large ALBUM table and spread the records among
6 other tables.

create table album_1 as select * from album where lower(title) >= 'a%' and
lower(title) <= 'e%';
create table album_2 as select * from album where lower(title) >= 'f%' and
lower(title) <= 'k%';
create table album_3 as select * from album where lower(title) >= 'l%' and
lower(title) <= 'p%';
create table album_4 as select * from album where lower(title) >= 'q%' and
lower(title) <= 't%';
create table album_5 as select * from album where lower(title) >= 'u%' and
lower(title) <= 'w%';
create table album_6 as select * from album where lower(title) >= 'x%' or
lower(title) <= 'a%';

select count(*) from album_1; -- 296k
select count(*) from album_2; -- 205k
select count(*) from album_3; -- 211k
select count(*) from album_4; -- 180k
select count(*) from album_5; -- 39k
select count(*) from album_6; -- 78k


I then create a view that merges all 6 tables.

I count the data like this:

select substring ( lower(title), 1, 1), count(*)
from album
where lower(title) > 'a%' and lower(title) <= 'e%'
group by substring ( lower(title), 1, 1) ;

Should get counts for 'a', 'b', 'c', 'd', 'e'.
Don't even care about upper case rows.

I get over 100 rows back with lots of hidden or unprintable characters.
Does not make sense when I limit results to 'a' thru 'e'.

Thanks!!

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.799 / Virus Database: 543 - Release Date: 11/19/2004

pgsql-bugs by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Psql history
Next
From: Karel Zak
Date:
Subject: Re: select to_number('1,000', '999,999');