Thread: Invalid unicode in COPY problem
Hi all, I've been chasing down a bug and from what I have learned it may be because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles invalid unicode. I've been given some ideas on how to try to catch invalid unicode but it seems expensive so I am hoping there is a postgresql way to deal with this problem. I've run into a problem where a bulk postgres "COPY..." statement is dieing because one of the lines contains a file name with an invalid unicode character. In nautilus this file has '(invalid encoding)' and the postgres error is 'CONTEXT: COPY file_info_3, line 228287, column file_name: "Femme Fatal\uffff.url"'. To actually look at the file from the shell (bash) shows what appears to be a whitespace but when I copy/paste the file name I get the '\uffff' you see above. I could, with the help of the TLUG people, use regex to match for an invalid character and skip the file but that is not ideal. The reason is that this is for my backup program and invalid unicode or not, the contents of the file may still be important and I would prefer to have it in the database so that it is later copied. I can copy and move the file in the shell so the file isn't apparently in an of itself corrupt. So then, is there a way I can tell postresql to accept the invalid unicode name? Here is a copy of my schema: tle-bu=> \d file_info_2 Table "public.file_info_2" Column | Type | Modifiers ----------------------+----------------------+----------------------------------------- file_group_name | text | file_group_uid | bigint | not null file_mod_time | bigint | not null file_name | text | not null file_parent_dir | text | not null file_perm | text | not null file_size | bigint | not null file_type | character varying(2) | not null default 'f'::character varying file_user_name | text | file_user_uid | bigint | not null file_backup | boolean | not null default true file_display | boolean | not null default false file_restore_display | boolean | not null default false file_restore | boolean | not null default false Indexes: "file_info_2_display_idx" btree (file_type, file_parent_dir, file_name) 'file_name' and 'file_parent_dir' are the columns that could have entries with the invalid unicode characters. Maybe I could/should use something other than 'text'? These columns could contain anything that a file or directory name could be. Thanks! Madison -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Madison Kelly (Digimer) TLE-BU, The Linux Experience; Back Up http://tle-bu.thelinuxexperience.com -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: 1) new postgresql.conf item "mbstr_check" added. 2) if mbstr_check = 0 then invalid characters are not accepted (same as current PostgreSQL behavior). This is the default. 3) if mbstr_check = 1 then invalid characters are accepted with WARNING 4) if mbstr_check = 2 then invalid characters are accepted without any warnings 5) We have checked PostgreSQL source code if accepting invalid characters makes some troubles. We have found that we need to fix a place and the fix is included in the patches. Madison, If you are interested in the patches, I could send it to you. Hackers, Do you think the functionality something like above is worth to add to PostgreSQL? -- Tatsuo Ishii > Hi all, > > I've been chasing down a bug and from what I have learned it may be > because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles > invalid unicode. I've been given some ideas on how to try to catch > invalid unicode but it seems expensive so I am hoping there is a > postgresql way to deal with this problem. > > I've run into a problem where a bulk postgres "COPY..." statement is > dieing because one of the lines contains a file name with an invalid > unicode character. In nautilus this file has '(invalid encoding)' and > the postgres error is 'CONTEXT: COPY file_info_3, line 228287, column > file_name: "Femme Fatal\uffff.url"'. > > To actually look at the file from the shell (bash) shows what appears > to be a whitespace but when I copy/paste the file name I get the > '\uffff' you see above. > > I could, with the help of the TLUG people, use regex to match for an > invalid character and skip the file but that is not ideal. The reason is > that this is for my backup program and invalid unicode or not, the > contents of the file may still be important and I would prefer to have > it in the database so that it is later copied. I can copy and move the > file in the shell so the file isn't apparently in an of itself corrupt. > > So then, is there a way I can tell postresql to accept the invalid > unicode name? Here is a copy of my schema: > > tle-bu=> \d file_info_2 > Table "public.file_info_2" > Column | Type | Modifiers > ----------------------+----------------------+----------------------------------------- > file_group_name | text | > file_group_uid | bigint | not null > file_mod_time | bigint | not null > file_name | text | not null > file_parent_dir | text | not null > file_perm | text | not null > file_size | bigint | not null > file_type | character varying(2) | not null default > 'f'::character varying > file_user_name | text | > file_user_uid | bigint | not null > file_backup | boolean | not null default true > file_display | boolean | not null default false > file_restore_display | boolean | not null default false > file_restore | boolean | not null default false > Indexes: > "file_info_2_display_idx" btree (file_type, file_parent_dir, file_name) > > 'file_name' and 'file_parent_dir' are the columns that could have > entries with the invalid unicode characters. Maybe I could/should use > something other than 'text'? These columns could contain anything that a > file or directory name could be. > > Thanks! > > Madison > > -- > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Madison Kelly (Digimer) > TLE-BU, The Linux Experience; Back Up > http://tle-bu.thelinuxexperience.com > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >
Thank you, I would! What versions have you tested the patch against? I am sorry but I am not too familiar with applying patches against the main program, is there documentation on how to apply the patch? Is there a way to roll the patch back/remove it? Would I be able to script the installation of the patch (I would expect so). The reason for the last question is that I expect (hope) many people will use it and I want to make it as easy as possible for a user to simply select or unselect the patch if it works well. If I can script the install and removal of this patch then I can do just this and that would be wonderful. Thank you again! どうも ありがとう ございます! (I hope that is right, my Japanese is still elementary. :) ) Madison Tatsuo Ishii wrote: > We have developed patches which relaxes the character validation so > that PostgreSQL accepts invalid characters. It works like this: > > 1) new postgresql.conf item "mbstr_check" added. > 2) if mbstr_check = 0 then invalid characters are not accepted > (same as current PostgreSQL behavior). This is the default. > 3) if mbstr_check = 1 then invalid characters are accepted with > WARNING > 4) if mbstr_check = 2 then invalid characters are accepted without any > warnings > 5) We have checked PostgreSQL source code if accepting invalid > characters makes some troubles. We have found that we need to fix a > place and the fix is included in the patches. > > Madison, > If you are interested in the patches, I could send it to you. > > Hackers, > Do you think the functionality something like above is worth to add to > PostgreSQL? > -- > Tatsuo Ishii > > >>Hi all, >> >> I've been chasing down a bug and from what I have learned it may be >>because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles >>invalid unicode. I've been given some ideas on how to try to catch >>invalid unicode but it seems expensive so I am hoping there is a >>postgresql way to deal with this problem. >> >> I've run into a problem where a bulk postgres "COPY..." statement is >>dieing because one of the lines contains a file name with an invalid >>unicode character. In nautilus this file has '(invalid encoding)' and >>the postgres error is 'CONTEXT: COPY file_info_3, line 228287, column >>file_name: "Femme Fatal\uffff.url"'. >> >> To actually look at the file from the shell (bash) shows what appears >>to be a whitespace but when I copy/paste the file name I get the >>'\uffff' you see above. >> >> I could, with the help of the TLUG people, use regex to match for an >>invalid character and skip the file but that is not ideal. The reason is >>that this is for my backup program and invalid unicode or not, the >>contents of the file may still be important and I would prefer to have >>it in the database so that it is later copied. I can copy and move the >>file in the shell so the file isn't apparently in an of itself corrupt. >> >> So then, is there a way I can tell postresql to accept the invalid >>unicode name? Here is a copy of my schema: >> >>tle-bu=> \d file_info_2 >> Table "public.file_info_2" >> Column | Type | Modifiers >>----------------------+----------------------+----------------------------------------- >> file_group_name | text | >> file_group_uid | bigint | not null >> file_mod_time | bigint | not null >> file_name | text | not null >> file_parent_dir | text | not null >> file_perm | text | not null >> file_size | bigint | not null >> file_type | character varying(2) | not null default >>'f'::character varying >> file_user_name | text | >> file_user_uid | bigint | not null >> file_backup | boolean | not null default true >> file_display | boolean | not null default false >> file_restore_display | boolean | not null default false >> file_restore | boolean | not null default false >>Indexes: >> "file_info_2_display_idx" btree (file_type, file_parent_dir, file_name) >> >> 'file_name' and 'file_parent_dir' are the columns that could have >>entries with the invalid unicode characters. Maybe I could/should use >>something other than 'text'? These columns could contain anything that a >>file or directory name could be. >> >> Thanks! >> >>Madison >> >>-- >>-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>Madison Kelly (Digimer) >>TLE-BU, The Linux Experience; Back Up >>http://tle-bu.thelinuxexperience.com >>-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-