Thread: Re: [GENERAL] Invalid unicode in COPY problem

Re: [GENERAL] Invalid unicode in COPY problem

From
Tatsuo Ishii
Date:
We have developed patches which relaxes the character validation so
that PostgreSQL accepts invalid characters. It works like this:

1) new postgresql.conf item "mbstr_check" added.
2) if mbstr_check = 0 then invalid characters are not accepted
   (same as current PostgreSQL behavior). This is the default.
3) if mbstr_check = 1 then invalid characters are accepted with
   WARNING
4) if mbstr_check = 2 then invalid characters are accepted without any
   warnings
5) We have checked PostgreSQL source code if accepting invalid
   characters makes some troubles. We have found that we need to fix a
   place and the fix is included in the patches.

Madison,
If you are interested in the patches, I could send it to you.

Hackers,
Do you think the functionality something like above is worth to add to
PostgreSQL?
--
Tatsuo Ishii

> Hi all,
>
>    I've been chasing down a bug and from what I have learned it may be
> because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles
> invalid unicode. I've been given some ideas on how to try to catch
> invalid unicode but it seems expensive so I am hoping there is a
> postgresql way to deal with this problem.
>
>    I've run into a problem where a bulk postgres "COPY..." statement is
> dieing because one of the lines contains a file name with an invalid
> unicode character. In nautilus this file has '(invalid encoding)' and
> the postgres error is 'CONTEXT:  COPY file_info_3, line 228287, column
> file_name: "Femme Fatal\uffff.url"'.
>
>    To actually look at the file from the shell (bash) shows what appears
> to be a whitespace but when I copy/paste the file name I get the
> '\uffff' you see above.
>
>    I could, with the help of the TLUG people, use regex to match for an
> invalid character and skip the file but that is not ideal. The reason is
> that this is for my backup program and invalid unicode or not, the
> contents of the file may still be important and I would prefer to have
> it in the database so that it is later copied. I can copy and move the
> file in the shell so the file isn't apparently in an of itself corrupt.
>
>    So then, is there a way I can tell postresql to accept the invalid
> unicode name? Here is a copy of my schema:
>
> tle-bu=> \d file_info_2
>                                Table "public.file_info_2"
>          Column        |         Type         |                Modifiers
> ----------------------+----------------------+-----------------------------------------
>   file_group_name      | text                 |
>   file_group_uid       | bigint               | not null
>   file_mod_time        | bigint               | not null
>   file_name            | text                 | not null
>   file_parent_dir      | text                 | not null
>   file_perm            | text                 | not null
>   file_size            | bigint               | not null
>   file_type            | character varying(2) | not null default
> 'f'::character varying
>   file_user_name       | text                 |
>   file_user_uid        | bigint               | not null
>   file_backup          | boolean              | not null default true
>   file_display         | boolean              | not null default false
>   file_restore_display | boolean              | not null default false
>   file_restore         | boolean              | not null default false
> Indexes:
>      "file_info_2_display_idx" btree (file_type, file_parent_dir, file_name)
>
>    'file_name' and 'file_parent_dir' are the columns that could have
> entries with the invalid unicode characters. Maybe I could/should use
> something other than 'text'? These columns could contain anything that a
> file or directory name could be.
>
>    Thanks!
>
> Madison
>
> --
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> Madison Kelly (Digimer)
> TLE-BU, The Linux Experience; Back Up
> http://tle-bu.thelinuxexperience.com
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Re: [GENERAL] Invalid unicode in COPY problem

From
Madison Kelly
Date:
Thank you, I would!

  What versions have you tested the patch against? I am sorry but I am
not too familiar with applying patches against the main program, is
there documentation on how to apply the patch? Is there a way to roll
the patch back/remove it? Would I be able to script the installation of
the patch (I would expect so).

  The reason for the last question is that I expect (hope) many people
will use it and I want to make it as easy as possible for a user to
simply select or unselect the patch if it works well. If I can script
the install and removal of this patch then I can do just this and that
would be wonderful.

  Thank you again!

  どうも ありがとう ございます! (I hope that is right, my Japanese is
still elementary. :) )

  Madison


Tatsuo Ishii wrote:
> We have developed patches which relaxes the character validation so
> that PostgreSQL accepts invalid characters. It works like this:
>
> 1) new postgresql.conf item "mbstr_check" added.
> 2) if mbstr_check = 0 then invalid characters are not accepted
>    (same as current PostgreSQL behavior). This is the default.
> 3) if mbstr_check = 1 then invalid characters are accepted with
>    WARNING
> 4) if mbstr_check = 2 then invalid characters are accepted without any
>    warnings
> 5) We have checked PostgreSQL source code if accepting invalid
>    characters makes some troubles. We have found that we need to fix a
>    place and the fix is included in the patches.
>
> Madison,
> If you are interested in the patches, I could send it to you.
>
> Hackers,
> Do you think the functionality something like above is worth to add to
> PostgreSQL?
> --
> Tatsuo Ishii
>
>
>>Hi all,
>>
>>   I've been chasing down a bug and from what I have learned it may be
>>because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles
>>invalid unicode. I've been given some ideas on how to try to catch
>>invalid unicode but it seems expensive so I am hoping there is a
>>postgresql way to deal with this problem.
>>
>>   I've run into a problem where a bulk postgres "COPY..." statement is
>>dieing because one of the lines contains a file name with an invalid
>>unicode character. In nautilus this file has '(invalid encoding)' and
>>the postgres error is 'CONTEXT:  COPY file_info_3, line 228287, column
>>file_name: "Femme Fatal\uffff.url"'.
>>
>>   To actually look at the file from the shell (bash) shows what appears
>>to be a whitespace but when I copy/paste the file name I get the
>>'\uffff' you see above.
>>
>>   I could, with the help of the TLUG people, use regex to match for an
>>invalid character and skip the file but that is not ideal. The reason is
>>that this is for my backup program and invalid unicode or not, the
>>contents of the file may still be important and I would prefer to have
>>it in the database so that it is later copied. I can copy and move the
>>file in the shell so the file isn't apparently in an of itself corrupt.
>>
>>   So then, is there a way I can tell postresql to accept the invalid
>>unicode name? Here is a copy of my schema:
>>
>>tle-bu=> \d file_info_2
>>                               Table "public.file_info_2"
>>         Column        |         Type         |                Modifiers
>>----------------------+----------------------+-----------------------------------------
>>  file_group_name      | text                 |
>>  file_group_uid       | bigint               | not null
>>  file_mod_time        | bigint               | not null
>>  file_name            | text                 | not null
>>  file_parent_dir      | text                 | not null
>>  file_perm            | text                 | not null
>>  file_size            | bigint               | not null
>>  file_type            | character varying(2) | not null default
>>'f'::character varying
>>  file_user_name       | text                 |
>>  file_user_uid        | bigint               | not null
>>  file_backup          | boolean              | not null default true
>>  file_display         | boolean              | not null default false
>>  file_restore_display | boolean              | not null default false
>>  file_restore         | boolean              | not null default false
>>Indexes:
>>     "file_info_2_display_idx" btree (file_type, file_parent_dir, file_name)
>>
>>   'file_name' and 'file_parent_dir' are the columns that could have
>>entries with the invalid unicode characters. Maybe I could/should use
>>something other than 'text'? These columns could contain anything that a
>>file or directory name could be.
>>
>>   Thanks!
>>
>>Madison
>>
>>--
>>-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>Madison Kelly (Digimer)
>>TLE-BU, The Linux Experience; Back Up
>>http://tle-bu.thelinuxexperience.com
>>-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-