Skip to end of metadata
Go to start of metadata
Summary

The datastore codepage has to match the database middleware.
The file format match the codepage of the file.
The engine has to be set to a codepage that is a superset to all - typically UTF-8. Manually in 11.7, automatically done in DI 12.
The target database has to use a codepage to save the chars that is a superset as well.

As a source we have a file with the list of languages of the world together with their local name encoded as UTF-8 file. see attachment.

Note: Data Integrator can deal with UTF-16 encoded files since version 12. Unfortunately, Excel, Word, and Wordpad save the files in UTF-16 when you specify Unicode and there is no way to force them to encode the file as UTF-8.

The next step is to use this file as source in the file format editor. So we create a new file format, call it languages_of_the_world, specify the "Tab" as delimiter, set "skip row header" and fill out the file name.

When you check the preview window, you will find wrong characters. The code page is set to default, so it defaults to the windows codepage and this is on my computer a standard Western European codepage, the 1252 or ISO-Latin-1 or ISO-8859-P1 page. Hence, all the double byte characters are shown as two characters which is obviously wrong.


Play around and set the codepage in the dialog to different values. For many you will get identical results as they are all based on the same ISO-8859 codepage, maybe with slight variations. But changing the codepage to cp1250 (Eastern Europe) you will find that the same value for a character code has a different character there. In our case, obviously utf-8 is the correct setting to choose.


Now we can use that file in a simple dataflow to load it into a new table. We build a job, a dataflow, drop the source file and add a template table for one of our datastores.


When executing it, you will find one first indicator that something is wrong, a line reading "initialize transcoder". Here it say that for the datastore "languages of the world" (yeah, I know, file formats are treated as datastores in DI) a transcoder is created, to transcode the UTF-8 to the engine codepage cp1252.


That might not be a bad thing to have a transcoder, but in our case, we read utf-8 text which is not only capable of representing all kinds of characters but it actually contains them too. And that we convert to cp1252, capable of western European characters only. Hence, when checking the results in the target table, most of the special chars are gone.


So we need to change the engine codepage. When you install Data Integrator, there is the one dialog asking for the locale including the codepage and normally you leave that to "<default>_<default>.<default>". This information is then saved in the DSConfig.txt - and we can change it here.

Up to version 11.7 you are in charge of choosing the appropriate code page yourself. When it is set to <default>, it means use the default codepage of the OS.

Since version 12.0 <default> is not related to the OS, it relates to the codepages of the sources and targets used in the dataflow. The idea is, use the minimum codepage where no data is lost. Your source might be ASCII codepage, the target database is cp1252. Why should the engine require two bytes for each character to be cached, the single byte codepage cp1252 is a superset. So this will be chosen. Another example is Cyrillic data loaded into a database with another codepage. For sure we do not want to lose characters in the initial step, so Data Integrator 12 will use a Unicode codepage automatically if the codepage is set to default.

In our current scenario here we had a transcoder from UTF-8 to the enforced engine codepage cp1252 and were losing characters therefore. So we go to the Data Integrator bin directory, open the DSConfig.txt file and modify the codepage there.


Note: A list of all valid codepages can be found in the documentation and in the DI\bin\locales\convrtrs.txt file.

The result will be the same as before. And again the, the trace log has the hint for us. One line says that UTF-8 will be converted to the engine codepage UTF-16. This is nothing to worry about as UTF-8 and UTF-16 are logically the same, the latter is just using double byte chars for all characters even the ASCII ones. And if the engine is set to UTF-8 we override that setting to UTF-16 to simplify the internal processing. But there is another line saying that the datastore called "target" is still converting the UTF data into cp1252.


So there is something potentially wrong there. The datastore setting has to match the database middleware setting. So it is time to figure what the database driver is using. In Oracle for Windows you use the regedit utility for that. Go to HKEY_LOCAL_MACHINE/Software/Oracle and check for the entries of NLS_LANG.


In my case the third entry was WIN1252, so using the codepage 1252 for the datastore was correct. We alter the setting from "WE8MSWIN1252" to "UTF8" in regedit.


When loading that, the result is even worse, now we have fewer correct characters: we have a mismatch between the datastore setting "cp1252" and the database middleware set to UTF-8. This is something that never ever should happen, so let us modify the datastore setting. Just imagine what was going on here, we gave the database driver a memory structure with the code 0xC4 (=Ä in cp1252) and the database driver checks what 0xC4 means in the UTF-8 codepage table - certainly not an Ä char.

Keep in mind, there are two database codepages. One is the database itself - the codepage that defines what characters can be stored inside the database. The other is the session codepage. It does not help much if the database is capable of every character (Unicode) but the database session driver is passing ASCII chars only. Therefore it is a good practice to set the session codepage to the database internal codepage and align the datastore to that. We are currently considering altering the session codepage to what was specified by the user in the datastore rather leaving it up to the user to keep both in sync. And possibly to even read the database codepage automatically so that the user cannot do anything wrong anymore (considered for DI 12.X).


The load now does not do any conversions anymore (except UTF-8 to UTF-16) at all, the source is Unicode, the target and the engine, and when we check the results in the database.... Still wrong!?!?!!


That actually looks like correct UTF-8 data, just not shown as UTF-8! Using SQL*Plus does not help either as both application, TOAD and SQL*Plus, are all not Unicode-aware yet. They use the Windows regional setting to determine the supported characters. Executing the same select statement with SQL*Developer (a free ware tool found in the Oracle download pages) gives you the correct results, including the Arabic text.


So the entire codepage handling is something one has to be very careful. Not only within Data Integrator, but even more important all the other tools around.

{+}Attachment
file.zip (35.24 KB)



  • No labels

1 Comment

  1. Hi Kendra,

    Very good example to configure DS for unicode data. I tried to implement it with data file provided however I am getting data truncation errors. Only difference is I have IBM DB2 9.7 FP4 database. According to DBA DB2 database is configured for UTF-8.

    I made DS configureation as you mentioned but don't know why it truncates multilanguage data. What's missing? What needs to configure to ensure that data in target (DB2 table is not truncated). If I directly load the same file into DB2 database table it load data correctly without loosing anything.

    I am using DS 4.0 SP2 (14.0.2.376) on Windows 2008 64 bit box.

    Thanks, Abhay

    Errors:

    12540 11188 INZ-064027 6/15/2012 4:12:14 PM The string <"лыгъоравэтльан йилйил (lyg""oravėtl'an jiljil); чаучу (čauču) / ча> exceeds the

    12540 11188 INZ-064027 6/15/2012 4:12:14 PM maximum of <94> bytes, so the trailing character(s) <авчу (čavču)"> will be lost. Context <Column Column Field3>

    Db2 database config:

           Database Configuration for Database

     Database configuration release level                    = 0x0d00
     Database release level                                  = 0x0d00

     Database territory                                      = en_US
     Database code page                                      = 1208
     Database code set                                       = UTF-8