Debugging Unexpected Characters in Data Using Cross-Character-Set Maps
The Problem: Customers Get Unexpected Characters
One of the most frequently asked questions the Sybase Global Products Group gets from customers is:
“What happened to my data? Yesterday, I typed in character x, but when I retrieved my data today, I see character y!”
where characters x and y are two different non-ASCII characters. If you have experienced this problem, read on.
A character set is a specific collection of characters and their assigned numerical values. (Please note, the term character set is interchangeable with code page or code set.) Data is encoded in a series of binary codes (also called hexadecimal or hex codes) for a particular character set. Since different operating systems use different character sets, the encoding for a particular character may differ depending on the platform and its character set. As data passes between the client and the server, it must be correctly converted to the character set used in the different parts of your system in order to maintain the integrity of the data.
In order to understand what has happened when you do not get the characters you expect, you must carefully step through and identify the character set used by each piece of hardware and software in the application system. By doing so, you’re trying to identify the point at which both of the following are true:
- software that handles the data uses a character set that differs from the character set used to enter the data, and
- the data was not converted as it passed between systems or between the different layers of a system.
When this happens, an error is introduced. In fact, this may occur several times before the data is displayed back to the user, thus compounding the problem.
Once you have identified the two or more character sets being used, you need to identify what character is associated with the hex code in the two character sets. Hopefully, you will find that character x in the first character set is interpreted as character y in the second. If that’s the case, the mystery is solved! If not, then either the character sets being used are different than you thought, or there is yet another character set involved.
Looking up hex codes in two character sets is tedious. First, you must locate the descriptions of the two character sets, understand the addressing system used by each table, and then find the character in the table. Therefore, we’ve come up with a way to make this easier for you--a table that lists the equivalent characters in each of three commonly used character sets for the Western European languages.
The following mappings show the characters associated with different hexadecimal and decimal values for the three mostly commonly used character sets in North America, Latin America, and Western Europe:
- CP 850 (used by MS-DOS systems)
- CP 1252 (used by MS-Windows 95, 98, and NT)
- ISO 8859-1 (used by Sun Solaris and other UNIX systems)
How to Use the Cross-Character-Set Maps
We use the following example to illustrate how to use the character set maps when you are not getting the characters you expect. In this example, the user entered a capital letter N with tilde (Ñ) and is now getting a capital letter eth (Ð). The character set being used on the user’s machine is CP 1252. To diagnose the problem, you would go through the following steps:
- Begin with the output data that is causing the problem (Ð). Locate its hex value in the character set of the user's machine. In this example, the character set is CP 1252 and the hex value for Ð is 0xD0.
- Identify the encoding for the character the user entered on their system (Ñ) in their character set (CP 1252). The hex value for Ñ in CP 1252 is 0xD1.
- Look across at the other columns (for ISO 8859-1 and CP 850) for the same hex value (0xD1).
- Did the data entered by the user change to one of the characters in either of these two character sets?
In this example, a capital eth (Ð) appears in the CP 850 column and the answer is yes. You’ve found your problem. Somewhere along the way the data, entered in CP 1252, was not converted to the correct encoding in CP 850.
How to Fix Your Character Set Problem
Once you have found the mapping problem, you need to identify which part of your computer system is using the wrong character set without character set conversion--usually an intermediate machine which is set to a different character set. Once you have identified it, then you need to switch character set conversion on at that particular point or find another way to prevent the problem from occurring. It is, unfortunately, impossible to be more specific than that because computer systems are so varied and each situation must be debugged.
In this example, data entered by the user is passed to a middle tier which is using CP 850. No conversion occurred between the user’s character set (CP 1252) and the middle tier. The data was not converted to the encoding for Ñ in CP 850 (0xA5); instead it is passed, unconverted, as 0xD1.
The data goes from the middle tier to a server which is storing the data in ISO 8859-1. This time, conversion is turned on between the middle tier and the server. The data, passed as 0xD1 to the middle tier, is passed to the server. Since conversion is turned on, the server converts 0xD1, the encoding for Ð to 0xD0, the ISO 8859-1 encoding for the same character.
The user retrieves the data. Since conversion is turned on between all parts of the system-server, middle tier, and client-the user sees the character associated with the hex code 0xD0, (Ð), not Ñ.
We have provided the cross-character-set map, divided into eight pages, to make it easier to display. Each part contains 32 characters, broken into the hex ranges shown below. Almost all problems occur in the non-ASCII range which starts at hex 80 and above. The displayable characters within the 0x80-FF range begins at A0. The other ranges are provided for your reference.