The "Go" tools
     The GoAsm manual
GoAsm Assembler and Tools forum (in the MASM forum)
Old forum messages

Writing Unicode programs
by Jeremy Gordon -

This file is intended for those interested in writing Unicode programs (and programs using non-Roman characters) for 32 bit Windows, using GoAsm (assembler), GoRC (resource compiler) and GoLink (linker). It will also be of interest to those using other tools or other programming languages.

Contents

Introduction:
What is Unicode?
Why was Unicode needed?
Drawing non-Roman characters - it's font-based
What can be done with the ANSI APIs?
What can be done with the Unicode APIs?
Getting the strings to give to the APIs

Part1: Using the "Go" tools to make Unicode programs
Types of Unicode source files readable by the "Go" tools
Making a Unicode source file with a Unicode editor
Unicode: command line, output files, filenames, comments,
     code and data labels, resource IDs, defined words, and exports
     and how they appear in the debugger
Conversion methods used by GoAsm

Part 2: Using Unicode strings in DATA
Controlling the string format
Using DUS (declare Unicode sequence)
L', A' and 8' overrides
Overriding using the STRINGS directive
Declaring data using the overrides
Pushing null terminated Unicode strings
Using the correct string in quoted immediates
Using Unicode strings in structures
When the overrides apply
Using SIZEOF on Unicode strings

Part 3: Unicode/ANSI switching
What is Unicode/ANSI switching and why is it needed?
Switching using run-time loading
Switching using the same source
How switching is achieved
Unicode/ANSI switching of APIs
Switching constants
Switching quoted strings and immediates
Switching of string sequences with control characters
Using a switched type indicator
Character size switch
Other ways to achieve switching
Testing the STRINGS directive

Part 4:
Summary of Unicode support in the "Go" tools

Demonstration files
"Hello Unicode 1" (draws Unicode characters to console)
"Hello Unicode 2" (draws Unicode characters in dialog and message box)
"Hello Unicode 3" (draws Unicode characters using TextOutW, and also demonstrates Unicode/ANSI switching)
"Run Time Loading" (demonstrates how to use run-time loading in large application so both ANSI and Unicode APIs can be used).

More information, references and links
Acknowledgements


Introduction

What is Unicode?top

Unicode is a standard for character encoding which aims to cover all the major scripts of the world. It is published by the Unicode Consortium, a non-profit organisation based in Mountain View, California, whose members comprise computer corporations, software houses, academics and others interested in the subject.
Developers will probably not need to delve into the standard because all encoding is achieved through tools such as Unicode editors. But for those interested, there is a lot of informatiom about it on the Unicode Consortium website. The standard was first published in 1991 in The Unicode Standard. It is regularly updated.
Currently about 50,000 characters from over 90 scripts are covered by the standard, and there is plenty of room for expansion.
A feature of Unicode is that characters have 16-bit base values. That is to say, a character either has a value of between 0 and 0FFFFh (the usual case) or by using surrogate pairs, it is represented by two such values. For this reason Unicode characters are sometimes described as "wide" characters. The feature remains true even in UTF-8 and other Unicode formats which are merely translations of the character values.

This much reduced table illustrates the extent of Unicode coverage:-

General scripts 0000 to 1FFF Latin (Roman) and other non-ideographic scripts eg. Greek, Cyrillic, Armenian/Hebrew, Arabic, Oriya/Tamil, Thai/Lao, Tibetan, Canadian Aboriginal Syllabics, Mongolian
Symbols 2000 to 2DFF Eg. currency, numbers, maths, arrows, technical symbols, braille patterns, dingbats and bi-directional control codes
CJK phonetics and symbols 2E00 to 33FF Phonetic characters, punctuation marks and symbols used in Chinese, Japanese and Korean
CJK ideographs 3400 to 9FFF Ideographic Han characters unified from Chinese, Japanese and Korean sources
Yi syllables A000 to A4CF Yi syllables and Yi radicals
Hangul syllables AC00 to D743 Precomposed Korean Hangul syllables
Surrogates D800 to DFFF Indicate that the character is represented by two 16-bit characters
Private use area E000 to F8FF Reserved for user-defined characters
Compatibility and specials F900 to FFFD Special use characters and alternative representation of characters defined elsewhere in the Unicode standard to aid compatibility and mapping
Byte order marks FFFE and FEFF These words at the beginning of a file can be taken to signify that the file is Unicode and in the format UTF-16 big-endian (UTF-16BE) and UTF-16 little-endian (UTF-16LE) respectively

Why was Unicode needed?top

The existing character system was designed to suit text-based displays, text-based printing and serial text transmission protocols where space was limited.

In the case of video displays, this was organised by a chip called a video controller which would know how to draw a particular character on the screen. The program would send the video controller two bytes at a time. The first would tell the controller which character to draw, and the second would provide the character's "attribute" (its colour, background and whether it was flashing, bold, or underlined).
Since the character's value was contained in one byte, the number of different characters which could be displayed was limited. In practice the first 32 characters were used as control characters (a carriage return value 13, and linefeed value 10 being good examples), and value 255 was reserved, so that only 222 characters were available.
In serial communication it was even more restricted since one bit was used for parity checking.
The actual characters capable of being drawn with these arrangements were called a character set. The ASCII character set, in the first 7-bits of the byte, could represent the usual Roman characters (a to z, A to Z), Arabic numbers (0 to 9), and some additional characters and punctuation marks commonly used. IBM established a character set which included line draw characters using the 8th bit.
To draw other characters on a screen meant the video controller had to be told to switch character sets and draw a different character when sent a particular value. So different character sets were established to represent those. But here was the first problem. Since text was drawn on a byte-by-byte basis, languages which needed more than the number of characters available in a single-byte character set [for example Chinese, Japanese and Korean (Hangeul)] could not be represented fully. So Double and Multi-Byte Character Sets (DBCS and MBCS) were developed. These required special parsing and drawing techniques.

One difficulty with this arrangement arises partly from inconsistency in the individual character sets. For example the Windows character set appears to be usable by Hungarian speakers, but in fact there are four characters missing (Ő,ő,Ű and ű). So Hungarians have to use the Central European single-byte codepage iso-8859-2 which does have those characters (a codepage being a version within a character set). But then there is another problem. The iso-8859-2 codepage does not have another character which may be needed (Õ)! If this is a problem for those using Hungarian, imagine the difficulties faced by those using Asian scripts.
A second problem arises from the fact that the default character set in any particular computer is set on installation to the "locale" of the computer. On Windows 9x machines, this can only be altered using the Control Panel and can not be altered at run-time. Now some system displays which are very useful to the programmer (such as MessageBox), can only draw using this default character set. This means that such displays will be wrong if the locale is not set properly by the user.
In other types of controls, considerable care is needed by the programmer to ensure that the correct character set is used for the language to be represented.

As the requirement has grown for more and more languages to be represented, this could have been achieved by introducing more and more character sets. But instead the decision was made to go down the Unicode route. The advantages are:-

  • Since every character can be represented in the Unicode character set, two characters which look the same can have the same Unicode value even if from different languages. This keeps the representation of Japanese, Chinese and Korean characters and punctuation more compact than if individual character sets were used for each of these languages.
  • It is much easier to identify a particular character in Unicode since each character has a unique value. You don't need to know the character set as well. This makes it much easier to transfer international character data between machines. However, to display characters properly in complex scripts you may well need to know the language in which the characters are being used, to make sure that they are drawn to the proportions required by that language.
  • Since characters are identified by value irrespective of character set, the display functions can draw characters from different languages in one string.
  • Whole characters sets were needed for each language to provide the sort order, that is to say the alphabetical order of the characters. In Unicode this is handled by the operating system using the National Language Support ("NLS") APIs.
  • Unicode has plenty of space to accommodate new characters as they are required. For example new codepages had to be created to enable the Euro character (€) to be drawn, yet it was easy to add to Unicode (value 20ACh).

Drawing non-Roman characters - it's font-basedtop

This is one of things you find in Windows programming, which once understood, will cause other things just to click into place. A control draws using a device context. Each device context has a font handle assigned to it (selected into it) which will be used in the draw. The font handle identifies the font which will be used in the draw but also it identifies the character set which will be used. Each font has one or more character sets. Each character set only contains a limited number of characters. So a control cannot draw a character which is not in the character set in the font selected in its device context. Unless the character can be found from elsewhere (see below) it will draw a default character instead, maybe a vertical line box or question mark.

So the trick is to find the correct font and character set to draw the characters which need to be drawn.

Not all fonts and character sets will be installed on all machines. What fonts and character sets are loaded on installation will depend on which version of Windows is being used, whether it is an English or other language version, and what applications are loaded (they can come with their own fonts).
Every Windows machine will have installed Courier New, Arial, Times New Roman, Symbol, Wingdings, MS Serif, MS Sans Serif and (on later versions) Tahoma. Depending on the Windows version, between them these fonts can handle at least Western and Central European, Hebrew, Arabic, Greek, Turkish, Baltic, Cyrillic and Vietnamese scripts.
You can see what fonts are loaded on your machine by looking in the Windows\fonts folder or at "Control Panel, fonts". You will also find that installed fonts are listed in the registry in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft and sub-keys (which depend on the Windows version). However, there is not much information about the fonts. What you need to know is what languages are supported by what fonts. The easiest way to see this is to go into "Internet Options, General" and clicking "Fonts". Then in the drop down combo box select the language you wish to portray. A list of fonts then appears which have character sets capable of displaying the language specified. The "ChooseFont" common dialog which is used by various applications does a similar thing, except that you choose the font first and then the "script" drop down combo box shows what languages the font is capable of displaying.
The capability of a particular font can be expected to vary from machine to machine. For example, the Times New Roman font supplied with Windows 98 is limited to Western script and Greek, but on Windows XP can also draw Hebrew, Arabic, Turkish, Baltic, Central European, Cyrillic and Vietnamese.

If you have Word installed you can see the characters available for each font and character set by using "Insert, symbol" and then clicking the subset of the range you are interested in.

As developers we are more interested with what can be done at run-time to check that the correct font is installed. You can enumerate the loaded fonts which suit specified character sets using EnumFontFamiliesEx. This is a callback routine capable of reporting the language and character set in two ways. Either by using the CHARSET value (a fixed byte value of between 0 and 255 which you find at +17h in the LOGFONT structure), or by using the name of a language in words (reported in the ENUMLOGFONTEX structure) which is a more flexible arrangement. The languages in words reported by EnumFontFamiliesEx seem to be those which appear in the "Internet Options, General", "Fonts" combo box which probably uses that function anyway to fill its contents.

Generally fonts are capable of representing only a limited selection of values in the Unicode character ranges, so under Windows 2000/XP EnumFontFamiliesEx API is extended so that you can also receive the Unicode range of a particular font through the FONTSIGNATURE structure. The API GetUnicodeRanges can also be used to discover the Unicode range of a particular font.

Microsoft advises that the user of your application should also be able to find the correct font to display the language required using the ChooseFont common control, if necessary.

Windows 2000/XP uses fonts in some clever ways to maximise their versitility. For example, if a particular font does not support a particular character which needs to be drawn, Windows looks at "linked fonts" to see if there is another font which will do so. The fonts which are linked are contained in the registry key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\NT\CurrentVersion\FontLink\SystemLink. These are added to each time a new Language Group is added to the machine by the user, using Control Panel, Regional Options.

When you have found the font suitable for the characters you wish to draw, you would get a handle to the font using CreateFont or CreateFontIndirect and then select the font into the device context ready to draw.

If your application runs on both W9x/ME and XP/2000 platforms there is a very handy pseudo-font you can use: MS Shell Dlg. You can specify this font either in the resource file or in the LOGFONT structure and it will automatically map to the correct font to suit the platform. On W9x/ME machines it maps to MS Sans Serif, and on XP/2000 machines it maps to Microsoft Sans Serif (or to Tahoma if you specify that you want a fixed pitch font). Both Tahoma and Microsoft Sans Serif (as found on XP/2000 machines) support Greek, Hebrew, Arabic, Turkish, Baltic, Central European, Cyrillic, Vietnamese and Thai scripts. Note that Microsoft Sans Serif is not the same as MS Sans Serif. What MS Shell Dlg switches to can be found in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\NT\CurrentVersion\FontSubstitutes.

What can be done with the ANSI APIs?top

The ANSI APIs are all available under Windows 9x and above. Most of the Unicode APIs (which display strings) are not available in Windows 9x and ME. So if you are writing for those platforms it is important to know what can be done with the ANSI APIs to display non-Roman characters.
The ANSI text drawing APIs accept byte character values. Therefore they cannot draw Unicode character values (which are 16-bits wide). The API will draw according to the character set of the font selected in the device context. Most character sets offer the same characters from 0 to 7Fh but vary the ones above. In Testbug, I have demonstrated how the output of DrawTextA varies with various character sets in fonts selected in the device context. On a basic Windows 98 machine with no other fonts loaded, DrawTextA could draw characters from various sets including Hebrew, Arabic, Greek, Turkish, and Russian.
If more character sets in fonts were to be loaded, the APIs could draw more characters.
Double Byte Character Sets are supported by the ANSI APIs at least in the Far East version of Windows.
You can also make a RichEdit control using the ANSI APIs to display characters from any font and character set installed on the machine. You tell the control which one to use by embedding this in the text itself (using formatting commands). From RichEdit Version 2, (available on Windows 98 and above, optional on Windows 95) you can switch the control to display Unicode characters, again using formatting commands.
Since necessarily drawing non-Roman characters using ANSI involves character sets, the ability of the system to find substitute fonts to draw the character is very limited. You will be likely to get much more success using Unicode characters and the Unicode API.

What can be done with the Unicode APIs?top

The Unicode APIs are capable of displaying Unicode text naturally, and are given text in 16-bit Unicode UTF-16 format. Given a Unicode 16-bit value they can draw the character corresponding to that value provided an appropriate font is loaded on the machine and an appropriate font is selected into the device context. If not, a default character will be drawn, which might be a box, line or question mark.
A small number of Unicode APIs are available in Windows 9x and ME in particular, TextOutW and ExtTextOutW. So even on that platform (as demonstrated by Testbug) these APIs can display Unicode text. However Windows 9x and ME is not nearly as clever as Windows 2000/XP in finding the correct font to draw the correct character to accord with the Unicode value.

Windows NT/2000 and XP provide Unicode versions of the other string APIs such as DrawText, and Unicode common controls, menus, and dialogs. These are capable of drawing Unicode text irrespective of character sets and codepages provided they are given the correct font.

There is also the Uniscribe set of APIs which are highly sophisticated and designed to analyse complex Unicode characters and their individual components (glyphs) and display them correctly.

Getting the strings to give to the APIstop

So as a developer, how do you get the Unicode text strings to give to the APIs? The traditional way to keep the Unicode strings was either by:-
  • keeping the Unicode text in a separate file to be loaded at run-time so that the strings could be used straight from data; or
  • keeping the Unicode text in a resource file either as a STRINGTABLE or RCDATA resource, or within DIALOG or MENU resources; then at compile time the resources would then be bound into the executable, or into special-language DLLs which would be selected for loading at run-time.
So, using these methods there was no need for development tools to read Unicode files, although resource compliers obviously had to. And by design all resources are kept in the executable in Unicode format anyway.
Keeping strings as a resource meant that you had to recover the string you needed to use with LoadString or FindResource followed by LoadResource. FindResource can find a specific language resource out of several with the same id. Here is how LoadString would be used:-
PUSH 100h                ;size of the buffer to receive the string
PUSH ADDR BUFFER         ;pointer to buffer to receive the string
PUSH 23                  ;identifier of the string to be obtained
PUSH [hInst]             ;handle to executable containing the resource
CALL LoadStringW
This code causes the string with the ID of 23 decimal to be placed in BUFFER. From there it can be written to the screen using a window or message box, for example:-
PUSH 40h                 ;information + ok button
PUSH ADDR TITLE_TEXT     ;pointer to text for the title
PUSH ADDR BUFFER         ;pointer to string 23 just retrieved
PUSH [hWnd]              ;to be child of main window
CALL MessageBoxW
Note that there are two versions of the API, LoadStringA and LoadStringW. In Windows 9x and ME only the ANSI version can be used. Similar care is needed when using FindResource and LoadResource.

What is new in GoAsm is that you can keep your strings in your assembler source script and declare them in data, just as you would with an ANSI string. This is because GoAsm reads Unicode source scripts and include files. See part 2 for details of how strings can be kept in data using GoAsm and GoRC.


Part 1: Using the "Go" tools to make Unicode programstop

Using the "Go" tools you can make a Unicode program using non-Roman characters in the same way as you make an ANSI program. The only extra tool you need is a Unicode editor. Then you will be able to use Unicode in the command line, output files, filenames, comments, code and data labels, resource IDs, defined words, exports and strings.

Types of Unicode source files readable by the "Go" toolstop

GoAsm and GoRC are able to read source files in Unicode format UTF-8 and UTF-16 as well as ANSI (text) files. GoLink can read command files in these formats as well.
UTF stands for "UCS (Unicode Character Set) Transformation Format".
In UTF-16 files the basic principle is that each character is 16 bits wide, although this can be extended by surrogate pairs (two 16 bit characters) in some applications.
In UTF-8 files the first 128 characters (values 0 to 127) are the same as in ANSI files. A UTF-8 file in English therefore will look almost identical to an ANSI file. Values 127 to 255 have special meaning in UTF-8 files and are used to code characters in subsequent bytes. When representing the same characters as can be represented by Unicode UTF-16, there can be up to three such bytes, so one character could be represented by anything from one to four UTF-8 bytes.
Most UTF-8 files contain a BOM (byte order mark) in the first three bytes of the file. This identifies them as being UTF-8 files. This BOM is required by the "Go" tools.
UTF-16 files also usually have a BOM but this is 2 bytes in such files. There are two formats of UTF-16 files, BE (Big Endian) and LE (Little Endian). The "Go" tools only work with LE files with a BOM.
There are other formats for Unicode files, but UTF-8 with BOM and UTF-16LE with BOM are the most widely used. In fact, some Unicode editors refer to UTF-16LE files with BOM merely as "Unicode".

Making a Unicode source file with a Unicode editortop

You can create a Unicode source file using a Unicode editor. There are several such editors around. Some are free and some you have to purchase. Which you use might depend on the language you are using and their ease of use for programming. There are some good free editors available. You can use a search engine, but you can also look at the resources area on the Unicode Consortium website and see also the Unicode resources on Alan Wood's site. UniRed is a good free Unicode editor available from Jurij Finkel's site (follow the link to UniRed via programs, and eventually you will find an English description). And there is Yudit which is free from Gaspar Sinai's site; Vim which is free from The Vim website, and NoteXPad from dREAMtHEATER Studio. And finally, Unipad from Sharmahd Computing which is purchasable.

Personally I have found NoteXPad to be very reliable. It comes in two versions, one in English and one in Chinese. Look either for "EN" or "SC" in the downloaded filename.

When you use the Unicode editor to create a source file, you must ensure that the file is saved in the correct format for the "Go" tools, that is either UTF-8 with BOM or UTF-16LE with BOM (sometimes referred to just as "Unicode").

Unicode command linetop

Under NT, 2000 and XP you can use Unicode characters on the command line in the MS-DOS (command prompt) window (the "console") but this only works if you have the console font switched to "Lucida console". You switch the console font by right clicking on the console title bar, and using "properties" and the font tab. Alternatively this can be done dynamically by making changes to the registry.

In this way you can define a word (using the /d switch) using Unicode characters.

You can also use Unicode characters for the names of your input and output files.

Note that extensions to filenames must be in Roman characters if the extension is important to the "Go" tools. For example GoRC creates an obj file if given a .res file. Here "res" cannot be in anything other than in this form. Similarly the extension of GoAsm and GoRC's output files will always be in Roman characters, for example if you assemble the file Процесса.asm this will produce a file called Процесса.obj.

GoLink looks for specified files with ".obj" and ".res" extensions for its input files, and specified ".dll", ".exe", ".drv", and "ocx" files when looking for imports. These extensions must be in this form, but the filename itself can use any characters permitted by the operating system.

Unicode output filestop

GoAsm produces a list file if the /l switch is specified and GoAsm, GoRC and GoLink will send their command line output to a file if the DOS redirection character ">" is used. In the case of GoAsm and GoRC the format of the first source file is used for the output file. This could be ANSI, Unicode UTF-16LE or UTF-8. In the case of GoLink the format of the first Unicode command file is used for the output file; if there is no Unicode command file the format depends on the operating system. See the GoLink manual for details.

Unicode filenames in your source scripttop

You may often wish to specify filenames in your source script for example for include files, raw data files and resource files. Since GoAsm and GoRC read Unicode files, it is possible to give these filenames using non-Roman characters in a source script saved in a Unicode format. Under Windows NT/2000 and XP this works reliably since GoAsm can use the Unicode versions of the file APIs. But under Windows 9x and ME, although Windows keeps the filenames in Unicode, only the ANSI versions of the file APIs are available. So GoAsm converts the Unicode name to a multi-byte name at the time the file is accessed. This is done using the character set and codepage current at that time (based on the "locale"). But this means that if a filename contains a character which is not available in the current codepage, the file will not be recognised. This might happen, for example, if the codepage has been changed since the file was made.

Unicode comments in your source scripttop

The usual instructions and operatives in the source script must still be in English. But comments can now use non-Roman characters and so they can be in any language. This is because GoAsm and GoRC just ignore comments. On both tools, you can use the assembler comment indicator (semi-colon) or the "C" comment indicators // (single line) or /* (start comment) and */ (end comment).

Code and data labels in assembler source scripts (ASM files)top

The code and data in the assembler source are the names given to functions and to data objects.
Within the modules used to develop the program, Unicode labels can be used freely. For example, suppose you have a function which has זכוכית: as its label. This is a Hebrew name. Incidentally note that although the word should be read from right to left, GoAsm still expects the colon to be on the right hand side of the word. Now this label cannot be represented in ANSI, but provided your source script is in Unicode (UTF-16 or UTF-8 format) this label will appear properly in your Unicode editor when you write the source script. And you will be able to call this function using CALL זכוכית from within the same module (ie. the same source script and the same object file).
Suppose however, that the module which needs to call the function and the function itself are contained in different modules. Then we need to be satisfied that the label is kept properly in the object file so that it can be called using Unicode. You can do this with GoAsm because it automatically keeps all Unicode labels in the object file in UTF-8 format. In fact this is the only practical way to keep these labels using COFF object files. This is because in COFF object files and COFF executables all labels (symbols) are kept as null terminated ANSI strings. For this reason the UTF-16 format cannot be used (because zeroes are permitted within a UTF-16 string). However, UTF-8 format can be used (because zeroes are not permitted within a UTF-8 string). Note: as far as I am aware, there is no agreed standard for keeping Unicode non-resource symbols in COFF object and executable files, but it does seem likely that UTF-8 will be the format used.

Resource IDs from resource scripts (RC files)top

It is fairly straightforward to have resource IDs in Unicode, since unlike code and data labels they are always kept in UTF-16 format in the executable anyway. This means you have a resource with a named ID which uses non-Roman Unicode characters. The resource can then be retrieved using the Unicode version of the API FindResource (that is, FindResourceW) and LoadResource. Unfortunately FindResourceW is not available under Windows 9x and ME. This means that you will not be able to use non-Roman Unicode characters to name your resource IDs when writing for that platform. Instead you will need to use characters which translate exactly to ANSI so that under Windows 9x or ME, FindResourceA can be used instead.

Exporting Unicode named functions from your executable or DLL and calling them from other executablestop

As has been seen above, with GoAsm the Unicode function label is kept in the object file in UTF-8 format. When the object file (and possibly other object files too) are made into executable files (either an "exe" file or a DLL) if any of the functions are to be exported, the linker will place the label of the function in the Export Name Table in the executable. Again names here are kept as null terminated strings. Hence they must be kept in UTF-8 format. Now the significance of this is that the executable which calls the relevant function at run-time must also call the function using the UTF-8 format. Otherwise the name will not be recognised. This means that the name of the function to be called must also appear in the Hint/Name Table of the Import Directory of the calling executable in UTF-8 format. As referred to above, GoAsm ensures that this is done by keeping the label in the symbol table in UTF-8 format. The linker should keep that format and use it in the Import Directory information. GoLink does this.

Unicode defined words in your source scripttop

GoAsm and GoRC expect its instructions to be in English in the source script but you can use macros and equates to make your source script more understable in your own language if you wish, for example:-
функция equ invoke
пересылка equ mov
ПолучитьХендлМодуля equ GetModuleHandle
МодальноеДиалоговоеОкно equ DialogBoxParam
ИнициализацияДиалога equ WM_INITDIALOG
СообщКоманда equ WM_COMMAND

How Unicode labels and IDs appear in the debuggertop

It may be that your debugger does not display Unicode names for IDs and labels (symbols) properly. GoBug can do so.

Conversion methods used by GoAsmtop

GoAsm uses its own algorithm to convert Unicode UTF-16 text to UTF-8 and vice versa.
Converting Unicode text to ANSI and vice versa requires knowledge of the correct codepage to use. This is because ANSI characters above 7Fh (up to 0FFh) vary depending on the codepage. When making such conversions, the "Go" tools simply use the Windows APIs MultiByteToWideChar and WideCharToMultiByte. These are set to work with the current codepage for the machine. By default (for the English version of Windows) this is set to the Windows codepage. But you can alter it by changing the regional settings on your machine. If your application relies on using a particular codepage you should ensure that the machine running the "Go" tools is set to that codepage.


Part 2:
Using Unicode strings in DATAtop

We have seen how to use Unicode resources, and strings which are in a resource file. But now GoAsm provides a much easier way to keep and use your Unicode strings - in data as you would with any string. These strings then go into the object file and, when linked, into the executable in their original Unicode format, ready to be given to the Unicode APIs.

Controlling the string formattop

GoAsm needs to be told how to deal with the strings in your source script. Unless told otherwise, GoAsm will put strings into the object file in either ANSI format or in Unicode (UTF-16) format depending on the format in which they are held in the source script. So unless told otherwise GoAsm will convert the strings in a Unicode UTF-8 source script to UTF-16. The reason for this is most of the Windows APIs work with UTF-16 strings rather than UTF-8 ones (GoRC works the same way and always puts strings into the RES file in UTF-16 format as required by Windows). You can rely on this behaviour in the usual case, for example, with a source script saved in Unicode format (UTF-16 or UTF-8) you can send a Unicode string to the API MessageBoxW using this code:-
PUSH 40h                 ;information + ok button
PUSH ADDR TITLE_TEXT     ;pointer to text for the title
PUSH ADDR STRING_23      ;pointer to string 23 in data
PUSH [hWnd]              ;to be child of main window
CALL MessageBoxW         ;call Unicode version of MessageBox
MessageBoxW requires that the Unicode string is in UTF-16 format and also null terminated (which must be a 16-bit null). To comply with this (in a Unicode file) you would declare the string as follows (this example in Greek):-
STRING_23 DB 'Μπορώ να φάω σπασμένα γυαλιά χωρίς να πάθω τίποτα'
          DW 0                ;null terminator
Clearly, to use this default method it is important that you are aware of the format of your source script.

Using DUS (declare Unicode sequence)top

DUS (declare Unicode sequence) allows you to declare a string along with control characters. Here is an example for a string on two lines:-
STRING4 DUS "The strings, new line chars",0Dh,0Ah,"and null are in UTF-16",0

L', A' and 8' overridestop

One of the ways to instruct GoAsm to put the string into the object file in a particular format is to use the L', A' and 8' overrides. Note that you can use double quotes (eg. L") instead of single (this enables you to use the apostrophe inside the double quotes) and you can use triple quotes (eg. L"""hello""") if you want the string to appear in the object file as a quoted string.

The overrides work like this:-
L' override - string will always be put in the object file in Unicode UTF-16 format
A' override - string will always be put in the object file in ANSI format
8' override - string will always be put in the object file in Unicode UTF-8 format

Overriding using the STRINGS directivetop

If you would need to use a lot of L" or A" overrides in your source script, you may find it easier to override using the STRINGS directive. This tells GoAsm to convert strings to the required format for the remainder of the file (or until another STRINGS directive is encountered). You use this directive as follows:-
STRINGS UNICODE  ;from this point all strings are converted to UTF-16
STRINGS ANSI     ;from this point all strings are converted to ANSI
The STRINGS directive works on all quoted strings in the file, but it should not be used on filenames (for example after #include or INCBIN). This is because such filenames may have to be converted to the correct format to suit the system and this is a job for GoAsm itself. See using Unicode filenames.

Note that as a directive to GoAsm, STRINGS is only recognised in the source script and in "a" extension include files. It will not be recognised in non "a" include files because they are not treated as assembler source scripts, but merely lists of defined word (equates, macros and structs).

Declaring data using the overridestop

Here is an example of a null-terminated Unicode string in UTF-16 format (suitable for giving to the Windows APIs which deal with strings), made using the L' override:-
STRING_23 DB L'Μπορώ να φάω σπασμένα γυαλιά χωρίς να πάθω τίποτα'
          DW 0                ;null terminator
Note that Unicode requires two null bytes properly to null terminate the string (the characters are 16-bits each). So instead of using DB you could use DW with one null terminator:-
STRING_28 DW L'Convert me to UTF-16',0
Using the STRINGS directive means you don't need the L' override:-
STRINGS UNICODE            ;all strings from now on to be UTF-16
STRING_30 DW 'nem lesz tőle bajom',0    ;null-terminated UTF-16 string
Even in a Unicode program there are occasional strings which need to be in ANSI format. For this, you can use the A' override, for example:-
STRING_36 DB A"This string will always be in ANSI"
Sometimes you may want to convert a string to UTF-8 format. To do this you can use the 8', for example:-
STRING_40 DB 8"The strings, new line chars",0Dh,0Ah,8"and null are in UTF-8",0
You can use the 8" override here with DB (without needing DW or DUS) because the control characters in UTF-8 always correspond with their ANSI values.

Pushing null terminated Unicode stringstop

In the same way as using strings in data, GoAsm's useful PUSH string instruction can also be used with Unicode strings, for example (using Russian this time):-
STRINGS UNICODE
PUSH 40h                 ;information + ok button
PUSH 'На берегу пустынных волн'   ;pointer to title
PUSH 'Стоял он, дум великих полн' ;pointer to text
PUSH [hWnd]              ;to be child of main window
CALL MessageBoxW
You can also push pointers to Unicode strings using INVOKE instead of CALL, this example in Hungarian using the L' override:-
INVOKE MessageBoxW, [hWnd], L'Meg tudom enni az üveget', \
                            L'nem lesz tőle bajom', \
                            40h

Using the correct string in quoted immediatestop

Quoted immediates work in the same way as ordinary quoted strings for example:-
STRINGS UNICODE
CMP AX,'л'
will compare AX with the UTF-16 value for the Cyrillic character 'л' which is 43Bh.

However,

CMP AX,8'л'
here the value in quotes will be 0BBD0h instead (first byte 0D0h, second 0BBh) since this is the UTF-8 value for 'л'.

And like ordinary quoted strings, a quoted immediate in a Unicode source script saved in UTF-8 format would be converted to UTF-16 if neither the STRINGS directive nor an override is used.

Using Unicode strings in structurestop

The overrides work in the same way in structures for example:-
MyStruct STRUCT
   DB 0
   DW L"I am a null terminated Unicode string",0
   DB 0
ENDS
;
Label64 MyStruct              ;apply the structure template
Irrespective of the state of the STRINGS directive the string is in Unicode UTF-16 format because of the L" override. But if there was no override in the string, the state of the STRINGS directive at the time the structure template was applied (at Label64) would be used.

If the initialisation of the structure is itself overriden, that takes priority for example:-

UP STRUCT
   DB L'I would prefer to forget this'
ENDS
DontGive UP <8"I won't let you">
Here the string is in Unicode UTF-8 format because although the structure contains the L' override, that is itself overriden by the 8" override.
Also since the UTF-8 string is shorter that the original string, the string is padded with nulls. Had the UTF-8 string been longer than the original string it would have been truncated.

When the overrides applytop

The STRINGS directive applies to macros, equates and structures when they are applied in the source script, and not when they are declared. Strings in data are different because they are applied and established in data when they are declared. So for example:-

STRINGS ANSI
MyString1="I could be either"      ;an equate
MyString2 DB "I couldn't"          ;an ANSI data declaration
STRINGS UNICODE
PUSH MyString1                     ;push pointer to Unicode string
PUSH ADDR MyString2                ;use the ANSI string
But the override will always be effective against the STRINGS directive regardless:-
STRINGS ANSI
MyString1=A"I know what I am"      ;an ANSI string equate
MyString2 DB L"So do I"            ;a Unicode data declaration
STRINGS UNICODE
PUSH MyString1                     ;push pointer to ANSI string
PUSH ADDR MyString2                ;use the Unicode string

Using SIZEOF on Unicode stringstop

SIZEOF label reports the number of bytes from the specified label to the next label (or section end). When working with Unicode strings, it works the same way. It does not report the number of characters. So, for example:-
MyString DW 'Здравствуй Мир (Hello World in Russian)',0
MyStringSize DD SIZEOF MyString
In MyStringSize you will find the value 80 if the string is in Unicode UTF-16 (that is 39 16-bit words for the string itself plus a 16-bit null).

If you use structures or equates with strings which change size depending on whether you are making a Unicode or ANSI version of your program using a Unicode/ANSI switch, ensure that the correct switch is set when SIZEOF is used. This is because the size in bytes of the structure, structure member, or data declaration is measured at the time when SIZEOF is used, rather than at any other time.


Part 3: Unicode/ANSI switching

What is Unicode/ANSI switching and why is it needed?top

Although Unicode applications are likely to become the norm, it is also likely that ANSI applications will continue to be written. One reason is that applications using a lot of text in Roman characters will be much more compact than their Unicode counterparts. Unless they are speed-critical applications, they will not need to be written in Unicode at all, since they will not require the full character set.
Another reason, which will apply for some time yet, is that Windows 95, 98 and ME do not contain most of the Unicode APIs. This means that an application which uses Unicode APIs will not run on those platforms. Conversely, an application which uses only the ANSI APIs will run not only on Windows 95, 98 and ME but also on Windows NT/2000 and XP. For this reason many applications are written solely as ANSI versions so that only one version working on all Windows platforms is needed. This works fine for applications which have no need to use non-Roman characters.
However, when an ANSI program runs on NT/2000 and XP, the system has to convert API string input to Unicode, since it uses the Unicode version of the API, and also convert API string output back to ANSI. This not only slows down the application, but it seems rather cumbersome.

Switching using run-time loadingtop

One way to get the best out of both the Windows 9x/ME and NT/2000/XP platforms with just one version of your application is to use run-time loading. Such an application would have the following features:-
  • On start-up the application would check which platform it was running on by calling GetVersionExA (note that GetVersionExW should not be called since it is not implemented in Windows 9x/ME).
  • The application would not call any APIs directly which were not available under Windows 9x/ME. The reason for this is that when an application is loaded by the system, the loader writes into the application itself as held in memory, the addresses of the APIs which the application uses. If the system loader cannot find a particular API (or DLL which is supposed to contain that API) the application will not start at all.
  • Because of the above fact, if running under NT/2000/XP, the application must obtain the addresses of the Unicode APIs at run-time using run-time loading. In fact, this is a fairly simple and well-established procedure, using the APIs LoadLibrary and GetProcAddress. Note that if you were sure a particular DLL would be loaded by the system anyway because you have called an API directly then you would use GetModuleHandle instead of LoadLibrary for that DLL). An example is given in "Hello Unicode 1". In a large application the switching could be dealt with straightforwardly using tables and strings in data. An example of this is given in "Run Time Loading".
  • At run time the application would call the ANSI APIs in the usual way if running under Windows 9x/ME and the Unicode APIs using the addresses which it found at start-up if running under NT/2000/XP. Note that not every API needs to be dealt with in this way - many of them are available under both platforms.

Switching using the same sourcetop

A developer may still want to make two versions of the application, one in ANSI and one in Unicode. This might be to avoid the system's Unicode/ANSI conversions at run-time, to make full use of the added functionality of NT/2000 and XP, or to permit different language versions with non-Roman characters on the NT/2000 and XP platform. This is where Unicode/ANSI switching comes in. Using Unicode/ANSI switching it is possible to have just one version of the source script, but at compile-time you tell the assembler whether to make an ANSI version of the application or a Unicode version, using conditional assembly.

This is the switching which may be needed in such a source script or in include files:-

  • Switching of APIs which have both Unicode and ANSI versions
  • Switching of constants
  • Switching of quoted strings and immediates
  • Switching of string sequences with control characters
  • Switching of the size of a write-to or read-from memory instruction by switching the type indicator
  • Switching of the size of a character to get structure members and data buffers to the correct size
  • Switching of initialisation of data

How switching is achievedtop

Traditionally, switching between Unicode and ANSI is done by defining the word UNICODE and then testing for this by using the conditional assembly operatives #ifdef or #ifndef ("if defined" and "if not defined" respectively). Using this method, it is possible to tell the assembler to look at only those parts of the source script required for the version of the program being made.

In GoAsm you can define UNICODE in one of two ways, either by adding /d to the command line for example:-

GoAsm Myfile /d UNICODE
or, by adding this line early in the source script:-
#define UNICODE
In both cases GoAsm regards UNICODE as defined, and also as the value one despite no value being given for it.

You can then use this switch like this:-

#ifdef UNICODE
                ;assemble these lines
                ;if UNICODE is defined
                ;then jump to #endif
#else
                ;assemble these lines
                ;if UNICODE is not defined
#endif
In a typical situation you would establish these switches:-
#ifdef UNICODE
AW=W             ;switch for APIs more
STRINGS UNICODE  ;switch for quoted strings more
DSS=DUS          ;switch for string sequences more
S=2              ;type indicator more and character size switch more
#else
AW=A
STRINGS ANSI
DSS=DB
S=1
#endif
Using #ifdef means that if you needed to switch back to ANSI part way through your source script you could undefine UNICODE using:-
#undef UNICODE
There are other ways of using conditional assembly to switch between Unicode and ANSI versions.
See other ways to achieve switching.

Unicode/ANSI switching of APIstop

GoAsm permits the ##AW switch after the name of an API for example:-
CALL DefWindowProc##AW
The conditional assembly switch is:-
#ifdef UNICODE
AW=W
#else
AW=A
#endif
Using this code, if UNICODE is defined, DefWindowProcW would be called. Otherwise, DefWindowProcA would be called. The double hash causes the two elements either side of it to combine.

You don't have to use AW in the switch - you could use anything you like, for example

CALL DefWindowProc##SWITCH
A number of APIs do not have A or W versions (in which case only one version of the API applies both to ANSI and Unicode programs). With these APIs you would not use the ##AW switch for example:-
CALL WriteFile
Users have their preferred methods of switching and there are many. See other ways of switching APIs.

Switching of constantstop

An example of constants which need to be switched are the Windows messages which have ANSI and Unicode versions. This, for example, is taken from the Windows header file CommCrtl.h in the SDK (WM_USER being a constant of 400h):-
#define TB_ADDBUTTONSA          (WM_USER + 20)
#define TB_INSERTBUTTONA        (WM_USER + 21)
#define TB_INSERTBUTTONW        (WM_USER + 67)
#define TB_ADDBUTTONSW          (WM_USER + 68)

#ifdef UNICODE
#define TB_INSERTBUTTON         TB_INSERTBUTTONW
#define TB_ADDBUTTONS           TB_ADDBUTTONSW
#else
#define TB_INSERTBUTTON         TB_INSERTBUTTONA
#define TB_ADDBUTTONS           TB_ADDBUTTONSA
#endif
GoAsm can read this, but here is an alternative using the switched double hash:-
TB_ADDBUTTONSA   = WM_USER+20
TB_INSERTBUTTONA = WM_USER+21
TB_INSERTBUTTONW = WM_USER+67
TB_ADDBUTTONSW   = WM_USER+68

TB_INSERTBUTTON  = TB_INSERTBUTTON##AW
TB_ADDBUTTONS    = TB_ADDBUTTONS##AW

Switching quoted strings and immediatestop

Here we are switching the STRINGS directive using:-
#ifdef UNICODE
STRINGS UNICODE
#else
STRINGS ANSI
#endif
If your source script is always saved in ANSI format (ie. as an ordinary text file) you could leave out the STRINGS ANSI line, but it would be required if you save your source script in one of the Unicode formats. The switch then allows you to switch either declared or PUSHed strings like this:-
ST36 DB "This string can be in Unicode or ANSI"
INVOKE MessageBox##AW,[hWnd],'Hello switched user',ADDR ST36,40h
or quoted immediates, for example if the register here is also switched using SWREG to be either AX or AL:-
CMP SWREG,'a'
this will compare AX with the UTF-16 value for 'a' if UNICODE, and with the ANSI value for 'a' if ANSI.

more about the STRINGS directive.
other ways of switching strings.

Switching string sequences with control characterstop

Here we might use this switch:-
#ifdef UNICODE
DSS=DUS
#else
DSS=DB
#endif
You don't have to use DSS for this switch it can be anything you like.
Now if you declare this string in data, or in a structure:-
MyLabel DSS 'I am a string of varying character',0Dh,0Ah,0
When making a Unicode program the DSS translates to DUS "declare Unicode sequence", so that the string would be put in the object file in Unicode UTF-16 format, together with the line end characters and the null terminator in their 16-bit forms. But when making an ANSI program the DSS is DB, so that the string and the control characters are in 8-bit forms.

Using the switched type indicatortop

The letter S is reserved as a type indicator in all situations when GoAsm might expect to find one. So you can have this switch:-
#ifdef UNICODE
S=2
#else
S=1
#endif
S can be switched to the equivalent of any of the pre-defined type indicators that is B, W, D, Q or T. In this case it is switched either to W (value 2) or to B (value 1). Therefore you can control the size of the instruction with it, for example:-
MOV S[EDI],0         ;insert a single zero if ANSI, double if Unicode
INC S[COUNT]         ;increment byte at COUNT if ANSI, word if Unicode
LOCAL CharUnder:S    ;make local byte if ANSI, word if Unicode
LOCAL BUFFER[256]:S  ;make 256 byte local buffer if ANSI, 512 if Unicode
You may prefer to use the following switch which has the same effect as the one above, but emphasises the fact that S can be switched to B, W, D, Q or T:-
#ifdef UNICODE
S=W
#else
S=B
#endif

Using a character size switchtop

The switch used for the type indicator can also double-up as a character size switch. Here is the switch again:-
#ifdef UNICODE
S=2
#else
S=1
#endif
And this allows you to do this:-
ADD EDI,S               ;increment EDI by 1 if ANSI, 2 if Unicode
HLabel DB 256*S DUP 0   ;make 256 byte buffer if ANSI, 512 if Unicode
The same idea can be used in formal structures, for example:-
NMTTDISPINFO STRUCT
  hdr               NMHDR
  lpszText          DD
  szText            DB S*80 DUP 0   ;80 bytes if ANSI, 160 if Unicode
  hinst             DD
  uFlags            DD
  lParam            DD
NMTTDISPINFO ENDS
You don't have to use S as the character size switch. Traditionally the word CHAR or TCHAR is used, and this is used in many include files for switching.

Other ways to achieve switchingtop

You may prefer to make the value of UNICODE more obvious, by specifically defining it as 1:-
GoAsm Myfile /d UNICODE=1
or
#define UNICODE 1
or
UNICODE=1
or the EQU equivalent.

You can then use this switch:-

#if UNICODE=1
                ;assemble these lines
                ;if UNICODE is "on"
                ;then jump to #endif
#else
                ;assemble these lines
                ;if UNICODE is "off"
#endif
To revert to ANSI you would then use:-
#define UNICODE 0
or
UNICODE=0
or the EQU equivalent.

Another way to switch stringstop

As an alternative to switching the STRINGS directive, strings can be switched using this conditional definition with arguments and double hashes (which cause the two elements to combine):-
#ifdef UNICODE
#define TEXT(x) L##x
#else
#define TEXT(x) x
#endif
This would then be used as follows:-
MyLabel DB TEXT("The string")

Other ways to switch APIstop

There are several ways to switch the A or W APIs. Some traditional ways of switching unfortunately require lists of the switchable APIs in include files. Suggested methods of that type are:-
#ifdef UNICODE
#define DefWindowProc DefWindowProcW
#define MessageBox    MessageBoxW
#else
#define DefWindowProc DefWindowProcA
#define MessageBox    MessageBoxA
#endif
;
CALL DefWindowProc         ;switched depending on whether UNICODE defined
CALL MessageBox            ;switched depending on whether UNICODE defined
Or you can use:-
#ifdef UNICODE
AW=W
#else
AW=A
#endif
;
DefWindowProc = DefWindowProc##AW
MessageBox    = MessageBox##AW
;
CALL DefWindowProc         ;switched depending on whether UNICODE defined
CALL MessageBox            ;switched depending on whether UNICODE defined
There are various methods of switching using macros and arguments using the double hash. In this example, an argument is joined up with a non-argument to ensure that the correct call is made (DefWindowProcA or DefWindowProcW) depending on the program version:-
#ifdef UNICODE
AW(a)=a##W
#else
AW(a)=a##A
#endif
;
CALL AW(DefWindowProc)
And here is yet another variation on the same theme:-
#ifdef UNICODE
CALLAW(a)=CALL a##W
#else
CALLAW(a)=CALL a##A
#endif
;
CALLAW (DefWindowProc)
Note that an API switch is only suitable for those APIs which have different ANSI and Unicode versions. A list of APIs has the advantage that you know which APIs need switching. However the linker will soon tell you if you mistakenly try to use an APIs A or W version which doesn't exist.

Testing the STRINGS directivetop

Whether natural quoted strings (ie. without overrides) will be converted to Unicode or not can be tested using the #if conditional assembly operative:-
#if STRINGS UNICODE          ;or#if STRINGS ANSI
This tests the current state of the STRINGS directive if it has been declared and will assemble the following lines if the statement is true (up to the next #else, #elseif or #endif). Note that if the STRINGS directive has not been declared, this test reports whether the source script is in Unicode format or not (since absent the STRINGS directive, the format governs what happens to natural quoted strings).


Part 4:
Summary of Unicode support in the "Go" toolstop


GoAsm
Input file formats supported:-
ANSI, Unicode UTF-8 with BOM and UTF-16LE with BOM

Output:-
Output to the console will be in Unicode if permitted by the system; list file and any file to receive redirected output will be in the same format as the first input file.

Unicode can be used in:-
filenames in include files and raw data files;
filenames in input and output files via the console (command line);
defined words in the command line;
defined words in source or include files (equates, structs and macros);
comments;
data labels (permitting access to data using Unicode);
code labels (permitting calls to functions using Unicode names, imports and exports to other executables);
strings (see below).

Declaring strings in data:-
DB "..." - (default action) in an ANSI file, no conversion;
- in a Unicode file, put in data in UTF-16 format;
and to override the default action:-
STRINGS UNICODE - strings always to be in UTF-16 format for the rest of the file
STRINGS ANSI - strings always to be in ANSI format for the rest of the file
and irrespective of the STRINGS directive:-
DUS "....",0Dh,0Ah,0 - declare sequence in data in UTF-16 format (string and control characters)
DB L"..." - declare string in data in UTF-16 format
DW L"..." - declare string in data in UTF-16 format
DB 8"..." - declare string in data in UTF-8 format
DB A"..." - declare string in data in ANSI format

Pushing pointers to null terminated strings in data:-
PUSH "....." - (default action) in ANSI file no conversion;
- in a Unicode file, push pointer to null-terminated Unicode string in data in UTF-16 format
and to override the default action:-
STRINGS UNICODE - push strings to be in UTF-16 format for the rest of the file
STRINGS ANSI - push strings to be in ANSI format for the rest of the file
and irrespective of the STRINGS directive:-
PUSH L"....." - push pointer to null terminated string in UTF-16 format
PUSH 8"....." - push pointer to null terminated string in UTF-8 format
PUSH A"....." - push pointer to null terminated string in ANSI format

Quoted immediates:-
MOV EAX,'a' - (default action) in ANSI file no conversion (put into EAX the character value of 'a' in the current codepage);
- in a Unicode file, put Unicode character value for "a" into EAX;
and to override the default action:-
STRINGS UNICODE - use Unicode character values for the rest of the file
STRINGS ANSI - use ANSI character values for the rest of the file
and irrespective of the STRINGS directive:-
MOV EAX,L'a' - use Unicode character value
MOV EAX,8'a' - use UTF-8 character value
MOV EAX,A'a' - use ANSI character value

Switching using conditional assembly:-
Switching Unicode/ANSI APIs - various methods, aided by the double hash
Switching of the STRINGS directive
Switching of Unicode or ANSI character sequences
Switched type "S" to switch as B,W,D,Q or T type indicator
Switched character size indicator



GoRC
Input file formats supported:-
ANSI, Unicode UTF-8 with BOM and UTF-16LE with BOM

Output:-
Output to the console will be in Unicode if permitted by the system; file to receive redirected output will be in the same format as the first input file.

Unicode can be used in:-
filenames in include files resource files (eg. icons and bitmaps) and raw data files;
filenames in input and output files via the console (command line);
defined words in the command line;
defined words in source or include files (equates and macros);
comments;
resource IDs;
resource types;
strings (see below).

Strings in version resource, stringtables, dialogs, menus, controls and user-defined resources:-
always converted to Unicode UTF-16 format if not in that format already, escape sequences allowed and number conversion carried out (see GoRC manual)

Strings in RCDATA resource (raw data in resource script)
(default action) kept in the same format as the source script, escape sequences allowed and number conversion carried out (see GoRC manual)
L"....." - string converted to UTF-16 format if not in that format already


GoLink
Input commands:-
Command line in console (MS-DOS command prompt):-
accepts Unicode filenames if supported by the operating system.
Command files:-
can be ANSI, Unicode UTF-8 with BOM and UTF-16LE with BOM

Output:-
Output to the console will be in Unicode if permitted by the system; file to receive redirected output will be in the same format as the first Unicode command file. If there is no Unicode command file the format depends on the operating system - see the GoLink manual.


More information, references and links top

The Unicode Consortium official site for Unicode news, updates, information and links
Alan Wood's Unicode resources a must-visit site kept up to date with information and resources
Dr International Microsoft's international site for developers - clear and understandable
SIL (Summer Institute for Liguistics) computer page various tools and resources
Institute of Estonian language letter database view Unicode values, character sets, and requirements of the different languages

Newsgroups:-
MSDN international newsgroup

Various articles:-
Design a Single Unicode App that Runs on Both Windows 98 and Windows 2000 by F Avery Bishop (April 1999).
Supporting Multilanguage Text Layout and Complex Scripts with Windows NT 5.0 by F Avery Bishop, David C Brown and Davis M Meltzer (November 1998).
Get World-Ready - Fonts Global Development and Computing Portal.
Writing Win32 Multilingual User Interface Applications Global Development and Computing Portal.
Configurable Language and Cultural Settings Global Development and Computing Portal..
Character sets by Ken Fowles, Microsoft Typography site.
Unicode Fonts for Windows Computers by Alan Wood.

Other References:-
"The Unicode Standard Version 3.0", published by The Unicode Consortium.
"Unicode - A Primer" by Tony Graham, M & T Books.
Various articles in the Microsoft Development Library, Knowledge Base and Windows Software Development Kit ("SDK").
"Hello World" by Rob Pike, Ken Thompson AT & T Bell Laboratories January 1992.
"An Essay in Endian Order" 1996 by Dr William T Verts.
"Alternate formats", February 2000, Network Working Group.
"Under the Hood" (Unicode support in Windows NT) 1998 by Matt Pietrek.
"Forms of Unicode" 1999 by Mark Davis (IBM developerWorks).


Acknowledgements top

Wayne J Radburn, Edgar Hansen, Daniel Fazekas, Leland M George, Dmitry Ilyin, Greg Heller, Rick at Unicode.org, and all those at the Win32Asm community. Some of the Unicode text used in this file is thanks to the Kermit project at Columbia University.


Copyright © Jeremy Gordon 2005
Back to top