![]() The GoAsm manual GoAsm Assembler and Tools forum (in the MASM forum) Old forum messages
Contents
IntroductionWhat is Unicode?
Unicode is a standard for character encoding which aims to cover all the
major scripts of the world. It is published by the Unicode Consortium, a
non-profit organisation based in Mountain View, California, whose members
comprise computer corporations, software houses, academics and others
interested in the subject.
|
General scripts | 0000 to 1FFF | Latin (Roman) and other non-ideographic scripts eg. Greek, Cyrillic, Armenian/Hebrew, Arabic, Oriya/Tamil, Thai/Lao, Tibetan, Canadian Aboriginal Syllabics, Mongolian | Symbols | 2000 to 2DFF | Eg. currency, numbers, maths, arrows, technical symbols, braille patterns, dingbats and bi-directional control codes | CJK phonetics and symbols | 2E00 to 33FF | Phonetic characters, punctuation marks and symbols used in Chinese, Japanese and Korean | CJK ideographs | 3400 to 9FFF | Ideographic Han characters unified from Chinese, Japanese and Korean sources | Yi syllables | A000 to A4CF | Yi syllables and Yi radicals | Hangul syllables | AC00 to D743 | Precomposed Korean Hangul syllables | Surrogates | D800 to DFFF | Indicate that the character is represented by two 16-bit characters | Private use area | E000 to F8FF | Reserved for user-defined characters | Compatibility and specials | F900 to FFFD | Special use characters and alternative representation of characters defined elsewhere in the Unicode standard to aid compatibility and mapping | Byte order marks | FFFE and FEFF | These words at the beginning of a file can be taken to signify that the file is Unicode and in the format UTF-16 big-endian (UTF-16BE) and UTF-16 little-endian (UTF-16LE) respectively |
In the case of video displays, this was organised by a chip called a
video controller which would know how to draw a particular character
on the screen. The program would send the video controller two bytes
at a time. The first would tell the controller which character to draw,
and the second would provide the character's "attribute" (its
colour, background and whether it was flashing, bold, or underlined).
Since the character's value was contained in one byte, the number
of different characters which could be displayed was limited. In
practice the first 32 characters were used as control characters (a
carriage return value 13, and linefeed value 10 being good examples),
and value 255 was reserved, so that only 222 characters were available.
In serial communication it was even more restricted since one bit
was used for parity checking.
The actual characters capable of being drawn with these arrangements
were called a character set. The ASCII character set, in the
first 7-bits of the byte, could represent the usual Roman characters
(a to z, A to Z), Arabic numbers (0 to 9), and some additional characters
and punctuation marks commonly used. IBM established a character set
which included line draw characters using the 8th bit.
To draw other characters on a screen meant the video controller
had to be told to switch character sets and draw a
different character when sent a particular value. So different
character sets were established to represent those. But here was the
first problem. Since text was drawn on a byte-by-byte basis, languages
which needed more than the number of characters available in a single-byte
character set [for example Chinese, Japanese and Korean (Hangeul)] could
not be represented fully. So Double and Multi-Byte Character Sets (DBCS
and MBCS) were developed. These required special parsing and drawing
techniques.
One difficulty with this arrangement arises partly from inconsistency
in the individual character sets. For example the
Windows character set
appears to be usable by Hungarian speakers, but in fact there are
four characters missing (Ő,ő,Ű and ű). So Hungarians have to use
the Central European single-byte codepage iso-8859-2 which does have
those characters (a codepage being a version within a character
set).
But then there is another problem. The iso-8859-2
codepage does not have another character which may be needed (Õ)!
If this is a problem for those using Hungarian, imagine the difficulties
faced by those using Asian scripts.
A second problem arises from the fact that the default character
set in any particular computer is set on installation to the "locale"
of the computer. On Windows 9x machines, this can only be altered using
the Control Panel and can not be altered at run-time. Now some system
displays which are very useful to the programmer (such as MessageBox),
can only draw using this default character set. This means that such
displays will be wrong if the locale is not set properly by the user.
In other types of controls, considerable care is needed by the
programmer to ensure that the correct character set is used for the
language to be represented.
As the requirement has grown for more and more languages to be represented, this could have been achieved by introducing more and more character sets. But instead the decision was made to go down the Unicode route. The advantages are:-
So the trick is to find the correct font and character set to draw the characters which need to be drawn.
Not all fonts and character sets will be installed on all machines.
What fonts and character sets are loaded on installation will depend on
which version of Windows is being used, whether it is an English or
other language version, and what applications are loaded (they can come
with their own fonts).
Every Windows machine will have installed Courier New, Arial,
Times New Roman, Symbol, Wingdings, MS Serif, MS Sans Serif and (on
later versions) Tahoma. Depending on the Windows version, between them
these fonts can handle at least Western and Central European, Hebrew,
Arabic, Greek, Turkish, Baltic, Cyrillic and Vietnamese scripts.
You can see what fonts are loaded on your machine by looking in the
Windows\fonts folder or at "Control Panel, fonts". You will also find
that installed fonts are listed in the registry in
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft and sub-keys (which depend on the
Windows version). However, there is not much information about the
fonts. What you need to know is what languages are supported by what
fonts. The easiest way to see this is to go into "Internet Options,
General" and clicking "Fonts". Then in the drop down combo box select
the language you wish to portray. A list of fonts then appears which
have character sets capable of displaying the language specified. The
"ChooseFont" common dialog which is used by various applications does a
similar thing, except that you choose the font first and then the "script"
drop down combo box shows what languages the font is capable of displaying.
The capability of a particular font can be expected to vary from machine
to machine. For example, the Times New Roman font supplied with
Windows 98 is limited to Western script and Greek, but on Windows XP
can also draw Hebrew, Arabic, Turkish, Baltic, Central European, Cyrillic
and Vietnamese.
If you have Word installed you can see the characters available for each font and character set by using "Insert, symbol" and then clicking the subset of the range you are interested in.
As developers we are more interested with what can be done at run-time to check that the correct font is installed. You can enumerate the loaded fonts which suit specified character sets using EnumFontFamiliesEx. This is a callback routine capable of reporting the language and character set in two ways. Either by using the CHARSET value (a fixed byte value of between 0 and 255 which you find at +17h in the LOGFONT structure), or by using the name of a language in words (reported in the ENUMLOGFONTEX structure) which is a more flexible arrangement. The languages in words reported by EnumFontFamiliesEx seem to be those which appear in the "Internet Options, General", "Fonts" combo box which probably uses that function anyway to fill its contents.
Generally fonts are capable of representing only a limited selection of values in the Unicode character ranges, so under Windows 2000/XP EnumFontFamiliesEx API is extended so that you can also receive the Unicode range of a particular font through the FONTSIGNATURE structure. The API GetUnicodeRanges can also be used to discover the Unicode range of a particular font.
Microsoft advises that the user of your application should also be able to find the correct font to display the language required using the ChooseFont common control, if necessary.
Windows 2000/XP uses fonts in some clever ways to maximise their versitility. For example, if a particular font does not support a particular character which needs to be drawn, Windows looks at "linked fonts" to see if there is another font which will do so. The fonts which are linked are contained in the registry key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\NT\CurrentVersion\FontLink\SystemLink. These are added to each time a new Language Group is added to the machine by the user, using Control Panel, Regional Options.
When you have found the font suitable for the characters you wish to draw, you would get a handle to the font using CreateFont or CreateFontIndirect and then select the font into the device context ready to draw.
If your application runs on both W9x/ME and XP/2000 platforms there is a very handy pseudo-font you can use: MS Shell Dlg. You can specify this font either in the resource file or in the LOGFONT structure and it will automatically map to the correct font to suit the platform. On W9x/ME machines it maps to MS Sans Serif, and on XP/2000 machines it maps to Microsoft Sans Serif (or to Tahoma if you specify that you want a fixed pitch font). Both Tahoma and Microsoft Sans Serif (as found on XP/2000 machines) support Greek, Hebrew, Arabic, Turkish, Baltic, Central European, Cyrillic, Vietnamese and Thai scripts. Note that Microsoft Sans Serif is not the same as MS Sans Serif. What MS Shell Dlg switches to can be found in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\NT\CurrentVersion\FontSubstitutes.
Windows NT/2000 and XP provide Unicode versions of the other string APIs such as DrawText, and Unicode common controls, menus, and dialogs. These are capable of drawing Unicode text irrespective of character sets and codepages provided they are given the correct font.
There is also the Uniscribe set of APIs which are highly sophisticated and designed to analyse complex Unicode characters and their individual components (glyphs) and display them correctly.
PUSH 100h ;size of the buffer to receive the string PUSH ADDR BUFFER ;pointer to buffer to receive the string PUSH 23 ;identifier of the string to be obtained PUSH [hInst] ;handle to executable containing the resource CALL LoadStringWThis code causes the string with the ID of 23 decimal to be placed in BUFFER. From there it can be written to the screen using a window or message box, for example:-
PUSH 40h ;information + ok button PUSH ADDR TITLE_TEXT ;pointer to text for the title PUSH ADDR BUFFER ;pointer to string 23 just retrieved PUSH [hWnd] ;to be child of main window CALL MessageBoxWNote that there are two versions of the API, LoadStringA and LoadStringW. In Windows 9x and ME only the ANSI version can be used. Similar care is needed when using FindResource and LoadResource.
What is new in GoAsm is that you can keep your strings in your assembler source script and declare them in data, just as you would with an ANSI string. This is because GoAsm reads Unicode source scripts and include files. See part 2 for details of how strings can be kept in data using GoAsm and GoRC.
Personally I have found NoteXPad to be very reliable. It comes in two versions, one in English and one in Chinese. Look either for "EN" or "SC" in the downloaded filename.
When you use the Unicode editor to create a source file, you must ensure that the file is saved in the correct format for the "Go" tools, that is either UTF-8 with BOM or UTF-16LE with BOM (sometimes referred to just as "Unicode").
In this way you can define a word (using the /d switch) using Unicode characters.
You can also use Unicode characters for the names of your input and output files.
Note that extensions to filenames must be in Roman characters if the extension is important to the "Go" tools. For example GoRC creates an obj file if given a .res file. Here "res" cannot be in anything other than in this form. Similarly the extension of GoAsm and GoRC's output files will always be in Roman characters, for example if you assemble the file Процесса.asm this will produce a file called Процесса.obj.
GoLink looks for specified files with ".obj" and ".res" extensions for its input files, and specified ".dll", ".exe", ".drv", and "ocx" files when looking for imports. These extensions must be in this form, but the filename itself can use any characters permitted by the operating system.
функция equ invoke пересылка equ mov ПолучитьХендлМодуля equ GetModuleHandle МодальноеДиалоговоеОкно equ DialogBoxParam ИнициализацияДиалога equ WM_INITDIALOG СообщКоманда equ WM_COMMAND
PUSH 40h ;information + ok button PUSH ADDR TITLE_TEXT ;pointer to text for the title PUSH ADDR STRING_23 ;pointer to string 23 in data PUSH [hWnd] ;to be child of main window CALL MessageBoxW ;call Unicode version of MessageBoxMessageBoxW requires that the Unicode string is in UTF-16 format and also null terminated (which must be a 16-bit null). To comply with this (in a Unicode file) you would declare the string as follows (this example in Greek):-
STRING_23 DB 'Μπορώ να φάω σπασμένα γυαλιά χωρίς να πάθω τίποτα' DW 0 ;null terminatorClearly, to use this default method it is important that you are aware of the format of your source script.
STRING4 DUS "The strings, new line chars",0Dh,0Ah,"and null are in UTF-16",0
The overrides work like this:-
L' override - string will always be put in the object file in Unicode UTF-16 format
A' override - string will always be put in the object file in ANSI format
8' override - string will always be put in the object file in Unicode UTF-8 format
STRINGS UNICODE ;from this point all strings are converted to UTF-16 STRINGS ANSI ;from this point all strings are converted to ANSIThe STRINGS directive works on all quoted strings in the file, but it should not be used on filenames (for example after #include or INCBIN). This is because such filenames may have to be converted to the correct format to suit the system and this is a job for GoAsm itself. See using Unicode filenames.
Note that as a directive to GoAsm, STRINGS is only recognised in the source script and in "a" extension include files. It will not be recognised in non "a" include files because they are not treated as assembler source scripts, but merely lists of defined word (equates, macros and structs).
STRING_23 DB L'Μπορώ να φάω σπασμένα γυαλιά χωρίς να πάθω τίποτα' DW 0 ;null terminatorNote that Unicode requires two null bytes properly to null terminate the string (the characters are 16-bits each). So instead of using DB you could use DW with one null terminator:-
STRING_28 DW L'Convert me to UTF-16',0Using the STRINGS directive means you don't need the L' override:-
STRINGS UNICODE ;all strings from now on to be UTF-16 STRING_30 DW 'nem lesz tőle bajom',0 ;null-terminated UTF-16 stringEven in a Unicode program there are occasional strings which need to be in ANSI format. For this, you can use the A' override, for example:-
STRING_36 DB A"This string will always be in ANSI"Sometimes you may want to convert a string to UTF-8 format. To do this you can use the 8', for example:-
STRING_40 DB 8"The strings, new line chars",0Dh,0Ah,8"and null are in UTF-8",0You can use the 8" override here with DB (without needing DW or DUS) because the control characters in UTF-8 always correspond with their ANSI values.
STRINGS UNICODE PUSH 40h ;information + ok button PUSH 'На берегу пустынных волн' ;pointer to title PUSH 'Стоял он, дум великих полн' ;pointer to text PUSH [hWnd] ;to be child of main window CALL MessageBoxWYou can also push pointers to Unicode strings using INVOKE instead of CALL, this example in Hungarian using the L' override:-
INVOKE MessageBoxW, [hWnd], L'Meg tudom enni az üveget', \ L'nem lesz tőle bajom', \ 40h
STRINGS UNICODE CMP AX,'л'will compare AX with the UTF-16 value for the Cyrillic character 'л' which is 43Bh.
However,
CMP AX,8'л'here the value in quotes will be 0BBD0h instead (first byte 0D0h, second 0BBh) since this is the UTF-8 value for 'л'.
And like ordinary quoted strings, a quoted immediate in a Unicode source script saved in UTF-8 format would be converted to UTF-16 if neither the STRINGS directive nor an override is used.
MyStruct STRUCT DB 0 DW L"I am a null terminated Unicode string",0 DB 0 ENDS ; Label64 MyStruct ;apply the structure templateIrrespective of the state of the STRINGS directive the string is in Unicode UTF-16 format because of the L" override. But if there was no override in the string, the state of the STRINGS directive at the time the structure template was applied (at Label64) would be used.
If the initialisation of the structure is itself overriden, that takes priority for example:-
UP STRUCT DB L'I would prefer to forget this' ENDS DontGive UP <8"I won't let you">Here the string is in Unicode UTF-8 format because although the structure contains the L' override, that is itself overriden by the 8" override.
The STRINGS directive applies to macros, equates and structures when they are applied in the source script, and not when they are declared. Strings in data are different because they are applied and established in data when they are declared. So for example:-
STRINGS ANSI MyString1="I could be either" ;an equate MyString2 DB "I couldn't" ;an ANSI data declaration STRINGS UNICODE PUSH MyString1 ;push pointer to Unicode string PUSH ADDR MyString2 ;use the ANSI stringBut the override will always be effective against the STRINGS directive regardless:-
STRINGS ANSI MyString1=A"I know what I am" ;an ANSI string equate MyString2 DB L"So do I" ;a Unicode data declaration STRINGS UNICODE PUSH MyString1 ;push pointer to ANSI string PUSH ADDR MyString2 ;use the Unicode string
MyString DW 'Здравствуй Мир (Hello World in Russian)',0 MyStringSize DD SIZEOF MyStringIn MyStringSize you will find the value 80 if the string is in Unicode UTF-16 (that is 39 16-bit words for the string itself plus a 16-bit null).
If you use structures or equates with strings which change size depending on whether you are making a Unicode or ANSI version of your program using a Unicode/ANSI switch, ensure that the correct switch is set when SIZEOF is used. This is because the size in bytes of the structure, structure member, or data declaration is measured at the time when SIZEOF is used, rather than at any other time.
A developer may still want to make two versions of the application, one in ANSI and one in Unicode. This might be to avoid the system's Unicode/ANSI conversions at run-time, to make full use of the added functionality of NT/2000 and XP, or to permit different language versions with non-Roman characters on the NT/2000 and XP platform. This is where Unicode/ANSI switching comes in. Using Unicode/ANSI switching it is possible to have just one version of the source script, but at compile-time you tell the assembler whether to make an ANSI version of the application or a Unicode version, using conditional assembly.
This is the switching which may be needed in such a source script or in include files:-
In GoAsm you can define UNICODE in one of two ways, either by adding /d to the command line for example:-
GoAsm Myfile /d UNICODEor, by adding this line early in the source script:-
#define UNICODEIn both cases GoAsm regards UNICODE as defined, and also as the value one despite no value being given for it.
You can then use this switch like this:-
#ifdef UNICODE ;assemble these lines ;if UNICODE is defined ;then jump to #endif #else ;assemble these lines ;if UNICODE is not defined #endifIn a typical situation you would establish these switches:-
#ifdef UNICODE AW=W ;switch for APIs more STRINGS UNICODE ;switch for quoted strings more DSS=DUS ;switch for string sequences more S=2 ;type indicator more and character size switch more #else AW=A STRINGS ANSI DSS=DB S=1 #endifUsing #ifdef means that if you needed to switch back to ANSI part way through your source script you could undefine UNICODE using:-
#undef UNICODEThere are other ways of using conditional assembly to switch between Unicode and ANSI versions.
CALL DefWindowProc##AWThe conditional assembly switch is:-
#ifdef UNICODE AW=W #else AW=A #endifUsing this code, if UNICODE is defined, DefWindowProcW would be called. Otherwise, DefWindowProcA would be called. The double hash causes the two elements either side of it to combine.
You don't have to use AW in the switch - you could use anything you like, for example
CALL DefWindowProc##SWITCHA number of APIs do not have A or W versions (in which case only one version of the API applies both to ANSI and Unicode programs). With these APIs you would not use the ##AW switch for example:-
CALL WriteFileUsers have their preferred methods of switching and there are many. See other ways of switching APIs.
#define TB_ADDBUTTONSA (WM_USER + 20) #define TB_INSERTBUTTONA (WM_USER + 21) #define TB_INSERTBUTTONW (WM_USER + 67) #define TB_ADDBUTTONSW (WM_USER + 68) #ifdef UNICODE #define TB_INSERTBUTTON TB_INSERTBUTTONW #define TB_ADDBUTTONS TB_ADDBUTTONSW #else #define TB_INSERTBUTTON TB_INSERTBUTTONA #define TB_ADDBUTTONS TB_ADDBUTTONSA #endifGoAsm can read this, but here is an alternative using the switched double hash:-
TB_ADDBUTTONSA = WM_USER+20 TB_INSERTBUTTONA = WM_USER+21 TB_INSERTBUTTONW = WM_USER+67 TB_ADDBUTTONSW = WM_USER+68 TB_INSERTBUTTON = TB_INSERTBUTTON##AW TB_ADDBUTTONS = TB_ADDBUTTONS##AW
#ifdef UNICODE STRINGS UNICODE #else STRINGS ANSI #endifIf your source script is always saved in ANSI format (ie. as an ordinary text file) you could leave out the STRINGS ANSI line, but it would be required if you save your source script in one of the Unicode formats. The switch then allows you to switch either declared or PUSHed strings like this:-
ST36 DB "This string can be in Unicode or ANSI" INVOKE MessageBox##AW,[hWnd],'Hello switched user',ADDR ST36,40hor quoted immediates, for example if the register here is also switched using SWREG to be either AX or AL:-
CMP SWREG,'a'this will compare AX with the UTF-16 value for 'a' if UNICODE, and with the ANSI value for 'a' if ANSI.
more about the STRINGS directive.
other ways of switching strings.
#ifdef UNICODE DSS=DUS #else DSS=DB #endifYou don't have to use DSS for this switch it can be anything you like.
MyLabel DSS 'I am a string of varying character',0Dh,0Ah,0When making a Unicode program the DSS translates to DUS "declare Unicode sequence", so that the string would be put in the object file in Unicode UTF-16 format, together with the line end characters and the null terminator in their 16-bit forms. But when making an ANSI program the DSS is DB, so that the string and the control characters are in 8-bit forms.
#ifdef UNICODE S=2 #else S=1 #endifS can be switched to the equivalent of any of the pre-defined type indicators that is B, W, D, Q or T. In this case it is switched either to W (value 2) or to B (value 1). Therefore you can control the size of the instruction with it, for example:-
MOV S[EDI],0 ;insert a single zero if ANSI, double if Unicode INC S[COUNT] ;increment byte at COUNT if ANSI, word if Unicode LOCAL CharUnder:S ;make local byte if ANSI, word if Unicode LOCAL BUFFER[256]:S ;make 256 byte local buffer if ANSI, 512 if UnicodeYou may prefer to use the following switch which has the same effect as the one above, but emphasises the fact that S can be switched to B, W, D, Q or T:-
#ifdef UNICODE S=W #else S=B #endif
#ifdef UNICODE S=2 #else S=1 #endifAnd this allows you to do this:-
ADD EDI,S ;increment EDI by 1 if ANSI, 2 if Unicode HLabel DB 256*S DUP 0 ;make 256 byte buffer if ANSI, 512 if UnicodeThe same idea can be used in formal structures, for example:-
NMTTDISPINFO STRUCT hdr NMHDR lpszText DD szText DB S*80 DUP 0 ;80 bytes if ANSI, 160 if Unicode hinst DD uFlags DD lParam DD NMTTDISPINFO ENDSYou don't have to use S as the character size switch. Traditionally the word CHAR or TCHAR is used, and this is used in many include files for switching.
GoAsm Myfile /d UNICODE=1or
#define UNICODE 1or
UNICODE=1or the EQU equivalent.
You can then use this switch:-
#if UNICODE=1 ;assemble these lines ;if UNICODE is "on" ;then jump to #endif #else ;assemble these lines ;if UNICODE is "off" #endifTo revert to ANSI you would then use:-
#define UNICODE 0or
UNICODE=0or the EQU equivalent.
#ifdef UNICODE #define TEXT(x) L##x #else #define TEXT(x) x #endifThis would then be used as follows:-
MyLabel DB TEXT("The string")
#ifdef UNICODE #define DefWindowProc DefWindowProcW #define MessageBox MessageBoxW #else #define DefWindowProc DefWindowProcA #define MessageBox MessageBoxA #endif ; CALL DefWindowProc ;switched depending on whether UNICODE defined CALL MessageBox ;switched depending on whether UNICODE definedOr you can use:-
#ifdef UNICODE AW=W #else AW=A #endif ; DefWindowProc = DefWindowProc##AW MessageBox = MessageBox##AW ; CALL DefWindowProc ;switched depending on whether UNICODE defined CALL MessageBox ;switched depending on whether UNICODE definedThere are various methods of switching using macros and arguments using the double hash. In this example, an argument is joined up with a non-argument to ensure that the correct call is made (DefWindowProcA or DefWindowProcW) depending on the program version:-
#ifdef UNICODE AW(a)=a##W #else AW(a)=a##A #endif ; CALL AW(DefWindowProc)And here is yet another variation on the same theme:-
#ifdef UNICODE CALLAW(a)=CALL a##W #else CALLAW(a)=CALL a##A #endif ; CALLAW (DefWindowProc)Note that an API switch is only suitable for those APIs which have different ANSI and Unicode versions. A list of APIs has the advantage that you know which APIs need switching. However the linker will soon tell you if you mistakenly try to use an APIs A or W version which doesn't exist.
#if STRINGS UNICODE ;or#if STRINGS ANSIThis tests the current state of the STRINGS directive if it has been declared and will assemble the following lines if the statement is true (up to the next #else, #elseif or #endif). Note that if the STRINGS directive has not been declared, this test reports whether the source script is in Unicode format or not (since absent the STRINGS directive, the format governs what happens to natural quoted strings).
GoAsm |
Input file formats supported:-
ANSI, Unicode UTF-8 with BOM and UTF-16LE with BOM |
|
Output:- Output to the console will be in Unicode if permitted by the system; list file and any file to receive redirected output will be in the same format as the first input file. |
||
Unicode can be used in:- filenames in include files and raw data files; filenames in input and output files via the console (command line); defined words in the command line; defined words in source or include files (equates, structs and macros); comments; data labels (permitting access to data using Unicode); code labels (permitting calls to functions using Unicode names, imports and exports to other executables); strings (see below). |
||
Declaring strings in data:- |
||
DB "..." | - (default action) in an ANSI file, no conversion;
- in a Unicode file, put in data in UTF-16 format; |
|
and to override the default action:- | ||
STRINGS UNICODE | - strings always to be in UTF-16 format for the rest of the file | |
STRINGS ANSI | - strings always to be in ANSI format for the rest of the file | |
and irrespective of the STRINGS directive:- | ||
DUS "....",0Dh,0Ah,0 | - declare sequence in data in UTF-16 format (string and control characters) | |
DB L"..." | - declare string in data in UTF-16 format | |
DW L"..." | - declare string in data in UTF-16 format | |
DB 8"..." | - declare string in data in UTF-8 format | |
DB A"..." | - declare string in data in ANSI format | |
Pushing pointers to null terminated strings in data:- |
||
PUSH "....." | - (default action) in ANSI file no conversion;
- in a Unicode file, push pointer to null-terminated Unicode string in data in UTF-16 format |
|
and to override the default action:- | ||
STRINGS UNICODE | - push strings to be in UTF-16 format for the rest of the file | |
STRINGS ANSI | - push strings to be in ANSI format for the rest of the file | |
and irrespective of the STRINGS directive:- | ||
PUSH L"....." | - push pointer to null terminated string in UTF-16 format | |
PUSH 8"....." | - push pointer to null terminated string in UTF-8 format | |
PUSH A"....." | - push pointer to null terminated string in ANSI format | |
Quoted immediates:- |
||
MOV EAX,'a' | - (default action) in ANSI file no conversion
(put into EAX the character value of 'a' in the current codepage);
- in a Unicode file, put Unicode character value for "a" into EAX; |
|
and to override the default action:- | ||
STRINGS UNICODE | - use Unicode character values for the rest of the file | |
STRINGS ANSI | - use ANSI character values for the rest of the file | |
and irrespective of the STRINGS directive:- | ||
MOV EAX,L'a' | - use Unicode character value | |
MOV EAX,8'a' | - use UTF-8 character value | |
MOV EAX,A'a' | - use ANSI character value | |
Switching using conditional assembly:- |
||
Switching Unicode/ANSI APIs - various methods, aided
by the double hash
Switching of the STRINGS directive Switching of Unicode or ANSI character sequences Switched type "S" to switch as B,W,D,Q or T type indicator Switched character size indicator |
GoRC |
Input file formats supported:-
ANSI, Unicode UTF-8 with BOM and UTF-16LE with BOM |
|
Output:- Output to the console will be in Unicode if permitted by the system; file to receive redirected output will be in the same format as the first input file. |
||
Unicode can be used in:- filenames in include files resource files (eg. icons and bitmaps) and raw data files; filenames in input and output files via the console (command line); defined words in the command line; defined words in source or include files (equates and macros); comments; resource IDs; resource types; strings (see below). |
||
Strings in version resource, stringtables, dialogs, menus, controls and user-defined resources:- |
||
always converted to Unicode UTF-16 format if not in that format already, escape sequences allowed and number conversion carried out (see GoRC manual) | ||
Strings in RCDATA resource (raw data in resource script) |
||
(default action) kept in the same format as the source script, escape sequences allowed and number conversion carried out (see GoRC manual) | ||
L"....." | - string converted to UTF-16 format if not in that format already |
GoLink |
Input commands:-
Command line in console (MS-DOS command prompt):- accepts Unicode filenames if supported by the operating system. Command files:- can be ANSI, Unicode UTF-8 with BOM and UTF-16LE with BOM |
|
Output:- Output to the console will be in Unicode if permitted by the system; file to receive redirected output will be in the same format as the first Unicode command file. If there is no Unicode command file the format depends on the operating system - see the GoLink manual. |
The Unicode Consortium
official site for Unicode news, updates, information and links
Alan Wood's Unicode resources
a must-visit site kept up to date with information and resources
Dr International
Microsoft's international site for developers - clear and
understandable
SIL (Summer Institute for
Liguistics) computer page various tools and resources
Institute of Estonian language letter database
view Unicode values, character sets, and requirements of the different languages
Newsgroups:-
MSDN international newsgroup
Various articles:-
Design a Single Unicode App that Runs on Both Windows 98 and Windows 2000
by F Avery Bishop (April 1999).
Supporting Multilanguage Text Layout and Complex Scripts with Windows NT 5.0
by F Avery Bishop, David C Brown and Davis M Meltzer (November 1998).
Get World-Ready - Fonts Global Development and Computing Portal.
Writing Win32 Multilingual User Interface Applications Global Development and Computing Portal.
Configurable Language and Cultural Settings Global Development and Computing Portal..
Character sets by Ken Fowles, Microsoft Typography site.
Unicode Fonts for Windows Computers by Alan Wood.
Other References:-
"The Unicode Standard Version 3.0", published by The Unicode Consortium.
"Unicode - A Primer" by Tony Graham, M & T Books.
Various articles in the Microsoft Development Library, Knowledge Base
and Windows Software Development Kit ("SDK").
"Hello World" by Rob Pike, Ken Thompson AT & T Bell Laboratories
January 1992.
"An Essay in Endian Order" 1996 by Dr William T Verts.
"Alternate formats", February 2000, Network Working Group.
"Under the Hood" (Unicode support in Windows NT) 1998 by Matt Pietrek.
"Forms of Unicode" 1999 by Mark Davis (IBM developerWorks).
Copyright © Jeremy Gordon 2005
Back to top