ASM Diaries 2: A Hack for Case Insensitive Identifiers

This article will make a lot more sense with an ASCII table to look at. I don’t have quick access to a copyright free one, but you view one at https://www.asciitable.com

Case insensitive comparisons in assembler are hard work1. The design of the ASCII codes splits letters into two blocks2. That requires your code to make multiple checks before converting case or for comparisons. This tip is useful when your character set is limited to a suitable subset of ASCII, such as it commonly is with identifiers in programming languages and, probably, elsewhere.

This is another trick I learnt from the Amstrad BASIC interpreter. In Amstrad BASIC a valid identifier must begin with a letter, which can be followed by letters, numbers and periods (.). The interpreter is case insensitive both for keywords and variable names3.

During execution variables names are added to a linked list4, with that list also containing the variable values5.

The variable names in this list are stored in way that makes case insensitive string comparisons easier, however they are not simply ‘upper-cased’. The code which stores them simply masks out bit 5 (hex &20, mask &df). This converts letters to upper case. It also changes the
codes for numbers and the period symbol but it does so without introducing any ambiguity. The characters have already been validated as a valid identifier. The change is only for storage.

When comparing a variable name (ie when searching the list) the same conversion can be made to each character being compared. The conversion adds a single assembly instruction which occupies two bytes.

In Quiche

I’m using a similar strategy in Quiche-Z80, although I’m setting bit 5 (rather than clearing it) to lower case the characters. This strategy keeps numbers unchanged which, for me, makes debugging easier. In Quiche identifiers periods are not valid but underscores are. Underscores get converted to &7f (ASCII DEL character) which is still distinct from any other converted characters.

Below is the identifier-string comparison code from Quiche-Z80 (in the code comments ‘identifier’ refers to a ‘raw’ string in the source code, ASCII7 refers to a case converted string being compared against). The case conversion is the OR $20 two lines after .cmp_loop. Identfiers are stored in ASCII7 format (with bit 7 of the last character set) which enables other fun assembly tricks in the code. See if you can spot any.

;Compares an identifier with an ASCII7 string.
;The identifier has already been verified to be a valid identifier and
;it's length has been established.
;On entry:
; DE=Address of the identifier
; HL=Address of the ASCII7 string
; B=Length of the identifier
;On exit:
; If the identifier matches the ASCII7 string:
;   Carry flag is set
;   DE contains the address of the byte after the Identifier
;   HL contains the address of the byte after the ASCII7 string
; If the identifier does not match:
;   Carry flag is clear
;   HL contains the address of the byte after the ASCII7 string
;   DE corrupt
;Always:
;   A,B,C and other flags corrupt
comp_ascii7
  ;Loop while chars match.
.cmp_loop
  ld a,(de)   ;Identifier char
  or $20      ;Convert to lower case
  xor (hl)    ;Compare to ASCII7 char
  inc hl      ;Next ASCII7 char
  inc de      ;Next ident char
  jr nz,.not_matchchar  ;Chars differ
  djnz .cmp_loop
  ;If we've got to the last char and every char has matched then the string doesn't
  ;(because last char of ASCII7 can't match due to high bit being set)

;Not a match - advance HL to byte after end of ASCII7
.not_matchstr
  dec hl      ;Undo next ASCII7
  ;Where A contains the first char of the string.
  ;and HL points to the first char of the ASCII7 string
.skip_loop
  and $80     ;Is high bit set? - clears carry
  inc hl      ;Next ASCII7 char
  ld a,(hl)   ;Next
  jr z,.skip_loop  ;End when high bit set
  ret         ;Exit Not Carry

Footnotes

  1. And, frankly, far too hard in high level languages too.
  2. And I appreciate this is even worse in most languages other than English.
  3. When LISTing programs keywords are displayed in upper case with variables displayed as they were entered. If code is entered in lower case this acts as an early form of syntax highlighting and nicely shows up any mistyped keywords.
  4. Actually multiple linked lists, depending on the first letter, with separate lists for arrays and DEF FNs, with arrays having separate lists for reals, integers and strings. All of which helps with performance.
  5. Length and pointer-to-data for strings.