How Amstrad CPC BASIC Compresses Error Messages

Amstrad CPC BASIC saves a few bytes by using some interesting compression in the way it stores error messages. Let’s take a look at how it works and how much space it saves.

Below is the table of error messages. The messages are stored in what I refer to as ASCII7 format – bit 7 of the last character is set.1 That accounts for the ‘+$80’ at the end of every string. That’s a standard way of storing ASCII strings and not of note.

error_message_full_list:
        defb $1a,$15+$80          ;$00 "Unknown error"
        defb $1d,$12+$80          ;$01 "Unexpected NEXT"
        defb "Syntax",$15+$80     ;$02 "Syntax error"
        defb $1d,"RETUR","N"+$80  ;$03 "Unexpected RETURN"
        defb "DATA ",$03,"haus",$0a+$80;$04 "DATA exhausted"
        defb "Impro",$00,"r argu",$09,"t"+$80;$05 "Improper argument"
;;=overflow message
overflow_message:
        defb "Ov",$02,"flo","w"+$80;$06 "Overflow"
        defb "Memory",$14+$80     ;$07 "Memory full"
        defb $11,"does ",$0e,$03,"is","t"+$80;$08 "Line does not exist"
        defb "Subscrip",$08,"u",$08,"f ",$07,"g","e"+$80;$09 "Subscript out of range"
        defb "Array ",$1c,"di",$09,"s",$04,"e","d"+$80;$0a "Array already dimensioned"
;;=division by zero message
division_by_zero_message:
        defb "Divi",$13,"by z",$02,"o"+$80;$0b "Division by zero"
        defb "Invalid d",$1e+$80  ;$0c "Invalid direct command"
        defb "Ty",$00,$10,"matc","h"+$80;$0d "Type mismatch"
        defb $17,"space",$14+$80  ;$0e "String space full"
        defb $17,$1b+$80          ;$0f "String too long"
        defb $17,$03,"pres",$13,$0f,$06,"pl",$03+$80;$10 "String expression too complex"
        defb "Can",$0e,"CONT",$01,"u","e"+$80;$11 "Cannot CONTinue"
        defb $1a," us",$02,$05,"nct",$04+$80;$12 "Unknown user function"
        defb $16,$19+$80          ;$13 "RESUME missing"
        defb $1d,$16+$80          ;$14 "Unexpected RESUME"
        defb "D",$1e," foun","d"+$80;$15 "Direct command found"
        defb "O",$00,$07,"d",$19+$80;$16 "Operand missing"
        defb $11,$1b+$80          ;$17 "Line too long"
        defb "EOF me","t"+$80     ;$18 "EOF met"
        defb $0d,"ty",$00,$15+$80 ;$19 "File type error"
        defb $12,$19+$80          ;$1a "NEXT missing"
        defb $0d,$1c,$0c+$80      ;$1b "File already open"
        defb $1a,$18+$80          ;$1c "Unknown command"
        defb $0b,$19+$80          ;$1d "WEND missing"
        defb $1d,$0b+$80          ;$1e "Unexpected WEND"
        defb $0d,$0e,$0c+$80      ;(WAS WRONG!!)$1f "File not open"
        defb "Broken ",$01+$80    ;(WAS WRONG!!)$20 "Broken in" (user terminated a cassette/disc operation?)

Partials

You’ll notice a slight lack of ASCII characters and but a lot of values less than $20 (decimal 32, ASCII for a space)2. These are indexes into a table of partials (or substrings if you prefer). When a partial is referenced it’s value is substituted for the partial number. Some of the partials reference other partials, allowing partials to be used recursively within other partials. The partials list is as follows,

error_message_partials_list:
        defb "p","e"+$80          ;$00 "pe"
        defb "i","n"+$80          ;$01 "in"
        defb "e","r"+$80          ;$02 "er"
        defb "e","x"+$80          ;$03 "ex"
        defb "io","n"+$80         ;$04 "ion"
        defb " f","u"+$80         ;$05 " fu"
        defb "co","m"+$80         ;$06 "com"
        defb "ra","n"+$80         ;$07 "ran"
        defb "t ","o"+$80         ;$08 "t o"
        defb "me","n"+$80         ;$09 "men"
        defb "te","d"+$80         ;$0a "ted"
        defb "WEN","D"+$80        ;$0b "WEND"
        defb "o",$00,"n"+$80      ;$0c "open"
        defb "File"," "+$80       ;$0d "File "
        defb "not"," "+$80        ;$0e "not "
        defb "too"," "+$80        ;$0f "too "
        defb " mi","s"+$80        ;$10 " mis"
        defb "L",$01,"e"," "+$80  ;$11 "Line "
        defb "NEX","T"+$80        ;$12 "NEXT"
        defb "s",$04," "+$80      ;$13 "sion "
        defb $05,"l","l"+$80      ;$14 " full"
        defb " ",$02,"ro","r"+$80 ;$15 " error"
        defb "RESUM","E"+$80      ;$16 "RESUME"
        defb "Str",$01,"g"," "+$80;$17 "String "
        defb " ",$06,"man","d"+$80;$18 " command"
        defb $10,"s",$01,"g"+$80  ;$19 " missing"
        defb "Unknow","n"+$80     ;$1a "Unknown"
        defb $0f,"lon","g"+$80    ;$1b "too long"
        defb "already"," "+$80    ;$1c "already "
        defb "Un",$03,"pec",$0a," "+$80;$1d "Unexpected "
        defb "irect",$18+$80      ;$1e "irect command"

Let’s take a look at how this works. Error message $02 is that for “Syntax error”. It is defined as

defb "Syntax",$15+$80

It begins with the string literal “Syntax”, which can be output as is.

This is followed by ‘$15+$80’. As mentioned the ‘+$80’ simply sets bit 7 to indicate the end of string. The ‘$15’ is a reference to a partial defined as

defb " ",$02,"ro","r"+$80

This partial starts with a space, which now makes our output string “Syntax error “.

Next up is another partial number, this time $02. That partial is defined as

defb "e","r"+$80

This is a simple two character string “er”. Adding that to our output gives “Syntax er”.

We now back up to the remainder of partial $15, which is the string “ror”. Add that to the output to give “Syntax error”.

Backing up again takes us back to the original error message string. Since we’re now at the last character the string is complete.

Code

Let’s have a look at the code that outputs one of these compressed error messages.

;;display error partial
;A=partial number
display_error_partial:
        ld      de,error_message_partials_list
        call    find_error_message_in_table

;;+display error message atDE
display_error_message_atDE:
        push    de                ;preserve DE
        ld      a,(de)            ;get character/code
        and     $7f               ;mask out bit 7
        cp      $20               ;Is it less than $20?

;; if $20<code<$7f -> code is a ASCII character. Display character.
;; if $00<code<$1f -> code is a message number. Display this message.
        call    nc,output_char    ;display text char
        call    c,display_error_partial ;display message partial
        pop     de                ;restore DE

        ld      a,(de)            ;reread the character
        inc     de
                ;advance to next character
        rla                       ;test for bit 7 - end of string
        jr      nc,display_error_message_atDE ;loop

        ret                       ;done

The main entry point is display_error_message_atDE. This loads the character at DE, masks out bit 7 and compares it to $20. If the value is $20 or above it’s an ASCII character so it calls output_char. If not if calls display_error_partial to expand the partial. After this it reloads the character, advances DE to the next character and tests if bit 7 is set. If not it loops for the next character, if so it finishes.

The call to display_error_partial is basically a recursive call to the same routine. The difference being that it calls find_error_message_in_table with the address of the partials table. This subroutine returns the start position (in DE) of the requested message.

Space Saving

So, how much space does this compression save? Since we’ve just looked at the code we can start by examining how much longer this makes the code.

Without partials we wouldn’t need the two instructions at display_error_partial. Those add six bytes. In the main loop we wouldn’t need the CP $20 or the call to display_error_partial. That’s another four bytes. So partials adds ten bytes to the code.3 (Without partials the call to output_char would be an unconditional one, but that wouldn’t change the instruction size).

And how many bytes are saved within the strings themselves? The error message lists start at address $cd14 and finish at address $ce72, taking up a total of $15f bytes, 351 in decimal. Adding up the lengths of the uncompressed strings from the error_message_full_list table gives 511 bytes. A saving of 160 bytes.

Subtract the ten bytes of extra code needed brings a total saving of 150 bytes.

That’s actually slightly more than I was expecting. But think about the work that went into extracting that table of partials and you can see how hard the coders had to work to fit what they did into the 16K ROM.4

What About Other Strings

This compression routine does raise a question though – there are a few other strings in the code which are stored as ASCIIZ (with a following $00 character). Using the compression scheme could have saved the byte used for the $00 termination, even if the compression of the string itself didn’t result in any savings. So, why didn’t they also use compression on those strings?

A quick scoot through the codebase shows up the following strings being defined.

defb " BASIC 1.1",10,10,0
defb "Ready",10,0
defb "Brea","k"+$80
defb " in ",0
defb "Undefined line ",0
defb "Random number seed ? ",0
defb "?Redo from start",10,0

“Break” is already displayed using the compression routine.

As to the others, I can’t see a single definitive answer but I’ll suggest the following as possible reasons

  • Control codes: The ’10’ values in a number of strings is a control code for a line feed. These strings obviously couldn’t be passed through the uncompression routine. The could be broken into a string and separate code to generate a new line but the extra code needed by such a routine would probably negate any savings in string size.
  • Register corruption: The routine to output an ASCIIZ string5 explicitly preserves the AF and HL registers. As far as I can see the display_error_message_atDE doesn’t touch HL but AF is definitely corrupted by it. The need to modify the error message code to preserve AF, or to preserve it around a call could negate any savings elsewhere. It may also be worth noting the the ASCIIZ message output routine takes the message address in the HL register, the error message code takes it in DE. Refactoring the calling code could make it longer.
  • The ASCIIZ output routine is used elsewhere within BASIC: For example the LIST command composes lines into an ASCIIZ buffer, as do the number to string conversion routines. If this code could have been dispensed with that may have made the effort to refactor other code to use the compressed error routine more worthwhile, but since it is needed anyway there’s little benefit to modifying code to use the uncompression routines.

Summary

I hope that was an interesting insight into how hard coders had to work back in the day to fit their programs into the limited memory available on old systems. Would you have gone to this amount of effort to save 150 bytes?

Links

Read the full reverse engineered Amstrad CPC BASIC source code at https://github.com/Bread80/Amstrad-CPC-BASIC-Source

Footnotes

  1. Standard ASCII only defines values 0 though 127, thus always leaving bit 7 clear. The Amstrad (and other computers) usually define values 128 to 255 as graphics or extended characters, but these aren’t used in any error messages.
  2. ASCII values less than $20 are defined as non-printable ‘control codes’. This includes things such as ‘carriage return’ and ‘line feed’ which you may be used to seeing as \r and \n in C style strings.
  3. It’s possible the code could be refactored to be shorter and save another couple of bytes, or removed completely in place of a general string output routine elsewhere in the code. I’ll leave that as an exercise for the reader.
  4. If you look at the end of the ROM there’s code right up to address $ffff. However there is a block of 21 bytes at address $ce9f which don’t appear to be used – they’re mostly RST 0 instructions. I’m not sure why they would put unused code in the middle of the ROM.
  5. output_ASCIIZ_string at $c38b.