Text and Braille Computer Translation A dissertation submitted to the University of Manchester Institute of Science and Technology for the degree of Master of Science, 2001. Alasdair King Department of Computation 28 September 2001 Text and Braille Computer Translation Declaration No portion of the work referred to in the dissertation has been submitted in support of an application for another degree or qualification of this or any other university or other institution of learning. 1 Text and Braille Computer Translation Acknowledgements I gratefully acknowledge the great support of my supervisor, Dr Gareth Evans, without whom this dissertation would not have been possible. 2 Text and Braille Computer Translation Abstract This project is concerned with the translation of text to and from Braille code by a number of Java programs. It builds on an existing translation system that combines a finite state machine with left and right context matching and a set of translation rules. This allows the translation of different languages and different grades of Braille contraction, and both text-to-Braille and Braille-to-text. An existing implementation in C, allowing the translation of languages based on 256-character extended-ANSI sets, has been successfully integrated into a Microsoft Word- based translation system. In this project, three Java implementations of this translation system were developed. LanguageInteger is a port of the existing C code. Language256 uses the same translation language files but is coded using Java programming idioms. LanguageUnicode is based upon the use of Unicode to encode characters. Each implements a Language Java interface, which defines common public methods for the classes in accordance with object-oriented software development principles of encapsulation and reuse. All the implementations performed translation correctly on a range of different operating systems and machines, demonstrating that they are platform-independent. LanguageUnicode was able to use language data files obtained over HTTP from a webserver. The implementations performed well relative to the native C program on a high-specification machine, but their performance was strongly dependent on available system resources. LanguageInteger performed fastest, and more consistently, across the range of platforms tested and is suitable to serve as a component in future development as a part of a platform- independent translation application. Language256 did not perform as fast, so its development should be discontinued. LanguageUnicode performed least well, suffering from using Java Strings to represent language information. It should be recoded using arrays of ints. This can be based on the Language256 program, which uses this representation internally. It is recommended that Java Beans be developed from the classes to facilitate future development with them as applets, GUI components and network applications. The three classes are supplemented by two more Java programs for creating the language rules tables used by the programs, a test language and test input that will allow the validation of future implementations of Language, and full documentation of the classes in the standard Sun API format for future development. 3 Text and Braille Computer Translation Contents Declaration 1 Acknowledgements 2 Abstract 3 Contents 4 1 Introduction 6 2 Current state of Braille and computer use 9 2.1 The Braille code 9 2.2 Use of Braille with computer technology 13 2.3 Approaches to performing Braille translation with computers 18 2.4 The UMIST translation system 19 2.5 Implementing the UMIST translation system: the BrailleTrans program 25 2.6 Using BrailleTrans: the Word translation system 33 2.7 Limitations of current implementation that can be addressed in this project 35 3 Solutions to current implementation limitations and development requirements 37 3.1 Implementation platform and implications 37 3.2 Addressing language universality 47 3.3 Advancing the UMIST translation system: planned development 52 4 Implementation of solutions 56 4.1 Language interface 56 4.2 The two 256-character implementations, LanguageInteger and Language256 63 4.3 LanguageUnicode 77 4.4 The Make utilities 87 4.5 Testing and translating utilities 92 4.6 Documentation 93 4.7 Packaging 95 4.2 Performance criteria 95 5 Results of implementations 96 5.1 Validation: meeting specification 96 5.2 Verification: performance 100 5.3 Language implementations 113 6 Conclusions and further work suggested 116 Appendix 1 - Computer Braille Code 121 Appendix 2 - Existing language rules table 124 Character rules 124 4 Text and Braille Computer Translation Wildcard specification 124 Decision table 125 Translation rules 126 Appendix 3 - Test results 127 Bibliography 137 5 Text and Braille Computer Translation 1 Introduction Braille is a system of writing that uses patterns of raised dots to inscribe characters on paper. It therefore allows visually-impaired people to read and write using touch instead of vision [rni2001]. It is a way for blind people to participate in a literate culture. First developed in the nineteenth century, Braille has become the pre-eminent tactile alphabet. Its characters are six-dot cells, two wide by three tall. Any of the dots may be raised, giving 26 or 64 possible characters. Although Braille cells are used world-wide, the meaning of each of the 64 cells depends on the language that they are being used to depict. Different languages have their own Braille codes, mapping the alphabets, numbers and punctuation symbols to Braille cells according to need. Braille characters can also be used to represent whole words or groups of letters. These contractions allow fewer Braille cells to encode more text (dur1996), saving the expensive printing costs of Braille text and making Braille faster to use for some experienced users (wes2001, lor1996b). Modern computer translation of Braille is of benefit to Braille users. Scanners allow printed documents to be transformed into accessible computer documents. This text can then be translated into Braille for output to a Braille printer or directed to special Braille output devices. Braille input devices like the Perkins Brailler are designed to allow Braille code to be entered directly, and Braille users can successfully use a standard QWERTY keyboard. Braille code is stored on computers as North American Computer Braille Code, a mapping of the six-dot Braille code to ASCII English text character values (kei2000). Braille translation is not a trivial task, however, because of the need to correctly perform the contractions. They complicate translation logic and introduce many idiomatic rules and exceptions. For example, the contraction for “OF” in American Braille cannot be used unless it is applied to letters pronounced the same way as the word “of”. This means that “OFten” can be contracted but not “prOFessor” [bra1999]. Despite these complications, translation programs have been developed, based on dictionaries of correct translations or complex rule systems. Few remain in the public domain for examination, however. One translation system still public is a translation system developed at UMIST [ble1995, ble1997]. This combines a large set of rules, relating input and output text, with a finite state machine that allows the application of rules to be controlled by comparing left and right contexts. This system translates input text using a set of character rules, determining what characters are valid for the language and their attributes, a finite state machine decision table, and a set of translation rules containing wildcards for matching input text. These parts together constitute a complete language rules table. Contraction and different types of translation can all be supported within the same language 6 Text and Braille Computer Translation rules table because of the state table. Idiomatic translations can be supported by the translation rules and the context matching. This system is very flexible and can be used to translate to or from Braille code any language for which a language rules table can be created. An implementation of this system has been produced at UMIST, the C program BrailleTrans [ble1995, ble1997]. It works with ANSI 256-character sets, used to represent characters on most computer systems (fow1997). BrailleTrans does not contain any language information internally. The language rules table containing all the information for the language being used is loaded from a machine-format file when BrailleTrans is first executed. BrailleTrans can use any language rules table interchangeably. The language rules tables can be created in simple plain text files by non-technical users. They are then converted into the machine format by a second program, Mk. BrailleTrans can therefore translate any language for which a language rules table file has been created. So far, Standard British English, Welsh and prototype Hungarian have been developed, both contracted and uncontracted and text to Braille and back. BrailleTrans has also been used to add Braille translation functionality to the popular Microsoft Word word processor [ble2001]. This integration provides a friendly and familiar interface to the translation system for users. BrailleTrans is fast and efficient, but it has limitations. It runs only on 32-bit Microsoft operating systems. It handles only 256-character sets, which do not allow the encoding of non- Western characters and do not supply a unique value for every different character. This makes a context-free encoding of characters impossible - there is no single correlation of characters to unique values, so text cannot be translated without first knowing what language it written in. These limitations are addressed in this project by the development of a number of Java implementations of the UMIST translation system. Java, as described in Chapter 3, allows for platform-independence - Java programs can run on any system with a Java interpreter, the Java Virtual Machine (JVM). JVMs are widespread, although the range of Java library classes a JVM supports can vary so not all Java library classes may be available on every machine. It also supports the use of 16-bit Unicode characters, which escapes the restriction on supporting only languages that use a 256-character set. The Java implementations developed adhere to the Java 1.1 standard, making them compatible with the majority of JVMs. Java also allows for the use of object-oriented programming approaches, which aim to promote re-use and minimise errors. Chapter 4 describes how three Java programs, or classes, were developed to perform translation. Two have the same functionality of the existing BrailleTrans program and use the same 256- character set language files. One, LanguageInteger, is a port of the existing C BrailleTrans program to Java, for comparison of output and to provide a highly-optimised benchmark. The 7 Text and Braille Computer Translation other, Language256 is a more object-oriented class that uses more Java programming techniques. These Java and object-oriented techniques are intended to make implementation easier and the resulting class simpler to maintain. The third Java program, LanguageUnicode, uses Unicode characters to represent text. All three were designed to implement a Java interface for Language that defines the public methods and variables that must be provided by the program. This is intended to establish the implementations as objects and facilitate their use as components in future translation applications. Two Java replacements of the existing Mk program convert the language rules tables from their plain text to machine formats - one handles the legacy 256-character files used by LanguageInteger, Language256 and BrailleTrans, the other the Unicode-based files used by LanguageUnicode. Both improve on Mk by allowing the text files to contain escape characters that code for other characters, allowing any Unicode or 256-character set to be represented in a strictly ASCII text file. The separate human-edited language rules table files were maintained to allow easy production and editing of languages. A number of command-line translation utilities demonstrated the ease of using the classes and their common interface for translation within a larger application. The results of the implementations are described in Chapter 5. A test translation language and input and output files confirmed that the classes did translate as required on a variety of operating systems. The speed of translation differed between the classes, and was related to their different designs. Analysis of the execution of each of the classes was performed to identify possible coding improvements, but few such improvements were found to be obvious. All of the classes are documented internally in sourcecode and externally using the Sun Javadoc system to produce comprehensive documentation for subsequent development with the classes to form larger applications. A number of applications of the classes were investigated. The project was therefore successful in producing platform-independent, component Java implementations of the UMIST translation system, including a Unicode language translator. All of the classes conform to good object-oriented design and with their documentation are well suited to future development as parts of a larger translation application. Chapter 6 provides more detail on these conclusions and specific proposals for future work on and with the classes. 8 Text and Braille Computer Translation 2 Current state of Braille and computer use 2.1 The Braille code Blind people cannot use printed or displayed text. They need a tactile or audile means to read and write. Braille was created in the nineteenth century to fulfil this need [rni2001]. Characters in Braille consist not of visual symbols, printed or displayed, but of physical symbols constructed of raised dots on paper. It is a system of reading and writing based on touch rather than vision, where characters are embossed rather than ink printed. Each Braille character, or cell, is composed of a rectangular array of six positions, each of which may be filled by a raised dot. They are numbered: This provides 26 or 64 possible characters. The cell with no dots at all, an empty space, is used to separate words and for layout, just as in visual text. This leaves 63 cells for characters. This allows a simple one-to-one mapping of any given common Indo-European alphabet to Braille cells. Letters, numbers and punctuation can be recorded in Braille by single cells or very simple combinations of them. Translating text to Braille is then a trivial process of converting written text character by character. Some examples from British English Braille [rni1992]: Character Braille equivalent J (cid:1) ! (cid:2) 7 Number symbol ((cid:3)) and 7 symbol ((cid:4)), (cid:3)(cid:4) 3 Number symbol ((cid:3)) and 3 symbol ((cid:5)), (cid:3)(cid:5) T (cid:6) H (cid:7) E (cid:8) THE (cid:6)(cid:7)(cid:8) However, Braille languages do not use only these simple mappings. They also use abbreviations, or contractions, where single Braille cells or combinations of cells represent 9
Description: