Bugs and inefficiencies

The software is suboptimally architected in a number of ways: Due to its nature as an experimental system designed for testing various theories, its functionality was not well-defined initially. While the core of the algorithm is implemented as a clean Python module, it contains several different methods for passing in options and variables: Some are hard-coded in the micaseparse.py file (pseudosentence length, block size, and whether to use stop words); some are hard-coded in segmentation.py (smoothing size); and some are passed in from the command line interface to segmentation.py.

The architecture of micaseparse.py means that it essentially can only perform one pass over the input XML data. Because it extracts only the information it needs from the file, it is sometimes necessary to perform two passes over the data to produce certain kinds of output. For example, two passes are required to produce automatic-break-annotated XML output: The first pass calculates the similarity values and performs trough detection, printing a list of breaks on standard out. The second pass reads in a list in the same format, printing out an XML file with breaks inserted at the appropriate locations. This is very inefficient, as each run of the software performs a complete pseudosentence chunking and similarity value calculation, which in this case is completely redundant. An identical inefficiency occurs if an experiment requires both smoothed and unsmoothed similarity data: There is currently no way to print both kinds of data simultaneously, so two full runs of the system are required, with different options each time.

On a 1 Gigahertz CPU, using a pseudosentence length of 20 and a block size of six, single-run execution time for a 5480 word dialogue (excluding stop-words) is only 25.264 seconds. However, settings involving a greater number of comparisons take significantly longer: Using a pseudosentence size of 1 and a block size of 120, a single run takes over three hours. Therefore, when experimenting with pseudosentence length it is clearly important to streamline the execution time as much as possible, and to remove redundancy.

The batch scripts written to automate the creation of the various output files, and to produce graphs, all use hard-coded paths to the scripts they execute. This is not a problem for the test environment, but represents a hazard to portability.

A bug exists in the annotated XML generation functionality: When reading an outside breaks list file, it incorrectly prints the name of the file it reads on standard out. This means that piping its output to a file includes this informational message, which confuses strict XML parsers such as XSLT, and it must be manually removed.

During the course of experimentation and evaluation, a new set of MICASE dialogue XML documents was acquired. Unfortunately, the XML format had changed (for example, the standard `speaker' tags, <S1 /> and <S2 />, representing the original and overlapping speaker respectively, were changed to the unified <S />. This introduced compatibility issues as scripts were modified to handle both new MICASE dialogues and existing ones which had already received much attention.

While the Python scripts were successfully modified to read both formats, forks were required in the creation of XSLT documents to handle each. As a result, the newer MICASE XML documents cannot be displayed using a different colour for each speaker. Additionally, due to differences between the original XML documents from micase and the system's output XML, further versions of the XSLT scripts were required to handle these cases.

James Ballantine 2005-02-19