Architecture

The system consists of the following modules:

Each module is individually usable as a standalone unit: Complete generation of all data for an individual transcription requires a sequence of separate executions. This allows individual aspects of the system to be tested, debugged and evaluated more quickly, as the execution time of each component is kept to a minimum. For example, some experiments may require fully-marked-up HTML files, but no graphs.

Because the system consists of multiple modules which must work in concert to produce a `complete' result set, scripts have been included to automate the process.

A full run of the system produces a number of intermediate files containing the data points of the cosine similarity measure, the smoothed similarity measure values, a list of detected topic changes, and a list of hand-marked topic changes. These data files can be cached to save processing time when multiple runs of `downstream' modules are required, or used immediately and deleted. They are required for the generation of graphs and for the generation of marked-up conversations in XML and HTML.

The central module is the text segmentation unit itself. It is responsible for reading the original transcript file and applying the segmentation algorithm to it.

The cosine measure module has a single purpose: It calculates the value of the cosine measure for two vectors currently under examination. For details of the cosine measure formula and the contents of the vectors, see section 3.3. It is called repeatedly by the central module during calculation of the continuous values similarity across each transcript.

The trough detection module is called by the text segmentation unit: it detects troughs in a continuous set of similarity values, which will form the putative topic breaks output by the text segmentation unit.

The XML generation module uses the internal (memory-only) structure created by the text segmentation module to interleave markers for topic change into a simplified version of the original input file. It must therefore be called simultaneously with the text segmentation module, and its input cannot be cached.

The HTML generation module uses the cacheable output of the XML module to create the human-readable transcript of the data. The output of the HTML module is used in conjunction with a style sheet to allow convenient viewing and printing.

The graph generation module uses the cacheable output of the text segmentation module - the listings of similarity measure, detected topic changes, and hand-marked topic changes. It is capable of working with partial input, so will generate graphs containing whatever input is provided.

Additionally, all modules will accept as optional input a pre-generated list of detected topic breaks in order to reduce redundancy.

James Ballantine 2005-02-19