Outline

The aim of this research was to investigate methods of topic segmentation for the special case of spoken dialogue transcripts. To facilitate this research, the differences between traditional written text (`expository text' [6]) and spoken dialogue were explored. Subsequently, a series of tests were performed to evaluate the TextTiling algorithm, designed for textual topic segmentation, in this domain: Using human annotation as a comparison and reference, evaluations were performed on a test-suite of 10 diverse transcripts from different domains of spoken dialogue. Additionally, parameters for the TextTiling algorithm were modified to evaluate the effect of the spoken dialogue domain upon the optimal parameters derived by Hearst [6] for the domain of magazine articles. Finally, results were examined at the level of individual topic breaks to investigate patterns in the successes and failures of TextTiling in the spoken dialogue domain.

To this end, a system using TextTiling was implemented to test upon transcripts of spoken dialogue (see chapter 3 for implementation details), and the Michigan Corpus of Academic Spoken English (MICASE) was chosen as a source for spoken dialogue transcripts. These transcripts are manually transcribed (not a product of speech-recognition) so represent a relatively `clean' source for spoken dialogue transcriptions. However, it is important to note that they contain little punctuation within speaker turns--see section 2.2.1 for a discussion of how this would affect heuristic detection of cue-phrases.

Human-annotated topic breaks within MICASE documents were then used to evaluate automatically detected topic breaks by the system for a variety of different system parameters.

James Ballantine 2005-02-19