Automatically learning topic-change cues for dialogue

Litman and Passonneau [11] present a method for topic segmentation based on machine learning of ``discourse segment boundaries'', analysing both textual and prosodic information and using a corpus of spoken dialogue transcripts marked with prosodic features. They initially applied a process of hand-tuning the system based on errors it made in the corpus, and subsequently applied a machine learning technique to simulate the same process of guided improvement based on accuracy on a pre-tagged corpus. In subsequent evaluation, the machine learning algorithm performed slightly better than the hand-tuning approach.

The approach that Litman's system uses to topic segmentation is to perform an analysis of segment boundaries. At each boundary in the training corpus, the prosodic and cue-phrase features are obtained. Each new utterance in turn is considered as a potential topic boundary: Using a static set of cue phrases derived from [7] (also, and, anyway, basically, because, but finally, first, like, meanwhile, no, now, oh, ok, only, or, see, so, then, well, where ...), if a member of the cue-phrase list appears at the start of the sentence, and it is followed by another from a subset of secondary cue-phrase words, it is marked as a potential boundary.

Coreference resolution also plays a part in the detection of potential topic boundaries: If the first noun-phrase after the boundary is found to be coreferent with a noun-phrase from before the boundary, it is marked as `+coref', otherwise it is marked `-coref'. The influence of this marker on topic boundary detection is tuned later, either by hand or through machine learning, but it would be reasonable to hypothesize that coreference across a potential topic boundary would reduce its likelihood. In support of a hypothesis in [15], Passonneau notes that ``adjacent utterances are more likely to contain expressions that corefer, or that are inferentially linked, if they occur within the same segment; and that a definite pronoun is more likely than a full NP to refer to an entity that was mentioned in the current segment, if not in the previous utterance.'' [11]

Pause duration and other prosodic information are also introduced as separate metrics in the overall combination that makes up the system, but they are not the focus of this literature review.

The machine learning algorithm made use of the collection of linguistic features (anaphoric, cue-phrase and prosodic) at each potential boundary site. Two classes to be learned are defined: boundary and non-boundary. Using pre-marked training data consisting of 10 documents, comprising 1004 potential boundary sites, the algorithm produces a decision tree, predicting which of the two classes an unknown potential boundary site falls into.

Litman suggests that the success of this algorithm (which does not use any statistical word-frequency information such as the methods described in [6] and [1]) points to a correlation between specific linguistic devices and discourse structure. If this is true, a combination of word-frequency based statistical approach and one verifying potential topic changes using linguistic features may provide better results.

James Ballantine 2005-02-19