Beeferman motivates his work initially as a tool to improve Information Retrieval task output: When a user submits a query to a large corpus of documents, it would be preferable to return a snippet containing the requested information which is bounded by its nearest topic-change boundaries, rather than the more traditional character or word limit. This forms a simple but reasonable approach to the general question of how much surrounding material to return to the user.
Beeferman's approach to topic segmentation is to assign a probability to the end of every sentence in the document, reflecting the likelihood that it is the end of a topic. This probability is determined by a model which weighs features in the surrounding text. Examples of features pertinent to the domain of multimedia include the presence of the phrase ``coming up'' in the last utterance.
These features are selected by incrementally building an exponential model (see [2]), and then combined into a predictive model using an algorithm ``akin to growing a decision tree''. Each potential feature in the source stream is added to the exponential model, and the gain (the largest possible reduction in divergence between the model and the pre-marked text) achieved from this addition is recorded. The candidate feature yielding the largest gain is then added to the model. This process is described in [2].
Each word is examined in the context of approximately 500 words surrounding it, effectively guaranteeing that the context is unique. Using Kullback-Leibler divergence, the degree to which the current context approximates each model in the training corpus is calculated. To avoid overfitting, it must be determined which are the salient features present in the training corpus. This is done by creating a linear exponential family distribution, which ensures that the features are salient.
Topicality is defined as a function combining a `myopic' trained trigram model for topic change combined with the longer-range model. This is because the long-range model is prone to confusion immediately after topic change: its probability assignments to various words will represent the wrong domain for several sentences after the change, which will confuse it.
Beeferman's exponential model for text segmentation achieved better results than TextTiling for the domain of newspaper text: An error rate of 0.19 compared to TextTiling's 0.296, a miss probability of 0.24 (TextTiling: 0.457) and a false-positive probability of 0.1575 (TextTiling: 0.191). The algorithm is, however, greatly more complex, both in implementation and in computational efficiency. Also, Beeferman's evaluation environment is disparate chunks of newspaper text, a more easily defined but different challenge to the one for which TextTiling was designed: `Subjective' segmentation within a single document.
James Ballantine 2005-02-19