TextTiling and document summarisation

Boguraev et al. [3] describe a system designed to meet the challenge of high-level information overview when handling large documents. It uses a modified form of TextTiling to organise automated document summaries into nested `point form' documents. Topics are detected within the document, and subsequently the key phrases (salient expressions or `topic stamps') are inserted as sub-points below each topic.

Boguraev's research primarily concerns document summarisation, and not topic segmentation, but in implementing a version of TextTiling for the system, an important general extension is proposed to the topic segmentation algorithm: Automatic labeling. In many conceivable applications, topic labeling is an important aspect of topic segmentation, useful for such tasks as browsing quickly through a large document and automatically indexing subject matter.

Boguraev's system makes use of `capsule overviews' of documents, abstractions of document content designed to capture `aboutness'. The algorithm aims to collect for each document a set of ``highly salient'' phrases. Further, the highest-scored of these salient phrases is then labeled the ``topic stamp''. The term capsule overview refers to the document's collection of topic stamps.

The topic stamps are noun phrases, intended to be those relevant to the current topic. They are ideally discourse referents within the discourse of the document. Discourse referents are ranked according to their salience, which is judged by prominence of introduction into the discourse, the amount of discussion which involves it, and how often it is mentioned elsewhere in the document. These criteria are combined into a single score; those discourse referents with a high salience weight become the topic stamps for the document. The resulting set of topic stamps is highly compact, listing only the most important topic phrases in the document.

This list is integrated into a capsule overview by combining it with the result of a topic segmentation algorithm based upon TextTiling. The document is segmented as described in section 2.1.1 using TextTiling. Having determined the location of (likely) topic breaks, the topic stamp algorithm is applied to the same document. The topic stamps are retrieved, and divided into groups according to the TextTiling topic under which they were found. It is important to note that this is different from performing a document summarisation task individually on each topic, as the algorithm analyses the whole document to learn the true importance of each topic stamp. The topic stamps are balanced representations of important subtopics, so a TextTiling topic which introduces few important topic stamps should receive fewer--this is only possible if the algorithm has access to the full document at topic stamp generation time.

The topic stamp strings, in order of appearance in the document, are superimposed on the overall layout provided by the TextTiling algorithm. This produces a numbered list (representing TextTiling topics), with sub-items representing an ordered list of subtopics (represented by topic stamps). In a full presentation of the system's output, these bulleted lists have superimposed upon them ``progressively more refined and more detailed discourse fragments'' [3], meaning the system is capable of providing a top-down view of the document with the TextTiling topics serving as the highest possible level of structure.

This system provides a robust method for providing labels, or sets of labels, for TextTiling topics. In the examples given in [3], each TextTiling topic has two topic stamps (which are noun phrases as they are derived from discourse referents). The example document given is an editorial on Microsoft Corporation and Apple Computer: The discourse referents chosen for the first three topics were:

  1. APPLE; MICROSOFT
  2. DESKTOP MACHINES; OPERATING SYSTEMS
  3. GILBERT AMELIO; NEW OPERATING SYSTEM

These topic stamps form an excellent label for the TextTiling topic. They also suggest a possible method for finding finer-grained, sentence-level topic changes within a TextTiling approach, based on individual topic stamps.

As described in [3], a significant addition to TextTiling is the system's ability to ``track'' threads of topics. Because it follows topics from the highest level (TextTiling) to the sentence level, building a form of tree as it does so, it has a better and more complete picture of the layout and structure of the document.

This has many applications, of which here document summarisation is explored. Possibilities include the ability to tune a topic segmentation system to find broader or finer topics, in real time, according to a user request for more detail; one might imagine a slider or other widget allowing the user to expose more or less detail of a document (from two or three bullet points describing the subject of the document, to the entire document in its original format).

James Ballantine 2005-02-19