Mittal states that the ultimate requirement in document summarisation is to find the discourse structure of the document, as a form of tree, at a fine-grained level. However, he regards this as a harder challenge than successful human-quality document summarisation, and so uses TextTiling as a broad, top-down approximation to discourse structure. While the provision of a broad linear chain of topics using TextTiling does not provide tree structure, Mittal notes that in some domains (such as news journalism), the tree of discourse structure is essentially ``right branching'' [13], in that it follows a linear approach, digressing into a series of discrete topics in turn. This means that the flow of the document is essentially linear, providing a significant possibility that TextTiling will discover most of the important interrelations in the discourse structure simply through its unbranching chain approach to topic flow.
Mittal notes that ``in theory, sub-document segmentation can be carried out recursively, yielding approximate boundaries for a hierarchical approximation to the discourse structure'' [13]. Certain practical limitations seem to exist with respect to the TextTiling algorithm itself--its word frequency-based approach would find a data sparseness problem below the level it usually works at, and its aforementioned ``fuzzy'' nature in terms of its accuracy of topic break placement would present a challenge. However, the concept of using recursive topic segmentation to discover (an approximation to) the complete discourse structure of a document is intriguing. Perhaps TextTiling could still apply in very large documents, to provide a segmentation first into ``chapters'' and then recursively into the segments for which it is usually employed.
Mittal's research showed that ``in all cases, pre-processing for topic boundaries can significantly improve the quality of a summarization system, often by a factor of 2''2.1. Essentially, TextTiling was used here to reduce redundancy: The system attempts to select summaries from different topics in the document, to avoid choosing two summary sentences which describe the same area (and presumably to avoid missing any topics altogether).
The document summarisation algorithms employed by Mittal, in conjunction with TextTiling, TF-IDF sentence-ranking and syntactic complexity/named-entity relationships, were also able to avoid using positional information typically required in document summarisation. Positional information can be seen as less important when working on a segmented document, as the requirement to locate potential topic changes positionally does not exist, and each segment can be assumed to be homogenous and monotopical.
James Ballantine 2005-02-19