Perhaps the most important next stage of evaluation is to expand the number of human evaluators in order to perform inter-annotator agreement tests on the scale of [6] and [15].
Expansion of the system from a word-frequency-only approach to a combined approach making use of cue-phrases would also be useful: Unlike TextTiling's native domain, cue-phrases are ubiquitous in spoken dialogue, and as shown in [7] can be effective in locating potential topic changes.
A useful expansion of the current system in terms of applicability to real-world situations would be the ability to label detected topics by their content, as performed in [3]. This would allow such applications as web-based meeting transcript browsers, allowing users to scan down a list of discussed topics to index directly to the one they seek.