... 2''2.1
``based on average 5-point $F_{1}$ score averaged across compression levels and normalized with the random sentence-selection baseline.'' [13]
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... evaluation.3.1
Hearst notes, however, that there is a high level of disagreement between human judges in a topic segmentation task, and thus this method is not infallible [6]. See Chapter 5 for a discussion of this issue.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.