Ancillary modules

segmentation.py contains two variable constants relevant to the tuning of the system: smooth_size, which controls the size of the window over which the smoothing algorithm will function (details below) and percent_detection, the minimum size of a detected trough as a percentage of the total range of the cosine measure over the current document.

The cosinemeasure.py module is implemented as follows: It takes as input two equal-length vectors of integers. It computes a similarity value as a single real number, $n$, $0 \leq n \leq 1$. This similarity metric is computed using equation 3.1:

\begin{displaymath}
sim(b_{1},b_{2}) = \frac{\sum_{t} w_{t,b_{1}} w_{t,b_{2}}}
{\sqrt{\sum_{t} w_{t,b_{1}}^{2} \sum_{t=1}^{n} w^{2}_{t,b_{2}}}}
\end{displaymath} (3.1)

Where $b_1$ and $b_2$ are the vectors under comparison, $t$ ranges across the values in each vector, and $w_{t,b_{n}}$ is the value in vector $n$ at position $t$. The result of this calculation is the cosine of the angle between the hyperdimensional lines represented by each vector. For example, imagine two vectors of length two: In this case, the angle between the lines they represent is easy to visualise - the first number represents an arbitrary rotation in the horizontal plane, and the second a rotation in the vertical plane. This system is also immune to magnitude: The vectors $[1,2,3,4]$ and $[2,4,6,8]$ represent the same line, and thus have an angle between them of 0 radians, and a corresponding cosine measure value of 1.

In a language technology context, the cosine measure is usually used to determine the similarity of two documents based on word frequency. Thus, the vectors represent every unique word present in the entire document under segmentation. Each value is the frequency of the corresponding word within the current window. At any time, most of the values in the vector are zero, but this has no effect on the cosine measure, and only nonzero numbers impact the similarity measure.

The peakpick.py module locates significant local minima (troughs) within a vector of real numbers. It must in addition be passed a percentage value representing its sensitivity as a percentage of the total variation.

The module first determines the range within the vector, and calculates the absolute change in value that the percentage argument represents. It then steps through the values in the vector, remembering the highest value. When a value is located that has dropped more than the threshold amount from the current highest value, this value is watched. If the value rises above the current lowest value by a greater amount than the threshold, the current lowest value is marked as a trough, and current highest and lowest values are reset.

James Ballantine 2005-02-19