The cosinemeasure.py module is implemented as follows:
It takes as input two equal-length vectors of integers. It computes a similarity value as a single real number,
,
. This similarity metric is computed using equation 3.1:
In a language technology context, the cosine measure is usually used to determine the similarity of two documents based on word frequency. Thus, the vectors represent every unique word present in the entire document under segmentation. Each value is the frequency of the corresponding word within the current window. At any time, most of the values in the vector are zero, but this has no effect on the cosine measure, and only nonzero numbers impact the similarity measure.
The peakpick.py module locates significant local minima (troughs) within a vector of real numbers. It must in addition be passed a percentage value representing its sensitivity as a percentage of the total variation.
The module first determines the range within the vector, and calculates the absolute change in value that the percentage argument represents. It then steps through the values in the vector, remembering the highest value. When a value is located that has dropped more than the threshold amount from the current highest value, this value is watched. If the value rises above the current lowest value by a greater amount than the threshold, the current lowest value is marked as a trough, and current highest and lowest values are reset.
James Ballantine 2005-02-19