Visualisation

The system makes use of two visualisation methods for the data it produces: Graphs and marked-up transcriptions.

The graphs are produced using R. A collection of simple R scripts reads as many input files as are available (at a maximum the similarity data, smoothed similarity data, detected topic breaks and human-annotated topic breaks files) and plots the similarity values as a continuous line graph, and the detected breaks and human-annotated breaks as clear, coloured vertical lines. These can be combined flexibly to show as much data as is necessary: The full functionality of the scripts (in file twoplot.R) overplots the raw similarity data in grey, the smoothed similarity data in black, the automatically detected breaks in broken green lines, and the human-marked data in red. R outputs its graphs in Encapsulated Postscript (EPS) format, suitable for printing as a full page.

The marked-up transcriptions are produced using XSLT, an XML transformation language, processed by the program xsltproc. The XML transcript files produced by the python modules are processed by an XSLT `stylesheet' which transforms the XML elements such as <S name="x"/> (speaker), <BREAK /> (human annotated topic break), and <AUTOBREAK /> (detected topic break) into valid HTML4.0 for viewing in a web browser. Each speaker turn is marked as a separate <P /> (paragraph), with special features such as speaker names, pseudosentence numbers, and topic breaks marked as a <SPAN /> with a special style name.

Three XSLT scripts exist; each includes a different level of detail in the output HTML:

These scripts allows custom CSS (Cascading Style Sheet) files to be written for the transcript HTML easily, changing its layout and colouring to suit the medium. Two CSS files have been created:

They are optimised for screen-viewing and (black and white) print viewing respectively. speakers.css uses different colours to represent each speaker, allowing easier following of in-line interruptions and overlaps between two speakers (though this feature only applies to older variants of the MICASE XML format). Topic breaks and pseudosentence numbers are printed in red, with breaks marked by two vertical bars, ``$\vert\vert$''. In both CSS files, speaker names are printed using a larger font, and interrupting and overlapping text is printed in italics.

Finally, batch processing of input dialogue files for graphing purposes is managed by shell scripts:

These coordinate the processing of dialogue files into a series of graphs to show human-annotated and detected topic breaks: makegraph.sh calls the python subsystem repeatedly on all XML files in the current working directory, producing four files per dialogue: Raw and smoothed similarity values, and the lists of hand-annotated breaks and detected breaks.

buildgraph.sh then uses these output files to build four progressive graphs, from the raw similarity data plotted in grey, to an overplot of all four data sources, as described above. These progressive graphs are useful for examining the success rate of the automatic detection without it crowding the graph.

James Ballantine 2005-02-19