The graphs are produced using R. A collection of simple R scripts reads as many input files as are available (at a maximum the similarity data, smoothed similarity data, detected topic breaks and human-annotated topic breaks files) and plots the similarity values as a continuous line graph, and the detected breaks and human-annotated breaks as clear, coloured vertical lines. These can be combined flexibly to show as much data as is necessary: The full functionality of the scripts (in file twoplot.R) overplots the raw similarity data in grey, the smoothed similarity data in black, the automatically detected breaks in broken green lines, and the human-marked data in red. R outputs its graphs in Encapsulated Postscript (EPS) format, suitable for printing as a full page.
The marked-up transcriptions are produced using XSLT, an XML transformation language, processed by the program xsltproc. The XML transcript files produced by the python modules are processed by an XSLT `stylesheet' which transforms the XML elements such as <S name="x"/> (speaker), <BREAK /> (human annotated topic break), and <AUTOBREAK /> (detected topic break) into valid HTML4.0 for viewing in a web browser. Each speaker turn is marked as a separate <P /> (paragraph), with special features such as speaker names, pseudosentence numbers, and topic breaks marked as a <SPAN /> with a special style name.
Three XSLT scripts exist; each includes a different level of detail in the output HTML:
These scripts allows custom CSS (Cascading Style Sheet) files to be written for the transcript HTML easily, changing its layout and colouring to suit the medium. Two CSS files have been created:
Finally, batch processing of input dialogue files for graphing purposes is managed by shell scripts:
buildgraph.sh then uses these output files to build four progressive graphs, from the raw similarity data plotted in grey, to an overplot of all four data sources, as described above. These progressive graphs are useful for examining the success rate of the automatic detection without it crowding the graph.
James Ballantine 2005-02-19