net.sourceforge.nite.datainspection

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package net.sourceforge.nite.datainspection

PACKAGE UNDER DEVELOPMENT

See:
Description

Class Summary
Info

Package net.sourceforge.nite.datainspection Description

PACKAGE UNDER DEVELOPMENT

Various viewers and calculation packages for inspecting a corpus. Can be used for reliability and quality analysis, or just to get a grip on what's in a corpus. This overview document is the main documentation of these packages. If you want to use this stuff, please read this text carefully, referring to the documentation of individual classes when necessary.

Package overview

data package

Contains interfaces to describe the data in your annotations in a way that can be handled by the other packages. Basically, an annotation by an human annotator is supposed to be a classification that assigns Values to Items. The precise interpretation of Item and Value is dependent on the actual data.

calc package

Contains the interfaces and classes necessary for calculating: kappa, alpha, distance metric interface, confusion matrices, etc.

impl package

In this package, a number of different implementations for Items and Values and some corresponding metrics are defined. StringValue, SetValue, DiceMetric, etc.

view package

This package contains some GUI elements that can be used to (interactively) view results from the data inspection such as confusion matrices and coincidence matrices.

timespan package

Classes and programs to investigate annotations that are non-overlapping, possibly (but not necessarily) continuous, and potentially multi-label. Calculates different types of confusion matrices, kappa and alpha, and visualises the annotations (in a very naive way).

Summary of the core datainspection packages

Reliability analysis often consists of

finding out whether separate annotators identified the same Items (segments, units for labeling),
finding out whether comparable Items have been assigned the same Values (labels, classes, categories) by the annotators and
finding out where disagreement lies, i.e. what Values are confused with each other in what situations; what type of Items are most often NOT identified by both annotators at the same time; etc.
(Investigating the nature of the errors that annotators made, and deciding how important these errors are, given the use for which the annotations were created.)

The package nite.datainspection.data contains the basic interfaces to describe the annotations in your corpus as Classifications that assign Values to Items. [[INSERT PICTURE HERE WITH EXAMPLE OF WHAT AN CLASSIFICATION IS, ESP THE RELATION BETWEEN THE ITEMS OF CLASSIFICATIONS OF TWO DIFFERENT CODERS]] [[ALSO A PICTURE THAT SHOWS THAT INDEPENDENT CODINGS CLEARLY LEAD TO SEPARATE CLASSIFICATIONS]] [[NOTE THAT THIS 'CLASSIFICATION' IS ONLY ONE OF MANY WAYS TO LOOK AT ANNOTATIONS...]

The package nite.datainspection.calc allows one to create a ConfusionMatrix or CoincidenceMatrix from two classifications, or a Multi-annotator CoincidenceMatrix for more classifications. Such matrices are a source of information about (dis)agreements between annotators as well as a first step towards calculating reliability measures such as kappa or alpha (available through methods in the ConfusionMatrix and CoincidenceMatrix classes). For calculating certain variations of Alpha reliability, a DistanceMetric is needed. [[INSERT PICTURE HERE THAT EXPLAINS WHAT IS THE RELATION BETWEEN THE CLASSIFICATION AND THE CONFUSION MATRIX]]

For a specific corpus, one needs to define what constitutes an Item and a Value. For example, when analysing dialogue act annotations, Items may be segments in the transcription, and Values may be Strings denoting a dialogue act label assigned to a transcription segment. As another example, when analysing a segmentation and labeling of the timeline with hand gestures, Items may be segments, or Items may be discretized timespans of e.g. 1 second, whereas a Value would be an assigned gesture label. When an annotation allows multiple labels to be assigned to a segment, Values may be sets of Strings, denoting the set of labels assigned to the segment. Furthermore one needs to define the appropriate DistanceMetrics, something which can be very corpus-specific. The package nite.datainspection.calc.impl offers a number of implementations of the interfaces Value, Item and DistanceMetric which may be sufficient for your corpus. If they are not adequate, you can make your own implementations of those interfaces.

How to use the datainspection packages

Reliability:

Figure out what are Items, Values and DistanceMetric; if necessary create own implementations of the interfaces
Derive Classifications from data in NXT format
Create coincidencematrix and calculate reliability values and print confusion matrices and start analysing what went wrong and why.

However, besides calculating some reliability value, one needs also to investigate the corpus in a more anecdotical way. Finding places where disagreement occurs, looking at what kind of situatations it happens, building informal hypotheses about sources of disagreement, etc. For this you might want to use some of the generic datainspection tools described below.

The generic datainspection tools

The datainspection packages contain a few tools which may help in investigating the data in a corpus. Each of these tools is documented in its respective subpackage documentation. Below, a list of (very short) summaries is provided for each tool. [[EXPLAIN ABOUT SINGLE CODER/MULTI CODER INSPECTION, mail from myrosia]]

Timespan tool

See timespan package. Classes and programs to investigate annotations that are non-overlapping, possibly (but not necessarily) continuous, and potentially multi-label.