LT-TTT2 Tutorial Guide


Table of Contents

1. Introduction
About LT-TTT2
What you should already know
Users of previous versions of LT-TTT and LTXML
Display of XML
2. Tutorial
Examples in this tutorial
Getting started
Example 1: a very simple lxtransduce grammar
Exercise 1: roman numerals
Rules, rules, rules
Example 2: adding a second rule
Exercise 2: adding a third rule
Rules in sequence
Example 3: using <seq>
Exercise 3: finding a monetary expression
First things first
Example 4: Using <first>
Exercise 4: a very simple question
Matching in context
Example 5: positioning within parent/container elements
Exercise 5: matching the firsts and middles
Example 6: suppress and more complex context
Exercise 6: your own suppress="true" rule
Repeated matching
Example 7: the mult attribute
Example 8: <repeat-until>
Example 9: <backtrack>
Passing and testing against values
Example 10: using <with-param/>, <constraint/> and <var/>
Boolean constructs
Example 11: using <and>
Exercise 11: using <not>
3. Using lexicons
Fundamentals
Example L1: your first lexicon
Example L2: other ways of doing the same thing
Standard usage
Example L3: Using categories
Example L4: Phrase lookups
Example L5: case sensitivity
4. Debugging lxtransduce grammars
Lxtransduce debug parameter
Other ways to debug rules
5. Character-level grammars
Overview
Example 12: a simple character-level grammar
Whitespace issues
6. Solutions to exercises
Exercise 1: roman numerals
Exercise 2: adding a third rule
Exercise 3: finding a monetary expression
Exercise 4: a very simple question
Exercise 5: matching the firsts and middles
Exercise 6: your own suppress="true" rule
Exercise 11: using <not>
7. Appendices
Glossary

List of Tables

1.1. LT-XML2 equivalents for LTXML programs
2.1. Multiplicity indicators for mult attribute
7A..1. Lxtransduce grammar rule reference
7B..1. Lexicon file rule reference
7C..1. Tools bundled with LT-TTT2


Language Technology Group
Human Communication Research Centre
University of Edinburgh
2 Buccleuch Place
Edinburgh
EH8 9LW
Scotland

http://www.ltg.ed.ac.uk/

Chapter 1. Introduction

About LT-TTT2

The TTT2 (Text Tokenisation Tool) system provides a flexible means of tokenising texts and adding linguistic markup at various levels. LT-TTT2 is a sister distribution to LT-XML2 and acts as a showcase for the LT-XML2 programs by providing extensive examples of NLP components can be built using LT-XML2. There are no new executables in TTT2 but it contains linguistic resource files such as lexicons, grammars and pipelines which are not provided with LT-XML2 as well as NLP-focused documentation. Many of the components have been developed over a number of years and are at least as robust as similar components available elsewhere. In particular, the tokeniser and chunker have been used by ourselves and others as parts of larger NLP systems for some time. TTT2 also includes some third party software, with agreement from the authors, thereby demonstrating how to wrap and integrate other non-XML components in our XML pipeline processing model.

The components and pipelines in TTT2 are described in detail in the pipelines documentation in this directory. The main aim of this tutorial material is to help users to understand how to write and modify the grammars for lxtransduce, the core program in TTT2.

What you should already know

This tutorial assumes basic knowledge of XML and a familiarity with any Unix/Linux-style command shell such as bash. It is also presumed that the user will be capable of writing the regular expressions needed in their problem area. Knowledge of the XML query language XPath may be helpful but not necessary. Prospective users wanting to give themselves a grounding in XML may find the W3 Schools XML Tutorial or A technical introduction to XML useful. The Regular Expression HOWTO and the Wikipedia article on regular expressions may also be handy for learning how to construct regular expressions to match string input.

Users of previous versions of LT-TTT and LTXML

The original LT-TTT was released in March 1999 and used the XML handling API provided by the LTXML library. LTXML was one of the first XML toolsets and last had a minor release in October 2004. The main program in TTT, fsgmatch, has been replaced in TTT2 by lxtransduce. Like TTT, the tools bundled in TTT2 are designed to be 'plugged together' in a variety of ways using pipelines where the output from one modular stage of processing is the input for the next stage. We can see below a table showing some equivalences between LTXML and LT-XML2 programs.

Table 1.1. LT-XML2 equivalents for LTXML programs

LTXML programLT-XML2 equivalent
fsgmatchlxtransduce
sggreplxgrep
knitlxinclude
sgsortlxsort
sgcountlxcount

It must be noted that although some programs perform equivalent functions, their command syntax is different. Lxtransduce, for example, has different formats for grammars and lexicons than fsgmatch. There are also some programs from the older versions of LTXML which have no equivalent and vice-versa. Functionality present in the old LTXML to handle "normalised SGML", a subset of SGML rather like XML, is not included in LT-XML2.

Display of XML

The display of XML in this tutorial document is designed to emphasise the logical tree-like structure of such data. One XML element is displayed on each row and indented relative to its logical position within the XML document. This method of display, also known as 'pretty printing' may be slightly different from output given by the LT-XML2 tools; users should not worry if the XML shown here differs in formatting from the results generated by completing exercises in the tutorial. Take, for example, the following output from lxtransduce:

<TEXT><p><w>The</w> <w>quick</w> <w>brown</w> <w>fox</w></p></TEXT>

This is the same as the example below. The only difference is that the XML below has been formatted to properly illuminate the logical structure of the document.


<TEXT>
  <p>
    <w>The</w>
    <w>quick</w>
    <w>brown</w>
    <w>fox</w>
  </p>
</TEXT>

The display of XML in this tutorial does not affect the preservation of whitespace by the LT-XML2 tools; this is discussed later in the document.

Chapter 2. Tutorial

Examples in this tutorial

In this tutorial we provide example input files, grammars and shell scripts so that you can try the examples for yourself. You will find the examples in TTT2/doc/tutorial/examples/ where there is a subdirectory for each example, containing the files source.xml, grammar.gr and run.sh. The first is the input file discussed in the text, the second is the lxtransduce grammar for the example and the third is a simple shell script to allow you to apply the grammar to the input. To run run.sh you need first to make sure that the TTT2 binaries directory is in your shell path, (e.g. export PATH=/home/myname/TTT2/bin:$PATH). Then you should navigate to the relevant example directory (e.g. TTT2/doc/tutorial/examples/eg01/) and execute the command ./run.sh. If you look at run.sh you will see that it contains a call to lxtransduce like this:

lxtransduce -q /TEXT/p grammar.gr source.xml

The "-q /TEXT/p" argument uses an XPath query to tell lxtransduce to examine all <p> elements for matches specified in the grammar. This is followed by the location of the grammar file and input file. Note the -q parameter is required for XML grammars. Instead of running run.sh you can also type the above command to the shell.

Getting started

In this section we will learn how to use the <rules>, <rule> and <query> commands in a simple lxtransduce grammar to mark up tokens representing decimal numbers. We will invoke lxtransduce from the command shell and then edit the grammar to mark up roman as well as decimal numerals. We will also examine briefly the nature of the tokenisation performed by lxtransduce, although the exact process is discussed later in the tutorial.

Example 1: a very simple lxtransduce grammar

Nearly all non-trivial transformations performed by lxtransduce are done on XML documents. Lxtransduce can process plain text but this is usually for the purpose of marshalling it into XML format, where the 'real' processing will take place. Take the following plain text example:

In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share.

Late last night the company announced a growth of 20%.

The conversion from plain text to XML can be performed by the component preparetxt in the TTT2/scripts directory (this uses lxtransduce in conjunction with another utility, lxplain2xml). The example in TTT2/doc/tutorial/examples/eg01/source.xml is a representation of the above text which has been converted into XML and all words, numbers and punctuation have all been tokenised by wrapping with a <w> element. We will skip over the details of the conversion as this process is looked at in more detail in the pipelines documentation.


<TEXT>
  <p>
    <w>In</w>
    <w>July</w>
    <w>1995</w>
    <w>CEG</w>
    <w>Corp</w>
    <w>posted</w>
    <w>net</w>
    <w>of</w>
    <w>$</w>
    <w>102</w>
    <w>million</w>
    <w>,</w>
    <w>or</w>
    <w>34</w>
    <w>cents</w>
    <w>a</w>
    <w>share</w>
    <w>.</w>
  </p>
	
  <p>
    <w>Late</w>
    <w>last</w>
    <w>night</w>
    <w>the</w>
    <w>company</w>
    <w>announced</w>
    <w>a</w>
    <w>growth</w>
    <w>of</w>
    <w>20</w>
    <w>%</w>
    <w>.</w>
  </p>
</TEXT>

In the above example, the XML classifies all items, even the dollar symbol and punctuation, as word tokens. We can now show how grammars are used to identify patterns by marking up all numerical tokens in the XML above. Lxtransduce needs 'grammars', sets of rules stored as XML, to decide how to process input. We can see below TTT2/doc/tutorial/examples/eg01/grammar.gr, a very simple grammar used to identify numerical tokens:


<rules type="xml" apply="number">     <!-- TOP LEVEL ELEMENT, SPECIFIES TYPE OF INPUT AND FIRST RULE TO BE EXECUTED -->

  <rule name="number" wrap="num">     <!-- INDIVIDUAL RULE, REFERENCED AS FIRST RULE TO BE EXECUTED -->
    <query match="w[.~'^[0-9]+$']"/>
  </rule>

</rules>

The top-level element is always <rules>, which will appear exactly once in each grammar file. The <rules> element also has an attribute type, which indicates the nature of input the grammar will process. In this case, it is set to "xml" as we are dealing with XML input rather than plain text, in which case it would be set to "plain". Although there may be dozens of rules in a grammar, there is one parent-rule which is initiated first; this is indicated by the attribute apply in <rules>. To enable rules to be referenced in this way, each must have a unique name conforming with XML constraints: alphanumeric characters starting with a letter. The name of the parent-rule in this grammar is "number", which appears in the name attribute of the <rule> element. Each rule will attempt to match any sequence of input corresponding with its child elements.

The most basic kind of matching for XML grammars is <query>, which will always have a match attribute specifying the elements to match. The syntax is that of XPath, a query language for XML. XPath usage in lxtransduce will be discussed in greater detail later on. All we need to know now is that exact regular expression matches can be specified for particular elements of the tokenised input. The example above stipulates a match for the regular expression [0-9]+, i.e. a numeric string within the element <w>. We need not concern ourselves with the details of XPath syntax at this stage. All we need to know is that any valid regular expression can be placed in between the caret (^) and the dollar sign ($). Now might be a good time to take a look at the quick command reference, which specifies each element usable in lxtransduce grammars, along with valid attributes and children.

When the the grammar is executed on our input XML the output should resemble the XML shown below. As we can see below, the numerical expressions are now all contained within <num>. The choice of element in which a match is wrapped is determined by the optional wrap attribute of <rule>. A rule with no wrap attribute is still useful because it can match elements or text which can then be referenced by other rules.


<TEXT>
  <p>
    <w>In</w>
    <w>July</w>
    <num>
      <w>1995</w>    <!-- IDENTIFIED AS NUMBER -->
    </num>
    <w>CEG</w>
    <w>Corp</w>
    <w>posted</w>
    <w>net</w>
    <w>of</w>
    <w>$</w>
    <num>
      <w>102</w>     <!-- IDENTIFIED AS NUMBER -->
    </num>
    <w>million</w>
    <w>,</w>
    <w>or</w>
    <num>
      <w>34</w>      <!-- IDENTIFIED AS NUMBER -->
    </num>
    <w>cents</w>
    <w>a</w>
    <w>share</w>
    <w>.</w>
  </p>
	
  <p>
    <w>Late</w>
    <w>last</w>
    <w>night</w>
    <w>the</w>
    <w>company</w>
    <w>announced</w>
    <w>a</w>
    <w>growth</w>
    <w>of</w>
    <num>
      <w>20</w>      <!-- IDENTIFIED AS NUMBER -->
    </num>
    <w>%</w>
    <w>.</w>
  </p>
</TEXT>

You can test this for yourself by trying the example in TTT2/doc/tutorial/examples/eg01/.

Exercise 1: roman numerals

You should now know enough to make useful changes to the grammar we examined above. Make a copy of the file and try altering it so that the "number" rule also matches capitalised roman numerals from 1 to 10. I.e. "I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX" and "X". The solution to this problem is given in the solutions section. HINT: this requires a single relatively simple alteration in one place.

Rules, rules, rules

In this section we will add more rules to the grammar.

Example 2: adding a second rule

We will now add a second rule to identify tokens referring to percentages, i.e. the character '%'. The first step is to add a new empty rule, specifying an appropriate name using the name attribute and text for the element in which matches will be wrapped using the wrap attribute:


<rules type="xml" apply="number">

  <rule name="number" wrap="num">
    <query match="w[.~'^[0-9]+$']"/>
  </rule>
  
  <rule name="percentage" wrap="per">     <!-- NEW RULE -->
  </rule>

</rules>

The regular expression '%' (without the quotes) will match the symbol we are interested in. We can put a <query> element inside our new rule with this regular expression:


<rules type="xml" apply="number">

  <rule name="number" wrap="num">
    <query match="w[.~'^[0-9]+$']"/>
  </rule>
  
  <rule name="percentage" wrap="per">
    <query match="w[.~'^%$']"/>        <!-- NEW QUERY ELEMENT -->
  </rule>

</rules>

We now face a problem that we want lxtransduce to apply both rules in the grammar, but only one can be specified as an attribute of the <rule> element specifying the first rule to be executed. This is solved by specifying a special rule with a disjunction. This rule is called "all" and contains two elements we have not used so far: <best> and <ref>. Because the "all" rule does not specify a wrap attribute, it will defer this function to the other rules which are specidied as children.


<rules type="xml" apply="all">         <!-- NOW APPLIES RULE "all" -->

  <rule name="number" wrap="num">
    <query match="w[.~'^[0-9]+$']"/>
  </rule>
  
  <rule name="percentage" wrap="per">
    <query match="w[.~'^%$']"/>
  </rule>
  
  <rule name="all">                    <!-- NEW DISJUNCTION REFERS TO OTHER RULES -->
    <best>                             <!-- MATCHES LONGEST INPUT OF ITS CHILDREN -->
      <ref name="number"/>             <!-- REFERENCES SPECIFIC RULE "number" -->
      <ref name="percentage"/>         <!-- REFERENCES SPECIFIC RULE "percentage" -->
    </best>
  </rule>

</rules>

A <ref> element refers to another rule by using the name attribute; only references to rules earlier in the grammar are allowed. The <ref> element allows complex rules to be built up from smaller, simpler ones. The two references to our existing rules are given as children inside the <best> element. The <best> element will match input if one of its child elements (in this case two <ref>) matches. If more than one child element matches then it will choose the "best" one - which will match the longest string of input. Imagine two rules, one matching instances of the string "abcd" and the other matching the string "abc". If these appeared as children of <best>, both would match a substring of "abcde" but "abcd" would be chosen as the match because it is longer.

We are now in a position to execute the grammar above which can be found in the TTT2/doc/tutorial/examples/eg02/ directory. As you will see, it correctly identifies both the percentage symbol and the numbers:


<TEXT>g
  <p>
    <w>In</w>
    <w>July</w>
    <num>
      <w>1995</w>
    </num>
    <w>CEG</w>
    <w>Corp</w>
    <w>posted</w>
    <w>net</w>
    <w>of</w>
    <w>$</w>
    <num>
      <w>102</w>
    </num>
    <w>million</w>
    <w>,</w>
    <w>or</w>
    <num>
      <w>34</w>
    </num>
    <w>cents</w>
    <w>a</w>
    <w>share</w>
    <w>.</w>
  </p>

  <p>
    <w>Late</w>
    <w>last</w>
    <w>night</w>
    <w>the</w>
    <w>company</w>
    <w>announced</w>
    <w>a</w>
    <w>growth</w>
    <w>of</w>
    <num>
      <w>20</w>
    </num>
    <per>
      <w>%</w>     <!-- IDENTIFIED AS PERCENTAGE SYMBOL -->
    </per>
    <w>.</w>
  </p>
</TEXT>

Exercise 2: adding a third rule

Think about adding a third rule to the grammar. What would you need to add add and where? You do not need to concern yourself with the specifics of the rule, only where it would be placed and other alterations required. The answer is given in the solutions section.

Rules in sequence

In this section we will learn how to use the <seq> element to match specific sequences of input to lxtransduce. We will then apply our knowledge to the problem of matching simple monetary expressions.

Example 3: using <seq>

The <seq> element is one of the most commonly used and will match its child elements when they appear in sequence. For example, the following rule is part of a larger grammar; it is designed to find full names of people in text. It does this by referencing two other rules called "forename" and "surname". <seq> will only match sections which appear in the document in the order given in the grammar.


<rule name="name" wrap="nm">
  <seq>
    <ref name="forename"/>     <!-- A FORENAME MUST APPEAR FIRST -->
    <ref name="surname"/>      <!-- AND THEN A SURNAME -->
  </seq>
</rule>

<seq> can have other constructs nested as children. Suppose that the "surname" rule did not match hyphenated (so-called "double barreled") surnames. We could create another rule called "hyph-surname" which performs this function and then place all of the surname related rules in a <best> element; lxtransduce would simply pick the best match it could find in the list.


<rule name="name" wrap="nm">
  <seq>
    <ref name="forename"/>
    <best>
      <ref name="surname"/>      <!-- ORIGINAL SURNAME RULE -->
      <ref name="hyph-surname"/> <!-- HYPHENATED SURNAME RULE -->
    </best>
  </seq>
</rule>

As we will see, many other lxtransduce elements can accept other constructs as children. Sophisticated and powerful rules can be built up in this way. Take the grammar we were looking at earlier:


<rules type="xml" apply="all">

  <rule name="number" wrap="num">
    <query match="w[.~'^[0-9]+$']"/>
  </rule>
  
  <rule name="percentage" wrap="per">
    <query match="w[.~'^%$']"/>
  </rule>
  
  <rule name="all">
    <best>
      <ref name="number"/>
      <ref name="percentage"/>
    </best>
  </rule>
</rules>

Let's try and alter the grammar so that it marks up whole percentage expressions (like "34%"), rather than just numbers (like "34") and percentage symbols (like "%"). Firstly, we need to create a new rule called "per-expr" which will also wrap its matches with <per-expr>:


<rules type="xml" apply="all">

  <rule name="number" wrap="num">         <!-- RULE I -->
    <query match="w[.~'^[0-9]+$']"/>
  </rule>
  
  <rule name="percentage" wrap="per">     <!-- RULE II -->
    <query match="w[.~'^%$']"/>
  </rule>
  
  <rule name="per-expr" wrap="per-expr">  <!-- NEW SEQUENCE RULE, MATCHES RULE I FOLLOWED BY RULE II -->
    <seq>
    	<ref name="number"/>
    	<ref name="percentage"/>
    </seq>
  </rule>
  
  <rule name="all">
    <best>
      <ref name="number"/>
      <ref name="percentage"/>
    </best>
  </rule>
</rules>

We now have a new rule in place which will wrap matches for expressions such as "34%". We no longer need the "number" and "percentage" rules to wrap matches on their own and must therefore remove the wrap attribute from these rules. The "all" rule must also be altered to include only the "per-expr" rule:


<rules type="xml" apply="all">

  <rule name="number">                     <!-- WRAP ATTRIBUTE REMOVED -->
    <query match="w[.~'^[0-9]+$']"/>
  </rule>
  
  <rule name="percentage">                 <!-- WRAP ATTRIBUTE REMOVED -->
    <query match="w[.~'^%$']"/>
  </rule>
  
  <rule name="per-expr" wrap="per-expr"> 
    <seq>
    	<ref name="number"/>
    	<ref name="percentage"/>
    </seq>
  </rule>
  
  <rule name="all">
    <ref name="per-expr"/>                 <!-- SINGLE RULE WITHIN "all" -->
  </rule>
</rules>

If we run this grammar (in examples/eg03), it will no longer wrap using <num> and <per>; rather, it only wraps whole percentage expressions with <per-expr>:


  <p>
    <w>Late</w>
    <w>last</w>
    <w>night</w>
    <w>the</w>
    <w>company</w>
    <w>announced</w>
    <w>a</w>
    <w>growth</w>
    <w>of</w>
    <per-expr>
      <w>20</w>
      <w>%</w>
    </per-expr>
    <w>.</w>
  </p>

Exercise 3: finding a monetary expression

The example input we have been using contains a reference to a monetary expression: "34 cents" as in "34 cents a share". Alter the grammar we developed in this section by creating a new rule which will identify such references to cents and wrap them with <cent-expr>. We're only interested in matching whole cents here; don't worry about other units of currency such as whole dollars or fractions of cents. For the answer to this problem see the solutions section.

First things first

Example 4: Using <first>

<first> works in precisely the same way as the <best> element we learned about earlier. The only difference is that first will look through its children in order, using the first valid match it finds rather than the longest one. Imagine a document which contains both plain numbers and also years expressed using numerals:


<p>
  <w>Simon</w>
  <w>Garner</w>
  <w>scored</w>
  <w>168</w>
  <w>goals</w>
  <w>for</w>
  <w>Blackburn</w>
  <w>Rovers</w>
  <w>between</w>
  <w>1978</w>
  <w>and</w>
  <w>1992</w>
  <w>.</w>
</p>

<first> is useful where we want to express exceptions to a rule. In the case above, the general rule is numbers and the exception is years. That is, years are a specific subset of general numbers which we may want to tag differently. We can achieve this easily using <first>, as shown in the grammar below:


<rules type="xml" apply="all">

  <rule name="number" wrap="num">     <!-- TYPICAL NUMBER RULE -->
    <query match="w[.~'^[0-9]+$']"/>
  </rule>
  
  <rule name="year" wrap="yr">        <!-- MATCHES YEARS 1900 TO 2019 -->
    <query match="w[.~'^(19[0-9][0-9])|(20[01][0-9])$']"/>
  </rule>
  
  <rule name="all">
    <first>                           <!-- WILL MATCH USING FIRST CHILD IT CAN USE -->
      <ref name="year"/>              <!-- TRIES TO MATCH YEAR EXPRESSION -->
      <ref name="number"/>            <!-- TRIES TO MATCH (POSSIBLY LONGER) NUMERICAL EXPRESSION -->
    </first>
  </rule>
</rules>


Both <best> and <first> are ways of presenting options to lxtransduce. We use the former when we want lxtransduce to choose the longest match of input, irrespective of the ordering of its children. We use the latter when we want the program to express a preference which isn't based on the length of the match and this is done through ordering the children. We can see below how lxtransduce marks up the input we gave it earlier:


<p>
  <w>Simon</w>
  <w>Garner</w>
  <w>scored</w>
  <num>
    <w>168</w>
  </num>
  <w>goals</w>
  <w>for</w>
  <w>Blackburn</w>
  <w>Rovers</w>
  <w>between</w>
  <yr>
    <w>1978</w>
  </yr>
  <w>and</w>
  <yr>
    <w>1992</w>
  </yr>
  <w>.</w>
</p>

Exercise 4: a very simple question

Recall the grammar and the input XML we have seen in the previous section. What would be the effect on the marked up output if the ordering of the year and number rules were reversed in the "all" rule? The answer is given in the solutions section. HINT: This should be obvious by now. If it isn't, re-read the last section!

Matching in context

In this section we are going to learn about five grammar constructs which can be used to make lxtransduce aware of context when deciding on matches. The <start>, <end>, <not-start> and <not-end> grammar constructs do not match any input but can determine whether an XML input element is (or is not) at the start or end of its container (parent) element. The fifth construct we will learn is the suppress attribute, which can be used with various match elements to describe context which is part of the rules but will not be included as part of a match.

Example 5: positioning within parent/container elements

The following XML has been broken up into word tokens and also organised by sentence:


<TEXT>
  <sent>
    <w>What</w>
    <w>a</w>
    <w>sentence</w>
    <w>!</w>
  </sent>
  <sent>
    <w>And</w>
    <w>another</w>
    <w>,</w>
    <w>cool</w>
    <w>!</w>
  </sent>
</TEXT>

The <start>, <end>, <not-start> and <not-end> are typically used within a <seq> to stipulate positioning relative to the container element. In the case of the <w> tokens above, their container (or parent) element is <sent>. Imagine a situation where we want to identify the last token in a sentence and mark it with <last-token>. The following grammar will markup the above input in this way:


<rules type="xml" apply="all">

  <rule name="last" wrap="last-token">
    <seq>
      <query match="w"/>
      <end/>                           <!-- END ELEMENT -->
    </seq>
  </rule>

  <rule name="all">
    <best>
      <ref name="last"/>
    </best>
  </rule>
</rules>

Notice above that although there is only one functional rule ("last"), there is also an "all" rule which references it. At this stage, this is completely unnecessary but some users of lxtransduce do this with new grammars as it makes it easier to add more rules later on. Notice also that, unlike the other grammars we have worked with, the match specified in <query> contains only one letter ("w") and no regular expression matching. This indicates that we are interested in matching any word token, regardless of its contents. We can see below the processed output:


<TEXT>
  <sent>            <!-- START OF CONTAINER -->
    <w>What</w>
    <w>a</w>
    <w>sentence</w>
    <last-token>
      <w>!</w>
    </last-token>
  </sent>           <!-- END OF CONTAINER ELEMENT -->
  <sent>            <!-- START OF CONTAINER ELEMENT -->
    <w>And</w>
    <w>another</w>
    <w>,</w>
    <w>cool</w>
    <last-token>
      <w>!</w>
    </last-token>
  </sent>           <!-- END OF CONTAINER ELEMENT -->
</TEXT>

Exercise 5: matching the firsts and middles

Expand the grammar above with two more rules: one should match word tokens at the start of sentences and wrap them with <first-token>; the other should match any word tokens in the middle (all the others) and wrap them with <middle-token>. The answer is given in the solutions section. HINT: you will need to use the other three grammar elements (not including suppress, which is an attribute) mentioned in the introduction to this section.

Example 6: suppress and more complex context

The <start> and <not-start>-style expressions can give grammar rules a limited awareness of context limited to the XML structure of container elements. More sophisticated left and right context can be added to a <seq> element by using the suppress="true" attribute on any match element at the beginning (left) or end (right) of the sequence. Elements with suppress="true" describe context which is required for the match to succeed but is not considered part of the match, meaning it is not part of the default rewrite for the <seq>, will not be included in any wrap if one is specified and will not be part of what is passed to other rules in references. Recall the example input we saw in the previous section. Imagine that the information about sentences given by the <sent> element was absent:


<TEXT>
  <w>What</w>
  <w>a</w>
  <w>sentence</w>
  <w>!</w>
  <w>And</w>
  <w>another</w>
  <w>,</w>
  <w>cool</w>
  <w>!</w>
</TEXT>

The absence of a container element for sentences means that, for example, we would be unable to match the first word in every sentence using <start>. If we use suppress, however, we can tell lxtransduce to look for indicators such as full stops, as well as the positioning within the container element:


<rules type="xml" apply="myrule">

  <rule name="myrule" wrap="first-word">
    <seq>
      <best suppress="true">                <!-- USE OF SUPPRESS -->
        <start/>                            <!-- FIRST OPTION: BEGINNING OF CONTAINER ELEMENT -->
        <query match="w[.~'^[.!]$']"/>       <!-- SECOND OPTION: FULL STOP OR EXCLAMATION -->
      </best>
      <query match="w"/>                    <!-- ONLY THIS PART WILL BE WRAPPED -->
    </seq>
  </rule>

</rules>

The grammar above uses suppress="true" on <best> to stipulate that a sentence boundary indicator (such as a full stop) or the beginning of a container element must be followed by a word token indicated by <w>. But it is only the second part, the word, which will be wrapped with "first-word" - everything else merely provides the context required and is contained within the suppressed <best>:


<TEXT>
  <first-word>
    <w>What</w>    <!-- WRAPPED - FOLLOWS START OF CONTAINER ELEMENT -->
  </first-word>
  <w>a</w>
  <w>sentence</w>
  <w>!</w>
  <first-word>
    <w>And</w>     <!-- WRAPPED - FOLLOWS FULL STOP -->
  </first-word>
  <w>another</w>
  <w>,</w>
  <w>cool</w>
  <w>!</w>
</TEXT>

Exercise 6: your own suppress="true" rule

How would the grammar be extended to markup words at the ends of sentences in a similar manner to exercise 5? The answer is given in the solutions section.

Repeated matching

In this section we are going to learn about different ways of matching the same thing a number of times. Three main constructs are used for this in different situations: <repeat-until>, <backtrack> and the mult attribute for <ref>.

Example 7: the mult attribute

As we have seen, the <ref> element is used in grammars to refer to other rules. The mult attribute can be used with <ref> to specify multiplicities on the way that the rule matches. The mult attribute takes one of three characters which may be familiar from regular expressions:

Table 2.1. Multiplicity indicators for mult attribute

ExpressionMultiplicity
?Zero or one matches
*Zero or more matches
+One or more matches

We can see below the use of this attribute in an XML grammar:


<rules type="xml" apply="mult-test">

  <rule name="a">
    <query match="a"/>
  </rule>

  <rule name="b">
    <query match="b"/>
  </rule>

  <rule name="c">
    <query match="c"/>
  </rule>

  <rule name="mult-test" wrap="X">
    <best>
      <ref name="a" mult="?"/>
      <ref name="b" mult="+"/>
      <ref name="c" mult="*"/>
    </best>
  </rule>

</rules>

We will use the input below to demonstrate this grammar:


<TEXT>
  <a/>
  <a/>
  <a/>
  <b/>
  <b/>
  <b/>
  <c/>
  <c/>
  <c/>
</TEXT>

The following output is produced. We can see that mult="?" can only match zero or one reference at a time, but "+" and "*" can match numerous:


<TEXT>
  <X><a/></X>
  <X><a/></X>
  <X><a/></X>
  <X>
    <b/>
    <b/>
    <b/>
  </X>
  <X>
    <c/>
    <c/>
    <c/>
  </X>
</TEXT>

In the instance outlined above, mult="+" and mult="*" perform the same, matching as much input as they can. These constructs would behave differently, however, if they are used in the context of a sequence indicated by <seq>. Look at the rule excerpt from a grammar shown below. In the case where the items matched by "rule2" are absent, this rule will still match if it sees the "rule1" items followed by the "rule3" items:


<rule name="example1">
  <seq>
    <ref name="rule1"/>
    <ref name="rule2" mult="*"/>
    <ref name="rule3"/>
  </seq>
</rule>

In the rule below, however, the use of mult="+" means that the items by "rule2" are a requirement for the whole "example2" rule to match.


<rule name="example2">
  <seq>
    <ref name="rule1"/>
    <ref name="rule2" mult="+"/>
    <ref name="rule3"/>
  </seq>
</rule>

Example 8: <repeat-until>

A <repeat-until> element refers to another rule by using a name attribute in the same manner as <ref>. A match element (such as <best> or <seq>) must be specified as a stop condition and placed as a child element of <repeat-until>. See the following example:


<repeat-until name="repeat-rule">  <!-- REPEATED RULE -->
  <ref name="stop-rule"/>          <!-- END CONDITION -->
</repeat-until>

The repeated rule is matched until either:

  • The end condition matches
  • No more input matches the rule

The rule can succeed without the end condition matching if there is no more input matching the repeated rule. If the end condition is matched, then this input is not considered part of the overall match. Take the following input:


<TEXT>
  <a/>
  <a/>
  <a/>
  <z/>
</TEXT>

If this is used with the following grammar...


<rules type="xml" apply="repeat-test">

  <rule name="a">
    <query match="a"/>
  </rule>

  <rule name="repeat-test" wrap="X" min-matches="1">
    <repeat-until name="a">
      <query match="z"/>
    </repeat-until>
  </rule>

</rules>

...then the output shows that the repeat rule is not included in the match:


<TEXT>
  <X>
    <a/>
    <a/>
    <a/>
  </X>
  <z/>    <!-- END CONDITION NOT INCLUDED IN MATCH -->
</TEXT>

By default, a <repeat-until> will succeed even if there are no matches of the repeated rule. This can sometimes lead to problems with looping which may cause lxtransduce to abort with an error:

Top-level rule matched without consuming any input;
This is not allowed because it would loop forever

The optional min-matches attribute can be used to specify a minimum number of matches. This is typically 1. If you want to require the end condition to match, use a <seq> containing the <repeat-until> followed by another copy of the end condition:


<seq>
  <repeat-until name="repeat-rule">  <!-- REPEATED RULE -->
    <ref name="stop-rule"/>          <!-- END CONDITION -->
  </repeat-until>
  <ref name="stop-rule"/>           <!-- END CONDITION REPEATED-->
</seq>

The min-matches attribute is not needed above to prevent looping, as the rule must always consume input due to a stop rule being specified.

Example 9: <backtrack>

This construct must have two child elements in sequence:

  1. A <ref> with a mult attribute set to "+" or "*".
  2. Any match element.

This construct is analogous to a backtracking <seq> where the sequence is given as ordered children. lxtransduce attempts progressively shorter multiples of <ref> with the mult attribute until it finds a match. If no match is found then the whole rule fails. See, for example, this rule taken from the noung.gr grammar, bundled with LT-TTT2:


<rule name="compound-noun">
  <seq>
    <ref name="restr-nom"/>
    <backtrack>                      <!-- BACKTRACK RULE -->
      <ref name="nomger" mult="*"/>
      <ref name="compoundlast"/>
    </backtrack>
  </seq>
</rule>

We can see <backtrack> above as part of a sequence. The "nomger" reference followed by the "compoundlast" reference are attempted for a match using progressively shorter multiples of "nomger".

Passing and testing against values

In this section we will learn how to pass parameters between rules using <with-param/>, place constraints on when a <query> will match by using <constraint/> and assigning and referencing variables using <var/>.

Example 10: using <with-param/>, <constraint/> and <var/>

A <ref> element may have one or more <with-param> children specifying values to be passed to the referenced rules. These values can then be used by the second rule, for example as constraints on matches. This can save time and space: instead of devising a rule for every particular instance of a situation, the user can write a smaller number of rules which handle precisely the same instances using the flexibility of parameter passing and variables.

The following rule is taken from a grammar dealing with natural language. It is designed to co-ordinate two verbs. For example, "John dances and sings" would be co-ordinated because "dances" and "sings" are both third person singular verbs. A reference is made to a rule called "coordinate-headverb". By using <with-param> this reference passes a parameter named "pos" with value "VBZ" to the "coordinate-headverb" rule:


<rule name="headverb-presact" attrs="headv='yes'">
    <first>
       <ref name="coordinate-headverb">               <!-- REFERENCE-->
         <with-param name="pos" value="'VBZ'"/>       <!-- PASSING A PARAMETER FOR USING WITH-PARAM -->
       </ref>
       <ref name="be-pres"/>
       <query match="w[@p='VBZ']"/>
    </first>
</rule>

Note that because the value is specified as an XPath, to pass a constant string you need to put quotes around the string as well as around the attribute, resulting in something like <with-param name="cat" value="'noun'"/>. We can see this above in value="'VBZ'". If lxtransduce didn't support parameter passing in this way, instances of the coordinate-headverb rule would have to be created for every single type of verb to be handled. We can see the coordinate-headverb below. The parameter p which was passed can be referred to using $p. This is done by a <constraint> element:


<rule name="coordinate-headverb">
  <seq>
    <query match="w[@p~'^V']" attrs="headv='yes'">
      <constraint test="@p = $pos"/>                     <!-- CONSTRAINT -->
      <var name="first_tag" value="@p"/>                 <!-- ASSIGNING A VARIABLE -->
    </query>
    <ref name="conj-headverb" mult="+">
      <with-param name="subs_tag" value="$first_tag"/>   <!-- USING VARIABLE VALUE -->
    </ref>
  </seq>
</rule>

<constraint> elements appear as the children of <query> elements, placing some restriction on the conditions under which the query will match. In the case above, it tests whether an attribute matches the parameter passed to it. As we will see later, constraints are also used to test lexicon entries for categories. The rule shown above is also an example of the use of <var> to declare a variable. In this case, the variable name is first_tag and the value is set to that of an attribute which is attached to the match. Note that <with-param/> and <var> constructs take the same attributes.

Boolean constructs

In this section we will learn how to use the boolean <and> and <not> constructs.

Example 11: using <and>

We have seen that rules may examine whether particular element(s) satisfy certain constraints defined by any single rule. The <and> construct takes two or more match elements as children and allows us to test if input satisfies all of those match elements. Imagine that we have two rules, one of which tests for elements which contain three characters and another which tests for elements containing numbers:


<rules type="xml" apply="all">
  
  <rule name="length-three">                <!-- MATCHES LENGTH = 3 -->
    <query match="w[.~'^...$']"/>
  </rule>

  <rule name="number">
    <query match="w[.~'^[0123456789]+$']"/>  <!-- MATCHES NUMBERS -->
  </rule>
  
  <rule name="all" wrap="num3">
    <and>
      <ref name="length-three"/>
      <ref name="number"/>
    </and>
  </rule>

</rules>

We can see the use of <and> above to create a rule which matches elements which are of length three and are numbers. Thus if we run the above grammar on the input below:


<TEXT>
  <w>1</w>
  <w>12</w>
  <w>123</w>
  <w>1234</w>
  <w>a</w>  
  <w>ab</w>
  <w>abc</w>
  <w>abcd</w>
</TEXT>

...we obtain the following output:


<TEXT>
  <w>1</w>
  <w>12</w>
  <num3>
    <w>123</w>
  </num3>
  <w>1234</w>
  <w>a</w>
  <w>ab</w>
  <w>abc</w>
  <w>abcd</w>
</TEXT>

If the rewrite for <and> is not specified lxtransduce defaults to the rewrite of the last child. <all> is a synonym for <and>.

Exercise 11: using <not>

A <not> element contains a single match element, and matches the input if the child element does not match. A successful <not> consumes no input; its match and rewrite are always empty. Recall the input used in the previous example:


<TEXT>
  <w>1</w>
  <w>12</w>
  <w>123</w>
  <w>1234</w>
  <w>a</w>  
  <w>ab</w>
  <w>abc</w>
  <w>abcd</w>
</TEXT>

Consider that we want to define a grammar which matches every element except those which are numbers of length three. How might we modify the grammar given above to achieve this? Remember that <not> can only take one child element. It also consumes no input, so we must be careful of looping issues similar to those found with <repeat-until>. The solution to this problem is given in the solutions section.

Chapter 3. Using lexicons

Fundamentals

Example L1: your first lexicon

Lexicons, sometimes referred to as gazetteers, are lists of words stored as XML which are used by lxtransduce grammars. The top-level element of a lexicon is <lexicon> with <lex> elements as children. Each lex element represents a single word and has a required word attribute. We can see below an example lexicon storing colours:


<lexicon name="colour">
 <lex word="Red"/>
 <lex word="Green"/>
 <lex word="Blue"/>
 <lex word="Pink"/>
</lexicon>

We can use this with a grammar to identify colours in XML files. Notice in the grammar below the use of <lexicon> to declare the name and location of the lexicon we are using. Lexicons can also be stipulated on the command line. Within the only rule we can see the use of the simplest and most straightforward lexicon reference for rules, <lookup>, which combines a lexicon lookup and a query:


<rules type="xml" apply="all">
  <lexicon name="colours" href="colours.lex"/>   <!-- LEXICON DECLARATION -->
  
  <rule name="all" wrap="col">
    <lookup match="w" lexicon="colours"/>        <!-- CALL TO LEXICON -->
  </rule>
</rules>

Thus if we run the grammar on the following input:


<TEXT>
  <w>Alex</w>
  <w>wears</w>
  <w>pink</w>
  <w>trousers</w>
</TEXT>

...the colour in the XML is identified correctly using the lexicon:


<TEXT>
  <w>Alex</w>
  <w>wears</w>
  <col>
    <w>pink</w>
  </col>
  <w>trousers</w>
</TEXT>

Example L2: other ways of doing the same thing

In addition to the <lookup>, we can also specify a lexicon lookup by using the constraint attribute on <query>:


<rules type="xml" apply="all">
  <lexicon name="colours" href="colours.lex"/> 
  
  <rule name="all" wrap="col">
    <query match="w" constraint="colours()"/>        <!-- CALL TO LEXICON -->
  </rule>
</rules>

...which can also be placed as a child element rather than an attribute:

<rules type="xml" apply="all">
  <lexicon name="colours" href="colours.lex"/>

  <rule name="all" wrap="col">
    <query match="w">
      <constraint test="colours()"/>            <!-- CONSTRAINT AS CHILD ELEMENT -->
    </query>
  </rule>
</rules>

Succinctly, the following are all equivalent:


<lookup match="w" lexicon="colours"/>

<query match="w" constraint="colours()"/>

<query match="w">
  <constraint test="colours()"/>
</query>

Standard usage

Example L3: Using categories

In some cases, we might want to distinguish between two or more different types of string in a lexicon, or handle words which may be in some category "A", category "B" or both. Lxtransduce allows us to assign categories or classes to words in a lexicon using the <cat> element:


<lexicon name="words">
  <lex word="boot">
    <cat>verb</cat>
    <cat>noun</cat>
  </lex>
  
  <lex word="dog">
    <cat>noun</cat>
  </lex>
  
  <lex word="weep">
    <cat>verb</cat>
  </lex>
</lexicon>

These categories may then be stipulated in the grammar as follows. These four examples are all equivalent:


<query match="w" constraint="words()/cat='noun'"/> 

<query match="w">
 <constraint test="words()/cat='noun'"/>
</query> 

<query match="w[words()/cat='noun']"/> 

Example L4: Phrase lookups

It may sometimes be helpful to list phrases in lexicons rather than individual words tokens. Take, for example, this list of universities:


<lexicon name="unis">
  <lex word="University of Edinburgh"/>
  <lex word="University of St Andrews"/>
  <lex word="University of Central Lancashire"/>
  <lex word="University of Life"/>
</lexicon>

By using the phrase="true" attribute on <lookup> lxtransduce can match the lexicon even with sequences of individual word tokens. Observe the following grammar:


<rules type="xml" apply="all">

  <lexicon name="unis" href="unis.lex"/>
  
  <rule name="all" wrap="uni">
    <lookup match="w" lexicon="unis" phrase="true"/> <!-- USE OF PHRASE ATTRIBUTE -->
  </rule>

</rules>

We can see the output here from running this grammar on a series of plain <w> tokens. Notice that the match "University of Edinburgh" is a phrase split across three tokens:


<TEXT>
  <w>I</w>
  <w>graduated</w>
  <w>from</w>
  <w>the</w>
  <uni>
    <w>University</w>
    <w>of</w>
    <w>Edinburgh</w>
  </uni>
  <w>in</w>
  <w>2006</w>
</TEXT>

Example L5: case sensitivity

By default, lexicon lookups are case insensitive. There are instances, however, where we must use case sensitive implementations with the case attribute on either <lexicon> or <lex>. If used on the former we can control the case sensitivity of the entire lexicon, while on the latter we can control individual entries:


<lexicon name="dates">
  <lex word="May" case="yes">
    <cat>month</cat>
    <cat>verb</cat>
  </lex>

  <lex word="may">
    <cat>verb</cat>
  </lex>
</lexicon>

We can see above that the capitalized word "May" matches the first entry; uncapitalized it matches the second. In the first case it has both the lexical categories "month" and "verb", in the second only "verb".

Chapter 4. Debugging lxtransduce grammars

Lxtransduce debug parameter

When an lxtransduce grammar doesn't produce the output you were expecting it is often quite difficult to see what is going on. One relatively new addition to lxtransduce is a parameter, -D 128, which makes lxtransduce add extra mark-up around sequences which have been consumed by rules to show which rules have fired.

We can illustrate this with the grammar from example 4. Navigate to the TTT2/doc/tutorial/examples/eg04 directory and instead of running the run.sh shell script, run this command:

    lxtransduce -D 128 -q p grammar.gr source.xml

The output from this command contains extra <x> elements with the name of a rule as the value of the rule attribute:


<p>
  <w>Simon</w>
  <w>Garner</w>
  <w>scored</w>
  <x rule="all"><x rule="number"><num><w>168</w></num></x></x>
  <w>goals</w>
  <w>for</w>
  <w>Blackburn</w>
  <w>Rovers</w>
  <w>between</w>
  <x rule="all"><x rule="year"><yr><w>1978</w></yr></x></x>
  <w>and</w>
  <x rule="all"><x rule="year"><yr><w>1992</w></yr></x></x>
  <w>.</w>
</p>

Here we can see that the <w>168</w> element has been matched and consumed by the all rule and by the number rule which is called by all. The two dates have also been matched and consumed by the all rule but in this case it is the year rule which was called by all.

Other ways to debug rules

The debug parameter is very useful for seeing which rules have fired but it is less helpful in explaining why a rule that you expected to work didn't. There are two main reasons why a rule may fail to fire when expected: first, the rule may be capable of matching the input but the rule ordering in the grammar causes some other rule to match and consume the input; second, there may be something wrong with the way the match conditions are formulated so that the input fails to match. You should be able to test for the former by using the debug parameter described above. For the latter, you can test the rule in question by using the -a parameter to lxtransduce. For example, continuing with example 4, type the following at the command line:

    lxtransduce -a year -q p grammar.gr source.xml

This causes just the year rule to be applied and results in the following output:


<p>
  <w>Simon</w>
  <w>Garner</w>
  <w>scored</w>
  <w>168</w>
  <w>goals</w>
  <w>for</w>
  <w>Blackburn</w>
  <w>Rovers</w>
  <w>between</w>
  <yr><w>1978</w></yr>
  <w>and</w>
  <yr><w>1992</w></yr>
  <w>.</w>
</p>

Similarly, repeating the command but with -a number will cause just the number rule to be applied:


<p>
  <w>Simon</w>
  <w>Garner</w>
  <w>scored</w>
  <num><w>168</w></num>
  <w>goals</w>
  <w>for</w>
  <w>Blackburn</w>
  <w>Rovers</w>
  <w>between</w>
  <num><w>1978</w></num>
  <w>and</w>
  <num><w>1992</w></num>
  <w>.</w>
</p>

Notice that because the year rule is not considered in this case, the number rule has been able to apply to the dates as well as the number. It is because both rules can potentially match dates that the year rule has been ordered first in the all rule of the grammar:


  <rule name="all">
    <first>
      <ref name="year"/>
      <ref name="number"/>
    </first>
  </rule>

Note that -a all is equivalent to not using the -a option since the all rule is defined as the default rule to apply in the apply attribute of the <rules> element in the grammar.

In the example we have given, the rules we selected with the -a parameter did match the input as expected. If you use -a and the rule you are testing does not succeed then it is likely that there is something wrong with the definition of the match conditions in the rule.

Chapter 5. Character-level grammars

Overview

All of the grammars that we have looked at so far have been XML-level grammars: if you look at the rules element in each of the grammars you will see that it has the attribute type="xml" and it is this attribute that defines a grammar as an XML-level grammar. XML-level grammars require XML files as input and the rules use XPath queries to select XML elements to operate on and are typically used to elaborate the XML mark-up.

Character-level grammars, by contrast, operate over character data, the input file need not be an XML file and they do not necessarily output XML mark-up. A grammar with no type attribute on its rules element is a character-level grammar. In this TTT2 distribution, the only character level grammar is TTT2/lib/tokenise/pretokenise.gr and it is used as the first stage in the conversion from strings of characters to XML structure which can be used by subsequent stages.

Example 12: a simple character-level grammar

The following is a simple character-level grammar which is similar to the XML-level grammar in Example 4 in that it identifies numbers and years.


<?xml version="1.0"?>

<rules apply="all">

<rule name="number" rewrite="NUMBER">
  <regex match="[0-9]+"/>
</rule>
  
<rule name="year" rewrite="YEAR">
  <regex match="(19[0-9][0-9])|(20[01][0-9])"/>
</rule>
  
<rule name="all">
  <first>
    <ref name="year"/>
    <ref name="number"/>
  </first>
</rule>

</rules>

We can compare the number and year rules with their equivalents in the XML-level grammar:


<rule name="number" wrap="num">
  <query match="w[.~'^[0-9]+$']"/>
</rule>
  
<rule name="year" wrap="yr">
  <query match="w[.~'^(19[0-9][0-9])|(20[01][0-9])$']"/>
</rule>

In the XML-level grammar the string to be operated on is selected with the match attribute. This has an XPath value which specifies a <w> element whose character data matches the regular expression. The character-level grammar has <regex> elements instead of <query> elements and the match attribute takes a regular expression as value. In the XML-level grammar the matched string is wrapped in a <num> or a <yr> element as defined by the wrap attribute. In the character-level grammar the matched elements are rewritten as the strings NUMBER and YEAR as defined by the rewrite attribute.

The file source1.xml in examples/eg12/ is the same sentence as in Example 4 except that the file is a plain text file, not an XML file:


   Simon Garner scored 168 goals for Blackburn Rovers between 1978 and 1992.

Running the grammar over this using the run1.sh script gives the following output:


   Simon Garner scored NUMBER goals for Blackburn Rovers between YEAR and YEAR.

Because the input file is a plain text file, the call to lxtransduce in run1.sh does not specify a query, i.e. there is no -q option:


   lxtransduce grammar.gr source1.xml

Character-level grammars can be used on XML files and be directed via the -q option to operate on the character data inside specified XML elements. To illustrate, we have provided the source2.xml input file where the same sentence appears twice, once wrapped in a <p> element and once wrapped in a <para> element:


<doc>
  <para>
    Simon Garner scored 168 goals for Blackburn Rovers between 1978 and 1992.
  </para>
  <p>
    Simon Garner scored 168 goals for Blackburn Rovers between 1978 and 1992.
  </p>
</doc>

The script run2.sh takes source2.xml as input and calls lxtransduce with -q p to make the grammar operate over the character data content of the <p> element with the following result:


<doc>
  <para>
    Simon Garner scored 168 goals for Blackburn Rovers between 1978 and 1992.
  </para>
  <p>
    Simon Garner scored NUMBER goals for Blackburn Rovers between YEAR and YEAR.
  </p>
</doc>

It is possible to use a character-level grammar to produce XML mark-up in the output file, and we have provided grammar2.gr in examples/eg12/ to illustrate this:


<?xml version="1.0"?>
<!DOCTYPE rules SYSTEM "lxtransduce.dtd">

<rules apply="all">

<rule name="number" rewrite="&xlt;num&xgt;$-&xlt;/num&xgt;">
  <regex match="[0-9]+"/>
</rule>
  
<rule name="year" rewrite="&xlt;yr&xgt;$-&xlt;/yr&xgt;">
  <regex match="(19[0-9][0-9])|(20[01][0-9])"/>
</rule>
  
<rule name="all">
  <first>
    <ref name="year"/>
    <ref name="number"/>
  </first>
</rule>

</rules>

Here the rewrite attributes use some special entities and variables to create XML output. The special entities &xlt; and &xgt; produce the "<" and ">" characters respectively. To use them their definitions need to be accessed from lxtransduce.dtd (also provided in examples/eg12/) which is why the grammar has the DOCTYPE declaration. The special variable $- takes the matched string as its value so the effect of the grammar is the same as the XML-level grammar in Example 4. The output of this command:


   lxtransduce grammar2.gr source1.xml

is this:


   Simon Garner scored <num>168</num> goals for Blackburn Rovers between <yr>1978</yr> and <yr>1992</yr>.

while the output of this command:


   lxtransduce -q p grammar2.gr source2.xml

is this:


<doc>
  <para>
    Simon Garner scored 168 goals for Blackburn Rovers between 1978 and 1992.
  </para>
  <p>
    Simon Garner scored <num>168</num> goals for Blackburn Rovers between <yr>1978</yr> and <yr>1992</yr>.
  </p>
</doc>

The lxtransduce development documentation at http://www.cogsci.ed.ac.uk/~richard/ltxml2/lxtransduce-manual.html contains many examples of character-level grammar rules and we refer readers to this documentation for further information.

Whitespace issues

In XML-level grammars there is no need to consider whitespace that happens to occur between the elements that a rule is designed to match: when an element matches a query, any immediately following non-element siblings are implicitly attached to it; this has the effect of preserving whitespace between elements in the output (unless the order of elements is changed, in which case their attached whitespace will be moved correspondingly).

In character-level grammars, on the other hand, whitespace characters are no different from any other characters and can be matched using regular expressions. For example, a rule designed to match a sequence of words will need to include regex elements for the whitespace between the words:


<rule name="name" rewrite="NAME">
  <seq>
    <regex match="Simon"/>
    <regex match=" "/>
    <regex match="Garner"/>
  </seq>
</rule>

Chapter 6. Solutions to exercises

Exercise 1: roman numerals

In order to add roman numerals to the tokens matched by our grammar, the only thing which need be changed is the regular expression. In our XPath query, the current regular expression lies in between the caret (^) and the dollar symbol ($):


[0-9]+

The following regular expressions match the capitalised roman numerals between one and ten ("I", "II", "III", "IV", "V", "VI", "VII", "VII", "IX" and "X"):


I|II|III|IV|V|VI|VII|VIII|IX|X

I{1,3}|IV|VI{0,3}|I?X

The first example above is simply an exhaustive list of the terms we may want to match; the second is a shorter expression which exploits patterns in roman numerals. Both match exactly the same input and either can be combined with the first regular expression to give one which can match standard decimal characters or roman numerals:


[0-9]+|I|II|III|IV|V|VI|VII|VIII|IX|X

[0-9]+|I{1,3}|IV|VI{0,3}|I?X

One of these may then be placed in our original grammar:


<rules type="xml" apply="number">

  <rule name="number" wrap="num">
    <query match="w[.~'^[0-9]+|I{1,3}|IV|VI{0,3}|I?X$']"/>
  </rule>

</rules>

This will then tag roman numerals with <num> in the same way as decimal numerals. To test this, try adding an expression with roman numerals to the input. For example "Henry VIII, son of Henry VII, was succeeded as King of England by Edward VI".

Exercise 2: adding a third rule

The new rule would need to be placed as a child of the <rules> element. The reference to the new rule must then be placed within the <best> element in the rule called "all". For example:


<rules type="xml" apply="all">

  <rule name="number" wrap="num">
    <query match="w[.~'^[0-9]+$']"/>
  </rule>
  
  <rule name="percentage" wrap="per">
    <query match="w[.~'^%$']"/>
  </rule>
  
  <rule name="new-rule" wrap="new">  <!-- NEW RULE -->
    <!-- CONTENTS OF NEW RULE -->
  </rule>
  
  <rule name="all">
    <best>
      <ref name="number"/>
      <ref name="percentage"/>
      <ref name="new-rule"/>       <!-- REFERENCE TO NEW RULE -->
    </best>
  </rule>
</rules>

Exercise 3: finding a monetary expression

Recall the grammar we have been developing so far:


<rules type="xml" apply="all">

  <rule name="number">    
    <query match="w[.~'^[0-9]+$']"/>
  </rule>
  
  <rule name="percentage">
    <query match="w[.~'^%$']"/>
  </rule>
  
  <rule name="per-expr" wrap="per-expr"> 
    <seq>
    	<ref name="number"/>
    	<ref name="percentage"/>
    </seq>
  </rule>
  
  <rule name="all">
    <ref name="per-expr"/>
  </rule>
</rules>

Expressions for cents can be broken down into a numerical part followed by "cents" or "cent" or "c". We can create a new <seq> rule to fit this pattern: the first element of the sequence is a reference to a number rule which has already been defined; the second is a <query> with a regular expression designed to catch various references to cents:


  <rule name="cents" wrap="cent-expr">
    <seq>
      <ref name="number"/>                       <!-- MATCHES NUMBER PART -->
      <query match="w[.~'^([Cc]ent(s?))|c$']"/>  <!-- MATCHES CENTS PART -->
    </seq>
  </rule>

Fitting this into the whole grammar, we have:


<rules type="xml" apply="all">

  <rule name="number">    
    <query match="w[.~'^[0-9]+$']"/>
  </rule>
  
  <rule name="percentage">
    <query match="w[.~'^%$']"/>
  </rule>
  
  <rule name="per-expr" wrap="per-expr"> 
    <seq>
    	<ref name="number"/>
    	<ref name="percentage"/>
    </seq>
  </rule>
  
  <rule name="cents" wrap="cent-expr">           <!-- CENT RULE -->
    <seq>
      <ref name="number"/>
      <query match="w[.~'^([Cc]ent(s?))|c$']"/>
    </seq>
  </rule>
  
  <rule name="all">
    <best>                                         <!-- USING BEST AGAIN AS THERE ARE NOW TWO RULES -->
      <ref name="per-expr"/>
      <ref name="cent-expr"/>                      <!-- REFERENCE TO NEW RULE -->
    </best>
  </rule>
</rules>

This should then markup the relevant section of the input as follows:


  <p>
    <w>In</w>
    <w>July</w>
    <w>1995</w>
    <w>CEG</w>
    <w>Corp</w>
    <w>posted</w>
    <w>net</w>
    <w>of</w>
    <w>$</w>
    <w>102</w>
    <w>million</w>
    <w>,</w>
    <w>or</w>
    <cent-expr>
      <w>34</w>
      <w>cents</w>
    </cent-expr>
    <w>a</w>
    <w>share</w>
    <w>.</w>
  </p>

We can see above that there is also a dollar expression. Using the skills we have learned so far it would be a trivial task to add another rule to match this and other similar expressions.

Exercise 4: a very simple question

All the numbers, including the year, would be tagged with <num>.

Exercise 5: matching the firsts and middles

First, we add a new rule to our grammar to match the first tokens in the sentence. This is basically the same as our existing rule:


<rules type="xml" apply="all">

  <rule name="last" wrap="last-token">   <!-- EXISTING RULE -->
    <seq>
      <query match="w"/>
      <end/>
    </seq>
  </rule>

  <rule name="first" wrap="first-token"> <!-- SECOND RULE -->
    <seq>
      <start/>
      <query match="w"/>
    </seq>
  </rule>

  <rule name="all">
    <best>
      <ref name="first"/>
      <ref name="last"/>
    </best>
  </rule>
</rules>

Notice above the insertion of the new rule called "first", the use of <start>, ordering of the children of <seq> for this rule and the referencing within "all". The third and final rule, which will match all word tokens in between the first and last of a container element, is slightly more complex than the others:


<rules type="xml" apply="all">

  <rule name="last" wrap="last-token">
    <seq>
      <query match="w"/>
      <end/>
    </seq>
  </rule>

  <rule name="first" wrap="first-token">
    <seq>
      <start/>
      <query match="w"/>
    </seq>
  </rule>

  <rule name="middle" wrap="middle-token"> <!-- THIRD RULE -->
    <seq>
      <not-start/>
      <query match="w"/>
      <not-end/>
    </seq>
  </rule>

  <rule name="all">
    <best>
      <ref name="first"/>
      <ref name="last"/>
      <ref name="middle"/>
    </best>
  </rule>
</rules>

Notice the use of <not-start> and <not-end>. The ordering of the children of <best> is irrelevant - try swapping them around. This grammar should generate the following output when run on the input we used in the example:


<TEXT>
  <sent>           
    <first-token><w>What</w></first-token>
    <middle-token><w>a</w></middle-token>
    <middle-token><w>sentence</w></middle-token>
    <last-token><w>!</w></last-token>
  </sent>          
  <sent>           
    <first-token><w>And</w></first-token>
    <middle-token><w>another</w></middle-token>
    <middle-token><w>,</w></middle-token>
    <middle-token><w>cool</w></middle-token>
    <last-token><w>!</w></last-token>
  </sent>          
</TEXT>

Exercise 6: your own suppress="true" rule

The following grammar will markup the first and last words of sentences in the way required. A second functional rule and another specifying a disjunction have been added:


<rules type="xml" apply="all">

  <rule name="first" wrap="first-word">
    <seq>
      <best suppress="true"> 
        <start/>
        <query match="w[.~'^\.!$']"/>
      </best>
      <query match="w"/>
    </seq>
  </rule>

  <rule name="last" wrap="last-word">      <!-- NEW RULE -->
    <seq>
      <query match="w"/>
      <best suppress="true">
        <end/>
        <query match="w[.~'^\.!$']"/>
      </best>
    </seq>
  </rule>

  <rule name="all">                        <!-- DISJUNCTION -->
    <best>
      <ref name="first"/>
      <ref name="last"/>
    </best>
  </rule>

</rules>

This should then markup the input given in example 6 as follows:


<TEXT>
  <first-word>
    <w>What</w>     <!-- WRAPPED - FOLLOWS START OF CONTAINER ELEMENT -->
  </first-word>
  <w>a</w>
  <last-word>
    <w>sentence</w> <!-- WRAPPED - BEFORE EXCLAMATION -->
  </last-word>
  <w>!</w>
  <first-word>
    <w>And</w>      <!-- WRAPPED - FOLLOWS FULL STOP -->
  </first-word>
  <w>another</w>
  <w>,</w>
  <last-word>
    <w>cool</w>     <!-- WRAPPED - BEFORE EXCLAMATION -->
  </last-word>
  <w>!</w>
</TEXT>

Exercise 11: using <not>

We start by altering the original grammar to negate the original main rule using a reference. We cannot simply turn <and> into <not> as the latter only takes one child element:

<rules type="xml" apply="all">
  
  <rule name="length-three">                <!-- MATCHES LENGTH = 3 -->
    <query match="w[.~'^...$']"/>
  </rule>

  <rule name="number">
    <query match="w[.~'^[0123456789]+$']"/>  <!-- MATCHES NUMBERS -->
  </rule>  
  
  <rule name="num3">
    <and>
      <ref name="length-three"/>
      <ref name="number"/>
    </and>    
  </rule>
  
  <rule name="all" wrap="not-num3">
    <not>
      <ref name="num3"/>
    </not>
  </rule>

</rules>

We cannot run this, however, as the top-level rule loops on no input. We must deal with this by placing <not>, which matches no input, in a sequence followed by <query match="w">. This will match anything not matching the referenced rule, which it then consumes using <query> and without looping forever:


<rules type="xml" apply="all">
  
  <rule name="length-three">                <!-- MATCHES LENGTH = 3 -->
    <query match="w[.~'^...$']"/>
  </rule>

  <rule name="number">
    <query match="w[.~'^[0123456789]+$']"/>  <!-- MATCHES NUMBERS -->
  </rule>  
  
  <rule name="num3">
    <and>
      <ref name="length-three"/>
      <ref name="number"/>
    </and>    
  </rule>
  
  <rule name="all" wrap="not-num3">
    <seq>
      <not>
        <ref name="num3"/>           <!-- BOOLEAN CHECK ON NEXT TOKEN --> 
      </not>
      <query match="w"/>             <!-- ACTUAL CONSUMPTION OF NEXT TOKEN -->
    </seq>
  </rule>

</rules>

We should then get the following output:


<TEXT>
  <not-num3><w>1</w></not-num3>
  <not-num3><w>12</w></not-num3>
  <w>123</w>
  <not-num3><w>1234</w></not-num3>
  <not-num3><w>a</w></not-num3>
  <not-num3><w>ab</w></not-num3>
  <not-num3><w>abc</w></not-num3>
  <not-num3><w>abcd</w></not-num3>
</TEXT>

Chapter 7. Appendices

Appendix A. Lxtransduce grammar rule reference

Attributes and elements with an asterisk are compulsory, all others are optional. The expression "match elements" refers to any of the constructs which can directly match input such as <query> and <ref> as opposed to structural elements such as <rules>.

Table 7A..1. Lxtransduce grammar rule reference

ElementDescriptionMatch elementAttributesChildrenTutorial section
<rules>Top-level elementNoapply*, type<rule>, <lexicon>Example 1
<rule>Specify individual rulesNoname*, rewrite, wrap, match, constraint<ref>, <regexp>, <query>, <seq>, <first>, <best>, <and>, <lookup>, <repeat-until>, <backtrack>, <var>Example 1
<query/>Most basic XML matchYesmatch*, constraint, rewrite, suppress<constraint>, <var>Example 1, Example 10
<best>Matches longest input of child elementsYesrewrite, suppressMatch elements are given as children examined for longest matchExample 2
<first> (or <or>)Matches the first child elements which matchesYesrewrite, suppressMatch elements are given as children and examined in sequence for matchesExample 4
<ref>References other rules by nameYesname*, mult, rewrite, suppress<with-param>Example 2, Example 7
<seq>Matches child elements if they appear in sequenceYesrewrite, suppressAny match elementExample 3
<start/>Does not consume any input. Matches at the start of input container element or line.NoNoneNoneExample 5
<end/>Does not consume any input. Matches at the end of input container element or line.NoNoneNoneExample 5
<not-start/>Does not consume any input. Will only match when not at the start of input container element or line.NoNoneNoneExample 5
<not-end/>Does not consume any input. Will only match when not at the end of input container element or line.NoNoneNoneExample 5
<repeat-until>Repeatedly matches a rule until either an end condition is matched or no more input matches the rule.Yesname*, min-matches, rewrite, suppressEnd condition given as any match elementExample 8
<backtrack>Succesively smaller multiplicites of a rule are attempted until a match is achieved.Yesrewrite, suppress<ref> whose mult attribute is "*" or "+". This is followed by another element.Example 9
<with-param/>As a child of <ref>, specifies a parameter to be passed to another rule.Noname*, value*NoneExample 10
<var/>Instantiates a variable which may then be referred to elsewhere.Noname*, value*NoneExample 10
<constraint/>Appears as a child of <query>, placing some restriction on matches made.Notest*NoneExample 10
<and> or (<all>)Matches only when all of the match elements specified as children match the input.Yesrewrite, suppressTwo or more match elementsExample 11
<not>Matches only when the child rule does not match.YesNoneA single match elementExercise 11
<lookup>Combines a lexicon lookup with a queryYesmatch*, lexicon*, phrase, case, rewrite, suppress<constraint>, <var>Example L1
<regex/>    Example 10
<lexicon/>Declaration of a lexicon within grammar fileNoname*, hrefNoneExample L1

Appendix B. Lexicon file rule reference

Attributes and elements with an asterisk are compulsory, all others are optional.

Table 7B..1. Lexicon file rule reference

ElementDescriptionAttributesChildrenTutorial section
<lexicon>Top level elementname Example L1
<lex> word*, case<cat>Example L1
<cat> NoneNoneExample L1

Appendix C. Tools bundled with LT-TTT2

Table 7C..1. Tools bundled with LT-TTT2

Tool commandDescription
lxaddidsadds ID attributes to an XML document
lxconvert 
lxconvfsgmatchused to convert old fsgmatch rule files to the new format used by lxtransduce.
lxconvlexused to convert old fsgmatch lexicon files to the new format used by lxtransduce.
lxcountcounts elements in an XML document
lxdiffshows the difference between two XML documents
lxgrepXML version of the Unix/Linux grep program. Finds nodes that match an XPath query.
lxplain2xmlconverts a plain text file to XML by wrapping it in an element
lxprintfformat text extracted from an XML document
lxreplacemakes replacements or deletions in an XML document
lxsortsorts elements in an XML document
lxtA fast XSLT 1.0 processor
lxtransducemain component of LT-TTT2
pospart of speech (POS) tagger. Assign part of speech tags to words reflecting their syntactic category
rxpan XML parser

Glossary

Attribute

XML attributes provide information and are placed directly within (as opposed to between) XML elements. Given an element <file> that would be used with both opening and closing tags to describe the name of a computer file, an attribute that gives information about the file format might be used as follows: <file type="JPG">filename.jpg</file>. Elements can contain as many attributes as is necessary, sequenced one after the other. For more information see XML attributes at W3Schools.

Element

Elements are the basic building block of XML documents such as lxtransduce grammar and lexicon files. They can be used in pairs (opening and closing tags) to wrap around plain text content <elname>Like this</elname> or on their own: <elname/>. Notice the different positioning of the slash character ("/") in these cases. Elements can be nested together to structure information. For more information see XML elements at W3Schools.

Lexicon

An inventory of words or phrases given in a file. lxtransduce uses lexicons given as XML. Sometimes referred to as a 'gazetteer'.

Grammar

An XML file specifying rules which lxtransduce uses to process input.

XML

Stands for eXtensible Markup Language. XML is a structured file specification similar to HTML and can be read by humans and machines. The grammar and lexicon files used by lxtransduce are formatted as XML.

XSLT

eXtensible Stylesheet Language Transformations is an XML-based language used for the transformation of XML documents. The original document is not changed - a new document is created based on the content of an existing one.