FACS - FAQs about Computational Syntax

This is a collection of frequently asked questions about Grammatical Framework (GF) and Universal Dependencies (UD) that I typically get during the LT2214 Computational Syntax course at the University of Gothenburg.

Contents:

GF

File structure/module types

What goes in which files?

In GF, there are several module types:

name	contents	example from the course
`abstract`	abstract syntax (cross-lingual)	`MicroLang.gf`
`concrete`	concrete syntax (language-specific)	`MicroLangEng.gf`
`resource`	reusable collection of `oper`s (see next question), i.e. library (type signatures & bodies together)	`MicroResEng.gf`
`interface`	“abstract syntax” (type declarations) of a resource module	not used in this course
`instance`	“concrete syntax” of a resource module whose “abstract syntax” is defined in an interface	not used in this course
`incomplete concrete`	partial “concrete syntax”, used when some things are common to a few of the languages in covered in the grammar, but not all of them	not used in this course

Let’s focus on the first three.

An abstract module typically contains:

the list of the names of the cross-lingual categories used in the grammar (e.g. Det, CN, NP…)
a list of “functions”, i.e. type signatures for the linearization rules of the grammar (e.g. DetCN : Det -> CN -> NP ;). An abstract module works a bit like a Java interface, or rather, if you haven’t done any object-oriented programming, as contract that specifies all the things the concrete syntaxes for each language have to provide for the grammar to be complete.

A concrete module contains the “implementations” of the cats and lins listed in the abstract syntax, for example:

a Det in English can be represented as a record storing a string and a number:
```
lincat Det = {s : Str ; n : Number} ;
```

the “body” of the function (i.e. its linearization rule) DetCN can be written as

lin DetCN det cn = {
  s = \\c => det.s ++ cn.s ! det.n ;
  a = Agr det.n } ;

Finally, a resource module is just a bunch of opers (i.e. helper functions) and parameter definitions that can be imported in other modules. You may well decide not to have a separate resource module in your grammar, but especially if your language of choice has extensice morphology, that’s a good place to store all the parameters and smart paradigms without cluttering your MicroLangXxx module.

GF terms/types of judgments

Can you give me a recap of what all of cat, lincat, fun, lin, oper and param mean?

There you go:

name	short for	description	found in	example
`cat`	category	grammatical category	abstract modules	`cat Noun`
`lincat`	linearization type	language-specific “implementation” of a category	concrete modules	`lincat Noun = {s : Number => Str}`
`fun`	function	type signature of a grammar rule	abstract modules	`fun UseN : N -> CN`
`lin`	linearization rule	language-specific “implementation” of a grammar rule	concrete modules	`lin UseN n = n`
`oper`	operation	helper function	here, there and everywhere (but often in concrete and resource modules)	`oper regNoun : Str -> Noun = \sg -> mkNoun sg (sg + "s")`
`param`	parameter	language-specific (inflectional) parameter tables	typically resource modules	`param Number = Sg \| Pg`

GF syntax

What is the difference between -> and =>?

-> is for functions, => is for tables.

More specifically:

-> is used when declaring the type signature of a function (i.e. of a fun or an oper). Examples:
- fun PredVPS : NP -> VP -> S ;
- oper regNoun : Str -> Noun = ...
\ -> is when defining a function (typically in opers). Example:
```
... \sg -> mkNoun sg (sg + "s") ;
```
=> is used when:
- declaring the type of a table (typically in lincat). Example:
```
lincat N = {s : Number => Str} ;
```
- filling in the cell of a table. Example:
```
table { Sg => "cat" ; 
        Pl => "cats"}
```

\\ => is used for filling in tables whose cells are all to be filled in the same way (single-branch tables). Example:

AdjCN a cn = {
  -- whatever the n (the number) is, use it to select the correct cell of the table for the CN
  s = \\n => a.s ++ (cn.s ! n) ;
} ;

Encoding issues

I am working with a language that uses a non-latin script and, when I try to linearize a tree, I get the following error:
  <stdout>: commitBuffer: invalid argument (invalid character)
How do I fix it?

Add the line

flags coding = utf8 ;

to your concrete module.

This is necessary because the default encoding for GF source files is iso-latin-1, which can only handle the ASCII charset and a few extra characters needed for some European languages.

GF best practices

Why do we use records for things that are essentially strings, such as lincat Adv = {s : Str}?

This is by no means obligatory, but there are at least three very practical reasons to do so:

if you later realize that a string isn’t enough, it’s easy to add another record field (e.g. lincat Adv = {s: Str, goesBeforeVP: Bool}), but more annoying to replace Str with a whole new record
if some things in your grammar are strings and others {s : Str}, let’s say
```
lincat Adv = {s: Str, goesBeforeVP: Bool} ;
lincat Adj = Str ;
```
it becomes easy to forget whether you should write adv ++ adj, adv.s ++ adj or adv.s ++ adj.s in your lins

you may want to extend a type, which only works with records! For example:

 oper Verb : Type = {s: Str} ; -- for languages with no verb inflection
 oper Verb2 : Type = Verb ** {prep: Str} ; -- verbs with an argument introduced by a preposition, such as "to wait for X"

Why do we use the identifier s specifically, even things that are not strings (such as in lincat N = {s: Number => Str; g: Gender})?

Superficial reason: it was (is?) hardcoded somewhere that s is the name of the record field used for linearization.

Deeper reason: s does stand for “string”. Record fields called s are typically the ones that will eventually become strings. In the case of Ns, for example, the table s is the part of the record representing a noun that will linearize to a string once we decide which table entry we need, for instance based on the number of the determiner that introduces the noun:

det.s ++ n.s ! det.n ; -- "a cat"

What am I supposed to put in a resource module?

You don’t have to even have a resource module unless there are any opers or params that you want to use in multiple concrete files, but moving some code to a resource module can be a good way to organize your code regardless. It seems that there are two dominant approaches:

only having lincats and lins in your concrete and putting everything else in one or more resource modules. In this way, the concrete is strictly an implementation of the abstract syntax, which acts like an interface in object-oriented programming. This also used to be the way to write GF grammars when params and opers were not allowed in concrete modules
also keeping parameters and constructors (mkXXX) in the concrete syntax and only use resource modules for helper functions, as in a typical utils module.

Eerie error messages

I wrote something like
oper mkA : Str -> A = \a -> { s = a } ;
and got an error that says something like
Happened in operation mkA
 {s = a} is not in the lincat of A; try wrapping it with lin A
What does this mean and how do I fix it?

The best and most exhaustive explanation of this is here. In short, the compiler is complaining that the return type of mkA is not exactly A because, when you write a lincat such as lincat A = {s = Str}, GF inserts a hidden field lock_A to be able to distinguish the lincat for A from any other lincats with the same definition.

The first possible solution is to do exactly what the compiler suggests:

oper mkA : Str -> A = \a -> lin A ({ s = a }) ;

(the parentheses are actually redundant, but they’re a way to visualize the “wrapping around”)

However, it is usually considered cleaner to explicitly define a type synonym for the category (in this case A):

Adjective : Type = {s: Str} ;

and change the signature of the oper the compiler is complaining about accordingly:

oper mkA : Str -> Adjective = \a -> { s = a } ;

Using GF from Python

I am getting started with lab 2 and I can’t install/run the Python pgf library and/or the C runtime.

For the moment, don’t: follow Aarne’s alternative instructions for testing.

UD

MultiWord Tokens

How do I analyze:

“do” (“de + “o”) in Portuguese

“au” (“à” + “le”) in French

“nel” (“in” + “il”) in Italian

“dámelo” (“da” + “me” + “lo”) in Spanish

and so on?

UD version 2 treats these cases (where an orthographic word consists of several syntactic units) as MultiWord Tokens (MWTs). Annotation consists of a so-called range line with the original form but no analysis, followed by 2+ lines where each individual element is analyzed individually. Example for portuguese.

# text = A comida do gato.
# gloss = The food of+the cat.
1	A	o	DET	_	_	2	det	_	_
2	comida	comida	NOUN	_	_	0	root	_	_
3-4	do	_	_	_	_	_	_	_	_
3	de	de	ADP		_	5	case	_	_
4	o	o	DET	_	_	5	det	_	_
5	gato	gato	NOUN	_	_	2	nmod	_	_
6	.	.	PUNCT	_	_	2	punct	_	_

Unfortunately, MaChAmp cannot fully handle this yet. This is one of the reasons why you need to preprocess all of its input files as indicated in the lab description.

Conjunctions

How do I annotate conjunctions, in phrases like “A and B”?

Like this:

But why? Then A and B are not on the same level!

Reasonable objection. However, remember that UD prioritizes cross-lingual parallelism by avoiding using function words (such as conjunctions) as syntactic heads.

In latin, for example “A and B” can be translated as both “A et b” and “A Bque” (e.g. “SPQR: Senatus PopolusQue Romanus”). If we used “et” as the head, the trees for “A and/et B” becomes:

which is more different than it needs to be from

where you can clearly see what the two conjuncts are.

But can’t “Bque” be treated as a MWT?

Sure can: in that case, you would split “B” and “que” and you could in theory treat the clitic “que” as a head. But imagine a language where conjunction is expressed by simply justaposing the conjuncts (e.g. “A B”). What would be the head then?

Syntax highlighting

I have tried to install the vscode-conllu extension but it doesn’t seem to work.

check that your filename ends in .conllu (which is not the same as .connllu)
check that you are using tabs and not spaces as separators
if it still doesn’t work, save the file as .tsv. You’ll get a different but equally good highlighting for the token lines

I want syntax highlighting, but I don’t use Visual Studio Code.

GOTO step 3 of the answer above ;) Most editors have support for highlighting TSVs and CSVs.

Arborator

I am using Arborator for annotation and I can’t find my project under “My Projects”

…I know right? For the moment, look for it from the homepage using the search bar and then save the direct link. Eventually, I hope they’ll solve the issue.

MaChAmp

I am trying to run a MaChAmp script but it fails complaining that some other Python file does not exist.

Check that you are running the script from Machamp’s root folder (i.e. the folder named machamp or machamp-master).

I successfully installed MaChamp and preprocessed the training and development data and I’m now trying to train my model, but training fails with something like

2025-05-21 10:34:51,518 - ERROR - STDERR - [enforce fail at inline_container.cc:595] . unexpected pos 196207040 vs 196206992

There is probably some formatting error in your CoNNL-U input files. The most common ones are:

missing newlines at the end of the file, which often happens when you split the data into a training and development set yourself
the diabolical non-unix newline characters, which some editors on Windows like adding when you simply open your files.

To fix error 1, open your files and check that it ends with two empty lines. If it doesn’t, add one.

To fix error 2, run

tr -d '\r' < PATH-TO-YOUR-NONUNIX-FILE.conllu >  PATH-FOR-THE-OUTPUT-FILE.conllu

(which means “delete all \r characters from the non-unix file and write the output on a new file”).

If training still fails after this, try to validate your CoNNL-U file(s) (see below).

UD validator

How do I use the official UD validator?

clone or download the UD tools repository
move inside the tools folder

run

python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE

MaChAmp should only be concerned about basic formatting issues, so if you are validating for MaChAmp you can add --level=1. If you want to check for errors in your own annotated files, however, you can go up a few levels:

2 checks UD format specifics
3 checks that the universal UD guidelines are followed (e.g. that there are no VERBs used as AUX or multiple subjects in the same sentence)
4 and 5 check language-specific stuff.