This is a collection of questions about Grammatical Framework (GF) and Universal Dependencies (UD) that I’ve been asked at least twice during the 2025 edition of the LT2214 Computational Syntax course at the University of Gothenburg, and/or that I haven’t been able to answer exhaustively in class, and/or that I myself tend to forget the answer to in between years.

NOTE: I maintain an up-to-date version of frequently asked GF+UD questions here.


Contents:

GF

File structure/module types

What goes in which files?

In GF, there are several module types:

name contents example from the course
abstract abstract syntax (cross-lingual) MicroLang.gf
concrete concrete syntax (language-specific) MicroLangEng.gf
resource reusable collection of opers (see next question), i.e. library (type signatures & bodies together) MicroResEng.gf
interface “abstract syntax” (type declarations) of a resource module not used in this course
instance “concrete syntax” of a resource module whose “abstract syntax” is defined in an interface not used in this course
incomplete concrete partial “concrete syntax”, used when some things are common to a few of the languages in covered in the grammar, but not all of them not used in this course

Let’s focus on the first three.

An abstract module typically contains:

  • the list of the names of the cross-lingual categories used in the grammar (e.g. Det, CN, NP…)
  • a list of “functions”, i.e. type signatures for the linearization rules of the grammar (e.g. DetCN : Det -> CN -> NP ;). An abstract module works a bit like a Java interface, or rather, if you haven’t done any object-oriented programming, as contract that specifies all the things the concrete syntaxes for each language have to provide for the grammar to be complete.

A concrete module contains the “implementations” of the cats and lins listed in the abstract syntax, for example:

  • a Det in English can be represented as a record storing a string and a number:
    lincat Det = {s : Str ; n : Number} ;
    
  • the “body” of the function (i.e. its linearization rule) DetCN can be written as
    lin DetCN det cn = {
      s = \\c => det.s ++ cn.s ! det.n ;
      a = Agr det.n } ;
    

Finally, a resource module is just a bunch of opers (i.e. helper functions) and parameter definitions that can be imported in other modules. You may well decide not to have a separate resource module in your grammar, but especially if your language of choice has extensice morphology, that’s a good place to store all the parameters and smart paradigms without cluttering your MicroLangXxx module.

GF terms/types of judgments

Can you give me a recap of what all of cat, lincat, fun, lin, oper and param mean?

There you go:

name short for description found in example
cat category grammatical category abstract modules cat Noun
lincat linearization type language-specific “implementation” of a category concrete modules lincat Noun = {s : Number => Str}
fun function type signature of a grammar rule abstract modules fun UseN : N -> CN
lin linearization rule language-specific “implementation” of a grammar rule concrete modules lin UseN n = n
oper operation helper function here, there and everywhere (but often in concrete and resource modules) oper regNoun : Str -> Noun = \sg -> mkNoun sg (sg + "s")
param parameter language-specific (inflectional) parameter tables typically resource modules param Number = Sg | Pg

GF syntax

What is the difference between -> and =>?

-> is for functions, => is for tables.

More specifically:

  • -> is used when declaring the type signature of a function (i.e. of a fun or an oper). Examples:
    • fun PredVPS : NP -> VP -> S ;
    • oper regNoun : Str -> Noun = ...
  • \ -> is when defining a function (typically in opers). Example:

    ... \sg -> mkNoun sg (sg + "s") ;
    
  • => is used when:
    • declaring the type of a table (typically in lincat). Example:

      lincat N = {s : Number => Str} ;
      
    • filling in the cell of a table. Example:

      table { Sg => "cat" ; 
              Pl => "cats"}
      
  • \\ => is used for filling in tables whose cells are all to be filled in the same way (single-branch tables). Example:

    AdjCN a cn = {
      -- whatever the n (the number) is, use it to select the correct cell of the table for the CN
      s = \\n => a.s ++ (cn.s ! n) ;
    } ;
    

Encoding issues

I am working with a language that uses a non-latin script and, when I try to linearize a tree, I get the following error:

  <stdout>: commitBuffer: invalid argument (invalid character)

How do I fix it?

Add the line

flags coding = utf8 ;

to your concrete module.

This is necessary because the default encoding for GF source files is iso-latin-1, which can only handle the ASCII charset and a few extra characters needed for some European languages.

Using GF from Python

I am getting started with lab 2 and I can’t install/run the Python pgf library and/or the C runtime.

For the moment, don’t: follow Aarne’s alternative instructions for testing.

UD

MultiWord tokens

How do I analyze:

  1. “do” (“de + “o”) in Portuguese
  2. “au” (“à” + “le”) in French
  3. “nel” (“in” + “il”) in Italian
  4. “dámelo” (“da” + “me” + “lo”) in Spanish

and so on?

UD version 2 treats these cases (where an orthographic word consists of several syntactic units) as MultiWord Tokens (MWTs). Annotation consists of a so-called range line with the original form but no analysis, followed by 2+ lines where each individual element is analyzed individually. Example for portuguese.

# text = A comida do gato.
# gloss = The food of+the cat.
1	A	o	DET	_	_	2	det	_	_
2	comida	comida	NOUN	_	_	0	root	_	_
3-4	do	_	_	_	_	_	_	_	_
3	de	de	ADP		_	5	case	_	_
4	o	o	DET	_	_	5	det	_	_
5	gato	gato	NOUN	_	_	2	nmod	_	_
6	.	.	PUNCT	_	_	2	punct	_	_

Unfortunately, MaChAmp cannot fully handle this yet. This is one of the reasons why you need to preprocess all of its input files as indicated in the lab description.

Conjunctions

How do I annotate conjunctions, in phrases like “A and B”?

Like this:

A and B SYM CCONJ SYM root cc conj

But why? Then A and B are not on the same level!

Reasonable objection. However, remember that UD prioritizes cross-lingual parallelism by avoiding using function words (such as conjunctions) as syntactic heads.

In latin, for example “A and B” can be translated as both “A et b” and “A Bque” (e.g. “SPQR: Senatus PopolusQue Romanus”). If we used “et” as the head, the trees for “A and/et B” becomes:

A et/and B SYM CCONJ SYM conj root conj

which is more different than it needs to be from

A Bque SYM NOUN root conj

where you can clearly see what the two conjuncts are.

But can’t “Bque” be treated as a MWT?

Sure can: in that case, you would split “B” and “que” and you could in theory treat the clitic “que” as a head. But imagine a language where conjunction is expressed by simply justaposing the conjuncts (e.g. “A B”). What would be the head then?

Syntax highlighting

I have tried to install the vscode-conllu extension but it doesn’t seem to work.

  1. check that your filename ends in .conllu (which is not the same as .connllu)
  2. check that you are using tabs and not spaces as separators
  3. if it still doesn’t work, save the file as .tsv. You’ll get a different but equally good highlighting for the token lines

I want syntax highlighting, but I don’t use Visual Studio Code.

GOTO step 3 of the answer above ;) Most editors have support for highlighting TSVs and CSVs.

Arborator

I am using Arborator for annotation and I can’t find my project under “My Projects”

…I know right? For the moment, look for it from the homepage using the search bar and then save the direct link. Eventually, I hope they’ll solve the issue.

MaChAmp

I am trying to run a MaChAmp script but it fails complaining that some other Python file does not exist.

Check that you are running the script from Machamp’s root folder (i.e. the folder named machamp or machamp-master).

I successfully installed MaChamp and preprocessed the training and development data and I’m now trying to train my model, but training fails with something like

2025-05-21 10:34:51,518 - ERROR - STDERR - [enforce fail at inline_container.cc:595] . unexpected pos 196207040 vs 196206992

There is probably some formatting error in your CoNNL-U input files. The most common ones are:

  1. missing newlines at the end of the file, which often happens when you split the data into a training and development set yourself
  2. the diabolical non-unix newline characters, which some editors on Windows like adding when you simply open your files.

To fix error 1, open your files and check that it ends with two empty lines. If it doesn’t, add one.

To fix error 2, run

tr -d '\r' < PATH-TO-YOUR-NONUNIX-FILE.conllu >  PATH-FOR-THE-OUTPUT-FILE.conllu

(which means “delete all \r characters from the non-unix file and write the output on a new file”).

If training still fails after this, try to validate your CoNNL-U file(s) (see below).

UD validator

How do I use the official UD validator?

  1. clone or download the UD tools repository
  2. move inside the tools folder
  3. run
    python validate.py PATH-TO-YOUR-TREEBANK.conllu --lang=2-LETTER-LANGCODE-FOR-YOUR-LANGUAGE
    

MaChAmp should only be concerned about basic formatting issues, so if you are validating for MaChAmp you can add --level=1. If you want to check for errors in your own annotated files, however, you can go up a few levels:

  • 2 checks UD format specifics
  • 3 checks that the universal UD guidelines are followed (e.g. that there are no VERBs used as AUX or multiple subjects in the same sentence)
  • 4 and 5 check language-specific stuff.