Rarely Asked Questions about (Swedish) UD
This is a collection of summaries of discussions on Universal Dependencies guidelines my collaborators and I opened during the annotation of the UD_Swedish-SweLL treebank. Most of the conclusions are Swedish-specific, but some of the logic behind them is also applicable to other languages. Part(s) of this list was or will be published on the Språkbanken Text blog.
- Comparative constructions (they are simpler than you think!)
- Går att
VERBa and other equally tough constructions - Participles
- Att
VERBa själv - Subword-level coordination
- Morphological analysis of syncretic adjective forms
- Att vara X år gammal or to be X years old
- Vad
ADJ! - The ABCs of UD
Comparative constructions (they are simpler than you think!)
Comparative constructions such as
- att annotera dessa konstruktioner är enklare än du tror and
- vissa konstruktioner är enklare än andra
look tricky, but the guidelines for them have recently gotten more comprehensive and, at least when it comes to Swedish, easier to understand and follow.
The first question might be what UPOS tag to assign to the word än.
The answer to that is SCONJ.
In (1), this is clear as än clearly introduces a subordinate clause, än du tror.
As for (2), the guidelines state that “if the same conjunction is used with bare nominals, we still tag it SCONJ”.
When it comes to the dependency structure of the construction, the clause or nominal introduced by än should always be attached to the property whose degree is compared:
The specific labels depend on whether the standard of comparison is a clause or a nominal.
For sentences like (1), we use advcl for the subordinate clause and mark for än:
In cases such as (2), we use obl and case:
The only remaining issue is where to draw the line between clausal and nominal comparison. Sentences like
- parsern annoterar dessa konstruktioner bättre än jag
can be rephrased as both
- parsern annoterar dessa konstruktioner bättre än jag skulle göra (clausal) and
- parsern annoterar dessa konstruktioner bättre än mig (nominal).
Ambiguous cases like (3) are treated like (2).
Går att VERBa and other equally tough constructions
How to annotate sentences like detta går att debattera? Well, that sparked a whole debate.
Semantically, detta is the object of debattera. If we rephrase the sentence into det går att debattera detta or att debattera detta går, the syntactic analysis coincides with the semantic one:
In detta går att debattera, on the other hand, detta acts like the syntactic subject of går, which becomes the head of the construction.
It remains to decide how to annotate the subordinate clause att debattera.
Since its subject of not controlled by that of the superordinate, we use ccomp rather than xcomp:
It turns out that this is similar to tough-movement, so called because the prototypical English example sentences for the phenomenon involve the word tough:
- this problem is tough to solve
- it is tough to solve this problem
- to solve this problem is tough
In sentences like (1), the syntactic subject problem of the main verb is is logically the object of an embedded non-finite verb solve (although in UD, the root of the sentence would be tough, not is), whereas in paraphrases (2) and (3) logical and grammatical structure coincide. General guidelines about the annotations of tough-constructions, though, are still being debated at the time of writing.
Participles
[go to the discussion on GitHub]
Participles may look nearly as tough as tough-constructions because they work as different parts of speech depending on the context. Consider the following cases:
- skolan får ökade möjligheter
- jag blev bjuden på te
- flickan var strålande glad
- detta sökande gav inget resultat
Cases like (1) are by far the most frequent.
The past participle ökade clearly modifies the noun möjligheter and should therefore be tagged as ADJ.
However, since it is derived from the verb öka, it also takes the typically verbal features Tense and VerbForm.
In (2), bjuden may also be seen as an adjective, but bli + participle passive constructions are treated differently: the participle is tagged as VERB (with Tense and VerbFrom) and bli is annotated as a passive auxiliary:
In (3), the present participle strålande modifies an adjective, glad.
We therefore give it the ADV UPOS tag, again with the Tense and VerbForm features.
Finally, in cases like (4), the participle should be tagged as NOUN.
Its morphological analysis should be consistent with the UPOS tag, so rather than using Tense and VerbForm we annotate for Case, Definiteness, Gender and Number.
In all four cases, including when the UPOS tag is VERB, the lemma of the participial form is the participial form itself.
Att VERBa själv
[go to the discussion on GitHub]
Constructions like att bestämma själv are clear when it comes to dependency structure (the root is the verb and själv is one of its direct dependents).
Talbanken and LinES, the two largest Swedish UD treebanks, used to label this edge differently: the former used amod, the latter advmod.
But while two dispute, the third enjoys!
Recent discussion led to re-analyzing this as secondary predication, which implies using a clausal relation type.
Since själv is optional, the relation of choice is advcl.
This is consistent with the pre-existing use of acl in cases like du borde vara dig själv, where the head is a nominal (dig).
Subword-level coordination
[go to the discussion on GitHub]
How to analyze constructions like levnads- och beteendemässigt?
Ideally, we would want the conjuncts levnads- and beteende(-) to form a compound with mässigt, but this is currently beyond the expressive capacity of UD. To circumvent the problem, we lemmatize levnad as levnadsmässigt, obtaining a conjunction of two adverbs.
Morphological analysis of syncretic adjective forms
[go to the discussion on GitHub]
A handful of Swedish adjectives, such as bra and äkta are indeclinable, or rather, they inflect for degree (and, if nominalized, case), but not gender, number or definiteness. Other adjectives, such as nyttig, inflect for the latter three features as well, but with a certain degree of syncretism: the form nyttiga, for example, can be a singular definite (of either gender) or a plural (irrespective of both gender and number).
In (Swedish) UD, a general principle is to ground morphological annotation on the observed word form and avoid inferring features based on the context.
Adjectives like bra should therefore only be annotated for Case and Degree.
This amounts to saying “this form works just as well for every combination of gender, number and definiteness”.
Annotation of adjectives like nyttig, on the other hand, partially deviates from this idea.
When assigning morphological features to -a forms like nyttiga, we would ideally want to convey that they can be definite and/or plural (but not singular and indefinite!).
The problem is that UD v2 allows expressing disjunctions of values for a single feature (e.g. Number=Sing,Plur) but not of combinations of several feature-value pairs.
Leaving -a forms unannotated for number and definiteness would be misleading, as it would imply that they can be used in the indefinite singular case too.
As a consequence, these two features are annotated contextually.
The current practice can be summarized as follows (case and degree are ignored for the sake of compactness):
| form | features |
|---|---|
| nyttig | Definite=Ind\|Gender=Com\|Number=Sing |
| nyttigt | Definite=Ind\|Gender=Neut\|Number=Sing |
| nyttiga | Definite=Def in definite contexts;Definite=Ind\|Number=Plur in plural contexts |
Nyttiga regler för nyttiga adjektiv!
Att vara X år gammal or to be X years old
[go to the discussion on GitHub]
The expression to be X years old/att vara X år gammal used to be treated inconsistently across English and Swedish treebanks. As of UD 2.16, annotation has been standardized to
Most importantly, gammal/old is the head and år/years is assigned the deprel obl.
Some English treebank specify the subtype obl:unmarked (adpositionless oblique).
If you speak any other languages where a similar construction is used, check how it is annotated!
Vad ADJ!
[go to the discussion on GitHub]
Vad trevligt att det finns en del 2 av det här blogginlägget!1
This sentence is equivalent to English How nice that there is a part 2 of this blog post!, excepts that vad, unlike how is a pronoun.
Since the head, trevligt, is an adjective and not a nominal, we use the dependency relation obl.
By the same logic, I suppose that the vad in vad fan is to be annotated as nmod.
The ABCs of UD
[go to the discussion on GitHub]
How to annotate vitamin A and the like?
Since vitamin A is a kind of vitamin, but not a kind of A, the head should be vitamin. A (or B or C) is treated as a proper noun and attached to it with the relation nmod.
But what about musical keys, like C minor?
This is not quite relevant for Swedish, where C-moll is written as a single word, but it is cross-lingually interesting.
At least in English, French, German, Italian, Spanish and Welsh, minor and major are adjectives and therefore amods of the note at hand.
In English, this is a rare case of postnominal adjectival modifier.
In Czech, on the other hand, dur and moll inflect as nouns and even have derived adjectives (durový, mollový).
Differently from the C in vitamin C, the note name C is just a common noun, that, in languages like Italian, can be introduced by an article (cf. il do centrale, ‘the central C’).
-
(hoppas jag att du tycker) ↩