This repository was archived by the owner on Feb 14, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
53 lines (47 loc) · 1.92 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
tag-clusterer. It clusters tags.
Actually, it doesn't. It's just a pair of filters to delexicalise tagged
text, which is then clustered by a word clustering tool (currently, mkcls
only), and then generates a tagset specification (.tsx) file for use with
apertium-tagger.
The input must be tagged -- not analysed -- text in the format used for
supervised training of the tagger. Feed its output to mkcls (play around with
the number of classes it generates), then feed that into mkcls-to-tsx.pl
semi-lexicalise.pl can take a set of tags to treat as stopwords, in the
apertium-transfer-tools .atx format, and semi-lexicalise the input. In this
case, the generated .tsx file will have closed classes, and may be usable
without extra intervention.
.atx looks like this:
<?xml version="1.0" encoding="iso-8859-1"?>
<transfer-at source="Portuguese" target="Spanish">
<source>
<lexicalized-words>
<lexicalized-word tags="cnjsub"/>
<lexicalized-word tags="det.*"/>
<lexicalized-word tags="pr"/>
<lexicalized-word tags="prn.tn.*"/>
<lexicalized-word tags="prn.enc.*"/>
<lexicalized-word tags="prn.pro.*"/>
<lexicalized-word tags="rel.*"/>
<lexicalized-word tags="vbser.*"/>
<lexicalized-word tags="vbhaver.*"/>
<lexicalized-word tags="vbmod.*"/>
<lexicalized-word tags="vblex.*" lemma="há"/>
</lexicalized-words>
</source>
<target>
<lexicalized-words>
<lexicalized-word tags="cnjsub"/>
<lexicalized-word tags="det.*"/>
<lexicalized-word tags="pr"/>
<lexicalized-word tags="prn.tn.*"/>
<lexicalized-word tags="prn.enc.*"/>
<lexicalized-word tags="prn.pro.*"/>
<lexicalized-word tags="rel.*"/>
<lexicalized-word tags="vbser.*"/>
<lexicalized-word tags="vbhaver.*"/>
<lexicalized-word tags="vbmod.*"/>
<lexicalized-word tags="vblex.*" lemma="hacer"/>
</lexicalized-words>
</target>
</transfer-at>
semi-lexicalise.pl only uses the <source> part, so you only need that.