forked from chardmeier/docent
-
Notifications
You must be signed in to change notification settings - Fork 0
Document-Level Local Search Decoder for Phrase-Based Statistical Machine Translation
License
deciphyre/docent
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
DOCENT -- DOCUMENT-LEVEL LOCAL SEARCH DECODER FOR PHRASE-BASED SMT ================================================================== Christian Hardmeier 26 September 2012 Docent is a decoder for phrase-based Statistical Machine Translation (SMT). Unlike most existing SMT decoders, it treats complete documents, rather than single sentences, as translation units and permits the inclusion of features with cross-sentence dependencies to facilitate the development of discourse-level models for SMT. Docent implements the local search decoding approach described by Hardmeier et al. (EMNLP 2012). If you publish something that uses or refers to this work, please cite the following paper: @inproceedings{Hardmeier:2012a, Author = {Hardmeier, Christian and Nivre, Joakim and Tiedemann, J\"{o}rg}, Booktitle = {Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning}, Month = {July}, Pages = {1179--1190}, Publisher = {Association for Computational Linguistics}, Title = {Document-Wide Decoding for Phrase-Based Statistical Machine Translation}, Address = {Jeju Island, Korea}, Year = {2012}} Requests and comments about this software can be addressed to [email protected] DOCENT LICENSE ============== All code included in docent, except for the contents of the 'external' directory, is copyright 2012 by Christian Hardmeier. All rights reserved. Docent is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Docent is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with Docent. If not, see <http://www.gnu.org/licenses/>. SNOWBALL STEMMER LICENSE ======================== Docent contains code from the Snowball stemmer library downloaded from http://snowball.tartarus.org/ (in external/libstemmer_c). This is what the Snowball stemmer website says about its license (as of 4 Sep 2012): "All the software given out on this Snowball site is covered by the BSD License (see http://www.opensource.org/licenses/bsd-license.html ), with Copyright (c) 2001, Dr Martin Porter, and (for the Java developments) Copyright (c) 2002, Richard Boulton. Essentially, all this means is that you can do what you like with the code, except claim another Copyright for it, or claim that it is issued under a different license. The software is also issued without warranties, which means that if anyone suffers through its use, they cannot come back and sue you. You also have to alert anyone to whom you give the Snowball software to the fact that it is covered by the BSD license. We have not bothered to insert the licensing arrangement into the text of the Snowball software." BUILD INSTRUCTIONS ================== Building docent can be a fairly complicated process because of its library dependencies and the fact that it links against Moses and shares some of its dependencies with it. Compared to the first public version, the build process has been much simplified (thanks to Tomas Hudik for helping me find problems!), but it may still need some tweaking for your system. 1. Initialising Git submodules The software packages Arabica (XML parser) and Moses (SMT decoder) are referenced by Docent as Git submodules. When you check out Docent for the first time, you need to initialise and update the submodules. From Docent's top-level directory, run: - git submodule init - git submodule update The build process of Arabica and Moses is driven automatically by Docent's cmake script. 2. Installing prerequisite components Download the following software components and build and install them into directories of your choice according to the build instructions provided with each package: - cmake (build system) Versions tested: 2.8.4, 2.8.9 Source: http://www.cmake.org/ Make sure you have cmake in your PATH. - Autotools Versions tested: - autoconf 2.68, 2.69 - automake 1.9.6, 1.12.4 - libtool 2.2.10, 2.4.2 Source: ftp://ftp.gnu.org/gnu and mirrors Arabica requires you to have recent versions of autoconf, automake and libtool installed on your system. If you have outdated versions of them or experience problems building Arabica, consider installing the latest versions. Make sure you have these tools in your PATH. - Boost (C++ library) Versions tested: 1.48.0, 1.50.0 Source: http://www.boost.org/ Watch out, many Linux boxes have old versions of boost lying around in system-wide directories. Boost is a major source of build problems, especially since Docent links against Moses, which also uses Boost and requires a non-standard naming scheme (--layout=tagged). Unless your distribution has a very recent Boost version preinstalled, I recommend installing Boost in a prefix directory of its own and set the BOOST_ROOT variable as explained below. If you install Boost yourself, follow the instructions in external/mosesdecoder/BUILD-INSTRUCTIONS.txt to make Moses happy. When you're done, rerun b2 without the --layout=tagged option: ./b2 --prefix=$PREFIX --libdir=$PREFIX/lib64 link=static,shared threading=multi install Then pass -DBOOST_ROOT=$PREFIX to cmake (see below). 3. Building Docent Docent can be built in a debuggable DEBUG and an optimised RELEASE configuration. You will probably need both of them, so it's easiest to build them in two separate directories. Start by setting the BOOST_ROOT environment variable to Boost's prefix directory: - BOOST_ROOT=<your Boost prefix directory> - export BOOST_ROOT Then run the following commands from Docent's top directory: - mkdir DEBUG RELEASE - cd DEBUG - cmake -DCMAKE_BUILD_TYPE=DEBUG .. - make - cd ../RELEASE - cmake -DCMAKE_BUILD_TYPE=RELEASE .. - make USAGE ===== 1. File input formats Docent accepts the NIST-XML and the MMAX formats for document input. Output is always produced in the NIST-XML format. MMAX input requires you to provide both a NIST-XML file and an MMAX directory, since the NIST-XML input file is used as a template for the output. NIST-XML output without MMAX can be used for models that process unannotated plain text input. MMAX input can be used to provide additional annotations, e.g. if the input has been annotated for coreference with a tool like BART (http://www.bart-coref.org/). 2. Decoder configuration The decoder uses an XML configuration file format. There are two example configuration files in tests/config. The <random> tag specifies the initialisation of the random number generation. When it is empty (<random/>), a random seed value is generated and logged at the beginning of the decoder run. In order to rerun a specific sequence of operations (for debugging), a seed value can be provided as shown in one of the example configuration files. The only phrase table format currently supported is the binary phrase table format of moses (generated with processPhraseTable). Note that Docent, unlike moses, doesn't have a parameter to enforce a limit on the number of translations for a given phrase that are loaded from the phrase table. We recommend that you filter your phrase tables with the filter-pt tool from the Moses distribution and a setting like "-n 30" before using them with Docent. Otherwise you may experience very poor performance. Language models should be in KenLM's probing hash format (other formats supported by KenLM can potentially be used as well). There is some more documentation about the configuration file format on Docent's Wiki page: https://github.com/chardmeier/docent/wiki/Docent-Configuration 3. Invoking the decoder There are three main docent binaries: docent for decoding a corpus on a single machine without producing intermediate output lcurve-docent for producing learning curves mpi-docent for decoding a corpus on an MPI cluster I recommend that you use lcurve-docent for experimentation. Usage: lcurve-docent {-n input.xml|-m input.mmaxdir input.xml} config.xml outputstem Use the -n or -m switch to provide NIST-XML input only or NIST-XML input combined with MMAX annotations, respectively. The file config.xml contains the decoder configuration. The decoder produces output files for the initial search state and intermediate states whenever the step number is equal to a power of two, starting at 256. Output files are named outstem.000000000.xml outstem.000000256.xml etc. 4. Extending the decoder To implement new feature functions, start with one of the existing. SentenceParityModel.cpp contains a very simple proof-of-concept feature function that promotes length parity per document (i.e. all sentences in a document should consistently have either odd or even length). If you're planning to implement something more complex, NgramModel.cpp or SemanticSpaceLanguageModel.cpp may be good places to start. Good luck!
About
Document-Level Local Search Decoder for Phrase-Based Statistical Machine Translation
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Erlang 87.3%
- C++ 5.9%
- SystemVerilog 5.5%
- Perl 1.2%
- CMake 0.1%
- C 0.0%