Skip to content

ferhanture/WikiBitext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

(created 3/30/12, last updated 7/10/12, by Ferhan Ture, [email protected])

We map Wikipedia internal docids to docnos used in our system:

dewiki_docno-mapping.dat : mapping from German Wikipedia internal docids to our docnos (to use this, see class edu.umd.cloud9.collection.DocnoMapping in our Cloud9 code library)
enwiki_docno-mapping.dat : mapping from English Wikipedia internal docids to our docnos
dewiki_docno-mapping.txt : same as dewiki_docno-mapping.dat, in text format.
enwiki_docno-mapping.txt : same as enwiki_docno-mapping.dat, in text format.

( format: [Wikipedia docid]\t[docno] )

$ head -3 enwiki_docno-mapping.txt
12	1
25	2
39	3

sample.docnos : list of docnos for the 1064 sample German articles used in evaluations

$ head -3 sample.docnos 
508
4077
6033

Similar German-English document pairs found by the cross-lingual pwsim algorithm (and ground truth pairs), as described in Ture, Lin and Elsayed, SIGIR 2011:

d1000_q300_t400_b2000.pwsim-all : all docno pairs, found simiar by the cross-lingual pwsim algo, with parameters #bits=1000, #tables=300, hamming threshold=400, windowsize=2000
d1000_q300_t400_b2000.pwsim-sample : docno pairs corresponding to the 1064 sampled German docnos

( format: [English docno]\t[German docno]\t[Hamming distance] )

$ head -3 d1000_q300_t400_b2000.pwsim-sample
2	77357	391
7	49806	399
41	1049339	396

ground_t30.cosine : docnos of all English articles similar to any of the 1064 German sample articles in sample.docnos, as found by brute-force over docvectors, cosine threshold = 0.3
ground_t30_top1.cosine : same as above, but keep top 1 for each German article

( format: [English docno]\t[German docno]\t[Cosine similarity] )

$ head -3 ground_t30.cosine 
3012458	11366	0.332
8205	11366	0.326
492411	11366	0.325
$ head -3 ground_t30_top1.cosine 
3012458	11366	0.332
739791	14456	0.380
1318820	18551	0.324

interwiki-links.en2de : docnos of all English articles similar to any of the 1064 German sample articles, as suggested by Wikipedia interwiki links (at most 1 for each German article)

( format: [English docno]\t[German docno] )

$ head -3 interwiki-links.en2de 
1534	299154
5708	121894
7845	64857

We extracted bitext from German and English Wikipedia (dumps from 1/30/11 and 1/15/11 respectively), using the algorithm described in Ture and Lin, NAACL-HLT 2012. This bitext was the best performing in our experiments (see paper), where we used the two-stage classification approach:

  • Threshold of 0.98 was applied on simple classifier in first stage (output = 13,746,173 pairs), and a 0.60 threshold was applied on the complex classifier in the second stage (final output = 5,761,517 pairs).

bitext-wiki_de-en.de.bz2 = German side of the bitext, lowercased, no tokenization, bz2-compressed. bitext-wiki_de-en.en.bz2 = English side of the bitext, lowercased, no tokenization, bz2-compressed.

$ bzcat bitext-wiki_de-en.de.bz2 | tail -3
kategorie:chief minister (orissa)
saddam hussein unterschrieb als vizepräsident persönlich einen zusicherungsvertrag  
saddam hussein unterschrieb als vizepräsident persönlich einen zusicherungsvertrag  
$ bzcat bitext-wiki_de-en.en.bz2 | tail -3
sitalsasthi  (may–june, orissa, neighbouring regions)	
saddam hussein declared that the invasion was a response to it.	
on december 30, 2006, saddam hussein was hanged.

Second version of German-English bitext, after a series of changes to our bitext extraction code, and with casing preserved. This bitext has not been evaluated on MT training yet.

  • Threshold of 0.98 was applied on simple classifier in first stage (output = 16,139,076 pairs), and a 0.62 threshold was applied on the complex classifier in the second stage (final output = 5,404,313 pairs).

bitext-wiki-v2_de-en.de.bz2 = German side of the bitext, in raw format (no tokenization or lowercasing). bitext-wiki-v2_de-en.en.bz2 = English side of the bitext, in raw format (no tokenization or lowercasing).

$ bzgrep 'Hitler' bitext-wiki-v2_de-en.de.bz2 | head -4
(1939) The High Cost of Hitler
(Deutsch: Hitlers Bankier.
(siehe Adolf Hitler:Politische Anfänge).
1. Januar 1925 von Hitler privat als persönlicher Mitarbeiter angestellt, war Schaub als einer von Hitlers persönlichen Adjutanten bis ins Jahr 1945 ständig in Hitlers Nähe.

$ bzgrep 'Hitler' bitext-wiki-v2_de-en.en.bz2 | head -4
(1939) The High Cost of Hitler		
Hitler's Banker: Hjalmar Horace Greeley Schacht.		
see Adolf Hitler for more books		
On January 1, 1925 hired privately by Hitler as a personal assistant, Schaub was one of Hitler's personal adjutant to the year 1945 and in constant close to Hitler.		

About

Bilingual text collections, extracted from Wikipedia

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published