Skip to content

Latest commit

 

History

History
 
 

common-crawl

Common Crawl

download

  1. download website indices, 25.6 MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/common-crawl/mse-index.zip
  2. download dumped, 9.6 GB, https://f000.backblazeb2.com/file/malay-dataset/dumping/common-crawl/feather.zip
  3. download cleaned pure text, 2.93 GB, https://f000.backblazeb2.com/file/malay-dataset/dumping/common-crawl/cleaned-common-crawl.txt

Citation

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Common Crawl,
  author = {Husein, Zolkepli},
  title = {Malay-Dataset},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/dumping/singlish-text}}
}