CS4225-Big-Data

AY22/23 Sem 1

Task 1: Given two textual files, count the number of words that are common.

Problem definition

Given TWO textual files, for each common word between the two files, find the smaller number of times that it appears between the two files. Output the top 20 common words with highest such frequency (For words with the same frequency, there’s no special requirement for the output order). Example: if the word “John” appears 5 times in the 1st file and 3 times in the 2nd file, the smaller number of times is 3.

Requirements

Split the input text with “(space)\t\n\r\f”. Any other tokens like “,.:`” will be regarded as a part of the words. Remove stop-words as given in Stopwords.txt, such as “a”, “the”, “that”, “of”, … (case sensitive). Sort the common words in descending order of the smaller number of occurrences in the two files. In general, words with different case or different non-whitespace punctuation are considered different words.

Marking

The assignment contains a public dataset 'data/' and expected output 'answer.txt'. If your codes are correct, your output should be the same as 'answer.txt'. Different orders will also be considered as correct in marking. All the codes will be automatically compiled and marked by similar scripts as 'compile_run' on a private test dataset. So, ensure your codes can be compiled by the script in your package.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assign1_common_words		assign1_common_words
4225_submit-receipt.png		4225_submit-receipt.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS4225-Big-Data

Problem definition

Requirements

Marking

About

Releases

Packages

Languages

mrwsy1/CS4225-Big-Data

Folders and files

Latest commit

History

Repository files navigation

CS4225-Big-Data

Problem definition

Requirements

Marking

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages