Skip to content

CS4225 Big Data Systems for Data Science. Graded assignment 1, AY22/23 Sem1.

Notifications You must be signed in to change notification settings

mrwsy1/CS4225-Big-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

CS4225-Big-Data

AY22/23 Sem 1

Task 1: Given two textual files, count the number of words that are common.

Problem definition

Given TWO textual files, for each common word between the two files, find the smaller number of times that it appears between the two files. Output the top 20 common words with highest such frequency (For words with the same frequency, there’s no special requirement for the output order). Example: if the word “John” appears 5 times in the 1st file and 3 times in the 2nd file, the smaller number of times is 3.

Requirements

Split the input text with “(space)\t\n\r\f”. Any other tokens like “,.:`” will be regarded as a part of the words. Remove stop-words as given in Stopwords.txt, such as “a”, “the”, “that”, “of”, … (case sensitive). Sort the common words in descending order of the smaller number of occurrences in the two files. In general, words with different case or different non-whitespace punctuation are considered different words.

Marking

The assignment contains a public dataset 'data/' and expected output 'answer.txt'. If your codes are correct, your output should be the same as 'answer.txt'. Different orders will also be considered as correct in marking. All the codes will be automatically compiled and marked by similar scripts as 'compile_run' on a private test dataset. So, ensure your codes can be compiled by the script in your package.

About

CS4225 Big Data Systems for Data Science. Graded assignment 1, AY22/23 Sem1.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published