-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring your own LIWC & matplotlib dependency fix #322
Merged
Merged
Changes from 5 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
34e7e84
added BYOL feature, updated some doc strings
sundy1994 27c571e
add matplotlib requirements
sundy1994 1367090
bug fix
sundy1994 006ad11
bug fixes
sundy1994 26f32d9
Merge branch 'dev' into yuxuan/BYOL
xehu 65140f1
update documentation
xehu 4909be0
update regex for emojis, add liwc loading checks, add columns for cus…
sundy1994 8e138d8
use original message col in lexical_features() to improve readability
sundy1994 7d9590e
rebased with dev
xehu c06f75c
remove comments after checking functionality
xehu c175b59
add check in case .dic file doesn't exist
xehu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,7 +24,8 @@ | |
os.environ["TOKENIZERS_PARALLELISM"] = "false" | ||
|
||
# Check if embeddings exist | ||
def check_embeddings(chat_data, vect_path, bert_path, need_sentence, need_sentiment, regenerate_vectors, message_col = "message"): | ||
def check_embeddings(chat_data: pd.DataFrame, vect_path: str, bert_path: str, need_sentence: bool, | ||
need_sentiment: bool, regenerate_vectors: bool, message_col: str = "message"): | ||
""" | ||
Check if embeddings and required lexicons exist, and generate them if they don't. | ||
|
||
|
@@ -90,15 +91,19 @@ def read_in_lexicons(directory, lexicons_dict): | |
continue | ||
lines = [] | ||
for lexicon in lexicons: | ||
# get rid of parentheses | ||
lexicon = lexicon.strip() | ||
lexicon = lexicon.replace('(', '') | ||
lexicon = lexicon.replace(')', '') | ||
# get rid of parentheses; comment out to keep the emojis like :) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Check the commented out parentheses |
||
# TODO: compare the difference if we keep () | ||
# lexicon = lexicon.replace('(', '') | ||
# lexicon = lexicon.replace(')', '') | ||
if '*' not in lexicon: | ||
lines.append(r"\b" + lexicon.replace("\n", "") + r"\b") | ||
else: | ||
# get rid of any cases of multiple repeat -- e.g., '**' | ||
lexicon = lexicon.replace('\**', '\*') | ||
# lexicon = lexicon.replace('\**', '\*'); this will throw Invalid syntax error | ||
pattern = re.compile(r'\*+') | ||
lexicon = pattern.sub('*', lexicon) | ||
lexicon = r"\b" + lexicon.replace("\n", "").replace("*", "") + r"\S*\b" | ||
|
||
# build the final lexicon | ||
lines.append(r"\b" + lexicon.replace("\n", "").replace("*", "") + r"\S*\b") | ||
|
@@ -134,6 +139,95 @@ def generate_lexicon_pkl(): | |
except: | ||
print("WARNING: Lexicons not found. Skipping pickle generation...") | ||
|
||
def fix_abbreviations(dicTerm: str) -> str: | ||
""" | ||
Helper function to fix abbreviations with punctuations. | ||
src: https://github.com/ryanboyd/ContentCoder-Py/blob/main/ContentCodingDictionary.py#L714 | ||
|
||
This function goes over a list of hardcoded exceptions for the tokenizer / sentence parser | ||
built into LIWC so that it doesn't convert them into separate strings | ||
(e.g., we want "i.e." to not be seen as two words and two sentences [i, e]). | ||
|
||
:param dicTerm: The lexicon term | ||
:type dicTerm: str | ||
|
||
:return: dicTerm | ||
:rtype: str | ||
""" | ||
|
||
AbbreviationList = ['ie.', 'i.e.', 'eg.', 'e.g.', 'vs.', 'ph.d.', 'phd.', 'm.d.', 'd.d.s.', 'b.a.', | ||
'b.s.', 'm.s.', 'u.s.a.', 'u.s.', 'u.t.', 'attn.', 'prof.', 'mr.', 'dr.', 'mrs.', | ||
'ms.', 'a.i.', 'a.g.i.', 'tl;dr', 't.t', 't_t'] | ||
AbbreviationDict = {} | ||
for item in AbbreviationList: | ||
itemClean = item.replace('.', '-').replace(';', '-').replace('_', '-') | ||
|
||
if len(itemClean) > 2 and itemClean.endswith('-'): | ||
numTrailers = len(itemClean) | ||
itemClean = itemClean.strip('-') | ||
numTrailers = numTrailers - len(itemClean) | ||
itemClean = itemClean[:-1] + ''.join(['-'] * numTrailers) + itemClean[-1:] | ||
|
||
AbbreviationDict[item] = itemClean | ||
AbbreviationDict[item + ','] = itemClean | ||
|
||
if dicTerm in AbbreviationDict.keys(): | ||
return AbbreviationDict[dicTerm] | ||
else: | ||
return dicTerm | ||
|
||
def load_liwc_dict(dicText: str) -> dict: | ||
""" | ||
Loads up a dictionary that is in the LIWC 2007/2015 format. | ||
src: https://github.com/ryanboyd/ContentCoder-Py/blob/main/ContentCodingDictionary.py#L81 | ||
|
||
This functions reads the content of a LIWC dictionary file in the official format, | ||
and convert it to a dictionary with lexicon: regular expression format. | ||
We assume the dicText has two parts: the header, which maps numbers to "category names," | ||
and the body, which maps words in the lexicon to different category numbers, separated by a '%' sign. | ||
|
||
:param dicText: The content of a .dic file | ||
:type dicText: str | ||
|
||
:return: dicCategories | ||
:rtype: dict | ||
""" | ||
dicSplit = dicText.split('%', 2) | ||
dicHeader, dicBody = dicSplit[1], dicSplit[2] | ||
# read headers | ||
catNameNumberMap = {} | ||
for line in dicHeader.splitlines(): | ||
if line.strip() == '': | ||
continue | ||
lineSplit = line.strip().split('\t') | ||
catNameNumberMap[lineSplit[0]] = lineSplit[1] | ||
# read body | ||
dicCategories = {} | ||
for line in dicBody.splitlines(): | ||
lineSplit = line.strip().split('\t') | ||
dicTerm, catNums = lineSplit[0], lineSplit[1:] | ||
dicTerm = fix_abbreviations(dicTerm=' '.join(lineSplit[0].lower().strip().split())) | ||
dicTerm = dicTerm.strip() | ||
if dicTerm == '': | ||
continue | ||
|
||
if '*' in dicTerm: | ||
# Replace consecutive asterisks with a single asterisk -- e.g., '**'->'*' | ||
pattern = re.compile(r'\*+') | ||
dicTerm = pattern.sub('*', dicTerm) | ||
dicTerm = r"\b" + dicTerm.replace("\n", "").replace("*", "") + r"\S*\b" | ||
else: | ||
dicTerm = r"\b" + dicTerm.replace("\n", "").replace('(', r'\(').replace(')', r'\)') + r"\b" | ||
|
||
for catNum in catNums: | ||
cat = catNameNumberMap[catNum] | ||
if cat not in dicCategories: | ||
dicCategories[cat] = dicTerm | ||
else: | ||
cur_dicTerm = dicCategories[cat] | ||
dicCategories[cat] = cur_dicTerm + "|" + dicTerm | ||
return dicCategories | ||
|
||
def generate_certainty_pkl(): | ||
""" | ||
Helper function for generating the pickle file containing the certainty lexicon. | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a note to double check the commented-out code here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is fine; it just comments out the previous code, which directly concats the dataframe, and replaces it with the new code, which calls it a second time if the custom dictionary is present.