Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amy/package v2 #326

Merged
merged 65 commits into from
Dec 3, 2024
Merged
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
a47e4ea
package aggregation updates
amytangzheng Sep 11, 2024
e3ad8d1
Merge branch 'dev' of https://github.com/Watts-Lab/team_comm_tools in…
amytangzheng Sep 18, 2024
b2ed12a
updates to gini dependency
amytangzheng Sep 18, 2024
c6c64bd
Bump path-to-regexp and express in /website (#298)
dependabot[bot] Sep 23, 2024
e060a44
Bump nltk from 3.8.1 to 3.9 (#297)
dependabot[bot] Sep 23, 2024
6a05e80
Bump body-parser and express in /website (#296)
dependabot[bot] Sep 23, 2024
1c72695
Check embedding update (#295)
xehu Sep 23, 2024
607548a
Merge branch 'main' into dev
xehu Sep 23, 2024
650197e
Update README.md to remove col = "message"
xehu Sep 23, 2024
d35aeb1
updated user aggregation methods (max)
amytangzheng Sep 27, 2024
36cd76e
Closes #302.
xehu Sep 27, 2024
21987f3
Amy/website (#301)
amytangzheng Oct 7, 2024
119efe4
Update github-actions-website.yaml (#309)
xehu Oct 7, 2024
6d25efd
Update github-actions-feature_dict.yaml (#308)
xehu Oct 7, 2024
7e87679
Package updates in Amy/website (#310)
xehu Oct 7, 2024
5678567
Update package-lock.json to local version
xehu Oct 7, 2024
d75837f
Update package-lock.json
xehu Oct 7, 2024
143cb77
Update package.json
xehu Oct 7, 2024
28f85f7
Update package-lock.json
xehu Oct 7, 2024
8b8bd24
Fix "@babel/plugin-proposal-private-property-in-object" error (#311)
xehu Oct 7, 2024
89cd16b
upgrade node packages
xehu Oct 7, 2024
d04037d
update team page + try to remove some of the deprecated packages
xehu Oct 7, 2024
bdf7035
Revert "update team page + try to remove some of the deprecated packa…
xehu Oct 7, 2024
ec2ed64
revert attempts to upgrade packages
xehu Oct 7, 2024
d83f854
Denormalize liwc (#312)
xehu Oct 7, 2024
7905240
address https://github.com/Watts-Lab/team_comm_tools/issues/300 (#313)
xehu Oct 7, 2024
bf762d0
Address issues with making feature names more clear; have cleaner def…
xehu Oct 8, 2024
1dad080
small fix to ensure filtered_dict does not generate in every run
xehu Oct 8, 2024
ed17d7a
merge in main + bump dev's version up for next time
xehu Oct 8, 2024
6b94149
PATCH FIX: Defaults in 0.1.4 were incorrectly specified
xehu Oct 8, 2024
576a376
updates to package aggregation
amytangzheng Oct 16, 2024
fd50f83
Merge pull request #320 from Watts-Lab/temp-dev
xehu Oct 16, 2024
c4200c5
updates to package aggregation
amytangzheng Oct 16, 2024
10f325d
checking valid methods and columns
amytangzheng Oct 23, 2024
653e386
updates to checking numeric columns
amytangzheng Oct 23, 2024
7c9545d
package aggregation updates
amytangzheng Sep 11, 2024
7c73f8d
updates to gini dependency
amytangzheng Sep 18, 2024
b0bbb7a
updated user aggregation methods (max)
amytangzheng Sep 27, 2024
37080e8
updates to package aggregation
amytangzheng Oct 16, 2024
7d75712
updates to package aggregation
amytangzheng Oct 16, 2024
1c861a3
checking valid methods and columns
amytangzheng Oct 23, 2024
1da2ecd
updates to checking numeric columns
amytangzheng Oct 23, 2024
b10bdee
package aggregation updates
amytangzheng Nov 6, 2024
d007ae8
Merge branch 'amy/package_v2' of https://github.com/Watts-Lab/team_co…
amytangzheng Nov 6, 2024
3fca434
updates to package aggregation
amytangzheng Nov 6, 2024
a36107d
updates to requirements.txt
amytangzheng Nov 8, 2024
e050fb6
updates to featurize.py
amytangzheng Nov 8, 2024
3f31f07
updates to checking columns are in data
amytangzheng Nov 8, 2024
4f562cc
Merge branch 'dev' of https://github.com/Watts-Lab/team_comm_tools in…
amytangzheng Nov 10, 2024
2892a3c
remove local file in featurize
xehu Dec 2, 2024
b027d27
remove commented out func in featurize
xehu Dec 2, 2024
84c126e
correct issue with empty conversation aagg features
xehu Dec 2, 2024
4825972
remove excess preprocessing call
xehu Dec 2, 2024
3ba2082
add error checking and removing custom vector functionality from this PR
xehu Dec 2, 2024
1e0d3f2
remove redundant call to preprocess chat data again
xehu Dec 2, 2024
6a7cccf
restore check embeddings call
xehu Dec 3, 2024
51d833f
code refactor for conversation and user aggregation / error checking
xehu Dec 3, 2024
23b957b
correct issue in which we were looking for columns to summarize that …
xehu Dec 3, 2024
c4d5608
rebase with dev
xehu Dec 3, 2024
7e8d985
add back sum functionality (with warning) and clean up user aggs
xehu Dec 3, 2024
c34ee7f
fix issue with user centroids
xehu Dec 3, 2024
a5362bb
update featurize.py
xehu Dec 3, 2024
727b91e
add test for package custom aggregation
xehu Dec 3, 2024
0643dd3
update documentations
xehu Dec 3, 2024
7141bae
update docs
xehu Dec 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 22 additions & 3 deletions examples/featurize.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@
juries_df = pd.read_csv("./example_data/full_empirical_datasets/jury_conversations_with_outcome_var.csv", encoding='utf-8')
csop_df = pd.read_csv("./example_data/full_empirical_datasets/csop_conversations_withblanks.csv", encoding='utf-8')
csopII_df = pd.read_csv("./example_data/full_empirical_datasets/csopII_conversations_withblanks.csv", encoding='utf-8')
"""
"""
TINY / TEST DATASETS -------------------------------

These are smaller versions of (real) empirical datasets for the purpose of testing and demonstration.
Expand Down Expand Up @@ -51,6 +51,25 @@
)
tiny_juries_feature_builder.featurize()

# Tiny Juries with custom Aggregations
print("Tiny Juries with Custom Aggregation...")
tiny_juries_feature_builder_custom_agg = FeatureBuilder(
input_df = tiny_juries_df,
grouping_keys = ["batch_num", "round_num"],
output_file_base = "jury_TINY_output_custom_agg", # Naming output files using the output_file_base parameter (recommended)
turns = False,
custom_features = [
"(BERT) Mimicry",
"Moving Mimicry",
"Forward Flow",
"Discursive Diversity"],
convo_methods = ['max', 'median'], # This will aggregate ONLY the "positive_bert" at the conversation level, using mean; it will aggregate ONLY "negative_bert" at the speaker/user level, using max.
convo_columns = ['positive_bert'],
user_methods = ['max', 'mean', 'min', 'median'],
user_columns = ['positive_bert', 'negative_bert', 'named_entity_recognition'],
)
tiny_juries_feature_builder_custom_agg.featurize()

# Tiny multi-task
tiny_multi_task_feature_builder = FeatureBuilder(
input_df = tiny_multi_task_df,
Expand Down Expand Up @@ -104,4 +123,4 @@
# output_file_path_conv_level = "./csopII_output_conversation_level.csv",
# turns = True
# )
# csopII_feature_builder.featurize()
# csopII_feature_builder.featurize()
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@ dependencies = [
"transformers==4.44.0",
"tqdm>=4.66.5",
"tzdata>=2023.3",
"tzlocal==5.2"
"tzlocal==5.2",
"fuzzywuzzy==0.18.0"
]
authors = [
{name = "Xinlan Emily Hu", email = "[email protected]"},
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,4 @@ transformers==4.44.0
tqdm>=4.66.5
tzdata>=2023.3
tzlocal==5.2
fuzzywuzzy==0.18.0
94 changes: 80 additions & 14 deletions src/team_comm_tools/feature_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,8 @@ class FeatureBuilder:
:param timestamp_col: A string representing the column name that should be selected as the message. Defaults to "timestamp".
:type timestamp_col: str, optional

:param grouping_keys: A list of multiple identifiers that collectively identify a conversation. If non-empty, we will group by all of the keys in the list and use the grouped key as the unique "conversational identifier."
:param grouping_keys: A list of multiple identifiers that collectively identify a conversation. If non-empty, we will group by all of the keys in the list and use the
grouped key as the unique "conversational identifier."
Defaults to an empty list.
:type grouping_keys: list, optional

Expand All @@ -86,11 +87,31 @@ class FeatureBuilder:
:param ner_cutoff: This is the cutoff value for the confidence of prediction for each named entity. Defaults to 0.9.
:type ner_cutoff: int

:param regenerate_vectors: If true, will regenerate vector data even if it already exists. Defaults to False.
:param regenerate_vectors: If true, will regenerate vector data even if it already exists. Defaults to False.
:type regenerate_vectors: bool, optional

:param compute_vectors_from_preprocessed: If true, computes vectors using preprocessed text (that is, with capitalization and punctuation removed). This was the default behavior for v.0.1.3 and earlier, but we now default to computing metrics on the unpreprocessed text (which INCLUDES capitalization and punctuation). Defaults to False.
:type compute_vectors_from_preprocessed: bool, optional
:param custom_vect_path: If provided, features will be generated using custom vectors rather than default SBERT. Defaults to None.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: need to update documentation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RESOLVED

:type custom_vect_path: str, optional

:param convo_aggregation: If true, will aggregate features at the conversational level. Defaults to True.
:type convo_aggregation: bool, optional

:param convo_methods: Specifies which functions that you want to aggregate with (e.g., mean, stdev...) at the conversational level. Defaults to ['mean', 'max', 'min', 'stdev'].
:type convo_methods: list, optional

:param convo_columns: Specifies which columns (at the utterance/chat level) that you want aggregated for the conversational level. Defauts to all all numeric columns.
:type convo_columns: list, optional

:param user_aggregation: If true, will aggregate features at the speaker/user level. Defaults to True.
:type convo_aggregation: bool, optional

:param user_methods: Specifies which functions that you want to aggregate with (e.g., mean, stdev...) at the speaker/user level. Defaults to ['mean', 'max', 'min', 'stdev'].
:type convo_methods: list, optional

:param user_columns: Specifies which columns (at the utterance/chat level) that you want aggregated for the speaker/user level. Defauts to all all numeric columns.
:type convo_columns: list, optional

:return: The FeatureBuilder doesn't return anything; instead, it writes the generated features to files in the specified paths. It will also print out its progress, so you should see "All Done!" in the terminal, which will indicate that the features have been generated.
:rtype: None
Expand All @@ -117,7 +138,14 @@ def __init__(
ner_training_df: pd.DataFrame = None,
ner_cutoff: int = 0.9,
regenerate_vectors: bool = False,
compute_vectors_from_preprocessed: bool = False
compute_vectors_from_preprocessed: bool = False,
custom_vect_path: str = None,
convo_aggregation = True,
convo_methods: list = ['mean', 'max', 'min', 'stdev'],
convo_columns: list = None,
user_aggregation = True,
user_methods: list = ['mean', 'max', 'min', 'stdev'],
user_columns: list = None
) -> None:

# Defining input and output paths.
Expand Down Expand Up @@ -224,6 +252,12 @@ def __init__(
self.within_task = within_task
self.ner_cutoff = ner_cutoff
self.regenerate_vectors = regenerate_vectors
self.convo_aggregation = convo_aggregation
self.convo_methods = convo_methods
self.convo_columns = convo_columns
self.user_aggregation = user_aggregation
self.user_methods = user_methods
self.user_columns = user_columns

if(compute_vectors_from_preprocessed == True):
self.vector_colname = self.message_col # because the message col will eventually get preprocessed
Expand Down Expand Up @@ -358,7 +392,24 @@ def __init__(
if not re.match(r"(.*\/|^)output\/", self.output_file_path_user_level):
self.output_file_path_user_level = re.sub(r'/user/', r'/output/user/', self.output_file_path_user_level)

self.vect_path = vector_directory + "sentence/" + ("turns" if self.turns else "chats") + "/" + base_file_name
if custom_vect_path is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: it seems like this PR build in some of the initial infrastructure for custom vectors (document this)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RESOLVED -- custom vector infrastructure has been removed

print("Detected that user has requested custom vectors...")
print("We will generate features using custom vectors rather than default SBERT")
self.vect_path = custom_vect_path
else:
self.vect_path = vector_directory + "sentence/" + ("turns" if self.turns else "chats") + "/" + base_file_name

self.original_vect_path = vector_directory + "sentence/" + ("turns" if self.turns else "chats") + "/" + base_file_name

if custom_vect_path is not None:
print("Detected that user has requested custom vectors...")
print("We will generate features using custom vectors rather than default SBERT")
self.vect_path = custom_vect_path
else:
self.vect_path = vector_directory + "sentence/" + ("turns" if self.turns else "chats") + "/" + base_file_name

self.original_vect_path = vector_directory + "sentence/" + ("turns" if self.turns else "chats") + "/" + base_file_name

self.bert_path = vector_directory + "sentiment/" + ("turns" if self.turns else "chats") + "/" + base_file_name

# Check + generate embeddings
Expand All @@ -375,7 +426,11 @@ def __init__(
if(not need_sentiment and feature_dict[feature]["bert_sentiment_data"]):
need_sentiment = True

check_embeddings(self.chat_data, self.vect_path, self.bert_path, need_sentence, need_sentiment, self.regenerate_vectors, message_col = self.vector_colname)
# preprocess chat data again
self.preprocess_chat_data()
# preprocess chat data again
self.preprocess_chat_data()
check_embeddings(self.chat_data, self.vect_path, self.bert_path, need_sentence, need_sentiment, self.regenerate_vectors, self.message_col)

if(need_sentence):
self.vect_data = pd.read_csv(self.vect_path, encoding='mac_roman')
Expand Down Expand Up @@ -487,7 +542,12 @@ def featurize(self) -> None:
Path(self.output_file_path_user_level).parent.mkdir(parents=True, exist_ok=True)
Path(self.output_file_path_chat_level).parent.mkdir(parents=True, exist_ok=True)
Path(self.output_file_path_conv_level).parent.mkdir(parents=True, exist_ok=True)


# Store column names of what we generated, so that the user can easily access them
self.chat_features = list(itertools.chain(*[feature_dict[feature]["columns"] for feature in self.feature_names if feature_dict[feature]["level"] == "Chat"]))
self.conv_features_base = list(itertools.chain(*[feature_dict[feature]["columns"] for feature in self.feature_names if feature_dict[feature]["level"] == "Conversation"]))
self.conv_features_all = [col for col in self.conv_data if col not in self.orig_data and col != 'conversation_num']

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note --- check this; we likely want the last line self.conv_features_all to appear AFTER we actually generate the features, so moving this line of code up may not work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE: fixed

# Step 3a. Create user level features.
print("Generating User Level Features ...")
self.user_level_features()
Expand All @@ -497,14 +557,9 @@ def featurize(self) -> None:
self.conv_level_features()
self.merge_conv_data_with_original()

# Step 4. Write the feartures into the files defined in the output paths.
# Step 4. Write the features into the files defined in the output paths.
print("All Done!")

# Store column names of what we generated, so that the user can easily access them
self.chat_features = list(itertools.chain(*[feature_dict[feature]["columns"] for feature in self.feature_names if feature_dict[feature]["level"] == "Chat"]))
self.conv_features_base = list(itertools.chain(*[feature_dict[feature]["columns"] for feature in self.feature_names if feature_dict[feature]["level"] == "Conversation"]))
self.conv_features_all = [col for col in self.conv_data if col not in self.orig_data and col != 'conversation_num']

self.save_features()

def preprocess_chat_data(self) -> None:
Expand Down Expand Up @@ -607,7 +662,11 @@ def user_level_features(self) -> None:
vect_data= self.vect_data,
conversation_id_col = self.conversation_id_col,
speaker_id_col = self.speaker_id_col,
input_columns = self.input_columns
input_columns = self.input_columns,
user_aggregation = self.user_aggregation,
user_methods = self.user_methods,
user_columns = self.user_columns,
chat_features = self.chat_features
)
self.user_data = user_feature_builder.calculate_user_level_features()
# Remove special characters in column names
Expand All @@ -633,7 +692,14 @@ def conv_level_features(self) -> None:
speaker_id_col = self.speaker_id_col,
message_col = self.message_col,
timestamp_col = self.timestamp_col,
input_columns = self.input_columns
input_columns = self.input_columns,
convo_aggregation = self.convo_aggregation,
convo_methods = self.convo_methods,
convo_columns = self.convo_columns,
user_aggregation = self.user_aggregation,
user_methods = self.user_methods,
user_columns = self.user_columns,
chat_features = self.chat_features,
)
# Calling the driver inside this class to create the features.
self.conv_data = conv_feature_builder.calculate_conversation_level_features(self.feature_methods_conv)
Expand Down
Loading
Loading