-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amy/package aggregation #272
Conversation
…sive range to not require a vector path
Starting sphinx setup
Testing folder featurizer file
Delete output.csv
Readthedocs setup
* valence testing * rearranging files * intermediate ner testing * NER testing * fix featurizer * fix featurize bug * updating test dataset + function * code coverage * burstiness * move testing FB's into run_tests.py * move NER dataframe to test file * adding complex tests back to run_tests.py * add chat_complex_df and conv_complex_df to run_tests.py * correct dataset paths * rebase * changing references as part of rebase * correcting FB calls based on latest interface updates * correct run_tests.py * add dd tests * burstiness fix * dd tests add * forward flow tests * src changes * testing timestamp variations * src changes * update test ds * fix formatting * fix formatting --------- Co-authored-by: Xinlan Emily Hu <[email protected]> Co-authored-by: Xinlan Emily Hu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @amytangzheng! I set up a few tests in the examples/featurize.py
file, with the following specifications:
- A version that aggregates nothing:
tiny_juries_feature_builder_no_aggregation
; - A version that aggregates only 1 feature with 1 function at each level:
tiny_juries_feature_builder_custom_aggregation
.
Unfortunately, I am running into some issues. It seems that the source of the problem has to do with summing features. "Sum" is not currently one of the summarization functions:
convo_methods: list = ['mean', 'max', 'min', 'std'],
convo_columns: list = None,
user_aggregation = True,
user_methods: list = ['mean', 'max', 'min', 'std'],
user_columns: list = None
) -> None:
However, it's a bit of a special case because we actually should only aggregate 3 attributes via sum, as defined by self.summable_columns
:
self.summable_columns = ["num_words", "num_chars", "num_messages"]
That is because we had determined, at some point, that things like 'sum of Positivity' in a conversation didn't really make sense.
Additionally, the summable features are the ones that we use to calculate the Gini coefficient with. In other words, we'll get the Gini coefficient with respect to the number of words, number of characters, and number of messages each person sent:
"""
Calculate the Gini index for relevant features in the conversation.
This function computes the Gini index for features involving counts, such as:
- Word count
- Character count
- Message count
- Function word accommodation
The Gini index is then merged into the conversation-level data.
:return: None
:rtype: None
"""
for column in self.summable_columns:
self.conv_data = pd.merge(
left=self.conv_data,
right=get_gini(self.user_data.copy(), "sum_"+column, self.conversation_id_col), # this applies to the summed columns in user_data, which matches the above
on=[self.conversation_id_col],
how="inner"
)
However, possibly due to this dependency, we get an error when I run the tiny_juries_feature_builder_custom_aggregation
:
Traceback (most recent call last):
File "/Users/xehu/Desktop/Team Process Mapping/team_comm_tools/examples/featurize.py", line 101, in <module>
tiny_juries_feature_builder_custom_aggregation.featurize(col="message")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/feature_builder.py", line 489, in featurize
self.conv_level_features()
File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/feature_builder.py", line 632, in conv_level_features
self.conv_data = conv_feature_builder.calculate_conversation_level_features(self.feature_methods_conv)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/calculate_conversation_level_features.py", line 100, in calculate_conversation_level_features
method(self)
File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/calculate_conversation_level_features.py", line 139, in get_gini_features
right=get_gini(self.user_data.copy(), "sum_"+column, self.conversation_id_col), # this applies to the summed columns in user_data, which matches the above
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/gini_coefficient.py", line 37, in get_gini
gini_calculated = input_data.groupby([conversation_id_col]).apply(lambda df : gini_coefficient(np.asarray(df[on_column]))).reset_index().rename(columns={0: "gini_coefficient_" + on_column})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1824, in apply
result = self._python_apply_general(f, self._selected_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1885, in _python_apply_general
values, mutated = self._grouper.apply_groupwise(f, data, self.axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/groupby/ops.py", line 919, in apply_groupwise
res = f(group)
^^^^^^^^
File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/gini_coefficient.py", line 37, in <lambda>
gini_calculated = input_data.groupby([conversation_id_col]).apply(lambda df : gini_coefficient(np.asarray(df[on_column]))).reset_index().rename(columns={0: "gini_coefficient_" + on_column})
~~^^^^^^^^^^^
File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/frame.py", line 4102, in __getitem__
indexer = self.columns.get_loc(key)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 'sum_num_words'
I think something is going wrong, in which we're not able to find the 'sum_num_words' column that it's looking for. I tried to fix this in user-level feature calculator by forcing it to sum the summable columns and removing the dependency on user_aggregation = True
, but things are still breaking after this change:
def get_user_level_summed_features(self) -> None:
"""
Aggregate summary statistics from chat-level features that need to be summed together.
Features for which summing makes sense include:
- Word count (total number of words)
- Character count
- Message count
- Function word accommodation
This function calculates and merges the summed features into the user-level data.
:return: None
:rtype: None
"""
# For each summarizable feature
for column in self.summable_columns: # TODO --- Gini depends on the summation happening; something is happening here where it's causing Gini to break.
# Sum of feature across the Conversation
self.user_data = pd.merge(
left=self.user_data,
right=get_user_sum_dataframe(self.chat_data, column, self.conversation_id_col, self.speaker_id_col),
on=[self.conversation_id_col, self.speaker_id_col],
how="inner"
)
My current call is that perhaps we need to take a minute to track down the source of the bug, as well as look into the design decisions for summing and writing some tests (perhaps similar to run_package_grouping_tests.py
, where we have the other package-related tests) to confirm that the custom aggregation is 100% working.
Summary of Next Steps
- We need to figure out exactly why the Gini feature is breaking on the test.
- How should we treat the summing function? I am noticing that, in the original code, we are only summing
self.summable_columns
in Conversation Level Features, but we are summing all columns in User Level Features. (for example, I am getting things likesum_positive_bert
inside the user-level output).
As an aside, though, the tests are currently looking for sum_positive_bert
; I tried changing the user-level feature to only sum the "summable columns," and it caused a test failure:
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/base.py", line 244, in __getitem__
raise KeyError(f"Column not found: {key}")
KeyError: 'Column not found: sum_positive_bert'
Generating BERT sentiments....
Generating Lexicon pickle...
Chat Level Features ...
Generating features for the first 100.0% of messages...
Generating User Level Features ...
Generating Conversation Level Features ...
Error: Process completed with exit code 1.
- If the user didn't request any aggregations, how do we still provide them with the Gini coefficient features (which is technically not an aggregation), without displaying any other aggregated features?
- I am also noticing that currently, the user-level only has mean and sum as aggregation functions. Should we add the other functions back in? Right now, we accept 'mean', 'max', 'min', and 'std', but there is no max or min function at the user level.
- Another thing is that, technically, at the conversation level, we aggregate from two sources: user and chat, but we only allow people to control which chat-level columns they want to aggregate. Should we change this?
- Let's write some robust test cases for this feature under
tests/
just to make sure it works as expected. For example:
- If the user asks for no aggregations, do we get no aggregations?
- If the user asks for no aggregations, do we still get valid non-aggregated features (like burstiness)?
- Is it OK to aggregate only a single feature?
- Is it OK to aggregate DIFFERENT features at the user level and the conversation level?
- What if I request a feature to be aggregated, but it doesn't exist?
For this one, I tried to filter out columns that don't exist and added a warning if we found nonexistent columns:
user_columns_in_data = list(set(user_columns).intersection(set(self.chat_data.columns)))
if(len(user_columns_in_data) != len(user_columns)):
warnings.warn(
"Warning: One or more requested user columns are not present in the data. Ignoring them."
)
self.columns_to_summarize = user_columns_in_data
- What if I pass in an aggregation method, but it doesn't exist?
Marking this as a draft until we're ready to merge this in! |
1ad1bb3
to
de045e5
Compare
Pull Request Template:
If you are merging in a feature or other major change, use this template to check your pull request!
Basic Info
What's this pull request about?
My PR Adds or Improves Documentation
If your feature is about documentation, ensure that you check the boxes relevant to you.
Docstrings
Feature Wiki
.._
and:
) in the header, as this is important for referencing the feature in the Table of Contents!General Documentation
make clean
andmake html
do not generate breaking errors.My PR is About Adding a New Feature to the Code Repository
Adding Feature to the Feature Dictionary
feature_dictionary.py
file with an appropriate entry for my feature. Below is a sample entry; I confirm that all fields are accurately filled out.### Conversation Level
).Documentation
Did you document your feature? You should follow the same requirements as above:
Code Basics
my_feature
, NOTmyFeature
(camel case).NAME_features.py
, where NAME is the name of my feature.src/features/
.Testing
tests/
folder.The location of my tests are here:
If you check all the boxes above, then you ready to merge!