Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amy/package aggregation #272

Closed
wants to merge 896 commits into from
Closed

Amy/package aggregation #272

wants to merge 896 commits into from

Conversation

amytangzheng
Copy link
Collaborator

Pull Request Template:
If you are merging in a feature or other major change, use this template to check your pull request!

Basic Info

What's this pull request about?

Allow users to customize user and conversational aggregation

My PR Adds or Improves Documentation

If your feature is about documentation, ensure that you check the boxes relevant to you.

Docstrings

  • Docstrings: I have followed the proper documentation format (https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html; Google Format recommended).
  • Docstrings: Every function in the file has a block quote comment with a description of the feature.
  • Docstrings: Every input argument is documented.
  • Docstrings: The output type is documented, along with a description of what the output is for.
  • Docstrings: I have linked the feature under the Table of Contents (docs/source/features/index.rst)

Feature Wiki

  • Conceptual Wiki: I made a copy of the TEMPLATE (docs/source/features_conceptual/TEMPLATE.rst)
  • Conceptual Wiki: I replaced the word TEMPLATE at the top of the file with the name of the feature (.. _TEMPLATE:) Please do NOT delete any of the punctuation (the .._ and :) in the header, as this is important for referencing the feature in the Table of Contents!
  • Conceptual Wiki: I have answered the six sections of the template to the best of my ability.
  • Conceptual Wiki: I have linked the feature under the Table of Contents (docs/source/features_conceptual/index.rst).

General Documentation

  • My documentation is linked in a toctree.
  • I have confirmed that make clean and make html do not generate breaking errors.

My PR is About Adding a New Feature to the Code Repository

Adding Feature to the Feature Dictionary

  • I have edited the feature_dictionary.py file with an appropriate entry for my feature. Below is a sample entry; I confirm that all fields are accurately filled out.
  "Function Word Accommodation": {
    "columns": ["function_word_accommodation"],
    "file": "./features/word_mimicry.py",
    "level": "Chat",
    "semantic_grouping": "Variance",
    "description": "The total number of function words used in a given turn that were also used in the previous turn. Function words are defined as a list of 190 words from the source paper.",
    "references": "(Ranganath et al., 2013)",
    "wiki_link": "https://github.com/Watts-Lab/team-process-map/wiki/C.9-Mimicry:-Function-word,-Content-word,-BERT,-Moving",
    "function": ChatLevelFeaturesCalculator.calculate_word_mimicry,
    "dependencies": [],
    "preprocess": [],
    "vect_data": False,
    "bert_sentiment_data": False
  }
  • If my feature is at the chat level, my dictionary entry is in the top half of the file; if my feature is at the conversation level, my dictionary entry is in the bottom half of the file (below the comment that says, ### Conversation Level).

Documentation

Did you document your feature? You should follow the same requirements as above:

  • Docstrings: I have followed the proper documentation format (https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html; Google Format recommended).
  • Docstrings: Every function in the file has a block quote comment with a description of the feature.
  • Docstrings: Every input argument is documented.
  • Docstrings: The output type is documented, along with a description of what the output is for.
  • Docstrings: I have linked the feature under the Table of Contents (docs/source/features/index.rst)

Code Basics

  • My feature is a .py file.
  • My feature uses snake case in the name. That means the name of the format is my_feature, NOT myFeature (camel case).
  • My feature has the name, NAME_features.py, where NAME is the name of my feature.
  • My feature is located in src/features/.

Testing

  • I have thought about test cases for my features, with inputs and expected outputs.
  • I have added test cases for my feature under the tests/ folder.
  • My feature passes the automated testing suite.

The location of my tests are here:

[PASTE LINK HERE]

If you check all the boxes above, then you ready to merge!

amytangzheng and others added 30 commits June 10, 2024 14:10
@amytangzheng amytangzheng requested a review from xehu August 7, 2024 16:11
@xehu xehu changed the base branch from main to initial_package_version August 7, 2024 16:16
xehu and others added 3 commits August 7, 2024 17:22
* valence testing

* rearranging files

* intermediate ner testing

* NER testing

* fix featurizer

* fix featurize bug

* updating test dataset + function

* code coverage

* burstiness

* move testing FB's into run_tests.py

* move NER dataframe to test file

* adding complex tests back to run_tests.py

* add chat_complex_df and conv_complex_df to run_tests.py

* correct dataset paths

* rebase

* changing references as part of rebase

* correcting FB calls based on latest interface updates

* correct run_tests.py

* add dd tests

* burstiness fix

* dd tests add

* forward flow tests

* src changes

* testing timestamp variations

* src changes

* update test ds

* fix formatting

* fix formatting

---------

Co-authored-by: Xinlan Emily Hu <[email protected]>
Co-authored-by: Xinlan Emily Hu <[email protected]>
@xehu xehu linked an issue Aug 7, 2024 that may be closed by this pull request
Copy link
Collaborator

@xehu xehu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @amytangzheng! I set up a few tests in the examples/featurize.py file, with the following specifications:

  1. A version that aggregates nothing: tiny_juries_feature_builder_no_aggregation;
  2. A version that aggregates only 1 feature with 1 function at each level: tiny_juries_feature_builder_custom_aggregation.

Unfortunately, I am running into some issues. It seems that the source of the problem has to do with summing features. "Sum" is not currently one of the summarization functions:

            convo_methods: list = ['mean', 'max', 'min', 'std'],
            convo_columns: list = None,
            user_aggregation = True,
            user_methods: list = ['mean', 'max', 'min', 'std'],
            user_columns: list = None
        ) -> None:

However, it's a bit of a special case because we actually should only aggregate 3 attributes via sum, as defined by self.summable_columns:

self.summable_columns = ["num_words", "num_chars", "num_messages"]

That is because we had determined, at some point, that things like 'sum of Positivity' in a conversation didn't really make sense.

Additionally, the summable features are the ones that we use to calculate the Gini coefficient with. In other words, we'll get the Gini coefficient with respect to the number of words, number of characters, and number of messages each person sent:

    """
        Calculate the Gini index for relevant features in the conversation.

        This function computes the Gini index for features involving counts, such as:
        - Word count
        - Character count
        - Message count
        - Function word accommodation

        The Gini index is then merged into the conversation-level data.

        :return: None
        :rtype: None
        """
        for column in self.summable_columns:
            self.conv_data = pd.merge(
                left=self.conv_data,
                right=get_gini(self.user_data.copy(), "sum_"+column, self.conversation_id_col), # this applies to the summed columns in user_data, which matches the above
                on=[self.conversation_id_col],
                how="inner"
            )

However, possibly due to this dependency, we get an error when I run the tiny_juries_feature_builder_custom_aggregation:

Traceback (most recent call last):
  File "/Users/xehu/Desktop/Team Process Mapping/team_comm_tools/examples/featurize.py", line 101, in <module>
    tiny_juries_feature_builder_custom_aggregation.featurize(col="message")	
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/feature_builder.py", line 489, in featurize
    self.conv_level_features()
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/feature_builder.py", line 632, in conv_level_features
    self.conv_data = conv_feature_builder.calculate_conversation_level_features(self.feature_methods_conv)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/calculate_conversation_level_features.py", line 100, in calculate_conversation_level_features
    method(self)
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/calculate_conversation_level_features.py", line 139, in get_gini_features
    right=get_gini(self.user_data.copy(), "sum_"+column, self.conversation_id_col), # this applies to the summed columns in user_data, which matches the above
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/gini_coefficient.py", line 37, in get_gini
    gini_calculated = input_data.groupby([conversation_id_col]).apply(lambda df : gini_coefficient(np.asarray(df[on_column]))).reset_index().rename(columns={0: "gini_coefficient_" + on_column})
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1824, in apply
    result = self._python_apply_general(f, self._selected_obj)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1885, in _python_apply_general
    values, mutated = self._grouper.apply_groupwise(f, data, self.axis)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/groupby/ops.py", line 919, in apply_groupwise
    res = f(group)
          ^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/gini_coefficient.py", line 37, in <lambda>
    gini_calculated = input_data.groupby([conversation_id_col]).apply(lambda df : gini_coefficient(np.asarray(df[on_column]))).reset_index().rename(columns={0: "gini_coefficient_" + on_column})
                                                                                                              ~~^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 'sum_num_words'

I think something is going wrong, in which we're not able to find the 'sum_num_words' column that it's looking for. I tried to fix this in user-level feature calculator by forcing it to sum the summable columns and removing the dependency on user_aggregation = True, but things are still breaking after this change:

    def get_user_level_summed_features(self) -> None:
        """
        Aggregate summary statistics from chat-level features that need to be summed together.

        Features for which summing makes sense include:
        - Word count (total number of words)
        - Character count
        - Message count
        - Function word accommodation

        This function calculates and merges the summed features into the user-level data.

        :return: None
        :rtype: None
        """

        # For each summarizable feature
        for column in self.summable_columns: # TODO --- Gini depends on the summation happening; something is happening here where it's causing Gini to break.
            
            # Sum of feature across the Conversation
            self.user_data = pd.merge(
                left=self.user_data,
                right=get_user_sum_dataframe(self.chat_data, column, self.conversation_id_col, self.speaker_id_col),
                on=[self.conversation_id_col, self.speaker_id_col],
                how="inner"
            )

My current call is that perhaps we need to take a minute to track down the source of the bug, as well as look into the design decisions for summing and writing some tests (perhaps similar to run_package_grouping_tests.py, where we have the other package-related tests) to confirm that the custom aggregation is 100% working.

Summary of Next Steps

  1. We need to figure out exactly why the Gini feature is breaking on the test.
  2. How should we treat the summing function? I am noticing that, in the original code, we are only summing self.summable_columns in Conversation Level Features, but we are summing all columns in User Level Features. (for example, I am getting things like sum_positive_bert inside the user-level output).

As an aside, though, the tests are currently looking for sum_positive_bert; I tried changing the user-level feature to only sum the "summable columns," and it caused a test failure:

  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/base.py", line 244, in __getitem__
    raise KeyError(f"Column not found: {key}")
KeyError: 'Column not found: sum_positive_bert'
Generating BERT sentiments....
Generating Lexicon pickle...
Chat Level Features ...
Generating features for the first 100.0% of messages...
Generating User Level Features ...
Generating Conversation Level Features ...
Error: Process completed with exit code 1.
  1. If the user didn't request any aggregations, how do we still provide them with the Gini coefficient features (which is technically not an aggregation), without displaying any other aggregated features?
  2. I am also noticing that currently, the user-level only has mean and sum as aggregation functions. Should we add the other functions back in? Right now, we accept 'mean', 'max', 'min', and 'std', but there is no max or min function at the user level.
  3. Another thing is that, technically, at the conversation level, we aggregate from two sources: user and chat, but we only allow people to control which chat-level columns they want to aggregate. Should we change this?
  4. Let's write some robust test cases for this feature under tests/ just to make sure it works as expected. For example:
  • If the user asks for no aggregations, do we get no aggregations?
  • If the user asks for no aggregations, do we still get valid non-aggregated features (like burstiness)?
  • Is it OK to aggregate only a single feature?
  • Is it OK to aggregate DIFFERENT features at the user level and the conversation level?
  • What if I request a feature to be aggregated, but it doesn't exist?
    For this one, I tried to filter out columns that don't exist and added a warning if we found nonexistent columns:
user_columns_in_data = list(set(user_columns).intersection(set(self.chat_data.columns)))

                if(len(user_columns_in_data) != len(user_columns)):
                    warnings.warn(
                        "Warning: One or more requested user columns are not present in the data. Ignoring them."
                    )

                self.columns_to_summarize = user_columns_in_data
  • What if I pass in an aggregation method, but it doesn't exist?

@xehu xehu marked this pull request as draft August 8, 2024 17:23
@xehu xehu changed the base branch from initial_package_version to main August 8, 2024 17:23
@xehu
Copy link
Collaborator

xehu commented Aug 8, 2024

Marking this as a draft until we're ready to merge this in!

@xehu xehu closed this Aug 17, 2024
@xehu xehu force-pushed the amy/package_aggregation branch from 1ad1bb3 to de045e5 Compare August 17, 2024 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow Users to Customize Aggregation
6 participants