Amy/package aggregation #272

amytangzheng · 2024-08-07T16:11:11Z

Pull Request Template:
If you are merging in a feature or other major change, use this template to check your pull request!

Basic Info

What's this pull request about?

Allow users to customize user and conversational aggregation

My PR Adds or Improves Documentation

If your feature is about documentation, ensure that you check the boxes relevant to you.

Docstrings

Docstrings: I have followed the proper documentation format (https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html; Google Format recommended).
Docstrings: Every function in the file has a block quote comment with a description of the feature.
Docstrings: Every input argument is documented.
Docstrings: The output type is documented, along with a description of what the output is for.
Docstrings: I have linked the feature under the Table of Contents (docs/source/features/index.rst)

Feature Wiki

Conceptual Wiki: I made a copy of the TEMPLATE (docs/source/features_conceptual/TEMPLATE.rst)
Conceptual Wiki: I replaced the word TEMPLATE at the top of the file with the name of the feature (.. _TEMPLATE:) Please do NOT delete any of the punctuation (the .._ and :) in the header, as this is important for referencing the feature in the Table of Contents!
Conceptual Wiki: I have answered the six sections of the template to the best of my ability.
Conceptual Wiki: I have linked the feature under the Table of Contents (docs/source/features_conceptual/index.rst).

General Documentation

My documentation is linked in a toctree.
I have confirmed that make clean and make html do not generate breaking errors.

My PR is About Adding a New Feature to the Code Repository

Adding Feature to the Feature Dictionary

I have edited the feature_dictionary.py file with an appropriate entry for my feature. Below is a sample entry; I confirm that all fields are accurately filled out.

  "Function Word Accommodation": {
    "columns": ["function_word_accommodation"],
    "file": "./features/word_mimicry.py",
    "level": "Chat",
    "semantic_grouping": "Variance",
    "description": "The total number of function words used in a given turn that were also used in the previous turn. Function words are defined as a list of 190 words from the source paper.",
    "references": "(Ranganath et al., 2013)",
    "wiki_link": "https://github.com/Watts-Lab/team-process-map/wiki/C.9-Mimicry:-Function-word,-Content-word,-BERT,-Moving",
    "function": ChatLevelFeaturesCalculator.calculate_word_mimicry,
    "dependencies": [],
    "preprocess": [],
    "vect_data": False,
    "bert_sentiment_data": False
  }

If my feature is at the chat level, my dictionary entry is in the top half of the file; if my feature is at the conversation level, my dictionary entry is in the bottom half of the file (below the comment that says, ### Conversation Level).

Documentation

Did you document your feature? You should follow the same requirements as above:

Docstrings: I have followed the proper documentation format (https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html; Google Format recommended).
Docstrings: Every function in the file has a block quote comment with a description of the feature.
Docstrings: Every input argument is documented.
Docstrings: The output type is documented, along with a description of what the output is for.
Docstrings: I have linked the feature under the Table of Contents (docs/source/features/index.rst)

Code Basics

My feature is a .py file.
My feature uses snake case in the name. That means the name of the format is my_feature, NOT myFeature (camel case).
My feature has the name, NAME_features.py, where NAME is the name of my feature.
My feature is located in src/features/.

Testing

I have thought about test cases for my features, with inputs and expected outputs.
I have added test cases for my feature under the tests/ folder.
My feature passes the automated testing suite.

The location of my tests are here:

[PASTE LINK HERE]

If you check all the boxes above, then you ready to merge!

…sive range to not require a vector path

Starting sphinx setup

Testing folder featurizer file

Sphinx2

…folder

LIWC tests

Delete output.csv

Readthedocs setup

* valence testing * rearranging files * intermediate ner testing * NER testing * fix featurizer * fix featurize bug * updating test dataset + function * code coverage * burstiness * move testing FB's into run_tests.py * move NER dataframe to test file * adding complex tests back to run_tests.py * add chat_complex_df and conv_complex_df to run_tests.py * correct dataset paths * rebase * changing references as part of rebase * correcting FB calls based on latest interface updates * correct run_tests.py * add dd tests * burstiness fix * dd tests add * forward flow tests * src changes * testing timestamp variations * src changes * update test ds * fix formatting * fix formatting --------- Co-authored-by: Xinlan Emily Hu <[email protected]> Co-authored-by: Xinlan Emily Hu <[email protected]>

xehu

Hey @amytangzheng! I set up a few tests in the examples/featurize.py file, with the following specifications:

A version that aggregates nothing: tiny_juries_feature_builder_no_aggregation;
A version that aggregates only 1 feature with 1 function at each level: tiny_juries_feature_builder_custom_aggregation.

Unfortunately, I am running into some issues. It seems that the source of the problem has to do with summing features. "Sum" is not currently one of the summarization functions:

            convo_methods: list = ['mean', 'max', 'min', 'std'],
            convo_columns: list = None,
            user_aggregation = True,
            user_methods: list = ['mean', 'max', 'min', 'std'],
            user_columns: list = None
        ) -> None:

However, it's a bit of a special case because we actually should only aggregate 3 attributes via sum, as defined by self.summable_columns:

self.summable_columns = ["num_words", "num_chars", "num_messages"]

That is because we had determined, at some point, that things like 'sum of Positivity' in a conversation didn't really make sense.

Additionally, the summable features are the ones that we use to calculate the Gini coefficient with. In other words, we'll get the Gini coefficient with respect to the number of words, number of characters, and number of messages each person sent:

    """
        Calculate the Gini index for relevant features in the conversation.

        This function computes the Gini index for features involving counts, such as:
        - Word count
        - Character count
        - Message count
        - Function word accommodation

        The Gini index is then merged into the conversation-level data.

        :return: None
        :rtype: None
        """
        for column in self.summable_columns:
            self.conv_data = pd.merge(
                left=self.conv_data,
                right=get_gini(self.user_data.copy(), "sum_"+column, self.conversation_id_col), # this applies to the summed columns in user_data, which matches the above
                on=[self.conversation_id_col],
                how="inner"
            )

However, possibly due to this dependency, we get an error when I run the tiny_juries_feature_builder_custom_aggregation:

Traceback (most recent call last):
  File "/Users/xehu/Desktop/Team Process Mapping/team_comm_tools/examples/featurize.py", line 101, in <module>
    tiny_juries_feature_builder_custom_aggregation.featurize(col="message")	
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/feature_builder.py", line 489, in featurize
    self.conv_level_features()
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/feature_builder.py", line 632, in conv_level_features
    self.conv_data = conv_feature_builder.calculate_conversation_level_features(self.feature_methods_conv)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/calculate_conversation_level_features.py", line 100, in calculate_conversation_level_features
    method(self)
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/calculate_conversation_level_features.py", line 139, in get_gini_features
    right=get_gini(self.user_data.copy(), "sum_"+column, self.conversation_id_col), # this applies to the summed columns in user_data, which matches the above
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/gini_coefficient.py", line 37, in get_gini
    gini_calculated = input_data.groupby([conversation_id_col]).apply(lambda df : gini_coefficient(np.asarray(df[on_column]))).reset_index().rename(columns={0: "gini_coefficient_" + on_column})
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1824, in apply
    result = self._python_apply_general(f, self._selected_obj)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1885, in _python_apply_general
    values, mutated = self._grouper.apply_groupwise(f, data, self.axis)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/groupby/ops.py", line 919, in apply_groupwise
    res = f(group)
          ^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/gini_coefficient.py", line 37, in <lambda>
    gini_calculated = input_data.groupby([conversation_id_col]).apply(lambda df : gini_coefficient(np.asarray(df[on_column]))).reset_index().rename(columns={0: "gini_coefficient_" + on_column})
                                                                                                              ~~^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 'sum_num_words'

I think something is going wrong, in which we're not able to find the 'sum_num_words' column that it's looking for. I tried to fix this in user-level feature calculator by forcing it to sum the summable columns and removing the dependency on user_aggregation = True, but things are still breaking after this change:

    def get_user_level_summed_features(self) -> None:
        """
        Aggregate summary statistics from chat-level features that need to be summed together.

        Features for which summing makes sense include:
        - Word count (total number of words)
        - Character count
        - Message count
        - Function word accommodation

        This function calculates and merges the summed features into the user-level data.

        :return: None
        :rtype: None
        """

        # For each summarizable feature
        for column in self.summable_columns: # TODO --- Gini depends on the summation happening; something is happening here where it's causing Gini to break.
            
            # Sum of feature across the Conversation
            self.user_data = pd.merge(
                left=self.user_data,
                right=get_user_sum_dataframe(self.chat_data, column, self.conversation_id_col, self.speaker_id_col),
                on=[self.conversation_id_col, self.speaker_id_col],
                how="inner"
            )

My current call is that perhaps we need to take a minute to track down the source of the bug, as well as look into the design decisions for summing and writing some tests (perhaps similar to run_package_grouping_tests.py, where we have the other package-related tests) to confirm that the custom aggregation is 100% working.

Summary of Next Steps

We need to figure out exactly why the Gini feature is breaking on the test.
How should we treat the summing function? I am noticing that, in the original code, we are only summing self.summable_columns in Conversation Level Features, but we are summing all columns in User Level Features. (for example, I am getting things like sum_positive_bert inside the user-level output).

As an aside, though, the tests are currently looking for sum_positive_bert; I tried changing the user-level feature to only sum the "summable columns," and it caused a test failure:

  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/pandas/core/base.py", line 244, in __getitem__
    raise KeyError(f"Column not found: {key}")
KeyError: 'Column not found: sum_positive_bert'
Generating BERT sentiments....
Generating Lexicon pickle...
Chat Level Features ...
Generating features for the first 100.0% of messages...
Generating User Level Features ...
Generating Conversation Level Features ...
Error: Process completed with exit code 1.

If the user didn't request any aggregations, how do we still provide them with the Gini coefficient features (which is technically not an aggregation), without displaying any other aggregated features?
I am also noticing that currently, the user-level only has mean and sum as aggregation functions. Should we add the other functions back in? Right now, we accept 'mean', 'max', 'min', and 'std', but there is no max or min function at the user level.
Another thing is that, technically, at the conversation level, we aggregate from two sources: user and chat, but we only allow people to control which chat-level columns they want to aggregate. Should we change this?
Let's write some robust test cases for this feature under tests/ just to make sure it works as expected. For example:

If the user asks for no aggregations, do we get no aggregations?
If the user asks for no aggregations, do we still get valid non-aggregated features (like burstiness)?
Is it OK to aggregate only a single feature?
Is it OK to aggregate DIFFERENT features at the user level and the conversation level?
What if I request a feature to be aggregated, but it doesn't exist?
For this one, I tried to filter out columns that don't exist and added a warning if we found nonexistent columns:

user_columns_in_data = list(set(user_columns).intersection(set(self.chat_data.columns)))

                if(len(user_columns_in_data) != len(user_columns)):
                    warnings.warn(
                        "Warning: One or more requested user columns are not present in the data. Ignoring them."
                    )

                self.columns_to_summarize = user_columns_in_data

What if I pass in an aggregation method, but it doesn't exist?

xehu · 2024-08-08T17:24:22Z

Marking this as a draft until we're ready to merge this in!

amytangzheng and others added 30 commits June 10, 2024 14:10

testing if feature is skipped if no training file

a2c7f46

Flexible conversation_num column

5a47f98

Starting sphinx setup

9f60555

updating github workflow

394c9a1

rename the featurizer inside testing/ and update within person discur…

f15b845

…sive range to not require a vector path

emily commiting updates

bf3753e

Merge pull request #229 from Watts-Lab/sphinx

2cfa2d4

Starting sphinx setup

Merge branch 'main' into shruti/testing-infra

5a5b18c

pulled in updates to tpm-data

0804e4b

update sphinx configuration

edfc8ef

Merge pull request #225 from Watts-Lab/shruti/testing-infra

5d1d71d

Testing folder featurizer file

Merge branch 'main' into sphinx2

3c15517

Merge pull request #231 from Watts-Lab/sphinx2

14ea866

Sphinx2

Merge branch 'main' into amy/named_entity_recognition

8e39e20

Merge branch 'main' into shruti/liwc-testing

8fa7cd3

update requirements and resolve dependency issue

b17fbb8

move liwc test generator file to notebooks folder and out of feature …

f628ff4

…folder

update requirements

5158b77

update packages (pip, graphql)

afb96da

Merge pull request #224 from Watts-Lab/shruti/liwc-testing

df3bf34

LIWC tests

Delete output.csv

82dc1ea

Merge pull request #232 from Watts-Lab/xehu-patch-2

6c7cbc1

Delete output.csv

set up basic automatic documentation infrastructure

d3297a5

working on updating docstrings

c69c91e

adding docstrings for all functions

8284cad

set up basic intro page and TOC

1737b98

set up feature page

fc4aee2

add feature conceptual documentation and templates

b75710f

documentation updates + PR template for documentation push

3d91f47

Merge pull request #235 from Watts-Lab/readthedocs_setup

8ee18bf

Readthedocs setup

sundy1994 and others added 15 commits August 2, 2024 17:44

move lexiconx_dict.pkl to features/assets

c11d244

update dependencies

1eb361a

delete legacy files and remove constant nltk import

e5d974f

clean up package structure and warnings

d050e28

resolve relative imports issue in sphinx

6416270

create single installation script

b726981

commit setup script

08a29c6

update setup script and documentationZ

6a8b8a8

update README to point to setup script

3c0465e

add linkes to website and Rtd to readme

71ff5bf

disable tokenizer parallelism to avoid error

86f3a2e

add badges to home page

9a5ce68

customization of convo aggregation

bb71e4e

customization of user aggregation

aa4d88c

remove testing comments

d00596a

amytangzheng requested a review from xehu August 7, 2024 16:11

xehu changed the base branch from main to initial_package_version August 7, 2024 16:16

xehu and others added 3 commits August 7, 2024 17:22

merge in initial_package_version

8f9e10c

Merge branch 'initial_package_version' into amy/package_aggregation

e8506d2

xehu linked an issue Aug 7, 2024 that may be closed by this pull request

Allow Users to Customize Aggregation #206

Open

xehu added 2 commits August 7, 2024 18:46

there are some issues with this branch when testing the summing

b811286

still breaking

100d33e

xehu requested changes Aug 7, 2024

View reviewed changes

xehu marked this pull request as draft August 8, 2024 17:23

xehu changed the base branch from initial_package_version to main August 8, 2024 17:23

update and clean up lexicons

de045e5

xehu closed this Aug 17, 2024

xehu force-pushed the amy/package_aggregation branch from 1ad1bb3 to de045e5 Compare August 17, 2024 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amy/package aggregation #272

Amy/package aggregation #272

amytangzheng commented Aug 7, 2024

xehu left a comment •

edited

Loading

xehu commented Aug 8, 2024

Amy/package aggregation #272

Amy/package aggregation #272

Conversation

amytangzheng commented Aug 7, 2024

Basic Info

My PR Adds or Improves Documentation

Docstrings

Feature Wiki

General Documentation

My PR is About Adding a New Feature to the Code Repository

Adding Feature to the Feature Dictionary

Documentation

Code Basics

Testing

xehu left a comment • edited Loading

Choose a reason for hiding this comment

Summary of Next Steps

xehu commented Aug 8, 2024

xehu left a comment •

edited

Loading