You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/build/html/_sources/examples.rst.txt
+20-18Lines changed: 20 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -18,8 +18,9 @@ We also have demos available on Google Colab that you can copy and run on your o
18
18
19
19
Finally, this page will walk you through a case study, highlighting top use cases and considerations when using the toolkit.
20
20
21
+
----------------
21
22
Getting Started
22
-
=================
23
+
----------------
23
24
24
25
To use our tool, please ensure that you have Python >= 3.10 installed and a working version of `pip <https://pypi.org/project/pip/>`_, which is Python's package installer. Then, in your local environment, run the following:
25
26
@@ -30,7 +31,7 @@ To use our tool, please ensure that you have Python >= 3.10 installed and a work
30
31
This command will automatically install our package and all required dependencies.
31
32
32
33
Troubleshooting
33
-
-----------------
34
+
================
34
35
35
36
In the event that some dependency installations fail (for example, you may get an error that ``en_core_web_sm`` from Spacy is not found, or that there is a missing NLTK resource), please run this simple one-line command in your terminal, which will force the installation of Spacy and NLTK dependencies:
36
37
@@ -43,14 +44,14 @@ If you encounter a further issue in which the 'wordnet' package from NLTK is not
43
44
You can also find a full list of our requirements `here <https://github.com/Watts-Lab/team_comm_tools/blob/main/requirements.txt>`_.
44
45
45
46
Import Recommendations: Virtual Environment and Pip
**We strongly recommend using a virtual environment in Python to run the package.** We have several specific dependency requirements. One important one is that we are currently only compatible with numpy < 2.0.0 because `numpy 2.0.0 and above <https://numpy.org/devdocs/release/2.0.0-notes.html#changes>`_ made significant changes that are not compatible with other dependencies of our package. As those dependencies are updated, we will support later versions of numpy.
49
50
50
51
**We also strongly recommend that your version of pip is up-to-date (>=24.0).** There have been reports in which users have had trouble downloading dependencies (specifically, the Spacy package) with older versions of pip. If you get an error with downloading ``en_core_web_sm``, we recommend updating pip.
51
52
52
53
Importing the Package
53
-
-----------------------
54
+
======================
54
55
55
56
After you import the package and install dependencies, you can then use our tool in your Python script as follows:
56
57
@@ -62,13 +63,14 @@ Now you have access to the :ref:`feature_builder`. This is the main class that y
62
63
63
64
*Note*: PyPI treats hyphens and underscores equally, so "pip install team_comm_tools" and "pip install team-comm-tools" are equivalent. However, Python does NOT treat them equally, and **you should use underscores when you import the package, like this: from team_comm_tools import FeatureBuilder**.
Next, we'll go through the details of running the FeatureBuilder on your data, discussing each of the specific options / parameters at your disposal.
69
71
70
72
Configuring the FeatureBuilder
71
-
--------------------------------
73
+
================================
72
74
73
75
The FeatureBuilder accepts any Pandas DataFrame as the input, so you can read in data in whatever format you like. For the purposes of this walkthrough, we'll be using some jury deliberation data from `Hu et al. (2021) <https://dl.acm.org/doi/pdf/10.1145/3411764.3445433?casa_token=d-b5sCdwpNcAAAAA:-U-ePTSSE3rY1_BLXy1-0spFN_i4gOJqy8D0CeXHLAJna5bFRTee9HEnM0TnK_R-g0BOqOn35mU>`_.
74
76
@@ -97,10 +99,10 @@ Now we are ready to call the FeatureBuilder on our data. All we need to do is de
97
99
jury_feature_builder.featurize()
98
100
99
101
Basic Input Columns
100
-
~~~~~~~~~~~~~~~~~~~~
102
+
---------------------
101
103
102
104
Conversation Parameters
103
-
**************************
105
+
~~~~~~~~~~~~~~~~~~~~~~~~~
104
106
105
107
* The **input_df** parameter is where you pass in your dataframe. In this case, we want to run the FeatureBuilder on the juries data that we read in!
106
108
@@ -206,19 +208,19 @@ Turns
206
208
207
209
208
210
Advanced Configuration Columns
209
-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
211
+
-------------------------------
210
212
211
213
More advanced users of the FeatureBuilder should consider the following optional parameters, depending on their needs.
212
214
213
215
Regenerating Vector Cache
214
-
***************************
216
+
~~~~~~~~~~~~~~~~~~~~~~~~~~
215
217
216
218
* The **regenerate_vectors** parameter controls whether you'd like the FeatureBuilder to re-generate the content in the **vector_directory**, even if we have already cached the output of a previous run. It is useful if the underlying data has changed, but you want to give the output file the same name as a previous run of the FeatureBuilder.
217
219
218
220
* By default, **we assume that, if your output file is named the same, that the underlying vectors are the same**. If this isn't true, you should set **regenerate_vectors = True** in order to clear out the cache and re-generate the RoBERTa and SBERT outputs.
219
221
220
222
Custom Features
221
-
*****************
223
+
~~~~~~~~~~~~~~~~~
222
224
223
225
* The **custom_features** parameter allows you to specify features that do not exist within our default set. **We default to NOT generating four features that depend on SBERT vectors, as the process for generating the vectors tends to be slow.** However, these features can provide interesting insights into the extent to which individuals in a conversation speak "similarly" or not, based on a vector similarity metric. To access these features, simply use the **custom_features** parameter:
224
226
@@ -234,7 +236,7 @@ Custom Features
234
236
* You can chose to add any of these features depending on your preference.
235
237
236
238
Analyzing First Percentage (%)
237
-
********************************
239
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
238
240
239
241
* The **analyze_first_pct** parameter allows you to "cut off" and separately analyze the first X% of a conversation, in case you wish to separately study different sections of a conversation as it progresses. For example, you may be interested in knowing how the attributes of the first 50% of a conversation differ from the attributes of the entire conversation. Then you can sepcify the following:
240
242
@@ -247,14 +249,14 @@ Analyzing First Percentage (%)
247
249
* By default, we will simply analyze 100% of each conversation.
248
250
249
251
Named Entity Recognition
250
-
**************************
252
+
~~~~~~~~~~~~~~~~~~~~~~~~~~
251
253
252
254
* The parameters **ner_training_df** and **ner_cutoff** are required if you would like the FeatureBuilder to identify named entities in your conversations. For example, the sentence, "John, did you talk to Michael this morning?" has two named entities: "John" and "Michael." The FeatureBuilder includes a tool that automatically detects these named entities, but it requires the user (you!) to specify some training data with examples of the types of named entities you'd like to recognize. This is because proper nouns can take many forms, from standard Western-style names (e.g., "John") to pseudonymous online nicknames (like "littleHorse"). More information about these parameters can be found in :ref:`named_entity_recognition`.
253
255
254
256
.. _custom_aggregation:
255
257
256
258
Custom Aggregation
257
-
********************
259
+
~~~~~~~~~~~~~~~~~~~
258
260
259
261
Imagine that you, as a researcher, are interested in high-level characteristics of the entire conversation (for example, how much is said), but you only have measures at the (lower) level of each individual utterance (for example, the number of words in each message). How would you "aggregate" information from the lower level to the higher level?
260
262
@@ -317,7 +319,7 @@ The table below summarizes the different types of aggregation, and the ways in w
317
319
318
320
319
321
Example Usage of Custom Aggregation Parameters
320
-
+++++++++++++++++++++++++++++++++++++++++++++++
322
+
************************************************
321
323
322
324
To customize aggregation behavior, simply add the following when constructing your FeatureBuilder:
323
325
@@ -336,14 +338,14 @@ To turn off aggregation, set the following parameters to ``False``. By default,
336
338
user_aggregation =False
337
339
338
340
Important Notes and Caveats
339
-
++++++++++++++++++++++++++++
341
+
*****************************
340
342
341
343
- **[NOTE 1]** Even when aggregation is disabled, totals of words, messages, and characters are still summarized, as these are required for calculating the Gini Coefficient features.
342
344
- **[NOTE 2]** Be careful when choosing the "sum" aggregation method, as it is not always appropriate to use the "sum" as an aggregation function. While it is a sensible choice for utterance-level attributes that are *countable* (for example, the total number of words, or other lexical wordcounts), it is a less sensible choice for others (for example, it does not make sense to sum sentiment scores for each utterance in a conversation). Consequently, using the "sum" feature will come with an associated warning.
343
345
- **[NOTE 3]** In addition to aggregating from the utterance (chat) level to the conversation level, we also aggregate from the speaker (user) level to the conversation level, using the same methods specified in ``convo_methods`` to do so.
344
346
345
347
Cumulative Grouping
346
-
*********************
348
+
~~~~~~~~~~~~~~~~~~~~
347
349
348
350
* The parameters **cumulative_grouping** and **within_task** address a special case of having multiple conversational identifiers; **they assume that the same team has multiple sequential conversations, and that, in each conversation, they perform one or more separate activities**. This was originally created as a companion to a multi-stage Empirica game (see: `<https://github.com/Watts-Lab/multi-task-empirica>`_). For example, imagine that a team must complete 3 different tasks, each with 3 different subparts. Then we can model this event in terms of 1 team (High level), 3 tasks (Mid level), and 3 subparts per task (Low level).
349
351
@@ -460,7 +462,7 @@ Here is some example output (for the RoBERTa sentiment feature):
460
462
'bert_sentiment_data': True}
461
463
462
464
Feature Column Names
463
-
~~~~~~~~~~~~~~~~~~~~~~
465
+
~~~~~~~~~~~~~~~~~~~~~
464
466
465
467
Once you call **.featurize()**, you can also obtain a convenient list of the feature columns generated by the toolkit:
0 commit comments