Skip to content

Commit 4211f0d

Browse files
xehusundy1994
andauthored
Small updates to documentation and adding .__version__ parameter (#332)
* update examples hierarchy * Closes #318. * provide __version__ variable without setup.py * small fix --------- Co-authored-by: sundy1994 <[email protected]>
1 parent 827ca02 commit 4211f0d

File tree

10 files changed

+80
-68
lines changed

10 files changed

+80
-68
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ MANIFEST
3535
src/team_comm_tools/features/lexicons/liwc_lexicons/*
3636
src/team_comm_tools/features/lexicons/liwc_lexicons_small_test/*
3737
src/team_comm_tools/features/lexicons/certainty.txt
38+
src/team_comm_tools/features/lexicons/liwc_2015.dic
3839
src/team_comm_tools/modules/
3940
src/team_comm_tools/output/*
4041
src/team_comm_tools/ipython_notebooks/.ipynb_checkpoints/
-1 Bytes
Binary file not shown.

docs/build/doctrees/examples.doctree

2 Bytes
Binary file not shown.

docs/build/html/_sources/examples.rst.txt

Lines changed: 20 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,9 @@ We also have demos available on Google Colab that you can copy and run on your o
1818

1919
Finally, this page will walk you through a case study, highlighting top use cases and considerations when using the toolkit.
2020

21+
----------------
2122
Getting Started
22-
=================
23+
----------------
2324

2425
To use our tool, please ensure that you have Python >= 3.10 installed and a working version of `pip <https://pypi.org/project/pip/>`_, which is Python's package installer. Then, in your local environment, run the following:
2526

@@ -30,7 +31,7 @@ To use our tool, please ensure that you have Python >= 3.10 installed and a work
3031
This command will automatically install our package and all required dependencies.
3132

3233
Troubleshooting
33-
-----------------
34+
================
3435

3536
In the event that some dependency installations fail (for example, you may get an error that ``en_core_web_sm`` from Spacy is not found, or that there is a missing NLTK resource), please run this simple one-line command in your terminal, which will force the installation of Spacy and NLTK dependencies:
3637

@@ -43,14 +44,14 @@ If you encounter a further issue in which the 'wordnet' package from NLTK is not
4344
You can also find a full list of our requirements `here <https://github.com/Watts-Lab/team_comm_tools/blob/main/requirements.txt>`_.
4445

4546
Import Recommendations: Virtual Environment and Pip
46-
-----------------------------------------------------
47+
=====================================================
4748

4849
**We strongly recommend using a virtual environment in Python to run the package.** We have several specific dependency requirements. One important one is that we are currently only compatible with numpy < 2.0.0 because `numpy 2.0.0 and above <https://numpy.org/devdocs/release/2.0.0-notes.html#changes>`_ made significant changes that are not compatible with other dependencies of our package. As those dependencies are updated, we will support later versions of numpy.
4950

5051
**We also strongly recommend that your version of pip is up-to-date (>=24.0).** There have been reports in which users have had trouble downloading dependencies (specifically, the Spacy package) with older versions of pip. If you get an error with downloading ``en_core_web_sm``, we recommend updating pip.
5152

5253
Importing the Package
53-
-----------------------
54+
======================
5455

5556
After you import the package and install dependencies, you can then use our tool in your Python script as follows:
5657

@@ -62,13 +63,14 @@ Now you have access to the :ref:`feature_builder`. This is the main class that y
6263

6364
*Note*: PyPI treats hyphens and underscores equally, so "pip install team_comm_tools" and "pip install team-comm-tools" are equivalent. However, Python does NOT treat them equally, and **you should use underscores when you import the package, like this: from team_comm_tools import FeatureBuilder**.
6465

66+
-------------------------------------------------------
6567
Walkthrough: Running the FeatureBuilder on Your Data
66-
=======================================================
68+
-------------------------------------------------------
6769

6870
Next, we'll go through the details of running the FeatureBuilder on your data, discussing each of the specific options / parameters at your disposal.
6971

7072
Configuring the FeatureBuilder
71-
--------------------------------
73+
================================
7274

7375
The FeatureBuilder accepts any Pandas DataFrame as the input, so you can read in data in whatever format you like. For the purposes of this walkthrough, we'll be using some jury deliberation data from `Hu et al. (2021) <https://dl.acm.org/doi/pdf/10.1145/3411764.3445433?casa_token=d-b5sCdwpNcAAAAA:-U-ePTSSE3rY1_BLXy1-0spFN_i4gOJqy8D0CeXHLAJna5bFRTee9HEnM0TnK_R-g0BOqOn35mU>`_.
7476

@@ -97,10 +99,10 @@ Now we are ready to call the FeatureBuilder on our data. All we need to do is de
9799
jury_feature_builder.featurize()
98100
99101
Basic Input Columns
100-
~~~~~~~~~~~~~~~~~~~~
102+
---------------------
101103

102104
Conversation Parameters
103-
**************************
105+
~~~~~~~~~~~~~~~~~~~~~~~~~
104106

105107
* The **input_df** parameter is where you pass in your dataframe. In this case, we want to run the FeatureBuilder on the juries data that we read in!
106108

@@ -206,19 +208,19 @@ Turns
206208

207209

208210
Advanced Configuration Columns
209-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
211+
-------------------------------
210212

211213
More advanced users of the FeatureBuilder should consider the following optional parameters, depending on their needs.
212214

213215
Regenerating Vector Cache
214-
***************************
216+
~~~~~~~~~~~~~~~~~~~~~~~~~~
215217

216218
* The **regenerate_vectors** parameter controls whether you'd like the FeatureBuilder to re-generate the content in the **vector_directory**, even if we have already cached the output of a previous run. It is useful if the underlying data has changed, but you want to give the output file the same name as a previous run of the FeatureBuilder.
217219

218220
* By default, **we assume that, if your output file is named the same, that the underlying vectors are the same**. If this isn't true, you should set **regenerate_vectors = True** in order to clear out the cache and re-generate the RoBERTa and SBERT outputs.
219221

220222
Custom Features
221-
*****************
223+
~~~~~~~~~~~~~~~~~
222224

223225
* The **custom_features** parameter allows you to specify features that do not exist within our default set. **We default to NOT generating four features that depend on SBERT vectors, as the process for generating the vectors tends to be slow.** However, these features can provide interesting insights into the extent to which individuals in a conversation speak "similarly" or not, based on a vector similarity metric. To access these features, simply use the **custom_features** parameter:
224226

@@ -234,7 +236,7 @@ Custom Features
234236
* You can chose to add any of these features depending on your preference.
235237

236238
Analyzing First Percentage (%)
237-
********************************
239+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
238240

239241
* The **analyze_first_pct** parameter allows you to "cut off" and separately analyze the first X% of a conversation, in case you wish to separately study different sections of a conversation as it progresses. For example, you may be interested in knowing how the attributes of the first 50% of a conversation differ from the attributes of the entire conversation. Then you can sepcify the following:
240242

@@ -247,14 +249,14 @@ Analyzing First Percentage (%)
247249
* By default, we will simply analyze 100% of each conversation.
248250

249251
Named Entity Recognition
250-
**************************
252+
~~~~~~~~~~~~~~~~~~~~~~~~~~
251253

252254
* The parameters **ner_training_df** and **ner_cutoff** are required if you would like the FeatureBuilder to identify named entities in your conversations. For example, the sentence, "John, did you talk to Michael this morning?" has two named entities: "John" and "Michael." The FeatureBuilder includes a tool that automatically detects these named entities, but it requires the user (you!) to specify some training data with examples of the types of named entities you'd like to recognize. This is because proper nouns can take many forms, from standard Western-style names (e.g., "John") to pseudonymous online nicknames (like "littleHorse"). More information about these parameters can be found in :ref:`named_entity_recognition`.
253255

254256
.. _custom_aggregation:
255257

256258
Custom Aggregation
257-
********************
259+
~~~~~~~~~~~~~~~~~~~
258260

259261
Imagine that you, as a researcher, are interested in high-level characteristics of the entire conversation (for example, how much is said), but you only have measures at the (lower) level of each individual utterance (for example, the number of words in each message). How would you "aggregate" information from the lower level to the higher level?
260262

@@ -317,7 +319,7 @@ The table below summarizes the different types of aggregation, and the ways in w
317319

318320

319321
Example Usage of Custom Aggregation Parameters
320-
+++++++++++++++++++++++++++++++++++++++++++++++
322+
************************************************
321323

322324
To customize aggregation behavior, simply add the following when constructing your FeatureBuilder:
323325

@@ -336,14 +338,14 @@ To turn off aggregation, set the following parameters to ``False``. By default,
336338
user_aggregation = False
337339
338340
Important Notes and Caveats
339-
++++++++++++++++++++++++++++
341+
*****************************
340342

341343
- **[NOTE 1]** Even when aggregation is disabled, totals of words, messages, and characters are still summarized, as these are required for calculating the Gini Coefficient features.
342344
- **[NOTE 2]** Be careful when choosing the "sum" aggregation method, as it is not always appropriate to use the "sum" as an aggregation function. While it is a sensible choice for utterance-level attributes that are *countable* (for example, the total number of words, or other lexical wordcounts), it is a less sensible choice for others (for example, it does not make sense to sum sentiment scores for each utterance in a conversation). Consequently, using the "sum" feature will come with an associated warning.
343345
- **[NOTE 3]** In addition to aggregating from the utterance (chat) level to the conversation level, we also aggregate from the speaker (user) level to the conversation level, using the same methods specified in ``convo_methods`` to do so.
344346

345347
Cumulative Grouping
346-
*********************
348+
~~~~~~~~~~~~~~~~~~~~
347349

348350
* The parameters **cumulative_grouping** and **within_task** address a special case of having multiple conversational identifiers; **they assume that the same team has multiple sequential conversations, and that, in each conversation, they perform one or more separate activities**. This was originally created as a companion to a multi-stage Empirica game (see: `<https://github.com/Watts-Lab/multi-task-empirica>`_). For example, imagine that a team must complete 3 different tasks, each with 3 different subparts. Then we can model this event in terms of 1 team (High level), 3 tasks (Mid level), and 3 subparts per task (Low level).
349351

@@ -460,7 +462,7 @@ Here is some example output (for the RoBERTa sentiment feature):
460462
'bert_sentiment_data': True}
461463
462464
Feature Column Names
463-
~~~~~~~~~~~~~~~~~~~~~~
465+
~~~~~~~~~~~~~~~~~~~~~
464466

465467
Once you call **.featurize()**, you can also obtain a convenient list of the feature columns generated by the toolkit:
466468

0 commit comments

Comments
 (0)