Skip to content

Commit

Permalink
address #286 and #299
Browse files Browse the repository at this point in the history
  • Loading branch information
xehu committed Oct 8, 2024
1 parent 44f71be commit abc1cae
Show file tree
Hide file tree
Showing 17 changed files with 184 additions and 90 deletions.
Binary file modified docs/build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/build/doctrees/examples.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/feature_builder.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/index.doctree
Binary file not shown.
39 changes: 32 additions & 7 deletions docs/build/html/_sources/examples.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -85,16 +85,17 @@ Now we are ready to call the FeatureBuilder on our data. All we need to do is de
timestamp_col = "timestamp",
grouping_keys = ["batch_num", "round_num"],
vector_directory = "./vector_data/",
output_file_path_chat_level = "./jury_output_chat_level.csv",
output_file_path_user_level = "./jury_output_user_level.csv",
output_file_path_conv_level = "./jury_output_conversation_level.csv",
output_file_base = "jury_output",
turns = True
)
jury_feature_builder.featurize()
Basic Input Columns
^^^^^^^^^^^^^^^^^^^^

Conversation Parameters
"""""""""""""""""""""""""

* The **input_df** parameter is where you pass in your dataframe. In this case, we want to run the FeatureBuilder on the juries data that we read in!

* The **speaker_id_col** refers to the name of the column containing a unique identifier for each speaker / participant in the conversation. Here, in the data, the name of our columns is called "speaker_nickname."
Expand All @@ -105,6 +106,8 @@ Basic Input Columns

* If you do not pass anything in, "message" is the default value for this parameter.

* We assume that all messages are ordered chronologically.

* The **timestamp_col** refers to the name of the column containing when each utterance was said. In this case, we have exactly one timestamp for each message, stored in "timestamp."

* If you do not pass anything in, "timestamp" is the default value for this parameter.
Expand All @@ -125,21 +128,39 @@ Basic Input Columns
conversation_id_col = "batch_num"
Vector Directory
""""""""""""""""""

* The **vector_directory** is the name of a directory in which we will store some pre-processed information. Some features require running inference from HuggingFace's `RoBERTa-based sentiment model <https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment>`_, and others require generating `SBERT vectors <https://sbert.net/>`_. These processes take time, and we cache the outputs so that subsequent runs of the FeatureBuilder on the same dataset will not take as much time. Therefore, we require you to pass in a location where you'd like us to save these outputs.

* By default, the directory is named "vector_data/."

* **Note that we do not require the name of the vector directory to be a folder that already exists**; if it doesn't exist, we will create it for you.

* Inside the folder, we will store the RoBERTa outputs in a subfolder called "sentiment", and the SBERT vectors in a subfolder called "sentence." We will create both of these subfolders for you.

* The **turns** parameter, which we will discuss later, controls whether or not you'd like the FeatureBuilder to treat successive utterances by the same individual as a single "turn," or whether you'd like them to be treated separately. We will cache different versions of outputs based on this parameter; we use a subfolder called "chats" (when **turns=False**) or "turns" (when **turns=True**).

* There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on `Generating Features: Utterance-, Speaker-, and Conversation-Level <intro#generating_features>`_ for more details.) However, this means that we require you to provide a path for where you would like us to store each of the output files; **output_file_path_chat_level** (Utterance- or Chat-Level Features), **output_file_path_user_level** (Speaker- or User-Level Features), and **output_file_path_conv_level** (Conversation-Level Features).
.. _output_file_details:

Output File Naming Details
""""""""""""""""""""""""""""

* There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on `Generating Features: Utterance-, Speaker-, and Conversation-Level <intro#generating_features>`_ for more details.) These are generated using the **output_file_base** parameter.

* **All of the outputs will be generated in a folder called "output."**

* Within the "output" folder, **we generate sub-folders such that the three files will be located in subfolders called "chat," "user," and "conv," respectively.**

* Similar to the **vector_directory** parameter, the "chat" directory will be renamed to "turn" depending on the value of the **turns** parameter.

* It is possible to generate different names for each of the three output files, rather than using the same base file path by modifying **output_file_path_chat_level** (Utterance- or Chat-Level Features), **output_file_path_user_level** (Speaker- or User-Level Features), and **output_file_path_conv_level** (Conversation-Level Features). However, because outputs are organized in the specific locations described above, **we have specific requirements for inputting the output paths, and we will modify the path under the hood to match our file naming schema,** rather than saving the file directly to the specified location.

* We expect that you pass in a **path**, not just a filename. For example, the path needs to be "./my_file.csv", and not just "my_file.csv"; you will get an error if you pass in only a name without the "/".

* Regardless of your path location, we will automatically append the name "output" to the fornt of your file path, such that **all of the outputs will be generated in a folder called "output."**
* Regardless of your path location, we will automatically append the name "output" to the fornt of your file path.

* Within the "output" folder, **we will also generate sub-folders such that the three files will be located in subfolders called "chat," "user," and "conv," respectively.**
* Within the "output" folder, **we will also generate the chat/user/conv sub-folders.**

* If you pass in a path that already contains the above automatically-generated elements (for example, "./output/chat/my_chat_features.csv"), we will skip these steps and directly save it in the relevant folder.

Expand All @@ -153,14 +174,18 @@ Basic Input Columns
output_file_path_chat_level = "./output/chat/jury_output_chat_level.csv"
* And these two ways of specifying an output path are equivalent, assumign that turns=True:
* And these two ways of specifying an output path are equivalent, assuming that turns=True:

.. code-block:: python
output_file_path_chat_level = "./jury_output_turn_level.csv"
output_file_path_chat_level = "./output/turn/jury_output_turn_level.csv"
Turns
""""""

* The **turns** parameter controls whether we want to treat successive messages from the same person as a single turn. For example, in a text conversation, sometimes individuals will send many message in rapid succession, as follows:

* **John**: Hey Michael
Expand Down
9 changes: 4 additions & 5 deletions docs/build/html/_sources/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -62,11 +62,10 @@ Once you import the tool, you will be able to declare a FeatureBuilder object, w
timestamp_col= "timestamp",
# this is where we'll cache things like sentence vectors; this directory doesn't have to exist; we'll create it for you!
vector_directory = "./vector_data/",
# give us names for the utterance (chat), speaker (user), and conversation-level outputs
output_file_path_chat_level = "./my_output_chat_level.csv",
output_file_path_user_level = "./my_output_user_level.csv",
output_file_path_conv_level = "./my_output_conversation_level.csv",
# if true, this will combine successive turns by the same speaker.
# this will be the base file path for which we generate the three outputs;
# you will get your outputs in output/chat/my_output_chat_level.csv; output/conv/my_output_conv_level.csv; and output/user/my_output_user_level.
output_file_base = "my_output"
# it will also store the output into output/turns/my_output_chat_level.csv
turns = False,
# these features depend on sentence vectors, so they take longer to generate on larger datasets. Add them in manually if you are interested in adding them to your output!
custom_features = [
Expand Down
41 changes: 34 additions & 7 deletions docs/build/html/examples.html
Original file line number Diff line number Diff line change
Expand Up @@ -160,16 +160,16 @@ <h3>Configuring the FeatureBuilder<a class="headerlink" href="#configuring-the-f
<span class="n">timestamp_col</span> <span class="o">=</span> <span class="s2">&quot;timestamp&quot;</span><span class="p">,</span>
<span class="n">grouping_keys</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&quot;batch_num&quot;</span><span class="p">,</span> <span class="s2">&quot;round_num&quot;</span><span class="p">],</span>
<span class="n">vector_directory</span> <span class="o">=</span> <span class="s2">&quot;./vector_data/&quot;</span><span class="p">,</span>
<span class="n">output_file_path_chat_level</span> <span class="o">=</span> <span class="s2">&quot;./jury_output_chat_level.csv&quot;</span><span class="p">,</span>
<span class="n">output_file_path_user_level</span> <span class="o">=</span> <span class="s2">&quot;./jury_output_user_level.csv&quot;</span><span class="p">,</span>
<span class="n">output_file_path_conv_level</span> <span class="o">=</span> <span class="s2">&quot;./jury_output_conversation_level.csv&quot;</span><span class="p">,</span>
<span class="n">output_file_base</span> <span class="o">=</span> <span class="s2">&quot;jury_output&quot;</span><span class="p">,</span>
<span class="n">turns</span> <span class="o">=</span> <span class="kc">True</span>
<span class="p">)</span>
<span class="n">jury_feature_builder</span><span class="o">.</span><span class="n">featurize</span><span class="p">()</span>
</pre></div>
</div>
<section id="basic-input-columns">
<h4>Basic Input Columns<a class="headerlink" href="#basic-input-columns" title="Link to this heading"></a></h4>
<section id="conversation-parameters">
<h5>Conversation Parameters<a class="headerlink" href="#conversation-parameters" title="Link to this heading"></a></h5>
<ul>
<li><p>The <strong>input_df</strong> parameter is where you pass in your dataframe. In this case, we want to run the FeatureBuilder on the juries data that we read in!</p></li>
<li><p>The <strong>speaker_id_col</strong> refers to the name of the column containing a unique identifier for each speaker / participant in the conversation. Here, in the data, the name of our columns is called “speaker_nickname.”</p>
Expand All @@ -183,6 +183,7 @@ <h4>Basic Input Columns<a class="headerlink" href="#basic-input-columns" title="
<blockquote>
<div><ul class="simple">
<li><p>If you do not pass anything in, “message” is the default value for this parameter.</p></li>
<li><p>We assume that all messages are ordered chronologically.</p></li>
</ul>
</div></blockquote>
</li>
Expand All @@ -208,21 +209,41 @@ <h4>Basic Input Columns<a class="headerlink" href="#basic-input-columns" title="
</div>
</div></blockquote>
</li>
</ul>
</section>
<section id="vector-directory">
<h5>Vector Directory<a class="headerlink" href="#vector-directory" title="Link to this heading"></a></h5>
<ul>
<li><p>The <strong>vector_directory</strong> is the name of a directory in which we will store some pre-processed information. Some features require running inference from HuggingFace’s <a class="reference external" href="https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment">RoBERTa-based sentiment model</a>, and others require generating <a class="reference external" href="https://sbert.net/">SBERT vectors</a>. These processes take time, and we cache the outputs so that subsequent runs of the FeatureBuilder on the same dataset will not take as much time. Therefore, we require you to pass in a location where you’d like us to save these outputs.</p>
<blockquote>
<div><ul class="simple">
<li><p>By default, the directory is named “vector_data/.”</p></li>
<li><p><strong>Note that we do not require the name of the vector directory to be a folder that already exists</strong>; if it doesn’t exist, we will create it for you.</p></li>
<li><p>Inside the folder, we will store the RoBERTa outputs in a subfolder called “sentiment”, and the SBERT vectors in a subfolder called “sentence.” We will create both of these subfolders for you.</p></li>
<li><p>The <strong>turns</strong> parameter, which we will discuss later, controls whether or not you’d like the FeatureBuilder to treat successive utterances by the same individual as a single “turn,” or whether you’d like them to be treated separately. We will cache different versions of outputs based on this parameter; we use a subfolder called “chats” (when <strong>turns=False</strong>) or “turns” (when <strong>turns=True</strong>).</p></li>
</ul>
</div></blockquote>
</li>
<li><p>There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on <a class="reference external" href="intro#generating_features">Generating Features: Utterance-, Speaker-, and Conversation-Level</a> for more details.) However, this means that we require you to provide a path for where you would like us to store each of the output files; <strong>output_file_path_chat_level</strong> (Utterance- or Chat-Level Features), <strong>output_file_path_user_level</strong> (Speaker- or User-Level Features), and <strong>output_file_path_conv_level</strong> (Conversation-Level Features).</p>
</ul>
</section>
<section id="output-file-naming-details">
<span id="output-file-details"></span><h5>Output File Naming Details<a class="headerlink" href="#output-file-naming-details" title="Link to this heading"></a></h5>
<ul>
<li><p>There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on <a class="reference external" href="intro#generating_features">Generating Features: Utterance-, Speaker-, and Conversation-Level</a> for more details.) These are generated using the <strong>output_file_base</strong> parameter.</p>
<blockquote>
<div><ul class="simple">
<li><p><strong>All of the outputs will be generated in a folder called “output.”</strong></p></li>
<li><p>Within the “output” folder, <strong>we generate sub-folders such that the three files will be located in subfolders called “chat,” “user,” and “conv,” respectively.</strong></p></li>
<li><p>Similar to the <strong>vector_directory</strong> parameter, the “chat” directory will be renamed to “turn” depending on the value of the <strong>turns</strong> parameter.</p></li>
</ul>
</div></blockquote>
</li>
<li><p>It is possible to generate different names for each of the three output files, rather than using the same base file path by modifying <strong>output_file_path_chat_level</strong> (Utterance- or Chat-Level Features), <strong>output_file_path_user_level</strong> (Speaker- or User-Level Features), and <strong>output_file_path_conv_level</strong> (Conversation-Level Features). However, because outputs are organized in the specific locations described above, <strong>we have specific requirements for inputting the output paths, and we will modify the path under the hood to match our file naming schema,</strong> rather than saving the file directly to the specified location.</p>
<blockquote>
<div><ul class="simple">
<li><p>We expect that you pass in a <strong>path</strong>, not just a filename. For example, the path needs to be “./my_file.csv”, and not just “my_file.csv”; you will get an error if you pass in only a name without the “/”.</p></li>
<li><p>Regardless of your path location, we will automatically append the name “output” to the fornt of your file path, such that <strong>all of the outputs will be generated in a folder called “output.”</strong></p></li>
<li><p>Within the “output” folder, <strong>we will also generate sub-folders such that the three files will be located in subfolders called “chat,” “user,” and “conv,” respectively.</strong></p></li>
<li><p>Regardless of your path location, we will automatically append the name “output” to the fornt of your file path.</p></li>
<li><p>Within the “output” folder, <strong>we will also generate the chat/user/conv sub-folders.</strong></p></li>
<li><p>If you pass in a path that already contains the above automatically-generated elements (for example, “./output/chat/my_chat_features.csv”), we will skip these steps and directly save it in the relevant folder.</p></li>
<li><p>Similar to the <strong>vector_directory</strong> parameter, the “chat” directory will be renamed to “turn” depending on the value of the <strong>turns</strong> parameter.</p></li>
<li><p>This means that the following two ways of specifying an output path are equivalent, assuming that turns=False:</p></li>
Expand All @@ -233,7 +254,7 @@ <h4>Basic Input Columns<a class="headerlink" href="#basic-input-columns" title="
</pre></div>
</div>
<ul class="simple">
<li><p>And these two ways of specifying an output path are equivalent, assumign that turns=True:</p></li>
<li><p>And these two ways of specifying an output path are equivalent, assuming that turns=True:</p></li>
</ul>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">output_file_path_chat_level</span> <span class="o">=</span> <span class="s2">&quot;./jury_output_turn_level.csv&quot;</span>

Expand All @@ -242,6 +263,11 @@ <h4>Basic Input Columns<a class="headerlink" href="#basic-input-columns" title="
</div>
</div></blockquote>
</li>
</ul>
</section>
<section id="turns">
<h5>Turns<a class="headerlink" href="#turns" title="Link to this heading"></a></h5>
<ul>
<li><p>The <strong>turns</strong> parameter controls whether we want to treat successive messages from the same person as a single turn. For example, in a text conversation, sometimes individuals will send many message in rapid succession, as follows:</p>
<blockquote>
<div><ul>
Expand All @@ -260,6 +286,7 @@ <h4>Basic Input Columns<a class="headerlink" href="#basic-input-columns" title="
</li>
</ul>
</section>
</section>
<section id="advanced-configuration-columns">
<h4>Advanced Configuration Columns<a class="headerlink" href="#advanced-configuration-columns" title="Link to this heading"></a></h4>
<p>More advanced users of the FeatureBuilder should consider the following optional parameters, depending on their needs.</p>
Expand Down
Loading

0 comments on commit abc1cae

Please sign in to comment.