Skip to content

Commit abc1cae

Browse files
committed
address #286 and #299
1 parent 44f71be commit abc1cae

17 files changed

+184
-90
lines changed
-7.57 KB
Binary file not shown.

docs/build/doctrees/examples.doctree

5.92 KB
Binary file not shown.
-16.1 KB
Binary file not shown.

docs/build/doctrees/index.doctree

-42 Bytes
Binary file not shown.

docs/build/html/_sources/examples.rst.txt

Lines changed: 32 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -85,16 +85,17 @@ Now we are ready to call the FeatureBuilder on our data. All we need to do is de
8585
timestamp_col = "timestamp",
8686
grouping_keys = ["batch_num", "round_num"],
8787
vector_directory = "./vector_data/",
88-
output_file_path_chat_level = "./jury_output_chat_level.csv",
89-
output_file_path_user_level = "./jury_output_user_level.csv",
90-
output_file_path_conv_level = "./jury_output_conversation_level.csv",
88+
output_file_base = "jury_output",
9189
turns = True
9290
)
9391
jury_feature_builder.featurize()
9492
9593
Basic Input Columns
9694
^^^^^^^^^^^^^^^^^^^^
9795

96+
Conversation Parameters
97+
"""""""""""""""""""""""""
98+
9899
* The **input_df** parameter is where you pass in your dataframe. In this case, we want to run the FeatureBuilder on the juries data that we read in!
99100

100101
* The **speaker_id_col** refers to the name of the column containing a unique identifier for each speaker / participant in the conversation. Here, in the data, the name of our columns is called "speaker_nickname."
@@ -105,6 +106,8 @@ Basic Input Columns
105106

106107
* If you do not pass anything in, "message" is the default value for this parameter.
107108

109+
* We assume that all messages are ordered chronologically.
110+
108111
* The **timestamp_col** refers to the name of the column containing when each utterance was said. In this case, we have exactly one timestamp for each message, stored in "timestamp."
109112

110113
* If you do not pass anything in, "timestamp" is the default value for this parameter.
@@ -125,21 +128,39 @@ Basic Input Columns
125128
126129
conversation_id_col = "batch_num"
127130
131+
Vector Directory
132+
""""""""""""""""""
133+
128134
* The **vector_directory** is the name of a directory in which we will store some pre-processed information. Some features require running inference from HuggingFace's `RoBERTa-based sentiment model <https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment>`_, and others require generating `SBERT vectors <https://sbert.net/>`_. These processes take time, and we cache the outputs so that subsequent runs of the FeatureBuilder on the same dataset will not take as much time. Therefore, we require you to pass in a location where you'd like us to save these outputs.
129135

136+
* By default, the directory is named "vector_data/."
137+
130138
* **Note that we do not require the name of the vector directory to be a folder that already exists**; if it doesn't exist, we will create it for you.
131139

132140
* Inside the folder, we will store the RoBERTa outputs in a subfolder called "sentiment", and the SBERT vectors in a subfolder called "sentence." We will create both of these subfolders for you.
133141

134142
* The **turns** parameter, which we will discuss later, controls whether or not you'd like the FeatureBuilder to treat successive utterances by the same individual as a single "turn," or whether you'd like them to be treated separately. We will cache different versions of outputs based on this parameter; we use a subfolder called "chats" (when **turns=False**) or "turns" (when **turns=True**).
135143

136-
* There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on `Generating Features: Utterance-, Speaker-, and Conversation-Level <intro#generating_features>`_ for more details.) However, this means that we require you to provide a path for where you would like us to store each of the output files; **output_file_path_chat_level** (Utterance- or Chat-Level Features), **output_file_path_user_level** (Speaker- or User-Level Features), and **output_file_path_conv_level** (Conversation-Level Features).
144+
.. _output_file_details:
145+
146+
Output File Naming Details
147+
""""""""""""""""""""""""""""
148+
149+
* There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on `Generating Features: Utterance-, Speaker-, and Conversation-Level <intro#generating_features>`_ for more details.) These are generated using the **output_file_base** parameter.
150+
151+
* **All of the outputs will be generated in a folder called "output."**
152+
153+
* Within the "output" folder, **we generate sub-folders such that the three files will be located in subfolders called "chat," "user," and "conv," respectively.**
154+
155+
* Similar to the **vector_directory** parameter, the "chat" directory will be renamed to "turn" depending on the value of the **turns** parameter.
156+
157+
* It is possible to generate different names for each of the three output files, rather than using the same base file path by modifying **output_file_path_chat_level** (Utterance- or Chat-Level Features), **output_file_path_user_level** (Speaker- or User-Level Features), and **output_file_path_conv_level** (Conversation-Level Features). However, because outputs are organized in the specific locations described above, **we have specific requirements for inputting the output paths, and we will modify the path under the hood to match our file naming schema,** rather than saving the file directly to the specified location.
137158

138159
* We expect that you pass in a **path**, not just a filename. For example, the path needs to be "./my_file.csv", and not just "my_file.csv"; you will get an error if you pass in only a name without the "/".
139160

140-
* Regardless of your path location, we will automatically append the name "output" to the fornt of your file path, such that **all of the outputs will be generated in a folder called "output."**
161+
* Regardless of your path location, we will automatically append the name "output" to the fornt of your file path.
141162

142-
* Within the "output" folder, **we will also generate sub-folders such that the three files will be located in subfolders called "chat," "user," and "conv," respectively.**
163+
* Within the "output" folder, **we will also generate the chat/user/conv sub-folders.**
143164

144165
* If you pass in a path that already contains the above automatically-generated elements (for example, "./output/chat/my_chat_features.csv"), we will skip these steps and directly save it in the relevant folder.
145166

@@ -153,14 +174,18 @@ Basic Input Columns
153174
154175
output_file_path_chat_level = "./output/chat/jury_output_chat_level.csv"
155176
156-
* And these two ways of specifying an output path are equivalent, assumign that turns=True:
177+
* And these two ways of specifying an output path are equivalent, assuming that turns=True:
157178

158179
.. code-block:: python
159180
160181
output_file_path_chat_level = "./jury_output_turn_level.csv"
161182
162183
output_file_path_chat_level = "./output/turn/jury_output_turn_level.csv"
163184
185+
186+
Turns
187+
""""""
188+
164189
* The **turns** parameter controls whether we want to treat successive messages from the same person as a single turn. For example, in a text conversation, sometimes individuals will send many message in rapid succession, as follows:
165190

166191
* **John**: Hey Michael

docs/build/html/_sources/index.rst.txt

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -62,11 +62,10 @@ Once you import the tool, you will be able to declare a FeatureBuilder object, w
6262
timestamp_col= "timestamp",
6363
# this is where we'll cache things like sentence vectors; this directory doesn't have to exist; we'll create it for you!
6464
vector_directory = "./vector_data/",
65-
# give us names for the utterance (chat), speaker (user), and conversation-level outputs
66-
output_file_path_chat_level = "./my_output_chat_level.csv",
67-
output_file_path_user_level = "./my_output_user_level.csv",
68-
output_file_path_conv_level = "./my_output_conversation_level.csv",
69-
# if true, this will combine successive turns by the same speaker.
65+
# this will be the base file path for which we generate the three outputs;
66+
# you will get your outputs in output/chat/my_output_chat_level.csv; output/conv/my_output_conv_level.csv; and output/user/my_output_user_level.
67+
output_file_base = "my_output"
68+
# it will also store the output into output/turns/my_output_chat_level.csv
7069
turns = False,
7170
# these features depend on sentence vectors, so they take longer to generate on larger datasets. Add them in manually if you are interested in adding them to your output!
7271
custom_features = [

docs/build/html/examples.html

Lines changed: 34 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -160,16 +160,16 @@ <h3>Configuring the FeatureBuilder<a class="headerlink" href="#configuring-the-f
160160
<span class="n">timestamp_col</span> <span class="o">=</span> <span class="s2">&quot;timestamp&quot;</span><span class="p">,</span>
161161
<span class="n">grouping_keys</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&quot;batch_num&quot;</span><span class="p">,</span> <span class="s2">&quot;round_num&quot;</span><span class="p">],</span>
162162
<span class="n">vector_directory</span> <span class="o">=</span> <span class="s2">&quot;./vector_data/&quot;</span><span class="p">,</span>
163-
<span class="n">output_file_path_chat_level</span> <span class="o">=</span> <span class="s2">&quot;./jury_output_chat_level.csv&quot;</span><span class="p">,</span>
164-
<span class="n">output_file_path_user_level</span> <span class="o">=</span> <span class="s2">&quot;./jury_output_user_level.csv&quot;</span><span class="p">,</span>
165-
<span class="n">output_file_path_conv_level</span> <span class="o">=</span> <span class="s2">&quot;./jury_output_conversation_level.csv&quot;</span><span class="p">,</span>
163+
<span class="n">output_file_base</span> <span class="o">=</span> <span class="s2">&quot;jury_output&quot;</span><span class="p">,</span>
166164
<span class="n">turns</span> <span class="o">=</span> <span class="kc">True</span>
167165
<span class="p">)</span>
168166
<span class="n">jury_feature_builder</span><span class="o">.</span><span class="n">featurize</span><span class="p">()</span>
169167
</pre></div>
170168
</div>
171169
<section id="basic-input-columns">
172170
<h4>Basic Input Columns<a class="headerlink" href="#basic-input-columns" title="Link to this heading"></a></h4>
171+
<section id="conversation-parameters">
172+
<h5>Conversation Parameters<a class="headerlink" href="#conversation-parameters" title="Link to this heading"></a></h5>
173173
<ul>
174174
<li><p>The <strong>input_df</strong> parameter is where you pass in your dataframe. In this case, we want to run the FeatureBuilder on the juries data that we read in!</p></li>
175175
<li><p>The <strong>speaker_id_col</strong> refers to the name of the column containing a unique identifier for each speaker / participant in the conversation. Here, in the data, the name of our columns is called “speaker_nickname.”</p>
@@ -183,6 +183,7 @@ <h4>Basic Input Columns<a class="headerlink" href="#basic-input-columns" title="
183183
<blockquote>
184184
<div><ul class="simple">
185185
<li><p>If you do not pass anything in, “message” is the default value for this parameter.</p></li>
186+
<li><p>We assume that all messages are ordered chronologically.</p></li>
186187
</ul>
187188
</div></blockquote>
188189
</li>
@@ -208,21 +209,41 @@ <h4>Basic Input Columns<a class="headerlink" href="#basic-input-columns" title="
208209
</div>
209210
</div></blockquote>
210211
</li>
212+
</ul>
213+
</section>
214+
<section id="vector-directory">
215+
<h5>Vector Directory<a class="headerlink" href="#vector-directory" title="Link to this heading"></a></h5>
216+
<ul>
211217
<li><p>The <strong>vector_directory</strong> is the name of a directory in which we will store some pre-processed information. Some features require running inference from HuggingFace’s <a class="reference external" href="https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment">RoBERTa-based sentiment model</a>, and others require generating <a class="reference external" href="https://sbert.net/">SBERT vectors</a>. These processes take time, and we cache the outputs so that subsequent runs of the FeatureBuilder on the same dataset will not take as much time. Therefore, we require you to pass in a location where you’d like us to save these outputs.</p>
212218
<blockquote>
213219
<div><ul class="simple">
220+
<li><p>By default, the directory is named “vector_data/.”</p></li>
214221
<li><p><strong>Note that we do not require the name of the vector directory to be a folder that already exists</strong>; if it doesn’t exist, we will create it for you.</p></li>
215222
<li><p>Inside the folder, we will store the RoBERTa outputs in a subfolder called “sentiment”, and the SBERT vectors in a subfolder called “sentence.” We will create both of these subfolders for you.</p></li>
216223
<li><p>The <strong>turns</strong> parameter, which we will discuss later, controls whether or not you’d like the FeatureBuilder to treat successive utterances by the same individual as a single “turn,” or whether you’d like them to be treated separately. We will cache different versions of outputs based on this parameter; we use a subfolder called “chats” (when <strong>turns=False</strong>) or “turns” (when <strong>turns=True</strong>).</p></li>
217224
</ul>
218225
</div></blockquote>
219226
</li>
220-
<li><p>There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on <a class="reference external" href="intro#generating_features">Generating Features: Utterance-, Speaker-, and Conversation-Level</a> for more details.) However, this means that we require you to provide a path for where you would like us to store each of the output files; <strong>output_file_path_chat_level</strong> (Utterance- or Chat-Level Features), <strong>output_file_path_user_level</strong> (Speaker- or User-Level Features), and <strong>output_file_path_conv_level</strong> (Conversation-Level Features).</p>
227+
</ul>
228+
</section>
229+
<section id="output-file-naming-details">
230+
<span id="output-file-details"></span><h5>Output File Naming Details<a class="headerlink" href="#output-file-naming-details" title="Link to this heading"></a></h5>
231+
<ul>
232+
<li><p>There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on <a class="reference external" href="intro#generating_features">Generating Features: Utterance-, Speaker-, and Conversation-Level</a> for more details.) These are generated using the <strong>output_file_base</strong> parameter.</p>
233+
<blockquote>
234+
<div><ul class="simple">
235+
<li><p><strong>All of the outputs will be generated in a folder called “output.”</strong></p></li>
236+
<li><p>Within the “output” folder, <strong>we generate sub-folders such that the three files will be located in subfolders called “chat,” “user,” and “conv,” respectively.</strong></p></li>
237+
<li><p>Similar to the <strong>vector_directory</strong> parameter, the “chat” directory will be renamed to “turn” depending on the value of the <strong>turns</strong> parameter.</p></li>
238+
</ul>
239+
</div></blockquote>
240+
</li>
241+
<li><p>It is possible to generate different names for each of the three output files, rather than using the same base file path by modifying <strong>output_file_path_chat_level</strong> (Utterance- or Chat-Level Features), <strong>output_file_path_user_level</strong> (Speaker- or User-Level Features), and <strong>output_file_path_conv_level</strong> (Conversation-Level Features). However, because outputs are organized in the specific locations described above, <strong>we have specific requirements for inputting the output paths, and we will modify the path under the hood to match our file naming schema,</strong> rather than saving the file directly to the specified location.</p>
221242
<blockquote>
222243
<div><ul class="simple">
223244
<li><p>We expect that you pass in a <strong>path</strong>, not just a filename. For example, the path needs to be “./my_file.csv”, and not just “my_file.csv”; you will get an error if you pass in only a name without the “/”.</p></li>
224-
<li><p>Regardless of your path location, we will automatically append the name “output” to the fornt of your file path, such that <strong>all of the outputs will be generated in a folder called “output.”</strong></p></li>
225-
<li><p>Within the “output” folder, <strong>we will also generate sub-folders such that the three files will be located in subfolders called “chat,” “user,” and “conv,” respectively.</strong></p></li>
245+
<li><p>Regardless of your path location, we will automatically append the name “output” to the fornt of your file path.</p></li>
246+
<li><p>Within the “output” folder, <strong>we will also generate the chat/user/conv sub-folders.</strong></p></li>
226247
<li><p>If you pass in a path that already contains the above automatically-generated elements (for example, “./output/chat/my_chat_features.csv”), we will skip these steps and directly save it in the relevant folder.</p></li>
227248
<li><p>Similar to the <strong>vector_directory</strong> parameter, the “chat” directory will be renamed to “turn” depending on the value of the <strong>turns</strong> parameter.</p></li>
228249
<li><p>This means that the following two ways of specifying an output path are equivalent, assuming that turns=False:</p></li>
@@ -233,7 +254,7 @@ <h4>Basic Input Columns<a class="headerlink" href="#basic-input-columns" title="
233254
</pre></div>
234255
</div>
235256
<ul class="simple">
236-
<li><p>And these two ways of specifying an output path are equivalent, assumign that turns=True:</p></li>
257+
<li><p>And these two ways of specifying an output path are equivalent, assuming that turns=True:</p></li>
237258
</ul>
238259
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">output_file_path_chat_level</span> <span class="o">=</span> <span class="s2">&quot;./jury_output_turn_level.csv&quot;</span>
239260

@@ -242,6 +263,11 @@ <h4>Basic Input Columns<a class="headerlink" href="#basic-input-columns" title="
242263
</div>
243264
</div></blockquote>
244265
</li>
266+
</ul>
267+
</section>
268+
<section id="turns">
269+
<h5>Turns<a class="headerlink" href="#turns" title="Link to this heading"></a></h5>
270+
<ul>
245271
<li><p>The <strong>turns</strong> parameter controls whether we want to treat successive messages from the same person as a single turn. For example, in a text conversation, sometimes individuals will send many message in rapid succession, as follows:</p>
246272
<blockquote>
247273
<div><ul>
@@ -260,6 +286,7 @@ <h4>Basic Input Columns<a class="headerlink" href="#basic-input-columns" title="
260286
</li>
261287
</ul>
262288
</section>
289+
</section>
263290
<section id="advanced-configuration-columns">
264291
<h4>Advanced Configuration Columns<a class="headerlink" href="#advanced-configuration-columns" title="Link to this heading"></a></h4>
265292
<p>More advanced users of the FeatureBuilder should consider the following optional parameters, depending on their needs.</p>

0 commit comments

Comments
 (0)