You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* The **input_df** parameter is where you pass in your dataframe. In this case, we want to run the FeatureBuilder on the juries data that we read in!
99
100
100
101
* The **speaker_id_col** refers to the name of the column containing a unique identifier for each speaker / participant in the conversation. Here, in the data, the name of our columns is called "speaker_nickname."
@@ -105,6 +106,8 @@ Basic Input Columns
105
106
106
107
* If you do not pass anything in, "message" is the default value for this parameter.
107
108
109
+
* We assume that all messages are ordered chronologically.
110
+
108
111
* The **timestamp_col** refers to the name of the column containing when each utterance was said. In this case, we have exactly one timestamp for each message, stored in "timestamp."
109
112
110
113
* If you do not pass anything in, "timestamp" is the default value for this parameter.
@@ -125,21 +128,39 @@ Basic Input Columns
125
128
126
129
conversation_id_col ="batch_num"
127
130
131
+
Vector Directory
132
+
""""""""""""""""""
133
+
128
134
* The **vector_directory** is the name of a directory in which we will store some pre-processed information. Some features require running inference from HuggingFace's `RoBERTa-based sentiment model <https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment>`_, and others require generating `SBERT vectors <https://sbert.net/>`_. These processes take time, and we cache the outputs so that subsequent runs of the FeatureBuilder on the same dataset will not take as much time. Therefore, we require you to pass in a location where you'd like us to save these outputs.
129
135
136
+
* By default, the directory is named "vector_data/."
137
+
130
138
* **Note that we do not require the name of the vector directory to be a folder that already exists**; if it doesn't exist, we will create it for you.
131
139
132
140
* Inside the folder, we will store the RoBERTa outputs in a subfolder called "sentiment", and the SBERT vectors in a subfolder called "sentence." We will create both of these subfolders for you.
133
141
134
142
* The **turns** parameter, which we will discuss later, controls whether or not you'd like the FeatureBuilder to treat successive utterances by the same individual as a single "turn," or whether you'd like them to be treated separately. We will cache different versions of outputs based on this parameter; we use a subfolder called "chats" (when **turns=False**) or "turns" (when **turns=True**).
135
143
136
-
* There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on `Generating Features: Utterance-, Speaker-, and Conversation-Level <intro#generating_features>`_ for more details.) However, this means that we require you to provide a path for where you would like us to store each of the output files; **output_file_path_chat_level** (Utterance- or Chat-Level Features), **output_file_path_user_level** (Speaker- or User-Level Features), and **output_file_path_conv_level** (Conversation-Level Features).
144
+
.. _output_file_details:
145
+
146
+
Output File Naming Details
147
+
""""""""""""""""""""""""""""
148
+
149
+
* There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on `Generating Features: Utterance-, Speaker-, and Conversation-Level <intro#generating_features>`_ for more details.) These are generated using the **output_file_base** parameter.
150
+
151
+
* **All of the outputs will be generated in a folder called "output."**
152
+
153
+
* Within the "output" folder, **we generate sub-folders such that the three files will be located in subfolders called "chat," "user," and "conv," respectively.**
154
+
155
+
* Similar to the **vector_directory** parameter, the "chat" directory will be renamed to "turn" depending on the value of the **turns** parameter.
156
+
157
+
* It is possible to generate different names for each of the three output files, rather than using the same base file path by modifying **output_file_path_chat_level** (Utterance- or Chat-Level Features), **output_file_path_user_level** (Speaker- or User-Level Features), and **output_file_path_conv_level** (Conversation-Level Features). However, because outputs are organized in the specific locations described above, **we have specific requirements for inputting the output paths, and we will modify the path under the hood to match our file naming schema,** rather than saving the file directly to the specified location.
137
158
138
159
* We expect that you pass in a **path**, not just a filename. For example, the path needs to be "./my_file.csv", and not just "my_file.csv"; you will get an error if you pass in only a name without the "/".
139
160
140
-
* Regardless of your path location, we will automatically append the name "output" to the fornt of your file path, such that **all of the outputs will be generated in a folder called "output."**
161
+
* Regardless of your path location, we will automatically append the name "output" to the fornt of your file path.
141
162
142
-
* Within the "output" folder, **we will also generate sub-folders such that the three files will be located in subfolders called "chat," "user," and "conv," respectively.**
163
+
* Within the "output" folder, **we will also generate the chat/user/conv sub-folders.**
143
164
144
165
* If you pass in a path that already contains the above automatically-generated elements (for example, "./output/chat/my_chat_features.csv"), we will skip these steps and directly save it in the relevant folder.
* The **turns** parameter controls whether we want to treat successive messages from the same person as a single turn. For example, in a text conversation, sometimes individuals will send many message in rapid succession, as follows:
# if true, this will combine successive turns by the same speaker.
65
+
# this will be the base file path for which we generate the three outputs;
66
+
# you will get your outputs in output/chat/my_output_chat_level.csv; output/conv/my_output_conv_level.csv; and output/user/my_output_user_level.
67
+
output_file_base="my_output"
68
+
# it will also store the output into output/turns/my_output_chat_level.csv
70
69
turns=False,
71
70
# these features depend on sentence vectors, so they take longer to generate on larger datasets. Add them in manually if you are interested in adding them to your output!
<h4>Basic Input Columns<aclass="headerlink" href="#basic-input-columns" title="Link to this heading"></a></h4>
171
+
<sectionid="conversation-parameters">
172
+
<h5>Conversation Parameters<aclass="headerlink" href="#conversation-parameters" title="Link to this heading"></a></h5>
173
173
<ul>
174
174
<li><p>The <strong>input_df</strong> parameter is where you pass in your dataframe. In this case, we want to run the FeatureBuilder on the juries data that we read in!</p></li>
175
175
<li><p>The <strong>speaker_id_col</strong> refers to the name of the column containing a unique identifier for each speaker / participant in the conversation. Here, in the data, the name of our columns is called “speaker_nickname.”</p>
<h5>Vector Directory<aclass="headerlink" href="#vector-directory" title="Link to this heading"></a></h5>
216
+
<ul>
211
217
<li><p>The <strong>vector_directory</strong> is the name of a directory in which we will store some pre-processed information. Some features require running inference from HuggingFace’s <aclass="reference external" href="https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment">RoBERTa-based sentiment model</a>, and others require generating <aclass="reference external" href="https://sbert.net/">SBERT vectors</a>. These processes take time, and we cache the outputs so that subsequent runs of the FeatureBuilder on the same dataset will not take as much time. Therefore, we require you to pass in a location where you’d like us to save these outputs.</p>
212
218
<blockquote>
213
219
<div><ulclass="simple">
220
+
<li><p>By default, the directory is named “vector_data/.”</p></li>
214
221
<li><p><strong>Note that we do not require the name of the vector directory to be a folder that already exists</strong>; if it doesn’t exist, we will create it for you.</p></li>
215
222
<li><p>Inside the folder, we will store the RoBERTa outputs in a subfolder called “sentiment”, and the SBERT vectors in a subfolder called “sentence.” We will create both of these subfolders for you.</p></li>
216
223
<li><p>The <strong>turns</strong> parameter, which we will discuss later, controls whether or not you’d like the FeatureBuilder to treat successive utterances by the same individual as a single “turn,” or whether you’d like them to be treated separately. We will cache different versions of outputs based on this parameter; we use a subfolder called “chats” (when <strong>turns=False</strong>) or “turns” (when <strong>turns=True</strong>).</p></li>
217
224
</ul>
218
225
</div></blockquote>
219
226
</li>
220
-
<li><p>There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on <aclass="reference external" href="intro#generating_features">Generating Features: Utterance-, Speaker-, and Conversation-Level</a> for more details.) However, this means that we require you to provide a path for where you would like us to store each of the output files; <strong>output_file_path_chat_level</strong> (Utterance- or Chat-Level Features), <strong>output_file_path_user_level</strong> (Speaker- or User-Level Features), and <strong>output_file_path_conv_level</strong> (Conversation-Level Features).</p>
227
+
</ul>
228
+
</section>
229
+
<sectionid="output-file-naming-details">
230
+
<spanid="output-file-details"></span><h5>Output File Naming Details<aclass="headerlink" href="#output-file-naming-details" title="Link to this heading"></a></h5>
231
+
<ul>
232
+
<li><p>There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on <aclass="reference external" href="intro#generating_features">Generating Features: Utterance-, Speaker-, and Conversation-Level</a> for more details.) These are generated using the <strong>output_file_base</strong> parameter.</p>
233
+
<blockquote>
234
+
<div><ulclass="simple">
235
+
<li><p><strong>All of the outputs will be generated in a folder called “output.”</strong></p></li>
236
+
<li><p>Within the “output” folder, <strong>we generate sub-folders such that the three files will be located in subfolders called “chat,” “user,” and “conv,” respectively.</strong></p></li>
237
+
<li><p>Similar to the <strong>vector_directory</strong> parameter, the “chat” directory will be renamed to “turn” depending on the value of the <strong>turns</strong> parameter.</p></li>
238
+
</ul>
239
+
</div></blockquote>
240
+
</li>
241
+
<li><p>It is possible to generate different names for each of the three output files, rather than using the same base file path by modifying <strong>output_file_path_chat_level</strong> (Utterance- or Chat-Level Features), <strong>output_file_path_user_level</strong> (Speaker- or User-Level Features), and <strong>output_file_path_conv_level</strong> (Conversation-Level Features). However, because outputs are organized in the specific locations described above, <strong>we have specific requirements for inputting the output paths, and we will modify the path under the hood to match our file naming schema,</strong> rather than saving the file directly to the specified location.</p>
221
242
<blockquote>
222
243
<div><ulclass="simple">
223
244
<li><p>We expect that you pass in a <strong>path</strong>, not just a filename. For example, the path needs to be “./my_file.csv”, and not just “my_file.csv”; you will get an error if you pass in only a name without the “/”.</p></li>
224
-
<li><p>Regardless of your path location, we will automatically append the name “output” to the fornt of your file path, such that <strong>all of the outputs will be generated in a folder called “output.”</strong></p></li>
225
-
<li><p>Within the “output” folder, <strong>we will also generate sub-folders such that the three files will be located in subfolders called “chat,” “user,” and “conv,” respectively.</strong></p></li>
245
+
<li><p>Regardless of your path location, we will automatically append the name “output” to the fornt of your file path.</p></li>
246
+
<li><p>Within the “output” folder, <strong>we will also generate the chat/user/conv sub-folders.</strong></p></li>
226
247
<li><p>If you pass in a path that already contains the above automatically-generated elements (for example, “./output/chat/my_chat_features.csv”), we will skip these steps and directly save it in the relevant folder.</p></li>
227
248
<li><p>Similar to the <strong>vector_directory</strong> parameter, the “chat” directory will be renamed to “turn” depending on the value of the <strong>turns</strong> parameter.</p></li>
228
249
<li><p>This means that the following two ways of specifying an output path are equivalent, assuming that turns=False:</p></li>
<h5>Turns<aclass="headerlink" href="#turns" title="Link to this heading"></a></h5>
270
+
<ul>
245
271
<li><p>The <strong>turns</strong> parameter controls whether we want to treat successive messages from the same person as a single turn. For example, in a text conversation, sometimes individuals will send many message in rapid succession, as follows:</p>
0 commit comments