diff --git a/docs/build/doctrees/environment.pickle b/docs/build/doctrees/environment.pickle index 8d0e0f09..f31e7e92 100644 Binary files a/docs/build/doctrees/environment.pickle and b/docs/build/doctrees/environment.pickle differ diff --git a/docs/build/doctrees/examples.doctree b/docs/build/doctrees/examples.doctree index 459db6af..ed989d52 100644 Binary files a/docs/build/doctrees/examples.doctree and b/docs/build/doctrees/examples.doctree differ diff --git a/docs/build/doctrees/feature_builder.doctree b/docs/build/doctrees/feature_builder.doctree index e13cdc16..70c1ee9d 100644 Binary files a/docs/build/doctrees/feature_builder.doctree and b/docs/build/doctrees/feature_builder.doctree differ diff --git a/docs/build/doctrees/index.doctree b/docs/build/doctrees/index.doctree index 1a4d56a7..f6754f42 100644 Binary files a/docs/build/doctrees/index.doctree and b/docs/build/doctrees/index.doctree differ diff --git a/docs/build/html/_sources/examples.rst.txt b/docs/build/html/_sources/examples.rst.txt index 637d96db..b7bc948d 100644 --- a/docs/build/html/_sources/examples.rst.txt +++ b/docs/build/html/_sources/examples.rst.txt @@ -85,9 +85,7 @@ Now we are ready to call the FeatureBuilder on our data. All we need to do is de timestamp_col = "timestamp", grouping_keys = ["batch_num", "round_num"], vector_directory = "./vector_data/", - output_file_path_chat_level = "./jury_output_chat_level.csv", - output_file_path_user_level = "./jury_output_user_level.csv", - output_file_path_conv_level = "./jury_output_conversation_level.csv", + output_file_base = "jury_output", turns = True ) jury_feature_builder.featurize() @@ -95,6 +93,9 @@ Now we are ready to call the FeatureBuilder on our data. All we need to do is de Basic Input Columns ^^^^^^^^^^^^^^^^^^^^ +Conversation Parameters +""""""""""""""""""""""""" + * The **input_df** parameter is where you pass in your dataframe. In this case, we want to run the FeatureBuilder on the juries data that we read in! * The **speaker_id_col** refers to the name of the column containing a unique identifier for each speaker / participant in the conversation. Here, in the data, the name of our columns is called "speaker_nickname." @@ -105,6 +106,8 @@ Basic Input Columns * If you do not pass anything in, "message" is the default value for this parameter. + * We assume that all messages are ordered chronologically. + * The **timestamp_col** refers to the name of the column containing when each utterance was said. In this case, we have exactly one timestamp for each message, stored in "timestamp." * If you do not pass anything in, "timestamp" is the default value for this parameter. @@ -125,21 +128,39 @@ Basic Input Columns conversation_id_col = "batch_num" +Vector Directory +"""""""""""""""""" + * The **vector_directory** is the name of a directory in which we will store some pre-processed information. Some features require running inference from HuggingFace's `RoBERTa-based sentiment model `_, and others require generating `SBERT vectors `_. These processes take time, and we cache the outputs so that subsequent runs of the FeatureBuilder on the same dataset will not take as much time. Therefore, we require you to pass in a location where you'd like us to save these outputs. + * By default, the directory is named "vector_data/." + * **Note that we do not require the name of the vector directory to be a folder that already exists**; if it doesn't exist, we will create it for you. * Inside the folder, we will store the RoBERTa outputs in a subfolder called "sentiment", and the SBERT vectors in a subfolder called "sentence." We will create both of these subfolders for you. * The **turns** parameter, which we will discuss later, controls whether or not you'd like the FeatureBuilder to treat successive utterances by the same individual as a single "turn," or whether you'd like them to be treated separately. We will cache different versions of outputs based on this parameter; we use a subfolder called "chats" (when **turns=False**) or "turns" (when **turns=True**). -* There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on `Generating Features: Utterance-, Speaker-, and Conversation-Level `_ for more details.) However, this means that we require you to provide a path for where you would like us to store each of the output files; **output_file_path_chat_level** (Utterance- or Chat-Level Features), **output_file_path_user_level** (Speaker- or User-Level Features), and **output_file_path_conv_level** (Conversation-Level Features). +.. _output_file_details: + +Output File Naming Details +"""""""""""""""""""""""""""" + +* There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on `Generating Features: Utterance-, Speaker-, and Conversation-Level `_ for more details.) These are generated using the **output_file_base** parameter. + + * **All of the outputs will be generated in a folder called "output."** + + * Within the "output" folder, **we generate sub-folders such that the three files will be located in subfolders called "chat," "user," and "conv," respectively.** + + * Similar to the **vector_directory** parameter, the "chat" directory will be renamed to "turn" depending on the value of the **turns** parameter. + +* It is possible to generate different names for each of the three output files, rather than using the same base file path by modifying **output_file_path_chat_level** (Utterance- or Chat-Level Features), **output_file_path_user_level** (Speaker- or User-Level Features), and **output_file_path_conv_level** (Conversation-Level Features). However, because outputs are organized in the specific locations described above, **we have specific requirements for inputting the output paths, and we will modify the path under the hood to match our file naming schema,** rather than saving the file directly to the specified location. * We expect that you pass in a **path**, not just a filename. For example, the path needs to be "./my_file.csv", and not just "my_file.csv"; you will get an error if you pass in only a name without the "/". - * Regardless of your path location, we will automatically append the name "output" to the fornt of your file path, such that **all of the outputs will be generated in a folder called "output."** + * Regardless of your path location, we will automatically append the name "output" to the fornt of your file path. - * Within the "output" folder, **we will also generate sub-folders such that the three files will be located in subfolders called "chat," "user," and "conv," respectively.** + * Within the "output" folder, **we will also generate the chat/user/conv sub-folders.** * If you pass in a path that already contains the above automatically-generated elements (for example, "./output/chat/my_chat_features.csv"), we will skip these steps and directly save it in the relevant folder. @@ -153,7 +174,7 @@ Basic Input Columns output_file_path_chat_level = "./output/chat/jury_output_chat_level.csv" - * And these two ways of specifying an output path are equivalent, assumign that turns=True: + * And these two ways of specifying an output path are equivalent, assuming that turns=True: .. code-block:: python @@ -161,6 +182,10 @@ Basic Input Columns output_file_path_chat_level = "./output/turn/jury_output_turn_level.csv" + +Turns +"""""" + * The **turns** parameter controls whether we want to treat successive messages from the same person as a single turn. For example, in a text conversation, sometimes individuals will send many message in rapid succession, as follows: * **John**: Hey Michael diff --git a/docs/build/html/_sources/index.rst.txt b/docs/build/html/_sources/index.rst.txt index fe4e036e..9e4be9bf 100644 --- a/docs/build/html/_sources/index.rst.txt +++ b/docs/build/html/_sources/index.rst.txt @@ -62,11 +62,10 @@ Once you import the tool, you will be able to declare a FeatureBuilder object, w timestamp_col= "timestamp", # this is where we'll cache things like sentence vectors; this directory doesn't have to exist; we'll create it for you! vector_directory = "./vector_data/", - # give us names for the utterance (chat), speaker (user), and conversation-level outputs - output_file_path_chat_level = "./my_output_chat_level.csv", - output_file_path_user_level = "./my_output_user_level.csv", - output_file_path_conv_level = "./my_output_conversation_level.csv", - # if true, this will combine successive turns by the same speaker. + # this will be the base file path for which we generate the three outputs; + # you will get your outputs in output/chat/my_output_chat_level.csv; output/conv/my_output_conv_level.csv; and output/user/my_output_user_level. + output_file_base = "my_output" + # it will also store the output into output/turns/my_output_chat_level.csv turns = False, # these features depend on sentence vectors, so they take longer to generate on larger datasets. Add them in manually if you are interested in adding them to your output! custom_features = [ diff --git a/docs/build/html/examples.html b/docs/build/html/examples.html index 5adbe896..f91818c9 100644 --- a/docs/build/html/examples.html +++ b/docs/build/html/examples.html @@ -160,9 +160,7 @@

Configuring the FeatureBuildertimestamp_col = "timestamp", grouping_keys = ["batch_num", "round_num"], vector_directory = "./vector_data/", - output_file_path_chat_level = "./jury_output_chat_level.csv", - output_file_path_user_level = "./jury_output_user_level.csv", - output_file_path_conv_level = "./jury_output_conversation_level.csv", + output_file_base = "jury_output", turns = True ) jury_feature_builder.featurize() @@ -170,6 +168,8 @@

Configuring the FeatureBuilder

Basic Input Columns

+
+
Conversation Parameters
@@ -208,21 +209,41 @@

Basic Input Columns +
Vector Directory
+
  • The vector_directory is the name of a directory in which we will store some pre-processed information. Some features require running inference from HuggingFace’s RoBERTa-based sentiment model, and others require generating SBERT vectors. These processes take time, and we cache the outputs so that subsequent runs of the FeatureBuilder on the same dataset will not take as much time. Therefore, we require you to pass in a location where you’d like us to save these outputs.

      +
    • By default, the directory is named “vector_data/.”

    • Note that we do not require the name of the vector directory to be a folder that already exists; if it doesn’t exist, we will create it for you.

    • Inside the folder, we will store the RoBERTa outputs in a subfolder called “sentiment”, and the SBERT vectors in a subfolder called “sentence.” We will create both of these subfolders for you.

    • The turns parameter, which we will discuss later, controls whether or not you’d like the FeatureBuilder to treat successive utterances by the same individual as a single “turn,” or whether you’d like them to be treated separately. We will cache different versions of outputs based on this parameter; we use a subfolder called “chats” (when turns=False) or “turns” (when turns=True).

  • -
  • There are three output files for each run of the FeatureBuilder, which mirror the three levels of analysis: utterance-, speaker-, and conversation-level. (Please see the section on Generating Features: Utterance-, Speaker-, and Conversation-Level for more details.) However, this means that we require you to provide a path for where you would like us to store each of the output files; output_file_path_chat_level (Utterance- or Chat-Level Features), output_file_path_user_level (Speaker- or User-Level Features), and output_file_path_conv_level (Conversation-Level Features).

    +
+

+
+
Output File Naming Details
+