Full paper accepted at the 33rd IEEE International Requirements Engineering 2025 conference (Research Track).
This artifact supports the replication of the study presented in the paper "What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews", accepted at the 33rd IEEE International Requirements Engineering 2025 conference. It provides a comprehensive framework for conducting fine-grained emotion analysis from mobile app reviews using both human and large language model (LLM)-based annotations.
The artifact includes:
- Input: A dataset of user reviews, emotion annotation guidelines, and ground truth annotations from human annotators.
- Process: Scripts for generating emotion annotations via LLMs (GPT-4o, Mistral Large 2, and Gemini 2.0 Flash), splitting annotations into iterations, computing agreement metrics (e.g., Cohen’s Kappa), and evaluating correctness and cost-efficiency.
- Output: Annotated datasets (human and LLM-generated), agreement analyses, emotion statistics, and evaluation metrics including accuracy, precision, recall, and F1 score.
The artifact was developed to ensure transparency, reproducibility, and extensibility of the experimental pipeline. It enables researchers to replicate, validate, or extend the emotion annotation process across different LLMs and configurations, contributing to the broader goal of integrating emotional insights into requirements engineering practices.
The artifact is available at https://doi.org/10.6084/m9.figshare.28548638.
Find how to cite this replication package and author information at the end of this README file.
- Literature review: results from the literature review on opinion mining and emotion analysis within the context of software-based reviews.
- Data: data used in the study, including user reviews (input), human annotations (ground truth), and LLM-based annotations (generated by the assistants).
- Code: code used in the study, including the generative annotation, data processing, and evaluation.
Study selection and results are available in the literature_review/study-selection.xlsx file. This file contains the following sheets:
iteration_1_IC_analysis: results from the first iteration of the inclusion criteria analysis.iteration_1_feature_extraction: results from the first iteration of the feature extraction analysis.iteration_2_IC_analysis: results from the second iteration of the inclusion criteria analysis.iteration_2_feature_extraction: results from the second iteration of the feature extraction analysis.iteration_3_IC_analysis: results from the third iteration of the inclusion criteria analysis.iteration_3_feature_extraction: results from the third iteration of the feature extraction analysis.emotions: statistical analysis of emotions covered by emotion taxonomies in the selected studies.
The data root folder contains the following files:
reviews.jsoncontains the reviews used in the study.guidelines.txtcontains a .txt version of the annotation guidelines.ground-truth.xlsxcontains the ground truth (human agreement) annotations for the reviews.
In addition, the data root folder contains the following subfolders:
assistantscontains the IDs of the assistants used for the generative annotation (see LLM-based annotation).annotationscontains the results of the human and LLM-based annotation: --iterationscontains both human and LLM-based annotations for each iteration. --llm-annotationscontains the LLM-based annotations for each assistance, including results for various temperature values: low (0), medium (0.5), and high (1) (see LLM-based annotation).agreementscontains the results of the agreement analysis between the human and LLM-based annotations (see Data Processing).evaluationcontains the results of the evaluation of the LLM-based annotations (see Evaluation), including statistics, Cohen's Kappa, correctness, and cost-efficiency analysis, which includes token usage and human annotation reported times.
All artifacts in this replication package are runnable in any operating system with the following requirements:
- OS: Linux Based OS // Mac-OS // Windows With Unix Like Shells For Example Git Bash CLI
- Python 3.10
Additionally, you will also need at least one API key for OpenAI, Mistral or Gemini. See Step 1 in Usage Instructions & Steps to reproduce.
⚙️ Install requirements
Create a virtual environment:
python -m venv venv
Activate the virtual environment. For Linux Based OS Or Mac-OS.
source venv/bin/activate
For Windows With Unix Like Shells (for example Git Bash CLI):
source venv/Scripts/activate
Install Python dependency requirements running the following command.
pip install -r requirements.txt
Now you're ready to start the annotation process!
We structure the code available in this replication package based on the stages involved in the LLM-based annotation process.
The llm_annotation folder contains the code used to generate the LLM-based annotations.
There are two main scripts:
-
create_assistant.pyis used to create a new assistant with a particular provider and model. This class includes the definition of a common system prompt across all agents, using thedata/guidelines.txtfile as the basis. -
annotate_emotions.pyis used to annotate a set of emotions using a previously created assistant. This script includes the assessment of the output format, as well as some common metrics for cost-efficiency analysis and output file generation.
Our research includes an LLM-based annotation experimentation with 3 LLMs: GPT-4o, Mistral Large 2, and Gemini 2.0 Flash. To illustrate the usage of the code, in this README we refer to the code execution for generating annotations using GPT-4o. However, full code is provided for all LLMs.
🔑 Step 1: Add your API key
If you haven't done this already, add your API key to the .env file in the root folder. For instance, for OpenAI, you can add the following:
OPENAI_API_KEY=sk-proj-...
🛠️ Step 2: Create an assistant
Create an assistant using the create_assistant.py script. For instance, for GPT-4o, you can run the following command:
python ./code/llm_annotation/create_assistant_openai.py --guidelines ./data/guidelines.txt --model gpt-4o
This will create an assistant loading the data/guidelines.txt file and using the GPT-4o model.
📝 Step 3: Annotate emotions
Annotate emotions using the annotate_emotions.py script. For instance, for GPT-4o, you can run the following command using a small subset of 100 reviews from the ground truth as an example:
python ./code/llm_annotation/annotate_emotions_openai.py --input ./data/ground-truth-small.xlsx --output ./data/annotations/llm/temperature-00/ --batch_size 10 --model gpt-4o --temperature 0 --sleep_time 10
For annotating the whole dataset, run the following command (IMPORTANT: this will take more than 60 minutes due to OpenAI, Mistral and Gemini consumption times!):
python ./code/llm_annotation/annotate_emotions_openai.py --input ./data/ground-truth.xlsx --output ./data/annotations/llm/temperature-00/ --batch_size 10 --model gpt-4o --temperature 0 --sleep_time 10
Parameters include:
input: path to the input file containing the set of reviews to annotate (e.g.,data/ground-truth.xlsx).output: path to the output folder where annotations will be saved (e.g.,data/annotations/llm/temperature-00/).batch_size: number of reviews to annotate for each user request (e.g., 10).model: model to use for the annotation (e.g.,gpt-4o).temperature: temperature for the model responses (e.g., 0).sleep_time: time to wait between batches, in seconds (e.g., 10).
This will annotate the emotions using the assistant created in the previous step, creating a new file with the same format as in the data/ground-truth.xlsx file.
In this stage, we refactor all files into iterations and we consolidate the agreement between multiple annotators or LLM runs. These logic serves both for human and LLM annotations. Parameters can be updated to include more annotators or LLM runs.
✂️ Step 4: Split annotations into iterations
We split the annotations into iterations based on the number of annotators or LLM runs. For instance, for GPT-4o (run 0), we can run the following command:
python code/data_processing/split_annotations.py --input_file data/annotations/llm/temperature-00/gpt-4o-0-annotations.xlsx --output_dir data/annotations/iterations/
This facilitates the Kappa analysis and agreement in alignment with each human iteration.
🤝 Step 5: Analyse agreement
We consolidate the agreement between multiple annotators or LLM runs. For instance, for GPT-4o, we can run the following command to use the run from Step 3 (run 0) and three additional annotations (run 1, 2, and 3) already available in the replication package (NOTE: we simplify the process to speed up the analysis and avoid delays in annotation):
python code/evaluation/agreement.py --input-folder data/annotations/iterations/ --output-folder data/agreements/ --annotators gpt-4o-0 gpt-4o-1 gpt-4o-2 gpt-4o-3
For replicating our original study, run the following:
python code/evaluation/agreement.py --input-folder data/annotations/iterations/ --output-folder data/agreements/ --annotators gpt-4o-1 gpt-4o-2 gpt-4o-3
After consolidating agreements, we can evaluate both the Cohen's Kappa agreement and correctness between the human and LLM-based annotations. Our code allows any combination of annotators and LLM runs.
📈 Step 6: Emotion statistics
We evaluate the statistics of the emotions in the annotations, including emotion frequency, distribution, and correlation between emotions. For instance, for GPT-4o and the example in this README file, we can run the following command:
python code/evaluation/emotion_statistics.py --input-file data/agreements/agreement_gpt-4o-0-gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output-dir data/evaluation/statistics/gpt-4o-0123
For replicating our original study, run the following:
python code/evaluation/emotion_statistics.py --input-file data/agreements/agreement_gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output-dir data/evaluation/statistics/gpt-4o
⚖️ Step 7: Cohen's Kappa pairwise agreement
We measure the average pairwise Cohen's Kappa agreement between annotators or LLM runs. For instance, for GPT-4o and the example in this README file, we can run the following command:
python code/evaluation/kappa.py --input_folder data/annotations/iterations/ --output_folder data/evaluation/kappa/ --annotators gpt-4o-0,gpt-4o-1,gpt-4o-2,gpt-4o-3
For replicating our original study, run the following:
python code/evaluation/kappa.py --input_folder data/annotations/iterations/ --output_folder data/evaluation/kappa/ --annotators gpt-4o-1,gpt-4o-2,gpt-4o-3 --exclude 0,1,2
In our analysis, we exclude iterations 0, 1 and 2 as they were used for guidelines refinement.
✅ Step 8: LLM-based annotation correctness
We measure the correctness (accuracy, precision, recall, and F1 score) between a set of annotated reviews and a given ground truth. For instance, for GPT-4o agreement and the example in this README file, we can run the following command:
python code/evaluation/correctness.py --ground_truth data/ground-truth.xlsx --predictions data/agreements/agreement_gpt-4o-0-gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output_dir data/evaluation/correctness/gpt-4o
For replicating our original study, run the following:
python code/evaluation/correctness.py --ground_truth data/ground-truth.xlsx --predictions data/agreements/agreement_gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output_dir data/evaluation/correctness/gpt-4o
📝 Step 8: Check results
After completing these steps, you will be able to check all generated artefacts, including:
- LLM annotations: available at
data\annotations\llm\ - Agreement between LLM annotations and humans: available at
data\evaluation\kappa - Correctness of LLM annotations with respect to Human agreement: available at
data\evaluation\correctness
This repository is licensed under the GPL-3.0 License. See the LICENSE file for details.
Full authors list:
- Quim Motger, Dept. of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain, [email protected]
- Marc Oriol, Dept. of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain, [email protected]
- Max Tiessler, Dept. of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain, [email protected]
- Xavier Franch, Dept. of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain, [email protected]
- Jordi Marco, Dept. of Computer Science (CS), Universitat Politècnica de Catalunya, Barcelona, Spain, [email protected]
To cite this replication package, please use the following citation format:
Q. Motger, M. Oriol, M. Tiessler, X. Franch, and J. Marco, "What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews - Replication Package". figshare. Dataset. https://doi.org/10.6084/m9.figshare.28548638
To cite the full paper describing the research that produced these artifacts, please use the following citation format (DOI to be generated upon publication):
Q. Motger, M. Oriol, M. Tiessler, X. Franch, and J. Marco, "What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews," in Proc. IEEE Int. Requirements Eng. Conf. (RE), 2025.