Emotion analysis from app reviews - Replication package

Full paper accepted at the 33rd IEEE International Requirements Engineering 2025 conference (Research Track).

📚 Summary of artifact

This artifact supports the replication of the study presented in the paper "What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews", accepted at the 33rd IEEE International Requirements Engineering 2025 conference. It provides a comprehensive framework for conducting fine-grained emotion analysis from mobile app reviews using both human and large language model (LLM)-based annotations.

The artifact includes:

Input: A dataset of user reviews, emotion annotation guidelines, and ground truth annotations from human annotators.
Process: Scripts for generating emotion annotations via LLMs (GPT-4o, Mistral Large 2, and Gemini 2.0 Flash), splitting annotations into iterations, computing agreement metrics (e.g., Cohen’s Kappa), and evaluating correctness and cost-efficiency.
Output: Annotated datasets (human and LLM-generated), agreement analyses, emotion statistics, and evaluation metrics including accuracy, precision, recall, and F1 score.

The artifact was developed to ensure transparency, reproducibility, and extensibility of the experimental pipeline. It enables researchers to replicate, validate, or extend the emotion annotation process across different LLMs and configurations, contributing to the broader goal of integrating emotional insights into requirements engineering practices.

🔎 Artifact Location

The artifact is available at https://doi.org/10.6084/m9.figshare.28548638.

Find how to cite this replication package and author information at the end of this README file.

📂 Description of Artifact

Literature review: results from the literature review on opinion mining and emotion analysis within the context of software-based reviews.
Data: data used in the study, including user reviews (input), human annotations (ground truth), and LLM-based annotations (generated by the assistants).
Code: code used in the study, including the generative annotation, data processing, and evaluation.

📖 Literature review

Study selection and results are available in the literature_review/study-selection.xlsx file. This file contains the following sheets:

iteration_1_IC_analysis: results from the first iteration of the inclusion criteria analysis.
iteration_1_feature_extraction: results from the first iteration of the feature extraction analysis.
iteration_2_IC_analysis: results from the second iteration of the inclusion criteria analysis.
iteration_2_feature_extraction: results from the second iteration of the feature extraction analysis.
iteration_3_IC_analysis: results from the third iteration of the inclusion criteria analysis.
iteration_3_feature_extraction: results from the third iteration of the feature extraction analysis.
emotions: statistical analysis of emotions covered by emotion taxonomies in the selected studies.

🗃️ Data

The data root folder contains the following files:

reviews.json contains the reviews used in the study.
guidelines.txt contains a .txt version of the annotation guidelines.
ground-truth.xlsx contains the ground truth (human agreement) annotations for the reviews.

In addition, the data root folder contains the following subfolders:

assistants contains the IDs of the assistants used for the generative annotation (see LLM-based annotation).
annotations contains the results of the human and LLM-based annotation: -- iterations contains both human and LLM-based annotations for each iteration. -- llm-annotations contains the LLM-based annotations for each assistance, including results for various temperature values: low (0), medium (0.5), and high (1) (see LLM-based annotation).
agreements contains the results of the agreement analysis between the human and LLM-based annotations (see Data Processing).
evaluation contains the results of the evaluation of the LLM-based annotations (see Evaluation), including statistics, Cohen's Kappa, correctness, and cost-efficiency analysis, which includes token usage and human annotation reported times.

⚙️ System Requirements

All artifacts in this replication package are runnable in any operating system with the following requirements:

OS: Linux Based OS // Mac-OS // Windows With Unix Like Shells For Example Git Bash CLI
Python 3.10

Additionally, you will also need at least one API key for OpenAI, Mistral or Gemini. See Step 1 in Usage Instructions & Steps to reproduce.

💻 Installation Instructions

⚙️ Install requirements

Create a virtual environment:

python -m venv venv

Activate the virtual environment. For Linux Based OS Or Mac-OS.

source venv/bin/activate

For Windows With Unix Like Shells (for example Git Bash CLI):

source venv/Scripts/activate

Install Python dependency requirements running the following command.

pip install -r requirements.txt

Now you're ready to start the annotation process!

💻 Usage Instructions & Steps to reproduce

We structure the code available in this replication package based on the stages involved in the LLM-based annotation process.

🤖 LLM-based annotation

The llm_annotation folder contains the code used to generate the LLM-based annotations.

There are two main scripts:

create_assistant.py is used to create a new assistant with a particular provider and model. This class includes the definition of a common system prompt across all agents, using the data/guidelines.txt file as the basis.
annotate_emotions.py is used to annotate a set of emotions using a previously created assistant. This script includes the assessment of the output format, as well as some common metrics for cost-efficiency analysis and output file generation.

Our research includes an LLM-based annotation experimentation with 3 LLMs: GPT-4o, Mistral Large 2, and Gemini 2.0 Flash. To illustrate the usage of the code, in this README we refer to the code execution for generating annotations using GPT-4o. However, full code is provided for all LLMs.

🔑 Step 1: Add your API key

If you haven't done this already, add your API key to the .env file in the root folder. For instance, for OpenAI, you can add the following:

OPENAI_API_KEY=sk-proj-...

🛠️ Step 2: Create an assistant

Create an assistant using the create_assistant.py script. For instance, for GPT-4o, you can run the following command:

python ./code/llm_annotation/create_assistant_openai.py --guidelines ./data/guidelines.txt --model gpt-4o

This will create an assistant loading the data/guidelines.txt file and using the GPT-4o model.

📝 Step 3: Annotate emotions

Annotate emotions using the annotate_emotions.py script. For instance, for GPT-4o, you can run the following command using a small subset of 100 reviews from the ground truth as an example:

python ./code/llm_annotation/annotate_emotions_openai.py --input ./data/ground-truth-small.xlsx --output ./data/annotations/llm/temperature-00/ --batch_size 10 --model gpt-4o --temperature 0 --sleep_time 10

For annotating the whole dataset, run the following command (IMPORTANT: this will take more than 60 minutes due to OpenAI, Mistral and Gemini consumption times!):

python ./code/llm_annotation/annotate_emotions_openai.py --input ./data/ground-truth.xlsx --output ./data/annotations/llm/temperature-00/ --batch_size 10 --model gpt-4o --temperature 0 --sleep_time 10

Parameters include:

input: path to the input file containing the set of reviews to annotate (e.g., data/ground-truth.xlsx).
output: path to the output folder where annotations will be saved (e.g., data/annotations/llm/temperature-00/).
batch_size: number of reviews to annotate for each user request (e.g., 10).
model: model to use for the annotation (e.g., gpt-4o).
temperature: temperature for the model responses (e.g., 0).
sleep_time: time to wait between batches, in seconds (e.g., 10).

This will annotate the emotions using the assistant created in the previous step, creating a new file with the same format as in the data/ground-truth.xlsx file.

🔄 Data processing

In this stage, we refactor all files into iterations and we consolidate the agreement between multiple annotators or LLM runs. These logic serves both for human and LLM annotations. Parameters can be updated to include more annotators or LLM runs.

✂️ Step 4: Split annotations into iterations

We split the annotations into iterations based on the number of annotators or LLM runs. For instance, for GPT-4o (run 0), we can run the following command:

python code/data_processing/split_annotations.py --input_file data/annotations/llm/temperature-00/gpt-4o-0-annotations.xlsx --output_dir data/annotations/iterations/

This facilitates the Kappa analysis and agreement in alignment with each human iteration.

🤝 Step 5: Analyse agreement

We consolidate the agreement between multiple annotators or LLM runs. For instance, for GPT-4o, we can run the following command to use the run from Step 3 (run 0) and three additional annotations (run 1, 2, and 3) already available in the replication package (NOTE: we simplify the process to speed up the analysis and avoid delays in annotation):

python code/evaluation/agreement.py --input-folder data/annotations/iterations/ --output-folder data/agreements/ --annotators gpt-4o-0 gpt-4o-1 gpt-4o-2 gpt-4o-3

For replicating our original study, run the following:

python code/evaluation/agreement.py --input-folder data/annotations/iterations/ --output-folder data/agreements/ --annotators gpt-4o-1 gpt-4o-2 gpt-4o-3

📊 Evaluation

After consolidating agreements, we can evaluate both the Cohen's Kappa agreement and correctness between the human and LLM-based annotations. Our code allows any combination of annotators and LLM runs.

📈 Step 6: Emotion statistics

We evaluate the statistics of the emotions in the annotations, including emotion frequency, distribution, and correlation between emotions. For instance, for GPT-4o and the example in this README file, we can run the following command:

python code/evaluation/emotion_statistics.py --input-file data/agreements/agreement_gpt-4o-0-gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output-dir data/evaluation/statistics/gpt-4o-0123

For replicating our original study, run the following:

python code/evaluation/emotion_statistics.py --input-file data/agreements/agreement_gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output-dir data/evaluation/statistics/gpt-4o

⚖️ Step 7: Cohen's Kappa pairwise agreement

We measure the average pairwise Cohen's Kappa agreement between annotators or LLM runs. For instance, for GPT-4o and the example in this README file, we can run the following command:

python code/evaluation/kappa.py --input_folder data/annotations/iterations/ --output_folder data/evaluation/kappa/ --annotators gpt-4o-0,gpt-4o-1,gpt-4o-2,gpt-4o-3

For replicating our original study, run the following:

python code/evaluation/kappa.py --input_folder data/annotations/iterations/ --output_folder data/evaluation/kappa/ --annotators gpt-4o-1,gpt-4o-2,gpt-4o-3 --exclude 0,1,2

In our analysis, we exclude iterations 0, 1 and 2 as they were used for guidelines refinement.

✅ Step 8: LLM-based annotation correctness

We measure the correctness (accuracy, precision, recall, and F1 score) between a set of annotated reviews and a given ground truth. For instance, for GPT-4o agreement and the example in this README file, we can run the following command:

python code/evaluation/correctness.py --ground_truth data/ground-truth.xlsx --predictions data/agreements/agreement_gpt-4o-0-gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output_dir data/evaluation/correctness/gpt-4o

For replicating our original study, run the following:

python code/evaluation/correctness.py --ground_truth data/ground-truth.xlsx --predictions data/agreements/agreement_gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output_dir data/evaluation/correctness/gpt-4o

📝 Step 8: Check results

After completing these steps, you will be able to check all generated artefacts, including:

LLM annotations: available at data\annotations\llm\
Agreement between LLM annotations and humans: available at data\evaluation\kappa
Correctness of LLM annotations with respect to Human agreement: available at data\evaluation\correctness

📜 License

This repository is licensed under the GPL-3.0 License. See the LICENSE file for details.

👥 Authors information

Full authors list:

Quim Motger, Dept. of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain, [email protected]
Marc Oriol, Dept. of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain, [email protected]
Max Tiessler, Dept. of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain, [email protected]
Xavier Franch, Dept. of Service and Information System Engineering (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain, [email protected]
Jordi Marco, Dept. of Computer Science (CS), Universitat Politècnica de Catalunya, Barcelona, Spain, [email protected]

To cite this replication package, please use the following citation format:

Q. Motger, M. Oriol, M. Tiessler, X. Franch, and J. Marco, "What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews - Replication Package". figshare. Dataset. https://doi.org/10.6084/m9.figshare.28548638

To cite the full paper describing the research that produced these artifacts, please use the following citation format (DOI to be generated upon publication):

Q. Motger, M. Oriol, M. Tiessler, X. Franch, and J. Marco, "What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews," in Proc. IEEE Int. Requirements Eng. Conf. (RE), 2025.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Emotion analysis from app reviews - Replication package

📚 Summary of artifact

🔎 Artifact Location

📂 Description of Artifact

📖 Literature review

🗃️ Data

⚙️ System Requirements

💻 Installation Instructions

💻 Usage Instructions & Steps to reproduce

🤖 LLM-based annotation

🔄 Data processing

📊 Evaluation

📜 License

👥 Authors information

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
code		code
data		data
literature_review		literature_review
.env		.env
Annotation Guidelines.pdf		Annotation Guidelines.pdf
LICENSE.md		LICENSE.md
README.md		README.md
ground-truth.xlsx		ground-truth.xlsx
requirements.txt		requirements.txt

License

nlp4se/review-emotion-analysis

Folders and files

Latest commit

History

Repository files navigation

Emotion analysis from app reviews - Replication package

📚 Summary of artifact

🔎 Artifact Location

📂 Description of Artifact

📖 Literature review

🗃️ Data

⚙️ System Requirements

💻 Installation Instructions

💻 Usage Instructions & Steps to reproduce

🤖 LLM-based annotation

🔄 Data processing

📊 Evaluation

📜 License

👥 Authors information

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages