Skip to content

Adding my NLP post#997

Open
f-allian wants to merge 3 commits intoRSE-Sheffield:masterfrom
f-allian:master
Open

Adding my NLP post#997
f-allian wants to merge 3 commits intoRSE-Sheffield:masterfrom
f-allian:master

Conversation

@f-allian
Copy link
Member

@f-allian f-allian self-assigned this Feb 17, 2026
@f-allian f-allian added the new post Create a new post label Feb 17, 2026
Copy link
Contributor

@ns-rse ns-rse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reads well @f-allian 👍

Couple of lines which haven't been wrapped to 120 characters and some minor questions/queries are in-line.

artificially upsampling of rare labels in our approach. Instead, we relied on strict stratified sampling across our
training, validation, and test splits that mimics the raw dataset's proportions and reduces the model's bias. This
guarantees that rare technological domains are preserved and adequately represented across all phases of model
development. A summary of the data splits is shown in Table 1.
Copy link
Contributor

@ns-rse ns-rse Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could have embedded links to Table 1 here and for other tables/figures even though this table is adjacent and the document isn't too complicated (its a habit I have from using LaTeX to link internally to figures/tables).

architecture of our multi-label text classification pipeline involves the following three main steps:

1. **Preprocessing:** Raw abstracts are tokenised (up to a maximum sequence length of 512 tokens). Each token is mapped to
a 768-dimensional embedding vector, and the final hidden state of the classification token ([CLS]) is pooled to create a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is ([CLS]) in both parentheses and square brackets, is it meant to be a hyper-link to something?

a 768-dimensional embedding vector, and the final hidden state of the classification token ([CLS]) is pooled to create a
single, dense semantic representation of the entire abstract.

2. **Fine-tuning:** The token embeddings are passed into the pre-trained SciBERT layer to perform fine-tuning. [CLS] token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above for [CLS] is it meant to be a hyperlink with the target URL in parentheses missing?

Comment on lines +122 to +123
During validation, we treated the task as a multi-label classification problem, looking at both micro metrics (e.g.
F1-micro, which favour frequent classes) and macro metrics (e.g. F1-macro, which treat rare niche classes equally) to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not heard of F1-[micro|macro] before would it be worth linking or citing references that explain these for readers?


![Training and evaluation performances](/assets/images/2026-02-18-innovation-project/figure3.png)
{: style="text-align: center;"}
***Figure 3**: Training and evaluation performances of the hierarchical classifier across 20 epochs. (a) Training loss
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing I just remembered...

Whilst *italics* italicises text and **bold** using the two adjacent in this manner could lead to confusion (I had to think a little about it and only really clicked when I rendered the page).

A solution to make it clearer is to use _italics_ which give the same effect and make the source easier to read.

This is more a matter of personal style, but if you used something markdownlint-cli2 then depending on configuration for rule MD049 it might throw some errors. (Other Markdown linters are available, this is the one I use commonly as a pre-commit hook, I may switch to a Rust based on rumdl in the future).

@f-allian
Copy link
Member Author

@ns-rse Thanks for your feedback Neil, I've addressed the changes in a new commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new post Create a new post

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants