Releases · techwolf-ai/workrb

13 Mar 14:20

Mattdl

v0.5.1

7d13b56

v0.5.1 Latest

Latest

Highlights

This release introduces the infrastructure needed to reproduce and report paper benchmark results, adds the NDCG ranking metric, and resolves several robustness issues encountered during large-scale multilingual evaluation runs.

New Features

NDCG metric

Added Normalized Discounted Cumulative Gain (NDCG) as a first-class ranking metric with binary relevance scoring.

Supports both a top-k cutoff variant (ndcg@k) and full-list evaluation (ndcg).
When no @k is specified, evaluates over the entire ranked list.
Handles edge cases: no relevant items (returns 0.0), all items relevant (returns 1.0), and k larger than the number of targets.
Comprehensive test suite in tests/test_ranking_metrics.py (154 lines) covering hand-computed values, edge cases, torch/numpy input parity, and smoke tests for all existing metrics.
README updated with the new metric entry.

Paper results pipeline

Two new example scripts enable end-to-end reproduction of benchmark results:

examples/run_paper_results.py — Runs the full benchmark suite across multilingual models (BM25, JobBERT-v3, Qwen3-0.6B) and monolingual English models (ConTeXTMatch, CurriculumMatch, JobBERT-v2).
examples/generate_paper_table.py — Loads saved results.json files and generates a publication-ready LaTeX comparison table with model grouping, short display names, bold-best highlighting per model group, optional dataset count rows, and \resizebox support.

LaTeX results reporting

New format_results_latex() function in src/workrb/metrics/reporting.py (~290 lines) that builds a complete LaTeX table environment from multiple BenchmarkResults objects. Supports:

Configurable aggregation level (per task group or per task).
Model grouping with \midrule separators.
Column renaming and ordering via a short_names dictionary.
Per-group dataset count (#D) rows.
Metric scaling (e.g. raw values to percentages).

Dataset count introspection

New BenchmarkResults.get_dataset_counts() method returns the number of datasets contributing to each task group (or task) score, respecting language aggregation filters.

Deduplication strategy for ranking datasets

Replaced the boolean allow_duplicate_queries / allow_duplicate_targets flags with a DuplicateStrategy enum offering three modes:

Strategy	Behavior
`ALLOW`	Silently accept duplicates (no-op).
`RAISE`	Raise an error if duplicates are found.
`RESOLVE` (new default)	Deterministic deduplication — targets keep first occurrence with index remapping; queries merge `target_indices` via set union.

Tested in tests/test_duplicate_strategy.py (210 lines).

ConTeXTMatch query batching

ConTeXTMatchModel._compute_rankings() now scores queries in configurable chunks (scoring_batch_size, default 32) to prevent OOM from the (num_queries, num_targets, seq_len) intermediate tensor. Targets are encoded once and reused across all chunks. Tested in tests/test_models/test_contextmatch_model.py (108 lines).

Version-dependent ESCO language support

ESCO.get_supported_languages(version) returns the correct language set per major.minor ESCO version. Languages added in v1.1 (Icelandic, Norwegian, Arabic, Ukrainian) are no longer incorrectly assumed available for v1.0.x. All ranking tasks now use this version-aware lookup.

Bug Fixes

Graceful handling of unsupported dataset configs — New DatasetConfigNotSupported exception for datasets that dynamically produce 0 queries or targets (e.g. ESCO language/version lacking skill alternatives). Task._load_datasets() catches this exception and logs a warning instead of crashing. self.dataset_ids is updated to reflect only successfully loaded datasets.
Float cast for prediction matrices — Prediction tensors are now explicitly cast to .float() before .numpy(), preventing dtype errors with bfloat16 models.
Deterministic index ordering — sorted(set(...)) replaces list(set(...)) in _postprocess_indices for reproducible results.
Reporting lint fix — Removed unnecessary parentheses in format_results_latex model group iteration.

Breaking Changes

RankingDataset.__init__ signature — allow_duplicate_queries / allow_duplicate_targets (bool) replaced by duplicate_query_strategy / duplicate_target_strategy (DuplicateStrategy enum). External callers using the old boolean flags must update.
ConTeXTMatchModel.encode() parameter rename — batch_size renamed to encode_batch_size for clarity.

Files Changed (24 files)

Category	Files	+/-
Metrics	`ranking.py`, `reporting.py`, `classification.py`, `__init__.py`	+330
Models	`bi_encoder.py`	+127
Tasks (core)	`base.py`, `ranking_base.py`, `__init__.py`	+156
Tasks (ranking)	`job2skill.py`, `skill2job.py`, `skill_extraction.py`, `jobnorm.py`, `skillnorm.py`, `melo.py`, `mels.py`	+53
Tasks (classification)	`job2skill.py`	+4
Data	`esco.py`	+53
Results	`results.py`	+54
Examples	`run_paper_results.py`, `generate_paper_table.py`	+317
Tests	`test_duplicate_strategy.py`, `test_contextmatch_model.py`, `test_ranking_metrics.py`	+472
Config	`README.md`, `pyproject.toml`, `CHANGELOG.md`	misc
Full Changelog: `v0.5.0...v0.5.1`

Assets 2

09 Mar 09:37

Mattdl

v0.5.0

fb8d0f4

v0.5.0

What's Changed

refactor: align task groups to paper (JOBSIM/SKILLSIM → Semantic Similarity + Candidate Ranking) by @Mattdl in #45

Full Changelog: v0.4.0...v0.5.0

Contributors

Mattdl

Assets 2

04 Mar 17:07

Mattdl

v0.4.0

2075797

v0.4.0

Features

Lexical baselines for ranking — Added BM25, TF-IDF (word/char n-gram), Edit Distance, and Random ranking models with optional lowercasing and unicode normalization. (lexical_baselines.py)
Freelancer project ranking tasks — New cross-lingual ranking tasks for freelancer candidate and project matching. (freelancer_project_matching.py)
Cross-lingual aggregation modes — Introduced LanguageAggregationMode enum with three modes (monolingual_only, crosslingual_group_input_languages, crosslingual_group_output_languages) for flexible per-language metric aggregation. Added DatasetLanguages type to describe input/output language sets per dataset.
Lazy execution filtering — Added ExecutionMode enum (LAZY / ALL) to skip datasets incompatible with the chosen aggregation mode before evaluation, avoiding unnecessary compute.
Language-grouped averaging — Per-task aggregation now groups datasets by language before averaging, giving equal weight to each language regardless of how many datasets it contains. A SKIP_LANGUAGE_AGGREGATION mode is available for the previous flat-average behavior.

Breaking Changes

MetricsResult.language has been replaced by input_languages / output_languages.
get_dataset_language() renamed to get_dataset_languages(), now returning input and output language sets.
language_aggregation_mode is now a required (non-optional) parameter in evaluate().
Dataset indexing generalized from language-based to dataset_id-based throughout the pipeline (language_results renamed to datasetid_results).

Refactors

Migrated freelancer tasks to dataset_id-based language mapping, replacing the Language.CROSS sentinel with a proper DATASET_LANGUAGES_MAP.
Extracted get_language_grouping_key() as a shared function in types.py, reused by both eval-time filtering and aggregation-time skipping.
Updated docstrings project-wide to comply with NumPy style.

Bug Fixes

Fixed SkillSkape import.
Included lowercase setting in lexical baseline model names.
Added language field to MetricsResult for proper per-language aggregation.
Removed from examples a dataset using ESCO 1.0.5 that incorrectly defines UK as a supported language.

Documentation

Updated README with task/model overview, metrics explanation, multi-model and cross-lingual examples.
Updated CONTRIBUTING with CI/CD guidance, cross-lingual task examples, and commit formatting.
Consolidated four aggregation example scripts into a single CLI-driven run_benchmark_aggregation.py.
Added run_all_ranking_tasks.py for auto-discovering all registered ranking tasks and models via the registry.

Tests

Added regression tests for all 9 lexical baseline model variants.
Added comprehensive tests for cross-lingual multi-dataset aggregation scenarios.
Added task loading tests for new MELO and MELS tasks.
Widened regression test tolerance (abs=1e-3) for cross-platform stability.

Package

Exposed ExecutionMode, LanguageAggregationMode, and setup_logger from workrb.__init__.
Default logger initialized to INFO on package import.
TaskRegistry.create() and ModelRegistry.create() now log the task/model being instantiated.

Full Changelog: v0.3.0...v0.4.0

Assets 2

09 Jan 12:34

Mattdl

v0.3.0

d5a56d1

v0.3.0

What's Changed

feat: add skillskape dataset by @jjzha in #23
feat: add Job Title Similarity ranking task by @federetyk in #28
ci: refactor tests to exclude heavy model benchmarking by default and allow by manual triggering by @Mattdl in #31
refactor: change evaluate.py to avoid amiguity with workrb.evaluate function call by @Mattdl in #29
chore: rename tasks to be more uniformly and update README task overview by @Mattdl in #32

New Contributors

@jjzha made their first contribution in #23
@federetyk made their first contribution in #28

Full Changelog: v0.2.1...v0.3.0

Contributors

Mattdl, federetyk, and jjzha

Assets 2

06 Jan 08:33

Mattdl

v0.2.0

5fbff68

v0.2.0

What's Changed

feat: Contribution of ConTeXTMatch model by @warreveys in #18
feat: add curriculum encoder and benchmark tests (#19) by @AleksanderB-hub in #20
docs: add citation reference by @Mattdl in #15
docs: README fix license shield by @Mattdl in #16
fix: wrong order attributes evaluate call in evaluate_multiple_models function by @warreveys in #17
fix: usage example by @Mattdl in #21

New Contributors

@warreveys made their first contribution in #17
@AleksanderB-hub made their first contribution in #20

Full Changelog: https://github.com/techwolf-ai/workrb/blob/main/CHANGELOG.md
Diff: v0.1.0...v0.2.0

Contributors

Mattdl, AleksanderB-hub, and warreveys

Assets 2

11 Nov 15:24

Mattdl

v0.1.0

121d751

v0.1.0

What's Changed

End-to-end evaluation with WorkRB for multiple Ranking and Classification tasks
End-to-end and unit test coverage of 78%
Pypi release github workflow automation
Update issue templates by @Mattdl in #11
Docs by @Mattdl in #12

Full Changelog: https://github.com/techwolf-ai/workrb/commits/v0.1.0

Contributors

Mattdl

Assets 2

Releases: techwolf-ai/workrb

v0.5.1

Highlights

New Features

NDCG metric

Paper results pipeline

LaTeX results reporting

Dataset count introspection

Deduplication strategy for ranking datasets

ConTeXTMatch query batching

Version-dependent ESCO language support

Bug Fixes

Breaking Changes

Files Changed (24 files)

Uh oh!

v0.5.0

What's Changed

Contributors

Uh oh!

v0.4.0

Features

Breaking Changes

Refactors

Bug Fixes

Documentation

Tests

Package

Uh oh!

v0.3.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.0

What's Changed

Contributors

Uh oh!