Releases: techwolf-ai/workrb
v0.5.1
Highlights
This release introduces the infrastructure needed to reproduce and report paper benchmark results, adds the NDCG ranking metric, and resolves several robustness issues encountered during large-scale multilingual evaluation runs.
New Features
NDCG metric
Added Normalized Discounted Cumulative Gain (NDCG) as a first-class ranking metric with binary relevance scoring.
- Supports both a top-k cutoff variant (
ndcg@k) and full-list evaluation (ndcg). - When no
@kis specified, evaluates over the entire ranked list. - Handles edge cases: no relevant items (returns 0.0), all items relevant (returns 1.0), and
klarger than the number of targets. - Comprehensive test suite in
tests/test_ranking_metrics.py(154 lines) covering hand-computed values, edge cases, torch/numpy input parity, and smoke tests for all existing metrics. - README updated with the new metric entry.
Paper results pipeline
Two new example scripts enable end-to-end reproduction of benchmark results:
examples/run_paper_results.py— Runs the full benchmark suite across multilingual models (BM25, JobBERT-v3, Qwen3-0.6B) and monolingual English models (ConTeXTMatch, CurriculumMatch, JobBERT-v2).examples/generate_paper_table.py— Loads savedresults.jsonfiles and generates a publication-ready LaTeX comparison table with model grouping, short display names, bold-best highlighting per model group, optional dataset count rows, and\resizeboxsupport.
LaTeX results reporting
New format_results_latex() function in src/workrb/metrics/reporting.py (~290 lines) that builds a complete LaTeX table environment from multiple BenchmarkResults objects. Supports:
- Configurable aggregation level (per task group or per task).
- Model grouping with
\midruleseparators. - Column renaming and ordering via a
short_namesdictionary. - Per-group dataset count (
#D) rows. - Metric scaling (e.g. raw values to percentages).
Dataset count introspection
New BenchmarkResults.get_dataset_counts() method returns the number of datasets contributing to each task group (or task) score, respecting language aggregation filters.
Deduplication strategy for ranking datasets
Replaced the boolean allow_duplicate_queries / allow_duplicate_targets flags with a DuplicateStrategy enum offering three modes:
| Strategy | Behavior |
|---|---|
ALLOW |
Silently accept duplicates (no-op). |
RAISE |
Raise an error if duplicates are found. |
RESOLVE (new default) |
Deterministic deduplication — targets keep first occurrence with index remapping; queries merge target_indices via set union. |
Tested in tests/test_duplicate_strategy.py (210 lines).
ConTeXTMatch query batching
ConTeXTMatchModel._compute_rankings() now scores queries in configurable chunks (scoring_batch_size, default 32) to prevent OOM from the (num_queries, num_targets, seq_len) intermediate tensor. Targets are encoded once and reused across all chunks. Tested in tests/test_models/test_contextmatch_model.py (108 lines).
Version-dependent ESCO language support
ESCO.get_supported_languages(version) returns the correct language set per major.minor ESCO version. Languages added in v1.1 (Icelandic, Norwegian, Arabic, Ukrainian) are no longer incorrectly assumed available for v1.0.x. All ranking tasks now use this version-aware lookup.
Bug Fixes
- Graceful handling of unsupported dataset configs — New
DatasetConfigNotSupportedexception for datasets that dynamically produce 0 queries or targets (e.g. ESCO language/version lacking skill alternatives).Task._load_datasets()catches this exception and logs a warning instead of crashing.self.dataset_idsis updated to reflect only successfully loaded datasets. - Float cast for prediction matrices — Prediction tensors are now explicitly cast to
.float()before.numpy(), preventing dtype errors with bfloat16 models. - Deterministic index ordering —
sorted(set(...))replaceslist(set(...))in_postprocess_indicesfor reproducible results. - Reporting lint fix — Removed unnecessary parentheses in
format_results_latexmodel group iteration.
Breaking Changes
RankingDataset.__init__signature —allow_duplicate_queries/allow_duplicate_targets(bool) replaced byduplicate_query_strategy/duplicate_target_strategy(DuplicateStrategyenum). External callers using the old boolean flags must update.ConTeXTMatchModel.encode()parameter rename —batch_sizerenamed toencode_batch_sizefor clarity.
Files Changed (24 files)
| Category | Files | +/- |
|---|---|---|
| Metrics | ranking.py, reporting.py, classification.py, __init__.py |
+330 |
| Models | bi_encoder.py |
+127 |
| Tasks (core) | base.py, ranking_base.py, __init__.py |
+156 |
| Tasks (ranking) | job2skill.py, skill2job.py, skill_extraction.py, jobnorm.py, skillnorm.py, melo.py, mels.py |
+53 |
| Tasks (classification) | job2skill.py |
+4 |
| Data | esco.py |
+53 |
| Results | results.py |
+54 |
| Examples | run_paper_results.py, generate_paper_table.py |
+317 |
| Tests | test_duplicate_strategy.py, test_contextmatch_model.py, test_ranking_metrics.py |
+472 |
| Config | README.md, pyproject.toml, CHANGELOG.md |
misc |
| Full Changelog: v0.5.0...v0.5.1 |
v0.5.0
v0.4.0
Features
- Lexical baselines for ranking — Added BM25, TF-IDF (word/char n-gram), Edit Distance, and Random ranking models with optional lowercasing and unicode normalization. (
lexical_baselines.py) - Freelancer project ranking tasks — New cross-lingual ranking tasks for freelancer candidate and project matching. (
freelancer_project_matching.py) - Cross-lingual aggregation modes — Introduced
LanguageAggregationModeenum with three modes (monolingual_only,crosslingual_group_input_languages,crosslingual_group_output_languages) for flexible per-language metric aggregation. AddedDatasetLanguagestype to describe input/output language sets per dataset. - Lazy execution filtering — Added
ExecutionModeenum (LAZY/ALL) to skip datasets incompatible with the chosen aggregation mode before evaluation, avoiding unnecessary compute. - Language-grouped averaging — Per-task aggregation now groups datasets by language before averaging, giving equal weight to each language regardless of how many datasets it contains. A
SKIP_LANGUAGE_AGGREGATIONmode is available for the previous flat-average behavior.
Breaking Changes
MetricsResult.languagehas been replaced byinput_languages/output_languages.get_dataset_language()renamed toget_dataset_languages(), now returning input and output language sets.language_aggregation_modeis now a required (non-optional) parameter inevaluate().- Dataset indexing generalized from language-based to
dataset_id-based throughout the pipeline (language_resultsrenamed todatasetid_results).
Refactors
- Migrated freelancer tasks to
dataset_id-based language mapping, replacing theLanguage.CROSSsentinel with a properDATASET_LANGUAGES_MAP. - Extracted
get_language_grouping_key()as a shared function intypes.py, reused by both eval-time filtering and aggregation-time skipping. - Updated docstrings project-wide to comply with NumPy style.
Bug Fixes
- Fixed SkillSkape import.
- Included lowercase setting in lexical baseline model names.
- Added language field to
MetricsResultfor proper per-language aggregation. - Removed from examples a dataset using ESCO 1.0.5 that incorrectly defines UK as a supported language.
Documentation
- Updated README with task/model overview, metrics explanation, multi-model and cross-lingual examples.
- Updated CONTRIBUTING with CI/CD guidance, cross-lingual task examples, and commit formatting.
- Consolidated four aggregation example scripts into a single CLI-driven
run_benchmark_aggregation.py. - Added
run_all_ranking_tasks.pyfor auto-discovering all registered ranking tasks and models via the registry.
Tests
- Added regression tests for all 9 lexical baseline model variants.
- Added comprehensive tests for cross-lingual multi-dataset aggregation scenarios.
- Added task loading tests for new MELO and MELS tasks.
- Widened regression test tolerance (
abs=1e-3) for cross-platform stability.
Package
- Exposed
ExecutionMode,LanguageAggregationMode, andsetup_loggerfromworkrb.__init__. - Default logger initialized to INFO on package import.
TaskRegistry.create()andModelRegistry.create()now log the task/model being instantiated.
Full Changelog: v0.3.0...v0.4.0
v0.3.0
What's Changed
- feat: add skillskape dataset by @jjzha in #23
- feat: add Job Title Similarity ranking task by @federetyk in #28
- ci: refactor tests to exclude heavy model benchmarking by default and allow by manual triggering by @Mattdl in #31
- refactor: change evaluate.py to avoid amiguity with workrb.evaluate function call by @Mattdl in #29
- chore: rename tasks to be more uniformly and update README task overview by @Mattdl in #32
New Contributors
- @jjzha made their first contribution in #23
- @federetyk made their first contribution in #28
Full Changelog: v0.2.1...v0.3.0
v0.2.0
What's Changed
- feat: Contribution of ConTeXTMatch model by @warreveys in #18
- feat: add curriculum encoder and benchmark tests (#19) by @AleksanderB-hub in #20
- docs: add citation reference by @Mattdl in #15
- docs: README fix license shield by @Mattdl in #16
- fix: wrong order attributes evaluate call in evaluate_multiple_models function by @warreveys in #17
- fix: usage example by @Mattdl in #21
New Contributors
- @warreveys made their first contribution in #17
- @AleksanderB-hub made their first contribution in #20
Full Changelog: https://github.com/techwolf-ai/workrb/blob/main/CHANGELOG.md
Diff: v0.1.0...v0.2.0
v0.1.0
What's Changed
- End-to-end evaluation with WorkRB for multiple Ranking and Classification tasks
- End-to-end and unit test coverage of 78%
- Pypi release github workflow automation
- Update issue templates by @Mattdl in #11
- Docs by @Mattdl in #12
Full Changelog: https://github.com/techwolf-ai/workrb/commits/v0.1.0