Skip to content

Releases: techwolf-ai/workrb

v0.5.1

13 Mar 14:20

Choose a tag to compare

Highlights

This release introduces the infrastructure needed to reproduce and report paper benchmark results, adds the NDCG ranking metric, and resolves several robustness issues encountered during large-scale multilingual evaluation runs.


New Features

NDCG metric

Added Normalized Discounted Cumulative Gain (NDCG) as a first-class ranking metric with binary relevance scoring.

  • Supports both a top-k cutoff variant (ndcg@k) and full-list evaluation (ndcg).
  • When no @k is specified, evaluates over the entire ranked list.
  • Handles edge cases: no relevant items (returns 0.0), all items relevant (returns 1.0), and k larger than the number of targets.
  • Comprehensive test suite in tests/test_ranking_metrics.py (154 lines) covering hand-computed values, edge cases, torch/numpy input parity, and smoke tests for all existing metrics.
  • README updated with the new metric entry.

Paper results pipeline

Two new example scripts enable end-to-end reproduction of benchmark results:

  • examples/run_paper_results.py — Runs the full benchmark suite across multilingual models (BM25, JobBERT-v3, Qwen3-0.6B) and monolingual English models (ConTeXTMatch, CurriculumMatch, JobBERT-v2).
  • examples/generate_paper_table.py — Loads saved results.json files and generates a publication-ready LaTeX comparison table with model grouping, short display names, bold-best highlighting per model group, optional dataset count rows, and \resizebox support.

LaTeX results reporting

New format_results_latex() function in src/workrb/metrics/reporting.py (~290 lines) that builds a complete LaTeX table environment from multiple BenchmarkResults objects. Supports:

  • Configurable aggregation level (per task group or per task).
  • Model grouping with \midrule separators.
  • Column renaming and ordering via a short_names dictionary.
  • Per-group dataset count (#D) rows.
  • Metric scaling (e.g. raw values to percentages).

Dataset count introspection

New BenchmarkResults.get_dataset_counts() method returns the number of datasets contributing to each task group (or task) score, respecting language aggregation filters.

Deduplication strategy for ranking datasets

Replaced the boolean allow_duplicate_queries / allow_duplicate_targets flags with a DuplicateStrategy enum offering three modes:

Strategy Behavior
ALLOW Silently accept duplicates (no-op).
RAISE Raise an error if duplicates are found.
RESOLVE (new default) Deterministic deduplication — targets keep first occurrence with index remapping; queries merge target_indices via set union.

Tested in tests/test_duplicate_strategy.py (210 lines).

ConTeXTMatch query batching

ConTeXTMatchModel._compute_rankings() now scores queries in configurable chunks (scoring_batch_size, default 32) to prevent OOM from the (num_queries, num_targets, seq_len) intermediate tensor. Targets are encoded once and reused across all chunks. Tested in tests/test_models/test_contextmatch_model.py (108 lines).

Version-dependent ESCO language support

ESCO.get_supported_languages(version) returns the correct language set per major.minor ESCO version. Languages added in v1.1 (Icelandic, Norwegian, Arabic, Ukrainian) are no longer incorrectly assumed available for v1.0.x. All ranking tasks now use this version-aware lookup.


Bug Fixes

  • Graceful handling of unsupported dataset configs — New DatasetConfigNotSupported exception for datasets that dynamically produce 0 queries or targets (e.g. ESCO language/version lacking skill alternatives). Task._load_datasets() catches this exception and logs a warning instead of crashing. self.dataset_ids is updated to reflect only successfully loaded datasets.
  • Float cast for prediction matrices — Prediction tensors are now explicitly cast to .float() before .numpy(), preventing dtype errors with bfloat16 models.
  • Deterministic index orderingsorted(set(...)) replaces list(set(...)) in _postprocess_indices for reproducible results.
  • Reporting lint fix — Removed unnecessary parentheses in format_results_latex model group iteration.

Breaking Changes

  • RankingDataset.__init__ signatureallow_duplicate_queries / allow_duplicate_targets (bool) replaced by duplicate_query_strategy / duplicate_target_strategy (DuplicateStrategy enum). External callers using the old boolean flags must update.
  • ConTeXTMatchModel.encode() parameter renamebatch_size renamed to encode_batch_size for clarity.

Files Changed (24 files)

Category Files +/-
Metrics ranking.py, reporting.py, classification.py, __init__.py +330
Models bi_encoder.py +127
Tasks (core) base.py, ranking_base.py, __init__.py +156
Tasks (ranking) job2skill.py, skill2job.py, skill_extraction.py, jobnorm.py, skillnorm.py, melo.py, mels.py +53
Tasks (classification) job2skill.py +4
Data esco.py +53
Results results.py +54
Examples run_paper_results.py, generate_paper_table.py +317
Tests test_duplicate_strategy.py, test_contextmatch_model.py, test_ranking_metrics.py +472
Config README.md, pyproject.toml, CHANGELOG.md misc
Full Changelog: v0.5.0...v0.5.1

v0.5.0

09 Mar 09:37

Choose a tag to compare

What's Changed

  • refactor: align task groups to paper (JOBSIM/SKILLSIM → Semantic Similarity + Candidate Ranking) by @Mattdl in #45

Full Changelog: v0.4.0...v0.5.0

v0.4.0

04 Mar 17:07

Choose a tag to compare

Features

  • Lexical baselines for ranking — Added BM25, TF-IDF (word/char n-gram), Edit Distance, and Random ranking models with optional lowercasing and unicode normalization. (lexical_baselines.py)
  • Freelancer project ranking tasks — New cross-lingual ranking tasks for freelancer candidate and project matching. (freelancer_project_matching.py)
  • Cross-lingual aggregation modes — Introduced LanguageAggregationMode enum with three modes (monolingual_only, crosslingual_group_input_languages, crosslingual_group_output_languages) for flexible per-language metric aggregation. Added DatasetLanguages type to describe input/output language sets per dataset.
  • Lazy execution filtering — Added ExecutionMode enum (LAZY / ALL) to skip datasets incompatible with the chosen aggregation mode before evaluation, avoiding unnecessary compute.
  • Language-grouped averaging — Per-task aggregation now groups datasets by language before averaging, giving equal weight to each language regardless of how many datasets it contains. A SKIP_LANGUAGE_AGGREGATION mode is available for the previous flat-average behavior.

Breaking Changes

  • MetricsResult.language has been replaced by input_languages / output_languages.
  • get_dataset_language() renamed to get_dataset_languages(), now returning input and output language sets.
  • language_aggregation_mode is now a required (non-optional) parameter in evaluate().
  • Dataset indexing generalized from language-based to dataset_id-based throughout the pipeline (language_results renamed to datasetid_results).

Refactors

  • Migrated freelancer tasks to dataset_id-based language mapping, replacing the Language.CROSS sentinel with a proper DATASET_LANGUAGES_MAP.
  • Extracted get_language_grouping_key() as a shared function in types.py, reused by both eval-time filtering and aggregation-time skipping.
  • Updated docstrings project-wide to comply with NumPy style.

Bug Fixes

  • Fixed SkillSkape import.
  • Included lowercase setting in lexical baseline model names.
  • Added language field to MetricsResult for proper per-language aggregation.
  • Removed from examples a dataset using ESCO 1.0.5 that incorrectly defines UK as a supported language.

Documentation

  • Updated README with task/model overview, metrics explanation, multi-model and cross-lingual examples.
  • Updated CONTRIBUTING with CI/CD guidance, cross-lingual task examples, and commit formatting.
  • Consolidated four aggregation example scripts into a single CLI-driven run_benchmark_aggregation.py.
  • Added run_all_ranking_tasks.py for auto-discovering all registered ranking tasks and models via the registry.

Tests

  • Added regression tests for all 9 lexical baseline model variants.
  • Added comprehensive tests for cross-lingual multi-dataset aggregation scenarios.
  • Added task loading tests for new MELO and MELS tasks.
  • Widened regression test tolerance (abs=1e-3) for cross-platform stability.

Package

  • Exposed ExecutionMode, LanguageAggregationMode, and setup_logger from workrb.__init__.
  • Default logger initialized to INFO on package import.
  • TaskRegistry.create() and ModelRegistry.create() now log the task/model being instantiated.

Full Changelog: v0.3.0...v0.4.0

v0.3.0

09 Jan 12:34

Choose a tag to compare

What's Changed

  • feat: add skillskape dataset by @jjzha in #23
  • feat: add Job Title Similarity ranking task by @federetyk in #28
  • ci: refactor tests to exclude heavy model benchmarking by default and allow by manual triggering by @Mattdl in #31
  • refactor: change evaluate.py to avoid amiguity with workrb.evaluate function call by @Mattdl in #29
  • chore: rename tasks to be more uniformly and update README task overview by @Mattdl in #32

New Contributors

Full Changelog: v0.2.1...v0.3.0

v0.2.0

06 Jan 08:33

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: https://github.com/techwolf-ai/workrb/blob/main/CHANGELOG.md
Diff: v0.1.0...v0.2.0

v0.1.0

11 Nov 15:24

Choose a tag to compare

What's Changed

  • End-to-end evaluation with WorkRB for multiple Ranking and Classification tasks
  • End-to-end and unit test coverage of 78%
  • Pypi release github workflow automation
  • Update issue templates by @Mattdl in #11
  • Docs by @Mattdl in #12

Full Changelog: https://github.com/techwolf-ai/workrb/commits/v0.1.0