Skip to content

Improve license required phrase generation #4237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Apr 22, 2025
Merged

Conversation

pombredanne
Copy link
Member

@pombredanne pombredanne commented Apr 14, 2025

This PR builds on top of:

Specifically:

  • Improve feedback and selectivity when generating new required phrase rules, mostly when updating existing rules
  • Add SPDX license ids to the addition of required phrases to existing rules
  • Update all rules "required phrase" status with these improvements

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁
  • Updated documentation pages (if applicable)
  • Updated CHANGELOG.rst (if applicable)

Put all required phrase validation in one code block

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Ensure it works with all rules at once
Add filter to run only a few times and limit
the minimum size of rules to gen.
Use SPDX keys for rules too.

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Do not generate rules for "license key".

Skip short rules that would contain stopwords as
they cannot be matched accurately

Validate short required phrase rules for stopwords

Add tests that highlight the stop word issue

Reference: #4238
Signed-off-by: Philippe Ombredanne <[email protected]>
Do not generate require phrase rules for these
rules.

Reference: #4238
Signed-off-by: Philippe Ombredanne <[email protected]>
@AyanSinhaMahapatra AyanSinhaMahapatra changed the base branch from develop to 4190-license-licence April 22, 2025 14:23
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops... should not be committed. junk local file!

Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne thanks++ for handling these. LGTM!
I will make the updates for a few nits and merge this after we merge #4215

@@ -2,6 +2,7 @@
license_expression: bsd-new
is_license_text: yes
is_continuous: yes
skip_for_required_phrase_generation: yes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we set the minimum_coverage instead here?
This is better for large parts of texts in rules that we want to detect exactly, more so when we want to match the entire rule exactly.

notes: Note a license
---

Some units are dual licensed and some are specifically Artistic-only.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be a false positive, we should instead make this a license clue for artistic, see example at https://github.com/natias/arnon/blob/e1233fdd840601e202d3bf1d3e9e5aff9a2e13d5/docker_with_perl_dbd/xzbnl#L245.

}}

<licenses><license><name>MPL</name><url>http://www.mozilla.org/MPL/2.0/index.txt</url><distribution>repo</distribution></license><license><name>LGPL</name><url>http://www.gnu.org/licenses/lgpl-2.1.txt</url><distribution>repo</distribution></license><license><name>GPL</name><url>http://www.gnu.org/licenses/gpl-2.0.txt</url><distribution>repo</distribution></license></licenses>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have this entire rule as a required phrase, instead we should hane the name/urls as required phrases.

}}

<licenses><license><name>MPL</name><url>http://www.mozilla.org/MPL/2.0/index.txt</url><distribution>repo</distribution></license><license><name>LGPL</name><url>https://www.gnu.org/licenses/lgpl-2.1.txt</url><distribution>repo</distribution></license><license><name>GPL</name><url>https://www.gnu.org/licenses/gpl-2.0.txt</url><distribution>repo</distribution></license></licenses>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have this entire rule as a required phrase, instead we should hane the name/urls as required phrases.

ignorable_urls:
- http://www.gnu.org/licenses/gpl.txt
- http://www.gnu.org/licenses/lgpl.txt
- https://www.mozilla.org/en-US/MPL/1.1/
---
{{


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have this entire rule as a required phrase, instead we should hane the name/urls as required phrases.

Return True if this phrase is a minimally suitable to use as a required phrase
Return True if this phrase is a minimally suitable to use as a required phrase.
Use the original rule to ensure we skip when referenced_filenames could be damaged.
Also skip short rules that would contain stopwords as they could not be detected correctly.
Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also check here for required phrase texts which are quite large as a refinement, as these can be better handled either by minimum coverage or by skipping required phrase generation from the source rules.

Signed-off-by: Philippe Ombredanne <[email protected]>
@AyanSinhaMahapatra AyanSinhaMahapatra changed the base branch from 4190-license-licence to develop April 22, 2025 19:17
@AyanSinhaMahapatra AyanSinhaMahapatra merged commit e4abe78 into develop Apr 22, 2025
7 of 42 checks passed
@pombredanne pombredanne deleted the improve-required branch April 23, 2025 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants