feat: keep test noise out of community names and attach tests to their subjects#600
Open
SHudici wants to merge 1 commit into
Open
feat: keep test noise out of community names and attach tests to their subjects#600SHudici wants to merge 1 commit into
SHudici wants to merge 1 commit into
Conversation
…r subjects
Two related fixes for test-heavy repositories, where community output
was dominated by test artifacts:
- Leiden tends to cluster tests with each other rather than with the
code they cover: shared fixtures and helpers give test files dense
internal CALLS edges, while TESTED_BY links to subjects are sparse
and weakly weighted. A new reassignment pass moves each Test node
into the community holding the majority of its TESTED_BY partners
(its own community counts as a vote, so tests already placed with
their subjects stay put; ties also prefer the current cluster and
otherwise resolve to the lowest cluster index, so the outcome never
depends on edge order). The test endpoint of a TESTED_BY edge is
identified by node kind rather than edge direction, so the pass is
independent of the direction the parser emits. TESTED_BY endpoints
inherited from unresolved cross-file calls can be bare names rather
than qualified ones; those are resolved by node name when the name
is unambiguous, so their tests still vote.
- Community naming is no longer hijacked by test members. Mixed
communities are named from their production members only, and BDD
test-name grammar ("should", "when", "given", ...) joined the
stop-word list, so a mixed community that used to come out as
"tests-should" is now named from its production side.
Behavior notes: tests with no TESTED_BY partner keep their original
cluster, and a test-only cluster that loses most members to
reassignment can fall below min_size and drop out of the community
list. The file-based fallback keeps grouping strictly by directory,
since moving tests out of their file group would contradict its
contract.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On test-heavy repositories, community output is dominated by test artifacts in two ways:
CALLSedges, whileTESTED_BYlinks to subjects are sparse and weakly weighted (0.4). Leiden therefore clusters tests with each other instead of with the code they cover, producing test-only blobs that are useless for review context.test_should_return_x_when_y) flood the keyword counter with grammar words, andtests/directories win the file-prefix vote — mixed communities come out named liketests-should.Fix
_reassign_test_nodes(new post-Leiden pass): eachTestnode moves to the community holding the majority of itsTESTED_BYpartners; its own community counts as a vote, so tests already placed with their subjects stay put. Ties also prefer the current cluster and otherwise resolve to the lowest cluster index, so the outcome never depends on edge order. The test endpoint of an edge is identified by node kind, not edge direction, so the pass is independent of which direction the parser emits (and composes with the TESTED_BY direction fix, but does not require it).TESTED_BYendpoints that inherited a bare name from an unresolved cross-file call are resolved by node name when the name is unambiguous, so their tests still vote.should,when,given,returns, ...) joined_COMMON_WORDS.Deliberate scoping: the file-based fallback keeps grouping strictly by directory — its contract is "group by file", and moving tests out of their file group would contradict it. Behavior notes: tests with no
TESTED_BYpartner keep their original cluster; a test-only cluster that loses most members to reassignment can fall belowmin_sizeand drop out of the community list.Testing
_reassign_test_nodes(move, direction-agnostic, majority vote, own-cluster votes, no-op without TESTED_BY, test-test edges ignored, bare-name endpoints resolved when unique / ignored when ambiguous, tie-breaks: stay home on a tie, deterministic lowest-cluster otherwise) — pure-function tests that run without igraph.🤖 Generated with Claude Code