Make code graph APIs SCIP-oriented #59470
Description
Status Quo
At the moment, the way precise code graph APIs work is based on source ranges. For example, when you do Find references, you end up calling something like this: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/frontend/graphqlbackend/codeintel.codenav.graphql?L114-128
references(line: Int!, character: Int!, ...)
The consequences of this design centered around positions is that:
- Ref panel URLs use source ranges. This means that the ref panel URLs that are not permalinks (i.e. not pinned a specific commit) can easily break if there were some unrelated changes earlier in the file causing the range for the occurrence to change.
- Client-side code cannot distinguish the cases where multiple semantically different symbols are present at the same source range, from the case where only one unique symbol is present. Related issues:
Proposed direction
From the start, the vision for SCIP has been to serve as part of the core vocabulary at Sourcegraph. We have already incorporated that to some extent with code navigation for locals, where the syntax highlighter can provide information about occurrences for locals and parameters for a growing set of languages, and the client-side code can retrieve a SCIP document1: SCIP document (1, 2), without having to make further requests to the server.
Integrating SCIP into the precise code graph APIs would involve adding support for:
- Getting the precise SCIP Document corresponding to a file.
- Looking up defs/refs/impls etc based on SCIP symbol names (or suffixes)
With 1. available, when attempting to do Go to definition/Find references, the client-side code could surface a choice to the user when multiple symbols have occurrences for the same token. This would address #57347.
With 1. available, even when a source range only has an occurrence for a single symbol, the client-side code could use the SCIP symbol name (or a slightly cleaned up version of it), to form URLs. For example, a URL like:
would become something like:
So if there were changes to previous lines, e.g. new imports were added, but if the code continued to have precise code graph data, the URL would still work, because the client-side code could fetch the SCIP document, and locate the nearest source range to L27:13-27:30
that has the matching symbol (and symbol names for top-level symbols themselves do not change based on source locations2), and (optionally) rewrite the URL. We should still probably maintain the source range in the URL so that the blob view can highlight the intended range in the source file.
With 2. available, the client-side code could only fetch precise data for a single symbol instead of presenting a union.
Accommodating new SCIP data sources
When the work on batch indexing is complete (tentatively planned for Q1 FY25), we'll have a new source of code graph data which is technically not precise, but it will be SCIP-oriented. We should be designing APIs and UI changes (e.g. for the ref panel and URLs) taking this into mind. For example:
- The client should request data with a setting for a "data source" (precise indexer? tree-sitter? any?), rather than having a tailored API specifically for precise data. This would apply both to the new APIs for fetching SCIP documents, as well as the APIs for fetching defs/refs/impls etc.
- Results that are returned should include (or be extensible to include) accompanying information about the data source (this matters for the
any
data source setting). - The data source should be implied by URLs, so that link sharing shows consistent results. Right now, the "Mix search-based and precise" setting is a user setting which means that it doesn't work if you're logged out 🙃, and if you're logged in, you may see different results compared to your colleague.
Miscellaneous suggestions
- Instead of having specialized APIs for defs/refs/impls etc., it might make sense to have a single API where the desired kind of output is specified as a string (this only applies for the output where it is a list of occurrences, this wouldn't apply to potential features like call hierarchy where the result would be a graph of occurrences). There seems to be an unnecessary amount of client-side complexity and code duplication because of the shape of the API.
- It would be helpful to document guarantees related to duplicates, ordering etc. to avoid unnecessary client-side code related to sorting/uniquing/merging.