Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions .github/prompts/plan-scipIndexImport.prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Plan: `scip-index-import` Domain

**TL;DR**: Create a standalone `domains/scip-index-import/` domain that (1) imports SCIP type-graph CSVs into Neo4j by splitting source files into single-statement `.cypher` files, (2) enriches the data for `projectionFunctions.sh` compatibility, and (3) creates SCIP-scoped structural nodes and SCIP-specific query variants for cyclic-deps and external-deps compatibility — without polluting shared labels.

**Decided**: domain name = `scip-index-import`, always cleanup before import, pass `referenceCount` as `$dependencies_projection_weight_property` parameter, assume CSVs pre-placed in Neo4j import dir, reuse `cypher/Dependency_Enrichment/` language-agnostic queries directly, keep all SCIP nodes under SCIP-specific labels only.

---

## Phase 1: Import Queries (split source files)

Files in `domains/scip-index-import/queries/`:

1. `Cleanup_SCIP_Type_Nodes.cypher`: copied from `getting-started-with-scip/type-graph/cleanup-code-unit-csv-from-neo4j.cypher`; single DETACH DELETE statement; strip semicolons per convention
2. `Create_SCIP_Type_Constraint.cypher`: Step 1 from `import-code-unit-csv-to-neo4j.cypher` (CREATE CONSTRAINT scip_type_symbol_unique)
3. `Import_SCIP_Type_Internal_Nodes.cypher`: Step 2 (LOAD CSV WHERE row.file <> '')
4. `Import_SCIP_Type_External_Nodes.cypher`: Step 3 (LOAD CSV WHERE row.file = '')
5. `Import_SCIP_Type_Edges.cypher`: Step 4 (LOAD CSV edges, MERGE DEPENDS_ON)

*Cypher convention: strip all semicolons; first line = description comment; one statement per file.*

## Phase 2: Projection-Compatibility Enrichment

6. `Set_Incoming_SCIP_Type_Dependencies.cypher`: MATCH `(target:SCIPType)` WHERE `incomingDependencies IS NULL`; OPTIONAL MATCH source nodes; SET `incomingDependencies`, `incomingDependenciesWeight`
7. `Set_Outgoing_SCIP_Type_Dependencies.cypher`: mirror of above for outgoing
8. `Set_SCIP_Type_Test_Marker_Integer.cypher`: `SET n.testMarkerInteger = CASE WHEN n.isTest THEN 1 ELSE 0 END` WHERE `n.testMarkerInteger IS NULL`; MATCH `(n:SCIPType)`

*Script reuses `cypher/Dependency_Enrichment/Set_Dependency_Degree.cypher` and `Set_Dependency_Degree_Rank.cypher` directly — no copies.*

## Phase 3: Structural Node Enrichment (SCIP-scoped)

Structural nodes carry only SCIP-specific labels to avoid collision with jQAssistant data. No `:Type`, `:Package`, `:Artifact`, or `:ExternalType` labels are added.

9. `Create_SCIP_Artifact_Nodes.cypher`: MERGE `:SCIP:SCIPArtifact` nodes from unique `(module, version, packageManager)` on SCIPType; `fqn = module + ' ' + version`, `name = module`, `fileName = packageId`
10. `Create_SCIP_Module_Nodes_For_Internal_Types.cypher`: MERGE `:SCIP:SCIPModule` nodes from unique directory portion of `file` on `:SCIPInternalType` nodes; `fqn` = raw directory path (language-agnostic: `left(file, size(file) - size(split(file, '/')[-1]) - 1)`); no language-specific stripping
11. `Link_SCIP_Module_CONTAINS_SCIP_InternalType.cypher`: MATCH SCIPModule by `fqn` equal to the derived directory of `file`, MATCH SCIPInternalType, MERGE `(module)-[:CONTAINS]->(type)`
12. `Link_SCIP_Artifact_CONTAINS_SCIP_Module.cypher`: MATCH SCIPArtifact by module+version, MATCH SCIPModule by module, MERGE `(artifact)-[:CONTAINS]->(module)`
13. `Link_SCIP_Artifact_CONTAINS_SCIP_ExternalType.cypher`: External types have no package path; link SCIPArtifact directly to SCIPExternalType via CONTAINS

## Phase 4: SCIP-specific Domain Query Variants

Existing domain queries use `:Package`, `:Type`, `:Artifact`, `:ExternalType` — labels not present on SCIP nodes. SCIP variants are placed in `domains/scip-index-import/queries/` for now (domain not yet integrated).

14. `Cyclic_SCIP_Type_Dependencies.cypher`: adapted from `domains/cyclic-dependencies/queries/Cyclic_Dependencies.cypher`; replace `:Package` with `:SCIPModule`, `:Type` with `:SCIPType`, `:Artifact` with `:SCIPArtifact`; same logic and output columns
15. `External_SCIP_Type_Package_Usage_Overall.cypher`: adapted from `domains/external-dependencies/queries/External_package_usage_overall.cypher`; replace `:ExternalType` with `:SCIPExternalType`, `:Package` with `:SCIPModule`, `:Type` with `:SCIPType`; same logic and output columns

## Phase 5: Entry-point Shell Script

16. `domains/scip-index-import/importScipIndexData.sh`:
- Header: shebang, blank line, description comment, `set -o errexit -o pipefail -o nounset`, `IFS=$'\n\t'`
- Source `executeQueryFunctions.sh` (via SCRIPTS_DIR resolution pattern from `prepareAnalysis.sh`)
- QUERIES_DIR defined relative to script location
- DEPENDENCY_ENRICHMENT_CYPHER_DIR path to `cypher/Dependency_Enrichment/`
- Runs in sequence: cleanup → constraint → import nodes (2 queries) → import edges → incoming → outgoing → test marker → Set_Dependency_Degree → Set_Dependency_Degree_Rank → artifact nodes → module nodes → contains links (3 queries)
- Log each step with echo prefix `importScipIndexData:`

## Verification

1. `shellcheck domains/scip-index-import/importScipIndexData.sh`
2. Copy test CSVs from `temp/simple-project-for-scip-java-comparision/import/` to Neo4j import dir; run script
3. Verify nodes exist: `MATCH (n:SCIPType) RETURN count(n)`
4. Verify projection readiness: run `Dependencies_0_Verify_Projectable.cypher` with params `dependencies_projection_node=SCIPType`, `dependencies_projection_weight_property=referenceCount`
5. Verify cyclic-deps SCIP variant: `domains/scip-index-import/queries/Cyclic_SCIP_Type_Dependencies.cypher`
6. Verify external-deps SCIP variant: `domains/scip-index-import/queries/External_SCIP_Type_Package_Usage_Overall.cypher`

## Gap Analysis

✅ Enabled after this plan:
- `projectionFunctions.sh` with `SCIPType` node + `referenceCount` property (graph algorithms, anomaly detection, node embeddings)
- `Cyclic_SCIP_Type_Dependencies.cypher`: via SCIPModule/SCIPArtifact/CONTAINS + SCIPType
- `External_SCIP_Type_Package_Usage_Overall.cypher`: via SCIPExternalType + SCIPModule/SCIPType

❌ Still missing (not in this plan):
- Internal-deps queries use `:Java:Package`, `:Java:Type` - SCIP-specific variants not planned here
- TypeScript-specific internal-deps queries - different schema entirely
- git-history domain - explicitly out of scope
- Queries using `globalFqn` or `fqn` on types - SCIP types use `symbol`; `name` is the display name

## Further Considerations

1. **Weight property aliasing (optional future optimization)**: If a hardcoded `weight` property becomes necessary (e.g., for domain queries that don't parameterize the weight property), it can be added in Phase 1 during the LOAD CSV edges step (item 5) with a simple `SET r.weight = r.referenceCount` clause. Currently all projection usage is parameterized, so this is not needed.
108 changes: 108 additions & 0 deletions domains/scip-index-import/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# SCIP Index Import Domain

Imports SCIP type-graph data from CSV into Neo4j and enriches it for analysis.
[SCIP](https://github.com/sourcegraph/scip) (Sourcegraph Code Intelligence Protocol) provides a language-agnostic type dependency graph.

Supported languages: Go, Java, TypeScript, Rust, C++, Ruby, Python, C#.

## When to use

Run this domain after generating `scip_type_nodes.csv` and `scip_type_edges.csv` and placing them in the Neo4j import directory.

## Entry Point

| Script | Purpose |
|--------|---------|
| [importScipIndexData.sh](./importScipIndexData.sh) | Full import and enrichment pipeline — run this directly |

## Prerequisites

Two CSV files must be present in the Neo4j import directory before running:

| File | Columns |
|------|---------|
| `scip_type_nodes.csv` | `symbol`, `display_name`, `file`, `scheme`, `type_name`, `package_id`, `package_manager`, `version`, `module`, `is_abstract` |
| `scip_type_edges.csv` | `source_symbol`, `target_symbol`, `reference_count` |

Internal types have a non-empty `file` column. External types have an empty `file` column.

## Import Phases

`importScipIndexData.sh` runs the following queries in order:

### 1. Setup

| Query | Purpose |
|-------|---------|
| [Cleanup_SCIP_Type_Nodes.cypher](./queries/Cleanup_SCIP_Type_Nodes.cypher) | Delete all existing SCIP nodes — clean slate before re-import |
| [Create_SCIP_Type_Constraint.cypher](./queries/Create_SCIP_Type_Constraint.cypher) | Create uniqueness constraint on `SCIPType.symbol` |

### 2. Import

| Query | Purpose |
|-------|---------|
| [Import_SCIP_Type_Internal_Nodes.cypher](./queries/Import_SCIP_Type_Internal_Nodes.cypher) | Import internal types (own source files); sets `isTest` from file path patterns |
| [Import_SCIP_Type_External_Nodes.cypher](./queries/Import_SCIP_Type_External_Nodes.cypher) | Import external types (library dependencies) |
| [Import_SCIP_Type_Edges.cypher](./queries/Import_SCIP_Type_Edges.cypher) | Import `DEPENDS_ON` relationships between types |

### 3. Type Enrichment

| Query | Purpose |
|-------|---------|
| [Set_Incoming_SCIP_Type_Dependencies.cypher](./queries/Set_Incoming_SCIP_Type_Dependencies.cypher) | Set `incomingDependencies` count on each type |
| [Set_Outgoing_SCIP_Type_Dependencies.cypher](./queries/Set_Outgoing_SCIP_Type_Dependencies.cypher) | Set `outgoingDependencies` count on each type |
| [Set_SCIP_Type_Test_Marker_Integer.cypher](./queries/Set_SCIP_Type_Test_Marker_Integer.cypher) | Set `testMarkerInteger` (0/1) from `isTest` on all types |

### 4. Structural Nodes and Links

| Query | Purpose |
|-------|---------|
| [Create_SCIP_Module_Nodes_For_Internal_Types.cypher](./queries/Create_SCIP_Module_Nodes_For_Internal_Types.cypher) | Create `SCIPModule` nodes — one per unique source directory |
| [Create_SCIP_Artifact_Nodes.cypher](./queries/Create_SCIP_Artifact_Nodes.cypher) | Create `SCIPArtifact` nodes — one per unique module+version combination |
| [Link_SCIP_Module_CONTAINS_SCIP_InternalType.cypher](./queries/Link_SCIP_Module_CONTAINS_SCIP_InternalType.cypher) | `SCIPModule -[:CONTAINS]-> SCIPInternalType` |
| [Link_SCIP_Artifact_CONTAINS_SCIP_Module.cypher](./queries/Link_SCIP_Artifact_CONTAINS_SCIP_Module.cypher) | `SCIPArtifact -[:CONTAINS]-> SCIPModule` |
| [Link_SCIP_Artifact_CONTAINS_SCIP_ExternalType.cypher](./queries/Link_SCIP_Artifact_CONTAINS_SCIP_ExternalType.cypher) | `SCIPArtifact -[:CONTAINS]-> SCIPExternalType` |
| [Set_SCIP_Module_Is_Test_And_Marker_Integer.cypher](./queries/Set_SCIP_Module_Is_Test_And_Marker_Integer.cypher) | Set `isTest` and `testMarkerInteger` on modules — true if any contained type is a test |

### 5. Dependency Metrics

Shared queries from [`cypher/Dependency_Enrichment/`](../../cypher/Dependency_Enrichment/):

- `Set_Dependency_Degree.cypher` — combined in/out degree per node
- `Set_Dependency_Degree_Rank.cypher` — percentile rank of dependency degree

## Graph Model

### Nodes

| Label | Description |
|-------|-------------|
| `SCIP:SCIPType:SCIPInternalType` | Type from own source code; has `isTest`, `testMarkerInteger`, `file` |
| `SCIP:SCIPType:SCIPExternalType` | Type from an external library; `isTest = false` |
| `SCIP:SCIPModule` | Source directory; has `isTest`, `testMarkerInteger` |
| `SCIP:SCIPArtifact` | Module + version package; groups types and modules |

### Relationships

| Relationship | From → To | Description |
|--------------|-----------|-------------|
| `DEPENDS_ON` | `SCIPType → SCIPType` | Type-level dependency with `referenceCount` |
| `CONTAINS` | `SCIPModule → SCIPInternalType` | Module contains its source types |
| `CONTAINS` | `SCIPArtifact → SCIPModule` | Artifact contains its modules |
| `CONTAINS` | `SCIPArtifact → SCIPExternalType` | Artifact contains its external types |

### Key Properties

| Property | Nodes | Description |
|----------|-------|-------------|
| `isTest` | `SCIPInternalType`, `SCIPModule` | `true` if the node is part of test code |
| `testMarkerInteger` | `SCIPType`, `SCIPModule` | `1` if `isTest`, `0` otherwise — used for graph projections |
| `language` | `SCIPType` | Detected language (e.g. `Java`, `TypeScript`, `Go`) |
| `incomingDependencies` | `SCIPType` | Number of types that depend on this type |
| `outgoingDependencies` | `SCIPType` | Number of types this type depends on |

### Test Detection

`isTest` is set on `SCIPInternalType` nodes during import by matching file path patterns (`/test/`, `/tests/`, `/spec/`, `__tests__`, `_test.go`, `.test.`, `.spec.`, Windows equivalents).

`isTest` on `SCIPModule` nodes is derived from its contained types: a module is a test module if **any** of its `SCIPInternalType` nodes has `isTest = true`.
95 changes: 95 additions & 0 deletions domains/scip-index-import/importScipIndexData.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
#!/usr/bin/env bash

# Imports SCIP type-graph CSV data into Neo4j and enriches it for projection compatibility.
# Creates SCIPType, SCIPInternalType, SCIPExternalType, SCIPArtifact, and SCIPModule nodes.
# Also creates structural CONTAINS links between artifacts, modules, and types.
# Assumes scip_type_nodes.csv and scip_type_edges.csv are already placed in the Neo4j import directory.
# Requires executeQueryFunctions.sh

# Fail on any error ("-e" = exit on first error, "-o pipefail" exit on errors within piped commands)
set -o errexit -o pipefail -o nounset
IFS=$'\n\t'

## Get this "domains/scip-index-import" directory if not already set
# Even if $BASH_SOURCE is made for Bourne-like shells it is also supported by others and therefore here the preferred solution.
# CDPATH reduces the scope of the cd command to potentially prevent unintended directory changes.
# This way non-standard tools like readlink aren't needed.
SCIP_INDEX_IMPORT_SCRIPT_DIR=${SCIP_INDEX_IMPORT_SCRIPT_DIR:-$( CDPATH=. cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P )}
echo "importScipIndexData: SCIP_INDEX_IMPORT_SCRIPT_DIR=${SCIP_INDEX_IMPORT_SCRIPT_DIR}"

# Get the "scripts" directory by navigating two levels up from this domain directory.
SCRIPTS_DIR=${SCRIPTS_DIR:-"${SCIP_INDEX_IMPORT_SCRIPT_DIR}/../../scripts"}

# Cypher query directory within this domain
QUERIES_DIR="${SCIP_INDEX_IMPORT_SCRIPT_DIR}/queries"

# Dependency enrichment queries in the shared cypher directory
DEPENDENCY_ENRICHMENT_CYPHER_DIR="${SCRIPTS_DIR}/../cypher/Dependency_Enrichment"

# Define functions to execute a cypher query from within a given file like "execute_cypher"
source "${SCRIPTS_DIR}/executeQueryFunctions.sh"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Cleaning up existing SCIP type nodes..."
execute_cypher "${QUERIES_DIR}/Cleanup_SCIP_Type_Nodes.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Creating SCIP type uniqueness constraint..."
execute_cypher "${QUERIES_DIR}/Create_SCIP_Type_Constraint.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Importing SCIP internal type nodes..."
execute_cypher "${QUERIES_DIR}/Import_SCIP_Type_Internal_Nodes.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Importing SCIP external type nodes..."
execute_cypher "${QUERIES_DIR}/Import_SCIP_Type_External_Nodes.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Importing SCIP type dependency edges..."
execute_cypher "${QUERIES_DIR}/Import_SCIP_Type_Edges.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Setting incoming SCIP type dependencies..."
execute_cypher "${QUERIES_DIR}/Set_Incoming_SCIP_Type_Dependencies.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Setting outgoing SCIP type dependencies..."
execute_cypher "${QUERIES_DIR}/Set_Outgoing_SCIP_Type_Dependencies.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Setting SCIP type test marker integers..."
execute_cypher "${QUERIES_DIR}/Set_SCIP_Type_Test_Marker_Integer.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Creating SCIP module nodes..."
execute_cypher "${QUERIES_DIR}/Create_SCIP_Module_Nodes_For_Internal_Types.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Creating SCIP artifact nodes..."
execute_cypher "${QUERIES_DIR}/Create_SCIP_Artifact_Nodes.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Linking SCIP modules to their contained internal types..."
execute_cypher "${QUERIES_DIR}/Link_SCIP_Module_CONTAINS_SCIP_InternalType.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Linking SCIP artifacts to their contained modules..."
execute_cypher "${QUERIES_DIR}/Link_SCIP_Artifact_CONTAINS_SCIP_Module.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Linking SCIP artifacts to their contained external types..."
execute_cypher "${QUERIES_DIR}/Link_SCIP_Artifact_CONTAINS_SCIP_ExternalType.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Setting SCIP module test markers..."
execute_cypher "${QUERIES_DIR}/Set_SCIP_Module_Is_Test_And_Marker_Integer.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Setting dependency degree..."
execute_cypher "${DEPENDENCY_ENRICHMENT_CYPHER_DIR}/Set_Dependency_Degree.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Setting dependency degree rank..."
execute_cypher "${DEPENDENCY_ENRICHMENT_CYPHER_DIR}/Set_Dependency_Degree_Rank.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Creating SCIP artifact nodes..."
execute_cypher "${QUERIES_DIR}/Create_SCIP_Artifact_Nodes.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Creating SCIP module nodes for internal types..."
execute_cypher "${QUERIES_DIR}/Create_SCIP_Module_Nodes_For_Internal_Types.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Linking SCIP modules to internal types..."
execute_cypher "${QUERIES_DIR}/Link_SCIP_Module_CONTAINS_SCIP_InternalType.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Linking SCIP artifacts to modules..."
execute_cypher "${QUERIES_DIR}/Link_SCIP_Artifact_CONTAINS_SCIP_Module.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') Linking SCIP artifacts to external types..."
execute_cypher "${QUERIES_DIR}/Link_SCIP_Artifact_CONTAINS_SCIP_ExternalType.cypher"

echo "importScipIndexData: $(date +'%Y-%m-%dT%H:%M:%S%z') SCIP index import complete."
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
// Remove all SCIPType nodes and their relationships from Neo4j. Run before re-importing to start with a clean slate.

MATCH (node:SCIPType)
DETACH DELETE node
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
// Create SCIPArtifact nodes from unique module, version, and packageManager combinations on SCIPType nodes. Requires "Import_SCIP_Type_Internal_Nodes.cypher" and "Import_SCIP_Type_External_Nodes.cypher".

MATCH (t:SCIPType)
WITH DISTINCT t.module AS module
,t.version AS version
,t.packageManager AS packageManager
,t.packageId AS packageId
MERGE (a:SCIP:SCIPArtifact {fqn: module + ' ' + version})
SET a.name = module
,a.version = version
,a.packageManager = packageManager
,a.fileName = packageId
RETURN count(*) AS writtenNodes
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
// Create SCIPModule nodes from unique directory portions of source file paths on SCIPInternalType nodes. Requires "Import_SCIP_Type_Internal_Nodes.cypher".

MATCH (t:SCIPInternalType)
WITH DISTINCT left(t.file, size(t.file) - size(split(t.file, '/')[-1]) - 1) AS directoryPath
MERGE (m:SCIP:SCIPModule {fqn: directoryPath})
SET m.name = split(directoryPath, '/')[-1]
RETURN count(*) AS writtenNodes
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
// Create uniqueness constraint on symbol property for SCIPType nodes.

CREATE CONSTRAINT scip_type_symbol_unique IF NOT EXISTS
FOR (n:SCIPType) REQUIRE n.symbol IS UNIQUE
Loading