Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

artifact_id for OP Atlas sourced projects is not being assigned on production runs #3267

Open
ccerv1 opened this issue Mar 14, 2025 · 4 comments
Labels
c:data Gathering data (e.g. indexing)

Comments

@ccerv1
Copy link
Member

ccerv1 commented Mar 14, 2025

Which area(s) are affected? (leave empty if unsure)

Other

To Reproduce

Let's take the following query:

    select
        artifact_id,
        project_id,
        artifact_source_id,
        artifact_source,
        artifact_namespace,
        artifact_name
    from oso.int_artifacts_by_project_in_op_atlas
    where artifact_name = '0x03232b5ee80369a88620615f8328beec1884b731'

Production Trino gives us:

[{'artifact_id': None,
  'project_id': 'VsbnIqH0Z8ErW7VixsMTGcLVf45E645bcxAVcD/KZT0=',
  'artifact_source_id': '0x03232b5ee80369a88620615f8328beec1884b731',
  'artifact_source': 'MAINNET',
  'artifact_namespace': None,
  'artifact_name': '0x03232b5ee80369a88620615f8328beec1884b731'},
 {'artifact_id': None,
  'project_id': 'VsbnIqH0Z8ErW7VixsMTGcLVf45E645bcxAVcD/KZT0=',
  'artifact_source_id': '0x03232b5ee80369a88620615f8328beec1884b731',
  'artifact_source': 'OPTIMISM',
  'artifact_namespace': None,
  'artifact_name': '0x03232b5ee80369a88620615f8328beec1884b731'}]

Meanwhile, if we test locally, we get:

┌──────────────────────┬──────────────────────┬──────────────────────┬───┬────────────────────┬──────────────────────┐
│     artifact_id      │      project_id      │  artifact_source_id  │ … │ artifact_namespace │    artifact_name     │
│       varchar        │       varchar        │       varchar        │   │      varchar       │       varchar        │
├──────────────────────┼──────────────────────┼──────────────────────┼───┼────────────────────┼──────────────────────┤
│ bad48b4822a5180d72…  │ 56c6e722a1f467c12b…  │ 0x03232b5ee80369a8…  │ … │                    │ 0x03232b5ee80369a8…  │
│ 43c276434f59b952ac…  │ 56c6e722a1f467c12b…  │ 0x03232b5ee80369a8…  │ … │                    │ 0x03232b5ee80369a8…  │
├──────────────────────┴──────────────────────┴──────────────────────┴───┴────────────────────┴──────────────────────┤
│ 2 rows                                                                                         6 columns (5 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Describe the Bug

The issue is likely with oso_id handling of null artifact namespaces.

The logic for assigning artifact_id in int_artifacts_by_project_in_op_atlas is:

SELECT
  project_id,
  CASE
    WHEN artifact_source = 'GITHUB' THEN @oso_id(artifact_source, artifact_source_id)
    ELSE @oso_id(artifact_source, artifact_namespace, artifact_name)
  END AS artifact_id,
  artifact_source_id,
  artifact_source,
  artifact_namespace,
  artifact_name,
  artifact_url,
  artifact_type
FROM all_normalized_artifacts

In the oso_id macro we see:

@macro()
def oso_id(evaluator: MacroEvaluator, *args: exp.Expression):
    if evaluator.runtime_stage in ["loading", "creating"]:
        return exp.Literal(this="someid", is_string=True)
    concatenated = exp.Concat(expressions=args, safe=True, coalesce=False)
    if evaluator.engine_adapter.dialect == "trino":
        # Trino's SHA256 function only accepts type `varbinary`. So we convert
        # the varchar to varbinary with trino's to_utf8.
        concatenated = exp.Anonymous(this="to_utf8", expressions=[concatenated])
    sha = exp.SHA2(
        this=concatenated,
        length=exp.Literal(this=256, is_string=False),
    )
    if evaluator.runtime_stage in ["loading", "creating"]:
        return exp.Literal(this="", is_string=True)
    if evaluator.engine_adapter.dialect == "duckdb":
        return sha
    return exp.ToBase64(this=sha)

However, in our OP Atlas staging models, we are assigning namespaces as NULL::VARCHAR rather than ''.

For example:

SELECT
  @oso_id('OP_ATLAS', project_id) AS project_id, /* Translating op-atlas project_id to OSO project_id */
  LOWER(contract_address) AS artifact_source_id,
  @chain_id_to_chain_name(chain_id) AS artifact_source,
  NULL::VARCHAR AS artifact_namespace,
  contract_address AS artifact_name,
  NULL::VARCHAR AS artifact_url,
  'CONTRACT' AS artifact_type
FROM latest_contracts

Expected Behavior

We should be getting the artifact_id to generate -- and it should match the same artifact_id we get for the same artifact from OSSD. Currently, this only works for Github artifacts (that are available in both OSSD and OP Atlas). For example:

select
    artifact_id,
    project_id,
    artifact_source_id,
    artifact_namespace,
    artifact_name
from oso.artifacts_by_project_v1
where
    artifact_namespace = 'ethereum-attestation-service'
    and artifact_name = 'eas-contracts'
limit 10

returns:

[{'artifact_id': 'grgi4DC5KFU+WgNpSItyaGfsgYwZgiMZJjNx1J2FD+k=',
  'project_id': 'Ze2CblnBaz2Apm6a00KjwAZ7N4K5yxDUpXlK+mDkwJw=',
  'artifact_source_id': '300911356',
  'artifact_namespace': 'ethereum-attestation-service',
  'artifact_name': 'eas-contracts'},
 {'artifact_id': 'grgi4DC5KFU+WgNpSItyaGfsgYwZgiMZJjNx1J2FD+k=',
  'project_id': '1mBkagpZ6JI5aGrhKsfnb7/M1rtz8N6nraFghzTJs+w=',
  'artifact_source_id': '300911356',
  'artifact_namespace': 'ethereum-attestation-service',
  'artifact_name': 'eas-contracts'}]
@github-project-automation github-project-automation bot moved this to Backlog in OSO Mar 14, 2025
@ccerv1 ccerv1 added the c:data Gathering data (e.g. indexing) label Mar 14, 2025
@ccerv1 ccerv1 added this to the [c] Retro Funding S7 Metrics milestone Mar 14, 2025
@ccerv1 ccerv1 linked a pull request Mar 14, 2025 that will close this issue
@ravenac95
Copy link
Member

I see. I actually think rather than address this in the macro level it's likely the responsibility of the calling function. Otherwise every value that is given into the macro needs to be wrapped in a COALESCE(value, '') function which I feel gives too much responsibility to the oso_id macro. We should probably just either pass '' or COALESCE(artifact_namespace, '') when calling oso_id that way the other values passed in don't need to have coalesce as well.

@ravenac95
Copy link
Member

actually looking at this, the artifact_id for ossd artifacts is also @oso_id(artifact_source, artifact_source_id) I think we actually then need to redo the artifact ids everywhere if we want these things to match.

@ccerv1
Copy link
Member Author

ccerv1 commented Mar 14, 2025

Another WIP PR here:
#3276

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c:data Gathering data (e.g. indexing)
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

2 participants