Skip to content

Indicate Python data types for GraphML specifications #752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 21, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 90 additions & 71 deletions site/content/3.13/data-science/graphml/notebooks-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,54 +118,71 @@ arangoml.projects.list_projects()

**API Documentation: [ArangoML.jobs.featurize](https://arangoml.github.io/arangoml/api.html#agml_api.jobs.v1.api.jobs_api.JobsApi.featurize)**

**The Featurization Service depends on a `Featurization Specification` that contains**:
- `featurizationName`: A name for the featurization task.
The Featurization Service depends on a **Featurization Specification**:

- `projectName`: The associated project name. You can use `project.name` here
{{< tip >}}
The descriptions of the specifications on this page indicate the Python data types,
but you can substitute them as follows for a schema description in terms of JSON:

| Python | JSON |
|:--------|:-------|
| `dict` | object |
| `list` | array |
| `int` | number |
| `float` | number |
| `str` | string |
{{< /tip >}}

- `featurizationName` (str): A name for the featurization task.

- `projectName` (str): The associated project name. You can use `project.name` here
if it was created or retrieved as described above.

- `graphName`: The associated graph name that exists within the database.
- `graphName` (str): The associated graph name that exists within the database.

- `featureSetID` Optional: The ID of an existing Feature Set to re-use. If provided, the `metagraph` dictionary can be ommitted. Defaults to `None`.
- `featureSetID` (str, _optional_): The ID of an existing Feature Set to re-use. If provided, the `metagraph` dictionary can be omitted. Defaults to `None`.

- `featurizationConfiguration` Optional: The optional default configuration to be applied
- `featurizationConfiguration` (dict, _optional_): The optional default configuration to be applied
across all features. Individual collection feature settings override this option.

- `featurePrefix`: The prefix to be applied to all individual features generated. Default is `feat_`.
- `featurePrefix` (str): The prefix to be applied to all individual features generated. Default is `feat_`.

- `outputName`: Adjust the default feature name. This can be any valid ArangoDB attribute name. Defaults to `x`.
- `outputName` (str): Adjust the default feature name. This can be any valid ArangoDB attribute name. Defaults to `x`.

- `dimensionalityReduction`: Object configuring dimensionality reduction.
- `disabled`: Whether to disable dimensionality reduction. Default is `false`,
- `dimensionalityReduction` (dict): Object configuring dimensionality reduction.
- `disabled` (bool): Whether to disable dimensionality reduction. Default is `false`,
therefore dimensionality reduction is applied after Featurization by default.
- `size`: The number of dimensions to reduce the feature length to. Default is `512`.

- `defaultsPerFeatureType`: A dictionary mapping each feature to how missing or mismatched values should be handled. The keys of this dictionary are the features, and the values are sub-dictionaries with the following keys:
- `missing`: A sub-dictionary detailing how missing values should be handled.
- `strategy`: The strategy to use for missing values. Options include `REPLACE` or `RAISE`.
- `replacement`: The value to replace missing values with. Only needed if `strategy` is `REPLACE`.
- `mismatch`: A sub-dictionary detailing how mismatched values should be handled.
- `strategy`: The strategy to use for mismatched values. Options include `REPLACE`, `RAISE`, `COERCE_REPLACE`, or `COERCE_RAISE`.
- `replacement`: The value to replace mismatched values with. Only needed if `strategy` is `REPLACE`, or `COERCE_REPLACE`.

- `jobConfiguration` Optional: A set of configurations that are applied to the job.
- `batchSize`: The number of documents to process in a single batch. Default is `32`.
- `runAnalysisChecks`: Whether to run analysis checks, used to perform a high-level analysis of the data quality before proceeding. Default is `true`.
- `skipLabels`: Skips the featurization process for attributes marked as `label`. Default is `false`.
- `useFeatureStore`: Enables the use of the Feature Store database, which allows you to store features separately from your Source Database. Default is `false`, therefore features are written to the source graph.
- `overwriteFSGraph`: Whether to overwrite the Feature Store Graph if features were previously generated. Default is `false`, therefore features are written to an existing Feature Store Graph.s
- `writeToSourceGraph`: Whether to store the generated features on the Source Graph. Default is `true`.

- `metagraph`: Metadata to represent the vertex & edge collections of the graph.
- `vertexCollections`: A dictionary mapping the vertex collection names to the following values:
- `features`: A dictionary mapping document properties to the following values:
- `featureType`: The type of feature. Options include `text`, `category`, `numeric`, or `label`.
- `config`: Collection-level configuration settings.
- `featurePrefix`: Identical to global `featurePrefix` but for this collection.
- `dimensionalityReduction`: Identical to global `dimensionalityReduction` but for this collection.
- `outputName`: Identical to global `outputName`, but specifically for this collection.
- `defaultsPerFeatureType`: Identical to global `defaultsPerFeatureType`, but specifically for this collection.
- `edgeCollections`: A dictionary mapping the edge collection names to an empty dictionary, as edge attributes are not currently supported.
- `size` (int): The number of dimensions to reduce the feature length to. Default is `512`.

- `defaultsPerFeatureType` (dict): A dictionary mapping each feature to how missing or mismatched values should be handled. The keys of this dictionary are the features, and the values are sub-dictionaries with the following keys:
- `text` / `numeric` / `category` / `label`:
- `missing` (dict): A sub-dictionary detailing how missing values should be handled.
- `strategy` (str): The strategy to use for missing values. Options include `REPLACE` or `RAISE`.
- `replacement`: The value to replace missing values with. Only needed if `strategy` is `REPLACE`.
- `mismatch` (dict): A sub-dictionary detailing how mismatched values should be handled.
- `strategy` (str): The strategy to use for mismatched values. Options include `REPLACE`, `RAISE`, `COERCE_REPLACE`, or `COERCE_RAISE`.
- `replacement` (str): The value to replace mismatched values with. Only needed if `strategy` is `REPLACE`, or `COERCE_REPLACE`.

- `jobConfiguration` (dict, _optional): A set of configurations that are applied to the job.
- `batchSize` (int): The number of documents to process in a single batch. Default is `32`.
- `runAnalysisChecks` (bool): Whether to run analysis checks, used to perform a high-level analysis of the data quality before proceeding. Default is `true`.
- `skipLabels` (bool): Skips the featurization process for attributes marked as `label`. Default is `false`.
- `useFeatureStore` (bool): Enables the use of the Feature Store database, which allows you to store features separately from your Source Database. Default is `false`, therefore features are written to the source graph.
- `overwriteFSGraph` (bool): Whether to overwrite the Feature Store Graph if features were previously generated. Default is `false`, therefore features are written to an existing Feature Store Graph.
- `writeToSourceGraph` (bool): Whether to store the generated features on the Source Graph. Default is `true`.

- `metagraph` (dict): Metadata to represent the vertex & edge collections of the graph.
- `vertexCollections` (dict): A dictionary mapping the vertex collection names to a configuration dictionary:
- _collection name_ (dict):
- `features` (dict): A dictionary mapping document properties to the following values:
- `featureType` (str): The type of feature. Options include `text`, `category`, `numeric`, or `label`.
- `config` (dict): Collection-level configuration settings.
- `featurePrefix` (str): Identical to global `featurePrefix` but for this collection.
- `dimensionalityReduction` (dict): Identical to global `dimensionalityReduction` but for this collection.
- `outputName` (str): Identical to global `outputName`, but specifically for this collection.
- `defaultsPerFeatureType` (dict): Identical to global `defaultsPerFeatureType`, but specifically for this collection.
- `edgeCollections` (dict): A dictionary mapping the edge collection names to an empty dictionary, as edge attributes are not currently supported.
- _collection name_ (dict): An empty dictionary.

The Featurization Specification example is used for the GDELT dataset:
- It featurizes the `name` attribute of the `Actor`, `Class`, `Country`,
Expand Down Expand Up @@ -383,34 +400,37 @@ Training Graph Machine Learning Models with GraphML requires two steps:
1. Describe which data points should be included in the Training Job.
2. Pass the Training Specification to the Training Service.

**The Training Service depends on a `Training Specification` that contains**:
- `featureSetID`: The feature set ID that was generated during the Featurization Job (if any). It replaces the need to provide the `metagraph`, `databaseName`, and `projectName` fields.
The Training Service depends on a **Training Specification**:

- `databaseName`: The database name the source data is in. Can be omitted if `featureSetID` is provided.
- `featureSetID` (str): The feature set ID that was generated during the Featurization Job (if any). It replaces the need to provide the `metagraph`, `databaseName`, and `projectName` fields.

- `projectName`: The top-level project to which all the experiments will link back. Can be omitted if `featureSetID` is provided.
- `databaseName` (str): The database name the source data is in. Can be omitted if `featureSetID` is provided.

- `useFeatureStore`: Boolean for enabling or disabling the use of the feature store. Default is `false`.
- `projectName` (str): The top-level project to which all the experiments will link back. Can be omitted if `featureSetID` is provided.

- `mlSpec`: Describes the desired machine learning task, input features, and
- `useFeatureStore` (bool): Boolean for enabling or disabling the use of the feature store. Default is `false`.

- `mlSpec` (dict): Describes the desired machine learning task, input features, and
the attribute label to be predicted.
- `classification`: Dictionary to describe the Node Classification Task Specification.
- `targetCollection`: The ArangoDB collection name that contains the prediction label.
- `inputFeatures`: The name of the feature to be used as input.
- `labelField`: The name of the attribute to be predicted.
- `batchSize`: The number of documents to process in a single training batch. Default is `64`.
- `graphEmbeddings`: Dictionary to describe the Graph Embedding Task Specification.
- `targetCollection`: The ArangoDB collection used to generate the embeddings.
- `embeddingSize`: The size of the embedding vector. Default is `128`.
- `batchSize`: The number of documents to process in a single training batch. Default is `64`.
- `generateEmbeddings`: Whether to generate embeddings on the training dataset. Default is `false`.

- `metagraph`: Metadata to represent the vertex & edge collections of the graph. If `featureSetID` is provided, this can be omitted.
- `graph`: The ArangoDB graph name.
- `vertexCollections`: A dictionary mapping the collection names to the following values:
- `x`: The name of the feature to be used as input.
- `y`: The name of the attribute to be predicted. Can only be specified for one collection.
- `edgeCollections`: A dictionary mapping the edge collection names to an empty dictionary, as edge features are not currently supported.
- `classification` (dict): Dictionary to describe the Node Classification Task Specification.
- `targetCollection` (str): The ArangoDB collection name that contains the prediction label.
- `inputFeatures` (str): The name of the feature to be used as input.
- `labelField` (str): The name of the attribute to be predicted.
- `batchSize` (int): The number of documents to process in a single training batch. Default is `64`.
- `graphEmbeddings` (dict): Dictionary to describe the Graph Embedding Task Specification.
- `targetCollection` (str): The ArangoDB collection used to generate the embeddings.
- `embeddingSize` (int): The size of the embedding vector. Default is `128`.
- `batchSize` (int): The number of documents to process in a single training batch. Default is `64`.
- `generateEmbeddings` (bool): Whether to generate embeddings on the training dataset. Default is `false`.

- `metagraph` (dict): Metadata to represent the vertex & edge collections of the graph. If `featureSetID` is provided, this can be omitted.
- `graph` (str): The ArangoDB graph name.
- `vertexCollections` (dict): A dictionary mapping the collection names to a configuration dictionary:
- _collection name_ (dict):
- `x` (str): The name of the feature to be used as input.
- `y`: The name of the attribute to be predicted. Can only be specified for one collection.
- `edgeCollections` (dict): A dictionary mapping the edge collection names to an empty dictionary, as edge features are not currently supported.
- _collection name_ (dict): An empty dictionary.

A Training Specification allows for concisely defining your training task in a
single object and then passing that object to the training service using the
Expand Down Expand Up @@ -705,23 +725,22 @@ print(best_model)

**API Documentation: [ArangoML.jobs.predict](https://arangoml.github.io/arangoml/api.html#agml_api.jobs.v1.api.jobs_api.JobsApi.predict)**

Final step!

After selecting a model, a Prediction Job can be created. The Prediction Job
will generate predictions and persist them to the source graph in a new
collection, or within the source documents.

**The Prediction Service depends on a `Prediction Specification` that contains**:
- `projectName`: The top-level project to which all the experiments will link back.
- `databaseName`: The database name the source data is in.
- `modelID`: The model ID to use for generating predictions.
- `featurizeNewDocuments`: Boolean for enabling or disabling the featurization of new documents. Useful if you don't want to re-train the model upon new data. Default is `false`.
- `featurizeOutdatedDocuments`: Boolean for enabling or disabling the featurization of outdated documents. Outdated documents are those whose features have changed since the last featurization. Default is `false`.
- `schedule`: A cron expression to schedule the prediction job. The cron syntax is a set of
The Prediction Service depends on a **Prediction Specification**:

- `projectName` (str): The top-level project to which all the experiments will link back.
- `databaseName` (str): The database name the source data is in.
- `modelID` (str): The model ID to use for generating predictions.
- `featurizeNewDocuments` (bool): Boolean for enabling or disabling the featurization of new documents. Useful if you don't want to re-train the model upon new data. Default is `false`.
- `featurizeOutdatedDocuments` (bool): Boolean for enabling or disabling the featurization of outdated documents. Outdated documents are those whose features have changed since the last featurization. Default is `false`.
- `schedule` (str): A cron expression to schedule the prediction job. The cron syntax is a set of
five fields in a line, indicating when the job should be executed. The format must follow
the following order: `minute` `hour` `day-of-month` `month` `day-of-week`
(e.g. `0 0 * * *` for daily predictions at 00:00). Default is `None`.
- `embeddingsField`: The name of the field to store the generated embeddings. This is only used for Graph Embedding tasks. Default is `None`.
- `embeddingsField` (str): The name of the field to store the generated embeddings. This is only used for Graph Embedding tasks. Default is `None`.

```py
# 1. Define the Prediction Specification
Expand Down