-
Notifications
You must be signed in to change notification settings - Fork 72
[GLE-8861] feat(vector): built-in TG function for pairwise vector embedding; #175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
7f33d68
[GLE-8861] feat(vector): built-in TG function for pairwise vector emb…
jue-yuan 870042f
[GLE-8861] change euclidean to l2;
jue-yuan 5160522
[GLE-8861] add missing range for foreach statements;
jue-yuan 3d49227
[GLE-8861] address comments;
jue-yuan bef452d
[GLE-8861] add OR REPLACE for each GSQL function;
jue-yuan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
CREATE FUNCTION gds.vector.cosine_distance(list<double> list1, list<double> list2) RETURNS(float) { | ||
|
||
/* | ||
First Author: Jue Yuan | ||
First Commit Date: Nov 27, 2024 | ||
|
||
Recent Author: Jue Yuan | ||
Recent Commit Date: Nov 27, 2024 | ||
|
||
Maturity: | ||
alpha | ||
|
||
Description: | ||
Calculates the cosine distance between two vectors represented as lists of doubles. | ||
The cosine distance is derived from the cosine similarity and provides a measure of the angle | ||
between two non-zero vectors in a multi-dimensional space. A distance of 0 indicates identical | ||
vectors, while a distance of 1 indicates orthogonal (maximally dissimilar) vectors. | ||
|
||
Parameters: | ||
list<double> list1: | ||
The first vector as a list of double values. | ||
list<double> list2: | ||
The second vector as a list of double values. | ||
|
||
Returns: | ||
float: | ||
The cosine distance between the two input vectors. | ||
Exceptions: | ||
list_size_mismatch (90000): | ||
Raised when the input lists are not of equal size. | ||
|
||
Logic Overview: | ||
Validates that both input vectors have the same length. | ||
Computes the inner (dot) product of the two vectors. | ||
Calculates the magnitudes (Euclidean norms) of both vectors. | ||
Returns the cosine distance as 1 - (inner product) / (product of magnitudes). | ||
|
||
Use Case: | ||
This function is commonly used in machine learning, natural language processing, | ||
and information retrieval tasks to quantify the similarity between vector representations, | ||
such as word embeddings or document feature vectors. | ||
*/ | ||
|
||
EXCEPTION list_size_mismatch (90000); | ||
ListAccum<double> @@myList1 = list1; | ||
ListAccum<double> @@myList2 = list2; | ||
|
||
IF (@@myList1.size() != @@myList2.size()) THEN | ||
RAISE list_size_mismatch ("Two lists provided for gds.vector.cosine_distance have different sizes."); | ||
END; | ||
|
||
double innerP = inner_product(@@myList1, @@myList2); | ||
double v1_magn = sqrt(inner_product(@@myList1, @@myList1)); | ||
double v2_magn = sqrt(inner_product(@@myList2, @@myList2)); | ||
RETURN (1 - innerP / (v1_magn * v2_magn)); | ||
jue-yuan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
CREATE FUNCTION gds.vector.dimension_count(list<double> list1) RETURNS(int) { | ||
|
||
/* | ||
First Author: Jue Yuan | ||
First Commit Date: Nov 27, 2024 | ||
|
||
Recent Author: Jue Yuan | ||
Recent Commit Date: Nov 27, 2024 | ||
|
||
Maturity: | ||
alpha | ||
|
||
Description: | ||
Returns the number of dimensions (elements) in a given vector, represented as a list of double values. | ||
This function is useful for determining the size or dimensionality of input vectors in mathematical | ||
and data processing operations. | ||
|
||
Parameters: | ||
list<double> list1: | ||
The input vector as a list of double values. | ||
|
||
Returns: | ||
int: | ||
The number of elements (dimensions) in the input vector. | ||
|
||
Logic Overview: | ||
Accepts a list of double values as input. | ||
Calculates the size of the list, which corresponds to the number of dimensions. | ||
Returns the size as an integer. | ||
Use Case: | ||
This function is valuable in vector-based computations, such as machine learning or data analysis tasks, | ||
where understanding the dimensionality of vectors is crucial for validation, preprocessing, or compatibility checks. | ||
*/ | ||
|
||
ListAccum<double> @@myList1 = list1; | ||
RETURN @@myList1.size(); | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
CREATE FUNCTION gds.vector.distance(list<double> list1, list<double> list2, string metric) RETURNS(float) { | ||
|
||
/* | ||
First Author: Jue Yuan | ||
First Commit Date: Nov 27, 2024 | ||
|
||
Recent Author: Jue Yuan | ||
Recent Commit Date: Nov 27, 2024 | ||
|
||
Maturity: | ||
alpha | ||
|
||
Description: | ||
Calculates the distance between two vectors represented as lists of double values, | ||
based on a specified distance metric. This function supports multiple metrics, | ||
allowing for flexible similarity or dissimilarity measurements in various computational tasks. | ||
|
||
Parameters: | ||
list<double> list1: | ||
The first vector as a list of double values. | ||
list<double> list2: | ||
The second vector as a list of double values. | ||
string metric: | ||
The distance metric to use. Supported metrics are: | ||
"cosine": Cosine distance | ||
"euclidean": Euclidean distance | ||
"ip": Inner product (dot product) | ||
Returns: | ||
float: | ||
The computed distance between the two input vectors based on the specified metric. | ||
|
||
Exceptions: | ||
list_size_mismatch (90000): | ||
Raised when the input vectors are not of equal size. | ||
invalid_metric_type (90001): | ||
Raised when an unsupported distance metric is provided. | ||
|
||
Logic Overview: | ||
Input Validation: | ||
Ensures both vectors have the same size. | ||
Metric Handling: | ||
Cosine Distance: | ||
Calculated as 1 - (inner product of vectors) / (product of magnitudes). | ||
Euclidean Distance: | ||
Computes the square root of the sum of squared differences between corresponding elements. | ||
Inner Product: | ||
Directly computes the dot product of the two vectors. | ||
|
||
Error Handling: | ||
Raises an exception if the provided metric is invalid. | ||
|
||
Use Case: | ||
This function is essential for machine learning, data science, and information retrieval applications, | ||
where distance or similarity calculations between vector representations (such as embeddings or feature vectors) are required. | ||
*/ | ||
|
||
EXCEPTION list_size_mismatch (90000); | ||
EXCEPTION invalid_metric_type (90001); | ||
ListAccum<double> @@myList1 = list1; | ||
ListAccum<double> @@myList2 = list2; | ||
|
||
IF (@@myList1.size() != @@myList2.size()) THEN | ||
RAISE list_size_mismatch ("Two lists provided for gds.vector.distance have different sizes."); | ||
END; | ||
|
||
SumAccum<float> @@myResult; | ||
SumAccum<float> @@sqrSum; | ||
|
||
CASE lower(metric) | ||
WHEN "cosine" THEN | ||
@@myResult = 1 - inner_product(@@myList1, @@myList2) / (sqrt(inner_product(@@myList1, @@myList1)) * sqrt(inner_product(@@myList2, @@myList2))); | ||
jue-yuan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
WHEN "euclidean" THEN | ||
FOREACH i IN [0, @@myList1.size() - 1 ] DO | ||
@@sqrSum += (@@myList1.get(i) - @@myList2.get(i)) * (@@myList1.get(i) - @@myList2.get(i)); | ||
END; | ||
@@myResult = sqrt(@@sqrSum); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have you tested the cases where the sizes of myList1 and myList2 are 0? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should be good since the |
||
WHEN "ip" THEN | ||
@@myResult = inner_product(@@myList1, @@myList2); | ||
ELSE | ||
RAISE invalid_metric_type ("Invalid metric algorithm provided, currently supported: cosine, euclidean and ip."); | ||
END | ||
; | ||
|
||
RETURN @@myResult; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
CREATE FUNCTION gds.vector.elements_sum(list<double> list1) RETURNS(float) { | ||
|
||
/* | ||
First Author: Jue Yuan | ||
First Commit Date: Nov 27, 2024 | ||
|
||
Recent Author: Jue Yuan | ||
Recent Commit Date: Nov 27, 2024 | ||
|
||
Maturity: | ||
alpha | ||
|
||
Description: | ||
Calculates the sum of all elements in a vector, represented as a list of double values. | ||
This function is useful for aggregating vector components in mathematical and statistical operations. | ||
|
||
Parameters: | ||
list<double> list1: | ||
The input vector as a list of double values. | ||
|
||
Returns: | ||
float: | ||
The sum of all elements in the input vector. | ||
|
||
Logic Overview: | ||
Iterates through each element in the input list. | ||
Accumulates the sum of all elements. | ||
Returns the final sum as a floating-point value. | ||
|
||
Use Case: | ||
This function is valuable in various data processing tasks, such as computing vector norms, | ||
validating data integrity, or performing aggregations in machine learning and statistical analysis. | ||
*/ | ||
|
||
SumAccum<float> @@mySum; | ||
|
||
FOREACH i IN list1 DO | ||
@@mySum += i; | ||
END; | ||
RETURN @@mySum; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
CREATE FUNCTION gds.vector.euclidean_distance(list<double> list1, list<double> list2) RETURNS(float) { | ||
|
||
/* | ||
First Author: Jue Yuan | ||
First Commit Date: Nov 27, 2024 | ||
|
||
Recent Author: Jue Yuan | ||
Recent Commit Date: Nov 27, 2024 | ||
|
||
Maturity: | ||
alpha | ||
|
||
Description: | ||
Calculates the Euclidean distance between two vectors represented as lists of double values. | ||
Euclidean distance measures the straight-line distance between two points in multi-dimensional space, | ||
making it a fundamental metric in various computational and analytical applications. | ||
|
||
Parameters: | ||
list<double> list1: | ||
The first vector as a list of double values. | ||
list<double> list2: | ||
The second vector as a list of double values. | ||
|
||
Returns: | ||
float: | ||
The Euclidean distance between the two input vectors. | ||
|
||
Exceptions: | ||
list_size_mismatch (90000): Raised when the input vectors are not of equal size. | ||
|
||
Logic Overview: | ||
Input Validation: | ||
Ensures both vectors have the same length. | ||
Distance Calculation: | ||
Iterates through corresponding elements of both vectors. | ||
Computes the sum of the squared differences between each pair of elements. | ||
Returns the square root of the accumulated sum, representing the Euclidean distance. | ||
|
||
Formula: | ||
Distance = sqrt((x1 - y1)^2 + (x2 - y2)^2 + ... + (xn - yn)^2) | ||
Where xi and yi are elements of list1 and list2, respectively. | ||
|
||
Use Case: | ||
This function is widely used in machine learning (e.g., k-nearest neighbors), data science, | ||
and pattern recognition tasks to measure the similarity or dissimilarity between data points. | ||
*/ | ||
|
||
EXCEPTION list_size_mismatch (90000); | ||
ListAccum<double> @@myList1 = list1; | ||
ListAccum<double> @@myList2 = list2; | ||
|
||
IF (@@myList1.size() != @@myList2.size()) THEN | ||
RAISE list_size_mismatch ("Two lists provided for gds.vector.euclidean_distance have different sizes."); | ||
END; | ||
|
||
SumAccum<float> @@sqrSum; | ||
FOREACH i IN [0, @@myList1.size() - 1 ] DO | ||
@@sqrSum += (@@myList1.get(i) - @@myList2.get(i)) * (@@myList1.get(i) - @@myList2.get(i)); | ||
END; | ||
|
||
RETURN sqrt(@@sqrSum); | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
CREATE FUNCTION gds.vector.ip_distance(list<double> list1, list<double> list2) RETURNS(float) { | ||
|
||
/* | ||
First Author: Jue Yuan | ||
First Commit Date: Nov 27, 2024 | ||
|
||
Recent Author: Jue Yuan | ||
Recent Commit Date: Nov 27, 2024 | ||
|
||
Maturity: | ||
alpha | ||
|
||
Description: | ||
Calculates the inner product (dot product) between two vectors represented as lists of double values. | ||
The inner product is a key measure in linear algebra, indicating the magnitude of the projection of one vector onto another. | ||
This function provides a similarity measure commonly used in machine learning and data analysis. | ||
|
||
Parameters: | ||
list<double> list1: | ||
The first vector as a list of double values. | ||
list<double> list2: | ||
The second vector as a list of double values. | ||
|
||
Returns: | ||
float: | ||
The inner product (dot product) of the two input vectors. | ||
|
||
Exceptions: | ||
list_size_mismatch (90000): | ||
Raised when the input vectors are not of equal size. | ||
|
||
Logic Overview: | ||
Input Validation: | ||
Ensures both vectors have the same length. | ||
Inner Product Calculation: | ||
Computes the sum of the element-wise products of the two vectors. | ||
|
||
Formula: | ||
Inner Product = (x1 x y1) + (x2 x y2) + ... + (xn x yn) | ||
Where xi and yi are elements of list1 and list2, respectively. | ||
|
||
Use Case: | ||
This function is widely used in: | ||
Calculating similarity in machine learning models (e.g., recommendation systems). | ||
Performing vector projections in linear algebra. | ||
Evaluating similarity between embeddings in natural language processing (NLP). | ||
*/ | ||
|
||
EXCEPTION list_size_mismatch (90000); | ||
ListAccum<double> @@myList1 = list1; | ||
ListAccum<double> @@myList2 = list2; | ||
|
||
IF (@@myList1.size() != @@myList2.size()) THEN | ||
RAISE list_size_mismatch ("Two lists provided for gds.vector.euclidean_distance have different sizes."); | ||
END; | ||
|
||
RETURN inner_product(@@myList1, @@myList2); | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "/gds" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be short for
graph database system
, Neo4j used this as their built-in library.https://neo4j.com/docs/graph-data-science/current/algorithms/similarity-functions/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GDS means Graph Data Science.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would returning double be better since float has lower precision? Same applies below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will introduce some type checking error since they are mainly used for float types.