Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add third data quality metric #11939

Merged
merged 39 commits into from
Jan 30, 2025
Merged

Add third data quality metric #11939

merged 39 commits into from
Jan 30, 2025

Conversation

marthasharkey
Copy link
Contributor

@marthasharkey marthasharkey commented Dec 24, 2024

Pull Request Description

non-sampled-example
sampled-example

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    TypeScript,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • Unit tests have been written where possible.
  • If meaningful changes were made to logic or tests affecting Enso Cloud integration in the libraries,
    or the Snowflake database integration, a run of the Extra Tests has been scheduled.
    • If applicable, it is suggested to paste a link to a successful run of the Extra Tests.

@marthasharkey marthasharkey marked this pull request as ready for review January 8, 2025 13:35
Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tweaks for efficiency but like the approach.

import org.enso.table.data.table.Column;
import org.graalvm.polyglot.Context;

public class CountWhitespace {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a base class (something like SampledOperation), which would hold the RANDOM_SEED and DEFAULT_SAMPLE_SIZE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CountWhitespace feels the wrong name, probably should be CountNonTrivialWhitespace.

Comment on lines 32 to 33
private Future<Long> untrimmedCount;
private Future<Long> whitespaceCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets create a record type here.

record DataQualityMetrics(Long untrimmedCount, Long whitespaceCount);

We can then do it as sinlge CompletableFuture to compute the values.

Comment on lines 65 to 68
List<String> trivialWhiteSpaceList =
List.of(
"\u200A", "\u200B", "\u205F", "\u2004", "\u2005", "\u2006", "\u2008", "\u2009",
"\u2007", "\r", "\n", "\t", "\u2002", "\u00A0", "\u3000", "\u2003");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set<Char> should work here I think.

The variable name makes me think it is good not bad.

We can then loop over the characters of the String and return true if the set contains any of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also confused by the variable name, shouldn't it be nonTrivialWhiteSpace instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I imagine the JVM is good at optimizing such things, but perhaps we will make it easier for it if we make the list (or Set) a private static final field ensuring that it is computed only once upon initialization and not on every invocation of this function (since this function may be invoked millions of times).

I imagine the JIT would fold this constant at some point, but making this a constant from the beginning maybe would just be better.

Copy link
Member

@radeusgd radeusgd Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other thing - how do we know that this list is comprehensive?

Checking e.g. https://jkorpela.fi/chars/spaces.html suggests that we are missing e.g. the NARROW NO-BREAK SPACE U+202F or U+180E - the "MONGOLIAN VOWEL SEPARATOR" 😅

I think that as long as we can reasonably avoid it, I'd rather be against including such lists of constants in our code - it is easy for them to miss some obscure constants, and they are a bit less future proof. If instead we rely on ICU4J or the Java SDK itself, if there are new Unicode versions released in the future we'd just need to rely on them to get updated to the latest version instead of having to update many of such constant lists manually (which we will probably never do unless users start reporting bugs).

Of course things like "MONGOLIAN VOWEL SEPARATOR" are probably not that commonly used in most cases and missing them probably won't be a huge deal for a generic metric. But if we can use existing dependencies that we rely on anyway, I'd suggest preferring that.

Thus we could rewrite the code to essentially:

  • iterate over every character in the text,
  • check if it is any whitespace using UCharacter::isUWhiteSpace,
  • if it is whitespace - then check if it is 'non-trivial' - a non trivial whitespace is any whitespace that is not the " " character.
for (...) {
    char c = str.charAt(i);
    if (UCharacter::isUWhiteSpace(c)) {
        boolean isNonTrivial = c != ' ';
        if (isNonTrivial) { return true; }
    }
    context.safepoint();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry small correction - after further reading perhaps the two examples I provided above may actually not be classified as whitespace by the Unicode standard... So perhaps the List was good.

But I think we can still simplify the code by relying on isUWhiteSpace. It will also likely be much faster as it relies on bit-fiddling and boils down to 1 bit check per character instead of 16 independent text searches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree not relying on a list seems more reliable and more efficient! I didnt know we already have the isUWhitespace elsewhere so thanks for pointing it out!

Comment on lines 208 to 215
number_untrimmed = case all_rows_count > Column.default_sample_size of
False -> JS_Object.from_pairs [["name", "Count untrimmed whitespace"], ["percentage_value", columns.map .count_untrimmed]]
True -> JS_Object.from_pairs [["name", "Count untrimmed whitespace (sampled)"], ["percentage_value", columns.map .count_untrimmed]]
[number_nothing, number_untrimmed]
number_non_triv = case all_rows_count > Column.default_sample_size of
False -> JS_Object.from_pairs [["name", "Count non trivial whitespace"], ["percentage_value", columns.map .count_non_trivial_whitespace]]
True -> JS_Object.from_pairs [["name", "Count non trivial whitespace (sampled)"], ["percentage_value", columns.map .count_non_trivial_whitespace]]
JS_Object.from_pairs
[number_nothing, number_untrimmed, number_non_triv]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change it so we check the count only once. Tiny nit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name_extra = if all_rows_count > Column.default_sample_size then " (sampled)" else ""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would be much better if we go the name_extra route and avoid the duplication in all the percentage_value entries.

Used for data quality indicator in Table Viz.
count_non_trivial_whitespace : Integer -> Integer | Nothing
count_non_trivial_whitespace self sample_size:Integer=Column.default_sample_size =
if (self.value_type == Value_Type.Mixed || self.value_type.is_text).not then Nothing else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (self.value_type == Value_Type.Mixed || self.value_type.is_text).not then Nothing else
if self.value_type.is_text.not then Nothing else

I think Mixed implies is_text.not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No but this suggestion is not equivalent:

(self.value_type == Value_Type.Mixed || self.value_type.is_text).not
===
self.value_type != Value_Type.Mixed && self.value_type.is_text.not
!==
self.value_type.is_text.not

The counter-example is a mixed type column - self.value_type == Value_Type.Mixed.

Then the first and the equivalent second expression will evaluate to False but your suggested (third) expression evaluates to True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it becomes clearer if the condition is named:

Suggested change
if (self.value_type == Value_Type.Mixed || self.value_type.is_text).not then Nothing else
is_eligible_for_whitespace_count = self.value_type == Value_Type.Mixed || self.value_type.is_text
if is_eligible_for_whitespace_count.not then Nothing else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pulled this out into a function as the same check is done in the other data quality count

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pulled this out into a function as the same check is done in the other data quality count

Sounds great

Comment on lines 208 to 215
number_untrimmed = case all_rows_count > Column.default_sample_size of
False -> JS_Object.from_pairs [["name", "Count untrimmed whitespace"], ["percentage_value", columns.map .count_untrimmed]]
True -> JS_Object.from_pairs [["name", "Count untrimmed whitespace (sampled)"], ["percentage_value", columns.map .count_untrimmed]]
[number_nothing, number_untrimmed]
number_non_triv = case all_rows_count > Column.default_sample_size of
False -> JS_Object.from_pairs [["name", "Count non trivial whitespace"], ["percentage_value", columns.map .count_non_trivial_whitespace]]
True -> JS_Object.from_pairs [["name", "Count non trivial whitespace (sampled)"], ["percentage_value", columns.map .count_non_trivial_whitespace]]
JS_Object.from_pairs
[number_nothing, number_untrimmed, number_non_triv]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name_extra = if all_rows_count > Column.default_sample_size then " (sampled)" else ""

* @return whether the string has leading or trailing whitespace
*/
public static boolean has_leading_trailing_whitespace(String s) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was initially written in enso then the logic copied over, now as we are in a java file I have written this how I feel is simpler, happy to revert if the original is preferred

Comment on lines 45 to 46
String trimmedString = initialString.trim();
return trimmedString.length() != initialString.length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier algorithm was working in constant memory. Now it may allocate O(N) (as big as the input string) additional memory (just to throw it away).

I know we shouldn't optimize prematurely but this amount of copying seems slightly concerning if it may be done on huge values.

CompletableFuture.completedFuture(
CountUntrimmed.compute(
this, CountUntrimmed.DEFAULT_SAMPLE_SIZE, Context.getCurrent()));
dataQualityMetricsValues =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code that constructs dataQualityMetricsValues is repeated in this file -- it would be good to separate it out into a helper.

Copy link
Contributor

@GregoryTravis GregoryTravis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one suggestion about factoring out some common code.

Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @GregoryTravis comment would be worth extracting method to make the DataQualityMetrics. Could be a static in the record type taking the StringStorage.

Comment on lines 2238 to 2239
can_contain_text : Value_Type -> Boolean
is_eligible_for_text_data_metric_count value_type =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
can_contain_text : Value_Type -> Boolean
is_eligible_for_text_data_metric_count value_type =
private is_eligible_for_text_data_metric_count value_type:Value_Type -> Boolean =

}

/** Internal method performing the calculation on a storage. */
public static long compute(ColumnStorage storage, long sampleSize, Context context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets get this merged but we can put the loop code here into SampleOperation and call with a delegate to perform the operation.

@marthasharkey marthasharkey added the CI: No changelog needed Do not require a changelog entry for this PR. label Jan 24, 2025
@marthasharkey marthasharkey added the CI: Ready to merge This PR is eligible for automatic merge label Jan 30, 2025
@mergify mergify bot merged commit 96b3a97 into develop Jan 30, 2025
45 checks passed
@mergify mergify bot deleted the wip/mk/dq-metric branch January 30, 2025 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: No changelog needed Do not require a changelog entry for this PR. CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Data Quality indicator to columns in the Table Viz
4 participants